Application of Clustering for Unsupervised Language Learning

Similar documents
CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Probabilistic Latent Semantic Analysis

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Comparison of Two Text Representations for Sentiment Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Comment-based Multi-View Clustering of Web 2.0 Items

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Switchboard Language Model Improvement with Conversational Data from Gigaword

Python Machine Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Distant Supervised Relation Extraction with Wikipedia and Freebase

Lecture 1: Machine Learning Basics

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

arxiv:cmp-lg/ v1 22 Aug 1994

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Assignment 1: Predicting Amazon Review Ratings

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

On document relevance and lexical cohesion between query terms

CS Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

WHEN THERE IS A mismatch between the acoustic

As a high-quality international conference in the field

arxiv: v1 [cs.cl] 2 Apr 2017

Ensemble Technique Utilization for Indonesian Dependency Parser

Matching Similarity for Keyword-Based Clustering

Modeling function word errors in DNN-HMM based LVCSR systems

AQUA: An Ontology-Driven Question Answering System

The stages of event extraction

Online Updating of Word Representations for Part-of-Speech Tagging

Matrices, Compression, Learning Curves: formulation, and the GROUPNTEACH algorithms

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Choice of Features for Classification of Verbs in Biomedical Texts

Knowledge-Free Induction of Inflectional Morphologies

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Second Exam: Natural Language Parsing with Neural Networks

A deep architecture for non-projective dependency parsing

Combining a Chinese Thesaurus with a Chinese Dictionary

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Leveraging Sentiment to Compute Word Similarity

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Learning Methods for Fuzzy Systems

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

BMBF Project ROBUKOM: Robust Communication Networks

Short Text Understanding Through Lexical-Semantic Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Statewide Framework Document for:

Linking Task: Identifying authors and book titles in verbose queries

Latent Semantic Analysis

Evidence for Reliability, Validity and Learning Effectiveness

Issues in the Mining of Heart Failure Datasets

Beyond the Pipeline: Discrete Optimization in NLP

Axiom 2013 Team Description Paper

Parsing of part-of-speech tagged Assamese Texts

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Simulation

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An Introduction to the Minimalist Program

Reinforcement Learning by Comparing Immediate Reward

Learning Methods in Multilingual Speech Recognition

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

The Smart/Empire TIPSTER IR System

BYLINE [Heng Ji, Computer Science Department, New York University,

CSL465/603 - Machine Learning

A Bayesian Learning Approach to Concept-Based Document Classification

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Mathematics Success Grade 7

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

A Statistical Approach to the Semantics of Verb-Particles

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Unsupervised Learning of Narrative Schemas and their Participants

Handling Sparsity for Verb Noun MWE Token Classification

MTH 215: Introduction to Linear Algebra

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Memory-based grammatical error correction

Constraining X-Bar: Theta Theory

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Rule Learning with Negation: Issues Regarding Effectiveness

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Primary National Curriculum Alignment for Wales

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Transcription:

Application of ing for Unsupervised Language Learning Jeremy Hoffman and Omkar Mate Abstract We describe a method for automatically learning word similarity from a corpus. We constructed feature vectors for words according to their appearance in different dependency paths in parse trees of corpus sentences. ing the huge amount of raw data costs too much time and memory, so we devised techniques to make the problem tractable. We used PCA to reduce the dimensionality of the feature space, and we devised a partitioned hierarchical clustering approach where we split the data set and gradually cluster and recombine the. We succeeded in clustering a huge amount of word data with very reasonable time and memory cost. Motivation Fully automated learning of similar words and dependency paths is extremely pertinent to many natural language processing (NLP) applications. Similarity estimation can help with problems of data sparseness in statistical NLP [], and clustering could automatically generate similarity estimates. Word similarity estimates also can be used in question answering and machine translation, major areas of current NLP research. Overview of Process Our process is to build a large input matrix that describes words from their appearances in various dependency paths, cluster the words in this feature space, then estimate that words that ended up in the same cluster are semantically similar. Since language is so varied with a vast vocabulary, we must build a huge matrix to infer anything useful from our clusters. However, clustering may not be tractable for huge matrices due to time and computer memory constraints, so we took a more complex approach. In the following sections, we discuss how principle component analysis and a multi-tiered form of hierarchical clustering solved this problem and allowed a large matrix to be clustered. Input Data Our input data was a corpus of six million newswire articles, parsed using MINIPAR into dependency path triplets, as in [6]. The corpus contained about 750,000 unique noun pairs and about 70,000 unique dependency paths. From these data, we

constructed a matrix of training examples. Our input matrix has m rows corresponding to different nouns, and n columns corresponding to different dependency paths. An entry (i, j) is the number of times that noun i appeared in dependency path j. Any given word is likely to only appear in a very small number of sentences, even in a corpus of six million articles, so the input matrix is very sparse. Principle Component Analysis To cluster nouns, we need to reduce the column dimension of the input matrix. In order to do this, we use Principal Components Analysis (PCA). PCA was implemented using Simple Lanczos algorithm for Singular Value Decomposition []. (See Figure ) Figure : PCA using Singular Value Decomposition The input matrix A is decomposed as A = U S V T. S has nonzero entries only along the diagonal, representing singular values of A, or eigenvalues of AA T. Since AA T represents covariance of rows of A, (i.e. nouns), its maximum eigenvalues represent eigenvectors directions in which covariance is maximized. We choose the first k (which happen to be k largest) entries of S, and the first k columns of U, to get U. (The choice of k is based on the size of the eigenvalues and the desired computational efficiency for clustering.) We multiply U by S to get A, which is the desired output matrix with reduced column dimensions. Following are the results we obtained by running PCA on input matrices of various sizes: No. of Rows (i/p) No. of Columns (i/p) No. of Columns (o/p) 7000 000 0 000 5000 00 000 5000 9 0000 0000 8 00000 0000

Partitioned Hierarchical ing Hierarchical clustering (HC) is well-suited to the task of grouping similar nouns. In contrast to hard assignment clustering algorithms such as K-means, HC builds a tree, or dendrogram, of closest points deterministically. The basic HC procedure to cluster m points into k clusters is as follows. First, start with m clusters of one point each. Then, find and merge the two closest clusters, and repeat m-k times. We computed the distance between two points (i.e. nouns) as their cosine, computed by dot product of their feature vectors. The distance between two clusters can be defined as the minimum, maximum, or average distance between points in the clusters. Minimum distance clustering tends to produce long chains, while spherical clusters more intuitively match word similarity. Maximum distance clustering is susceptible to outliers, which makes it unsuitable for this problem because the data is noisy (the corpus could contain a few bizarre sentences). Thus, average-link clustering was the most suitable approach. Specifically, we maintain along with each cluster the mean of the feature vectors of its points, and compute the similarity of two clusters as the cosine of their mean vectors. HC is computationally expensive. In particular, average-link clustering on m points in d-dimensional space takes O(dm logm) time and O(dm+m ) memory [5]. Even if cluster mean vector and cosine values are discretized to -byte integers, storing m clusters and their pairwise cosines takes (dm+m ) bytes; for m=50,000, this is 0 0 bytes = 0 GB, more than the memory of most computers. To make HC tractable, we devised an approach that we call partitioned hierarchical clustering. In this approach, the m points are split into k of size m/k such that HC on m/k points can be executed on a single computer. For each partition, HC is used to reduce the number of clusters by 50% by making m/(k) mergers, so that the m/k points are combined into m/(k) clusters. Then pairs of are concatenated to create k/ each containing m/k clusters (which contain a total of m/k points). HC is again used to reduce the size of each partition by 50%, and pairs of are again combined, until all are eventually recombined in the log k th step. (See Figure.) Since partitioned HC requires running HC on at most m/k clusters at a time, its space requirement is O(dm/k+(m/k) ), an improvement by a factor of about k. Thus partitioned HC can be run on a computer that otherwise would not have enough memory to cluster the data. Partitioned HC is also fast; it requires log k steps, and the ith step entails i instances of running HC on m/k clusters. Thus partitioned HC has a time cost of O(d(logk) (m/k) log(m/k)), an asymptotic improvement over normal HC of a factor

(k/log k). Furthermore, the process can be parallelized on several computers, while this was not possible with normal HC. Partition large input matrix individually Combine individually Combine final partition & & & &,,,& Figure : Partitioned Hierarchical ing with k= initial. The problem introduced by partitioned HC is that the guarantee of the closest points being merged is lost. If two very close points are in separate that only meet in the final step, they may not be merged at all if they belong to clusters whose means are farther apart. This is unlikely to happen except in borderline cases, because close points should end up in close clusters that will eventually be merged, but a further investigation into partitioned HC s deviation from normal HC clusters would be worthwhile. A possible solution to this problem is to perform partitioned HC multiple times, randomly picking a different initial partitioning of the data each time, and then averaging the word similarity results. Implementation and Testing To evaluate the feasibility of our approach, we implemented PCA with SVD using the code from the SVDPACKC package [] and implemented partitioned hierarchical clustering from scratch. For our trial run s input matrix, we took 65,000 of the most frequently occurring nouns and 0,000 of the most frequently occurring dependency paths from the corpus. Running PCA on the data reduced the column

dimension of the matrix from 0,000 to (in about four minutes on a Windows XP PC with 56MB RAM). Dividing the 65,000 rows into four of 6,50 and performing partitioned HC, successively reducing the 6,50 clusters by half, we generated 8,5 clusters (in about 5 minutes on a Dual GHz Xeon shared Linux machine with GB RAM). When hierarchical clustering was run without partitioning, an out of memory error occurred. Thus we conclude that our approach is fundamentally sound, and can allow clustering of matrices with a very large number of both rows and columns. Further Work To continue this work, we would try running our program on even larger input matrices. To evaluate the semantic significance of our cluster output, we would compare the word similarity suggested by our cluster output to a gold standard such as WordNet [6], Latent Semantic Analysis [], or human-tagged data. We would compute the pairwise similarity of all of the words in our input data, and compare the average similarity of words that were clustered together in our output to the average similarity of words that were not. Acknowledgements We acknowledge the valuable guidance provided by Rion Snow, Prof. Andrew Ng, Prof. Dan Jurafsky, and Prof. Gene Golub. References []Berry. SVDPACKC ftp://cs.utk.edu/pub/berry [] Laham, D. (998) Latent Semantic Analysis @ CU Boulder. Last updated Oct 998. Accessed Nov 005. http://lsa.colorado.edu/ [] Golub, G., and Loan, C. (996) Matrix Computations. Baltimore, MD: Johns Hopkins U Press. [] Pereira, F., Tishby, N., and L. Lee. (99) Distributional clustering of English words, 0th Annual Meeting of the ACL, pp8-90. [5] Schütze, H. Single-Link, Complete-Link & Average-Link ing. No date given. Accessed Nov 005. http://www-csli.stanford.edu/~schuetze/completelink.html [6] Snow, R., Jurafsky, D., and Ng, A. (00) Learning syntactic patterns for automatic hypernym discovery, Advances in Neural Information Processing Systems 7, 00.