Comparing the value of Latent Semantic Analysis on two English-to-Indonesian lexical mapping tasks

Similar documents
Probabilistic Latent Semantic Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A Case Study: News Classification Based on Term Frequency

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Comment-based Multi-View Clustering of Web 2.0 Items

Leveraging Sentiment to Compute Word Similarity

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Bayesian Learning Approach to Concept-Based Document Classification

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

On document relevance and lexical cohesion between query terms

Cross Language Information Retrieval

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Finding Translations in Scanned Book Collections

Concepts and Properties in Word Spaces

Linking Task: Identifying authors and book titles in verbose queries

Multilingual Sentiment and Subjectivity Analysis

Assignment 1: Predicting Amazon Review Ratings

Constructing Parallel Corpus from Movie Subtitles

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Multi-Lingual Text Leveling

Latent Semantic Analysis

As a high-quality international conference in the field

Cross-lingual Text Fragment Alignment using Divergence from Randomness

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Using dialogue context to improve parsing performance in dialogue systems

2.1 The Theory of Semantic Fields

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

A Comparison of Two Text Representations for Sentiment Analysis

Universiteit Leiden ICT in Business

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Statistical Approach to the Semantics of Verb-Particles

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

HLTCOE at TREC 2013: Temporal Summarization

Measuring Web-Corpus Randomness: A Progress Report

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

arxiv: v1 [cs.cl] 2 Apr 2017

Lecture 2: Quantifiers and Approximation

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Reducing Features to Improve Bug Prediction

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Matching Similarity for Keyword-Based Clustering

Postprint.

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

The stages of event extraction

Handling Sparsity for Verb Noun MWE Token Classification

A Bootstrapping Model of Frequency and Context Effects in Word Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

AUTHOR COPY. Techniques for cold-starting context-aware mobile recommender systems for tourism

The Role of String Similarity Metrics in Ontology Alignment

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

On-the-Fly Customization of Automated Essay Scoring

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

CS Machine Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Learning to Rank with Selection Bias in Personal Search

Speech Recognition at ICSI: Broadcast News and beyond

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Memory-based grammatical error correction

A Domain Ontology Development Environment Using a MRD and Text Corpus

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Robust Sense-Based Sentiment Classification

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Switchboard Language Model Improvement with Conversational Data from Gigaword

Discovery of Topical Authorities in Instagram

Assessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Evaluating vector space models with canonical correlation analysis

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Graph Alignment for Semi-Supervised Semantic Role Labeling

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

The Role of the Head in the Interpretation of English Deverbal Compounds

A Graph Based Authorship Identification Approach

Online Updating of Word Representations for Part-of-Speech Tagging

AQUA: An Ontology-Driven Question Answering System

Evidence for Reliability, Validity and Learning Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Visit us at:

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Vocabulary Usage and Intelligibility in Learner Language

Transcription:

Comparing the value of Latent Semantic Analysis on two English-to-Indonesian lexical mapping tasks David Moeljadi Nanyang Technological University October 16, 2014 1

Outline The Authors The Experiments (general idea and results) The Details - Concept and word - Bilingual word mapping - Bilingual concept mapping Results and Discussion 2

The Authors Eliza Margaretha Ruli Manurung Wissenschaftliche Angestellte (Research Staff) at Institut für Deutsche Sprache Coordinator of Computer Science Dept. at Faculty of Computer Science, University of Indonesia Eliza Margaretha s undergraduate theses supervised by Ruli Manurung 3

The Experiments - General Idea - English WordNet version 3.0 Parallel English- Indonesian corpus (news article pairs) Bilingual English- Indonesian dictionary The Great Dictionary of the Indonesian Language (KBBI) How? Indonesian WordNet 4

The Experiments - General Idea - English WordNet version 3.0 The Great Dictionary of the Indonesian Language (KBBI) Parallel English- Indonesian corpus (news article pairs) Bilingual English- Indonesian dictionary Using Latent Semantic Analysis (LSA) Indonesian WordNet 5

English WordNet version 3.0 The Experiments - Results - The Great Dictionary of the Indonesian Language (KBBI) LSA is bad Parallel English- Indonesian corpus (news article pairs) Bilingual English- Indonesian dictionary Using Latent Semantic Analysis (LSA) Indonesian WordNet LSA is good 6

Concept and Word Language Concept Word Indonesian 00464894-n golf English 08420278-n 09213565-n bank Indonesian 00015388-n hewan binatang 7

- The Corpus - 1. Define a collection of parallel article pairs 100 article pairs 500 article pairs 3,273 article pairs 1,000 article pairs 8

- Latent Semantic Analysis - 2. Set up a bilingual word-document matrix for LSA ENG Article 1E Article 2E Article 100E dog 5 0 0 the 10 15 50 car 4 0 7 IND Article 1I Article 2I Article 100I anjing 5 0 0 itu 12 10 30 mobil 3 0 10 Each column is a pair of parallel articles 9

- Latent Semantic Analysis - 2. Set up a bilingual word-document matrix for LSA Article 1E Article 2E Article 100E M E 5 0 0 10 15 50 4 0 7 Article 1I Article 2I Article 100I M I 5 0 0 12 10 30 3 0 10 10

- Latent Semantic Analysis - 2. Set up a bilingual word-document matrix for LSA For each of these rows, compute the similarity Article 1E Article 2E Article 100E M to each of these rows 5 0 0 10 15 50 4 0 7 Article 1I Article 2I Article 100I 5 0 0 12 10 30 3 0 10 11

- Latent Semantic Analysis - 2. Set up a bilingual word-document matrix for LSA However, there are irrelevant information and noise need to be removed Article 1E Article 2E Article 100E M 5 0 0 10 15 50 4 0 7 Article 1I Article 2I Article 100I 5 0 0 12 10 30 3 0 10 12

- Latent Semantic Analysis - 3. LSA: Compute SVD (Singular Value Decomposition) M = U S V T 0 0 0 0 0 0 0 0 0 Matrix of left singular vectors Matrix containing the singular values of M Matrix of right singular vectors 13

- Latent Semantic Analysis - 3. LSA: Compute SVD (Singular Value Decomposition) Highly recommended if you want to know more! (especially for beginners) 14

- Latent Semantic Analysis - 4. Compute the optimal reduced rank approximation (reducing dimensions of the matrix) - unearth implicit patterns of semantic concepts - the vectors representing English and Indonesian words that are closely related should have high similarity 10% 25% 50% 100% (no reduction) 100 art.pairs 10 25 50 100 500 art.pairs 50 125 250 500 1000 art.pairs 100 250 500 1,000 15

- Latent Semantic Analysis - 4. Words are represented by row vectors in U, word similarity can be measured by computing row similarity in US. M = U S V T 0 0 0 0 0 0 0 0 0 16

- Latent Semantic Analysis - 5. For a randomly chosen set of vectors representing English words, compute the n nearest vectors representing the n most similar Indonesian words using the cosine of the angle between two vectors Article 1 mobil dog cos cos anjing Article 2 Article 3 17

- Some Experiments - 6. Remove the stopwords from the matrix English: the, a, of, in, by, for, Indonesian: itu, sebuah, dari, di, oleh, untuk, and do SVD again. 7. Apply two weighting schemes: - TF-IDF - Log-entropy and do SVD again. 18

- Some Experiments - 7. Apply TF-IDF - term frequency-inverse document frequency - TF: to measure how frequently a word occurs in a document Number of word w in a document Total number of words in a document - IDF: to measure how important a word is in a corpus log Total number of documents Number of documents with word w in it - can be used for stopwords filtering 19

- Some Experiments - 7. Apply TF-IDF (example) Article 1 Article 2 Article 100 dog 5 0 0 the 10 15 50 car 4 0 7 Total 100 150 125 TF Number of word w in a document Total number of words in a document x log IDF Total number of documents Number of documents with word w in it 20

- Some Experiments - 7. Apply TF-IDF (example) Article 1 Article 2 Article 100 dog 5 0 0 the 10 15 50 car 4 0 7 Total 100 150 125 TF IDF of dog 5 100 x 100 log = 0.05 x log 100 = 0.05 x 2 = 0.1 1 21

- Some Experiments - 7. Apply TF-IDF (example) Article 1 Article 2 Article 100 dog 5 0 0 the 10 15 50 car 4 0 7 Total 100 150 125 TF-IDF of the in article 1 TF-IDF of car in article 1 TF-IDF of car in article 100 10 100 4 100 7 125 x 100 log 100 = 0.1 x log 1 = 0.1 x 0 = 0 x 100 log 2 = 0.04 x log 50 = 0.04 x 1.7 = 0.07 x 100 log 2 = 0.06 x log 50 = 0.06 x 1.7 = 0.09 22

- Some Experiments - 7. Apply TF-IDF and do SVD (example) Article 1 Article 2 Article 100 dog 0.10 0.00 0.00 the 0.00 0.00 0.00 car 0.07 0.00 0.09 Stopwords filtering 23

- Some Experiments - 7. Apply TF-IDF and do SVD (example) M = Article 1 Article 2 Article 100 0.10 0.00 0.00 0.07 0.00 0.09 M = U S V T 0 0 0 0 0 0 0 0 0 24

- Some Experiments - 7. Apply Log-entropy and do SVD log = entropy = gf i is the total number of times a word appears in a corpus, n is the number of documents in a corpus After getting a new matrix from log-entropy, do SVD (same as in TF-IDF) 25

- Some Experiments - 8. Do mapping selection Take the top 1, 10, 50, and 100 mappings based on similarity GOOD BAD - billion is not domain specific - billion can sometimes be translated numerically instead of lexically - lack of data: the collection is too small using 1,000 article pairs with 500-rank approximation and no weighting 26

- Some Experiments - 9. Compute the precision and recall values for all experiments P = Σ correct mappings (check with bilingual dictionary) Σ total mappings found R = Σ correct mappings (check with bilingual dictionary) Σ total mappings in bilingual dictionary 27

- The Results - 1. As the collection size increases, the precision and recall values also increase 2. The higher the rank approximation percentage, the better the mapping results 28

- The Results - 3. On account of the small size of the collection, stopwords may carry some semantic information 4. Weighting can improve the mappings (esp. Logentropy) 29

- The Results - 5. As the number of translation pairs selected increases, the precision value decreases and the possibility to find more pairs matching the pairs in bilingual dictionary (the recall value) increases Conclusion: FREQ baseline (basic vector space model) is better than LSA 30

Bilingual Concept Mapping - Semantic Vectors for Concepts - 1. Construct a set of textual context representing a concept c by including (1) the sublemma words, (2) the gloss words, and (3) the example sentence words, which appear in the corpus. 31

Bilingual Concept Mapping - Semantic Vectors for Concepts - 1. Construct a set of textual context representing a concept c by including (1) the sublemma words, (2) the definition words, and (3) the example sentence words, which appear in the corpus. 32

Bilingual Concept Mapping - Semantic Vectors for Concepts - 2. Compute the semantic vector of a concept, that is a weighted average of the semantic vectors of the words in the set Sublemma 60% Gloss 30% Example 10% Sublemma 60% Definition 30% Example 10% 33

Bilingual Concept Mapping - Latent Semantic Analysis - 3. Use 1,000 article pairs and set up a bilingual conceptdocument matrix for LSA ENG Article 1E Article 1000E 100319939 201277784 IND Article 1I Article 1000I k39607 k02421 34

Bilingual Concept Mapping - Latent Semantic Analysis - 3. Set up a bilingual concept-document matrix for LSA Given a WordNet synset, look up in bilingual dictionary the Indonesian words e.g. for synset communication select the most appropriate KBBI sense from a subset of senses compare it with komunikasi and perhubungan only Article 1E Article 1000E Article 1I Article 1000I 35

Bilingual Concept Mapping - Latent Semantic Analysis - 4. LSA: Compute SVD (Singular Value Decomposition) M = U S V T 0 0 0 0 0 0 0 0 0 Matrix of left singular vectors Matrix containing the singular values of M Matrix of right singular vectors 36

Bilingual Concept Mapping - Latent Semantic Analysis - 5. Compute the optimal reduced rank approximation (reducing dimensions of the matrix) 10% 25% 50% 1,000 art. pairs 100 250 500 6. Compute the level of agreement between the LSAbased mappings with human annotations (ongoing experiment to manually map WordNet synsets to KBBI senses) 37

Bilingual Concept Mapping - Check the results - 7. As a baseline, select three random suggested Indonesian word senses as a mapping for an English word sense 8. As another baseline, compare English concepts to their suggestion based on a full rank word-document matrix 9. Choose top 3 Indonesian concepts with the highest similarity values as the mapping results 38

Bilingual Concept Mapping - Results - 10. Compute the Fleiss kappa values - LSA 10% is better than the random baseline (RNDM3) and frequency baseline (FREQ) - LSA 10% is better than LSA 25% and LSA 50% (cf. the word mapping results) 39

Bilingual Concept Mapping - Mapping results - GOOD The textual context sets both are fairly large -> provide sufficient context for LSA to choose the correct KBBI sense The textual context set for the synset is very small -> no sufficient context for LSA to choose the correct KBBI sense BAD 40

Discussion Initial intuition: LSA is good for both word and concept mappings Results: 1. LSA blurs the co-occurrence information/details -> bad for word mapping 2. LSA is useful for revealing implicit semantic patterns -> good for concept mapping Reasons: - The rank reduction in LSA perhaps blurs some details - A problem of polysemous words for LSA Suggestion: Make a finer granularity of alignment (e.g. at a sentential level) for word mapping 41

Special thanks to Giulia and Yukun 42