Exploring the Vector Space Model for Finding Verbs Synonyms

Similar documents
Probabilistic Latent Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Universiteit Leiden ICT in Business

HLTCOE at TREC 2013: Temporal Summarization

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Assignment 1: Predicting Amazon Review Ratings

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

On document relevance and lexical cohesion between query terms

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Leveraging Sentiment to Compute Word Similarity

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

arxiv: v1 [cs.cl] 2 Apr 2017

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Bayesian Learning Approach to Concept-Based Document Classification

A Case Study: News Classification Based on Term Frequency

Memory-based grammatical error correction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

MYCIN. The MYCIN Task

Task Tolerance of MT Output in Integrated Text Processes

Learning Methods in Multilingual Speech Recognition

Text-mining the Estonian National Electronic Health Record

The Role of String Similarity Metrics in Ontology Alignment

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Using dialogue context to improve parsing performance in dialogue systems

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Multilingual Sentiment and Subjectivity Analysis

Comment-based Multi-View Clustering of Web 2.0 Items

Methods for the Qualitative Evaluation of Lexical Association Measures

Lecture 1: Machine Learning Basics

Multi-Lingual Text Leveling

Software Maintenance

Term Weighting based on Document Revision History

The stages of event extraction

English Language and Applied Linguistics. Module Descriptions 2017/18

Speech Recognition at ICSI: Broadcast News and beyond

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

Corpus Linguistics (L615)

Cross Language Information Retrieval

The taming of the data:

A corpus-based approach to the acquisition of collocational prepositional phrases

The Role of the Head in the Interpretation of English Deverbal Compounds

Ensemble Technique Utilization for Indonesian Dependency Parser

Postprint.

Prediction of Maximal Projection for Semantic Role Labeling

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Compositional Semantics

Bug triage in open source systems: a review

A Re-examination of Lexical Association Measures

The Choice of Features for Classification of Verbs in Biomedical Texts

Word Sense Disambiguation

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

THE VERB ARGUMENT BROWSER

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Finding Translations in Scanned Book Collections

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

M55205-Mastering Microsoft Project 2016

AQUA: An Ontology-Driven Question Answering System

1. Introduction. 2. The OMBI database editor

Calibration of Confidence Measures in Speech Recognition

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

New Features & Functionality in Q Release Version 3.2 June 2016

A Graph Based Authorship Identification Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Learning Methods for Fuzzy Systems

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Tun your everyday simulation activity into research

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Concepts and Properties in Word Spaces

Organizational Knowledge Distribution: An Experimental Evaluation

Short Text Understanding Through Lexical-Semantic Analysis

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

A Domain Ontology Development Environment Using a MRD and Text Corpus

Toward a Unified Approach to Statistical Language Modeling for Chinese

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

The MEANING Multilingual Central Repository

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Transcription:

Exploring the for Finding Verbs Synonyms in Portuguese Recent Advances in Natural Language Processing September 14-16, 2009, Borovets, Bulgaria Luís Sarmento Paula Carvalho Eugénio Oliveira September 14, 2009 Exploring the for Finding Verbs Synonyms

Exploring the for Finding Verbs Synonyms

Motivation & Goals Linguistic resources (for Portuguese) are still scarce Manual creation: is time-consuming and requires linguistic expertise is often biased towards a specific application (or linguistic flavour ) coverage is limited by the amount of work resulting resources are usually difficult to customize Automatic methods: involve a large set of parameters, whose impact on final results is difficult to assess, and thus to optimize. Our long term goal: develop and evaluate automatic techniques for the creation of lexical-semantic resource for Portuguese Exploring the for Finding Verbs Synonyms

In this work We focus on the task of automatically finding verb synonyms for Portuguese using a (VSM) approach We study the impact of three core parameters of the VSM: 1. the context used for extracting vector features 2. the functions used for weighting features 3. the cut-off threshold for removing vectors with insufficient feature information We follow a data-driven approach: we use raw n-gram information (readily available) we perform minimal linguistic pre-processing Exploring the for Finding Verbs Synonyms

The and the Distributional Hypothesis In the VSM, items are converted into a vector representation on a feature space (feature vectors) geometric approach, many mathematical tools The VSM is a convenient framework for finding semantic similarities because it allows almost direct mapping of the Distributional Hypothesis words that occur in the same contexts tend to have similar meanings The key is defining the right context and represent it in the VS! Exploring the for Finding Verbs Synonyms

VSM Parameters: Context Context is the environment from which we extract features that describe the semantic properties of a given word: lexical surroundings of words (i.e. neighbour) syntactic relations that word establishes (e.g. in a subject of relation) Choice of a given feature context has huge impact on the information transferred to the VS It directly affects the notion of similarity that may be inferred from feature vectors Exploring the for Finding Verbs Synonyms

VSM Parameters: Feature Weighting Functions Different weighting functions promote (or demote) different sections of the feature spectrum e.g. should idiosyncratic or rare features be considered (more) important? May combine local feature information with global statistics taken from the whole corpus Examples of weighting functions: tf-idf Mutual Information Log-Likelihood Ratio... Exploring the for Finding Verbs Synonyms

Other VSM Parameters Cut-off Thresholds: filter out components or entire feature vectors feature vectors with low number of non-nil components Distance Metrics for comparing the (weighted) vectors: Geometry-based: Euclidean Distance (L2), cosine metric,... Probabilistic-inspired: Kullback-Liebler, Jensen-Shannon... In this work we will NOT explore this parameter because it interacts with weighting functions Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (I) Synonymy is a paradigmatic relation: synonyms occur in the same context / position We believe that for Portuguese much of the information relevant for describing the semantic properties of verbs can be found in the lexical neighbourhood for transitive verbs, the following 2 to 4 words should contain relevant verb-object relations for intransitive verbs, the 2 to 4 word preceding the verb should contain information about typical subjects and modifiers. Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (II) Our hypothesis is that a [-2 : +2] window is sufficient for capturing enough distributional evidence for inferring verb synonyms. We used a database of n-gram statistics compiled from a dump of the Portuguese web (1000 million words) N-gram information in this collection is not POS-tagged, But, most verb forms in Portuguese can be unambiguously analyzed using only dictionary lookups Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (III) We scanned 173,607,555 3-grams (v uf = unambiguous form) Pattern 1 = [w1 = v uf & w 2 = * & w 3 = *] Pattern 2 = [w1 = * & w 2 = * & w 3 = v uf ] Verb forms (at w 1 or at w 3 ) were lemmatized: (verb lemma, X w2 w 3, frequency) (verb lemma, w1 w 2 X, frequency) Info for 5,025 verbs. Feature space: 4,068,853 dimensions Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Weighting and Comparing Feature vectors contain raw frequency information: Next steps: 1. Features are weighted using weighting function ponder more faithfully the true association between verbs and features 2. weighted feature vectors are compared to obtain all pairwise-similarities. Synonyms for verb v i should be the other verbs, v j, whose feature vectors [V j ] are more similar to [V i ]. Exploring the for Finding Verbs Synonyms

Overall Procedure 1. compile feature vectors by filtering the 3-gram database with selection patterns; 2. compile statistics regarding the feature and generated the set of weighted feature vectors using a given weighing function; 3. compute pairwise vector similarity using the metric of choice (we use the cosine metric in all our experiments); 4. for each verb v i obtain the top n vectors closest to its word vector, keeping the corresponding words as verb synonyms. Exploring the for Finding Verbs Synonyms

: Automatic and Manual 1. Automatic - Gold Synsets from OpenOffice + Wiktionary 3,423 verbs, with an average of 4.53 synonyms per verb enough for parameter exploration despite coverage and recall problems 2. Manual - two groups of verbs: Vcomm : 25 declarative verbs: dizer ( to say ), mencionar ( to mention ),... Vemo : 25 psychological verbs: gostar ( to like ), envergonhar ( to ashame ),... Exploring the for Finding Verbs Synonyms

: Metrics Metrics for each verb v i : Precision at rank 1, P@ (v i,1) Precision at rank Ngold (v i ), P @ (v i,n gold (i)) Average Precision, AP(vi ) Global precision figures: P avg @ (1), Pavg @ (N) and MAP. Global coverage figure, C = V auto V gold / V gold For manual evaluation: P man @ (v i,n), with n {1,5,10,20} Exploring the for Finding Verbs Synonyms

: 3 Sets of Experiments 1. Experiment Set 1: using a [ 2,+2] window weighing functions: tf-idf, Log-Likelihood Ratio (LL), Z-Score, Pearson s χ 2 test, Student s T test, Mutual Information (MI), Mutual Dependency (MD), φ 2 test, and no weighting 2. Experiment Set 2: using the best performing weighting function [ 2,0] window vs. [0,+2] window vs. [ 2,+2] window 3. Experiment Set 3: using the best performing weighting function increase cut-off threshold on the number of non-nil features 4. Finally: manually evaluate the best configuration in Experiment 3 Exploring the for Finding Verbs Synonyms

Results: Experiments Set 1 (Weighting Function) Weighting Pavg @ N MAP C MI 0.221 0.121 0.125 0.800 MD 0.164 0.083 0.083 0.800 Z 0.134 0.096 0.067 0.712 χ 2 0.087 0.075 0.030 0.392 φ 2 0.084 0.075 0.027 0.375 raw 0.083 0.041 0.043 0.798 tf-idf 0.076 0.038 0.039 0.800 T 0.073 0.040 0.040 0.800 LL 0.059 0.034 0.037 0.796 P avg @ 1 Table: context window = [-2, + 2] and cutoff threshold = 1. Exploring the for Finding Verbs Synonyms

Results: Experiments Set 2 (Context Window) Window Pavg @ N MAP C [-2, 0] 0.136 0.078 0.079 0.779 [ 0, +2] 0.196 0.107 0.111 0.798 [-2, +2] 0.221 0.121 0.125 0.800 P avg @ 1 Table: weighting function = MI and cutoff threshold = 1. Exploring the for Finding Verbs Synonyms

Results: Experiments Set 3 (Cut-off Threshold) cut. Pavg @ N MAP C 1 0.221 0.121 0.125 0.800 10 0.251 0.136 0.136 0.783 20 0.263 0.142 0.141 0.767 50 0.277 0.149 0.149 0.736 100 0.288 0.154 0.154 0.695 200 0.297 0.155 0.155 0.632 500 0.297 0.146 0.146 0.507 1000 0.290 0.141 0.141 0.398 2000 0.294 0.140 0.141 0.300 P avg @ 1 Table: Weighting function = MI and context window [-2, +2]. Exploring the for Finding Verbs Synonyms

Results: Manual against V com and V emo Group Pman @ 5 Pman @ 10 V com 0.88 0.71 0.56 0.44 V emo 0.60 0.44 0.37 0.27 P@ man1 Pman @ 20 Table: Manual evaluation of sets V com and V emo Exploring the for Finding Verbs Synonyms

(I) Weighting function DO have a crucial impact on results Low Frequency features carry most information about similarity. Both sides of the window around verb contain important information regarding synonymy but the two following words seem to carry more information (this might be due to the higher number of transitive verbs) It is beneficial to exclude word vectors with < 50 features but if the cutoff threshold is set too performance is hurt Exploring the for Finding Verbs Synonyms

(II) Performance depends on the degree of polysemy, vagueness of use and on the number of antonymous of the verb at stake: very high performance for communication verbs, Vcom lower performance for emotion-related verbs, Vemo Because almost no linguistic pre-processing was made, results for V emo : P@ 1 0.60 P@ 5 0.45 can be seen as baseline figures for the task of automatically finding verb synonyms for Portuguese Exploring the for Finding Verbs Synonyms

Thank you! Questions & comments? Luís Sarmento: las@fe.up.pt Paula Carvalho: pcc@di.fc.ul.pt Eugénio Oliveira: eco@fe.up.pt This work was partially supported by grants SFRH/BD/23590/2005 and SFRH/BPD/45416/2008 FCT-Portugal, co-financed by POSI. Exploring the for Finding Verbs Synonyms