Exploring the Vector Space Model for Finding Verbs Synonyms

Exploring the for Finding Verbs Synonyms in Portuguese Recent Advances in Natural Language Processing September 14-16, 2009, Borovets, Bulgaria Luís Sarmento Paula Carvalho Eugénio Oliveira September 14, 2009 Exploring the for Finding Verbs Synonyms

Exploring the for Finding Verbs Synonyms

Motivation & Goals Linguistic resources (for Portuguese) are still scarce Manual creation: is time-consuming and requires linguistic expertise is often biased towards a specific application (or linguistic flavour ) coverage is limited by the amount of work resulting resources are usually difficult to customize Automatic methods: involve a large set of parameters, whose impact on final results is difficult to assess, and thus to optimize. Our long term goal: develop and evaluate automatic techniques for the creation of lexical-semantic resource for Portuguese Exploring the for Finding Verbs Synonyms

In this work We focus on the task of automatically finding verb synonyms for Portuguese using a (VSM) approach We study the impact of three core parameters of the VSM: 1. the context used for extracting vector features 2. the functions used for weighting features 3. the cut-off threshold for removing vectors with insufficient feature information We follow a data-driven approach: we use raw n-gram information (readily available) we perform minimal linguistic pre-processing Exploring the for Finding Verbs Synonyms

The and the Distributional Hypothesis In the VSM, items are converted into a vector representation on a feature space (feature vectors) geometric approach, many mathematical tools The VSM is a convenient framework for finding semantic similarities because it allows almost direct mapping of the Distributional Hypothesis words that occur in the same contexts tend to have similar meanings The key is defining the right context and represent it in the VS! Exploring the for Finding Verbs Synonyms

VSM Parameters: Context Context is the environment from which we extract features that describe the semantic properties of a given word: lexical surroundings of words (i.e. neighbour) syntactic relations that word establishes (e.g. in a subject of relation) Choice of a given feature context has huge impact on the information transferred to the VS It directly affects the notion of similarity that may be inferred from feature vectors Exploring the for Finding Verbs Synonyms

VSM Parameters: Feature Weighting Functions Different weighting functions promote (or demote) different sections of the feature spectrum e.g. should idiosyncratic or rare features be considered (more) important? May combine local feature information with global statistics taken from the whole corpus Examples of weighting functions: tf-idf Mutual Information Log-Likelihood Ratio... Exploring the for Finding Verbs Synonyms

Other VSM Parameters Cut-off Thresholds: filter out components or entire feature vectors feature vectors with low number of non-nil components Distance Metrics for comparing the (weighted) vectors: Geometry-based: Euclidean Distance (L2), cosine metric,... Probabilistic-inspired: Kullback-Liebler, Jensen-Shannon... In this work we will NOT explore this parameter because it interacts with weighting functions Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (I) Synonymy is a paradigmatic relation: synonyms occur in the same context / position We believe that for Portuguese much of the information relevant for describing the semantic properties of verbs can be found in the lexical neighbourhood for transitive verbs, the following 2 to 4 words should contain relevant verb-object relations for intransitive verbs, the 2 to 4 word preceding the verb should contain information about typical subjects and modifiers. Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (II) Our hypothesis is that a [-2 : +2] window is sufficient for capturing enough distributional evidence for inferring verb synonyms. We used a database of n-gram statistics compiled from a dump of the Portuguese web (1000 million words) N-gram information in this collection is not POS-tagged, But, most verb forms in Portuguese can be unambiguously analyzed using only dictionary lookups Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Extracting Feature Information (III) We scanned 173,607,555 3-grams (v uf = unambiguous form) Pattern 1 = [w1 = v uf & w 2 = * & w 3 = *] Pattern 2 = [w1 = * & w 2 = * & w 3 = v uf ] Verb forms (at w 1 or at w 3 ) were lemmatized: (verb lemma, X w2 w 3, frequency) (verb lemma, w1 w 2 X, frequency) Info for 5,025 verbs. Feature space: 4,068,853 dimensions Exploring the for Finding Verbs Synonyms

VSM for Verbs Synonyms in Portuguese: Weighting and Comparing Feature vectors contain raw frequency information: Next steps: 1. Features are weighted using weighting function ponder more faithfully the true association between verbs and features 2. weighted feature vectors are compared to obtain all pairwise-similarities. Synonyms for verb v i should be the other verbs, v j, whose feature vectors [V j ] are more similar to [V i ]. Exploring the for Finding Verbs Synonyms

Overall Procedure 1. compile feature vectors by filtering the 3-gram database with selection patterns; 2. compile statistics regarding the feature and generated the set of weighted feature vectors using a given weighing function; 3. compute pairwise vector similarity using the metric of choice (we use the cosine metric in all our experiments); 4. for each verb v i obtain the top n vectors closest to its word vector, keeping the corresponding words as verb synonyms. Exploring the for Finding Verbs Synonyms

: Automatic and Manual 1. Automatic - Gold Synsets from OpenOffice + Wiktionary 3,423 verbs, with an average of 4.53 synonyms per verb enough for parameter exploration despite coverage and recall problems 2. Manual - two groups of verbs: Vcomm : 25 declarative verbs: dizer ( to say ), mencionar ( to mention ),... Vemo : 25 psychological verbs: gostar ( to like ), envergonhar ( to ashame ),... Exploring the for Finding Verbs Synonyms

: Metrics Metrics for each verb v i : Precision at rank 1, P@ (v i,1) Precision at rank Ngold (v i ), P @ (v i,n gold (i)) Average Precision, AP(vi ) Global precision figures: P avg @ (1), Pavg @ (N) and MAP. Global coverage figure, C = V auto V gold / V gold For manual evaluation: P man @ (v i,n), with n {1,5,10,20} Exploring the for Finding Verbs Synonyms

: 3 Sets of Experiments 1. Experiment Set 1: using a [ 2,+2] window weighing functions: tf-idf, Log-Likelihood Ratio (LL), Z-Score, Pearson s χ 2 test, Student s T test, Mutual Information (MI), Mutual Dependency (MD), φ 2 test, and no weighting 2. Experiment Set 2: using the best performing weighting function [ 2,0] window vs. [0,+2] window vs. [ 2,+2] window 3. Experiment Set 3: using the best performing weighting function increase cut-off threshold on the number of non-nil features 4. Finally: manually evaluate the best configuration in Experiment 3 Exploring the for Finding Verbs Synonyms

Results: Experiments Set 1 (Weighting Function) Weighting Pavg @ N MAP C MI 0.221 0.121 0.125 0.800 MD 0.164 0.083 0.083 0.800 Z 0.134 0.096 0.067 0.712 χ 2 0.087 0.075 0.030 0.392 φ 2 0.084 0.075 0.027 0.375 raw 0.083 0.041 0.043 0.798 tf-idf 0.076 0.038 0.039 0.800 T 0.073 0.040 0.040 0.800 LL 0.059 0.034 0.037 0.796 P avg @ 1 Table: context window = [-2, + 2] and cutoff threshold = 1. Exploring the for Finding Verbs Synonyms

Results: Experiments Set 2 (Context Window) Window Pavg @ N MAP C [-2, 0] 0.136 0.078 0.079 0.779 [ 0, +2] 0.196 0.107 0.111 0.798 [-2, +2] 0.221 0.121 0.125 0.800 P avg @ 1 Table: weighting function = MI and cutoff threshold = 1. Exploring the for Finding Verbs Synonyms

Results: Experiments Set 3 (Cut-off Threshold) cut. Pavg @ N MAP C 1 0.221 0.121 0.125 0.800 10 0.251 0.136 0.136 0.783 20 0.263 0.142 0.141 0.767 50 0.277 0.149 0.149 0.736 100 0.288 0.154 0.154 0.695 200 0.297 0.155 0.155 0.632 500 0.297 0.146 0.146 0.507 1000 0.290 0.141 0.141 0.398 2000 0.294 0.140 0.141 0.300 P avg @ 1 Table: Weighting function = MI and context window [-2, +2]. Exploring the for Finding Verbs Synonyms

Results: Manual against V com and V emo Group Pman @ 5 Pman @ 10 V com 0.88 0.71 0.56 0.44 V emo 0.60 0.44 0.37 0.27 P@ man1 Pman @ 20 Table: Manual evaluation of sets V com and V emo Exploring the for Finding Verbs Synonyms

(I) Weighting function DO have a crucial impact on results Low Frequency features carry most information about similarity. Both sides of the window around verb contain important information regarding synonymy but the two following words seem to carry more information (this might be due to the higher number of transitive verbs) It is beneficial to exclude word vectors with < 50 features but if the cutoff threshold is set too performance is hurt Exploring the for Finding Verbs Synonyms

(II) Performance depends on the degree of polysemy, vagueness of use and on the number of antonymous of the verb at stake: very high performance for communication verbs, Vcom lower performance for emotion-related verbs, Vemo Because almost no linguistic pre-processing was made, results for V emo : P@ 1 0.60 P@ 5 0.45 can be seen as baseline figures for the task of automatically finding verb synonyms for Portuguese Exploring the for Finding Verbs Synonyms

Thank you! Questions & comments? Luís Sarmento: las@fe.up.pt Paula Carvalho: pcc@di.fc.ul.pt Eugénio Oliveira: eco@fe.up.pt This work was partially supported by grants SFRH/BD/23590/2005 and SFRH/BPD/45416/2008 FCT-Portugal, co-financed by POSI. Exploring the for Finding Verbs Synonyms