Exploring Vector Space Models to Predict the Compositionality of German Noun-Noun Compounds

Exploring to Predict the Compositionality of German Noun-Noun Compounds Institut für Maschinelle Sprachverarbeitung (IMS) Universität Stuttgart, Germany *SEM, Atlanta June 13-14, 2013

Overview Motivation Motivation and Background Description of Compositionality Ratings & Data Sets Eval & Baselines Predicting Compound-Constituent Ratings POS Feature comparison Syntax Feature Comparison Predicting Compound Whole Ratings Conclusions

Motivation (VSMs): explore the notion of similarity between a set of target objects within a geometric setting (Turney and Pantel, 2010; Erk, 2012). Distributional Semantics: exploit the distributional hypothesis (Firth, 1957; Harris, 1968) to determine co-occurrence features for vector space models that best describe the words, phrases, sentences, etc. of interest. Salient Distributional Features in VSMs: general knowledge about useful features, but not across phenomena. Linguist - Computational Linguist loop Phenomenon: German noun-noun compounds, such as Feuerwerk fire works (Feuer fire + Werk opus ).

Hypotheses Motivation 1 Targets in the vector space models are nouns (compound nouns, modifier nouns, head nouns) adjectives and verbs provide most salient features, syntax-based outperforms window-based. 2 Contributions of modifier noun vs. head noun: distributional properties of heads are more salient than distributional properties of modifiers in predicting the degree of compositionality of the compounds.

German Noun-Noun Compounds German noun-noun compounds: combinations of two or more simplex nouns grammatical head is a noun (German: rightmost constituent) modifier is a noun Examples: Ahornblatt maple leaf, Obstkuchen fruit cake Degree of Compositionality: semantic relatedness between compound meaning and meanings of constituents Examples (T=transparent; O=opaque): TT Ahornblatt maple+leaf OO Löwenzahn lion+tooth dandelion TO Feuerzeug fire+stuff lighter OT Fliegenpilz fly+mushroom toadstool Dataset: 244 two-part noun-noun compounds

Compositionality Ratings Two collections: 1 Compound Constituent Ratings 2 Compound Whole Ratings

Compound Constituent Ratings Material: 450 concrete, depictable German noun compounds (We use a subset of these) Participants: 30 per compound Task: degree of compositionality of the compounds with respect to their first as well as their second constituent Scale: 1 (definitely opaque) to 7 (definitely transparent) Mode: paper+pen Data: rating means and standard deviation

Compound Whole Ratings Material: 244 noun-noun compounds (subset of above) Participants: 27 34 per compound Task: degree of compositionality of the compounds as a whole Scale: 1 (definitely opaque) to 7 (definitely transparent) Mode: Amazon Mechanical Turk (AMT) Data: rating means and standard deviation

Compositionality Ratings: Examples Compounds Mean Ratings and Standard Deviations whole literal meanings of constituents whole modifier head Ahornblatt maple leaf maple leaf 6.03 ± 1.49 5.64 ± 1.63 5.71 ± 1.70 Löwenzahn dandelion lion tooth 1.66 ± 1.54 2.10 ± 1.84 2.23 ± 1.92 Fliegenpilz toadstool fly/bow tie mushroom 2.00 ± 1.20 1.93 ± 1.28 6.55 ± 0.63 Feuerzeug lighter fire stuff 4.58 ± 1.75 5.87 ± 1.01 1.90 ± 1.03

Compositionality Ratings: Distribution (1)

: Setup Goal: use VSM to identify salient distributional features to predict the degree of compositionality of the compounds Corpora: two German web corpora Feature Values: local mutual information (Evert, 2005) of co-occurrence counts (between target nouns and features): LMI = O log O E Measure of Relatedness: cosine degree of compositionality Evaluation: cosine against human ratings; Spearman Rank-Order Correlation Coefficient ρ (Siegel and Castellan, 1988)

Baseline and Upper Bound Upper Bound: correlations between human ratings: whole compound modifier; whole compound head addition/multiplication: whole compound modifier +/ compound head Baseline: random assignment of rating values [1,7] to compound modifier and compound head pairs; correlation of random values against human ratings addition/multiplication: whole rand(compound modifier) +/ rand(compound head)

Baseline and Upper Bound Function ρ Baseline Upper Bound modifier only.0959.6002 head only.1019.1385 addition.1168.7687 multiplication.1079.7829

Corpus Data: German Web Corpora 1 sdewac (Faaß et al., 2010) 2 WebKo cleaned and parsed version of the German web corpus dewac created by the WaCky group (Baroni et al., 2009) corpus cleaning: removing duplicates; disregarding syntactically ill-formed sentences; etc. size: approx. 880 million words disadvantage: sentences in the corpus are sorted alphabetically window co-occurrence refers to x words to left and right BUT within the same sentence predecessor version of sdewac size: approx. 1.5 billion words disadvantage: less clean and not parsed

Window-based VSMs Hypothesis 1 (i): adjectives and verbs provide most salient features (for describing noun compounds) Task: compare parts-of-speech in predicting compositionality Setup: specification of corpus, part-of-speech and window size determine co-occurrence counts and calculate lmi values parts-of-speech: common nouns, adjectives, main verbs window sizes: 1, 2, 5, 10, 20 (,... 100) basis: lemmas; no punctuation

Window-based VSMs: Results NN > NN+ADJ+VV > VV > ADJ (significant) window sizes: 100 = 50 20 > 10 > 5 > 2 > 1 WebKo > sdewac (significant; also with sentence-internal windows) best result: ρ = 0.6497 (WebKo, NN, window size: 20)

Hypothesis 1 (ii): syntax-based features outperform window-based features Task: compare the two co-occurrence conditions Setup: corpus choice: sdewac (parsed) specification of syntactic function determine co-occurrence counts and calculate lmi values syntactic functions (VS features): nouns in verb subcategorisation: transitive and intransitive subjects concatenation of both trans/intrans features (all subjects) direct objects PP objects noun-modifying adjectives noun-modifying and noun-modified prepositions

Syntax-based VSMs: Results

Syntax-based VSMs: Results window-based > syntax-based noun-modifying adjectives adjectives in window 20 verbs in window 20 > verb subcategorisation; best verb subcategorisation function: direct object abstracting over subject (in)transitivity > specific functions concatenation worse than the best individual functions

Role of Modifiers vs. Heads (1) Hypothesis 2: distributional properties of heads are more salient than distributional properties of modifiers Perspective (i): salient features for compound modifier vs. compound head pairs Setup: same as before (window-based and syntax-based) distinguish evaluation of 244 compound modifier predictions vs. 244 compound head predictions (instead of abstracting over the constituent type, using all 488 predictions)

Role of Modifiers vs. Heads (1): Results for Windows window-based: NN > NN+ADJ+VV > VV > ADJ (same as before) window sizes: 20 > 10 > 5 > 2 > 1 (same as before) small windows: compound head > compound modifier predictions larger windows: difference vanishes

Role of Modifiers vs. Heads (1): Results for Syntax syntax-based: window-based > syntax-based (as before) compound head > compound modifier predictions (exception transitive subjects) patterns with regard to function types vary (in comparison to previous models, and for modifiers vs. heads)

Role of Modifiers vs. Heads (2) Hypothesis 2: distributional properties of heads are more salient than distributional properties of modifiers Perspective (ii): contribution of modifiers vs. heads to compound meaning Setup: window-based, window 20, across parts-of-speech correlate only one type of compound constituent predictions with the compound whole ratings apply addition/multiplication correspondence to upper bound

Role of Modifiers vs. Heads (2): Results impact of distributional semantics: modifiers > heads multiplication modifiers only multiplication > addition

Summary Motivation Hypothesis 1 (i): against our intuition, not adjectives or verbs but nouns provided the most salient distributional information. Hypothesis 1 (ii): syntax-based predictions were all worse or same as predictions by the respective window-based parts-of-speech. Best Model: nouns within a 20-word window (ρ = 0.6497)

Summary Motivation Hypothesis 2 (i): salient features to predict similarities between compound modifier vs. compound head pairs are different small windows: distributional similarity between compounds and heads > compounds and modifiers; but difference vanishes in larger contexts Hypothesis 2 (ii): influence of modifier meaning on compound meaning is stronger than influence of head meaning in human ratings and in VSMs Future Work: learn more about the semantic role of modifiers vs. heads in noun-noun compounds (as do Gagné and Spalding, 2009; 2011, among others).

Compositionality Ratings: Distribution (2)

Window-based VSMs: Results Context Windows only Sentence Internal sdewac, just Nouns vs. Sentence External Webko, just Nouns.