Measuring association between words (and other linguistic units) Marco Baroni Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Text Processing Conclusion The basic problem A catalogue of interesting co-occurrences Multi-word expressions Most information extracted from text corpora comes in the form of co-occurrence counts Co-occurrence of a word with another, co-occurrence of a word with a POS, co-occurrence of a POS sequence with a syntactic structure, occurrence of a word in a certain type of text, etc. We want to distinguish interesting co-occurrences from those that are due to chance Compare of the and lame duck: the first pair has a higher co-occurrence frequency, but the second is probably more interesting Idioms (frozen figurative expressions): to shoot the breeze, lame duck Collocations (arbitrary choice of adjective or verb to express a certain meaning in relation to a noun): deliver a speech, take a shower, black coffee Lexical bundles (very frequent sequences that behave as a single function word): as well as, in order to Compounds (more or less lexicalized): book cover, Spider Man Named entities: New York, League of Nations...
A catalogue of interesting co-occurrences Other co-occurrences Outline Semantic relatives: car-wheel, murder-victim, dog-kennel Important in word sense disambiguation, relation extraction, as dimensions in vectorial representation of word meaning Beyond word sequences: Words and morphological or syntactic structures: e.g., tendency of verbs to occur/not occur in passive constructions Words and corpora: keyword extraction: what are the most typical words of corpus X compared to corpus Y? Words in aligned parallel texts POS strings and syntactic constructions... Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Frequency POS frames Just looking for frequently recurring bigrams is typically not that interesting Ten most frequent BNC bigrams: of the 753180 in the 480169 to the 286955 on the 207550 and the 188216 to be 187853 for the 159579 at the 138045 that the 127180 by the 125360 Given tagged corpus, one can zero in on interesting syntactic configurations E.g.: [pos="vv.*"][pos="av.*"]?[pos="d.*" pos="at0"]{0,2} [pos="av.*"]?[pos="aj.*"]*[pos="nn.*"]; Need to specify what are interesting syntactic configurations POS tagging not always available or feasible
Most frequent V+N in BNC Frequency vs. association take place 9827 shake head 3758 take part 2761 make sense 2436 play part 2320 take time 2210 play role 2184 find way 1958 make use 1876 make decision 1809 A bigram might be frequent simply because its component words are frequent (of the) We need to take frequency of parts into account A large number of association measures (AMs) provide scores based on comparison of observed frequency of bigram and expected frequency under assumption that parts are independent Pointwise Mutual information (PMI) Pointwise Mutual Information: interpretations The formula: Oldest and most used AM in computational linguistics K. Church & P. Hanks. Word association norms, mutual information, and lexicography. ACL 1989, 76-83. The formula: PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) In information theory, PMI quantifies extra-information (in bits) about possible occurrence of w 2 when we know that first word is w 1 PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) (Logarithm of) ratio of empirically estimated probability of bigram and theoretical probability under independence (product of empirical probability of unigrams) (Logarithm of) ratio of P(w 2 w 1 ) (probabilities of seeing second word if we saw first word) to P(w 2 ) (probability of second word independently of context) To see this, recall that: P(w 2 w 1 ) = P(w 1, w 2 ) P(w 1 )
Computing PMI Computing PMI We need to take logarithm of: Apply usual maximum likelihood estimates (C() is a counting function; different strategies for what counts as a w1, w 2 co-occurrence): P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w1,w2) N C(w 1 ) C(w 2 ) N N = Given: C(w 1, w 2 )N log A B = log A + log B log A B = log A log B C(w 1, w 2 ) N N 2 = C(w 1, w 2 )N we derive PMI(w 1, w 2 ) = log 2 (C(w 1, w 2 )) + log 2 (N) log 2 (C(w 1 )) log 2 (C(w 2 )) What is N? Depending on the task, N (the sample size) might be interpreted as the number of tokens in the whole corpus, or as the number of items in the bigram list, e.g., number of V+N pairs extracted with expression above (in this case, unigram frequencies should also be taken from list rather than from whole corpus) Often, we are only interested in ranking a list of pairs, in which case N will not matter, being constant for all pairs: C(w 1, w 2 )N Keep in mind, however, that for AMs that have statistical interpretation changing N will change absolute value of score, leading to different p-value The problem with PMI Random selection from 734 V+N pairs with highest PMI in BNC V N C(VN) C(V) C(N) PMI Asalam alekum 1 1 1 6.4719 Astynax mexicana 1 1 1 6.4719 cholyglycine hydrolase 1 1 1 6.4719 choose{ gth 1 1 1 6.4719 christopher Columbus 1 1 1 6.4719 ek badmash 1 1 1 6.4719 elk n a 1 1 1 6.4719 perswade yong 1 1 1 6.4719 royall maiesty 1 1 1 6.4719 sont superbe 1 1 1 6.4719
The problem with PMI Serious over-estimation for low-frequency events Consider the core of PMI formula: C(w 1, w 2 ) This will increase with numerator (C(w 1, w 2 )) and decrease with denominator () However, these are not independent quantities: C(w 1, w2) can at most be equal to C(w 1 ) and C(w 2 ) In this best case scenario : C(w 1, w2) = C(w 1 ) = C(w 2 ) = f the core formula becomes: Since f does not grow as fast as f 2, PMI will decrease as f becomes larger f f 2 The problem with PMI Empirical solution: pick a minimum frequency cut-off V+N pairs with highest PMI in BNC, minimum frequency = 100: V N C(VN) C(V) C(N) PMI grind pepper 112 285 206 3.7524 beg pardon 299 682 452 3.4587 thank goodness 224 1048 288 3.3424 grit tooth 164 181 1411 3.2796 sow seed 110 252 735 3.2456 purse lip 155 165 1996 3.1446 list engagement 176 929 418 3.1282 bridge gap 177 316 1469 3.0532 shrug shoulder 255 374 1852 3.0379 resist temptation 221 1810 351 3.0133 The problem with PMI Counterintuitively, highest possible theoretical PMI for words that occur once, and that time they occur together The problem with PMI f f /f 2 1 1 2 0.5 3 0.33 10 0.1 100 0.01 1000 0.001 Frequency thresholds often produce excellent results, however they are arbitrary, depend on corpus size, might cause loss of important information... More principled (but not necessarily empirically better) approach is to use AM that takes absolute observed frequency into account E.g., log-likelihood ratio: Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1): 61-74 (1994) This and related measures have at their core a formula weighting absolute observed frequency by PMI (or similar): C(w 1, w 2 ) log P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w 1, w 2 ) log C(w 1, w 2 )N This formula by itself is also known as local MI (Evert, 2005)
Weighting with absolute observed frequency Top 10 BNC V+N pairs by Log-Likelihood Ratio f f log 2 (f 100000/f 2 ) 1 16.6 2 31.2 3 45.0 10 132.9 100 996.6 1000 6643.9 V N C(VN) C(V) C(N) LLR take place 9827 83076 16288 1.3330 shake head 3758 5564 12357 2.2096 play role 2184 10825 6442 1.9677 answer question 1594 3130 9077 2.2209 play part 2320 10825 13625 1.6686 open door 1773 9023 6480 1.9537 solve problem 1349 2135 11584 2.2087 make sense 2436 83867 5545 1.1911 see chapter 1553 63873 2050 1.5460 give rise 1508 39235 2939 1.5884 My practical advice A note about the base of the logarithm LLR and similar measures (consider using local MI) are often quite similar to raw frequency In my experience, there are two macro-types of AMs: PMI-like and frequency-like Typically, it is a good idea to use both in order to harvest different kinds of co-occurrences E.g., PMI-like for idioms, frequency/llr-like for collocations In PMI and other information-theroetic AMs, logarithm is base 2 Quantifies number of bits needed to encode information In log-likelihood ratio and other measures from probability theory and heuristic approaches, we use natural (base e) logarithm Often exponential function involved in function derivation...... and it results in lower absolute values, which is handy If you are only interested in rank, it does not matter If however you perform other mathematical operations on the resulting values, difference in base will typically matter!
Outline Looking for keywords (corpus comparison) Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Use of AMs can be naturally extended to look for typical words of a target text with respect to a general one (e.g., technical terms in specialized corpus, compared to general corpus) or two specific texts with respect to each other (e.g., female vs. male text) Illustrated here with PMI, but any AM should do Recall conditional form of PMI: P(w 2 w 1 ) P(w 2 ) = P(w 1, w 2 ) P(w 1 )P(w 2 ) More generally, replacing words with events : P(A B) P(A) = P(A, B) P(A)P(B) Looking for keywords (corpus comparison) P(A B) P(A) = P(A, B) P(A)P(B) Now, suppose that: A is event that word we picked from either corpus (specialized or general) is peptic B is event that word was extracted from specialized corpus Then: = P(w = peptic corpus(w) = spec) P(w = peptic) P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) (Nothing changes if comparison is not specialized/general, but specialized 1 /specialized 2 : you will simply be interested in both highest and lowest PMI values) Looking for keywords (corpus comparison) Probability estimates P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) P(w = peptic) = C(peptic) N spec + N gen P(corpus(w) = spec) = N spec N spec + N gen P(w = peptic, corpus(w) = spec) = C(corpus(peptic) = spec) N spec + N gen
Outline Some current research topics Introduction Measuring association Conclusion Benchmarks for AM evaluation, particularly for MWE extraction/ranking (see the MWE workshop series) Principled ways to pick the right AM (and how different different AMs really are) Different ways to measure association for different types of co-occurrences If we are interested in idioms, we do not want to extract pupil as a collocate of eye; if we are interested in word meaning, we do not want to extract apple as a collocate of eye Degree of fixedness of co-occurrence might be exploited to distinguish lexical vs. semantic attraction Cf. eyes have pupils vs. *eyes have apples http://www.collocations.de/ Stefan Evert s site on association measures Including a catalogue of association measures with explanations a reference list and the UCS software to compute them