Measuring association between words (and other linguistic units) A catalogue of interesting co-occurrences. The basic problem.

Size: px

Start display at page:

Download "Measuring association between words (and other linguistic units) A catalogue of interesting co-occurrences. The basic problem."

Dominic Simmons
6 years ago
Views:

1 Measuring association between words (and other linguistic units) Marco Baroni Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Text Processing Conclusion The basic problem A catalogue of interesting co-occurrences Multi-word expressions Most information extracted from text corpora comes in the form of co-occurrence counts Co-occurrence of a word with another, co-occurrence of a word with a POS, co-occurrence of a POS sequence with a syntactic structure, occurrence of a word in a certain type of text, etc. We want to distinguish interesting co-occurrences from those that are due to chance Compare of the and lame duck: the first pair has a higher co-occurrence frequency, but the second is probably more interesting Idioms (frozen figurative expressions): to shoot the breeze, lame duck Collocations (arbitrary choice of adjective or verb to express a certain meaning in relation to a noun): deliver a speech, take a shower, black coffee Lexical bundles (very frequent sequences that behave as a single function word): as well as, in order to Compounds (more or less lexicalized): book cover, Spider Man Named entities: New York, League of Nations...

2 A catalogue of interesting co-occurrences Other co-occurrences Outline Semantic relatives: car-wheel, murder-victim, dog-kennel Important in word sense disambiguation, relation extraction, as dimensions in vectorial representation of word meaning Beyond word sequences: Words and morphological or syntactic structures: e.g., tendency of verbs to occur/not occur in passive constructions Words and corpora: keyword extraction: what are the most typical words of corpus X compared to corpus Y? Words in aligned parallel texts POS strings and syntactic constructions... Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Frequency POS frames Just looking for frequently recurring bigrams is typically not that interesting Ten most frequent BNC bigrams: of the in the to the on the and the to be for the at the that the by the Given tagged corpus, one can zero in on interesting syntactic configurations E.g.: [pos="vv.*"][pos="av.*"]?[pos="d.*" pos="at0"]{0,2} [pos="av.*"]?[pos="aj.*"]*[pos="nn.*"]; Need to specify what are interesting syntactic configurations POS tagging not always available or feasible

3 Most frequent V+N in BNC Frequency vs. association take place 9827 shake head 3758 take part 2761 make sense 2436 play part 2320 take time 2210 play role 2184 find way 1958 make use 1876 make decision 1809 A bigram might be frequent simply because its component words are frequent (of the) We need to take frequency of parts into account A large number of association measures (AMs) provide scores based on comparison of observed frequency of bigram and expected frequency under assumption that parts are independent Pointwise Mutual information (PMI) Pointwise Mutual Information: interpretations The formula: Oldest and most used AM in computational linguistics K. Church & P. Hanks. Word association norms, mutual information, and lexicography. ACL 1989, The formula: PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) In information theory, PMI quantifies extra-information (in bits) about possible occurrence of w 2 when we know that first word is w 1 PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) (Logarithm of) ratio of empirically estimated probability of bigram and theoretical probability under independence (product of empirical probability of unigrams) (Logarithm of) ratio of P(w 2 w 1 ) (probabilities of seeing second word if we saw first word) to P(w 2 ) (probability of second word independently of context) To see this, recall that: P(w 2 w 1 ) = P(w 1, w 2 ) P(w 1 )

4 Computing PMI Computing PMI We need to take logarithm of: Apply usual maximum likelihood estimates (C() is a counting function; different strategies for what counts as a w1, w 2 co-occurrence): P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w1,w2) N C(w 1 ) C(w 2 ) N N = Given: C(w 1, w 2 )N log A B = log A + log B log A B = log A log B C(w 1, w 2 ) N N 2 = C(w 1, w 2 )N we derive PMI(w 1, w 2 ) = log 2 (C(w 1, w 2 )) + log 2 (N) log 2 (C(w 1 )) log 2 (C(w 2 )) What is N? Depending on the task, N (the sample size) might be interpreted as the number of tokens in the whole corpus, or as the number of items in the bigram list, e.g., number of V+N pairs extracted with expression above (in this case, unigram frequencies should also be taken from list rather than from whole corpus) Often, we are only interested in ranking a list of pairs, in which case N will not matter, being constant for all pairs: C(w 1, w 2 )N Keep in mind, however, that for AMs that have statistical interpretation changing N will change absolute value of score, leading to different p-value The problem with PMI Random selection from 734 V+N pairs with highest PMI in BNC V N C(VN) C(V) C(N) PMI Asalam alekum Astynax mexicana cholyglycine hydrolase choose{ gth christopher Columbus ek badmash elk n a perswade yong royall maiesty sont superbe

5 The problem with PMI Serious over-estimation for low-frequency events Consider the core of PMI formula: C(w 1, w 2 ) This will increase with numerator (C(w 1, w 2 )) and decrease with denominator () However, these are not independent quantities: C(w 1, w2) can at most be equal to C(w 1 ) and C(w 2 ) In this best case scenario : C(w 1, w2) = C(w 1 ) = C(w 2 ) = f the core formula becomes: Since f does not grow as fast as f 2, PMI will decrease as f becomes larger f f 2 The problem with PMI Empirical solution: pick a minimum frequency cut-off V+N pairs with highest PMI in BNC, minimum frequency = 100: V N C(VN) C(V) C(N) PMI grind pepper beg pardon thank goodness grit tooth sow seed purse lip list engagement bridge gap shrug shoulder resist temptation The problem with PMI Counterintuitively, highest possible theoretical PMI for words that occur once, and that time they occur together The problem with PMI f f /f Frequency thresholds often produce excellent results, however they are arbitrary, depend on corpus size, might cause loss of important information... More principled (but not necessarily empirically better) approach is to use AM that takes absolute observed frequency into account E.g., log-likelihood ratio: Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1): (1994) This and related measures have at their core a formula weighting absolute observed frequency by PMI (or similar): C(w 1, w 2 ) log P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w 1, w 2 ) log C(w 1, w 2 )N This formula by itself is also known as local MI (Evert, 2005)

6 Weighting with absolute observed frequency Top 10 BNC V+N pairs by Log-Likelihood Ratio f f log 2 (f /f 2 ) V N C(VN) C(V) C(N) LLR take place shake head play role answer question play part open door solve problem make sense see chapter give rise My practical advice A note about the base of the logarithm LLR and similar measures (consider using local MI) are often quite similar to raw frequency In my experience, there are two macro-types of AMs: PMI-like and frequency-like Typically, it is a good idea to use both in order to harvest different kinds of co-occurrences E.g., PMI-like for idioms, frequency/llr-like for collocations In PMI and other information-theroetic AMs, logarithm is base 2 Quantifies number of bits needed to encode information In log-likelihood ratio and other measures from probability theory and heuristic approaches, we use natural (base e) logarithm Often exponential function involved in function derivation and it results in lower absolute values, which is handy If you are only interested in rank, it does not matter If however you perform other mathematical operations on the resulting values, difference in base will typically matter!

7 Outline Looking for keywords (corpus comparison) Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Use of AMs can be naturally extended to look for typical words of a target text with respect to a general one (e.g., technical terms in specialized corpus, compared to general corpus) or two specific texts with respect to each other (e.g., female vs. male text) Illustrated here with PMI, but any AM should do Recall conditional form of PMI: P(w 2 w 1 ) P(w 2 ) = P(w 1, w 2 ) P(w 1 )P(w 2 ) More generally, replacing words with events : P(A B) P(A) = P(A, B) P(A)P(B) Looking for keywords (corpus comparison) P(A B) P(A) = P(A, B) P(A)P(B) Now, suppose that: A is event that word we picked from either corpus (specialized or general) is peptic B is event that word was extracted from specialized corpus Then: = P(w = peptic corpus(w) = spec) P(w = peptic) P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) (Nothing changes if comparison is not specialized/general, but specialized 1 /specialized 2 : you will simply be interested in both highest and lowest PMI values) Looking for keywords (corpus comparison) Probability estimates P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) P(w = peptic) = C(peptic) N spec + N gen P(corpus(w) = spec) = N spec N spec + N gen P(w = peptic, corpus(w) = spec) = C(corpus(peptic) = spec) N spec + N gen

8 Outline Some current research topics Introduction Measuring association Conclusion Benchmarks for AM evaluation, particularly for MWE extraction/ranking (see the MWE workshop series) Principled ways to pick the right AM (and how different different AMs really are) Different ways to measure association for different types of co-occurrences If we are interested in idioms, we do not want to extract pupil as a collocate of eye; if we are interested in word meaning, we do not want to extract apple as a collocate of eye Degree of fixedness of co-occurrence might be exploited to distinguish lexical vs. semantic attraction Cf. eyes have pupils vs. *eyes have apples Stefan Evert s site on association measures Including a catalogue of association measures with explanations a reference list and the UCS software to compute them

Methods for the Qualitative Evaluation of Lexical Association Measures

Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian