Measuring association between words (and other linguistic units) A catalogue of interesting co-occurrences. The basic problem.

Similar documents
Methods for the Qualitative Evaluation of Lexical Association Measures

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Universiteit Leiden ICT in Business

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

A Re-examination of Lexical Association Measures

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

THE VERB ARGUMENT BROWSER

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

A corpus-based approach to the acquisition of collocational prepositional phrases

A Comparison of Two Text Representations for Sentiment Analysis

The Evolution of Random Phenomena

Cross Language Information Retrieval

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Corpus Linguistics (L615)

Context Free Grammars. Many slides from Michael Collins

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Multilingual Sentiment and Subjectivity Analysis

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Collocation extraction measures for text mining applications

Formulaic Language and Fluency: ESL Teaching Applications

Constructing Parallel Corpus from Movie Subtitles

Handling Sparsity for Verb Noun MWE Token Classification

Distant Supervised Relation Extraction with Wikipedia and Freebase

Translating Collocations for Use in Bilingual Lexicons

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Construction Grammar. University of Jena.

What is beautiful is useful visual appeal and expected information quality

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Concepts and Properties in Word Spaces

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Compositional Semantics

Generation of Referring Expressions: Managing Structural Ambiguities

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

arxiv: v1 [cs.cl] 2 Apr 2017

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Combining a Chinese Thesaurus with a Chinese Dictionary

On document relevance and lexical cohesion between query terms

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

A Computational Evaluation of Case-Assignment Algorithms

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Role of the Head in the Interpretation of English Deverbal Compounds

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Ensemble Technique Utilization for Indonesian Dependency Parser

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The taming of the data:

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Mandarin Lexical Tone Recognition: The Gating Paradigm

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

arxiv:cmp-lg/ v1 22 Aug 1994

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

AQUA: An Ontology-Driven Question Answering System

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Proof Theory for Syntacticians

Detecting English-French Cognates Using Orthographic Edit Distance

Procedia - Social and Behavioral Sciences 200 ( 2015 )

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Probability and Statistics Curriculum Pacing Guide

Online Updating of Word Representations for Part-of-Speech Tagging

Switchboard Language Model Improvement with Conversational Data from Gigaword

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Accuracy (%) # features

Using Semantic Relations to Refine Coreference Decisions

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Radius STEM Readiness TM

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Lecture 1: Machine Learning Basics

Vocabulary Usage and Intelligibility in Learner Language

Matching Similarity for Keyword-Based Clustering

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Getting Started with Deliberate Practice

An Empirical and Computational Test of Linguistic Relativity

Ch VI- SENTENCE PATTERNS.

1.11 I Know What Do You Know?

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Grade 6: Correlated to AGS Basic Math Skills

Variation of English passives used by Swedes

Transcription:

Measuring association between words (and other linguistic units) Marco Baroni Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Text Processing Conclusion The basic problem A catalogue of interesting co-occurrences Multi-word expressions Most information extracted from text corpora comes in the form of co-occurrence counts Co-occurrence of a word with another, co-occurrence of a word with a POS, co-occurrence of a POS sequence with a syntactic structure, occurrence of a word in a certain type of text, etc. We want to distinguish interesting co-occurrences from those that are due to chance Compare of the and lame duck: the first pair has a higher co-occurrence frequency, but the second is probably more interesting Idioms (frozen figurative expressions): to shoot the breeze, lame duck Collocations (arbitrary choice of adjective or verb to express a certain meaning in relation to a noun): deliver a speech, take a shower, black coffee Lexical bundles (very frequent sequences that behave as a single function word): as well as, in order to Compounds (more or less lexicalized): book cover, Spider Man Named entities: New York, League of Nations...

A catalogue of interesting co-occurrences Other co-occurrences Outline Semantic relatives: car-wheel, murder-victim, dog-kennel Important in word sense disambiguation, relation extraction, as dimensions in vectorial representation of word meaning Beyond word sequences: Words and morphological or syntactic structures: e.g., tendency of verbs to occur/not occur in passive constructions Words and corpora: keyword extraction: what are the most typical words of corpus X compared to corpus Y? Words in aligned parallel texts POS strings and syntactic constructions... Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Frequency POS frames Just looking for frequently recurring bigrams is typically not that interesting Ten most frequent BNC bigrams: of the 753180 in the 480169 to the 286955 on the 207550 and the 188216 to be 187853 for the 159579 at the 138045 that the 127180 by the 125360 Given tagged corpus, one can zero in on interesting syntactic configurations E.g.: [pos="vv.*"][pos="av.*"]?[pos="d.*" pos="at0"]{0,2} [pos="av.*"]?[pos="aj.*"]*[pos="nn.*"]; Need to specify what are interesting syntactic configurations POS tagging not always available or feasible

Most frequent V+N in BNC Frequency vs. association take place 9827 shake head 3758 take part 2761 make sense 2436 play part 2320 take time 2210 play role 2184 find way 1958 make use 1876 make decision 1809 A bigram might be frequent simply because its component words are frequent (of the) We need to take frequency of parts into account A large number of association measures (AMs) provide scores based on comparison of observed frequency of bigram and expected frequency under assumption that parts are independent Pointwise Mutual information (PMI) Pointwise Mutual Information: interpretations The formula: Oldest and most used AM in computational linguistics K. Church & P. Hanks. Word association norms, mutual information, and lexicography. ACL 1989, 76-83. The formula: PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) In information theory, PMI quantifies extra-information (in bits) about possible occurrence of w 2 when we know that first word is w 1 PMI(w 1, w 2 ) = log 2 P(w 1, w 2 ) P(w 1 )P(w 2 ) (Logarithm of) ratio of empirically estimated probability of bigram and theoretical probability under independence (product of empirical probability of unigrams) (Logarithm of) ratio of P(w 2 w 1 ) (probabilities of seeing second word if we saw first word) to P(w 2 ) (probability of second word independently of context) To see this, recall that: P(w 2 w 1 ) = P(w 1, w 2 ) P(w 1 )

Computing PMI Computing PMI We need to take logarithm of: Apply usual maximum likelihood estimates (C() is a counting function; different strategies for what counts as a w1, w 2 co-occurrence): P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w1,w2) N C(w 1 ) C(w 2 ) N N = Given: C(w 1, w 2 )N log A B = log A + log B log A B = log A log B C(w 1, w 2 ) N N 2 = C(w 1, w 2 )N we derive PMI(w 1, w 2 ) = log 2 (C(w 1, w 2 )) + log 2 (N) log 2 (C(w 1 )) log 2 (C(w 2 )) What is N? Depending on the task, N (the sample size) might be interpreted as the number of tokens in the whole corpus, or as the number of items in the bigram list, e.g., number of V+N pairs extracted with expression above (in this case, unigram frequencies should also be taken from list rather than from whole corpus) Often, we are only interested in ranking a list of pairs, in which case N will not matter, being constant for all pairs: C(w 1, w 2 )N Keep in mind, however, that for AMs that have statistical interpretation changing N will change absolute value of score, leading to different p-value The problem with PMI Random selection from 734 V+N pairs with highest PMI in BNC V N C(VN) C(V) C(N) PMI Asalam alekum 1 1 1 6.4719 Astynax mexicana 1 1 1 6.4719 cholyglycine hydrolase 1 1 1 6.4719 choose{ gth 1 1 1 6.4719 christopher Columbus 1 1 1 6.4719 ek badmash 1 1 1 6.4719 elk n a 1 1 1 6.4719 perswade yong 1 1 1 6.4719 royall maiesty 1 1 1 6.4719 sont superbe 1 1 1 6.4719

The problem with PMI Serious over-estimation for low-frequency events Consider the core of PMI formula: C(w 1, w 2 ) This will increase with numerator (C(w 1, w 2 )) and decrease with denominator () However, these are not independent quantities: C(w 1, w2) can at most be equal to C(w 1 ) and C(w 2 ) In this best case scenario : C(w 1, w2) = C(w 1 ) = C(w 2 ) = f the core formula becomes: Since f does not grow as fast as f 2, PMI will decrease as f becomes larger f f 2 The problem with PMI Empirical solution: pick a minimum frequency cut-off V+N pairs with highest PMI in BNC, minimum frequency = 100: V N C(VN) C(V) C(N) PMI grind pepper 112 285 206 3.7524 beg pardon 299 682 452 3.4587 thank goodness 224 1048 288 3.3424 grit tooth 164 181 1411 3.2796 sow seed 110 252 735 3.2456 purse lip 155 165 1996 3.1446 list engagement 176 929 418 3.1282 bridge gap 177 316 1469 3.0532 shrug shoulder 255 374 1852 3.0379 resist temptation 221 1810 351 3.0133 The problem with PMI Counterintuitively, highest possible theoretical PMI for words that occur once, and that time they occur together The problem with PMI f f /f 2 1 1 2 0.5 3 0.33 10 0.1 100 0.01 1000 0.001 Frequency thresholds often produce excellent results, however they are arbitrary, depend on corpus size, might cause loss of important information... More principled (but not necessarily empirically better) approach is to use AM that takes absolute observed frequency into account E.g., log-likelihood ratio: Ted Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1): 61-74 (1994) This and related measures have at their core a formula weighting absolute observed frequency by PMI (or similar): C(w 1, w 2 ) log P(w 1, w 2 ) P(w 1 )P(w 2 ) = C(w 1, w 2 ) log C(w 1, w 2 )N This formula by itself is also known as local MI (Evert, 2005)

Weighting with absolute observed frequency Top 10 BNC V+N pairs by Log-Likelihood Ratio f f log 2 (f 100000/f 2 ) 1 16.6 2 31.2 3 45.0 10 132.9 100 996.6 1000 6643.9 V N C(VN) C(V) C(N) LLR take place 9827 83076 16288 1.3330 shake head 3758 5564 12357 2.2096 play role 2184 10825 6442 1.9677 answer question 1594 3130 9077 2.2209 play part 2320 10825 13625 1.6686 open door 1773 9023 6480 1.9537 solve problem 1349 2135 11584 2.2087 make sense 2436 83867 5545 1.1911 see chapter 1553 63873 2050 1.5460 give rise 1508 39235 2939 1.5884 My practical advice A note about the base of the logarithm LLR and similar measures (consider using local MI) are often quite similar to raw frequency In my experience, there are two macro-types of AMs: PMI-like and frequency-like Typically, it is a good idea to use both in order to harvest different kinds of co-occurrences E.g., PMI-like for idioms, frequency/llr-like for collocations In PMI and other information-theroetic AMs, logarithm is base 2 Quantifies number of bits needed to encode information In log-likelihood ratio and other measures from probability theory and heuristic approaches, we use natural (base e) logarithm Often exponential function involved in function derivation...... and it results in lower absolute values, which is handy If you are only interested in rank, it does not matter If however you perform other mathematical operations on the resulting values, difference in base will typically matter!

Outline Looking for keywords (corpus comparison) Introduction Measuring association Pointwise Mutual Information and other AMs Association measures for keyword extraction Conclusion Use of AMs can be naturally extended to look for typical words of a target text with respect to a general one (e.g., technical terms in specialized corpus, compared to general corpus) or two specific texts with respect to each other (e.g., female vs. male text) Illustrated here with PMI, but any AM should do Recall conditional form of PMI: P(w 2 w 1 ) P(w 2 ) = P(w 1, w 2 ) P(w 1 )P(w 2 ) More generally, replacing words with events : P(A B) P(A) = P(A, B) P(A)P(B) Looking for keywords (corpus comparison) P(A B) P(A) = P(A, B) P(A)P(B) Now, suppose that: A is event that word we picked from either corpus (specialized or general) is peptic B is event that word was extracted from specialized corpus Then: = P(w = peptic corpus(w) = spec) P(w = peptic) P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) (Nothing changes if comparison is not specialized/general, but specialized 1 /specialized 2 : you will simply be interested in both highest and lowest PMI values) Looking for keywords (corpus comparison) Probability estimates P(w = peptic, corpus(w) = spec) P(w = peptic)p(corpus(w) = spec) P(w = peptic) = C(peptic) N spec + N gen P(corpus(w) = spec) = N spec N spec + N gen P(w = peptic, corpus(w) = spec) = C(corpus(peptic) = spec) N spec + N gen

Outline Some current research topics Introduction Measuring association Conclusion Benchmarks for AM evaluation, particularly for MWE extraction/ranking (see the MWE workshop series) Principled ways to pick the right AM (and how different different AMs really are) Different ways to measure association for different types of co-occurrences If we are interested in idioms, we do not want to extract pupil as a collocate of eye; if we are interested in word meaning, we do not want to extract apple as a collocate of eye Degree of fixedness of co-occurrence might be exploited to distinguish lexical vs. semantic attraction Cf. eyes have pupils vs. *eyes have apples http://www.collocations.de/ Stefan Evert s site on association measures Including a catalogue of association measures with explanations a reference list and the UCS software to compute them