Corpus Linguistics (L415/L615)

Similar documents
On document relevance and lexical cohesion between query terms

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Proof Theory for Syntacticians

A corpus-based approach to the acquisition of collocational prepositional phrases

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Grade 6: Correlated to AGS Basic Math Skills

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Formulaic Language and Fluency: ESL Teaching Applications

12- A whirlwind tour of statistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

A Comparison of Two Text Representations for Sentiment Analysis

Construction Grammar. University of Jena.

Linking Task: Identifying authors and book titles in verbose queries

Testing Collocational Knowledge of Taif University English Seniors

Handling Sparsity for Verb Noun MWE Token Classification

Procedia - Social and Behavioral Sciences 154 ( 2014 )

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Are You Ready? Simplify Fractions

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Methods for the Qualitative Evaluation of Lexical Association Measures

A Re-examination of Lexical Association Measures

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Compositional Semantics

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Corpus Linguistics (L615)

Dublin City Schools Mathematics Graded Course of Study GRADE 4

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Leader s Guide: Dream Big and Plan for Success

success. It will place emphasis on:

Part I. Figuring out how English works

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Collocation extraction measures for text mining applications

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

CHAPTER 5: COMPARABILITY OF WRITTEN QUESTIONNAIRE DATA AND INTERVIEW DATA

Vocabulary Usage and Intelligibility in Learner Language

Matching Similarity for Keyword-Based Clustering

Natural Language Processing. George Konidaris

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

arxiv:cmp-lg/ v1 22 Aug 1994

Field Experience Management 2011 Training Guides

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

CHAPTER 10 Statistical Measures for Usage-Based Linguistics

THE VERB ARGUMENT BROWSER

Netsmart Sandbox Tour Guide Script

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Procedia - Social and Behavioral Sciences 200 ( 2015 )

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Using NVivo to Organize Literature Reviews J.J. Roth April 20, Goals of Literature Reviews

Universiteit Leiden ICT in Business

Talk About It. More Ideas. Formative Assessment. Have students try the following problem.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Disambiguation of Thai Personal Name from Online News Articles

Constraining X-Bar: Theta Theory

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Switchboard Language Model Improvement with Conversational Data from Gigaword

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Multiplication of 2 and 3 digit numbers Multiply and SHOW WORK. EXAMPLE. Now try these on your own! Remember to show all work neatly!

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Mathematics Success Level E

A Bootstrapping Model of Frequency and Context Effects in Word Learning

First Grade Standards

Writing a composition

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

A Domain Ontology Development Environment Using a MRD and Text Corpus

Characteristics of Functions

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Programma di Inglese

Detecting English-French Cognates Using Orthographic Edit Distance

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

The following information has been adapted from A guide to using AntConc.

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

Genevieve L. Hartman, Ph.D.

Translating Collocations for Use in Bilingual Lexicons

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Leveraging Sentiment to Compute Word Similarity

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Loughton School s curriculum evening. 28 th February 2017

Attendance/ Data Clerk Manual.

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

How long did... Who did... Where was... When did... How did... Which did...

Food Products Marketing

Mathematics Success Grade 7

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

Transcription:

(L415/L615) Markus Dickinson Department of Linguistics, Indiana University Fall 2015 1 / 25

are characteristic co-occurrence patterns of two (or more) lexical items 1. Firthian definition: combinations of words that co-occur more frequently than by chance You shall know a word by the company it keeps (Firth 1957) 2. Phraseological definition: The meaning tends to be more than the sum of its parts a sequence of two or more consecutive words,... whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components (Choueka 1988) 2 / 25

Some examples by different definitions: Firth + phraseology: couch potato Firth only: potato peeler Phraseology only: broken record 3 / 25

are hard to define by intuition: Corpora have been able to reveal connections previously unseen Though, it may not be clear what the theoretical basis of are Q: how (where) do they fit into grammar? Firthian definition is empirical need test for co-occur more frequently than by chance Significance test / information theoretic measures 4 / 25

Colligations A colligation is a slightly different concept: Collocation of a node word with a particular class of words (e.g., determiners) Colligations often create noise in a list of e.g., this house because this is so common on its own, and determiners appear before nouns Thus, people sometimes use stop words to filter out non- 5 / 25

Semantic prosody & preference Semantic prosody = a form of meaning which is established through the proximity of a consistent series of collocates (Louw 2000) Idea: you can tell the semantic prosody of a word by the types of words it frequently co-occurs with These are typically negative: e.g., peddle, ripe for, get oneself verbed This type of co-occurrence often leads to general semantic preferences e.g., utterly, totally, etc. typically have a feature of absence or change of state 6 / 25

Towards corpus-based metrics are expressions of two or more words that are in some sense conventionalized as a group strong tea (cf.??powerful tea) international best practice kick the bucket Importance of the context: You shall know a word by a company it keeps (Firth 1957) There are lexical properties that more general syntactic properties do not capture This slide and the next 3 adapted from Manning and Schütze (1999), Foundations of Statistical Natural Language Processing 7 / 25

Prototypical Prototypically, meet the following criteria: Non-compositional: meaning of kick the bucket not composed of meaning of parts Non-substitutable: orange hair just as accurate as red hair, but some don t say it Non-modifiable: often we cannot modify a collocation, even though we normally could modify one of those words:??kick the red bucket 8 / 25

Compositionality tests Previous properties may be hard to verify with corpus data (At least) two tests we can use with corpora: Is the collocation translated word-by-word into another language? e.g., Collocation make a decision is not translated literally into French Do the two words co-occur more frequently together than we would otherwise expect? e.g., of the is frequent, but both words are frequent, so we might expect this 9 / 25

Kinds of Calculations ideally take into account variability: Light verbs: verbs convey very little meaning but must be the right one: make *take a decision, take *make a walk Phrasal verbs: main verb and particle combination, often translated as a single word: to tell off, to call up Proper nouns: slightly different than others, but each refers to a single idea (e.g., Brooks Brothers) Terminological expressions: technical terms that form a unit (e.g., hydraulic oil filter) Syntactically adaptable expressions: bite biting bit the dust, take leave of his her your senses Non-adjacent : faint (stale apricot) smell 10 / 25

Ideas for calculating We want to tell if two words occur together more than by chance, meaning we should examine: Observed frequency of the two words together Expected frequency of the two words together This if often derived from observed frequencies of the individual words Metrics for combining observed & expected frequencies e.g., t = observed expected (from Gries 2009) observed 11 / 25

Simplest approach: use frequency counts Two words appearing together a lot are a collocation The problem is that we get lots of uninteresting pairs of function words (M&S 1999, table 5.1) C(w 1, w 2 ) w 1 w 2 80871 of the 58841 in the 26430 to the 21842 on the (Slides 12 24 are based on Manning & Schütze (M&S) 1999) 12 / 25

POS filtering To remove frequent pairings which are uninteresting, we can use a POS filter (Justeson and Katz 1995) Only examine word sequences which fit a particular part-of-speech pattern: A N, N N, A A N, A N N, N A N, N N N, N P N A N N A N N P N linear function mean squared error degrees of freedom Crucially, all other sequences are removed P D MV V of the has been 13 / 25

POS filtering (2) Some results after tag filtering (M&S 1999, table 5.3) C(w 1, w 2 ) w 1 w 2 Tag Pattern 11487 New York A N 7261 United States A N 5412 Los Angeles N N 3301 last year A N Fairly simple, but surprisingly effective Needs to be refined to handle verb-particle Kind of inconvenient to write out patterns you want 14 / 25

(Pointwise) Mutual Information Pointwise mutual information () compares: Observed: the actual probability of the two words appearing together (p(w 1 w 2 )) Expected: the probability of the two words appearing together if they are independent (p(w 1 )p(w 2 )) The pointwise mutual information is a measure to do this: (1) I(w 1, w 2 ) = log p(w 1w 2 ) p(w 1 )p(w 2 ) The higher the value, the more surprising it is 15 / 25

Pointwise Mutual Information Equation Probabilities (p(w 1 w 2 ), p(w 1 ), p(w 2 )) calculated as: (2) p(x) = C(x) N N is the number of words in the corpus The number of bigrams the number of unigrams (3) I(w 1, w 2 ) = log p(w 1w 2 ) p(w 1 )p(w 2 ) = log C(w 1 w 2 ) N C(w 1 ) C(w 2 ) N N = log[n C(w 1w 2 ) C(w 1 )C(w 2 ) ] 16 / 25

Mutual Information example We want to know if Ayatollah Ruhollah is a collocation in a data set we have: C(Ayatollah) = 42 C(Ruhollah) = 20 C(Ayatollah Ruhollah) = 20 N = 14, 307, 668 20 N (4) I(Ayatollah, Ruhollah) = log 2 42 N 20 N 18.38 = log 2 N 20 42 20 To see how good a collocation this is, we need to compare it to others 17 / 25

Problems for Mutual Information A few problems: Sparse data: infrequent bigrams for infrequent words get high scores Tends to measure independence (value of 0) better than dependence Doesn t account for how often the words do not appear together (M&S 1999, table 5.15) 18 / 25

Motivating Contingency Tables What we can instead get at is: which bigrams are likely, out of a range of possibilities? Looking at the Arthur Conan Doyle story A Case of Identity, we find the following possibilities for one particular bigram: sherlock followed by holmes sherlock followed by some word other than holmes some word other than sherlock preceding holmes two words: the first not being sherlock, the second not being holmes 19 / 25

Contingency Tables We can count up these different possibilities and put them into a contingency table (or 2x2 table) B = holmes B holmes Total A = sherlock 7 0 7 A sherlock 39 7059 7098 Total 46 7059 7105 The Total row and Total column are the marginals Values in this chart are the observed frequencies (f o ) 20 / 25

Observed bigram probabilities Each cell indicates a bigram: divide each cell by total number of bigrams (7105) to get probabilities: holmes holmes Total sherlock 0.00099 0.0 0.00099 sherlock 0.00549 0.99353 0.99901 Total 0.00647 0.99353 1.0 Marginal probabilities indicate probabilities for a given word e.g., p(sherlock) = 0.00099 and p(holmes) = 0.00647 21 / 25

Expected bigram probabilities Assuming sherlock & holmes are independent results in: holmes holmes Total sherlock 0.00647 x 0.00099 0.99353 x 0.00099 0.00099 sherlock 0.00647 x 0.99901 0.99353 x 0.99901 0.99901 Total 0.00647 0.99353 1.0 This is simply p e (w 1, w 2 ) = p(w 1 )p(w 2 ) 22 / 25

Expected bigram frequencies Multiplying by 7105 (the total number of bigrams) gives us the expected number of times we should see each bigram: holmes holmes Total sherlock 0.05 6.95 7 sherlock 45.5 7052.05 7098 Total 46 7059 7105 Values in this chart are expected frequencies (f e ) 23 / 25

Pearson s test The (χ 2 ) test measures how far the observed values are from the expected values: (5) χ 2 = (f o f e ) 2 f e (6) χ 2 = (7 0.05)2 0.05 + (0 6.95)2 6.95 + (39 45.5)2 45.5 + (7059 7052.05)2 7052.05 = 966.05 + 6.95 + 1.048 + 0.006 = 974.05 Looking this up in a table shows it s unlikely to be chance χ 2 test does not work well for rare events, i.e., f e 5 Other tests can be employed using the same tables 24 / 25

Gries (2009) lists some other points to consider: Fertility: # of unique types associate with a word Lexical gravity: window-based approaches that find the most informative contextual slots Multi-word : breaking down the string into most informative units for expected frequencies Variable n: bottom-up approaches to defining the size of n for n-gram collocates Discontinuous n-grams 25 / 25