Natural Language Processing:

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using dialogue context to improve parsing performance in dialogue systems

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Corpus Linguistics (L615)

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Memory-based grammatical error correction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Loughton School s curriculum evening. 28 th February 2017

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

The Role of the Head in the Interpretation of English Deverbal Compounds

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Training and evaluation of POS taggers on the French MULTITAG corpus

Context Free Grammars. Many slides from Michael Collins

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

BULATS A2 WORDLIST 2

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Disambiguation of Thai Personal Name from Online News Articles

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Advanced Grammar in Use

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

What the National Curriculum requires in reading at Y5 and Y6

Words come in categories

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Developing Grammar in Context

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Grammars & Parsing, Part 1:

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Detecting English-French Cognates Using Orthographic Edit Distance

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

The Smart/Empire TIPSTER IR System

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An Evaluation of POS Taggers for the CHILDES Corpus

Sample Goals and Benchmarks

Derivational and Inflectional Morphemes in Pak-Pak Language

Modeling full form lexica for Arabic

The Discourse Anaphoric Properties of Connectives

cmp-lg/ Jan 1998

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Adjectives tell you more about a noun (for example: the red dress ).

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Universiteit Leiden ICT in Business

arxiv: v1 [cs.cl] 2 Apr 2017

Myths, Legends, Fairytales and Novels (Writing a Letter)

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Leveraging Sentiment to Compute Word Similarity

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

THE VERB ARGUMENT BROWSER

The College Board Redesigned SAT Grade 12

Ch VI- SENTENCE PATTERNS.

A Comparison of Two Text Representations for Sentiment Analysis

Emmaus Lutheran School English Language Arts Curriculum

Accurate Unlexicalized Parsing for Modern Hebrew

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Primary English Curriculum Framework

The stages of event extraction

CS 598 Natural Language Processing

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Ups and Downs of Preposition Error Detection in ESL Writing

Specifying a shallow grammatical for parsing purposes

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Thornhill Primary School - Grammar coverage Year 1-6

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

On document relevance and lexical cohesion between query terms

Constructing Parallel Corpus from Movie Subtitles

Speech Recognition at ICSI: Broadcast News and beyond

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Computational Evaluation of Case-Assignment Algorithms

Development of the First LRs for Macedonian: Current Projects

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Multi-Lingual Text Leveling

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Transcription:

Corpora Natural Language Processing: (Simple) Word Counting Regina Barzilay EECS Department A corpus is a body of naturally occurring text, stored in a machine-readable form A balanced corpus tries to be representative across a language or other domains MIT November 15, 2004 Natural Language Processing:(Simple) Word Counting 2/35 Today Word Counts Corpora and its properties Zipf s Law Examples of annotated corpora Word segmentation algorithm What are the most common words in the text? How many words are there in the text? What are the properties of word distribution in large corpora? We will consider Mark Twain s Tom Sawyer Natural Language Processing:(Simple) Word Counting 1/35 Natural Language Processing:(Simple) Word Counting 3/35

Most Common Words Word Freq Use the 3332 determiner (article) and 2972 conjunction a 1775 determiner to 1725 preposition, inf. marker of 1440 preposition was 1161 auxiliary verb it 1027 pronoun in 906 preposition that 877 complementizer Tom 678 proper name How Many Words Are There? They picnicked by the pool, then lay back on the grass and looked at the stars. Type number of distinct words in a corpus (vocabulary size) Token total number of words in a corpus Tom Sawyer: word types 8, 018 word tokens 71, 370 average frequency 9 Natural Language Processing:(Simple) Word Counting 4/35 Natural Language Processing:(Simple) Word Counting 6/35 Most Common Words (Cont.) Frequencies of Frequencies Some observations: Dominance of function words Presence of corpus-dependent items (e.g., Tom ) Is it possible to create a truly representative sample of English? Word Frequency Frequency of Frequency 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 8 131 9 82 10 91 11-50 540 51-100 99 Most words in a corpus appear only once! Natural Language Processing:(Simple) Word Counting 5/35 Natural Language Processing:(Simple) Word Counting 7/35

Zipf s Law in Tom Sawyer Zipf s Law word Freq. (f) Rank (r) f r the 3332 1 3332 and 2972 2 5944 a 1775 3 5235 he 877 10 8770 but 410 20 8400 be 294 30 8820 there 222 40 8880 one 172 50 8600 about 158 60 9480 never 124 80 9920 Oh 116 90 10440 Natural Language Processing:(Simple) Word Counting 8/35 Natural Language Processing:(Simple) Word Counting 10/35 Zipf s Law The frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n. Zipf s Law captures the relationship between frequency and rank: or Mandelbrot s refinement f = P (r + p) B logf = logp Blog(r + p) f = 1 r There is a constant k such that: P, B, p are parametrized for particular corpora Better fit at low and high ranks f r = k Natural Language Processing:(Simple) Word Counting 9/35 Natural Language Processing:(Simple) Word Counting 11/35

Zipf s Law and Principle of Least Effort Examples of collections approximately obeying Zipf s law Human Behavior and the Principle of Least Effort(Zipf):... Zipf argues that he found a unifying principle, the Principle of Least Effort, which underlies essentially the entire human condition (the book even includes some questionable remarks on human sexuality!). The principle argues that people will act so as to minimize their probable average rate of work. (Manning&Schutze, p.23) Frequency of accesses to web pages Sizes of settlements Income distribution amongst individuals Size of earthquakes Notes in musical performances Natural Language Processing:(Simple) Word Counting 12/35 Natural Language Processing:(Simple) Word Counting 14/35 Other laws Is Zipf s Law unique to human language? (Li 1992): randomly generated text exhibits Zipf s law Word sense distribution Phonemes distribution Word co-occurrence patterns Consider a generator that randomly produces characters from the 26 letters of the alphabet and the blank. p(w n ) = ( 26 27 )n 1 27 The words generated by such a generator obey a power law of the Mandelbrot: There are 26 times more words of length n + 1 than words of length n Natural Language Processing:(Simple) Word Counting 13/35 Natural Language Processing:(Simple) Word Counting 15/35

There is a constant ratio by which words of length n are more frequent that words of length n + 1 Sparsity There is no data like more data How often does kick occur in 1M words? 58 How often does kick a ball occur in 1M words? 0 How often does kick occur in the web? 6 M How often does kick a ball occur in the web? 8.000 Natural Language Processing:(Simple) Word Counting 16/35 Natural Language Processing:(Simple) Word Counting 18/35 Sparsity How often does kick occur in 1M words? 58 How often does kick kick a ball occur in 1M words? 0 Very Very Large Data Brill&Banko 2001: In the task of confusion set disambiguation increase of data size yield significant improvement over the best performing system trained on the standard training corpus size set Task: disambiguate between pairs such as too, to Training Size: varies from one million to one billion Learning methods used for comparison: winnow, perceptron, decision-tree Lapata&Keller 2002, 2003: the web can be used as a very very large corpus The counts can be noisy, but for some tasks this is not an issue Natural Language Processing:(Simple) Word Counting 17/35 Natural Language Processing:(Simple) Word Counting 19/35

The Brown Corpus Corpus Content Famous early corpus (Made by Nelson Francis and Henry Kucera at Brown University in the 1960s) A balanced corpus of written American English Newspaper, novels, non-fiction, academic 1 million words, 500 written texts Do you think this is a large corpus? Genre: newswires, novels, broadcast, spontaneous conversations Media: text, audio, video Annotations: tokenization, syntactic trees, semantic senses, translations Natural Language Processing:(Simple) Word Counting 20/35 Natural Language Processing:(Simple) Word Counting 22/35 Recent Corpora Corpus Size Domain Language NA News Corpus 600 million newswire American English British National Corpus 100 million balanced British English EU proceedings 20 million legal 10 language pairs Penn Treebank 2 million newswire American English Broadcast News spoken 7 languages SwitchBoard 2.4 million spoken American English For more corpora, check the Linguistic Data Consortium: http://www.ldc.upenn.edu/ Example of Annotations: POS Tagging POS tags encode simple grammatical functions Several tag sets: Penn tag set (45 tags) Brown tag set (87 tags) CLAWS2 tag set (132 tags) Category Example Claws c5 Brown Penn Adverb often, badly AJ0 JJ JJ Noun singular table, rose NN1 NN NN Noun plural tables, roses NN2 NN NN Noun proper singular Boston, Leslie NP0 NP NNP Natural Language Processing:(Simple) Word Counting 21/35 Natural Language Processing:(Simple) Word Counting 23/35

Issues in Annotations What s a word? Different annotation schemes for the same task are common In some cases, there is a direct mapping between schemes; in other cases, they do not exhibit any regular relation Choice of annotation is motivated by the linguistic, the computational and/or the task requirements English: Wash. vs wash won t, John s pro-arab, the idea of a child-as-required-yuppie-possession must be motivating them, 85-year-old grandmother East Asian languages: words are not separated by white spaces Natural Language Processing:(Simple) Word Counting 24/35 Natural Language Processing:(Simple) Word Counting 26/35 Tokenization Word Segmentation Goal: divide text into a sequence of words Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis) Is tokenization easy? Rule-based approach: morphological analysis based on lexical and grammatical knowledge Corpus-based approach: learn from corpora (Ando&Lee, 2000) Issues to consider: coverage, ambiguity, accuracy Natural Language Processing:(Simple) Word Counting 25/35 Natural Language Processing:(Simple) Word Counting 27/35

Motivation for Statistical Segmentation Unknown words problem: presence of domain terms and proper names Grammatical constrains may not be sufficient Example: alternative segmentation of noun phrases s n 1 s n 2 t n j I (y, z) Algorithm for Word Segmentation non-straddling n-grams to the left of location k non-straddling n-grams to the right of location k straddling n-gram with j characters to the right of location k indicator function that is 1 when y z, and 0 otherwise. 1. Calculate the fraction of affirmative answers for each n in N: v n (k) = 1 2 (n 1) 2 n 1 i=1 j=1 I (#(s n i ), #(t n j )) Segmentation sha-choh/ken/gyoh-mu/bu-choh sha-choh/ken-gyoh/mu/bu-choh Translation president/and/business/general/manager president/subsidiary business/tsutomi[a name]/general manager 2. Average the contributions of each n gram order v N (k) = 1 N v n (k) n N Natural Language Processing:(Simple) Word Counting 28/35 Natural Language Processing:(Simple) Word Counting 30/35 Word Segmentation Key idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it.? S S 1 2 T I N G E V I D t1 t2 t 3 For N = 4, consider the 6 questions of the form: Is #(s i ) #(t j )?, where #(x) is the number of occurrences of x Example: Is TING more frequent in the corpus than INGE? Natural Language Processing:(Simple) Word Counting 29/35 Algorithm for Word Segmentation (Cont.) Place boundary at all locations l such that either: l is a local maximum: v N (l) > v N (l 1) and v N (l) > v N (l + 1) v N (l) t, a threshold parameter V (k) N A B C D W X Y Z Natural Language Processing:(Simple) Word Counting 31/35 t

Experimental Framework Corpus: 150 megabytes of 1993 Nikkei newswire Manual annotations: 50 sequences for development set (parameter tuning) and 50 sequences for test set Baseline algorithms: Chasen and Juman morphological analyzers (115,000 and 231,000 words) Evaluation Measures (Cont) Precision the measure of the proportion of selected items that the system got right tp P = tp + fp Recall the measure of the target items that the system selected: F-measure: R = tp tp + fn F = 2 P R (R + P ) Word precision (P) is the percentage of proposed brackets that match word-level brackets in the annotation; Word recall (R) is the percentage of word-level brackets that are proposed by the algorithm. Natural Language Processing:(Simple) Word Counting 32/35 Natural Language Processing:(Simple) Word Counting 34/35 Evaluation Measures Conclusions tp true positive fp false positive tn true negative fn false negative System target not target selected tp f p not selected f n tn Corpora widely used in text processing Corpora used either annotated or raw Zipf s law and its connection to natural language Sparsity is a major problem for corpus processing methods Next time: Language modeling Natural Language Processing:(Simple) Word Counting 33/35 Natural Language Processing:(Simple) Word Counting 35/35