Natural Language Processing:

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using dialogue context to improve parsing performance in dialogue systems

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Corpus Linguistics (L615)

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Memory-based grammatical error correction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Loughton School s curriculum evening. 28 th February 2017

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

The Role of the Head in the Interpretation of English Deverbal Compounds

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Training and evaluation of POS taggers on the French MULTITAG corpus

Context Free Grammars. Many slides from Michael Collins

Disambiguation of Thai Personal Name from Online News Articles

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

BULATS A2 WORDLIST 2

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Advanced Grammar in Use

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

What the National Curriculum requires in reading at Y5 and Y6

Words come in categories

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Developing Grammar in Context

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Detecting English-French Cognates Using Orthographic Edit Distance

Grammars & Parsing, Part 1:

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Leveraging Sentiment to Compute Word Similarity

The Smart/Empire TIPSTER IR System

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

An Evaluation of POS Taggers for the CHILDES Corpus

Sample Goals and Benchmarks

Derivational and Inflectional Morphemes in Pak-Pak Language

Modeling full form lexica for Arabic

cmp-lg/ Jan 1998

The Discourse Anaphoric Properties of Connectives

A Comparison of Two Text Representations for Sentiment Analysis

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Adjectives tell you more about a noun (for example: the red dress ).

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Universiteit Leiden ICT in Business

arxiv: v1 [cs.cl] 2 Apr 2017

Myths, Legends, Fairytales and Novels (Writing a Letter)

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

THE VERB ARGUMENT BROWSER

The College Board Redesigned SAT Grade 12

Ch VI- SENTENCE PATTERNS.

Emmaus Lutheran School English Language Arts Curriculum

Accurate Unlexicalized Parsing for Modern Hebrew

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Primary English Curriculum Framework

The stages of event extraction

CS 598 Natural Language Processing

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Ups and Downs of Preposition Error Detection in ESL Writing

Specifying a shallow grammatical for parsing purposes

Development of the First LRs for Macedonian: Current Projects

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Thornhill Primary School - Grammar coverage Year 1-6

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

On document relevance and lexical cohesion between query terms

Constructing Parallel Corpus from Movie Subtitles

Speech Recognition at ICSI: Broadcast News and beyond

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Computational Evaluation of Case-Assignment Algorithms

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Character Stream Parsing of Mixed-lingual Text

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Multi-Lingual Text Leveling

Transcription:

Natural Language Processing: (Simple) Word Counting Regina Barzilay EECS Department MIT November 15, 2004

Today Corpora and its properties Zipf s Law Examples of annotated corpora Word segmentation algorithm Natural Language Processing:(Simple) Word Counting 1/35

Corpora A corpus is a body of naturally occurring text, stored in a machine-readable form A balanced corpus tries to be representative across a language or other domains Natural Language Processing:(Simple) Word Counting 2/35

Word Counts What are the most common words in the text? How many words are there in the text? What are the properties of word distribution in large corpora? We will consider Mark Twain s Tom Sawyer Natural Language Processing:(Simple) Word Counting 3/35

Most Common Words Word Freq Use the 3332 determiner (article) and 2972 conjunction a 1775 determiner to 1725 preposition, inf. marker of 1440 preposition was 1161 auxiliary verb it 1027 pronoun in 906 preposition that 877 complementizer Tom 678 proper name Natural Language Processing:(Simple) Word Counting 4/35

Most Common Words (Cont.) Some observations: Dominance of function words Presence of corpus-dependent items (e.g., Tom ) Is it possible to create a truly representative sample of English? Natural Language Processing:(Simple) Word Counting 5/35

How Many Words Are There? They picnicked by the pool, then lay back on the grass and looked at the stars. Type number of distinct words in a corpus (vocabulary size) Token total number of words in a corpus Tom Sawyer: word types 8, 018 word tokens 71, 370 average frequency 9 Natural Language Processing:(Simple) Word Counting 6/35

Frequencies of Frequencies Word Frequency Frequency of Frequency 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 8 131 9 82 10 91 11-50 540 51-100 99 Most words in a corpus appear only once! Natural Language Processing:(Simple) Word Counting 7/35

Zipf s Law in Tom Sawyer word Freq. (f) Rank (r) f r the 3332 1 3332 and 2972 2 5944 a 1775 3 5235 he 877 10 8770 but 410 20 8400 be 294 30 8820 there 222 40 8880 one 172 50 8600 about 158 60 9480 never 124 80 9920 Oh 116 90 10440 Natural Language Processing:(Simple) Word Counting 8/35

Zipf s Law The frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n. Zipf s Law captures the relationship between frequency and rank: 1 f = r There is a constant k such that: f r = k Natural Language Processing:(Simple) Word Counting 9/35

Zipf s Law Natural Language Processing:(Simple) Word Counting 10/35

Mandelbrot s refinement f = P (r + p) B or logf = logp Blog(r + p) P, B, p are parametrized for particular corpora Better fit at low and high ranks Natural Language Processing:(Simple) Word Counting 11/35

Zipf s Law and Principle of Least Effort Human Behavior and the Principle of Least Effort(Zipf):... Zipf argues that he found a unifying principle, the Principle of Least Effort, which underlies essentially the entire human condition (the book even includes some questionable remarks on human sexuality!). The principle argues that people will act so as to minimize their probable average rate of work. (Manning&Schutze, p.23) Natural Language Processing:(Simple) Word Counting 12/35

Other laws Word sense distribution Phonemes distribution Word co-occurrence patterns Natural Language Processing:(Simple) Word Counting 13/35

Examples of collections approximately obeying Zipf s law Frequency of accesses to web pages Sizes of settlements Income distribution amongst individuals Size of earthquakes Notes in musical performances Natural Language Processing:(Simple) Word Counting 14/35

Is Zipf s Law unique to human language? (Li 1992): randomly generated text exhibits Zipf s law Consider a generator that randomly produces characters from the 26 letters of the alphabet and the blank. 26 1 p(w n )=( 27 ) n 27 The words generated by such a generator obey a power law of the Mandelbrot: There are 26 times more words of length n +1 than words of length n Natural Language Processing:(Simple) Word Counting 15/35

There is a constant ratio by which words of length n are more frequent that words of length n +1 Natural Language Processing:(Simple) Word Counting 16/35

Sparsity How often does kick occur in 1M words? 58 How often does kick kick a ball occur in 1M words? 0 Natural Language Processing:(Simple) Word Counting 17/35

Sparsity There is no data like more data How often does kick occur in 1M words? 58 How often does kick a ball occur in 1M words? 0 How often does kick occur in the web? 6 M How often does kick a ball occur in the web? 8.000 Natural Language Processing:(Simple) Word Counting 18/35

Very Very Large Data Brill&Banko 2001: In the task of confusion set disambiguation increase of data size yield significant improvement over the best performing system trained on the standard training corpus size set Task: disambiguate between pairs such as too, to Training Size: varies from one million to one billion Learning methods used for comparison: winnow, perceptron, decision-tree Lapata&Keller 2002, 2003: the web can be used as a very very large corpus The counts can be noisy, but for some tasks this is not an issue Natural Language Processing:(Simple) Word Counting 19/35

The Brown Corpus Famous early corpus (Made by Nelson Francis and Henry Kucera at Brown University in the 1960s) A balanced corpus of written American English Newspaper, novels, non-fiction, academic 1 million words, 500 written texts Do you think this is a large corpus? Natural Language Processing:(Simple) Word Counting 20/35

Recent Corpora Corpus Size Domain Language NA News Corpus 600 million newswire American English British National Corpus 100 million balanced British English EU proceedings 20 million legal 10 language pairs Penn Treebank 2 million newswire American English Broadcast News spoken 7 languages SwitchBoard 2.4 million spoken American English For more corpora, check the Linguistic Data Consortium: http://www.ldc.upenn.edu/ Natural Language Processing:(Simple) Word Counting 21/35

Corpus Content Genre: newswires, novels, broadcast, spontaneous conversations Media: text, audio, video Annotations: tokenization, syntactic trees, semantic senses, translations Natural Language Processing:(Simple) Word Counting 22/35

Example of Annotations: POS Tagging POS tags encode simple grammatical functions Several tag sets: Penn tag set (45 tags) Brown tag set (87 tags) CLAWS2 tag set (132 tags) Category Example Claws c5 Brown Penn Adverb often, badly AJ0 JJ JJ Noun singular table, rose NN1 NN NN Noun plural tables, roses NN2 NN NN Noun proper singular Boston, Leslie NP0 NP NNP Natural Language Processing:(Simple) Word Counting 23/35

Issues in Annotations Different annotation schemes for the same task are common In some cases, there is a direct mapping between schemes; in other cases, they do not exhibit any regular relation Choice of annotation is motivated by the linguistic, the computational and/or the task requirements Natural Language Processing:(Simple) Word Counting 24/35

Tokenization Goal: divide text into a sequence of words Word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis) Is tokenization easy? Natural Language Processing:(Simple) Word Counting 25/35

What s a word? English: Wash. vs wash won t, John s pro-arab, the idea of a child-as-required-yuppie-possession must be motivating them, 85-year-old grandmother East Asian languages: words are not separated by white spaces Natural Language Processing:(Simple) Word Counting 26/35

Word Segmentation Rule-based approach: morphological analysis based on lexical and grammatical knowledge Corpus-based approach: learn from corpora (Ando&Lee, 2000) Issues to consider: coverage, ambiguity, accuracy Natural Language Processing:(Simple) Word Counting 27/35

Motivation for Statistical Segmentation Unknown words problem: presence of domain terms and proper names Grammatical constrains may not be sufficient Example: alternative segmentation of noun phrases Segmentation sha-choh/ken/gyoh-mu/bu-choh sha-choh/ken-gyoh/mu/bu-choh Translation president/and/business/general/manager president/subsidiary business/tsutomi[a name]/general manager Natural Language Processing:(Simple) Word Counting 28/35

Word Segmentation Key idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that straddle it. T S t 1 t? S 1 2 I NG E V I 2 t 3 For N =4, consider the 6 questions of the form: Is #(s i ) #(t j )?, where #(x) is the number of occurrences of x Example: Is TING more frequent in the corpus than INGE? D Natural Language Processing:(Simple) Word Counting 29/35

s n 1 s n 2 t n j I (y, z) Algorithm for Word Segmentation non-straddling n-grams to the left of location k non-straddling n-grams to the right of location k straddling n-gram with j characters to the right of location k indicator function that is 1 when y z, and 0 otherwise. 1. Calculate the fraction of affirmative answers for each n in N: 2 n 1 1 n v n (k) = I (#(s i ), #(t n j )) 2 (n 1) i=1 j=1 2. Average the contributions of each n gram order 1 v N (k) = v n (k) N n N Natural Language Processing:(Simple) Word Counting 30/35

Algorithm for Word Segmentation (Cont.) Place boundary at all locations l such that either: l is a local maximum: v N (l) >v N (l 1) and v N (l) >v N (l +1) v N (l) t, a threshold parameter V (k) N A B C D W X Y Z t Natural Language Processing:(Simple) Word Counting 31/35

Experimental Framework Corpus: 150 megabytes of 1993 Nikkei newswire Manual annotations: 50 sequences for development set (parameter tuning) and 50 sequences for test set Baseline algorithms: Chasen and Juman morphological analyzers (115,000 and 231,000 words) Natural Language Processing:(Simple) Word Counting 32/35

Evaluation Measures tp true positive fp false positive tn true negative fn false negative System target not target selected tp fp not selected fn tn Natural Language Processing:(Simple) Word Counting 33/35

Evaluation Measures (Cont) Precision the measure of the proportion of selected items that the system got right tp P = tp + fp Recall the measure of the target items that the system selected: tp R = tp + fn F-measure: 2 PR F = (R + P ) Word precision (P) is the percentage of proposed brackets that match word-level brackets in the annotation; Word recall (R) is the percentage of word-level brackets that are proposed by the algorithm. Natural Language Processing:(Simple) Word Counting 34/35

Conclusions Corpora widely used in text processing Corpora used either annotated or raw Zipf s law and its connection to natural language Sparsity is a major problem for corpus processing methods Next time: Language modeling Natural Language Processing:(Simple) Word Counting 35/35