Simple, Effective, Robust Semi-Supervised Learning, Thanks To Google N-grams. Shane Bergsma Johns Hopkins University

Similar documents
University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Context Free Grammars. Many slides from Michael Collins

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Web as a Corpus: Going Beyond the n-gram

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Ensemble Technique Utilization for Indonesian Dependency Parser

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Cross Language Information Retrieval

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

LTAG-spinal and the Treebank

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Annotation Projection for Discourse Connectives

The stages of event extraction

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Using dialogue context to improve parsing performance in dialogue systems

Word Sense Disambiguation

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Memory-based grammatical error correction

Prediction of Maximal Projection for Semantic Role Labeling

Multilingual Sentiment and Subjectivity Analysis

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Learning Methods in Multilingual Speech Recognition

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Grammars & Parsing, Part 1:

Handling Sparsity for Verb Noun MWE Token Classification

Switchboard Language Model Improvement with Conversational Data from Gigaword

A High-Quality Web Corpus of Czech

Natural Language Processing. George Konidaris

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

AQUA: An Ontology-Driven Question Answering System

Coupling Semi-Supervised Learning of Categories and Relations

BYLINE [Heng Ji, Computer Science Department, New York University,

CS 598 Natural Language Processing

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

The Smart/Empire TIPSTER IR System

A Case Study: News Classification Based on Term Frequency

Parsing of part-of-speech tagged Assamese Texts

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Postprint.

Language Independent Passage Retrieval for Question Answering

Leveraging Sentiment to Compute Word Similarity

Distant Supervised Relation Extraction with Wikipedia and Freebase

EACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on

arxiv: v1 [cs.cl] 2 Apr 2017

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Multi-Lingual Text Leveling

The Role of the Head in the Interpretation of English Deverbal Compounds

Beyond the Pipeline: Discrete Optimization in NLP

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Comparison of Two Text Representations for Sentiment Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Ups and Downs of Preposition Error Detection in ESL Writing

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Speech Emotion Recognition Using Support Vector Machine

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CS Machine Learning

Online Updating of Word Representations for Part-of-Speech Tagging

Developing Grammar in Context

Applications of memory-based natural language processing

The taming of the data:

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Indian Institute of Technology, Kanpur

Robust Sense-Based Sentiment Classification

Detecting English-French Cognates Using Orthographic Edit Distance

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

A Graph Based Authorship Identification Approach

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

An Evaluation of POS Taggers for the CHILDES Corpus

Learning Computational Grammars

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Accuracy (%) # features

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

A Statistical Approach to the Semantics of Verb-Particles

Proof Theory for Syntacticians

Using Web Searches on Important Words to Create Background Sets for LSI Classification

arxiv:cmp-lg/ v1 22 Aug 1994

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Transcription:

Simple, Effective, Robust Semi-Supervised Learning, Thanks To Google N-grams Shane Bergsma Johns Hopkins University Hissar, Bulgaria September 15, 2011

Research Vision Robust processing of human language requires knowledge beyond what s in small manually-annotated data sets Derive knowledge from real-world data: 1) Raw text on the web 2) Bilingual text (words plus their translations) 3) Visual data (labelled online images) 2

More data is better data [Banko & Brill, 2001] Grammar Correction Task @Microsoft

Search Engines vs. N-grams Early web work: Use an Internet search engine to get data [Keller & Lapata, 2003] Britney Spears Britany Spears 269,000,000 pages 693,000 pages 4

Search Engines Search Engines for NLP: objectionable? Scientifically: not reproducible, unreliable [Kilgarriff, 2007, Googleology is bad science. + Practically: Too slow for millions of queries 5

N-grams Google N-gram Data [Brants & Franz, 2006] N words in sequence + their count on web A compressed version of all the text on web 24 GB zipped fits on your hard drive Enables better features for a range of tasks [Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.] 6

Google N-gram Data Version 2 Google N-grams Version 2 [Lin et al., LREC 2010] Same source as Google N-grams Version 1 More pre-processing: duplicate sentence removal, sentence-length and alphabetical constraints Includes part-of-speech tags! flies 1643568 NNS 611646 VBZ 1031922 caught the flies, 11 VBD DT NNS, 11 plane flies really well 10 NN VBZ RB RB 10 7

How to Create Robust Classifiers using Google N-grams Features from Google N-gram corpus: Count(some N-gram) in Google corpus Open questions: 1.How well do web-scale N-gram features work when combined with conventional features? 2.How well do classifiers with web-scale N-gram features perform on new domains? Conclusion: N-gram features are essential 8 [Bergsma, Pitler & Lin, ACL 2010]

Feature Classes Lex (lexical features): x Lex Many thousands of binary features indicating a property of the strings to be classified N-gm (N-gram count features): x Ngm A few dozen real-valued features for the logarithmic counts of various things The classifier: x = (x Lex, x Ngm ) h(x) = w x 9

Training Examples (small) Google N-gram Data (HUGE) Feature Vectors x 1, x 2, x 3, x 4 Machine Learning Classifier: h(x) 10

Uses of New N-gram Data Applications: 1. Adjective Ordering 2. Real-Word Spelling Correction 3. Noun Compound Bracketing All experiments: linear SVM classifier, report Accuracy (%) 11

1. Adjective Ordering green big truck or big green truck? Used in translation, generation, etc. Not a syntactic issue but a semantic issue: size precedes colour, etc. 12

Adjective Ordering As a classification problem: Take adjectives in alphabetical order Decision: is alphabetical order correct or not? Why not just most frequent order on web? 87% for web order but 94% for classifier 13

Adjective Ordering Features Lex features: indicators for the adjectives adj 1 indicated with +1, adj 2 indicated with -1 E.g. big green big green x Lex = (..., 0, 0, 0, 0, 0, 0, 0, +1, 0, 0, 0, 0,..., 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...) Decision: h Lex (x Lex ) = w Lex x Lex h Lex (x Lex ) = w big - w green 14

Adjective Ordering Features w big w green big green truck 15

Adjective Ordering Features w big w first first big storm 16

Adjective Ordering Features w first w big w young w green w Canadian 17

Adjective Ordering Features N-gm features: Count( big green ) Count( big J.* ) Count( J.* big ) Count( green big ) Count( green J.* ) Count( J.* green )... Count( green big ) Count( green J.* ) Count( big green ) Count( J.* green ) x Ngm = (29K, 200, 571K, 2.5M,...) 18

19 Adjective Ordering Results

In-Domain Learning Curve 93.7% 20

Out-of-Domain Learning Curve! 21

2. Real-Word Spelling Correction Classifier predicts correct word in context: Let me know weather you like it. weather or whether 22

Spelling Correction Lex features: Presence of particular words (and phrases) preceding or following the confusable word 23

Spelling Correction N-gm feats: Leverage multiple relevant contexts: Let me know _ me know _ you know _ you like _ you like it [Bergsma et al., 2009] Five 5-grams, four 4-grams, three 3-grams and two 2-grams span the confusable word 24

Spelling Correction N-gm features: Count( let me know weather you ) 5-grams Count( me know weather you like )... Count( let me know weather ) 4-grams Count( me know weather you ) Count( know weather you like )... Count( let me know whether you ) 5-grams... 25

26 Spelling Correction Results

27 In-Domain Learning Curve

Cross-Domain Results N-gm + Lex Lex In-Domain 96.5 95.2 Literature 91.9 85.8 Biomedical 94.8 91.0 28

3. Noun Compound Bracketing bus driver female (bus driver) *(female bus) driver (school bus) driver 3-word case is a binary classification: right or left bracketing 29

Noun Compound Bracketing Lex features: binary features for all words, pairs, and the triple, plus capitalization pattern [Vadas & Curran, 2007] 30

Noun Compound Bracketing N-gm features, e.g. female bus driver Count( female bus ) predicts left Count( female driver ) predicts right Count( bus driver ) predicts right Count( femalebus ) Count( busdriver ) etc. [Nakov & Hearst, 2005] 31

32 In-Domain Learning Curve

Out-of-Domain Results Without N-grams: A Disaster! 33

Part 2 Conclusion It s good to mix standard lexical features with N-gram count features (but be careful OOD) Domain sensitivity of NLP in general: a very big deal 34

Part 3: Parsing NPs with conjunctions 1) [dairy and meat] production 2) [sustainability] and [meat production] yes: [dairy production] in (1) no: [sustainability production] in (2) Our contributions: new semantic features from raw web text and a new approach to using bilingual data as soft supervision 35 [Bergsma, Yarowsky & Church, ACL 2011]

One Noun Phrase or Two: A Machine Learning Approach Classify as either one NP or two using a linear classifier: h(x) = w x x Lex = (, first-noun=dairy, second-noun=meat, first+second-noun=dairy+meat, ) 36

N-gram Features [dairy and meat] production If there is only one NP, then it is implicitly talking about dairy production Count( dairy production ) in N-gram Data? [High] sustainability and [meat production] If there is only one NP, then it is implicitly talking about sustainability production Count( sustainability production ) in N-gram Data? [Low] 37

Features for Explicit Paraphrases ❶ and ❷ ❸ dairy and meat production ❶ and ❷ ❸ sustainability and meat production Pattern: ❸ of ❶ and ❷ Count(production of dairy and meat) Count(production of sustainability and meat) Pattern: ❷ ❸ and ❶ Count(meat production and dairy) Count(meat production and sustainability) 38 New paraphrases extending ideas in [Nakov & Hearst, 2005]

Using Bilingual Data Bilingual data: a rich source of paraphrases dairy and meat production producción láctea y cárnica Build a classifier which uses bilingual features Applicable when we know the translation of the NP 39

Bilingual Paraphrase Features ❶ and ❷ ❸ ❶ and ❷ ❸ dairy and meat production Pattern: Count(producción láctea y cárnica ) ❸ ❶ ❷ (Spanish) sustainability and meat production unseen Pattern: ❶ ❸ ❷ (Italian) unseen Count(sostenibilità e la produzione di carne) 40

Bilingual Paraphrase Features ❶ and ❷ ❸ ❶ and ❷ ❸ dairy and meat production Pattern: C o u nt(maidon ja l i h a n t u o ta ntoon) ❶- ❷❸ (Finnish) sustainability and meat production unseen 41

Training Examples + Features from Google Data h(x m ) coal and steel money rocket and mortar attacks h(x b ) Training Examples Bitext Examples + Features from Translation Data 42

Training Examples + Features from Google Data h(x m ) business and computer science the Bosporus and Dardanelles straits the environment and air transport h(x b ) 1 Training Examples coal and steel money rocket and mortar attacks + Features from Translation Data 43

Training Examples business and computer science the Bosporus and Dardanelles straits the environment and air transport + Features from Google Data h(x m ) 1 h(x b ) 1 Co-Training: *Yarowsky 95+, *Blum & Mitchell 98+ Training Examples coal and steel money rocket and mortar attacks + Features from Translation Data 44

Error rate (%) of co-trained classifiers h(x b ) i h(x m ) i 45

Error rate (%) on Penn Treebank (PTB) 20 15 10 5 unsupervised 800 PTB training examples 800 PTB training examples h(x m ) N 2 training examples 0 Broad-coverage Parsers Nakov & Hearst (2005) Pitler et al (2010) New Supervised Monoclassifier Co-trained Monoclassifier 46

Conclusion Robust NLP needs to look beyond humanannotated data to exploit large corpora Size matters: 47 Most parsing systems trained on 1 million words We use: billions of words in bitexts (as soft supervision) trillions of words of monolingual text (as features) online images: hundreds of billions ( 1000 words each a 100 trillion words!) [See our RANLP 2011, IJCAI 2011 papers]

Questions + Thanks Gold sponsors: Platinum sponsors (collaborators): Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin (Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins) 48