Web-Scale N-Gram Models for Lexical Disambiguation

Similar documents
University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Search right and thou shalt find... Using Web Queries for Learner Error Detection

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Word Sense Disambiguation

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Web as a Corpus: Going Beyond the n-gram

Natural Language Processing. George Konidaris

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Formulaic Language and Fluency: ESL Teaching Applications

THE VERB ARGUMENT BROWSER

On document relevance and lexical cohesion between query terms

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Memory-based grammatical error correction

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.cl] 2 Apr 2017

Detecting English-French Cognates Using Orthographic Edit Distance

Multi-Lingual Text Leveling

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Role of the Head in the Interpretation of English Deverbal Compounds

CS Machine Learning

Disambiguation of Thai Personal Name from Online News Articles

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

CS 598 Natural Language Processing

The Smart/Empire TIPSTER IR System

Switchboard Language Model Improvement with Conversational Data from Gigaword

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Short Text Understanding Through Lexical-Semantic Analysis

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

The taming of the data:

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Distant Supervised Relation Extraction with Wikipedia and Freebase

SEMAFOR: Frame Argument Resolution with Log-Linear Models

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Indian Institute of Technology, Kanpur

Probabilistic Latent Semantic Analysis

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A Case Study: News Classification Based on Term Frequency

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Using dialogue context to improve parsing performance in dialogue systems

Learning Methods in Multilingual Speech Recognition

Loughton School s curriculum evening. 28 th February 2017

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Extracting and Ranking Product Features in Opinion Documents

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

A High-Quality Web Corpus of Czech

BULATS A2 WORDLIST 2

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Proceedings of the 19th COLING, , 2002.

The stages of event extraction

Vocabulary Usage and Intelligibility in Learner Language

A Comparison of Two Text Representations for Sentiment Analysis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Leveraging Sentiment to Compute Word Similarity

Developing Grammar in Context

Accuracy (%) # features

Training and evaluation of POS taggers on the French MULTITAG corpus

Context Free Grammars. Many slides from Michael Collins

Ensemble Technique Utilization for Indonesian Dependency Parser

The Ups and Downs of Preposition Error Detection in ESL Writing

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Word-based dialect identification with georeferenced rules

Using Semantic Relations to Refine Coreference Decisions

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Copyright 2002 by the McGraw-Hill Companies, Inc.

Modeling full form lexica for Arabic

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

The Choice of Features for Classification of Verbs in Biomedical Texts

CS177 Python Programming

Adjectives tell you more about a noun (for example: the red dress ).

Robust Sense-Based Sentiment Classification

MYCIN. The MYCIN Task

Python Machine Learning

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Mandarin Lexical Tone Recognition: The Gating Paradigm

We re Listening Results Dashboard How To Guide

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Advanced Grammar in Use

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Geographical Location School, Schedules, Classmates, Activities,

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An Evaluation of POS Taggers for the CHILDES Corpus

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Using AMT & SNOMED CT-AU to support clinical research

Transcription:

Web-Scale N-Gram Models for Lexical Disambiguation Shane Bergsma Dekang Lin Google, Inc. Randy Goebel IJCAI 2009 Slide 1

N-grams for Disambiguation Problem: Choose a label for a word in text Noun or verb? Sense 1 or Sense 2? Method: Which label is most frequent in the word s (N-gram) context? Get counts from web-scale text Combine counts from multiple segments of context Slide 2

Outline 1. Lexical Disambiguation 2. Gathering Web-Scale Counts 3. Combining Context Counts 4. Applications Preposition Selection Context-Sensitive Spelling Correction Non-Referential Pronoun Detection Slide 3

Lexical Disambiguation Choosing the correct meaning of a word from a set of candidates Input: a word in context Bob ate a huge bass for dinner. Output: a label, e.g.: <fish-bass> or <music-bass> Slide 4

Lexical Disambiguation Different meanings, same surface form: Let me know weather you like it. weather or whether Also: Diacritic restoration, POS-tagging, etc. (Yarowsky 1994, Roth 1998) Slide 5

Lexical Disambiguation Use corpus occurrences as unambiguous examples: know weather you vs. know whether you Terminology: know _ you : context pattern weather, whether : fillers Get counts for fillers in context patterns, take highest-scoring as label Slide 6

Non-word Labels Devise proxies for labels, get pattern counts (Mihalcea & Moldovan, 1999) Bob ate a huge bass for dinner. Sense Proxies tuna, salmon, pike guitar, drums, harmonica Slide 7

Web-Scale Data Where to get the counts? More data = better data (Banko & Brill, 01) Hmmm... Search engine page-counts = Awesome corpus counts Slide 8

Previous work Lapata & Keller 2005: Query web with trigram of context: know weather you : 1,370,000 pages know whether you : 1,600,000 pages Correct one is higher, but??? Slide 9

Previous work Lapata & Keller 2005: Query web with trigram of context: know weather you : 1,370,000 pages know whether you : 1,600,000 pages Correct one is higher, but??? July 6, 2009: know weather you : 4,060 pages know whether you : 2,530,000 pages Slide 10

Google N-gram Data 2006: Google releases web-scale N-gram corpus From 1 trillion words of online English text Doesn t fit on your hard drive 1-grams to 5-grams with > 40 counts A compressed version of the whole web Approximately 24 GB gzipped Does fit on your hard drive Slide 11

Web vs. N-Gram Corpus For training a preposition selection system, needed 267 million unique counts. Using Google API with 1000 query/day limit, that would have taken over 732 years Search-engine counts are extremely inefficient Slide 12

How much context to include? Slide 13

From: xkcd.com Slide 14

Multiple Patterns Many contexts span the confusable word: Let me know _ me know _ you know _ you like _ you like it Five 5-grams, four 4-grams, three 3-grams and two 2-grams span the confusable word Like a LM Slide 15

SuperLM: Combining Counts Use supervised machine learning to combine counts (Bergsma et al., ACL 2008) Features: log(count(context-pattern{filler})) indexed by pattern position, length, filler, class learns the association of fillers and classes exploits most predictive fillers, positions Slide 16

Example... to choose among/between the three candidates... Predicting: is it among? Feature Weight log( C( to choose among ) ) +1 log( C( to choose between ) ) -1 log( C( among the three ) ) +3 log( C( between the three ) ) -3 Slide 17

Trigram: Other Approaches Compare trigram counts of fillers, take highest as label SumLM: Sum the log-frequencies across all context patterns for each filler, take highest as label Slide 18

Applications 1) Preposition Selection Study in California at UCLA. Fillers: 34 prepositions: at, by, from, in, on... System Accuracy Baseline 20.9% Trigram 58.8% SumLM 73.7% SuperLM 75.4% Slide 19

Applications 1) Preposition Selection Study in California at UCLA. Fillers: 34 prepositions: at, by, from, in, on... System Accuracy Baseline 20.9% Trigram 58.8% SumLM 73.7% SuperLM 75.4% Slide 20

Applications 1) Preposition Selection Study in California at UCLA. Fillers: 34 prepositions: at, by, from, in, on... System Accuracy Baseline 20.9% Trigram 58.8% SumLM 73.7% SuperLM 75.4% Slide 21

SumLM from MIN to MAX MAX MIN 2 3 4 5 2 50.2% 63.8% 70.4% 72.6% 3 66.8% 72.1% 73.7% 4 69.3% 70.6% 5 57.8% Slide 22

SumLM from MIN to MAX MAX MIN 2 3 4 5 2 50.2% 63.8% 70.4% 72.6% 3 66.8% 72.1% 73.7% 4 69.3% 70.6% 5 57.8% Slide 23

SumLM from MIN to MAX MAX MIN 2 3 4 5 2 50.2% 63.8% 70.4% 72.6% 3 66.8% 72.1% 73.7% 4 69.3% 70.6% 5 57.8% Slide 24

Applications 2) Context-Sensitive Spelling Correction Fillers: among/between, amount/number, cite/sight/site, peace/piece, raise/rise. System Accuracy (Avg.) Baseline 66.9% Trigram 88.4% SumLM 94.8% SuperLM 95.7% Slide 25

Applications 3) Non-referential Pronoun Detection it is hungry. vs. it is important to eat. Fillers: it, he/she/they/etc.,.* (proxies) System Accuracy Baseline 59.4% Trigram 74.3% SumLM 79.8% SuperLM 82.4% Slide 26

Conclusion Web-scale N-gram counts for many tasks Use as much context as possible, combine in intelligent ways Get state-of-the-art performance Johns Hopkins Summer Workshop 2009: Unsupervised Acquisition of Lexical Knowledge from N-grams Google N-grams Version 2: with POS-tags! Slide 27

Thanks! Slide 28