Outline. Statistical Natural Language Processing. Symbolic NLP Insufficient. Statistical NLP. Statistical Language Models

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Probabilistic Latent Semantic Analysis

CS 598 Natural Language Processing

Grammars & Parsing, Part 1:

Natural Language Processing. George Konidaris

Compositional Semantics

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

AQUA: An Ontology-Driven Question Answering System

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Using dialogue context to improve parsing performance in dialogue systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Proof Theory for Syntacticians

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Parsing of part-of-speech tagged Assamese Texts

Cross Language Information Retrieval

Lecture 1: Machine Learning Basics

Linking Task: Identifying authors and book titles in verbose queries

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Speech Recognition at ICSI: Broadcast News and beyond

Some Principles of Automated Natural Language Information Extraction

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Comparison of Two Text Representations for Sentiment Analysis

Lecture 10: Reinforcement Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Analysis of Probabilistic Parsing in NLP

Detecting English-French Cognates Using Orthographic Edit Distance

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Ensemble Technique Utilization for Indonesian Dependency Parser

Universiteit Leiden ICT in Business

Context Free Grammars. Many slides from Michael Collins

Visual CP Representation of Knowledge

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Prediction of Maximal Projection for Semantic Role Labeling

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

A Bayesian Learning Approach to Concept-Based Document Classification

CS Machine Learning

Learning Methods in Multilingual Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

Corpus Linguistics (L615)

The Strong Minimalist Thesis and Bounded Optimality

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning From the Past with Experiment Databases

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

STA 225: Introductory Statistics (CT)

Short Text Understanding Through Lexical-Semantic Analysis

Dublin City Schools Mathematics Graded Course of Study GRADE 4

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

A Grammar for Battle Management Language

Indian Institute of Technology, Kanpur

Experts Retrieval with Multiword-Enhanced Author Topic Model

Language and Computers. Writers Aids. Introduction. Non-word error detection. Dictionaries. N-gram analysis. Isolated-word error correction

TINE: A Metric to Assess MT Adequacy

What the National Curriculum requires in reading at Y5 and Y6

SEMAFOR: Frame Argument Resolution with Log-Linear Models

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Developing Grammar in Context

South Carolina English Language Arts

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Loughton School s curriculum evening. 28 th February 2017

A Version Space Approach to Learning Context-free Grammars

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

What is NLP? CS 188: Artificial Intelligence Spring Why is Language Hard? The Big Open Problems. Information Extraction. Machine Translation

Introduction to the Practice of Statistics

Large vocabulary off-line handwriting recognition: A survey

The stages of event extraction

California Department of Education English Language Development Standards for Grade 8

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Language Independent Passage Retrieval for Question Answering

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Guidelines for Writing an Internship Report

Dialog Act Classification Using N-Gram Algorithms

On document relevance and lexical cohesion between query terms

The Smart/Empire TIPSTER IR System

The Role of String Similarity Metrics in Ontology Alignment

Lecture 9: Speech Recognition

Extracting and Ranking Product Features in Opinion Documents

Math 96: Intermediate Algebra in Context

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Transcription:

Outline Statistical Natural Language Processing July 8, 26 CS 486/686 University of Waterloo Introduction to Statistical NLP Statistical Language Models Information Retrieval Evaluation Metrics Other Applications of Statistical NLP Reading: R&N Sect. 23., 23.2 CS486/686 Lecture Slides (c) 25 P. Poupart 2 Symbolic NLP Insufficient Symbolic NLP generally fails because Grammars too complex to specify NL is vague, imprecise, and ambiguous NL is often context dependent Motivation behind Statistical NLP Symbolic NLP involves: Constructing a set of rules (eg. a grammar) for the language and the NLP task. Applying the rules to the data. Success depends on how well the rules describe the data. How to ensure the rules fit the data well? Derive the rules from the data statistical natural language processing. CS486/686 Lecture Slides (c) 25 P. Poupart 3 CS486/686 Lecture Slides (c) 25 P. Poupart 4 Statistical NLP Statistical NLP involves: Analyzing some (training) data to derive patterns and rules for the language and the NLP task. Applying the rules to the (test) data. Symbolic NLP specifies how a language should be used, while statistical NLP specifies how a language is usually used. Often both are needed hybrid models. Statistical Language Models One of the most fundamental tasks in statistical NLP. A statistical / probabilistic language model defines a probability distribution over a (possibly infinite) set of strings. We ll look at two popular examples: N-gram models: distribution over words Probabilistic context free grammar CS486/686 Lecture Slides (c) 25 P. Poupart 5 CS486/686 Lecture Slides (c) 25 P. Poupart 6

Unigram model Unigram: independent distribution P(w) for each word w in the lexicon Given a document D, P(w) = #w in D / Σ i #w i in D Word sequence: Π i P(w i ) Ex. 2-word sequence generated at random from a unigram model of the textbook: logical are as are confusion a may right tries agent goal the was diesel more object then informationgathering search is Bigram model Bigram: conditional distribution P(w i w i- ) for each word w i given the previous word w i- Given a document D, P(w i w i- ) = #(w i,w i- ) in D / #w i- in D Word sequence: P(w ) Π i P(w i w i- ) Ex. word sequence generated at random from a bigram model of the textbook: planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic tac toe a predicate CS486/686 Lecture Slides (c) 25 P. Poupart 7 CS486/686 Lecture Slides (c) 25 P. Poupart 8 Trigram model Trigram: conditional distribution P(w i w i-,w i-2 ) for each word w i given the previous two words Given a document D, P(w i w i-,w i-2 ) = #(w i,w i-,w i-2 ) in D / #(w i-,w i-2 ) in D Word sequence: P(w ) P(w w ) Π i P(w i w i-,w i-2 ) Ex. word sequence generated at random from a trigram model of the textbook: planning and scheduling are integrated the success of naïve bayes model is just a possible prior source by that time CS486/686 Lecture Slides (c) 25 P. Poupart 9 Graphically Unigram: zero th -order Markov process w w w 2 w 3 w 4 Bigram: first-order Markov process w w w 2 w 3 w 4 Trigram: second-order Markov process w w w 2 w 3 w 4 CS486/686 Lecture Slides (c) 25 P. Poupart N-gram models N-gram models: Quality: language model improves with n Learning: amount of data necessary increases exponentially with n Suppose corpus of k unique words and K total words: Unigram model: K > k Bigram model: K > k 2 Trigram model: K > k 3 CS486/686 Lecture Slides (c) 25 P. Poupart Textbook has: 5, unique words 5, total words Textbook Model complexity: Unigram model: 5, probabilities Bigram model: 5, 2 = 225 million probabilities 99.8% of probabilities are zero! Trigram model: 5, 3 = 3.375 trillion probs 99.9999% of probabilities are zero! CS486/686 Lecture Slides (c) 25 P. Poupart 2 2

Smoothing Zero probabilities can be problematic: Word sequence: Π i P(w i w i-,w i-2, ) = as soon as i such that P(w i w i-,w i-2, ) = Solutions: Add-one smoothing P(w ^ i w i- ) = [#(w i,w i- )+] / [#w i- +k 2 ] Linear interpolation smoothing P(w ^ i w i- ) = c 2 P(w i w i- ) + c P(w i ) where c + c 2 = Probabilistic Context-Free Grammar (PCFG) N-gram models: Basic probabilistic language models Context-free grammars: Sophisticated symbolic language models Probabilistic context free grammars: Sophisticated probabilistic language models Assign probabilities to rewrite rules CS486/686 Lecture Slides (c) 25 P. Poupart 3 CS486/686 Lecture Slides (c) 25 P. Poupart 4 Example PCFG Example probabilistic parse tree S NP VP [.] NP Pronoun [.] Name [.] Noun [.2] Article Noun [.5] NP PP [.] VP Verb [.6] VP NP [.2] VP PP [.2] Noun breeze[.] wumpus[.5] agent[.5] Verb sees [.5] smells [.] goes [.25] Article the [.3] a [.35] every [.5] CS486/686 Lecture Slides (c) 25 P. Poupart 5 NP S VP Article Noun Verb.5 Every.5..5 wumpus.6. smells Parse tree prob:.*.5*.6*.5*.5*. =.225 CS486/686 Lecture Slides (c) 25 P. Poupart 6 Learning PCFGs When corpus of parsed sentences available: Learn probability of each rewrite rule P(lhs rhs) = #(lhs rhs) / #(lhs) Problems: But we need a CFG which is hard to design We also need to parse by hand lots of sentences which takes a long time Learning PCFGs Lots of texts are available, but not parsed can we learn from those? Yes: use EM algorithm E step: given rule probabilities, compute expected frequency of each rule in some corpus. M step: given expected frequency of each rule, update the rule probabilities by normalizing the rule frequencies. Problems: EM gets stuck in local optima Probabilistic parses often unintuitive to linguists CS486/686 Lecture Slides (c) 25 P. Poupart 7 CS486/686 Lecture Slides (c) 25 P. Poupart 8 3

Learning PCFGs Could we also learn without a grammar? Yes: for instance assume grammar is in Chomsky normal form (CNF) Any CFG can be represented in CNF Only two types of rule: X Y Z X t But effective only for small grammars Information Retrieval Information retrieval: task of finding documents that are relevant to a user Information retrieval components: Document collection Query posed Resulting set of relevant documents Examples: www search engines Text classification and clustering CS486/686 Lecture Slides (c) 25 P. Poupart 9 CS486/686 Lecture Slides (c) 25 P. Poupart 2 Information Retrieval Initial attempts: Parse documents into knowledge base of logical formulas Parse query into a logical formula Answer query by logical inference It failed because of Ambiguity Unknown context Etc Information Retrieval Alternative: Build unigram model for each document D i Treat query Q as a bag of words Find document D i that maximizes P(Q D i ) It works! CS486/686 Lecture Slides (c) 25 P. Poupart 2 CS486/686 Lecture Slides (c) 25 P. Poupart 22 Example Example Query: {Bayes, information, retrieval, model} Documents: each chapter of the textbook Build unigram model for each chapter Words Bayes information Query Chapt Intro 5 5 Chapt 3 Uncert. 32 8 Chapt 5 Time 38 8 Chapt 22 NLP 2 Chapt 23 Current 7 39 Computation: P(Q D i ) = P(Bayes, information, retrieval, model chapter i) P(Q D i ): same as P(Q Di) but with add-one smoothing retrieval model N P(Q D i ) P(Q D i ) 4 9 4,68.5x -4 4.x -4 7,94 2.8x -3 7.x -3 6 8,86 5.2x -3 9 6,397.7x -5 7 63 2,574.2x -.5x - CS486/686 Lecture Slides (c) 25 P. Poupart 23 CS486/686 Lecture Slides (c) 25 P. Poupart 24 4

Evaluation Two measures: Precisionmeasures the proportion of documents that are actually relevant false positive rate = - precision Recallmeasures the proportion of all relevant documents in the result set false negative rate = - recall CS486/686 Lecture Slides (c) 25 P. Poupart 25 Relevant Not relevant Evaluation In result set 3 Not in result set 2 4 Precision: 3/(3+) =.75 False positive rate = precision =.25 Recall: 3/(3+2) =.6 False negative rate = recall =.4 CS486/686 Lecture Slides (c) 25 P. Poupart 26 Tradeoff F Score There is often a tradeoff between recall and precision Perfect recall: Return every document But precision will be poor Perfect precision: Return only documents for which we are certain about their relevancy, or none at all But recall will be poor CS486/686 Lecture Slides (c) 25 P. Poupart 27 F score (or F measure) combines precision and recall Definition: 2pr / (p+r) If p = r, f = p = r If p = or r =, f = Otherwise favours compromising Precision Recall F Measure.9.2.33.5.6.55.7.8.75 CS486/686 Lecture Slides (c) 25 P. Poupart 28 IR Refinement Refinements: Case folding:convert to lower case E.g. COUCH couch, Italy italy Stemming: truncate words to their stem E.g. couches couch, taken take Synonyms: E.g. sofa couch Improves recall, but worsens precision Statistical NLP Applications Many other NLP tasks are shifting toward statistical / hybrid approaches. Segmentation Part-of-speech tagging Parsing Text classification / clustering Text summarization Machine translation Textual entailment Semantic role labelling CS486/686 Lecture Slides (c) 25 P. Poupart 29 CS486/686 Lecture Slides (c) 25 P. Poupart 3 5

Next Class Next Class: Robotics Russell and Norvig Ch. 25 CS486/686 Lecture Slides (c) 25 P. Poupart 3 6