Lecture 22: Introduction to Natural Language Processing (NLP)

Similar documents
A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

Lecture 1: Machine Learning Basics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

Switchboard Language Model Improvement with Conversational Data from Gigaword

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Recognition at ICSI: Broadcast News and beyond

The Role of the Head in the Interpretation of English Deverbal Compounds

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Bayesian Learning Approach to Concept-Based Document Classification

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 598 Natural Language Processing

Probabilistic Latent Semantic Analysis

The Smart/Empire TIPSTER IR System

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Parsing of part-of-speech tagged Assamese Texts

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Universiteit Leiden ICT in Business

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Natural Language Processing. George Konidaris

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Rule Learning with Negation: Issues Regarding Effectiveness

The stages of event extraction

A Comparison of Two Text Representations for Sentiment Analysis

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Loughton School s curriculum evening. 28 th February 2017

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Cross Language Information Retrieval

Context Free Grammars. Many slides from Michael Collins

CS Machine Learning

Prediction of Maximal Projection for Semantic Role Labeling

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Grammars & Parsing, Part 1:

Learning From the Past with Experiment Databases

Experts Retrieval with Multiword-Enhanced Author Topic Model

Advanced Grammar in Use

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Speech Emotion Recognition Using Support Vector Machine

Memory-based grammatical error correction

Some Principles of Automated Natural Language Information Extraction

CS 446: Machine Learning

Language Independent Passage Retrieval for Question Answering

Artificial Neural Networks written examination

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Leveraging Sentiment to Compute Word Similarity

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

ScienceDirect. Malayalam question answering system

What the National Curriculum requires in reading at Y5 and Y6

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Detecting English-French Cognates Using Orthographic Edit Distance

Applications of memory-based natural language processing

Short Text Understanding Through Lexical-Semantic Analysis

Writing a composition

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Learning Computational Grammars

Vocabulary Usage and Intelligibility in Learner Language

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Proof Theory for Syntacticians

Formulaic Language and Fluency: ESL Teaching Applications

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Developing a TT-MCTAG for German with an RCG-based Parser

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

BYLINE [Heng Ji, Computer Science Department, New York University,

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

Managerial Decision Making

The taming of the data:

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Compositional Semantics

Introduction to Text Mining

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Ensemble Technique Utilization for Indonesian Dependency Parser

Developing Grammar in Context

An Interactive Intelligent Language Tutor Over The Internet

Australian Journal of Basic and Applied Sciences

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCES

Transcription:

Lecture 22: Introduction to Natural Language Processing (NLP) Traditional NLP Statistical approaches Statistical approaches used for processing Internet documents If we have time: hidden variables COMP-424, Lecture 22 - April 10, 2013 1

Natural language understanding Language is very important for communication! Two parts: syntax and semantics Syntax viewed as important to understand meaning COMP-424, Lecture 22 - April 10, 2013 2

Grammars Set of re-write rules, e.g.: S := NP V P N P := noun pronoun noun := intelligence wumpus... V P := verb verbnp...... COMP-424, Lecture 22 - April 10, 2013 3

Parse trees Given a grammar, a sentence can be represented as a parse tree COMP-424, Lecture 22 - April 10, 2013 4

Problems with using grammars Grammars need to be context-sensitive Anaphora: using pronouns to refer back to entities already introduced in the text E.g. After Mary proposed to John, they found a preacher and got married. For the honeymoon, they went to Hawaii. Indexicality: sentences refer to a situation (place, time, S/H, etc.) E.g. I am over here Metaphor: Non-literal usage of words and phrases, often systematic: E.g. I ve tried killing the process but it won t die. Its parent keeps it alive. COMP-424, Lecture 22 - April 10, 2013 5

Some good tools exist Stanford NLP parser: http://nlp.stanford.edu/software/corenlp.shtml Input natural text, output annotated XML, which can be used for further processing: Named entity extraction (proper names, countries, amounts, dates...) Part-of-speech tagging (noun, adverbe, adjective,...) Parsing Co-reference resolution (finding all words that refer to the same entity) Eg. Albert Einstein invented the theory of relativity. He also played the violin. Uses state-of-art NLP methods, and is very easy to use. COMP-424, Lecture 22 - April 10, 2013 6

Examples from Stuart Russell: Squad helps dog bite victim Helicopter powered by human flies I ate spaghetti with meatballs abandon a fork a friend Ambiguity COMP-424, Lecture 22 - April 10, 2013 7

Statistical language models Words are treated as observations We typically have a corpus of data The model computes the probability of the input being generated from the same source as the training data Naive Bayes and n-gram models are tools of this type COMP-424, Lecture 22 - April 10, 2013 8

Learning for document classification Suppose we want to provide a class label y for documents represented as a set of words x We can compute P (y) by counting the number of interesting and uninteresting documents we have How do we compute P (x y)? Assuming about 100000 words, and not too many documents, this is hopeless! Most possible combinations of words will not appear in the data at all... Hence, we need to make some extra assumptions. COMP-424, Lecture 22 - April 10, 2013 9

Reminder: Naive Bayes assumption Suppose the features x i are discrete Assume the x i are conditionally independent given y. In other words, assume that: P (x i y) = P (x i y, x j ), i, j Then, for any input vector x, we have: P (x y) = P (x 1, x 2,... x n y) = P (x 1 y)p (x 2 y, x 1 ) P (x n y, x 1,... x n 1 ) = P (x 1 y)p (x 2 y)... P (x n y) For binary features, instead of O(2 n ) numbers to describe a model, we only need O(n)! COMP-424, Lecture 22 - April 10, 2013 10

Naive Bayes for binary features The parameters of the model are θ i,1 = P (x i = 1 y = 1), θ i,0 = P (x i = 1 y = 0), θ 1 = P (y = 1) We will find the parameters that maximize the log likelihood of the training data! The likelihood in this case is: L(θ 1, θ i,1, θ i,0 ) = m j=1 P (x j, y j ) = m n P (y j ) P (x j,i y j ) j=1 i=1 First, use the log trick: log L(θ 1, θ i,1, θ i,0 ) = ( m log P (y j ) + j=1 ) n log P (x j,i y j ) i=1 COMP-424, Lecture 22 - April 10, 2013 11

Observe that each term in the sum depends on the values of y j, x j that appear in the jth instance COMP-424, Lecture 22 - April 10, 2013 12

Maximum likelihood parameter estimation for Naive Bayes log L(θ 1, θ i,1, θ i,0 ) = + + m [y j log θ 1 + (1 y j ) log(1 θ 1 ) j=1 n y j (x j,i log θ i,1 + (1 x j,i ) log(1 θ i,1 )) i=1 n i=1 (1 y j ) (x j,i log θ i,0 + (1 x j,i ) log(1 θ i,0 ))] To estimate θ 1, we take the derivative of log L wrt θ 1 and set it to 0: L θ 1 = m j=1 ( yj + 1 y ) j ( 1) θ 1 1 θ 1 = 0 COMP-424, Lecture 22 - April 10, 2013 13

Maximum likelihood parameters estimation for naive Bayes By solving for θ 1, we get: θ 1 = 1 m m y j = j=1 number of examples of class 1 total number of examples Using a similar derivation, we get: θ i,1 = number of instances for which x j,i = 1 and y j = 1 number of instances for which y j = 1 θ i,0 = number of instances for which x j,i = 1 and y j = 0 number of instances for which y j = 0 COMP-424, Lecture 22 - April 10, 2013 14

Text classification revisited Consider again the text classification example, where the features x i correspond to words Using the approach above, we can compute probabilities for all the words which appear in the document collection But what about words that do not appear? They would be assigned zero probability! As a result, the probability estimates for documents containing such words would be 0/0 for both classes, and hence no decision can be made COMP-424, Lecture 22 - April 10, 2013 15

Laplace smoothing Instead of the maximum likelihood estimate: use: θ i,1 = number of instances for which x j,i = 1 and y j = 1 number of instances for which y j = 1 θ i,1 = (number of instances for which x j,i = 1 and y j = 1) + 1 (number of instances for which y j = 1) + 2 Hence, if a word does not appear at all in the documents, it will be assigned prior probability 0.5. If a word appears in a lot of documents, this estimate is only slightly different from max. likelihood. This is an example of Bayesian prior for Naive Bayes COMP-424, Lecture 22 - April 10, 2013 16

Example: 20 newsgroups Given 1000 training documents from each group, learn to classify new documents according to which newsgroup they came from comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.space sci.crypt sci.electronics sci.med talk.politics.guns Naive Bayes: 89% classification accuracy - comparable to other stateof-art methods COMP-424, Lecture 22 - April 10, 2013 17

Computing joint probabilities of word sequences Suppose you model a sentence as a sequence of words w 1,... w n How do we compute the probability of the sentence, P (w 1,... w n )? P (w 1 )P (w 2 w 1 )P (w 3 w 2, w 1 ) P (w n w n 1 w 1 ) These have to be estimated from data But data can be sparse! COMP-424, Lecture 22 - April 10, 2013 18

n-grams We make a conditional independence assumption: each words depends only on the n words preceding it, not on anything before This is a Markovian assumption! 1-st order model: P (w i w i 1 ) - bigram model 2nd order Markov model: P (w i w i 1, w i 2 ) - trigram model Now we can get a lot more data! COMP-424, Lecture 22 - April 10, 2013 19

Application: Speech recognition Input: wave sound file Output: typed text representing the words To disambiguate the next word, one can use n-gram models to predict the most likely next word, based on the past words n-gram model is typically learned from past data This idea is at the core of many speech recognizers COMP-424, Lecture 22 - April 10, 2013 20

NLP tasks related to the Internet Information retrieval (IR): give a word query, retrieve documents that are relevant to the query Most well understood and studied task Information filtering (text categorization): group documents based on topics/categories E.g. Yahoo categories for browsing E.g. E-mail filters News services Information extraction: given a text, get relevant information in a template. Closest to language understanding E.g. House advertisements (get location, price, features) E.g. Contact information for companies COMP-424, Lecture 22 - April 10, 2013 21

How can we do information retrieval? Two basic approaches Exact matching (logical approach) Approximate (inexact) matching The exact match approaches do not work well at all! Most often, no documents are retrieved, because the query is too restrictive. Hard to tell for the user which terms to drop in order to get results. COMP-424, Lecture 22 - April 10, 2013 22

Basic idea of inexact matching systems We are given a collection of documents Each document is a collection of words The query is also a collection of words We want to retrieve the documents which are closest to the query The trick is how to get a good distance metric! Key assumption: If a word occurs very frequently in a document compared to its frequency in the entire collection of documents, then the document is about that word. COMP-424, Lecture 22 - April 10, 2013 23

Processing documents for IR 1. Assign every new document an ID 2. Break the document into words 3. Eliminate stopwords and do stemming 4. Do term weighting COMP-424, Lecture 22 - April 10, 2013 24

Details of document processing Stopwords very frequently occurring words that do not have a lot of meaning E.g. Articles: the, a, these... and Prepositions: on, in,... Stemming (also known as suffix removal) is designed to take care of different conjugations and declinations. E.g. eliminating s for the plural, -ing and -ed terminations, etc. Example: after stemming, win, wins, won and winning will all become WIN How should we weight the words in a document??? COMP-424, Lecture 22 - April 10, 2013 25

Term weighting Key assumption: If a word occurs very frequently in a document compared to its frequency in the entire collection of documents, then the document is about that word. Term frequency: Number of times term occurs in the document, or Total number of terms in the document log(number of times term occurs in the document+1) log(total number of terms in the document) This tells us if terms occur frequently, but does not tell us if the occur unusually frequently. Inverse document frequency: log Number of documents in collection Number of documents in which the term occurs at least once COMP-424, Lecture 22 - April 10, 2013 26

Processing queries for IR We have to do the same things to the queries as we do to the documents! 1. Break into words 2. Stopword elimination and stemming 3. Retrieve all documents containing any of the query words 4. Rank the documents To rank the documents, for a simple query, we compute: Term frequency * Inverse document frequency for each term. Then we sum them up! More complicated formulas if the query contains + -, phrases etc. COMP-424, Lecture 22 - April 10, 2013 27

Example Query: The destruction of the Amazonian rain forests 1. Case normalization: the destruction of the Amazonian rain forests 2. Stopword removal: destruction Amazonian rain forests 3. Stemming: destruction amazon rain forest 4. Then we apply our formula! Note: Certain terms in the query will inherently be more important than others E.g. amazon vs. rain COMP-424, Lecture 22 - April 10, 2013 28

Evaluating IR Systems Two measures: Precision: ratio of the number of relevant document retrieved over the total number of documents retrieved Recall: ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database. Both precision and recall are between 0 and 1 (close to 1 is better). People are used to judge the correct label of a document, but they are subjective and may disagree Bad news: usually high precision means low recall and vice versa COMP-424, Lecture 22 - April 10, 2013 29

Why is statistical NLP good? Universal! Can be applied to any collection of documents, in any language, and no matter how it is structured In contrast, knowledge-based NLP systems work ONLY for specialized collections Very robust to language mistakes (e.g. bad syntax) Most of the time, you get at least some relevant documents COMP-424, Lecture 22 - April 10, 2013 30

Why do we still have research in NLP? Statistical NLP is not really language understanding! Are word counts all that language is about? Syntax knowledge could be very helpful sometimes There are some attempts now to incorporate knowledge in statistical NLP Eliminating prepositions means that we cannot really understand the meaning anymore One can trick the system by overloading the document with certain terms, although they do not get displayed on the screen. If a word has more than one meaning, you get a very varied collection of documents... COMP-424, Lecture 22 - April 10, 2013 31

AI techniques directly applicable to web text processing Learning: Clustering: group documents, detect outliers Naive Bayes: classify a document Neural nets Probabilistic reasoning: each word can be considered as evidence, try to infer what the text is about COMP-424, Lecture 22 - April 10, 2013 32