Advantages of classical NLP

Similar documents
Natural Language Processing. George Konidaris

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Parsing of part-of-speech tagged Assamese Texts

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CS 598 Natural Language Processing

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Compositional Semantics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Context Free Grammars. Many slides from Michael Collins

AQUA: An Ontology-Driven Question Answering System

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Developing a TT-MCTAG for German with an RCG-based Parser

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using dialogue context to improve parsing performance in dialogue systems

Text-mining the Estonian National Electronic Health Record

Some Principles of Automated Natural Language Information Extraction

Disambiguation of Thai Personal Name from Online News Articles

Distant Supervised Relation Extraction with Wikipedia and Freebase

Grammars & Parsing, Part 1:

Linking Task: Identifying authors and book titles in verbose queries

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Memory-based grammatical error correction

Probabilistic Latent Semantic Analysis

Rule-based Expert Systems

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Argument structure and theta roles

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The Smart/Empire TIPSTER IR System

An Interactive Intelligent Language Tutor Over The Internet

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Universiteit Leiden ICT in Business

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Proof Theory for Syntacticians

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Constraining X-Bar: Theta Theory

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Text Mining

Using Semantic Relations to Refine Coreference Decisions

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Language Independent Passage Retrieval for Question Answering

Character Stream Parsing of Mixed-lingual Text

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Cross Language Information Retrieval

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Prediction of Maximal Projection for Semantic Role Labeling

Loughton School s curriculum evening. 28 th February 2017

BYLINE [Heng Ji, Computer Science Department, New York University,

Knowledge-Based - Systems

Applications of memory-based natural language processing

Analysis of Probabilistic Parsing in NLP

A Case Study: News Classification Based on Term Frequency

A corpus-based approach to the acquisition of collocational prepositional phrases

Cross-Lingual Text Categorization

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chapter 4: Valence & Agreement CSLI Publications

The Interface between Phrasal and Functional Constraints

Formulaic Language and Fluency: ESL Teaching Applications

CAS LX 522 Syntax I. Long-distance wh-movement. Long distance wh-movement. Islands. Islands. Locality. NP Sea. NP Sea

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Learning Computational Grammars

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Ensemble Technique Utilization for Indonesian Dependency Parser

The taming of the data:

Detecting English-French Cognates Using Orthographic Edit Distance

Corrective Feedback and Persistent Learning for Information Extraction

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Part I. Figuring out how English works

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Introduction, Organization Overview of NLP, Main Issues

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Language and Computers. Writers Aids. Introduction. Non-word error detection. Dictionaries. N-gram analysis. Isolated-word error correction

Lecture 1: Machine Learning Basics

A Bayesian Learning Approach to Concept-Based Document Classification

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

A Comparison of Two Text Representations for Sentiment Analysis

Corpus Linguistics (L615)

Speech Recognition at ICSI: Broadcast News and beyond

The Ups and Downs of Preposition Error Detection in ESL Writing

Generation of Referring Expressions: Managing Structural Ambiguities

Construction Grammar. University of Jena.

Training and evaluation of POS taggers on the French MULTITAG corpus

THE VERB ARGUMENT BROWSER

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Multilingual Sentiment and Subjectivity Analysis

Transcription:

Artificial Intelligence Programming Statistical NLP Chris Brooks Outline n-grams Applications of n-grams review - Context-free grammars Probabilistic CFGs Information Extraction Advantages of IR approaches Recall that IR-based approaches use the bag of words model. TFIDF is used to account for word frequency. Takes information about common words into account. Can deal with grammatically incorrect sentences. Gives us a degree of correctness, rather than just yes or no. Department of Computer Science University of San Francisco cisco p.1/?? Department of Computer Science University of San Fra Disadvantages of IR approaches No use of structural information. Not even co-occurrence of words Can t deal with synonyms or dereferencing pronouns Very little semantic analysis. Advantages of classical NLP Classical NLP approaches use a parser to generate a parse tree. This can then be used to transform knowledge into a form that can be reasoned with. Identifies sentence structure Easier to do semantic interpretation Can handle anaphora, synonyms, etc. Disadvantages of classical NLP Doesn t take frequency into account No way to choose between different parses for a sentence Can t deal with incorrect grammar Requires a lexicon. Maybe we can incorporate both statistical information and structure. cisco p.3/?? cisco p.4/?? Department of Computer Science University of San Fra

n-grams The simplest way to add structure to our IR approach is to count the occurrence not only of single tokens, but of sequences of tokens. So far, we ve considered words as tokens. A token is sometimes called a gram an n-gram model considers the probability that a sequence of n tokens occurs in a row. More precisely, it is the probability P(token i token i 1,token i 2,...,token i n ) n-grams Our approach in assignment 3 uses 1-grams, or unigrams. We could also choose to count bigrams, or 2-grams. The sentence Every good boy deserves fudge contains the bigrams every good, good boy, boy deserves, deserves fudge We could continue this approach to 3-grams, or 4-grams, or 5-grams. Longer n-grams give us more accurate information about content, since they include phrases rather than single words. What s the downside here? Sampling theory We need to be able to estimate the probability of each n-gram occurring. In assignment 3, we do this by collecting a corpus and counting the distribution of words in the corpus. If the corpus is too small, these counts may not be reflective of an n-gram s true frequency. Many n-grams will not appear at all in our corpus. For example, if we have a lexicon of 20,000 words, there are: 20, 000 2 = 400 million distinct bigrams 20, 000 3 = 8 trillion distinct trigrams 20, 000 4 = 1.6 10 17 distinct 4-grams cisco p.6/?? cisco p.7/?? Department of Computer Science University of San Fra Smoothing So, when we are estimating n-gram counts from a corpus, there will be many n-grams that we never see. This might occur in your assignment - what if there s a word in your similarity set that s not in the corpus? The simplest thing to do is add-one smoothing. We start each n-gram with a count of 1, rather than zero. Easy, but not very theoretically satisfying. Linear Interpolation Smoothing We can also use estimates of shorter-length n-grams to help out. Assumption: the sequence w 1,w 2,w 3 and the sequence w 1,w 2 are related. More precisely, we want to know P(w 3 w 2,w 1 ). We count all 1-grams, 2-grams, and 3-grams. we estimate P(w 3 w 2,w 1 ) as c 1 P(w 3 w 2,w 1 ) + c 2 P(w 3 w 2 ) + c 3 P(w 3 ) So where do we get c 1,c 2,c 3? They might be fixed, based on past experience. Or, we could learn them. Application: segmentation One application of n-gram models is segmentation Splitting a sequence of characters into tokens, or finding word boundaries. Speech-to-text systems Chinese and Japanese genomic data Documents with other characters, such as representing space. The algorithm for doing this is called Viterbi segmentation (Like parsing, it s a form of dynamic programming) cisco p.9/?? cisco p.10/??

Viterbi segmentation put: a string S, a 1-gram distribution P = length(s) rds = array[n+1] st = array[n+1] = 0.0 * (n+1) st[0] = 1.0 r i = 1 to n for j = 0 to i - 1 word = S[j:i] ##get the substring from j to i w = length(word) if (P[word] x best[i - w] >= best[i]) best[i] = P[word] x best[i - w] words[i] = word # now get best words sult = [] = n ile i > 0 push words[i] onto result i = i - len(words[i]) turn result, best[i] cisco p.12/?? Input cattlefish P(cat) = 0.1, P(cattle) = 0.3, P(fish) = 0.1. all other 1-grams are 0.001. best[0] = 1.0 i: 1, j: 0 word: c. w = 1 0.001 * 1.0 >= 0.0 best[1] = 0.001 words[1] = c i = 2, j = 0 word = ca, w = 2 0.001 * 1.0 >= 0.0 best[2] = 0.001 words[2] = ca i = 2, j = 1 word = a, w = 1 cisco p.13/?? i = 3, j = 0, word= cat, w=3 0.1 * 1.0 > 0.0 best[3] = 0.1 words[3] = cat i = 3, j = 1, word = at, w=2 0.001 * 0.001 < 0.1 i = 3, j = 2, word = t, w=1 0.001 * 0.001 < 0.1 4, j=0, word= catt, w=4 001 * 1.0 > 0.0 best[4] = 0.001 words[4] = catt i=5, j=0, word= cattl, w=5 0.001 * 1.0 > 0.0 best[5] = 0.001 word[5] = cattl i=6, j=0, word= cattle, w=6 0.3 * 1.0 > 0.0 word[6] = cattle best[6] = 0.3 4,j=1 word = att, w=3 001 * 0.001 < 0.001 i=5, j=1, word= attl, w=4 etc... 4, j=2, word= tt, w=2 001 * 0.001 < 0.001 i=5, j=2, word= ttl, w=3 4, j=3, word= t, w=1 001 * 0.1 < 0.001 i=5, j=3, word= tl, w=2 0.001 * 0.1 < 0.001 i=5, j=4, word= l, w=1 cisco p.15/?? cisco p.16/??

st: [1.0 0.001 0.001 0.1 0.001 0.001 0.3 0.001 0.001 0.2] rds: [ c ca cat catt cattl cattle cattlef cattlefi attlefis fish ] = 10 sh fish onto result = i-4 sh cattle onto result = 0 What s going on here? The Viterbi algorithm is searching through the space of all combinations of substrings. States with high probability mass are pursued. The best array is used to prevent the algorithm from repeatedly expanding portions of the search space. This is an example of dynamic programming (like chart parsing) Application: language detection n-grams have also been successfully used to detect the language a document is in. Approach: consider letters as tokens, rather than words. Gather a corpus in a variety of different languages (Wikipedia works well here.) Process the documents, and count all two-grams. Estimate probabilities for Language L with count #of2 grams Call this P L Assumption: different languages have characteristic two-grams. cisco p.18/?? cisco p.19/?? Application: language detection To classify a document by language: Find all two-grams in the document. Call this set T. For each language L, the likelihood that the document is of language L is: P L (t 1 ) P L (t 2 )... P L (t n ) The language with the highest likelihood is the most probable language. (this is a form of Bayesian inference - we ll spend more time on this later in the semester.) Going further n-grams and segmentation provide some interesting ideas: We can combine structure with statistical knowledge. Probabilities can be used to help guide search Probabilities can help a parser choose between different outcomes. But, no structure used apart from colocation. Maybe we can apply these ideas to grammars. Reminder: CFGs Recall context-free grammars from the last lecture Single non-terminal on the left, anything on the right. S -> NP VP VP -> Verb Verb PP Verb -> run sleep We can construct sentences that have more than one legal parse. Squad helps dog bite victim CFGs don t give us any information about which parse to select. cisco p.21/?? cisco p.22/??

Probabalistic CFGs A probabalisitc CFG is just a regular CFG with probabilities attached to the right-hand sides of rules. The have to sum up to 1 They indicate how often a particular non-terminal derives that right-hand side. S -> NP VP (1.0) PP -> P NP (1.0) VP -> V NP (0.7) VP -> VP PP (0.3) P -> with (1.0) V -> saw (1.0) NP -> NP PP (0.4) NP -> astronomers (0.1) NP -> stars (0.18) NP -> saw (0.04) NP -> ears (0.18) NP -> telescopes (0.1) Disambiguation The probability of a parse tree being correct is just the product of each rule in the tree being derived. This lets us compare two parses and say which is more likely. NP (0.1) astronomers S (1.0) VP (0.7) V (1.0) NP (0.4) saw NP(0.18) PP (1.0) stars P(1.0) NP (0.18) with ears P1=1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072 astronomers S (1.0) NP (0.1) VP (0.3) V (1.0) saw VP (0.7) NP(0.18) PP (1.0) P(1.0) NP (0.18) stars with ears P2=1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 1.0 * 0.18 * 1.0 * 0.18 = 0.00068 cisco p.24/?? cisco p.25/?? Faster Parsing We can also use probabilities to speed up parsing. Recall that both top-down and chart pasring proceed in a primarily depth-first fashion. They choose a rule to apply, and based on its right-hand side, they choose another rule. Probabilities can be used to better select which rule to apply, or which branch of the search tree to follow. This is a form of best-first search. cisco p.27/?? Information Extraction An increasingly common application of parsing is information extraction. This is the process of creating structured information (database or knowledge base entries) from unstructured text. : Suppose we want to build a price comparison agent that can visit sites on the web and find the best deals on flatscreen TVs? Suppose we want to build a database about video games. We might do this by hand, or we could write a program that could parse wikipedia pages and insert knowledge such as madeby(blizzard, WorldOfWarcraft) into a knowledge base. cisco p.28/?? Extracting specific information A program that fetches HTML pages and extracts specfic information is called a scraper. Simple scrapers can be built with regular expressions. For example, prices typically have a dollar sign, some digits, a period, and two digits. $[0-9]+.[0-9]{2} This approach will work, but it has several limitations Can only handle simple extractions Brittle and page specific

Extracting Relational Information Suppose we want to build a database that relates organizations to cities. In(USF, San Francisco) We want to be able to extract this information from a sentence like: AI is the best class ever! said Chris Brooks, a professor at USF, a university in San Francisco. We subdivide this problem into two pieces: Named Entity Extraction Relation Extraction Named Entity Extraction Named Entity Extraction is the process of figuring out that Chris Brooks, USF and San Francisco are proper nouns. We could just have a big list of all people in an organization, or all cities, which might work. Or, we could use a program called a chunker, which is a probabilistic parser. Only parses to a shallow level (two deep) and identifies chunks of sentences, rather than tagging every word. It will often try to identify the type of entity, such as organization or person, typically using probabilities extracted from training corpora. Relation Extraction Once we have named entities, we need to figure out how they are related. We can write augmented regular expressions that use chunk types for this: <ORG>(.+)in(.+)<CITY> will match <organization> blah blah in blah <city>. There will be false positives; getting this highly accurate takes some care. In assignment 4, you ll get to experiment with information extraction using NLTK. cisco p.30/?? cisco p.31/?? Summary We can combine the best of probabilistic and classical NLP approaches. n-grams take advantage of co-occurrence information. Segmenting, language detection CFGs can be augmented with probabilities Speeds parsing, deals with ambiguity. Information extraction is an increasingly common application. Still no discussion of semantics; jst increasingly complex syntax processing. cisco p.33/??