POS tagging. Intro to NLP - ETHZ - 11/03/2013

Similar documents
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Context Free Grammars. Many slides from Michael Collins

Grammars & Parsing, Part 1:

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

A Graph Based Authorship Identification Approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 598 Natural Language Processing

BULATS A2 WORDLIST 2

The stages of event extraction

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Linking Task: Identifying authors and book titles in verbose queries

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

LTAG-spinal and the Treebank

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Applications of memory-based natural language processing

Prediction of Maximal Projection for Semantic Role Labeling

Speech Recognition at ICSI: Broadcast News and beyond

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Parsing of part-of-speech tagged Assamese Texts

Training and evaluation of POS taggers on the French MULTITAG corpus

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Words come in categories

An Evaluation of POS Taggers for the CHILDES Corpus

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Indian Institute of Technology, Kanpur

Semi-supervised Training for the Averaged Perceptron POS Tagger

Ch VI- SENTENCE PATTERNS.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Smart/Empire TIPSTER IR System

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Natural Language Processing: Interpretation, Reasoning and Machine Learning

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Multilingual Sentiment and Subjectivity Analysis

Specifying a shallow grammatical for parsing purposes

AQUA: An Ontology-Driven Question Answering System

Modeling function word errors in DNN-HMM based LVCSR systems

Corrective Feedback and Persistent Learning for Information Extraction

Named Entity Recognition: A Survey for the Indian Languages

A Case Study: News Classification Based on Term Frequency

Ensemble Technique Utilization for Indonesian Dependency Parser

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Developing Grammar in Context

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Loughton School s curriculum evening. 28 th February 2017

Extracting Verb Expressions Implying Negative Opinions

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods in Multilingual Speech Recognition

California Department of Education English Language Development Standards for Grade 8

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The Role of the Head in the Interpretation of English Deverbal Compounds

Memory-based grammatical error correction

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Lecture 1: Machine Learning Basics

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Writing a composition

Universiteit Leiden ICT in Business

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

CS 446: Machine Learning

Cross Language Information Retrieval

Natural Language Processing. George Konidaris

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Beyond the Pipeline: Discrete Optimization in NLP

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Learning From the Past with Experiment Databases

What the National Curriculum requires in reading at Y5 and Y6

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Rule Learning With Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

THE VERB ARGUMENT BROWSER

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Lecture 1: Basic Concepts of Machine Learning

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Assignment 1: Predicting Amazon Review Ratings

The Indiana Cooperative Remote Search Task (CReST) Corpus

A Bayesian Learning Approach to Concept-Based Document Classification

Today we examine the distribution of infinitival clauses, which can be

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Syllable Based Word Recognition Model for Korean Noun Extraction

Transcription:

POS tagging Intro to NLP - ETHZ - 11/03/2013

Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence Probability of an observation Parameter estimation Evaluation

POS ambiguity "Squad helps dog bite victim" bite -> verb? bite -> noun?

Parts of Speech (PoS) Traditional parts of speech: Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...

Examples N (noun): car, squad, dog, bite, victim V (verb): help, bite ADJ (adjective): purple, tall ADV (adverb): unfortunately, slowly P (preposition): of, by, to PRO (pronoun): I, me, mine DET (determiner): the, a, that, those

Open and closed classes 1. Closed class: small stable set a. Auxiliaries: may, can, will, been,... b. Prepositions: of, in, by,... c. Pronouns: I, you, she, mine, his, them,... d. Usually function words, short, grammar role 2. Open class: a. new ones are created all the time ("to google/tweet", "e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget") b. English has 4: Nouns, Verbs, Adjectives, Adverbs c. Many languages have these 4, not all!

Open class words 1. Nouns: a. Proper nouns: Zurich, IBM, Albert Einstein, The Godfather,... Capitalized in many languages. b. Common nouns: the rest, also capitalized in German, mass/count nouns (goat/goats, snow/*snows) 2. Verbs: a. Morphological affixes in English: eat/eats/eaten 3. Adverbs: tend to modify things: a. John walked home extremely slowly yesterday b. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat) d. Manner adverbs (slowly, slinkily, delicately) 4. Adjectives: qualify nouns and noun phrases

Closed class words 1. prepositions: on, under, over,... 2. particles: up, down, on, off,... 3. determiners: a, an, the,... 4. pronouns: she, who, I,.. 5. conjunctions: and, but, or,... 6. auxiliary verbs: can, may should,... 7. numerals: one, two, three, third,.

Prepositions with corpus frequencies

Conjunctions

Pronouns

Auxiliaries

Applications A useful pre-processing step in many tasks Syntactic parsing: important source of information for syntactic analysis Machine translation Information retrieval: stemming, filtering Named entity recognition Summarization?

Applications Speech synthesis, for correct pronunciation of "ambiguous" words: lead /lid/ (guide) vs. /led/ (chemical) insult vs. INsult object vs. OBject overflow vs. OVERflow content vs. CONtent

Summarization Idea: Filter out sentences starting with certain PoS tags Use PoS statistics from gold standard titles (might need cross-validation)

Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't

Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't

PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. Pre-processing: tokenization Word Squad helps dog bite victim Tag N V N N N

Choosing a tagset 1. There are many parts of speech, potential distinctions we can draw 2. For POS tagging, we need to choose a standard set of tags to work with 3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? http://en.wikipedia. org/wiki/part-of-speech_tagging 4. More commonly used set is finer grained, the Penn TreeBank tagset, 45 tags

PTB tagset

Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NN 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/MD you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

Complexities Book/VB that/dt flight/nn./. There/EX are/vbp 70/CD children/nns there/rb./. Mrs./NNP Shaefer/NNP never/rb got/vbd around/rp to/to joining/vbg./. All/DT we/prp gotta/vbn do/vb is/vbz go/vb around/in the/dt corner/nn./. Unresolvable ambiguity: The Duchess was entertaining last night.

Words PoS WSJ PoS Universal The DT DET oboist NN NOUN Heinz NNP NOUN Holliger NNP NOUN has VBZ VERB taken VBN VERB a DT DET hard JJ ADJ line NN NOUN about IN ADP the DT DET problems NNS NOUN...

POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.

Word type tag ambiguity

Methods 1. Rule-based a. Start with a dictionary b. Assign all possible tags c. Write rules by hand to remove tags in context 2. Stochastic a. Supervised/Unsupervised b. Generative/discriminative c. independent/structured output d. HMMs

Rule-based tagging 1. Start with a dictionary: she promised to back the bill PRP VBN, VBD TO VB, JJ, RB, NN DT NN, VB

Rule-based tagging 2. Assign all possible tags: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill

Rule-based tagging 3. Introduce rules to reduce ambiguity: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill Rule: "<start> PRP {VBN, VBD}" -> "<start> PRP VBD"

Statistical models for POS tagging 1. A classic: one of the first successful applications of statistical methods in NLP 2. Extensively studied with all possible approaches (sequence models benchmark) 3. Simple to get started on: data, eval, literature 4. An introduction to more complex segmentation and labelling tasks: NER, shallow parsing, global optimization 5. An introduction to HMMs, used in many variants in POS tagging and related tasks.

Supervision and resources 1. Supervised case: data with words manually annotated with POS tags 2. Partially supervised: annotated data + unannotated data 3. Unsupervised: only raw text available 4. Resources: dictionaries with words possible tags 5. Start with the supervised task

HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(t s =q j t s-1 =q i ) b. t s t s-1 = q i ~ Multi(a i ) c. Special vector of initial/final probabilities 4. Emissions: a. B = {b ik }; b ik = P(w s = o k t s =q i ) b. w s t s = q i ~ Multi(b i )

Markov Chain Interpretation Tagging process as a (hidden) Markov process Independence assumptions 1. Limited horizon 2. Time-invariant 3. Observation depends only on the state

Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag

Inference in HMMs Three fundamental problems: 1. Given an observation (e.g., a sentence) find the most likely sequence of states (e.g., pos tags) 2. Given an observation, compute its probability 3. Given a dataset of observation (sequences) estimate the model's parameters: theta = (A,B)

HMMs and FSA

HMMs and FSA also Bayes nets, directed graphical models, etc.

Other applications of HMMs NLP - Named entity recognition, shallow parsing - Word segmentation - Optical Character Recognition Speech recognition Computer Vision image segmentation Biology - Protein structure prediction Economics, Climatology, Robotics...

POS as sequence classification Observation: a sequence of N words w 1:N Response: a sequence of N tags t 1:N Task: find the predicted t' 1:N such that: The best possible tagging for the sequence.

Bayes rule reformulation

HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences?

HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences? O( Tagset N )! Dynamic programming: Viterbi algorithm

Viterbi algorithm

Example: model A = N V END V 0.8 0.2 0.3 N 0.3 0.7 0.7 START 0.6 0.4 B = board backs plan V 0.3 0.3 0.4 N 0.4 0.2 0.4

Example: observation Sentence: "Board backs plan" Find the most likely tag sequence d s (t) = probability of most likely path ending at state s at time t

Viterbi algorithm: example END V N START board backs plan Time 1 2 3

Viterbi: forward pass END V N START board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: forward pass END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

Viterbi: output END V d=.011 N START board/n backs/v plan/n Time 1 2 3

Observation probability Given HMM theta = (A,B) and observation sequence w 1:N compute P(w 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:

Forward algorithm Dynamic programming: each state of the trellis stores a value alpha i (s) = probability of being in state s having observed w 1:i Sum over all paths up to i-1 leading to s Init:

Forward algorithm

Forward computation END V a=.12 N START a=.24 board backs plan Time 1 2 3

Forward computation END V a=.12 a=.058 N START a=.24 a=.034 board backs plan Time 1 2 3

Forward computation END V a=.12 a=.058 a=.014 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3

Forward computation END V a=.12 a=.058 a=.014 a=0.2 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3

Parameter estimation Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:

Implementation details 1. Start/End states 2. Log space/rescaling 3. Vocabularies: model pruning 4. Higher order models: a. states representation b. Estimation and sparsity: deleted interpolation

Evaluation So once you have you POS tagger running how do you evaluate it? 1. Overall error rate with respect to a goldstandard test set. a. ER = # words incorrectly tagged/# words tagged 2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially unknown words)

Evaluation The result is compared with a manually coded Gold Standard Typically accuracy > 97% on WSJ PTB This may be compared with result for a baseline tagger (one that uses no context). Baselines (most frequent tag) can achieve up to 90% accuracy. Important: 100% is impossible even for human annotators.

Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence (decoding) Probability of an observation (word sequence) Parameter estimation (supervised) Evaluation

Next class Unsupervised POS tagging models (HMMs) Parameter estimation: forward-backward algorithm Discriminative sequence models: MaxEnt, CRF, Perceptron, SVM, etc. Read J&M 5-6 Pre-process and POS tag the data: report problems & baselines