Part Of Speech (POS) Tagging. Based on Foundations of Statistical NLP by C. Manning & H. Schütze, ch. 10 MIT Press, 2002

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

The stages of event extraction

Context Free Grammars. Many slides from Michael Collins

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Grammars & Parsing, Part 1:

Training and evaluation of POS taggers on the French MULTITAG corpus

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Prediction of Maximal Projection for Semantic Role Labeling

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Probabilistic Latent Semantic Analysis

Lecture 10: Reinforcement Learning

cmp-lg/ Jan 1998

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

An Evaluation of POS Taggers for the CHILDES Corpus

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CS 598 Natural Language Processing

LTAG-spinal and the Treebank

Lecture 1: Machine Learning Basics

Linking Task: Identifying authors and book titles in verbose queries

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Learning Computational Grammars

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Natural Language Processing. George Konidaris

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

A Syllable Based Word Recognition Model for Korean Noun Extraction

Parsing of part-of-speech tagged Assamese Texts

Ensemble Technique Utilization for Indonesian Dependency Parser

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

What the National Curriculum requires in reading at Y5 and Y6

BULATS A2 WORDLIST 2

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

A Bayesian Learning Approach to Concept-Based Document Classification

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Accurate Unlexicalized Parsing for Modern Hebrew

Learning Methods in Multilingual Speech Recognition

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using dialogue context to improve parsing performance in dialogue systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Smart/Empire TIPSTER IR System

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Disambiguation of Thai Personal Name from Online News Articles

Semi-supervised Training for the Averaged Perceptron POS Tagger

Emmaus Lutheran School English Language Arts Curriculum

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Comparison of Two Text Representations for Sentiment Analysis

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Distant Supervised Relation Extraction with Wikipedia and Freebase

Specifying a shallow grammatical for parsing purposes

A Graph Based Authorship Identification Approach

The Role of the Head in the Interpretation of English Deverbal Compounds

Using Semantic Relations to Refine Coreference Decisions

Universiteit Leiden ICT in Business

Online Updating of Word Representations for Part-of-Speech Tagging

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Methods for the Qualitative Evaluation of Lexical Association Measures

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

The Indiana Cooperative Remote Search Task (CReST) Corpus

arxiv:cmp-lg/ v1 22 Aug 1994

Memory-based grammatical error correction

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Writing a composition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Ch VI- SENTENCE PATTERNS.

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Indian Institute of Technology, Kanpur

BYLINE [Heng Ji, Computer Science Department, New York University,

Development of the First LRs for Macedonian: Current Projects

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Discriminative Learning of Beam-Search Heuristics for Planning

Analysis of Probabilistic Parsing in NLP

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Formulaic Language and Fluency: ESL Teaching Applications

Experts Retrieval with Multiword-Enhanced Author Topic Model

Assignment 1: Predicting Amazon Review Ratings

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

National Literacy and Numeracy Framework for years 3/4

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Truth Inference in Crowdsourcing: Is the Problem Solved?

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Transcription:

0. Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H. Schütze, ch. 10 MIT Press, 2002

1. POS Tagging: Overview 1. Task: labeling (tagging) each word in a sentence with the appropriate POS (morphological category) Applications: partial parsing, chunking, lexical acquisition, information retrieval (IR), information extraction (IE), question answering (QA) Approaches: Hidden Markov Models (HMM) Transformation-Based Learning (TBL) others: neural networks, decision trees, bayesian learning, maximum entropy, etc. Performance acquired: 90% 98%

2. Sample POS Tags (from the Brown/Penn corpora) AT article BEZ is IN preposition JJ adjective JJR adjective: comparative MD modal NN noun: singular or mass NNP noun: singular proper NNS noun: plural PERIOD.:?! PN RB RBR TO VB VBD VBG VBN VBP VBZ WDT personal pronoun adverb adverb: comparative to verb: base form verb: past tense verb: past participle, gerund verb: past participle verb: non-3rd singular present verb: 3rd singular present wh-determiner (what, which)

3. The representative put An Example chairs on the table. AT NN VBD NNS IN AT NN AT JJ NN VBZ IN AT NN put option to sell; chairs leads a meeting Tagging requires (limited) syntactic disambiguation. But, there are multiple POS for many words English has production rules like noun verb (e.g., flour the pan, bag the groceries) So,...

4. The first approaches to POS tagging [ Greene & Rubin, 1971] deterministic rule-based tagger 77% of words correctly tagged not enough; made the problem look hard [ Charniak, 1993] statistical, dumb tagger, based on Brown corpus 90% accuracy now taken as baseline

5. 2. POS Tagging Using Markov Models Assumptions: Limited Horizon: P (t i+1 t 1,i ) = P (t i+1 t i ) (first-order Markov model) Time Invariance: P (X k+1 = t j X k = t i ) does not depend on k Words are independent of each other P (w 1,n t 1,n ) = Π n i=1 P (w i t 1,n ) A word s identity depends only of its tag P (w i t 1,n ) = P (w i t i )

Determining Optimal Tag Sequences The Viterbi Algorithm 6. argmax t 1...n P (t 1...n w 1...n ) = argmax t 1...n P (w 1...n t 1...n )P (t 1...n ) P (w 1...n ) = argmax t 1...n P (w 1...n t 1...n )P (t 1...n ) using the previous assumptions = argmax t 1...n Π n i=1 P (w i t i )Π n i=1 P (t i t i 1 ) 2.1 Supervised POS Tagging using tagged training data: MLE estimations: P (w t) = C(w,t) C(t), P (t t ) = C(t,t ) C(t )

7. Exercises 10.4, 10.5, 10.6, 10.7, pag 348 350 [Manning & Schütze, 2002]

8. The Treatment of Unknown Words (I) use a priori uniform distribution over all tags: badly lowers the accuracy of the tagger feature-based estimation [ Weishedel et al., 1993 ]: P (w t) = 1 P (unknown word t)p (Capitalized t)p (Ending t) Z where Z is a normalization constant: Z = Σ t P (unknown word t )P (Capitalized t )P (Ending t ) error rate 40% 20% using both roots and suffixes [Charniak, 1993] example: doe-s (verb), doe-s (noun)

9. The Treatment of Unknown Words (II) Smoothing ( Add One ) [Church, 1988] C(w, t) + 1 P (w t) = C(t) + k t where k t is the number of possible words for t [Charniak et al., 1993] P (t t ) = (1 ɛ) C(t, t ) C(t ) + ɛ Note: not a proper probability distribution

2.2 Unsupervised POS Tagging using HMMs 10. no labeled training data; use the EM (Forward-Backward) algorithm Initialisation options: random: not very useful (do 10 iterations) when a dictionary is available (2-3 iterations) [Jelinek, 1985] b b j.l = j.l C(wl { ) 0 Σ w m b j.m C(w m ) where b j.l = 1 T (w l ) T (w l ) is the number of tags allowed for w l if t j not allowed for w l otherwise [Kupiec, 1992] group words into equivalent classes. Example: u JJ,NN = {top, bottom,...}, u NN,VB,VBP = {play, flour, bag,...} distribute C(u L ) over all words in u L

11. 2.3 Fine-tuning HMMs for POS Tagging [ Brands, 1998 ]

Trigram Taggers 1st order MMs = bigram models each state represents the previous word s tag the probability of a word s tag is conditioned on the previous tag 2nd order MMs = trigram models state corresponds to the previous two tags tag probability conditioned on the previous two tags BEZ RB RB VBN example: is clearly marked BEZ RB VBN more likely than BEZ RB VBD he clearly marked PN RB VBD more likely than PN RB VBN problem: sometimes little or no syntactic dependency, e.g. across commas. Example: xx, yy: xx gives little information on yy more severe data sparseness problem 12.

13. Linear interpolation combine unigram, bigram and trigram probabilities as given by first-order, second-order and third-order MMs on words sequences and their tags P (t i t i 1 ) = λ 1 P 1 (t i ) + λ 2 P 2 (t i t i 1 ) + λ 3 P 3 (t i t i 1,i 2 ) λ 1, λ 2, λ 3 can be automatically learned using the EM algorithm see [Manning & Schütze 2002, Figure 9.3, pag. 323]

Variable Memory Markov Models 14. have states of mixed length (instead of fixed length as bigram or trigram tagger have) the actual sequence of words/signals determines the length of memory used for the prediction of state sequences AT AT BEZ... JJ... WDT AT JJ IN

3. POS Tagging based on Transformation-based Learning (TBL) [ Brill, 1995 ] 15. exploits a wider range of regularities (lexical, syntactic) in a wider context input: tagged training corpus output: a sequence of learned transformations rules each transformation relabels some words 2 principal components: specification of the (POS-related) transformation space TBL learning algorithm; transformation selection creterion: greedy error reduction

16. TBL Transformations Rewrite rules: t t if condition C Examples: NN VB previous tag is TO...try to hammer... VBP VB one of prev. 3 tags is MD...could have cut... JJR RBR next tag is JJ...more valuable player... VBP VB one of prev. 2 words in n t...does n t put... A later transformation may partially undo the effect. Example: go to school

TBL POS Algorithm 17. tag each word with its most frequent POS for k = 1, 2,... Consider all possible transformations that would apply at least once in the corpus set t k to the transformation giving the greatest error reduction apply the transformation t k to the corpus stop if termination creterion is met (error rate < ɛ) output: t 1, t 2,..., t k issues: 1. search is gready; 2. transformations applied (lazily...) from left to right

TBL Efficient Implementation: Using Finite State Transducers [Roche & Scabes, 1995] 18. t 1, t 2,..., t n FST 1. convert each transformation to an equivalent FST: t i f i 2. create a local extension for each FST: f i f i so that running f i in one pass on the whole corpus be equivalent to running f i on each position in the string Example: rule A B if C is one of the 2 precedent symbols CAA CBB requires two separate applications of f i f i does rewrite in one pass 3. compose all transducers: f 1 f 2... f R f ND typically yields a non-deterministic transducer 4. convert to deterministic FST: f ND f DET (possible for TBL for POS tagging)

19. transformations: O(Rkn) where TBL Tagging Speed R = the number of tranformations k = maximum length of the contexts n = length of the input FST: O(n) with a much smaller constant one order of magnitude faster than a HMM tagger [André Kempe, 1997] work on HMM FST

Appendix A 20.

Transformation-based Error-driven Learning 21. Training: 1. unannotated input (text) is passed through an initial state annotator 2. by comparing its output with a standard (e.g. manually annotated corpus), transformation rules of a certain template/pattern are learned to improve the quality (accuracy) of the output. Reiterate until no significant improvement is obtained. Note: the algo is greedy: at each iteration, the rule with the best score is retained. Test: 1. apply the initial-state annotator 2. apply each of the learned transformation rules in order.

22. unannotated text Transformation-based Error-driven Learning initial state annotator annotated text truth learner rules

Appendix B 23.

Unsupervised Learning of Disambiguation Rules for POS Tagging [ Eric Brill, 1995 ] Plan: 24. 1. An unsupervised learning algorithm (i.e., without using a manually tagged corpus) for automatically acquiring the rules for a TBL-based POS tagger 2. Comparison to the EM/Baum-Welch algorithm used for unsupervised training of HMM-based POS taggers 3. Combining unsupervised and supervised TBL taggers to create a highly accurate POS tagger using only a small amount of manually tagged text

25. 1. Unsupervised TBL-based POS tagging 1.1 Start with minimal amount of knowledge: the allowable tags for each word. These tags can be extracted from an on-line dictionary or through morphological and distributional analysis. The initial-state annotator will assign all these tags to words in the annotated text. Example: Rival/JJ NNP gangs/nns have/vb VBP turned/vbd VBN cities/nns into/in combat/nn VB zones/nns./.

26. 1.2 The transformations which will be learned will reduce the uncertainty. They will have the form: Change the tag of a word from X to Y in the context C. where X is a set of tags, Y X, and C is one of the form: Example: the previous/next tag/word is T / W. From NN VB VBP to VBP if the previous tag is NNS From NN VB to VB if the previous tag is MD From JJ NNP to JJ if the following tag is NNS

1.3 The scoring 27. Note: While in supervised training the annotated corpus is used for scoring the outcome of applying transformations, in unsupervised training we need an objective function to evaluate the effect of learned transformations. Idea: Use information from the distribution of unambiguous words to find reliable disambiguation contexts. The value of the objective function: The score of the rule Change the tag of a word from X to Y in context C. is the difference between the number of unambiguous instances of tag Y in (all occurrences of the context) C and the number of unambiguous instances of the most likely tag R in C (R X, R Y ), adjusting for relative frequency.

Formalisation: 1. Compute: where: R = argmax Z X, Z Y incontext(z,c) freq(z) 28. freq(z) is the number of occurrences of words unambiguously tagged Z in the corpus; incontext(z, C) = number of occurrences of words unambiguously tagged Z in C. Note: R = argmin Z X, Z Y [ incontext(y,c) freq(y) incontext(z,c) freq(z) where freq(y ) is computed similarly to freq(z). ]

29. Formalisation (cont d): 2. The score of the (previously) given rule: incontext(y,c) freq(y) incontext(r,c) freq(r) freq(y )[ incontext(y,c) freq(y) freq(y) min Z X, Z Y [ incontext(y,c) freq(y) incontext(r,c) freq(r) = ] = incontext(z,c) freq(z) ] In each iteration the learner searches for the transformation rule which maximizes this score.

1.4 Stop the training when no positive scoring transformations can be found. 30.

2. Unsupervised learning of a POS tagger: Evaluation 2.1 Results on the Penn treebank corpus [ Marcus et al., 1993 ]: 95.1% on the Brown corpus [ Francis and Kucera, 1982 ]: 96% 31. (for more details, see Table 1, page 8 from [ Brill, 1995 ]) 2.2 Comparison to the EM/Baum-Welch unsupervised learning: on the Penn treebank corpus: 83.6% on 1M words of Associated Press articles: 86.6%; Kupiec s version (1992), using classes of words: 95.7% Note: Compared to the Baum-Welch tagger, no overtraining occurs. (Otherwise an additional held-out training corpus is needed to determine an appropriate number of training iterations.)

32. 3. Weakly supervised rule learning Aim: use a tagged corpus to improve the accuracy of unsupervised TBL. Idea: use the trained unsupervised POS tagger as the initialstate annotator for the supervised leraner. Advantage over using supervised learning alone: use both tagged and untagged text in training.

untagged text 33. Combining unsupervised learning and supervised learning initial state annotator: unsupervised unsupervised learner manually tagged text unsupervised transformations supervised learner supervised transformations

34. Difference w.r.t. weakly supervised Baum-Welch: in TBL weakly supervised learning, supervision influences the learner after unsupervised training; in weakly supervised Baum-Welch, tagged text is used to bias the initial probabilities. Weakness in weakly supervised Baum-Welch: unsupervised training may erase what was learned from the manually annotated corpus. Example: [ Merialdo, 1995 ], 50K tagged words, test acurracy (by probabilistic estimation): 95.4%; but after 10 EM iterations: 94.4%!

35. Results: see Table 2, pag. 11 [ Brill, 1995 ] Conclusion: The combined trainining outperformed the purely supervised training at no added cost in terms of annotated training text.