Part-of-Speech Tagging Announcements Lit Review Part 2 Written review of 2 articles, due April 1 CS 341: Natural Language Processing Prof. Heather Pon-Barry www.mtholyoke.edu/courses/ponbarry/cs341.html Final Project Proposal Due Monday April 6
Today POS Tagging Process of assigning part of speech marker to each word in a collection! POS Tagging She/pronoun! found/verb! herself/pronoun! falling/verb!...
POS Tagging Penn Treebank Tagset Words often have more than one POS: e.g., back The back door = adjective (JJ) On my back = noun (NN) Win the voters back = adverb (RB) Promised to back the bill = verb (VB) The POS tagging problem is to determine the POS tag for a particular instance of a word.
Applications Speech synthesis I object vs. This object... Parsing Machine translation Named entity recognition Word sense disambiguation POS Tagging Performance How many tags are correct? (Tag accuracy) State of the art: about 97% But baseline is already 90% Baseline is performance is: Tag every word with its most frequent tag Tag unknown words as nouns Partly easy because Many words are unambiguous You get points for them (the, a, etc.) and for punctuation marks!
How difficult is POS Tagging? In the Brown corpus: ~ 11% of the word types are ambiguous with regard to part of speech ~ 40% of the word tokens are ambiguous But they tend to be very common words. E.g., that I know that he is honest = preposition (IN) Yes, that play was nice = determiner (DT) You can t go that far = adverb (RB) Automatic POS Tagging Symbolic Rule-based Transformation-based Probabilistic Hidden Markov models Log-linear models
Rule-based Tagging Rule-based Example Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word!!!!! NN!!!!! RB!!!! VBN!! JJ VB! PRP! VBD!! TO VB DT NN! She!promised to back the! bill
Rule-based Example Eliminate VBN if VBD is an option when VBN VBD follows <start> PRP!!!! NN! RB!!! VBN! JJ VB! PRP VBD!! TO VB DT NN! She!promised to back the! bill Transformation-based Combines rule-based and probabilistic tagging rules are used to specify tags in a certain environment probabilistic, we use a tagged corpus to find the best performing rules (supervised learning) Input tagged corpus dictionary (with most frequent tags) Example: Brill tagger
Automatic POS Tagging Symbolic HMM: Part-of-Speech Transition Probabilities Rule-based Transformation-based Probabilistic Hidden Markov models Log-linear models
Observation Likelihoods: P(word tag) HMM
Maxent P(tag word) MEMMs Can do surprisingly well just looking at a word by itself: Word the: the DT Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Maximum Entropy Markov Model A sequence version of the maximum entropy classifier. Capitalization Meridian: CAP NNP t i-2 t i-1 Word shapes 35-year: d-x JJ NNP MD VB Then build a classifier to predict tag w i-1 w i-1 w i w i+1 Maxent P(tag word): 93.7% overall / 82.6% unknown <s> Janet will back the bill Slide adapted from Dan Jurafsky
MEMMs More Features t i-2 t i-1 NNP MD VB w i-1 w i-1 w i w i+1 <s> Janet will back the bill Slide adapted from Dan Jurafsky
MEMM Decoding Simplest algorithm Greedy: at each step in sequence, select tag that maximizes P(tag nearby words, nearby tags) In practice Viterbi algorithm Beam search POS Tagging Accuracies Rough accuracies: Baseline: most freq tag: ~90% Trigram HMM: ~95% Maxent P(t w): 93.7% MEMM tagger: 96.9% Bidirectional MEMM: 97.2% Upper bound: ~98% (human agreement) Slide adapted from Dan Jurafsky
More Resources References Stanford POS Tagger (cyclic dependency network, bidirectional version of MEMM) http://nlp.stanford.edu/software/tagger.shtml CMU Twitter POS tagger http://www.ark.cs.cmu.edu/tweetnlp/ Log-linear models Ratnaparkhi, EMNLP 1996 Toutanova et al., NAACL 2003 Excellent recent survey: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? (Manning, 2011)
Summary Training a Tagger Penn Treebank: standard tagset Approaches to POS tagging: Symbolic: rule-based, transformation-based Probabilistic: HMMs, MEMMs Input tagged corpus dictionary (with most frequent tags) These are available for English What about other languages?
Research in POS Tagging Low resource languages Learning a Part-of-Speech Tagger from Two Hours of Annotation (Garrette and Baldridge, 2013) [video]