CS630 Represetig ad Accesg Digital Iformatio Part-of-Speech Taggig Thorste Joachims Corell Uiverty Based o slides from Prof. Claire Cardie Why is POS Taggig Hard? Ambiguity He will race/vb the car. Whe will the race/noun ed? The boat floated/vbd dow the river. The boat floated/vbn dow the river sak. Average of ~2 parts of speech for each word The umber of tags used by differet systems varies a lot. Some systems use < 20 tags, while others use > 400. Part-of-Speech Taggig Task defiitio Task specificatio Why is POS taggig difficult Trasformatio-based learig approach [Brill 93] Hidde Markov Models Amog Eaest of NLP Problems State of the art methods achieve ~97% accuracy. Simple heuristics ca go a log way. ~90% accuracy just by choog the most frequet tag for a word But defiig the rules for special cases ca be time-cosumig, difficult, ad proe to errors ad omisos Part-of-Speech Taggig Task Asg the correct part of speech (word class) to each word i a documet The/DT plaet/nn Jupiter/NNP ad/cc its/prp moos/nns are/vbp i/in effect/nn a/dt mii-solar/jj system/nn,/, ad/cc Jupiter/NNP itself/prp is/vbz ofte/rb called/vbn a/dt star/nn that/in ever/rb caught/vbn fire/nn./. Needed as a iitial procesg step for a umber of laguage techology applicatios Iformatio extractio Aswer extractio i QA Base step i idetifyig sytactic phrases for IR systems Critical for word-sese disambiguatio (WordNet apps) Part-of-Speech Taggig Task defiitio Task specificatio Why is POS taggig difficult Trasformatio-based learig approach [Brill 93] Hidde Markov Models
Trasformatio-Based Learig Machie learig techique For acquirig mple default heuristics ad rules for special cases Rules are leared by iteratively collectig errors ad geeratig rules to correct them. Requires a large (traiig) corpus of maually tagged text iitial state tagger Trasformatio-Based Learig allowable trasformatios: based o words ad tags i widow surroudig the target word objective fuctio: # correct- # icorrect [Brill 993] TBL: Top-Level Algorithm Lears a ordered list of trasformatios (i.e. rewrite rules) Learig Algorithm: Greedy Search Specify A iitial state aotator Space of allowable trasformatios Objective fuctio for comparig corpus to truth Algorithm Iterate Try each posble trasformatio Choose the oe with the best score Add to list of trasformatios Update the traiig corpus Util o trasformatio improves performace Rewrite Rules Rule Chage modal to ou, if precedig word is a determier, Example Determier: the, a, a, this, that Modals: ca, will, would, may, might followed by the mai verb The/det ca/modal rusted/verb./. The/det ca/ou rusted/verb./. Trasformatio Templates Chage tag A to B whe: precedig/followig word is tagged Z word two before/after is tagged Z oe of the two precedig/followig words is tagged Z oe of the three precedig/followig words is tagged Z precedig word is tagged Z ad followig word is tagged W precedig/followig word is tagged Z ad word two before/after is tagged W
Geeratig Trasformatios Apply the iitial tagger ad compile types of taggig errors. Each type of error is of the form: <icorrect tag, dered tag,# of occurreces> For each error type, istatiate all templates to geerate cadidate trasformatios. Apply each cadidate trasformatio to the corpus ad cout the umber of correctios ad errors that it produces. Save the trasformatio that yields the greatest improvemet. Stop whe o trasformatio ca reduce the error rate by a predetermied threshold. Taggig New Text The resultig tagger costs of two phases: Use the iitial tagger to tag all the text Apply each trasformatio, i order, to the corpus to correct some of the errors. The order of the trasformatios is very importat! For example, it is posble for a word s tag to chage several times as differet trasformatios are applied. I fact, a word s tag could thrash back ad forth betwee the same two tags. Example Suppose that the iitial tagger mistags 59 words as verbs whe they should have bee ous. Produces the error triple: < verb, ou, 59> Suppose template #3 is istatiated as the rule: Chage the tag from verb to ou if oe of the two precedig words is tagged as a determier. Evaluatio Traiig: 600,000 words from the Pe Treebak WSJ corpus Testig: separate 50,000 words from PTB Assumes all posble tags for all test set words are kow. 97.0% accuracy Tagger leared 378 rules. Whe this template is applied to the corpus, it corrects 98 of the 59 errors. But it also creates 8 ew errors. Error reductio is 98-8=80. Leared Rules. NN VB if the previous tag is TO I wated to/to wi/nn VB a Subaru WRX 2. VBP VB if oe of the prev-3 tags is MD The food might/md vaish/vbp VB from ght. 3. NN VB if oe of prev-2 tags is MD I might/md ot reply/nn VB 4. VB NN if oe of the prev-2 tags is DT 5. VBD VBN if oe of the prev-3 tags is VBZ 6. VBN VBD if oe of the previous tag is PRP Problems? Not lexicalized Trasformatios are etirely tag-based; o specific words were used i the rules. But certai phrases ad lexicalized expresos ca yield idiosycratic tag sequeces, so allowig the rules to look for specific words should help Add additioal templates E.g. whe the precedig/followig word is w Tagger achieves 97.2% accuracy First 200 rules achieved 97.0% First 00 rules achieved 96.8% Lears 447 rules Ukow words
Trasformatio-Based Learig Part-of-speech taggig [Brill 995; Ramshaw & Marcus 994] Prepotioal phrase attachmet [Brill & Rek 995] Sytactic parg [Brill 994] Nou phrase chukig [Ramshaw & Marcus 995, 999] Cotext-setive spellig correctio [Magu & Brill 997] Dialogue act taggig [Samuel et al. 998] States ad Tratios States Thik about as odes of a graph Oe for each POS tag special start state (ad maybe ed state) Tratios Thik about as directed edges i a graph Edges have tratio probabilities Output Each state also produces a word of the sequece Setece is geerated by a walk through the graph Part-of-Speech Taggig Part-of-Speech Taggig Task specificatio Why is POS taggig difficult Trasformatio-based learig approach [Brill 93] Hidde Markov Models Named Etity Recogitio Probabilistic Model Startig state s 0 Specifies where the sequece starts Tratio probability S t S t- ) Probability that oe states succeeds aother Matrix of ze #states * #states Emiso probability W t S t ) Probability that word is geerated i this state Matrix of ze #states * #words => Every word + state sequece has a probability W,S) w, s,..., s) = wi ) i=,..., w, sstart ) Hidde Markov Models Applicatio to POS taggig: View POS taggig as a sequece of word clasficatio tasks Goal: Trai a HMM to label every word with oe of the POS tags. What is a HMM? Hidde Markov Model (HMM) represets a process of geeratig the word ad tag sequece Probabilistic model Probability for each word ad tag sequece Predict most likely tag sequece for a give word sequece HMM Iferece Type I: Evaluatio Questio: What is the probabiliy of a output sequece give a HMM Give fully specified HMM: s 0, W t S t ), S t S t- ) Fid for a give w,,w w,..., w ) = wi ) ) ( s0,..., s ) i= Naïve algorithm expoetial rutime; forward algorithm liear i legth of sequece Laguage model Example: clasfy sequeces as questio vs. aswer setece.
HMM Iferece Type II: Decodig Questio: What is the most likely state sequece give a output sequece Give fully specified HMM: s 0, W t S t ), S t S t- ) Fid max s,..., s s0, w,..., w ) = max( s,..., s P wi P s ) ( ) ( i i= ) Viterbi algorithm has rutime liear i legth of sequece Example: fid the most likely tag sequece for a give sequece of words Tagger HMM TBL Experimetal Results Accuracy 96.80% 96.47% Experimet setup WSJ Corpus Trigram HMM model Lexicalized from [Pla ad Molia, 200] Traiig time 20 sec 9 days Predictio time 8.000 words/s 750 words/s Estimatig the Probabilities Give: Fully observed data Pairs of word sequece with their state sequece Estimatig tratio probabilities S t S t- ) # oftimesstateafollowsstateb sa sb ) = # oftimesstateboccurs Estimatig miso probabilities W t S t ) # oftimeswordaisobservedistateb wa sb) = # oftimesstateboccurs Smoothig the estimates Laplace smoothig -> uiform prior See aïve Bayes for text clasficatio Partially observed data: Expectatio Maximizatio (EM) HMM s for POS Taggig Deg HMM structure (vailla) States: oe state per POS tag Tratios: fully coected Emisos: all words observed i traiig corpus Estimate probabilities Use corpus, e.g. Treebak Smoothig Usee words? Taggig ew seteces Use Viterbi to fid most likely tag sequece