POS tagging Intro to NLP - ETHZ - 11/03/2013
Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence Probability of an observation Parameter estimation Evaluation
POS ambiguity "Squad helps dog bite victim" bite -> verb? bite -> noun?
Parts of Speech (PoS) Traditional parts of speech: Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...
Examples N (noun): car, squad, dog, bite, victim V (verb): help, bite ADJ (adjective): purple, tall ADV (adverb): unfortunately, slowly P (preposition): of, by, to PRO (pronoun): I, me, mine DET (determiner): the, a, that, those
Open and closed classes 1. Closed class: small stable set a. Auxiliaries: may, can, will, been,... b. Prepositions: of, in, by,... c. Pronouns: I, you, she, mine, his, them,... d. Usually function words, short, grammar role 2. Open class: a. new ones are created all the time ("to google/tweet", "e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget") b. English has 4: Nouns, Verbs, Adjectives, Adverbs c. Many languages have these 4, not all!
Open class words 1. Nouns: a. Proper nouns: Zurich, IBM, Albert Einstein, The Godfather,... Capitalized in many languages. b. Common nouns: the rest, also capitalized in German, mass/count nouns (goat/goats, snow/*snows) 2. Verbs: a. Morphological affixes in English: eat/eats/eaten 3. Adverbs: tend to modify things: a. John walked home extremely slowly yesterday b. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat) d. Manner adverbs (slowly, slinkily, delicately) 4. Adjectives: qualify nouns and noun phrases
Closed class words 1. prepositions: on, under, over,... 2. particles: up, down, on, off,... 3. determiners: a, an, the,... 4. pronouns: she, who, I,.. 5. conjunctions: and, but, or,... 6. auxiliary verbs: can, may should,... 7. numerals: one, two, three, third,.
Prepositions with corpus frequencies
Conjunctions
Pronouns
Auxiliaries
Applications A useful pre-processing step in many tasks Syntactic parsing: important source of information for syntactic analysis Machine translation Information retrieval: stemming, filtering Named entity recognition Summarization?
Applications Speech synthesis, for correct pronunciation of "ambiguous" words: lead /lid/ (guide) vs. /led/ (chemical) insult vs. INsult object vs. OBject overflow vs. OVERflow content vs. CONtent
Summarization Idea: Filter out sentences starting with certain PoS tags Use PoS statistics from gold standard titles (might need cross-validation)
Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't
Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't
PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. Pre-processing: tokenization Word Squad helps dog bite victim Tag N V N N N
Choosing a tagset 1. There are many parts of speech, potential distinctions we can draw 2. For POS tagging, we need to choose a standard set of tags to work with 3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? http://en.wikipedia. org/wiki/part-of-speech_tagging 4. More commonly used set is finer grained, the Penn TreeBank tagset, 45 tags
PTB tagset
Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NN 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns
Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns
Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns
Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns
Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/MD you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns
Complexities Book/VB that/dt flight/nn./. There/EX are/vbp 70/CD children/nns there/rb./. Mrs./NNP Shaefer/NNP never/rb got/vbd around/rp to/to joining/vbg./. All/DT we/prp gotta/vbn do/vb is/vbz go/vb around/in the/dt corner/nn./. Unresolvable ambiguity: The Duchess was entertaining last night.
Words PoS WSJ PoS Universal The DT DET oboist NN NOUN Heinz NNP NOUN Holliger NNP NOUN has VBZ VERB taken VBN VERB a DT DET hard JJ ADJ line NN NOUN about IN ADP the DT DET problems NNS NOUN...
POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.
Word type tag ambiguity
Methods 1. Rule-based a. Start with a dictionary b. Assign all possible tags c. Write rules by hand to remove tags in context 2. Stochastic a. Supervised/Unsupervised b. Generative/discriminative c. independent/structured output d. HMMs
Rule-based tagging 1. Start with a dictionary: she promised to back the bill PRP VBN, VBD TO VB, JJ, RB, NN DT NN, VB
Rule-based tagging 2. Assign all possible tags: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill
Rule-based tagging 3. Introduce rules to reduce ambiguity: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill Rule: "<start> PRP {VBN, VBD}" -> "<start> PRP VBD"
Statistical models for POS tagging 1. A classic: one of the first successful applications of statistical methods in NLP 2. Extensively studied with all possible approaches (sequence models benchmark) 3. Simple to get started on: data, eval, literature 4. An introduction to more complex segmentation and labelling tasks: NER, shallow parsing, global optimization 5. An introduction to HMMs, used in many variants in POS tagging and related tasks.
Supervision and resources 1. Supervised case: data with words manually annotated with POS tags 2. Partially supervised: annotated data + unannotated data 3. Unsupervised: only raw text available 4. Resources: dictionaries with words possible tags 5. Start with the supervised task
HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(t s =q j t s-1 =q i ) b. t s t s-1 = q i ~ Multi(a i ) c. Special vector of initial/final probabilities 4. Emissions: a. B = {b ik }; b ik = P(w s = o k t s =q i ) b. w s t s = q i ~ Multi(b i )
Markov Chain Interpretation Tagging process as a (hidden) Markov process Independence assumptions 1. Limited horizon 2. Time-invariant 3. Observation depends only on the state
Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag
Inference in HMMs Three fundamental problems: 1. Given an observation (e.g., a sentence) find the most likely sequence of states (e.g., pos tags) 2. Given an observation, compute its probability 3. Given a dataset of observation (sequences) estimate the model's parameters: theta = (A,B)
HMMs and FSA
HMMs and FSA also Bayes nets, directed graphical models, etc.
Other applications of HMMs NLP - Named entity recognition, shallow parsing - Word segmentation - Optical Character Recognition Speech recognition Computer Vision image segmentation Biology - Protein structure prediction Economics, Climatology, Robotics...
POS as sequence classification Observation: a sequence of N words w 1:N Response: a sequence of N tags t 1:N Task: find the predicted t' 1:N such that: The best possible tagging for the sequence.
Bayes rule reformulation
HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences?
HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences? O( Tagset N )! Dynamic programming: Viterbi algorithm
Viterbi algorithm
Example: model A = N V END V 0.8 0.2 0.3 N 0.3 0.7 0.7 START 0.6 0.4 B = board backs plan V 0.3 0.3 0.4 N 0.4 0.2 0.4
Example: observation Sentence: "Board backs plan" Find the most likely tag sequence d s (t) = probability of most likely path ending at state s at time t
Viterbi algorithm: example END V N START board backs plan Time 1 2 3
Viterbi: forward pass END V N START board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: forward pass END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3
Viterbi: output END V d=.011 N START board/n backs/v plan/n Time 1 2 3
Observation probability Given HMM theta = (A,B) and observation sequence w 1:N compute P(w 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:
Forward algorithm Dynamic programming: each state of the trellis stores a value alpha i (s) = probability of being in state s having observed w 1:i Sum over all paths up to i-1 leading to s Init:
Forward algorithm
Forward computation END V a=.12 N START a=.24 board backs plan Time 1 2 3
Forward computation END V a=.12 a=.058 N START a=.24 a=.034 board backs plan Time 1 2 3
Forward computation END V a=.12 a=.058 a=.014 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3
Forward computation END V a=.12 a=.058 a=.014 a=0.2 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3
Parameter estimation Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:
Implementation details 1. Start/End states 2. Log space/rescaling 3. Vocabularies: model pruning 4. Higher order models: a. states representation b. Estimation and sparsity: deleted interpolation
Evaluation So once you have you POS tagger running how do you evaluate it? 1. Overall error rate with respect to a goldstandard test set. a. ER = # words incorrectly tagged/# words tagged 2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially unknown words)
Evaluation The result is compared with a manually coded Gold Standard Typically accuracy > 97% on WSJ PTB This may be compared with result for a baseline tagger (one that uses no context). Baselines (most frequent tag) can achieve up to 90% accuracy. Important: 100% is impossible even for human annotators.
Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence (decoding) Probability of an observation (word sequence) Parameter estimation (supervised) Evaluation
Next class Unsupervised POS tagging models (HMMs) Parameter estimation: forward-backward algorithm Discriminative sequence models: MaxEnt, CRF, Perceptron, SVM, etc. Read J&M 5-6 Pre-process and POS tag the data: report problems & baselines