Part-of-Speech Tagging Yan Shao Department of Linguistics and Philology, Uppsala University 19 April 2017
Last time N-grams are used to create language models The probabilities are obtained via on corpora using MLE. Data sparsity and smoothing. Markov Assumption. PoS Tagging 2/33
Outline 1 Introduction of PoS Tagging 2 An Example of a Tagged Corpus: SUC 3 Evaluation 4 Types of Tagging Approaches Rule-based Approaches Statistical Approaches PoS Tagging 3/33
Part-of-speech (PoS) Part of Speech: Category of words corresponding to similar grammatical properties. traditional parts of speech - Noun, verb, adjective, adverb, preposition, article, interjection, pronoun, conjunction,... Also known as: - Parts of speech, lexical categories, word classes, morphological classes, lexical tags,... Lots of debate within linguistics about the number, nature, and universality of these PoS Tagging 4/33
PoS Examples PoS Tagging 5/33
Introduction of PoS Tagging Definition of POS Tagging The process of assigning a part-of-speech tag to every word of a sentence/text PoS Tagging 6/33
Why PoS-Tagging? Distinguish heterophones in speech synthesis - I did not object to the object. To present the present. The bandage was wound around the wound. Parsing - POS tagging is the proceeding step for parsing (syntactic analysis). Information extraction - Finding names, relations, etc. Machine translation PoS Tagging 7/33
What is the challenge in PoS Tagging? Ambiguous words Solve the lexical ambiguities - The/DT wind/nn was/vb too/adv strong/adj to/prp wind/vb the/dt sail/nn. Unknown words The/DT rural/jj Babbitt/??? who/wp bloviates/??? about/in progress/nn and/cc growth/nn PoS Tagging 8/33
How is PoS-Tagging performed Two sources of information Lexical information (the word itself) - Known words can be looked up in a lexicon listing possible tags for each word - Unknown words can be analyzed with respect to affixes, capitalization, special symbols, etc. Contextual information (surrounding words) - Contextual words - Contextual POS tags Two Main approaches Rule-based systems Statistical systems PoS Tagging 9/33
Tagsets are not universal We need a standard set of tags to do POS tagging Various tagging schemes are employed in different annotated corpora. Very coarse tagsets: - N, V, Adj, Adv,... More commonly used sets are more fine-grained: - English: Penn Treebank tagset, 45 tags - Swedish: SUC tagset, 25 base tags + features 150 tags Even more fine-grained tagsets exist PoS Tagging 10/33
There are two types of tags, open and closed classes. Closed class: a small fixed membership - Prepositions: of, in, by,... - Pronouns: I, you, she, mine, his, this, that,... - Determiners: the, a, this, that,... - Usually function words - Often frequent and ambiguous Open class: new ones can be added all the time - English has 4: Nouns, Verbs, Adjectives, Adverbs - Usually content words - Often rare and (therefore sometimes) unknown PoS Tagging 11/33
Penn TreeBank POS Tagset PoS Tagging 12/33
How Hard is POS Tagging? Measuring Ambiguity PoS Tagging 13/33
The SUC POS Tagset PoS Tagging 14/33
The SUC POS Tagset Och han menade faktiskt allvar PoS Tagging 15/33
The SUC POS Tagset Och/KN han/pn menade/vb faktiskt/ab allvar/nn PoS Tagging 16/33
SUC includes morphosyntactic features, as we see in this sample: PoS Tagging 17/33
List of the morphosyntactic features PoS Tagging 18/33
Adding morphosyntactic features Och han menade faktiskt allvar KN PN VB AB NN Och han menade faktiskt allvar KN PN_UTR SIN DEF SUB VB_PRT AKT AB_POS NN_NEU SIN IND NOM PoS Tagging 19/33
Adding morphosyntactic features Och han menade faktiskt allvar KN PN VB AB NN Och han menade faktiskt allvar KN PN_UTR SIN DEF SUB VB_PRT AKT AB_POS NN_NEU SIN IND NOM PoS Tagging 19/33
Evaluation Evaluate the accuracy of the POS tagger Overall error rate with respect to a manually annotated gold-standard test set Error rates on known vs. unknown words Error rates on particular tags Accuracy typically reaches 96 97% for English newswire text PoS Tagging 20/33
Evaluation Evaluate the accuracy of the POS tagger Overall error rate with respect to a manually annotated gold-standard test set Error rates on known vs. unknown words Error rates on particular tags Accuracy typically reaches 96 97% for English newswire text PoS Tagging 20/33
Error Analysis Generate a confusion matrix (for development data): How often was tag i mistagged as tag j: A confusion matrix See what errors are causing problems: Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ) PoS Tagging 21/33
Some Vocabulary Unknown word: word that is not in the dictionary/lexicon of the tagger Ambiguous word: word that can have different tag, depending on the context. Low-frequent word: word that are very rare (sometimes appears one time) in your corpus. PoS Tagging 22/33
Two Approaches for POS Tagging Rule-based systems Constraint Grammar Transformation-Based Learning Statistical sequence models Hidden Markov Models Maximum Entropy Markov Models Conditional Random Fields Neural Networks PoS Tagging 23/33
Two Approaches for PoS Tagging Rule-based systems a) Constraint Grammar - Assign all possible tags to each word - Apply rules that discard tags based on context - Rules created by hand b) Transformation-Based Learning - Assign most frequent tag to each word - Apply rules that replace tags based on context - Later rules may overwrite earlier rules - Rules learned from tagged corpus PoS Tagging 24/33
Two Approaches for PoS Tagging Rule-based systems a) Constraint Grammar - Assign all possible tags to each word - Apply rules that discard tags based on context - Rules created by hand b) Transformation-Based Learning - Assign most frequent tag to each word - Apply rules that replace tags based on context - Later rules may overwrite earlier rules - Rules learned from tagged corpus PoS Tagging 24/33
Two Approaches for PoS Tagging a) Constraint Grammar For each ambiguous word, apply a rule. Example: "An ambiguous word is a noun rather than a verb if it succeeds a determiner". Advantages: - Can achieve very high recall with good lexical resources - Rules can be interpreted by humans, which facilitates debugging Drawbacks: - Not always possible to eliminate all ambiguity - Rule design is (very) expensive and time-consuming PoS Tagging 25/33
Two Approaches for PoS Tagging b)transformation-based Learning (=Brill tagging) The rules are NOT hand-written. The most probable tags are initially assigned. Advantages: - Rules can be interpreted by humans, which facilitates debugging - Rules are learnt automatically from data Drawbacks: - Not quite as accurate as the best models - Slow to train on large data sets PoS Tagging 26/33
Two Approaches for POS Tagging Statistical Models The parameter of the tagger is statistically learned from an annotated corpus. - What is the most probable tag sequence given a sequence of words? - What is the most probable sequence of tags that generates this sentence? PoS Tagging 27/33
Exercise: Imagine a tagged corpus A C B B A B A A <S> Adj Verb Noun Adj Verb Adj Noun Adj </S> We can for instance compute: (1) The probability of a word to be assigned a certain tag (example: P(Adj B)=2/3 ). (2) The probability of transition between tags c(verb,noun)=1 c(noun)=2 P(Noun Verb)=1/2 (3) The probability of generating a word given a certain tag c(adj, B)=2 c(adj)=4 P(B Adj)=2/4 PoS Tagging 28/33
Exercise: Imagine a tagged corpus A C B B A B A A <S> Adj Verb Noun Adj Verb Adj Noun Adj </S> We can for instance compute: (1) The probability of a word to be assigned a certain tag (example: P(Adj B)=2/3 ). (2) The probability of transition between tags c(verb,noun)=1 c(noun)=2 P(Noun Verb)=1/2 (3) The probability of generating a word given a certain tag c(adj, B)=2 c(adj)=4 P(B Adj)=2/4 PoS Tagging 28/33
Exercise: Imagine a tagged corpus A C B B A B A A <S> Adj Verb Noun Adj Verb Adj Noun Adj </S> We can for instance compute: (1) The probability of a word to be assigned a certain tag (example: P(Adj B)=2/3 ). (2) The probability of transition between tags c(verb,noun)=1 c(noun)=2 P(Noun Verb)=1/2 (3) The probability of generating a word given a certain tag c(adj, B)=2 c(adj)=4 P(B Adj)=2/4 PoS Tagging 28/33
Exercise: Imagine a tagged corpus A C B B A B A A <S> Adj Verb Noun Adj Verb Adj Noun Adj </S> We can for instance compute: (1) The probability of a word to be assigned a certain tag c(adj,b)=2 c(b)=3 P(Adj B)=2/3 (2) The probability of transition between tags c(verb,noun)=1 c(noun)=2 P(Noun Verb)=1/2 (3) The probability of generating a word given a certain tag c(adj,b)=2 c(adj)=4 P(B Adj)=2/4 PoS Tagging 29/33
HMM Hidden Markov Model (HMM) for POS tagging. PoS Tagging 30/33
Hidden Markov Model (HMM): Formally HMM tagging is based on two mathematical statements The Bayesian inference: Applied to tag sequence prediction: And the Markov assumptions - Generation of each word w i, only depends on its tag t i, and not on previous words - Generation of each tag t i only depends on its immediate predecessor t i 1 PoS Tagging 31/33
More Formally Alphabet Σ = { s 1, s 2,, s M } Set of states Q = { q 1, q 2,, q M } Transition probabilities between any two states a ij = P(q j q i ) = transition prob from state i to state j Start probabilities for any state π 0i = P(q i ) = start prob for state I Emission probabilities for each symbol and state b ik = P( s k q i ) 36
Summary Part-of-speech tagging Prior step in many NLP applications Different tagsets and tagging schemes Approaches Rule-based systems (Constraint Grammar, Transformation Based Learning) Statistical sequence models (HMM,...) State of the art 96-97% accuracy for English newswire text PoS Tagging 32/33
References Daniel Jurafsky and James H Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, volume 163 of Prentice Hall Series in Artificial Intelligence. Prentice Hall, Pearson International Edition, 2009. Have a look as well here : https://www.coursera.org/course/nlp PoS Tagging 33/33