TDDE09, 729A27 Natural Language Processing (2017) Part-of-Speech Tagging Marco Kuhlmann Department of Computer and Information Science This work is licensed under a Creative Commons Attribution 4.0 International License.
Parts of speech A part of speech is a category of words that play similar roles within the syntactic structure of a sentence. Parts of speech can be defined distributionally or functionally. Kim saw the {elephant, movie, mountain, error} before we did. verbs = predicates; nouns = arguments; adverbs = modify verbs, There are many different tag sets for parts of speech. different languages, different levels of granularity, different design principles
Universal part-of-speech tags Tag Category Examples ADJ adjective big, old ADV adverb very, well INTJ interjection ouch! NOUN noun girl, cat, tree VERB verb run, eat PROPN proper noun Mary, John Tag Category Examples ADP adposition in, to, during AUX auxiliary verb has, was CCONJ conjunction and, or, but DET determiner a, my, this NUM cardinal numbers 0, one PRON pronoun I, myself, this Missing: PART, SCONJ, PUNCT, SYM, X Source: Universal Dependencies Project
Part-of-speech tagging A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech. Part-of-speech tagging can be approached as a supervised machine learning problem. This requires training data. Part-of-speech taggers are commonly evaluated using accuracy, precision, and recall.
Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre
Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons
Evaluation of Part-of-Speech Taggers
A reminder about machine learning methodology Training data used to train a machine learning system Development data used to evaluate during development, set hyperparameters smoothing parameter in additive smoothing Test data used to evaluate the final system
Stockholm Umeå Corpus (SUC) SUC is the largest manually annotated corpus for written Swedish, a collaboration of Stockholm and Umeå University. created in the early 1990s SUC contains more than 1.1 million tokens; these are annotated with parts of speech, morphological features, and lemmas. SUC is a balanced corpus with texts from different genres.
Accuracy DT JJ NN PP VB 307 DT 923 0 0 0 1 12445 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag
Precision with respect to NN DT JJ NN PP VB 264 DT 923 0 0 0 1 4499 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag
Recall with respect to NN DT JJ NN PP VB 26 DT 923 0 0 0 1 4499 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag
Sample exam question NN JJ VB NN 58 6 1 JJ 5 11 2 VB 0 7 43 predicted tag gold-standard tag Compute (a) precision on adjectives, (b) recall on verbs.
Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons
Part-of-Speech Tagging with Hidden Markov Models
Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre
Different parts-of-speech have different frequencies Word / tag PN VB PP DT JJ NN jag 4 532 0 0 0 0 25 bad 0 41 0 0 0 10 om 0 0 4 945 0 0 0 en 402 0 0 16 0 1 kort 0 0 0 0 125 18 bit 0 0 0 0 0 92 Data from the Stockholm Umeå Corpus
Different tag sequences have different frequencies Previous / next PN VB PP DT JJ NN PN 1 291 35 473 6 812 1 291 1 759 1 496 VB 24 245 19 470 22 191 13 175 8 794 19 282 PP 5 582 198 501 19 737 10 751 52 440 DT 201 1 286 163 21 648 23 719 JJ 233 1 937 3 650 245 2 716 46 678 NN 1 149 41 928 51 855 1 312 3 350 10 314 Data from the Stockholm Umeå Corpus
Hidden Markov Model A hidden Markov model (HMM) is a generalised Markov model with two types of probabilities: transition probabilities P(tag 2 tag 1 ) How probable is it to see a verb after having seen a pronoun? output probabilities P(word tag) How probable is it to see the word bad being tagged as a verb?
P(w 1 w 1 ) w 1 P(w 1 BOS) P(EOS w 1 ) BOS P(w 2 w 1 ) P(w 1 w 2 ) EOS w 2 P(w 2 BOS) P(EOS w 2 ) P(w 2 w 2 )
P(VB VB) w P(w VB) jag 0.000004 bad 0.000152 P(VB BOS) VB P(EOS VB) BOS P(PN VB) P(VB PN) EOS P(PN BOS) PN P(EOS PN) w P(w PN) P(PN PN) jag 0.025775 bad 0.000006
Learning hidden Markov models To learn a hidden Markov model from a corpus, we can use Maximum Likelihood Estimation just as before: To estimate the transition probability P(VB PN), we ask: How often do we see VB given that the previous tag was PN? To estimate the output probability P(jag PN), we ask: How often do we see the word jag when the tag is PN? We can also use various smoothing techniques just as before.
Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities
Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.
Sample exam question You want to compute the probability of this tagged sentence in an HMM: jag skrev på utan att tveka PN VB PL PP IE VB You can ask the model for its atomic probabilities, but each such question costs 1 crown. Which questions do you need to ask, and how much do you have to pay?
The Viterbi Algorithm
Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities
Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.
High-level description The algorithm takes as its inputs a HMM and a sentence and computes the most probable tag sequence for the sentence. The algorithm fills a matrix that contains one row for each possible tag and one column for each position in the sentence. including BOS, EOS In this presentation we fill the matrix with negative log probabilities; we can interpret them as costs in crowns. We do this to avoid underflow.
The central invariant The algorithm should make sure that the value in row t, column i is the minimal cost needed to tag the first i words in the sentence in such a way that word number i is tagged as t. Remember that minimal cost = maximal probability. If the algorithm can achieve this, then we can read off the least possible cost to tag the complete sentence from the last column.
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93
Hidden Markov model 1: Transition costs PN VB PP DT JJ NN EOS BOS 1,69 3,58 2,25 2,50 3,37 1,76 11,19 PN 4,00 0,69 2,34 4,00 3,69 3,85 7,94 VB 1,95 2,17 2,04 2,56 2,97 2,18 6,87 PP 3,09 6,42 5,49 1,82 2,43 0,85 8,38 DT 5,61 10,22 5,26 5,82 0,93 0,84 10,22 JJ 5,73 3,62 2,98 5,68 3,28 0,43 6,35 NN 5,30 1,70 1,49 5,17 4,23 3,11 4,30
Hidden Markov model 2: Observation costs jag bad om en kort bit PN 3,66 12,08 12,08 6,08 12,08 12,08 VB 12,53 8,79 12,53 12,53 12,53 12,53 PP 12,33 12,33 3,83 12,33 12,33 12,33 DT 11,99 11,99 11,99 2,29 11,99 11,99 JJ 12,09 12,09 12,09 12,09 7,25 12,09 NN 9,47 10,33 12,73 12,03 9,78 8,19
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ NN PN PP VB EOS 0.00 + P(DT BOS) + P(jag DT) = 2.50 + 11.99 = 14.49
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ 15,46 NN 11,22 PN 5,35 PP VB EOS 0.00 + P(PN BOS) + P(jag PN) = 1.69 + 3.66 = 5.35
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 35,15 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS 28.86 + P(DT PN) + P(en DT) = 28.86 + 4.00 + 2.29 = 35.15
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS 20.70 + P(DT PP) + P(en DT) = 20.70 + 1.82 + 2.29 = 24.82
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93 41.63 + P(EOS NN) = 41.63 + 4.30 = 45.93
jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93 Follow the backpointers to read off the sequence.
jag 1 skrev 2 på 3 utan 4 att 5 tveka 6 BOS 0,00 IE 17,22 21,69 30,02 33,79 34,63 54,70 PL 21,77 21,20 22,10 39,77 49,28 55,06 PN 5,35 21,43 27,87 33,85 44,12 48,09 PP 14,59 20,02 18,69 28,95 44,66 50,70 SN 15,83 21,51 29,20 34,29 35,24 51,40 VB 16,11 13,84 28,54 37,64 43,96 44,86 EOS 51,74 It does not suffice to pick the best cell in each column!
Computational complexity Let m, n denote the number of tags in the HMM and the length of the input sentence, respectively. The memory required by the Viterbi algorithm is in O(mn); this corresponds to the size of the matrix. The runtime required by the Viterbi algorithm is in O(m 2 n): We need to fill O(mn) cells, and each cell requires us to look at O(m) cells in the previous column.
Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons
Part-of-Speech Tagging with Perceptrons
Part-of-speech tagging as classification Part-of-speech tagging can be cast as a sequence of classification problems one classification per word in the sentence. Based on this idea, any method for classification can be used to build a part-of-speech tagger. Naive Bayes Here we use a very simple non-probabilistic method called the multi-class perceptron.
The multi-class perceptron x 1 w 1 Σ a 1 w 1 w 2 x 2 Σ w 2 a 2 activation = weighted sum of the features
Interpretation of feature weights Features whose weights are zero do not contribute to the activation; such features are ignored. Features whose weights are positive cause the activation to increase they suggest that the input belongs to the class. Features whose weights are negative cause the activation to decrease they suggest that the input falls outside of the class.
Part-of-speech tagging with a perceptron jag bad om en kort bit NN 09,36 PN 81,72 VB 9,18
Part-of-speech tagging with a perceptron jag bad om en kort bit PN 81,72 NN 09,36 VB 9,18
Part-of-speech tagging with a perceptron jag bad om en kort bit PN NN 16,08 PN 4,02 VB 64,32
Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB 64,32 NN 16,08 PN 4,02
Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB
Feature windows Hidden Markov models look back one step; but sometimes it is a good idea to look back further, or to look ahead! Jag bad om en kort bit. At the same time, we do not want the classifier to see too much information. efficiency, data sparseness A compromise is to define a limited feature window.
Comparison between the two methods Part-of-speech tagging with hidden Markov models probabilistic exhaustive search for the best sequence (Viterbi algorithm) limited possibilities to define features (current word, previous tag) Part-of-speech tagging with multi-class perceptrons non-probabilistic no search; locally optimal decisions more possibilities to define features (feature windows)
Feature window jag bad om en kort bit BOS PN EOS With this feature window, we see the current word, the previous word, the next word, and the previous tag.
Feature window jag bad om en kort bit BOS PN EOS The feature window moves forward during tagging.
Comparison between the two methods Hidden markov model Multi-class perceptron Viterbi search greedy search HMM features fine-tuned features 92,71% 89,97% 88,86% 95,30% Tagging accuracy on the SUC test set
Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons