Part-of-Speech Tagging

Size: px

Start display at page:

Download "Part-of-Speech Tagging"

Stephen Hines
6 years ago
Views:

1 TDDE09, 729A27 Natural Language Processing (2017) Part-of-Speech Tagging Marco Kuhlmann Department of Computer and Information Science This work is licensed under a Creative Commons Attribution 4.0 International License.

2 Parts of speech A part of speech is a category of words that play similar roles within the syntactic structure of a sentence. Parts of speech can be defined distributionally or functionally. Kim saw the {elephant, movie, mountain, error} before we did. verbs = predicates; nouns = arguments; adverbs = modify verbs, There are many different tag sets for parts of speech. different languages, different levels of granularity, different design principles

3 Universal part-of-speech tags Tag Category Examples ADJ adjective big, old ADV adverb very, well INTJ interjection ouch! NOUN noun girl, cat, tree VERB verb run, eat PROPN proper noun Mary, John Tag Category Examples ADP adposition in, to, during AUX auxiliary verb has, was CCONJ conjunction and, or, but DET determiner a, my, this NUM cardinal numbers 0, one PRON pronoun I, myself, this Missing: PART, SCONJ, PUNCT, SYM, X Source: Universal Dependencies Project

4 Part-of-speech tagging A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech. Part-of-speech tagging can be approached as a supervised machine learning problem. This requires training data. Part-of-speech taggers are commonly evaluated using accuracy, precision, and recall.

5 Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre

6 Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

7 Evaluation of Part-of-Speech Taggers

8 A reminder about machine learning methodology Training data used to train a machine learning system Development data used to evaluate during development, set hyperparameters smoothing parameter in additive smoothing Test data used to evaluate the final system

9 Stockholm Umeå Corpus (SUC) SUC is the largest manually annotated corpus for written Swedish, a collaboration of Stockholm and Umeå University. created in the early 1990s SUC contains more than 1.1 million tokens; these are annotated with parts of speech, morphological features, and lemmas. SUC is a balanced corpus with texts from different genres.

10 Accuracy DT JJ NN PP VB 307 DT JJ NN PP VB predicted tag gold-standard tag

11 Precision with respect to NN DT JJ NN PP VB 264 DT JJ NN PP VB predicted tag gold-standard tag

12 Recall with respect to NN DT JJ NN PP VB 26 DT JJ NN PP VB predicted tag gold-standard tag

13 Sample exam question NN JJ VB NN JJ VB predicted tag gold-standard tag Compute (a) precision on adjectives, (b) recall on verbs.

14 Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

15 Part-of-Speech Tagging with Hidden Markov Models

16 Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre

17 Different parts-of-speech have different frequencies Word / tag PN VB PP DT JJ NN jag bad om en kort bit Data from the Stockholm Umeå Corpus

18 Different tag sequences have different frequencies Previous / next PN VB PP DT JJ NN PN VB PP DT JJ NN Data from the Stockholm Umeå Corpus

19 Hidden Markov Model A hidden Markov model (HMM) is a generalised Markov model with two types of probabilities: transition probabilities P(tag 2 tag 1 ) How probable is it to see a verb after having seen a pronoun? output probabilities P(word tag) How probable is it to see the word bad being tagged as a verb?

20 P(w 1 w 1 ) w 1 P(w 1 BOS) P(EOS w 1 ) BOS P(w 2 w 1 ) P(w 1 w 2 ) EOS w 2 P(w 2 BOS) P(EOS w 2 ) P(w 2 w 2 )

21 P(VB VB) w P(w VB) jag bad P(VB BOS) VB P(EOS VB) BOS P(PN VB) P(VB PN) EOS P(PN BOS) PN P(EOS PN) w P(w PN) P(PN PN) jag bad

22 Learning hidden Markov models To learn a hidden Markov model from a corpus, we can use Maximum Likelihood Estimation just as before: To estimate the transition probability P(VB PN), we ask: How often do we see VB given that the previous tag was PN? To estimate the output probability P(jag PN), we ask: How often do we see the word jag when the tag is PN? We can also use various smoothing techniques just as before.

23 Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities

24 Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.

25 Sample exam question You want to compute the probability of this tagged sentence in an HMM: jag skrev på utan att tveka PN VB PL PP IE VB You can ask the model for its atomic probabilities, but each such question costs 1 crown. Which questions do you need to ask, and how much do you have to pay?

26 The Viterbi Algorithm

27 Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities

28 Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.

29 High-level description The algorithm takes as its inputs a HMM and a sentence and computes the most probable tag sequence for the sentence. The algorithm fills a matrix that contains one row for each possible tag and one column for each position in the sentence. including BOS, EOS In this presentation we fill the matrix with negative log probabilities; we can interpret them as costs in crowns. We do this to avoid underflow.

30 The central invariant The algorithm should make sure that the value in row t, column i is the minimal cost needed to tag the first i words in the sentence in such a way that word number i is tagged as t. Remember that minimal cost = maximal probability. If the algorithm can achieve this, then we can read off the least possible cost to tag the complete sentence from the last column.

31 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93

32 Hidden Markov model 1: Transition costs PN VB PP DT JJ NN EOS BOS 1,69 3,58 2,25 2,50 3,37 1,76 11,19 PN 4,00 0,69 2,34 4,00 3,69 3,85 7,94 VB 1,95 2,17 2,04 2,56 2,97 2,18 6,87 PP 3,09 6,42 5,49 1,82 2,43 0,85 8,38 DT 5,61 10,22 5,26 5,82 0,93 0,84 10,22 JJ 5,73 3,62 2,98 5,68 3,28 0,43 6,35 NN 5,30 1,70 1,49 5,17 4,23 3,11 4,30

33 Hidden Markov model 2: Observation costs jag bad om en kort bit PN 3,66 12,08 12,08 6,08 12,08 12,08 VB 12,53 8,79 12,53 12,53 12,53 12,53 PP 12,33 12,33 3,83 12,33 12,33 12,33 DT 11,99 11,99 11,99 2,29 11,99 11,99 JJ 12,09 12,09 12,09 12,09 7,25 12,09 NN 9,47 10,33 12,73 12,03 9,78 8,19

34 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ NN PN PP VB EOS P(DT BOS) + P(jag DT) = = 14.49

35 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ 15,46 NN 11,22 PN 5,35 PP VB EOS P(PN BOS) + P(jag PN) = = 5.35

36 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 35,15 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS P(DT PN) + P(en DT) = = 35.15

37 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS P(DT PP) + P(en DT) = = 24.82

38 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45, P(EOS NN) = = 45.93

39 jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93 Follow the backpointers to read off the sequence.

40 jag 1 skrev 2 på 3 utan 4 att 5 tveka 6 BOS 0,00 IE 17,22 21,69 30,02 33,79 34,63 54,70 PL 21,77 21,20 22,10 39,77 49,28 55,06 PN 5,35 21,43 27,87 33,85 44,12 48,09 PP 14,59 20,02 18,69 28,95 44,66 50,70 SN 15,83 21,51 29,20 34,29 35,24 51,40 VB 16,11 13,84 28,54 37,64 43,96 44,86 EOS 51,74 It does not suffice to pick the best cell in each column!

41 Computational complexity Let m, n denote the number of tags in the HMM and the length of the input sentence, respectively. The memory required by the Viterbi algorithm is in O(mn); this corresponds to the size of the matrix. The runtime required by the Viterbi algorithm is in O(m 2 n): We need to fill O(mn) cells, and each cell requires us to look at O(m) cells in the previous column.

42 Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

43 Part-of-Speech Tagging with Perceptrons

44 Part-of-speech tagging as classification Part-of-speech tagging can be cast as a sequence of classification problems one classification per word in the sentence. Based on this idea, any method for classification can be used to build a part-of-speech tagger. Naive Bayes Here we use a very simple non-probabilistic method called the multi-class perceptron.

45 The multi-class perceptron x 1 w 1 Σ a 1 w 1 w 2 x 2 Σ w 2 a 2 activation = weighted sum of the features

46 Interpretation of feature weights Features whose weights are zero do not contribute to the activation; such features are ignored. Features whose weights are positive cause the activation to increase they suggest that the input belongs to the class. Features whose weights are negative cause the activation to decrease they suggest that the input falls outside of the class.

47 Part-of-speech tagging with a perceptron jag bad om en kort bit NN 09,36 PN 81,72 VB 9,18

48 Part-of-speech tagging with a perceptron jag bad om en kort bit PN 81,72 NN 09,36 VB 9,18

49 Part-of-speech tagging with a perceptron jag bad om en kort bit PN NN 16,08 PN 4,02 VB 64,32

50 Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB 64,32 NN 16,08 PN 4,02

51 Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB

52 Feature windows Hidden Markov models look back one step; but sometimes it is a good idea to look back further, or to look ahead! Jag bad om en kort bit. At the same time, we do not want the classifier to see too much information. efficiency, data sparseness A compromise is to define a limited feature window.

53 Comparison between the two methods Part-of-speech tagging with hidden Markov models probabilistic exhaustive search for the best sequence (Viterbi algorithm) limited possibilities to define features (current word, previous tag) Part-of-speech tagging with multi-class perceptrons non-probabilistic no search; locally optimal decisions more possibilities to define features (feature windows)

54 Feature window jag bad om en kort bit BOS PN EOS With this feature window, we see the current word, the previous word, the next word, and the previous tag.

55 Feature window jag bad om en kort bit BOS PN EOS The feature window moves forward during tagging.

56 Comparison between the two methods Hidden markov model Multi-class perceptron Viterbi search greedy search HMM features fine-tuned features 92,71% 89,97% 88,86% 95,30% Tagging accuracy on the SUC test set

57 Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion