POS tagging. Intro to NLP - ETHZ - 11/03/2013

Size: px

Start display at page:

Download "POS tagging. Intro to NLP - ETHZ - 11/03/2013"

Tamsin Clark
6 years ago
Views:

1 POS tagging Intro to NLP - ETHZ - 11/03/2013

2 Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence Probability of an observation Parameter estimation Evaluation

3 POS ambiguity "Squad helps dog bite victim" bite -> verb? bite -> noun?

4 Parts of Speech (PoS) Traditional parts of speech: Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...

5 Examples N (noun): car, squad, dog, bite, victim V (verb): help, bite ADJ (adjective): purple, tall ADV (adverb): unfortunately, slowly P (preposition): of, by, to PRO (pronoun): I, me, mine DET (determiner): the, a, that, those

6 Open and closed classes 1. Closed class: small stable set a. Auxiliaries: may, can, will, been,... b. Prepositions: of, in, by,... c. Pronouns: I, you, she, mine, his, them,... d. Usually function words, short, grammar role 2. Open class: a. new ones are created all the time ("to google/tweet", "e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget") b. English has 4: Nouns, Verbs, Adjectives, Adverbs c. Many languages have these 4, not all!

7 Open class words 1. Nouns: a. Proper nouns: Zurich, IBM, Albert Einstein, The Godfather,... Capitalized in many languages. b. Common nouns: the rest, also capitalized in German, mass/count nouns (goat/goats, snow/*snows) 2. Verbs: a. Morphological affixes in English: eat/eats/eaten 3. Adverbs: tend to modify things: a. John walked home extremely slowly yesterday b. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat) d. Manner adverbs (slowly, slinkily, delicately) 4. Adjectives: qualify nouns and noun phrases

8 Closed class words 1. prepositions: on, under, over, particles: up, down, on, off, determiners: a, an, the, pronouns: she, who, I,.. 5. conjunctions: and, but, or, auxiliary verbs: can, may should, numerals: one, two, three, third,.

9 Prepositions with corpus frequencies

10 Conjunctions

11 Pronouns

12 Auxiliaries

13 Applications A useful pre-processing step in many tasks Syntactic parsing: important source of information for syntactic analysis Machine translation Information retrieval: stemming, filtering Named entity recognition Summarization?

14 Applications Speech synthesis, for correct pronunciation of "ambiguous" words: lead /lid/ (guide) vs. /led/ (chemical) insult vs. INsult object vs. OBject overflow vs. OVERflow content vs. CONtent

15 Summarization Idea: Filter out sentences starting with certain PoS tags Use PoS statistics from gold standard titles (might need cross-validation)

16 Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't

17 Summarization Idea: Filter out sentences starting with certain PoS tags: Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions" Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri" Title3: "But Siri, Apple's personal assistant application on the iphone 4s, doesn't

18 PoS tagging The process of assigning a part-of-speech tag (label) to each word in a text. Pre-processing: tokenization Word Squad helps dog bite victim Tag N V N N N

19 Choosing a tagset 1. There are many parts of speech, potential distinctions we can draw 2. For POS tagging, we need to choose a standard set of tags to work with 3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? org/wiki/part-of-speech_tagging 4. More commonly used set is finer grained, the Penn TreeBank tagset, 45 tags

20 PTB tagset

21 Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NN 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

22 Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nns 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

23 Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vb a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

24 Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/VBP you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

25 Examples* 1. I/PRP need/vbp a/dt flight/nn from/in Atlanta/NNP 2. Does/VBZ this/dt flight/nn serve/vb dinner/nn 3. I/PRP have/vbp a/dt friend/nn living/vbg in/in Denver/NNP 4. Can/MD you/prp list/vb the/dt nonstop/jj afternoon/nn flights/nns

26 Complexities Book/VB that/dt flight/nn./. There/EX are/vbp 70/CD children/nns there/rb./. Mrs./NNP Shaefer/NNP never/rb got/vbd around/rp to/to joining/vbg./. All/DT we/prp gotta/vbn do/vb is/vbz go/vb around/in the/dt corner/nn./. Unresolvable ambiguity: The Duchess was entertaining last night.

27 Words PoS WSJ PoS Universal The DT DET oboist NN NOUN Heinz NNP NOUN Holliger NNP NOUN has VBZ VERB taken VBN VERB a DT DET hard JJ ADJ line NN NOUN about IN ADP the DT DET problems NNS NOUN...

28 POS Tagging Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word.

29 Word type tag ambiguity

30 Methods 1. Rule-based a. Start with a dictionary b. Assign all possible tags c. Write rules by hand to remove tags in context 2. Stochastic a. Supervised/Unsupervised b. Generative/discriminative c. independent/structured output d. HMMs

31 Rule-based tagging 1. Start with a dictionary: she promised to back the bill PRP VBN, VBD TO VB, JJ, RB, NN DT NN, VB

32 Rule-based tagging 2. Assign all possible tags: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill

33 Rule-based tagging 3. Introduce rules to reduce ambiguity: NN RB VBN JJ NN PRP VBD TO VB DT VB she promised to back the bill Rule: "<start> PRP {VBN, VBD}" -> "<start> PRP VBD"

34 Statistical models for POS tagging 1. A classic: one of the first successful applications of statistical methods in NLP 2. Extensively studied with all possible approaches (sequence models benchmark) 3. Simple to get started on: data, eval, literature 4. An introduction to more complex segmentation and labelling tasks: NER, shallow parsing, global optimization 5. An introduction to HMMs, used in many variants in POS tagging and related tasks.

35 Supervision and resources 1. Supervised case: data with words manually annotated with POS tags 2. Partially supervised: annotated data + unannotated data 3. Unsupervised: only raw text available 4. Resources: dictionaries with words possible tags 5. Start with the supervised task

36 HMMs HMM = (Q,O,A,B) 1. States: Q=q 1..q N [the part of speech tags] 2. Observation symbols: O = o 1..o V [words] 3. Transitions: a. A = {a ij }; a ij = P(t s =q j t s-1 =q i ) b. t s t s-1 = q i ~ Multi(a i ) c. Special vector of initial/final probabilities 4. Emissions: a. B = {b ik }; b ik = P(w s = o k t s =q i ) b. w s t s = q i ~ Multi(b i )

37 Markov Chain Interpretation Tagging process as a (hidden) Markov process Independence assumptions 1. Limited horizon 2. Time-invariant 3. Observation depends only on the state

38 Complete data likelihood The joint probability of a sequence of words and tags, given a model: Generative process: 1. generate a tag sequence 2. emit the words for each tag

39 Inference in HMMs Three fundamental problems: 1. Given an observation (e.g., a sentence) find the most likely sequence of states (e.g., pos tags) 2. Given an observation, compute its probability 3. Given a dataset of observation (sequences) estimate the model's parameters: theta = (A,B)

40 HMMs and FSA

41 HMMs and FSA also Bayes nets, directed graphical models, etc.

42 Other applications of HMMs NLP - Named entity recognition, shallow parsing - Word segmentation - Optical Character Recognition Speech recognition Computer Vision image segmentation Biology - Protein structure prediction Economics, Climatology, Robotics...

43 POS as sequence classification Observation: a sequence of N words w 1:N Response: a sequence of N tags t 1:N Task: find the predicted t' 1:N such that: The best possible tagging for the sequence.

44 Bayes rule reformulation

45 HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences?

46 HMM POS tagging How can we find t' 1:N? Enumeration of all possible sequences? O( Tagset N )! Dynamic programming: Viterbi algorithm

47 Viterbi algorithm

48 Example: model A = N V END V N START B = board backs plan V N

49 Example: observation Sentence: "Board backs plan" Find the most likely tag sequence d s (t) = probability of most likely path ending at state s at time t

50 Viterbi algorithm: example END V N START board backs plan Time 1 2 3

51 Viterbi: forward pass END V N START board backs plan Time 1 2 3

52 Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3

53 Viterbi: forward pass END V d=.12 N START d=.24 board backs plan Time 1 2 3

54 Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3

55 Viterbi: forward pass END V d=.12 d=.050 N START d=.24 d=.019 board backs plan Time 1 2 3

56 Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

57 Viterbi: forward pass END V d=.12 d=.050 d=.005 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

58 Viterbi: forward pass END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

59 Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

60 Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

61 Viterbi: backtrack END V d=.12 d=.050 d=.005 d=.011 N START d=.24 d=.019 d=.016 board backs plan Time 1 2 3

62 Viterbi: output END V d=.011 N START board/n backs/v plan/n Time 1 2 3

63 Observation probability Given HMM theta = (A,B) and observation sequence w 1:N compute P(w 1:N theta) Applications: language modeling Complete data likelihood: Sum over all possible tag sequences:

64 Forward algorithm Dynamic programming: each state of the trellis stores a value alpha i (s) = probability of being in state s having observed w 1:i Sum over all paths up to i-1 leading to s Init:

65 Forward algorithm

66 Forward computation END V a=.12 N START a=.24 board backs plan Time 1 2 3

67 Forward computation END V a=.12 a=.058 N START a=.24 a=.034 board backs plan Time 1 2 3

68 Forward computation END V a=.12 a=.058 a=.014 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3

69 Forward computation END V a=.12 a=.058 a=.014 a=0.2 N START a=.24 a=.034 a=.022 board backs plan Time 1 2 3

70 Parameter estimation Maximum likelihood estimates (MLE) on data 1. Transition probabilities: 2. Emission probabilities:

71 Implementation details 1. Start/End states 2. Log space/rescaling 3. Vocabularies: model pruning 4. Higher order models: a. states representation b. Estimation and sparsity: deleted interpolation

72 Evaluation So once you have you POS tagger running how do you evaluate it? 1. Overall error rate with respect to a goldstandard test set. a. ER = # words incorrectly tagged/# words tagged 2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially unknown words)

73 Evaluation The result is compared with a manually coded Gold Standard Typically accuracy > 97% on WSJ PTB This may be compared with result for a baseline tagger (one that uses no context). Baselines (most frequent tag) can achieve up to 90% accuracy. Important: 100% is impossible even for human annotators.

74 Summary Parts of speech Tagsets Part of speech tagging HMM Tagging: Most likely tag sequence (decoding) Probability of an observation (word sequence) Parameter estimation (supervised) Evaluation

75 Next class Unsupervised POS tagging models (HMMs) Parameter estimation: forward-backward algorithm Discriminative sequence models: MaxEnt, CRF, Perceptron, SVM, etc. Read J&M 5-6 Pre-process and POS tag the data: report problems & baselines

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.