Lecture 9: Part of Speech Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1
This lecture v Parts of speech (POS) v POS Tagsets CS6501 Natural Language Processing 2
Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3
POS examples v N noun chair, bandwidth, pacing v V verb study, debate, munch v ADJ adjective purple, tall, ridiculous v ADV adverb unfortunately, slowly v P preposition of, by, to v PRO pronoun I, me, mine v DET determiner the, a, that, those CS6501 Natural Language Processing 4
Parts of Speech v A.k.a. parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... v Lots of debate within linguistics about the number, nature, and universality of these CS6501 Natural Language Processing 5
POS Tagging v The process of assigning a part-of-speech to each word in a collection (sentence). WORD the koala put the keys on the table tag DET N V DET N P DET N CS6501 Natural Language Processing 6
Why is POS Tagging Useful? v First step of a vast number of practical tasks v Parsing v Need to know if a word is an N or V before you can parse v Information extraction v Finding names, relations, etc. v Speech synthesis/recognition v OBject v OVERflow v DIScount v CONtent object overflow discount content v Machine Translation CS6501 Natural Language Processing 7
Open and Closed Classes v Closed class: a small fixed membership v Prepositions: of, in, by, v Pronouns: I, you, she, mine, his, them, v Usually function words (short common words which play a role in grammar) v Open class: new ones can be created v English has 4: Nouns, Verbs, Adjectives, Adverbs v Many languages have these 4, but not all! CS6501 Natural Language Processing 8
Open Class Words v Nouns v Proper nouns (Boulder, Granby, Eli Manning) v Common nouns (the rest). v Count nouns and mass nouns v Count: have plurals, get counted: goat/goats, one goat, two goats v Mass: don t get counted (snow, salt, communism) (*two snows) v Verbs v In English, have morphological affixes (eat/eats/eaten) CS6501 Natural Language Processing 9
Closed Class Words Examples: vprepositions: on, under, over, vparticles: up, down, on, off, vdeterminers: a, an, the, vpronouns: she, who, I,.. vconjunctions: and, but, or, vauxiliary verbs: can, may should, vnumerals: one, two, three, third, CS6501 Natural Language Processing 10
Prepositions from CELEX CELEX: online dictionary Frequency counts are from COBUILD 16-billion-word corpus CS6501 Natural Language Processing 11
English Particles CS6501 Natural Language Processing 12
Conjunctions CS6501 Natural Language Processing 13
Choosing a Tagset v Could pick very coarse tagsets v N, V, Adj, Adv, Other v More commonly used set is finer grained v E.g., Penn TreeBank tagset, 45 tags: PRP$, WRB, WP$, VBG v Brown cropus, 87 tags. v Prague Dependency Treebank (Czech) v 4452 tags v AAFP3----3N----: (nejnezajímavějším) Adj Regular Feminine Plural.Superlative [Hajic 2006, VMC tutorial] CS6501 Natural Language Processing 14
Penn TreeBank POS Tagset CS6501 Natural Language Processing 15
Using the Penn Tagset v The/DT grand/jj jury/nn commmented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501 Natural Language Processing 16
Universal Tag set v ~ 12 different tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT,., X CS6501 Natural Language Processing 17
POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB These examples from Dekang Lin CS6501 Natural Language Processing 18
How Hard is POS Tagging? CS6501 Natural Language Processing 19
POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_adj_+_noun_%2c_adv_+_nou N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 20
Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 21