Lecture 9: Part of Speech

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

BULATS A2 WORDLIST 2

Context Free Grammars. Many slides from Michael Collins

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

Words come in categories

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Grammars & Parsing, Part 1:

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

CS 598 Natural Language Processing

The stages of event extraction

Ch VI- SENTENCE PATTERNS.

Writing a composition

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BASIC ENGLISH. Book GRAMMAR

Specifying a shallow grammatical for parsing purposes

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Chapter 4: Valence & Agreement CSLI Publications

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Development of the First LRs for Macedonian: Current Projects

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

LTAG-spinal and the Treebank

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Role of the Head in the Interpretation of English Deverbal Compounds

Parsing of part-of-speech tagged Assamese Texts

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Accurate Unlexicalized Parsing for Modern Hebrew

Developing Grammar in Context

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

THE VERB ARGUMENT BROWSER

Ensemble Technique Utilization for Indonesian Dependency Parser

Extracting Verb Expressions Implying Negative Opinions

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

The Indiana Cooperative Remote Search Task (CReST) Corpus

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Sample Goals and Benchmarks

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Modeling full form lexica for Arabic

Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study

An Evaluation of POS Taggers for the CHILDES Corpus

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Advanced Grammar in Use

Prediction of Maximal Projection for Semantic Role Labeling

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Today we examine the distribution of infinitival clauses, which can be

On the Notion Determiner

A Graph Based Authorship Identification Approach

Programma di Inglese

Vocabulary Usage and Intelligibility in Learner Language

Applications of memory-based natural language processing

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Training and evaluation of POS taggers on the French MULTITAG corpus

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

cmp-lg/ Jan 1998

SAMPLE. Chapter 1: Background. A. Basic Introduction. B. Why It s Important to Teach/Learn Grammar in the First Place

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Myths, Legends, Fairytales and Novels (Writing a Letter)

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Coast Academies Writing Framework Step 4. 1 of 7

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Cross Language Information Retrieval

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

What the National Curriculum requires in reading at Y5 and Y6

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

4 th Grade Reading Language Arts Pacing Guide

Problems of the Arabic OCR: New Attitudes

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Formulaic Language and Fluency: ESL Teaching Applications

5 th Grade Language Arts Curriculum Map

Survey on parsing three dependency representations for English

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Part of Speech Template

Using dialogue context to improve parsing performance in dialogue systems

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Semi-supervised Training for the Averaged Perceptron POS Tagger

National Literacy and Numeracy Framework for years 3/4

Derivational and Inflectional Morphemes in Pak-Pak Language

Leveraging Sentiment to Compute Word Similarity

Transcription:

Lecture 9: Part of Speech Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1

This lecture v Parts of speech (POS) v POS Tagsets CS6501 Natural Language Processing 2

Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3

POS examples v N noun chair, bandwidth, pacing v V verb study, debate, munch v ADJ adjective purple, tall, ridiculous v ADV adverb unfortunately, slowly v P preposition of, by, to v PRO pronoun I, me, mine v DET determiner the, a, that, those CS6501 Natural Language Processing 4

Parts of Speech v A.k.a. parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... v Lots of debate within linguistics about the number, nature, and universality of these CS6501 Natural Language Processing 5

POS Tagging v The process of assigning a part-of-speech to each word in a collection (sentence). WORD the koala put the keys on the table tag DET N V DET N P DET N CS6501 Natural Language Processing 6

Why is POS Tagging Useful? v First step of a vast number of practical tasks v Parsing v Need to know if a word is an N or V before you can parse v Information extraction v Finding names, relations, etc. v Speech synthesis/recognition v OBject v OVERflow v DIScount v CONtent object overflow discount content v Machine Translation CS6501 Natural Language Processing 7

Open and Closed Classes v Closed class: a small fixed membership v Prepositions: of, in, by, v Pronouns: I, you, she, mine, his, them, v Usually function words (short common words which play a role in grammar) v Open class: new ones can be created v English has 4: Nouns, Verbs, Adjectives, Adverbs v Many languages have these 4, but not all! CS6501 Natural Language Processing 8

Open Class Words v Nouns v Proper nouns (Boulder, Granby, Eli Manning) v Common nouns (the rest). v Count nouns and mass nouns v Count: have plurals, get counted: goat/goats, one goat, two goats v Mass: don t get counted (snow, salt, communism) (*two snows) v Verbs v In English, have morphological affixes (eat/eats/eaten) CS6501 Natural Language Processing 9

Closed Class Words Examples: vprepositions: on, under, over, vparticles: up, down, on, off, vdeterminers: a, an, the, vpronouns: she, who, I,.. vconjunctions: and, but, or, vauxiliary verbs: can, may should, vnumerals: one, two, three, third, CS6501 Natural Language Processing 10

Prepositions from CELEX CELEX: online dictionary Frequency counts are from COBUILD 16-billion-word corpus CS6501 Natural Language Processing 11

English Particles CS6501 Natural Language Processing 12

Conjunctions CS6501 Natural Language Processing 13

Choosing a Tagset v Could pick very coarse tagsets v N, V, Adj, Adv, Other v More commonly used set is finer grained v E.g., Penn TreeBank tagset, 45 tags: PRP$, WRB, WP$, VBG v Brown cropus, 87 tags. v Prague Dependency Treebank (Czech) v 4452 tags v AAFP3----3N----: (nejnezajímavějším) Adj Regular Feminine Plural.Superlative [Hajic 2006, VMC tutorial] CS6501 Natural Language Processing 14

Penn TreeBank POS Tagset CS6501 Natural Language Processing 15

Using the Penn Tagset v The/DT grand/jj jury/nn commmented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501 Natural Language Processing 16

Universal Tag set v ~ 12 different tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT,., X CS6501 Natural Language Processing 17

POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB These examples from Dekang Lin CS6501 Natural Language Processing 18

How Hard is POS Tagging? CS6501 Natural Language Processing 19

POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_adj_+_noun_%2c_adv_+_nou N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 20

Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 21