Part of speech tags. CS 585, Fall 2017 Introduction to Natural Language Processing

Similar documents
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Context Free Grammars. Many slides from Michael Collins

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Grammars & Parsing, Part 1:

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

LTAG-spinal and the Treebank

Indian Institute of Technology, Kanpur

Prediction of Maximal Projection for Semantic Role Labeling

CS 598 Natural Language Processing

The stages of event extraction

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The Role of the Head in the Interpretation of English Deverbal Compounds

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Linking Task: Identifying authors and book titles in verbose queries

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Ch VI- SENTENCE PATTERNS.

BULATS A2 WORDLIST 2

Extracting Verb Expressions Implying Negative Opinions

Survey on parsing three dependency representations for English

Ensemble Technique Utilization for Indonesian Dependency Parser

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Training and evaluation of POS taggers on the French MULTITAG corpus

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

BASIC ENGLISH. Book GRAMMAR

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

The Indiana Cooperative Remote Search Task (CReST) Corpus

Learning Computational Grammars

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

An Evaluation of POS Taggers for the CHILDES Corpus

cmp-lg/ Jan 1998

Advanced Grammar in Use

Loughton School s curriculum evening. 28 th February 2017

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Specifying a shallow grammatical for parsing purposes

Adjectives tell you more about a noun (for example: the red dress ).

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Parsing of part-of-speech tagged Assamese Texts

A Graph Based Authorship Identification Approach

A Syllable Based Word Recognition Model for Korean Noun Extraction

Emotions from text: machine learning for text-based emotion prediction

Developing Grammar in Context

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Words come in categories

Constraining X-Bar: Theta Theory

Formulaic Language and Fluency: ESL Teaching Applications

Methods for the Qualitative Evaluation of Lexical Association Measures

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Using dialogue context to improve parsing performance in dialogue systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Multilingual Sentiment and Subjectivity Analysis

Compositional Semantics

The Discourse Anaphoric Properties of Connectives

AQUA: An Ontology-Driven Question Answering System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The Smart/Empire TIPSTER IR System

Named Entity Recognition: A Survey for the Indian Languages

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

THE VERB ARGUMENT BROWSER

Word Sense Disambiguation

Development of the First LRs for Macedonian: Current Projects

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Proceedings of the 19th COLING, , 2002.

Writing a composition

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Emmaus Lutheran School English Language Arts Curriculum

Chapter 4: Valence & Agreement CSLI Publications

Online Updating of Word Representations for Part-of-Speech Tagging

Semi-supervised Training for the Averaged Perceptron POS Tagger

Mercer County Schools

Argument structure and theta roles

What the National Curriculum requires in reading at Y5 and Y6

Cross Language Information Retrieval

Generation of Referring Expressions: Managing Structural Ambiguities

Leveraging Sentiment to Compute Word Similarity

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

The College Board Redesigned SAT Grade 12

Transcription:

Part of speech tags CS 585, Fall 2017 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2017 Brendan O Connor College of Information and Computer Sciences University of Massachusetts Amherst

What s a part-of-speech (POS)? Syntax = how words compose to form larger meaning-bearing units POS = syntactic categories for words You could substitute words within a class and have a syntactically valid sentence. Give information how words can combine. I saw the dog I saw the cat I saw the {table, sky, dream, school, anger,...} Schoolhouse Rock: Conjunction Junction https://www.youtube.com/watch?v=odga7ssl-6g&index=1&list=pl6795522ead6ce2f7

Open vs closed classes Open class (lexical) words Nouns Verbs Adjectives old older oldest Proper Common Main Adverbs slowly IBM cat / cats see Italy snow registered Numbers more 122,312 Closed class (functional) Modals one Determiners the some can Prepositions to with Conjunctions and or had Particles off up more Pronouns he its Interjections Ow Eh slide credit: Chris Manning 3

Many tagging standards Penn Treebank (45 tags)... the most common one Coarse tagsets: 12 to 20 (e.g. Petrov 2012, Gimpel 2011) UD project: coarse tags, but fine-grained grammatical features http://universaldependencies.org/u/pos/index.html http://universaldependencies.org/u/feat/index.html 4

Why do we want POS? Useful for many syntactic and other NLP tasks. Phrase identification ( chunking ) Named entity recognition (names = proper nouns... or are they?) Syntactic/semantic dependency parsing Sentiment Either as features or heuristic filtering Esp. useful when not much training data 5

POS patterns: sentiment Turney (2002): identify bigram phrases, from unlabeled corpus, useful for sentiment analysis. Table 1. Patterns of tags for extracting two-word phrases from reviews. First Word Second Word Third Word (Not Extracted) 1. JJ NN or NNS anything 2. RB, RBR, or JJ not NN nor NNS RBS 3. JJ JJ not NN nor NNS 4. NN or NNS JJ not NN nor NNS 5. RB, RBR, or VB, VBD, anything RBS VBN, or VBG (plus co-occurrence information) 6

POS patterns: sentiment Turney (2002): identify bigram phrases, from unlabeled corpus, useful for sentiment analysis. Table 1. Patterns of tags for extracting two-word phrases from reviews. First Word Second Word Third Word (Not Extracted) 1. JJ NN or NNS anything 2. RB, RBR, or JJ not NN nor NNS RBS 3. JJ JJ not NN nor NNS 4. NN or NNS JJ not NN nor NNS 5. RB, RBR, or VB, VBD, anything RBS VBN, or VBG Table 2. An example of the processing of a review that the author has classified as recommended. 6 Extracted Phrase Part-of-Speech Tags Semantic Orientation online experience JJ NN 2.253 low fees JJ NNS 0.333 local branch JJ NN 0.421 small part JJ NN 0.053 online service JJ NN 2.780 printable version JJ NN -0.705 direct deposit JJ NN 1.288 well other RB JJ 0.237 inconveniently RB VBN -1.541 located other bank JJ NN -0.850 true service JJ NN -0.732 (plus co-occurrence information) 6

POS patterns: simple noun phrases Quick and dirty noun phrase identification http://brenocon.com/justesonkatz1995.pdf http://brenocon.com/handler2016phrases.pdf Frequency: Candidate strings must have frequency 2 or more in the text. Grammatical structure: Candidate strings are those multi-word noun phrases that are specified by the regular expression ((A N)+ ((A \ N)'{NP)-)(A \ N)')N, where 7

POS Tagging: lexical ambiguity Can we just use a tag dictionary (one tag per word type)? Types: WSJ Brown Unambiguous (1 tag) 44,432 (86%) 45,799 (85%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: 8AFT Most words types are unambiguous...

POS Tagging: lexical ambiguity Can we just use a tag dictionary (one tag per word type)? Types: WSJ Brown Unambiguous (1 tag) 44,432 (86%) 45,799 (85%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Unambiguous (1 tag) 577,421 (45%) 384,349 (33%) Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%) e 8.2 The amount of tag ambiguity for word types in the Brown and WSJ co Most words types are unambiguous... But not so for AFT tokens! 8AFT

POS Tagging: lexical ambiguity Can we just use a tag dictionary (one tag per word type)? Types: WSJ Brown Unambiguous (1 tag) 44,432 (86%) 45,799 (85%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Unambiguous (1 tag) 577,421 (45%) 384,349 (33%) Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%) e 8.2 The amount of tag ambiguity for word types in the Brown and WSJ co Most words types are unambiguous... But not so for AFT Ambiguous wordtypes tend to be the common ones. I know that he is honest = IN (relativizer) Yes, that play was nice = DT (determiner) You can t go that far = RB (adverb) tokens! 8AFT

POS Tagging: baseline Baseline: most frequent tag. 92.7% accuracy Simple baselines are very important to run! 9

POS Tagging: baseline Baseline: most frequent tag. 92.7% accuracy Simple baselines are very important to run! Why so high? Many ambiguous words have a skewed distribution of tags Credit for easy things like punctuation, the, a, etc. 9

POS Tagging: baseline Baseline: most frequent tag. 92.7% accuracy Simple baselines are very important to run! Why so high? Many ambiguous words have a skewed distribution of tags Credit for easy things like punctuation, the, a, etc. Is this actually that high? I get 0.918 accuracy for token tagging...but, 0.186 whole-sentence accuracy (!) 9

POS tagging can be hard for humans, too Mrs/NNP Shaefer/NNP never/rb got/vbd around/rp to/to joining/vbg All/DT we/prp gotta/vbn do/vb is/vbz go/vb around/ IN the/dt corner/nn Chateau/NNP Petrus/NNP costs/vbz around/rb $/$ 250/CD 10

Need careful guidelines (and do annotators always follow them?) PTB POS guidelines, Santorini (1990) 4 Confusing parts of speech This section discusses parts of speech that are easily confused and gives guidelines on how to tag such cases. CD or JJ Number-number combinations should be tagged as adjectives (JJ) if they have the same distribution as adjectives. EXAMPLES: a 50-3/JJ victory (cf. a handy/jj victory) Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should be tagged as adjectives (JJ) when they are prenominal modifiers, but as adverbs (RB) if they could be replaced by double or twice. EXAMPLES: one-half/j J cup; cf. a full/jj cup one-half/rb the amount; cf. twice/rb the amount; double/rb the amount 11

Some other lexical ambiguities Prepositions versus verb particles turn into/p a monster take out/t the trash Test: turn slowly into a monster *take slowly out the trash check it out/t, what s going on/t, shout out/t Careful annotator guidelines are necessary to define what to do in many cases. http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports http://www.ark.cs.cmu.edu/tweetnlp/annot_guidelines.pdf 12

Some other lexical ambiguities Prepositions versus verb particles turn into/p a monster take out/t the trash Test: turn slowly into a monster *take slowly out the trash check it out/t, what s going on/t, shout out/t this,that -- pronouns versus determiners i just orgasmed over this/o this/d wind is serious Careful annotator guidelines are necessary to define what to do in many cases. http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports http://www.ark.cs.cmu.edu/tweetnlp/annot_guidelines.pdf 12

How to build a POS tagger? Key sources of information: 1. The word itself 2. Word-internal characters 3. POS tags of surrounding words: syntactic context Approach: supervised learning (text => tags) Today/Thursday: with the Hidden Markov Model Next week: Conditional Random Field (arbitrary features) 13