Part-of-Speech Tagging & Sequence Labeling. Hongning Wang

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Grammars & Parsing, Part 1:

Prediction of Maximal Projection for Semantic Role Labeling

The stages of event extraction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Assignment 1: Predicting Amazon Review Ratings

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Context Free Grammars. Many slides from Michael Collins

CS 598 Natural Language Processing

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Training and evaluation of POS taggers on the French MULTITAG corpus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

An Evaluation of POS Taggers for the CHILDES Corpus

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

CS Machine Learning

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Extracting Verb Expressions Implying Negative Opinions

Learning Methods in Multilingual Speech Recognition

Indian Institute of Technology, Kanpur

AQUA: An Ontology-Driven Question Answering System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Corrective Feedback and Persistent Learning for Information Extraction

Beyond the Pipeline: Discrete Optimization in NLP

Linking Task: Identifying authors and book titles in verbose queries

Learning Computational Grammars

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Using dialogue context to improve parsing performance in dialogue systems

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Online Updating of Word Representations for Part-of-Speech Tagging

The Smart/Empire TIPSTER IR System

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Probabilistic Latent Semantic Analysis

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Proof Theory for Syntacticians

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BYLINE [Heng Ji, Computer Science Department, New York University,

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

LTAG-spinal and the Treebank

Universiteit Leiden ICT in Business

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Short Text Understanding Through Lexical-Semantic Analysis

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A Comparison of Two Text Representations for Sentiment Analysis

Lecture 1: Machine Learning Basics

Ensemble Technique Utilization for Indonesian Dependency Parser

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Natural Language Processing: Interpretation, Reasoning and Machine Learning

A Case Study: News Classification Based on Term Frequency

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Loughton School s curriculum evening. 28 th February 2017

Semi-supervised Training for the Averaged Perceptron POS Tagger

Compositional Semantics

Natural Language Processing. George Konidaris

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Parsing of part-of-speech tagged Assamese Texts

Applications of memory-based natural language processing

Named Entity Recognition: A Survey for the Indian Languages

The Role of the Head in the Interpretation of English Deverbal Compounds

Extracting and Ranking Product Features in Opinion Documents

Distant Supervised Relation Extraction with Wikipedia and Freebase

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Graph Based Authorship Identification Approach

Introduction to Text Mining

Accurate Unlexicalized Parsing for Modern Hebrew

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

arxiv: v1 [cs.cl] 2 Apr 2017

A Bayesian Learning Approach to Concept-Based Document Classification

Introduction to Simulation

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Multilingual Sentiment and Subjectivity Analysis

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Discriminative Learning of Beam-Search Heuristics for Planning

Formulaic Language and Fluency: ESL Teaching Applications

Using Semantic Relations to Refine Coreference Decisions

Analysis of Probabilistic Parsing in NLP

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Leveraging Sentiment to Compute Word Similarity

A Syllable Based Word Recognition Model for Korean Noun Extraction

Florida Reading Endorsement Alignment Matrix Competency 1

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Python Machine Learning

Adjectives tell you more about a noun (for example: the red dress ).

cmp-lg/ Jan 1998

Transcription:

Part-of-Speech Tagging & Sequence Labeling Hongning Wang CS@UVa

What is POS tagging Tag Set NNP: proper noun CD: numeral JJ: adjective POS Tagger Raw Text Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Tagged Text Pierre_NNP Vinken_NNP,_, 61_CD years_nns old_jj,_, will_md join_vb the_dt board_nn as_in a_dt nonexecutive_jj director_nn Nov._NNP 29_CD._. CS@UVa CS 6501: Text Mining 2

Why POS tagging? POS tagging is a prerequisite for further NLP analysis Syntax parsing Basic unit for parsing Information extraction Indication of names, relations Machine translation The meaning of a particular word depends on its POS tag Sentiment analysis Adjectives are the major opinion holders Good v.s. Bad, Excellent v.s. Terrible CS@UVa CS 6501: Text Mining 3

Challenges in POS tagging Words often have more than one POS tag The back door (adjective) On my back (noun) Promised to back the bill (verb) Simple solution with dictionary look-up does not work in practice One needs to determine the POS tag for a particular instance of a word from its context CS@UVa CS 6501: Text Mining 4

Define a tagset We have to agree on a standard inventory of word classes Taggers are trained on a labeled corpora The tagset needs to capture semantically or syntactically important distinctions that can easily be made by trained human annotators CS@UVa CS 6501: Text Mining 5

Word classes Open classes Nouns, verbs, adjectives, adverbs Closed classes Auxiliaries and modal verbs Prepositions, Conjunctions Pronouns, Determiners Particles, Numerals CS@UVa CS 6501: Text Mining 6

Public tagsets in NLP Brown corpus - Francis and Kucera 1961 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres 87 tags Penn Treebank - Marcus et al. 1993 Hand-annotated corpus of Wall Street Journal, 1M words 45 tags, a simplified version of Brown tag set Standard for English now Most statistical POS taggers are trained on this Tagset CS@UVa CS 6501: Text Mining 7

How much ambiguity is there? Statistics of word-tag pair in Brown Corpus and Penn Treebank 11% 18% CS@UVa CS 6501: Text Mining 8

Is POS tagging a solved problem? Baseline Tag every word with its most frequent tag Tag unknown words as nouns Accuracy Word level: 90% Sentence level Average English sentence length 14.3 words 0.9 14.3 = 22% Accuracy of State-of-the-art POS Tagger Word level: 97% Sentence level: 0.97 14.3 = 65% CS@UVa CS 6501: Text Mining 9

Building a POS tagger Rule-based solution 1. Take a dictionary that lists all possible tags for each word 2. Assign to every word all its possible tags 3. Apply rules that eliminate impossible/unlikely tag sequences, leaving only one tag per word Rules can be learned via inductive learning. she PRP promised VBN,VBD to TO back VB, JJ, RB, NN!! the DT bill NN, VB R1: Pronoun should be followed by a past tense verb R2: Verb cannot follow determiner CS@UVa CS 6501: Text Mining 10

Building a POS tagger Statistical POS tagging tt = tt 1 tt 2 tt 3 tt 4 tt 5 tt 6 ww = ww 1 ww 2 ww 3 ww 4 ww 5 ww 6 What is the most likely sequence of tags tt for the given sequence of words ww tt = aaaaaaaaaaxx tt pp(tt ww) CS@UVa CS 6501: Text Mining 11

POS tagging with generative models Bayes Rule tt = aaaaaaaaaaxx tt pp tt ww pp ww tt pp(tt) = aaaaaaaaaaxx tt pp(ww) = aaaaaaaaaaxx tt pp ww tt pp(tt) Joint distribution of tags and words Generative model A stochastic process that first generates the tags, and then generates the words based on these tags CS@UVa CS 6501: Text Mining 12

Hidden Markov models Two assumptions for POS tagging 1. Current tag only depends on previous kk tags pp tt = ii pp(tt ii tt ii 1, tt ii 2,, tt ii kk ) When kk=1, it is so-called first-order HMMs 2. Each word in the sequence depends only on its corresponding tag pp ww tt = ii pp(ww ii tt ii ) CS@UVa CS 6501: Text Mining 13

Graphical representation of HMMs All the tags in the tagset pp(tt ii tt ii 1 ) Transition probability All the words in the vocabulary pp(ww ii tt ii ) Emission probability Light circle: latent random variables Dark circle: observed random variables Arrow: probabilistic dependency CS@UVa CS 6501: Text Mining 14

Finding the most probable tag sequence tt = aaaaaaaaaaxx tt pp tt ww = aaaaaaaaaaxx tt pp ww ii tt ii pp(tt ii tt ii 1 ) Complexity analysis Each word can have up to TT tags For a sentence with NN words, there will be up to TT NN possible tag sequences Key: explore the special structure in HMMs! ii CS@UVa CS 6501: Text Mining 15

tt 11 = tt 4 tt 1 tt 3 tt 5 tt 7 tt 22 = tt 4 tt 1 tt 3 tt 5 tt 2 ww 1 ww 2 ww 3 ww 4 ww 5 tt 1 tt 2 tt 3 tt 4 tt 5 tt 6 tt 7 Word ww 1 takes tag tt 4 CS@UVa CS 6501: Text Mining 16

Trellis: a special structure for HMMs tt 1 tt 2 tt 3 tt 4 tt 5 tt 6 tt 7 tt 11 = tt 4 tt 1 tt 3 tt 5 tt 7 tt 22 = tt 4 tt 1 tt 3 tt 5 tt 2 Computation can be reused! ww 1 ww 2 ww 3 ww 4 ww 5 Word ww 1 takes tag tt 4 CS@UVa CS 6501: Text Mining 17

Viterbi algorithm Store the best tag sequence for ww 1 ww ii that ends in tt jj in TT[jj][ii] TT[jj][ii] = max pp(ww 1 ww ii, tt 1, tt ii = tt jj ) Recursively compute trellis[j][i] from the entries in the previous column trellis[j][i-1] TT jj ii = PP ww ii tt jj MMMMxx kk TT kk ii 1 PP tt jj tt kk Generating the current observation The best i-1 tag sequence Transition from the previous best ending tag CS@UVa CS 6501: Text Mining 18

Viterbi algorithm TT jj ii = PP ww ii tt jj MMMMxx kk TT kk ii 1 PP tt jj tt kk Dynamic programming: OO(TT 2 NN)! ww 1 ww 2 ww 3 ww 4 ww 5 tt 1 tt 2 tt 3 tt 4 tt 5 tt 6 tt 7 Order of computation CS@UVa CS 6501: Text Mining 19

Decode aaaaaaaaaaxx tt pp(tt ww) Take the highest scoring entry in the last column of the trellis tt 1 tt 2 tt 3 tt 4 tt 5 tt 6 tt 7 TT jj ii = PP ww ii tt jj MMMMxx kk TT kk ii 1 PP tt jj tt kk Keep backpointers in each trellis to keep track of the most probable sequence ww 1 ww 2 ww 3 ww 4 ww 5 CS@UVa CS 6501: Text Mining 20

Train an HMMs tagger Parameters in an HMMs tagger Transition probability: pp tt ii tt jj, TT TT Emission probability: pp ww tt, VV TT Initial state probability: pp tt ππ, TT 1 For the first tag in a sentence CS@UVa CS 6501: Text Mining 21

Train an HMMs tagger Maximum likelihood estimator Given a labeled corpus, e.g., Penn Treebank Count how often we have the pair of tt ii tt jj and ww ii tt jj pp tt jj tt ii = cc(tt ii,tt jj ) cc(tt ii ) pp ww ii tt jj = cc(ww ii,tt jj ) cc(tt jj ) Proper smoothing is necessary! CS@UVa CS 6501: Text Mining 22

Public POS taggers Brill s tagger http://www.cs.jhu.edu/~brill/ TnT tagger http://www.coli.uni-saarland.de/~thorsten/tnt/ Stanford tagger http://nlp.stanford.edu/software/tagger.shtml SVMTool http://www.lsi.upc.es/~nlp/svmtool/ GENIA tagger http://www-tsujii.is.s.u-tokyo.ac.jp/genia/tagger/ More complete list at http://www-nlp.stanford.edu/links/statnlp.html#taggers CS@UVa CS 6501: Text Mining 23

Let s take a look at other NLP tasks Noun phrase (NP) chunking Task: identify all non-recursive NP chunks CS@UVa CS 6501: Text Mining 24

The BIO encoding Define three new tags B-NP: beginning of a noun phrase chunk I-NP: inside of a noun phrase chunk O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS@UVa CS 6501: Text Mining 25

Another NLP task Shallow parsing Task: identify all non-recursive NP, verb ( VP ) and preposition ( PP ) chunks CS@UVa CS 6501: Text Mining 26

BIO Encoding for Shallow Parsing Define several new tags B-NP B-VP B-PP: beginning of an NP, VP, PP chunk I-NP I-VP I-PP: inside of an NP, VP, PP chunk O: outside of any chunk POS Tagging with a restricted Tagset? CS@UVa CS 6501: Text Mining 27

Yet another NLP task Named Entity Recognition Task: identify all mentions of named entities (people, organizations, locations, dates) CS@UVa CS 6501: Text Mining 28

BIO Encoding for NER Define many new tags B-PERS, B-DATE, : beginning of a mention of a person/date... I-PERS, B-DATE, : inside of a mention of a person/date... O: outside of any mention of a named entity POS Tagging with a restricted Tagset? CS@UVa CS 6501: Text Mining 29

Sequence labeling Many NLP tasks are sequence labeling tasks Input: a sequence of tokens/words Output: a sequence of corresponding labels E.g., POS tags, BIO encoding for NER Solution: finding the most probable label sequence for the given word sequence tt = aaaaaaaaaaxx tt pp tt ww CS@UVa CS 6501: Text Mining 30

Comparing to traditional classification problem Sequence labeling tt = aaaaaaaaaaxx tt pp tt ww tt is a vector/matrix Dependency between both (tt, ww) and (tt ii, tt jj ) Structured output t i t j Difficult to solve the inference problem Traditional classification yy = aaaaaaaaaaxx yy pp(yy xx) yy is a single label Dependency only within (yy, xx) Independent output y i y j Easy to solve the inference problem w i w j x i x j CS@UVa CS 6501: Text Mining 31

Two modeling perspectives Generative models Model the joint probability of labels and words tt = aaaaaaaaaaxx tt pp tt ww = aaaaaaaaaaxx tt pp ww tt pp(tt) Discriminative models Directly model the conditional probability of labels given the words tt = aaaaaaaaaaxx tt pp tt ww = aaaaaaaaaaxx tt ff(tt, ww) CS@UVa CS 6501: Text Mining 32

Generative V.S. discriminative models Binary classification as an example Generative Model s view Discriminative Model s view CS@UVa CS 6501: Text Mining 33

Generative V.S. discriminative models Generative Specifying joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for pp ww tt and pp(tt) Flexible, can be used in unsupervised learning Discriminative Specifying conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling pp tt ww Need labeled data, only suitable for (semi-) supervised learning CS@UVa CS 6501: Text Mining 34

Maximum entropy Markov models MEMMs are discriminative models of the labels tt given the observed input sequence ww pp tt ww = ii pp(tt ii ww ii, tt ii 1 ) CS@UVa CS 6501: Text Mining 35

Design features Emission-like features Binary feature functions f first-letter-capitalized-nnp (China) = 1 f first-letter-capitalized-vb (know) = 0 Integer (or real-valued) feature functions f number-of-vowels-nnp (China) = 2 Transition-like features Binary feature functions f first-letter-capitalized-vb-nnp (China) = 1 VB know NNP China Not necessarily independent features! CS@UVa CS 6501: Text Mining 36

Parameterization of pp(tt ii ww ii, tt ii 1 ) Associate a real-valued weight λλ to each specific type of feature function λλ kk for f first-letter-capitalized-nnp (w) Define a scoring function ff tt ii, tt ii 1, ww ii = kk λλ kk ff kk (tt ii, tt ii 1, ww ii ) Naturally pp tt ii ww ii, tt ii 1 exp ff tt ii, tt ii 1, ww ii Recall the basic definition of probability PP(xx) > 0 xx pp(xx) = 1 CS@UVa CS 6501: Text Mining 37

Parameterization of MEMMs pp tt ww = pp(tt ii ww ii, tt ii 1 ) = ii exp ff(tt ii, tt ii 1, ww ii ) tt ii exp ff(tt ii, tt ii 1, ww ii ) It is a log-linear model log pp tt ww = ii ff(tt ii, tt ii 1, ww ii ) CC(λλ) ii Constant only related to λλ Viterbi algorithm can be used to decode the most probable label sequence solely based on ii ff(tt ii, tt ii 1, ww ii ) CS@UVa CS 6501: Text Mining 38

Parameter estimation Maximum likelihood estimator can be used in a similar way as in HMMs λλ = aaaaaaaaaaxx λλ tt,ww log pp(tt ww) = aaaaaaaaaaxx λλ tt,ww ii ff(tt ii, tt ii 1, ww ii ) CC(λλ) Decompose the training data into such units CS@UVa CS 6501: Text Mining 39

Why maximum entropy? We will explain this in detail when discussing the Logistic Regression models CS@UVa CS 6501: Text Mining 40

A little bit more about MEMMs Emission features can go across multiple observations ff tt ii, tt ii 1, ww ii kk λλ kk ff kk (tt ii, tt ii 1, ww) Especially useful for shallow parsing and NER tasks CS@UVa CS 6501: Text Mining 41

Conditional random field A more advanced model for sequence labeling Model global dependency pp tt ww ii exp( kk λλ kk ff kk tt ii, ww + ll ηη ll gg ll (tt ii, tt ii 1, ww)) tt 1 tt 2 ww 1 ww 2 tt 3 tt 4 ww 3 ww 4 Edge feature gg(tt ii, tt ii 1, ww) Node feature ff(tt ii, ww) CS@UVa CS 6501: Text Mining 42

What you should know Definition of POS tagging problem Property & challenges Public tag sets Generative model for POS tagging HMMs General sequential labeling problem Discriminative model for sequential labeling MEMMs CS@UVa CS 6501: Text Mining 43

Today s reading Speech and Language Processing Chapter 5: Part-of-Speech Tagging Chapter 6: Hidden Markov and Maximum Entropy Models Chapter 22: Information Extraction (optional) CS@UVa CS 6501: Text Mining 44