Part-of-speech tagging

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

BULATS A2 WORDLIST 2

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

CS 598 Natural Language Processing

Indian Institute of Technology, Kanpur

The stages of event extraction

(Sub)Gradient Descent

Using dialogue context to improve parsing performance in dialogue systems

Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Ch VI- SENTENCE PATTERNS.

Training and evaluation of POS taggers on the French MULTITAG corpus

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Parsing of part-of-speech tagged Assamese Texts

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Linking Task: Identifying authors and book titles in verbose queries

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of Two Text Representations for Sentiment Analysis

Words come in categories

Python Machine Learning

An Evaluation of POS Taggers for the CHILDES Corpus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Ensemble Technique Utilization for Indonesian Dependency Parser

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

CS 446: Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Development of the First LRs for Macedonian: Current Projects

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Memory-based grammatical error correction

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Multilingual Sentiment and Subjectivity Analysis

Grammars & Parsing, Part 1:

Context Free Grammars. Many slides from Michael Collins

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Emotion Recognition Using Support Vector Machine

Switchboard Language Model Improvement with Conversational Data from Gigaword

The taming of the data:

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Developing Grammar in Context

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Corrective Feedback and Persistent Learning for Information Extraction

Universiteit Leiden ICT in Business

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

CS Machine Learning

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

CSL465/603 - Machine Learning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction to Text Mining

A Graph Based Authorship Identification Approach

Advanced Grammar in Use

The Evolution of Random Phenomena

A Syllable Based Word Recognition Model for Korean Noun Extraction

Cross Language Information Retrieval

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Methods for Fuzzy Systems

ScienceDirect. Malayalam question answering system

Speech Recognition at ICSI: Broadcast News and beyond

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Specifying a shallow grammatical for parsing purposes

A Bayesian Learning Approach to Concept-Based Document Classification

The Smart/Empire TIPSTER IR System

Writing a composition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The Role of the Head in the Interpretation of English Deverbal Compounds

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

LTAG-spinal and the Treebank

Second Exam: Natural Language Parsing with Neural Networks

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Discriminative Learning of Beam-Search Heuristics for Planning

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Human Emotion Recognition From Speech

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

THE VERB ARGUMENT BROWSER

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

DIRECT AND INDIRECT SPEECH

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Transcription:

Language Technology (2018) Part-of-speech tagging Marco Kuhlmann Department of Computer and Information Science This work is licensed under a Creative Commons Attribution 4.0 International License.

Parts of speech A part of speech is a category of words that play similar roles within the syntactic structure of a sentence. Three common parts of speech are noun, verb, and adjective. Kim loves fast cars. There are many different tag sets for parts of speech. different languages, different levels of granularity

Universal part-of-speech tags Source: Universal Dependencies Project Tag Category Examples ADJ adjective big, old ADV adverb very, well INTJ interjection ouch! NOUN noun girl, cat, tree PROPN proper noun Mary, John VERB verb run, eat Tag Category Examples ADP adposition in, to, during AUX auxiliary verb has, should CCONJ conjunction and, or, but DET determiner a, my, this NUM cardinal numbers one, two PRON pronoun you, herself plus PART, SCONJ, PUNCT, SYM, X

Part-of-speech tagging A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech. Part-of-speech tagging can be cast as a supervised machine learning problem. This requires training data. sentences whose words are tagged with their correct part of speech

Ambiguity causes combinatorial explosion I want to live in peace PRON VERB PART VERB ADP NOUN NOUN NOUN ADP ADJ ADV VERB ADV ADV ADJ NOUN I only want to live in peace, plant potatoes, and dream! Moomin

This Stanford University alumnus co-founded educational technology company Coursera. Source: MacArthur Foundation SPARQL query against DBPedia SELECT DISTINCT?x WHERE {?x dbpedia-owl:almamater dbres:stanford_university. dbres:coursera dbpedia-owl:founder?x. }

Named entity recognition as tagging State-of-the algorithms treat named entity recognition as a wordby-word tagging task. Just as part-of-speech tagging! The basic idea is to use tags that can encode both the boundaries and the types of named entity mentions. A common encoding is the IOB scheme, where there is a tag for the beginning (B) and inside of each entity type, as well as an additional tag for tokens outside (O) any entity.

Named entity recognition as tagging Token American Airlines immediately matched the move Wagner said IOB tag B-ORG I-ORG O O O O B-PER O. O

Outline for today Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Part-of-speech tagging with hidden Markov models Part-of-speech tagging with multi-class perceptrons

Evaluation of part-of-speech taggers

Reminder: Evaluation of text classifiers gold-standard class Windsor The Queen A B C Mao Communist TV-ads campaign A B A predicted class

Evaluation of part-of-speech taggers gold-standard tag PRON VERB PART VERB ADP NOUN I want to work in films PRON VERB ADP NOUN ADP NOUN predicted tag

Stockholm Umeå Corpus (SUC) SUC is the largest manually annotated corpus for written Swedish, a collaboration of Stockholm and Umeå University. created in the early 1990s SUC contains more than 1.1 million tokens; these are annotated with parts of speech, morphological features, and lemmas. SUC is a balanced corpus with texts from different genres.

Accuracy DET ADJ NOUN ADP VERB DET 923 0 0 0 1 ADJ 2 1255 132 1 5 NOUN 0 7 4499 1 18 ADP 0 0 0 2332 1 VERB 0 5 132 2 3436 predicted tag gold-standard tag

Precision with respect to NOUN DET ADJ NOUN ADP VERB DET 923 0 0 0 1 ADJ 2 1255 132 1 5 NOUN 0 7 4499 1 18 ADP 0 0 0 2332 1 VERB 0 5 132 2 3436 predicted tag gold-standard tag

Recall with respect to NOUN DET ADJ NOUN ADP VERB DET 923 0 0 0 1 ADJ 2 1255 132 1 5 NOUN 0 7 4499 1 18 ADP 0 0 0 2332 1 VERB 0 5 132 2 3436 predicted tag gold-standard tag

Sample exam question NOUN ADJ VERB NOUN 58 6 1 ADJ 5 11 2 VERB 0 7 43 predicted tag gold-standard tag Compute (a) precision on adjectives, (b) recall on verbs.

Outline for today Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Part-of-speech tagging with hidden Markov models Part-of-speech tagging with multi-class perceptrons

Part-of-speech tagging with hidden Markov models

Ambiguity causes combinatorial explosion I want to live in peace PRON VERB PART VERB ADP NOUN NOUN NOUN ADP ADJ ADV VERB ADV ADV ADJ NOUN I only want to live in peace, plant potatoes, and dream! Moomin

Relative frequencies of tags per word I want to live in peace PRON 99.97% NOUN 0.00% VERB 100.00% NOUN 0.00% PART 63.46% ADP 35.13% ADV 0.12% VERB 83.87% ADJ 14.52% ADV 0.00% ADP 92.92% ADV 3.61% ADJ 0.03% NOUN 0.27% NOUN 100.00% VERB 0.00% Data: UD English Treebank (training data)

Relative frequencies of next tags per tag Tag / next tag ADJ ADP ADV NOUN PART PRON VERB ADJ 5,22 % 7,93 % 1,34 % 54,70 % 3,26 % 1,37 % 0,94 % ADP 6,25 % 2,96 % 1,59 % 16,35 % 0,07 % 13,22 % 0,67 % ADV 13,70 % 8,94 % 10,53 % 1,46 % 1,84 % 8,99 % 19,37 % NOUN 1,14 % 20,91 % 3,70 % 12,70 % 2,82 % 4,13 % 5,87 % PART 3,59 % 0,61 % 4,12 % 7,76 % 0,14 % 0,65 % 71,03 % PRON 3,80 % 3,78 % 5,19 % 13,42 % 1,19 % 2,84 % 27,36 % VERB 4,32 % 18,13 % 7,25 % 7,72 % 6,74 % 17,01 % 1,62 % Data: UD English Treebank (training data)

Hidden Markov Model A hidden Markov model (HMM) is a generalised Markov model with two types of probabilities: transition probabilities P(next tag tag) How probable is it to see a verb after having seen a pronoun? output probabilities P(word tag) How probable is it to see the word want being tagged as a verb?

P(a w 1 ) P(a BOS) a P(EOS a) BOS P(b a) P(a b) EOS P(b BOS) b P(EOS b) P(b b)

P(VB VB) w P(w VB) jag 0.000004 bad 0.000152 P(VB BOS) VB P(EOS VB) BOS P(PN VB) P(VB PN) EOS P(PN BOS) PN P(EOS PN) w P(w PN) P(PN PN) jag 0.025775 bad 0.000006

Learning hidden Markov models To learn a hidden Markov model from a corpus, we can use maximum likelihood estimation just as before: To estimate the transition probability P(VERB PRON), we ask: How often do we see VERB given that the previous tag was PRON? To estimate the output probability P(want VERB), we ask: How often do we see the word want when the tag is VERB? We can also use various smoothing techniques just as before.

Probability of a tagged sentence P(want VERB) P(live VERB) P(peace NOUN) P(I PRON) P(to PART) P(in ADP) I want to live in peace PRON VERB PART VERB ADP NOUN P(PRON BOS) P(PART VERB) P(ADP VERB) P(EOS NOUN) P(VERB PRON) P(VERB PART) P(NOUN ADP) product of transition and output probabilities

Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.

Sample exam question You want to compute the probability of this tagged sentence in an HMM: jag skrev på utan att tveka PN VB PL PP IE VB You can ask the model for its atomic probabilities, but each such question costs 1 crown. Which questions do you need to ask, and how much do you have to pay?

Outline for today Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Part-of-speech tagging with hidden Markov models Part-of-speech tagging with multi-class perceptrons

Part-of-speech tagging with multi-class perceptrons

Part-of-speech tagging as classification Part-of-speech tagging can be cast as a sequence of classification problems one classification per word in the sentence. Based on this idea, any method for classification can be used to build a part-of-speech tagger. Naive Bayes Here we use a very simple non-probabilistic method called the multi-class perceptron.

The classical perceptron x 1 w 1 Σ a x 2 w 2 activation = dot product of input and weights

Inspiration from neurobiology dendrites synapses cell body axon Image source: Wikipedia

The multi-class perceptron x 1 w 1 Σ a 1 w 1 w 2 x 2 Σ w 2 a 2 prediction = class with the highest activation

Interpretation of feature weights Features whose weights are zero do not contribute to the activation; such features are ignored. Features whose weights are positive cause the activation to increase they suggest that the input belongs to the class. Features whose weights are negative cause the activation to decrease they suggest that the input falls outside of the class. This assumes that the features are either on (1) or off (0).

Part-of-speech tagging with a perceptron jag bad om en kort bit NN 09,36 PN 81,72 VB 9,18

Part-of-speech tagging with a perceptron jag bad om en kort bit PN 81,72 NN 09,36 VB 9,18

Part-of-speech tagging with a perceptron jag bad om en kort bit PN NN 16,08 PN 4,02 VB 64,32

Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB 64,32 NN 16,08 PN 4,02

Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB

Feature windows Hidden Markov models look back one step; but sometimes it is a good idea to look back further, or to look ahead! I want to live in peace. At the same time, we do not want the classifier to see too much information. efficiency, data sparseness A compromise is to define a limited feature window.

Feature window jag bad om en kort bit BOS PN EOS With this feature window, we see the current word, the previous word, the next word, and the previous tag.

Feature window jag bad om en kort bit BOS PN EOS The feature window moves forward during tagging.

Comparison between the two methods Part-of-speech tagging with hidden Markov models probabilistic exhaustive search for the best sequence (Viterbi algorithm) limited possibilities to define features (current word, previous tag) Part-of-speech tagging with multi-class perceptrons non-probabilistic no search; locally optimal decisions more possibilities to define features (feature windows)

Comparison between the two methods Hidden markov model Multi-class perceptron Viterbi search greedy search HMM features fine-tuned features 92,71 % 89,97 % 88,86 % 95,30 % Tagging accuracy on the SUC test set

Limitations of the perceptron x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

New features to the rescue! x 2 0 1 1 x 3 0 x 1

New features to the rescue! x 2 0 1 x 3 1 0 x 1 x 3 = xnor(x 1, x 2 )

How do we get new features? We want to apply the linear model not to x directly but to a representation φ(x) of x. How do we get this representation? Option 1. Manually engineer φ using expert knowledge. feature engineering linear classifiers Option 2. Make the model sensitive to parameters such that learning these parameters identifies a good representation φ. feature learning neural networks

Outline for today Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Part-of-speech tagging with hidden Markov models Part-of-speech tagging with multi-class perceptrons