Part-of-Speech Tagging

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

BULATS A2 WORDLIST 2

CS 598 Natural Language Processing

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The stages of event extraction

Indian Institute of Technology, Kanpur

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

CS 446: Machine Learning

Learning Methods in Multilingual Speech Recognition

Python Machine Learning

Parsing of part-of-speech tagged Assamese Texts

(Sub)Gradient Descent

Context Free Grammars. Many slides from Michael Collins

Words come in categories

Using dialogue context to improve parsing performance in dialogue systems

A Comparison of Two Text Representations for Sentiment Analysis

Probabilistic Latent Semantic Analysis

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Universiteit Leiden ICT in Business

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Ensemble Technique Utilization for Indonesian Dependency Parser

A Case Study: News Classification Based on Term Frequency

Training and evaluation of POS taggers on the French MULTITAG corpus

Ch VI- SENTENCE PATTERNS.

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Grammars & Parsing, Part 1:

An Evaluation of POS Taggers for the CHILDES Corpus

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Memory-based grammatical error correction

Lecture 1: Machine Learning Basics

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

CS Machine Learning

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Linking Task: Identifying authors and book titles in verbose queries

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Multilingual Sentiment and Subjectivity Analysis

Finding Your Friends and Following Them to Where You Are

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

Rule Learning With Negation: Issues Regarding Effectiveness

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Graph Based Authorship Identification Approach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Writing a composition

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

LTAG-spinal and the Treebank

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Artificial Neural Networks written examination

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Corrective Feedback and Persistent Learning for Information Extraction

Development of the First LRs for Macedonian: Current Projects

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Prediction of Maximal Projection for Semantic Role Labeling

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

The Ups and Downs of Preposition Error Detection in ESL Writing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Syllable Based Word Recognition Model for Korean Noun Extraction

The taming of the data:

A Bayesian Learning Approach to Concept-Based Document Classification

Sample Goals and Benchmarks

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Developing Grammar in Context

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Discriminative Learning of Beam-Search Heuristics for Planning

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Advanced Grammar in Use

Beyond the Pipeline: Discrete Optimization in NLP

CSL465/603 - Machine Learning

The following information has been adapted from A guide to using AntConc.

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Loughton School s curriculum evening. 28 th February 2017

Introduction to Text Mining

Short Text Understanding Through Lexical-Semantic Analysis

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Specifying a shallow grammatical for parsing purposes

Assignment 1: Predicting Amazon Review Ratings

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Emotion Recognition Using Support Vector Machine

THE VERB ARGUMENT BROWSER

Experts Retrieval with Multiword-Enhanced Author Topic Model

Transcription:

TDDE09, 729A27 Natural Language Processing (2017) Part-of-Speech Tagging Marco Kuhlmann Department of Computer and Information Science This work is licensed under a Creative Commons Attribution 4.0 International License.

Parts of speech A part of speech is a category of words that play similar roles within the syntactic structure of a sentence. Parts of speech can be defined distributionally or functionally. Kim saw the {elephant, movie, mountain, error} before we did. verbs = predicates; nouns = arguments; adverbs = modify verbs, There are many different tag sets for parts of speech. different languages, different levels of granularity, different design principles

Universal part-of-speech tags Tag Category Examples ADJ adjective big, old ADV adverb very, well INTJ interjection ouch! NOUN noun girl, cat, tree VERB verb run, eat PROPN proper noun Mary, John Tag Category Examples ADP adposition in, to, during AUX auxiliary verb has, was CCONJ conjunction and, or, but DET determiner a, my, this NUM cardinal numbers 0, one PRON pronoun I, myself, this Missing: PART, SCONJ, PUNCT, SYM, X Source: Universal Dependencies Project

Part-of-speech tagging A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech. Part-of-speech tagging can be approached as a supervised machine learning problem. This requires training data. Part-of-speech taggers are commonly evaluated using accuracy, precision, and recall.

Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre

Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

Evaluation of Part-of-Speech Taggers

A reminder about machine learning methodology Training data used to train a machine learning system Development data used to evaluate during development, set hyperparameters smoothing parameter in additive smoothing Test data used to evaluate the final system

Stockholm Umeå Corpus (SUC) SUC is the largest manually annotated corpus for written Swedish, a collaboration of Stockholm and Umeå University. created in the early 1990s SUC contains more than 1.1 million tokens; these are annotated with parts of speech, morphological features, and lemmas. SUC is a balanced corpus with texts from different genres.

Accuracy DT JJ NN PP VB 307 DT 923 0 0 0 1 12445 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag

Precision with respect to NN DT JJ NN PP VB 264 DT 923 0 0 0 1 4499 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag

Recall with respect to NN DT JJ NN PP VB 26 DT 923 0 0 0 1 4499 JJ 2 1255 132 1 5 NN 0 7 4499 1 18 PP 0 0 0 2332 1 VB 0 5 132 2 3436 predicted tag gold-standard tag

Sample exam question NN JJ VB NN 58 6 1 JJ 5 11 2 VB 0 7 43 predicted tag gold-standard tag Compute (a) precision on adjectives, (b) recall on verbs.

Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

Part-of-Speech Tagging with Hidden Markov Models

Ambiguity causes combinatorial explosion jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN Example by Joakim Nivre

Different parts-of-speech have different frequencies Word / tag PN VB PP DT JJ NN jag 4 532 0 0 0 0 25 bad 0 41 0 0 0 10 om 0 0 4 945 0 0 0 en 402 0 0 16 0 1 kort 0 0 0 0 125 18 bit 0 0 0 0 0 92 Data from the Stockholm Umeå Corpus

Different tag sequences have different frequencies Previous / next PN VB PP DT JJ NN PN 1 291 35 473 6 812 1 291 1 759 1 496 VB 24 245 19 470 22 191 13 175 8 794 19 282 PP 5 582 198 501 19 737 10 751 52 440 DT 201 1 286 163 21 648 23 719 JJ 233 1 937 3 650 245 2 716 46 678 NN 1 149 41 928 51 855 1 312 3 350 10 314 Data from the Stockholm Umeå Corpus

Hidden Markov Model A hidden Markov model (HMM) is a generalised Markov model with two types of probabilities: transition probabilities P(tag 2 tag 1 ) How probable is it to see a verb after having seen a pronoun? output probabilities P(word tag) How probable is it to see the word bad being tagged as a verb?

P(w 1 w 1 ) w 1 P(w 1 BOS) P(EOS w 1 ) BOS P(w 2 w 1 ) P(w 1 w 2 ) EOS w 2 P(w 2 BOS) P(EOS w 2 ) P(w 2 w 2 )

P(VB VB) w P(w VB) jag 0.000004 bad 0.000152 P(VB BOS) VB P(EOS VB) BOS P(PN VB) P(VB PN) EOS P(PN BOS) PN P(EOS PN) w P(w PN) P(PN PN) jag 0.025775 bad 0.000006

Learning hidden Markov models To learn a hidden Markov model from a corpus, we can use Maximum Likelihood Estimation just as before: To estimate the transition probability P(VB PN), we ask: How often do we see VB given that the previous tag was PN? To estimate the output probability P(jag PN), we ask: How often do we see the word jag when the tag is PN? We can also use various smoothing techniques just as before.

Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities

Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.

Sample exam question You want to compute the probability of this tagged sentence in an HMM: jag skrev på utan att tveka PN VB PL PP IE VB You can ask the model for its atomic probabilities, but each such question costs 1 crown. Which questions do you need to ask, and how much do you have to pay?

The Viterbi Algorithm

Probability of a tagged sentence P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) product of transition and output probabilities

Tagging with a hidden Markov model Given a sentence, we want to find a sequence of tags such that the probability of the tagged sentence is maximal. The tag sequence is not given in advance; it is hidden! For each sentence there are many different tag sequences with many different probabilities. combinatorial explosion In spite of this, the most probable tag sequence can be found efficiently using the Viterbi algorithm.

High-level description The algorithm takes as its inputs a HMM and a sentence and computes the most probable tag sequence for the sentence. The algorithm fills a matrix that contains one row for each possible tag and one column for each position in the sentence. including BOS, EOS In this presentation we fill the matrix with negative log probabilities; we can interpret them as costs in crowns. We do this to avoid underflow.

The central invariant The algorithm should make sure that the value in row t, column i is the minimal cost needed to tag the first i words in the sentence in such a way that word number i is tagged as t. Remember that minimal cost = maximal probability. If the algorithm can achieve this, then we can read off the least possible cost to tag the complete sentence from the last column.

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93

Hidden Markov model 1: Transition costs PN VB PP DT JJ NN EOS BOS 1,69 3,58 2,25 2,50 3,37 1,76 11,19 PN 4,00 0,69 2,34 4,00 3,69 3,85 7,94 VB 1,95 2,17 2,04 2,56 2,97 2,18 6,87 PP 3,09 6,42 5,49 1,82 2,43 0,85 8,38 DT 5,61 10,22 5,26 5,82 0,93 0,84 10,22 JJ 5,73 3,62 2,98 5,68 3,28 0,43 6,35 NN 5,30 1,70 1,49 5,17 4,23 3,11 4,30

Hidden Markov model 2: Observation costs jag bad om en kort bit PN 3,66 12,08 12,08 6,08 12,08 12,08 VB 12,53 8,79 12,53 12,53 12,53 12,53 PP 12,33 12,33 3,83 12,33 12,33 12,33 DT 11,99 11,99 11,99 2,29 11,99 11,99 JJ 12,09 12,09 12,09 12,09 7,25 12,09 NN 9,47 10,33 12,73 12,03 9,78 8,19

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ NN PN PP VB EOS 0.00 + P(DT BOS) + P(jag DT) = 2.50 + 11.99 = 14.49

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 JJ 15,46 NN 11,22 PN 5,35 PP VB EOS 0.00 + P(PN BOS) + P(jag PN) = 1.69 + 3.66 = 5.35

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 35,15 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS 28.86 + P(DT PN) + P(en DT) = 28.86 + 4.00 + 2.29 = 35.15

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 JJ 15,46 21,13 29,88 NN 11,22 19,53 29,74 PN 5,35 21,43 28,86 PP 14,59 20,02 20,70 VB 16,11 14,83 29,53 EOS 20.70 + P(DT PP) + P(en DT) = 20.70 + 1.82 + 2.29 = 24.82

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93 41.63 + P(EOS NN) = 41.63 + 4.30 = 45.93

jag 1 bad 2 om 3 en 4 kort 5 bit 6 BOS 0,00 DT 14,49 21,33 29,38 24,82 42,62 50,67 JJ 15,46 21,13 29,88 35,22 33,00 48,36 NN 11,22 19,53 29,74 33,58 35,44 41,63 PN 5,35 21,43 28,86 29,86 42,50 50,81 PP 14,59 20,02 20,70 38,53 42,41 48,32 VB 16,11 14,83 29,53 39,65 43,08 49,15 EOS 45,93 Follow the backpointers to read off the sequence.

jag 1 skrev 2 på 3 utan 4 att 5 tveka 6 BOS 0,00 IE 17,22 21,69 30,02 33,79 34,63 54,70 PL 21,77 21,20 22,10 39,77 49,28 55,06 PN 5,35 21,43 27,87 33,85 44,12 48,09 PP 14,59 20,02 18,69 28,95 44,66 50,70 SN 15,83 21,51 29,20 34,29 35,24 51,40 VB 16,11 13,84 28,54 37,64 43,96 44,86 EOS 51,74 It does not suffice to pick the best cell in each column!

Computational complexity Let m, n denote the number of tags in the HMM and the length of the input sentence, respectively. The memory required by the Viterbi algorithm is in O(mn); this corresponds to the size of the matrix. The runtime required by the Viterbi algorithm is in O(m 2 n): We need to fill O(mn) cells, and each cell requires us to look at O(m) cells in the previous column.

Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons

Part-of-Speech Tagging with Perceptrons

Part-of-speech tagging as classification Part-of-speech tagging can be cast as a sequence of classification problems one classification per word in the sentence. Based on this idea, any method for classification can be used to build a part-of-speech tagger. Naive Bayes Here we use a very simple non-probabilistic method called the multi-class perceptron.

The multi-class perceptron x 1 w 1 Σ a 1 w 1 w 2 x 2 Σ w 2 a 2 activation = weighted sum of the features

Interpretation of feature weights Features whose weights are zero do not contribute to the activation; such features are ignored. Features whose weights are positive cause the activation to increase they suggest that the input belongs to the class. Features whose weights are negative cause the activation to decrease they suggest that the input falls outside of the class.

Part-of-speech tagging with a perceptron jag bad om en kort bit NN 09,36 PN 81,72 VB 9,18

Part-of-speech tagging with a perceptron jag bad om en kort bit PN 81,72 NN 09,36 VB 9,18

Part-of-speech tagging with a perceptron jag bad om en kort bit PN NN 16,08 PN 4,02 VB 64,32

Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB 64,32 NN 16,08 PN 4,02

Part-of-speech tagging with a perceptron jag bad om en kort bit PN VB

Feature windows Hidden Markov models look back one step; but sometimes it is a good idea to look back further, or to look ahead! Jag bad om en kort bit. At the same time, we do not want the classifier to see too much information. efficiency, data sparseness A compromise is to define a limited feature window.

Comparison between the two methods Part-of-speech tagging with hidden Markov models probabilistic exhaustive search for the best sequence (Viterbi algorithm) limited possibilities to define features (current word, previous tag) Part-of-speech tagging with multi-class perceptrons non-probabilistic no search; locally optimal decisions more possibilities to define features (feature windows)

Feature window jag bad om en kort bit BOS PN EOS With this feature window, we see the current word, the previous word, the next word, and the previous tag.

Feature window jag bad om en kort bit BOS PN EOS The feature window moves forward during tagging.

Comparison between the two methods Hidden markov model Multi-class perceptron Viterbi search greedy search HMM features fine-tuned features 92,71% 89,97% 88,86% 95,30% Tagging accuracy on the SUC test set

Overview of this section Introduction to part-of-speech tagging Evaluation of part-of-speech taggers Method 1: Part-of-speech tagging with hidden Markov models Method 2: Part-of-speech tagging with perceptrons