CMSC 723: Computational Linguistics I

Similar documents
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Large vocabulary off-line handwriting recognition: A survey

(Sub)Gradient Descent

Training and evaluation of POS taggers on the French MULTITAG corpus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS Machine Learning

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Learning Methods in Multilingual Speech Recognition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Lecture 1: Machine Learning Basics

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Accuracy (%) # features

Improvements to the Pruning Behavior of DNN Acoustic Models

Speech Recognition at ICSI: Broadcast News and beyond

Universiteit Leiden ICT in Business

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Memory-based grammatical error correction

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

An Evaluation of POS Taggers for the CHILDES Corpus

Indian Institute of Technology, Kanpur

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Switchboard Language Model Improvement with Conversational Data from Gigaword

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Artificial Neural Networks written examination

The stages of event extraction

Rule Learning with Negation: Issues Regarding Effectiveness

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

Context Free Grammars. Many slides from Michael Collins

Physics 270: Experimental Physics

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Statewide Framework Document for:

Python Machine Learning

Distant Supervised Relation Extraction with Wikipedia and Freebase

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

Search right and thou shalt find... Using Web Queries for Learner Error Detection

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Software Maintenance

Generating Test Cases From Use Cases

Using dialogue context to improve parsing performance in dialogue systems

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

WHEN THERE IS A mismatch between the acoustic

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Modeling function word errors in DNN-HMM based LVCSR systems

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Course Content Concepts

The Ups and Downs of Preposition Error Detection in ESL Writing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

M55205-Mastering Microsoft Project 2016

Lecture 9: Speech Recognition

Setting Up Tuition Controls, Criteria, Equations, and Waivers

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Sapphire Elementary - Gradebook Setup

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Cross Language Information Retrieval

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Introduction. Chem 110: Chemical Principles 1 Sections 40-52

Dublin City Schools Mathematics Graded Course of Study GRADE 4

A Syllable Based Word Recognition Model for Korean Noun Extraction

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

How to learn writing english online free >>>CLICK HERE<<<

Mini Lesson Ideas for Expository Writing

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

End-of-Module Assessment Task

Grammars & Parsing, Part 1:

STUDENTS' RATINGS ON TEACHER

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Automatic Pronunciation Checker

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

The Strong Minimalist Thesis and Bounded Optimality

Modeling function word errors in DNN-HMM based LVCSR systems

The Evolution of Random Phenomena

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Constructing Parallel Corpus from Movie Subtitles

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ALEKS. ALEKS Pie Report (Class Level)

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

First Grade Standards

Rendezvous with Comet Halley Next Generation of Science Standards

6 Financial Aid Information

Transcription:

CMSC 723: Computational Linguistics I Introduction Assignment 3: Let's play tag! Jimmy Lin (Instructor) and Melissa Egan (TA) Due: October 14, 2009 This assignment is about exploring part-of-speech (POS) tagging using n- gram taggers, tagger combination, and hidden Markov models (HMMs). There are a total of three problems; the rst requires no programming. This assignment requires the Python modules below: 1. Matplotlib: This provides advanced plotting and visualization capabilities that we will need to use for Problem 2. 2. Numpy: This provides the ecient multi-dimensional array structure that we will need to use for Problem 3. A link to installation instructions can be found on the course website under Software. Background In this section, we provide some background on building POS taggers. NLTK ships with a factory of POS taggers that can be easily trained on the included pre-tagged corpora. There are two main POS taggers that we will use: 1. DefaultTagger: This tagger tags every word with a default tag. For example, a very good baseline for English POS-tagging is to just tag every word as a noun. Listing 1 shows how to build such a tagger. 2. NgramTagger: N-grams over any given sequence can be informally dened as overlapping subsequences each of length N. We will formally dene n-grams later in the course. For the purposes of this assignment, the informal denition should suce. As an example, the sentence My name is Nitin Madnani will yield the following n-grams for various values of N: 1

Listing 1: Building and using a DefaultTagger >>> import nltk >>> t = nltk.defaulttagger('nn') >>> sentence = 'This is a sentence' >>> words = sentence.split() >>> print t.tag(words) [('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('sentence', 'NN')] N = 1 (1-grams or Unigrams): My, name, is, Nitin, Madnani N = 2 (2-grams or Bigrams): My name, name is, is Nitin, Nitin Madnani N = 3 (3-grams or Trigrams): My name is, name is Nitin, is Nitin Madnani N = 4 (4-grams): My name is Nitin, name is Nitin Madnani N = 5 (5-grams): My name is Nitin Madnani So, how do we use n-grams for POS-tagging? Figure 1 shows the basic idea behind this strategy. Instead of just looking at the word being tagged, we also look at the POS tags of the previous n words. Therefore, using n-grams allows us to be able to take context into consideration when performing POS-tagging. In the gure, we are using the text of the word itself plus two previous tags, so N=3. Looking at the gure, it should be easy to see how a UnigramTagger (N=1) would work. It would use just the text of the word itself as the only context for predicting its POS tag. For example, it might learn that the word promise is more likely to be tagged as a verb (I promise you...) than a noun (It is a promise...). Therefore, it would always tag promise as a verb even though that's not always correct! However, if we were to use the previous tag as additional context, our tagger might also learn that if promise were preceded by an article (a), it should be tagged as a noun instead. Therefore, using larger context is usually a good strategy when building n-gram based POS taggers. The important thing to realize is that when using an NgramTagger, you need to train it on some sentences for which you already know the POS tags. This is needed because an NgramTagger needs to count and build tables of how many times a particular word is tagged as a verb (when N=1) or how many 2

Figure 1: How does an NgramTagger work? In this gure, N = 3 (original image from NLTK documentation). Listing 2: Building and using an NgramTagger >>> import nltk >>> from nltk.corpus import brown >>> traindata = brown.tagged_sents(categories='reviews') >>> t = nltk.ngramtagger(n=2, train=traindata) >>> sentence = 'This is a sentence' >>> words = sentence.split() >>> print t.tag(words) [('This', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('sentence', None)] times a particular word preceded by a noun is tagged as a verb (when N=2) and so on. In order to build these tables, it requires sentences with the correct tags already assigned to each word. It's usually a little complicated to build and train an NgramTagger. However, NLTK makes it extremely easy. Listing 2 shows how to build and train a bigram tagger (N=2) on the reviews category of the Brown corpus. 1 If the tagger cannot make a prediction about a particular word, it assigns that word a null tag indicated by None. 1 The Brown corpus tagset is shown in Figure 5.7 on page 134 of your textbook. 3

Listing 3: Using the cutoff parameter during training >>> import nltk >>> from nltk.corpus import brown >>> traindata = brown.tagged_sents(categories='reviews') # Treat everything as evidence (very noisy) >>> t = nltk.ngramtagger(n=2, train=traindata, cutoff=0) Restricting Training Evidence As explained above, training an NgramTagger basically entails keeping track of the tag that was assigned to each word for every context that it was seen in and then using that as evidence for making predictions on test data. Now, it's reasonable to think that not all evidence should be considered reliable. For example, if a particular piece of evidence occurs only once in the training data, we may not want to rely on it lest it was just an artifact of noise. NLTK allows us to achieve this with the cutoff parameter as shown in Listing 3. By default, the value of the cutoff parameter is 1, i.e., during training, NLTK will ignore any evidence unless it occurs in the training data at least twice (one higher than the cuto value). Note that the default cuto of 1 should be sucient for this assignment. The point of this section is just to provide information that may be worth having. Measuring tagger accuracy Assuming that you have the correct POS tags for the sentences that you wish to test your tagger on, NLTK also provides a simple way to compute how accurate your tagger is in its predictions. Of course, these test sentences should be completely separate from the sentences that are used to train the tagger. Listing 4 shows how to compute the accuracy of a DefaultTagger on the editorial category of the Brown corpus. On this particular test set, tagging everything as a noun is successful only about 12.5% of the time. Combining taggers It's possible to combine two taggers such that if the primary tagger was unable to assign the tag to a particular word, it backs o to the second tagger for the prediction. This is known as Backo. Listing 5 shows how to do this in NLTK. 4

Listing 4: Measuring the accuracy of a DefaultTagger >>> import nltk >>> from nltk.corpus import brown >>> testdata = brown.tagged_sents(categories='editorial') >>> t = nltk.defaulttagger('nn') >>> print t.evaluate(testdata) 0.12458606583988052 Listing 5: Combining taggers in NLTK >>> import nltk >>> from nltk.corpus import brown >>> traindata = brown.tagged_sents(categories='reviews') >>> t1 = nltk.ngramtagger(n=1, train=traindata) >>> t2 = nltk.ngramtagger(n=2, train= traindata, backoff=t1) >>> sentence = 'This is a sentence' >>> words = sentence.split() >>> print t2.tag(words) [('This', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('sentence', 'NN')] Plotting using Matplotlib As we have seen in class, the plotting capabilities of NLTK are quite primitive. The Python package Matplotlib provides more advanced plotting functions that generate nicer-looking plots. For this assignment, we will only need to know how to make line plots and save them as image les. Listing 6 shows how to create and save a plot. Figure 2 shows the le plot.png. 5

Listing 6: Create and save a simple line plot from pylab import xlabel, ylabel, plot, savefig x = range(1, 11) y = [i 3+3 for i in x] xlabel('x') ylabel('x^3 + 3') plot(x, y) savefig('plot.png') Figure 2: The le plot.png as produced by Listing 6. 6

Problem 1 (10 points) Recall the constraints used in the EngCG rule-based tagger that we looked at in class. The system is described in more detail in Section 5.4 of your textbook. Say we have the following constraint in our tagger grammar: if (-1 has only DT tag) then remove verb tags Can you think of two dierent counter-examples where applying this constraint could lead to a possibly incorrect tagging? Problem 2 (40 points) (a) In Listing 2 above, why do you think the bigram tagger could not assign a tag to the word sentence? However, in Listing 5, a bigram tagger combined with a unigram tagger was able to correctly predict the tag for the same word. Why do you think that strategy worked? (b) Create dierent combinations using a DefaultTagger and 3 dierent n-gram taggers (N = 1, 2 and 3). Use the rst 500 sentences of the news category of the Brown corpus as the training data. Test each combination on the religion category of the Brown corpus. Which combination yields the highest accuracy? Plot the accuracy of the winning combination as the number of sentences used for training increases (by 500 sentences at each step). You need only go up to 4500 sentences. (c) Let the coverage of a test set be dened as the percentage of words that are not assigned a null tag (None) by a tagger. Train 6 dierent n-gram taggers (N= 1... 6) on the news category of the Brown corpus. Compute the coverage and accuracy of each individual tagger (no combinations) on the religion category of the Brown corpus. Explain what happens to the two numbers as N increases. (d) Note that the contextual information used both by a bigram tagger and a rst-order HMM tagger pertains only to the previous word. Does that mean that a trigram tagger will always prove to be a better tagger than a rst-order HMM? Put another way, does a rst-order HMM have any advantages over an n-gram tagger with a much larger N (>= 3)? If so, what are they? 7

Notes: A combination should have at least two taggers. Listing 5 shows you how to combine two taggers. You have to gure out how you would use this method to create a combination of 3 or more taggers. Even though there are a large number of possible combinations, you should be able to rule out many of them by thinking about how the individual taggers work and how they can complement each other. Your code should not enumerate all possible combinations to nd the best one. The point of the problem is to ensure that you understand the pros and cons of each tagger enough to come up with combinations that are reasonably good. Since POS tagging is sentence oriented, we need to make sure that an NgramTagger does not consider context that goes beyond sentence boundaries. The NLTK implementation takes care of this for you. Problem 3 (50 points) You are provided with the le hmm.py that denes a class called hmm. As soon as you instantiate this class, the various parameters of the HMM (transition probabilities, emission probabilities etc.) are automatically computed by using the rst 1000 sentences of the news category of the Brown corpus as the training data (using functions dened in the supporting le hmmtrainer.py). The following ve parameters are available to each instance of the hmm class: transitions: The probabilities of transitioning from one state to another. To get the probability of going to state s2 from state s1, use self.transitions[s1].prob(s2). emissions: The probability of emitting a particular output symbol from a particular state. To get the probability of emitting output symbol sym in state s, use self.emissions[s].prob(sym). priors: The probability of starting in a particular state. To get the probability that the HMM starts in state s, use self.priors.prob(s). states: The states (tags) in the trained HMM. symbols: The output symbols (words) in the trained HMM. 8

Listing 7: Using multi-dimensional arrays >>> from numpy import zeros, random, max, argmax, float32 # Create a 10x10 two dimensional array initialized to zeros # Must use oat32 to indicate 32 bit oating point precision >>> a = zeros((10, 10), float32) # add 0.5 to all elements >>> a += 0.5 # element at row 1 and column 1 (zero indexed) >>> a[0,0] 0.5 # add 1.0 to each element in the 6th column >>> a[:,5] += 1.0 # create a 5x5 two dimensional array with each element x # randomly generated such that 0 <= x < 10 >>> b = random.randint(0, 10, (5, 5)) # nd the largest element in the 5th column >>> max(a[:,4]) 9 # nd the row number in which this maximum occurred >>> argmax(a[:,4]) 3 9

For this problem, implement the following: (a) Add a decode() method to the class that performs Viterbi decoding to nd the most likely tag sequence for a given word sequence. (b) Add a tag() method that takes a sentence string as input and tags the words in that sentence using Viterbi decoding. It should have an output of the form: This/DT is/bez a/at sentence/nn. (c) Tag each of the six sentences in the provided le given.sentences. Do you see any errors in the tags assigned to each sentence? If so, mention them. Turn in the le hmm.py that implements the items above. Your program should accept sentences from stdin and print the tagged results to stdout. We will test your program with the following command-line invocation: python hmm.py < given.sentences Make sure your solution behaves exactly in this manner. Notes: 1. The Viterbi decoding algorithm requires a two-dimensional trellis or chart. It is extremely tedious to use Python lists to implement such a chart. This is where the ecient and versatile array datatype provided by Numpy comes in. Listing 7 should tell you how to create, initialize and use a two-dimensional array. You should use such an array to implement the chart you need for decoding. 2. The probability values that are calculated by the trainer are going to be extremely small in scale. Multiplying two very small numbers can lead to loss of precision. Therefore, we strongly recommend that you use the log of the probabilities (logprobs) instead. To compute the log of the transition probability of going from s1 to s2, use self.transitions[s1].logprob(s2) instead, and so on. 3. You do not need to lowercase the training data. Use the words as they occur in the data. 4. You do not need to worry about any words that are not seen in the training data. The probability distributions that the hmmtrainer module computes are all smoothed, which means that it assigns some nonzero probability mass to every event whether or not it was observed 10

in the data. In general, assigning a zero probability to any event is not a good idea when building statistical models. This has an intuitive reason: just because you don't observe an event in your limited view of the world (as represented by the training data) doesn't mean that it never happens in the real world (which is what assigning it zero probability says). We will delve deeper into the technical details of smoothing later in the semester. For this problem, just know that you don't have to do anything special about unseen words. 11