Midterm practice questions

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

(Sub)Gradient Descent

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Probabilistic Latent Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Indian Institute of Technology, Kanpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 598 Natural Language Processing

BULATS A2 WORDLIST 2

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Radius STEM Readiness TM

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Grammars & Parsing, Part 1:

Training and evaluation of POS taggers on the French MULTITAG corpus

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

CS Machine Learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Reducing Features to Improve Bug Prediction

Lecture 1: Machine Learning Basics

Using dialogue context to improve parsing performance in dialogue systems

Corrective Feedback and Persistent Learning for Information Extraction

Detecting English-French Cognates Using Orthographic Edit Distance

Learning Methods in Multilingual Speech Recognition

Universiteit Leiden ICT in Business

Truth Inference in Crowdsourcing: Is the Problem Solved?

CS 446: Machine Learning

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Loughton School s curriculum evening. 28 th February 2017

Python Machine Learning

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

The stages of event extraction

Ensemble Technique Utilization for Indonesian Dependency Parser

Parsing of part-of-speech tagged Assamese Texts

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Calibration of Confidence Measures in Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Semi-Supervised Face Detection

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

A Case Study: News Classification Based on Term Frequency

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

What the National Curriculum requires in reading at Y5 and Y6

An Evaluation of POS Taggers for the CHILDES Corpus

Mathematics Success Grade 7

Discriminative Learning of Beam-Search Heuristics for Planning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Memory-based grammatical error correction

A Comparison of Two Text Representations for Sentiment Analysis

Automatic Pronunciation Checker

Dublin City Schools Mathematics Graded Course of Study GRADE 4

The Role of the Head in the Interpretation of English Deverbal Compounds

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Distant Supervised Relation Extraction with Wikipedia and Freebase

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Short Text Understanding Through Lexical-Semantic Analysis

Grade 6: Correlated to AGS Basic Math Skills

Natural Language Processing. George Konidaris

Chapter 4 - Fractions

Artificial Neural Networks written examination

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Softprop: Softmax Neural Network Backpropagation Learning

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Ohio s Learning Standards-Clear Learning Targets

Words come in categories

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

AQUA: An Ontology-Driven Question Answering System

Online Updating of Word Representations for Part-of-Speech Tagging

Prediction of Maximal Projection for Semantic Role Labeling

arxiv: v1 [cs.cl] 2 Apr 2017

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

UNIT ONE Tools of Algebra

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Ch VI- SENTENCE PATTERNS.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Extracting and Ranking Product Features in Opinion Documents

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Learning Computational Grammars

HLTCOE at TREC 2013: Temporal Summarization

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Assignment 1: Predicting Amazon Review Ratings

Developing Grammar in Context

Rule Learning With Negation: Issues Regarding Effectiveness

Sample Goals and Benchmarks

Managerial Decision Making

Transcription:

Midterm practice questions UMass CS 585 October 2017 1 Topics on the midterm Language concepts Parts of speech Regular expressions, text normalization Probability / machine learning Probability theory: Marginal probs, conditional probs, law(s) of total probability, Bayes Rule. Maximum likelihood estimation Naive Bayes Relative frequency estimation and pseudocount smoothing Logistic regression (for binary classification) Markov / N-Gram language models Structured models Hidden Markov models Viterbi algorithm Log-linear models and CRFs (Structured) Perceptron 2 Decoding Question 2.1. Consider the Viterbi sequence inference algorithm for a sequence length N with K possible states. (For POS tagging, it would be: there are N tokens and K parts-ofspeech.) Give the following answers in terms of N and K. (a) What s the time complexity of Viterbi? 1

(b) What s the space complexity of Viterbi? (c) What s the time complexity of enumerating all possible answers? Question 2.2. The greedy decoding algorithm is an alternative to Viterbi. It simply makes decisions left to right, without considering future decisions. Using A(y prev, y cur ) and B t (y cur ) additive factor score notation, it creates a predicted sequence y as follows: for t = 1..T : y t arg max k tagset A(y t 1, k) + B t (k) (a) What s the time complexity of the greedy algorithm? (b) Unlike Viterbi, the greedy algorithm does not always find the most probable solution according to the model. Why? Give an example where it might fail. Why does Viterbi get it right? 3 Classification Question 3.1. Consider training and predicting with a naive Bayes classifier for two document classes, and without pseudocounts. The word booyah appears once for class 1, and never for class 0. When predicting on new data, if the classifier sees booyah, what is the posterior probability of class 1? Question 3.2. For a probabilistic classifier for a binary classification problem, consider the prediction rule to predict class 1 if P (y = 1 x) > t, and predict class 0 otherwise. This assumes some threshold t is set. If the threshold t is increased, (a) Does precision tend to increase, decrease, or stay the same? (b) Does recall tend to increase, decrease, or stay the same? 4 Classifiers Here s a naive Bayes model with the following conditional probability table (each row is that class s unigram language model): word type a b c P (w y = 1) 5/10 3/10 2/10 P (w y = 0) 2/10 2/10 6/10 and the following prior probabilities over classes:

P (y = 1) P (y = 0) 8/10 2/10 Naive Bayes Consider a binary classification problem, for whether a document is about the end of the world (class y = 1), or it is not about the end of the world (class y = 0). Question 4.1. Consider a document consisting of 2 a s, and 1 c. Note: In this practice and on the midterm, you do not need to convert to decimal or simplify fractions. You may find it easier to not simplify the fractions. On the midterm, we will not penalize simple arithmetic errors. Please show your work. (a) What is the probability that it is about the end of the world? (b) What is the probability it is not about the end of the world? Question 4.2. Now suppose that we know the document is about the end of the world (y = 1). (a) True or False, the naive Bayes model is able to tell us the probability of seeing the document w = (a, a, b, c) under the model. (b) If True, what is the probability? 5 Language Models We consider a language over the three symbols A, B, and C. Question 5.1. Consider the training corpus (A, C, C, B, A, B, C) (a) Under a bigram language model with zero pseudocounts, what is the probability of the observation (A, B, B)? Please include generation of the END event. (b) Under a bigram language model with a pseudocount of α = 1, what is the probability of the observation (A, B, B)? Please include generation of the END event.

6 HMMs Code-switching is when people switch between languages when communicating. For example, the phrase pie a la carte can be analyzed as code switching, where the first token pie is English, and the next three tokens are French. We ll model code-switching with an HMM. The model is: at every token position t, the variable y t denotes which language the speaker is using. y t can be one of two states, either E or F. The word is then produced by a unigram language model for that language (this the HMM s emission distribution). Assume we know the language model parameters (i.e. the probability of a given word, given that state=english or state=french), and we only want to learn the transition parameters (i.e. the probability of switching between English, French, the START state and the END state). We will use the example sentence w = (pie, a, la, carte). The unigram model parameters are, where one row is P emit (w E) and the second row is P emit (w F ): pie a la carte... English (E) 0.01 0.2 0 0... French (F) 0 0.1 0.1 0.01... We are going to treat this emission distribution as fixed. We want to learn the transition distribution. The transition distribution describes how likely a speaker is to stay in the same language, or switch to the other language. Assume that the transition distribution is initialized to be uniform between the two states (plus a little probability for the end state): P trans (E E) = 0.4 P trans (F E) = 0.4 P (END E) = 0.2 P trans (E F ) = 0.4 P trans (F F ) = 0.4 P (END F ) = 0.2 P trans (E ST ART ) = 0.5 P trans (F ST ART ) = 0.5 Question 6.1. Bayes Rule Only one token has ambiguity about which language it came from, so there are only two possible (y 1..y 4 ) sequences that have non-zero probability (remember that y t denotes which language the speaker is using at time t). For each possible sequence y, write it and its posterior probability p( y w). Note that many terms are shared between the unnormalized probabilities, which can be ignored when computing the posterior probabilities since they are absorbed into the normalizer. Question 6.2. Just to learn stuff We do not cover the EM algorithm in 585, but you can find out a lot about it from many sources online. 1 How can you use EM to learn the parameters? Note: EM will not be on the midterm. This question is just if you want to learn more about NLP. 1 We like mathematicalmonk: https://www.youtube.com/watch?v=anbinavp3eq

Question 6.3. Just to learn stuff Perform the first M-step. Given the posterior expectations from the last step, estimate new values of the transition parameters. You may choose either to include the generation of an END symbol, or to not include its generation. 7 Language stuff Question 7.1. Each of the following sentences has an incorrect part-of-speech tag. Identify which one and correct it. (If you think there are multiple incorrect tags, choose the one that is the most egregious.) We ll use a very simple tag system: NOUN common noun or proper noun PRO pronoun ADJ adjective ADV adverb VERB verb, including auxiliary verbs PREP preposition DET determiner X something else 1. Colorless/ADV green/adj clouds/pro sleep/verb furiously/adv./x 2. She/PRO saw/verb herself/pro through/prep the/adj looking/adj glass/noun./x 3. Wait/NOUN could/verb you/pro please/x?/x 8 Perceptron Question 8.1. In the homework, we saw an example of when the averaged perceptron outperforms the vanilla perceptron. There is another variant of the perceptron that often outperforms the vanilla perceptron. This variant is called the voting perceptron. Here s how the voting perceptron works: initialize the weight vector if the voting perceptron misclassifies an example at iteration i, update the weight vector and store it as w i. if it makes a correct classification at iteration i, do not update the weight vector but store w i anyway.

To classify an example with the voting perceptron, we classify that example with each w i and tally up the number of votes for each class. The class with the most votes is the prediction. Despite often achieving high accuracy, the voting perceptron is rarely used in practice. Why not?