Machine Learning for NLP Lecture 1: Introduction

Machine Learning for NLP Lecture 1: Introduction UNIVERSITY OF Richard Johansson August 31, 2015

overview of today's lecture some information about the course machine learning basics and overview overview of the assignments introduction to the scikit-learn library

overview information about the course machine learning basics introduction to the scikit-learn library

teaching 5 lectures + 1 guest lecture (hopefully) lab sessions where you work on assignment or project seminars

your work 3 assignments present a research paper at a seminar mini-project for VG grade: solve optional parts of assignments 2 and 3, and select an ambitious project

request if you have a personal interest in some topic, please let me know!

assignment 1: feature design for function tagging (root) She lives in a house of brick

assignment 1: feature design for function tagging nsubj (root) She lives in a house of brick

assignment 1: feature design for function tagging the purpose of this assignment is to practice the typical steps in building a machine learning-based system designing features analyzing the performance of the system (and its errors) trying out dierent learning methods

assignment 2: classier implementation read a paper about a simple algorithm for training the support vector machine classier write code to implement the algorithm similar to the algorithms in scikit-learn

assignment 3: learning for dependency parsing [preliminary] implement the structured perceptron learning algorithm use it to build a graph-based dependency parser yesterday she gave the horse an apple

independent work select a topic of interest (or ask me for ideas) dene a small project write code, carry out experiments write a short paper, present it at a seminar at the end of the course

overview information about the course machine learning basics introduction to the scikit-learn library

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence?

some types of machine learning problems classication: learning to output a category label spam/non-spam; positive/negative; subject/object,... structure prediction: learning to build some structure POS tagging; dependency parsing; translation;... (numerical regression: learning to guess a number) value of a share; number of stars in a review;... (reinforcement learning: learning to act in an environment) dialogue systems; playing games;...

machine learning in NLP research ACL, EMNLP, Coling, etc are heavily dominated by ML-focused papers

why machine learning? why would we want to build the function from data instead of just implementing it? usually because we don't really know how to write down the function by hand speech recognition image classication syntactic parsing translation... might not be necessary for limited tasks where we know: morphology? sentence splitting and tokenization? identication of limited classes of names, dates and times? what is more expensive in your case? knowledge or data?

don't forget your linguistic intuitions! machine learning automatizes some tasks, but we still need our brains: dening the tasks and terminology annotating training and testing data having an intuition about which features may be useful can be crucial in general, features are more important than the choice of learning algorithm error analysis dening constraints to guide the learner valency lexicons can be used in parsers grammar-based parsers with ML-trained disambiguators

learning from data

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

attributes/values or bag of words we often represent the features as attributes with values in practice, as a Python dict features = { "gender":"male", "age":37, "blood_pressure":130,... } sometimes, it's easier just to see the features as a list of e.g. words (bag of words) features = [ "here", "are", "some", "words", "in", "a", "document" ]

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)... but we could also add other features such as the presence of smileys or negations Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover. A1 Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). C2 Pilán, NLP-based Approaches to Sentence Readability for Second Language Learning Purposes, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication

examples of ML in NLP: coreference resolution do two given noun phrases refer to the same real-world entity? Soon et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases, Comp. Ling. 2001

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition Zhang and Johnson A Robust Risk Minimization based Named Entity Recognition System, CoNLL 2003

what goes on when we learn? the learning algorithm observes the examples in the training set it tries to nd common patterns that explain the data: it generalizes so that we can make predictions for new examples how this is done depends on what algorithm we are using

knowledge from experience? given some experience, when can we be certain that we can draw any conclusion?

a fundamental tradeo goodness of t: the learned classier should be able to correctly classify the examples in the training data regularization: the classier should be simple

learning algorithms that we have seen so far we have already seen a number of learning algorithms in previous courses: Naive Bayes perceptron hidden Markov models decision trees transformation rules (Brill tagger)

representation of the prediction function we may represent our prediction function in dierent ways: numerical models: rules weight or probability tables networked models decision trees transformation rules

example: the prediction function as numbers def sentiment_is_positive(features): score = 0.0 score += 2.1 * features["wonderful"] score += 0.6 * features["good"]... score -= 0.9 * features["bad"] score -= 3.1 * features["awful"]... if score > 0: return True else: return False

example: the prediction function as rules def patient_is_sick(features): if features["systolic_blood_pressure"] > 140: return True if features["gender"] == "m": if features["psa"] > 4.0: return True... return False

perceptron revisited the perceptron learning algorithm creates a weight table each weight in the table corresponds to a feature e.g. "fine" probably has a high positive weight in sentiment analysis "boring" a negative weight "and" near zero classication is carried out by summing the weights for each feature def perceptron_classify(features, weights): score = 0 for f in features: score += weights.get(f, 0) if score >= 0: return "pos" else: return "neg"

the perceptron learning algorithm start with an empty weight table classify according to the current weight table each time we misclassify, change the weight table a bit if a positive instance was misclassied, add 1 to the weight of each feature in the document and conversely... def perceptron_learn(examples, number_iterations): weights = {} for iteration in range(number_iterations): for label, features in examples: guess = perceptron_classify(features, weights) if label == "pos" and guess == "neg": for f in features: weights[f] = weights.get(f, 0) + 1 elif label == "neg" and guess == "pos": for f in features: weights[f] = weights.get(f, 0) - 1 return weights

estimation in Naive Bayes, revisited Naive Bayes: P(document, label) = P(f 1,..., f n, label) = P(label) P(f 1,..., f n label) = P(label) P(f 1 label)... P(f n label) how do we estimate the probabilities? maximum likelihood: set the probabilities so that the probability of the data is maximized

estimation in Naive Bayes: supervised case how do we estimate P(positive)? P MLE (positive) = count(positive) count(all) = 2 4 how do we estimate P(nice positive)? count(nice, positive) P MLE (nice positive) = count(any word, positive) = 2 7

machine learning software general-purpose software, large collections of algorithms: scikit-learn: http://scikit-learn.org Python library will be used in this course Weka: http://www.cs.waikato.ac.nz/ml/weka Java library with nice user interface NLTK includes some learning algorithms but seems to be discontinuing them in favor of scikit-learn special-purpose software, small collections of algorithms: LibSVM/LibLinear for support vector machines CRF++, CRFSGD for conditional random elds Theano, Cae, Keras for neural networks...

evaluation methodology how do we evaluate our systems? intrinsic evaluation: test the performance in isolation extrinsic evaluation: I changed my POS tagger how does this change the performance of my parser? how much more money do I make? common measures in intrinsic evaluation classication accuracy precision and recall (for needle in haystack) also several other task-dependent measures.

overview information about the course machine learning basics introduction to the scikit-learn library

detailed example def train_scikit_classifier(x, Y): # A DictVectorizer maps a feature dict to a sparse vector, # e.g. vec.transform({'label':'np'}) might give # [0, 0,..., 0, 1, 0,... ] vec = DictVectorizer() # Convert all the feature dicts to vectors. # As usual, it's more efficient to handle all at once. Xe = vec.fit_transform(x) # Initialize the learning algorithm we will use. classifier = Perceptron(n_iter=20) # Finally, we can train the classifier. classifier.fit(xe, Y) # Return a pipeline consisting of the vectorizer followed # by the classifier. return Pipeline([('vec', vec), ('classifier', classifier)])

what are X and Y? our feature extractor has collected features in the form of attributes with values e.g. { 'label':'np',... } they are stored in the list X the corresponding true outputs are stored in the list Y

the rst step: mapping features to numerical vectors scikit-learn's learning methods works with features as numbers, not strings they can't directly use the feature dicts we have stored in X converting from string to numbers is the purpose of these lines: vec = DictVectorizer() Xe = vec.fit_transform(x)

types of vectorizers a DictVectorizer converts from attributevalue dicts: a CountVectorizer converts from texts (after applying a tokenizer) or lists: a TfidfVectorizer is like a CountVectorizer, but also uses TF*IDF

what goes on in a DictVectorizer? each feature corresponds to one or more columns in the output matrix easy case: boolean and numerical features: for string features, we reserve one column for each possible value that is, we convert to booleans

code example (DictVectorizer) here's an example: from sklearn.feature_extraction import DictVectorizer X = [{'f1':'np', 'f2':'in', 'f3':false, 'f4':7}, {'f1':'np', 'f2':'on', 'f3':true, 'f4':2}, {'f1':'vp', 'f2':'in', 'f3':false, 'f4':9}] vec = DictVectorizer() Xe = vec.fit_transform(x) print(xe.toarray()) print(vec.vocabulary_) the result: [[ 1. 0. 1. 0. 0. 7.] [ 1. 0. 0. 1. 1. 2.] [ 0. 1. 1. 0. 0. 9.]] {'f4': 5, 'f2=in': 2, 'f1=np': 0, 'f1=vp': 1, 'f2=on': 3, 'f3': 4}

CountVectorizers for document representation a CountVectorizer converts from documents the document is a string or a list of tokens just like string features in a DictVectorizer, each word type will correspond to one column

code example (CountVectorizer) here's an example: X = ['example text', 'another text'] vec = CountVectorizer() Xe = vec.fit_transform(x) print(xe.toarray()) print(vec.vocabulary_) the result: [[0 1 1] [1 0 1]] {'text': 2, 'example': 1, 'another': 0}

a comment about the vectorizer methods fit: look at the data, create the mapping transform: convert the data to numbers fit_transform = fit + transform

training a classier after mapping the features to numbers with our Vectorizers, we can train a perceptron classier: classifier = Perceptron(n_iter=20) classifier.fit(xe, Y) other classiers (e.g. Naive Bayes) can be trained in a similar way: classifier = MultinomialNB() classifier.fit(xe, Y)

applying the classier to new examples X_new =... # extract the features for new examples Xe_new = vectorizer.transform(x_new) guesses = classifier.predict(xe_new)

combining a vectorizer and a classier into a pipeline vectorizer =... Xe = vectorizer.fit_transform(x) classifier =... classifier.fit(xe, Y) pipeline = Pipeline([('vec', vectorizer), ('cls', classifier)]) X_new =... # extract the features for new examples guesses = pipeline.predict(x_new)

simplied training of a pipeline we can call fit to train the whole pipeline in one step: pipeline = Pipeline([('vec', DictVectorizer()), pipeline.fit(x, Y) ('cls', Perceptron())])... guesses = pipeline.predict(x_new)

a note on eciency Python is a nice language for programmers but not always the most ecient in scikit-learn, many functions are implemented in faster languages (e.g. C) and use specialized math libraries so in many cases, it is much faster to call the library once than many times: import time t0 = time.time() guesses1 = classifier.predict(x_eval) t1 = time.time() guesses2 = [classifier.predict(x) for x in X_eval] t2 = time.time() print(t1-t0) print(t2-t1) result: 0.29 sec and 45 sec

some other practical functions splitting the data: from sklearn.cross_validation import train_test_split train_files, dev_files = train_test_split(td_files, train_size=0.8, random_state=0) evaluation, e.g. accuracy, precision, recall, F-score: from sklearn.metrics import f1_score print(f1_score(y_eval, Y_out)) note that we're using our own evaluation in this assignment, since we need more details

extended example 1: named entity classication we are given a name (a single word) in a sentence determine if it is a person, location, or an organization My aunt Gözde lives in Ashgabat. the information our classier can use: the words in the sentence the part-of-speech tags the position of the name that we are classifying

extended example 2: document classication we are given a document determine the category of the document (select from a small set of predened categories) we reuse the review dataset that we had in the previous course this dataset has polarity and topic labels for each document