Machine Learning for NLP Lecture 1: Introduction

Machine Learning for NLP Lecture 1: Introduction UNIVERSITY OF Richard Johansson August 29, 2016

overview of the lecture some information about the course machine learning basics and overview overview of the assignments introduction to the scikit-learn library

overview information about the course machine learning basics introduction to the scikit-learn library

teaching 7 lectures lab sessions where you work on assignment or project seminars

your work 3 mandatory assignments plus one optional present a research paper at a seminar mini-project for VG grade: written exam

request if you have a personal interest in some topic, please let me know!

assignment 1: feature design for function tagging (root) She lives in a house of brick

assignment 1: feature design for function tagging nsubj (root) She lives in a house of brick

assignment 1: feature design for function tagging the purpose of this assignment is to practice the typical steps in building a machine learning-based system designing features analyzing the performance of the system (and its errors) trying out dierent learning methods

assignment 2: classier implementation read a paper about a simple algorithm for training the support vector machine classier write code to implement the algorithm similar to the algorithms in scikit-learn

assignment 3: learning for sequence tagging implement a sequential tagging model use it to build a named-entity recognizer United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

assignment 4: using TensorFlow (optional) explore Google's new TensorFlow library for neural network training try out some classication and structure prediction tasks (e.g. translation)

independent work select a topic of interest (or ask me for ideas) dene a small project write code, carry out experiments write a short paper, present it at a seminar at the end of the course

written exam (required for VG) the questions will test your understanding of central ideas in machine learning and let you sketch and discuss (not code) ML-based solutions to some real-world problems in NLP ML is a mathy subject, but the questions will not require much math but you might need to understand the idea behind a few formulas

overview information about the course machine learning basics introduction to the scikit-learn library

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence?

some types of machine learning problems classication: learning to output a category label spam/non-spam; positive/negative; subject/object,... structure prediction: learning to build some structure POS tagging; dependency parsing; translation;... (numerical regression: learning to guess a number) value of a share; number of stars in a review;... (reinforcement learning: learning to act in an environment) dialogue systems; playing games;...

machine learning in NLP research ACL, EMNLP, Coling, etc are heavily dominated by ML-focused papers

why machine learning? why would we want to build the function from data instead of just implementing it? usually because we don't really know how to write down the function by hand speech recognition image classication syntactic parsing translation... might not be necessary for limited tasks where we know: morphology? sentence splitting and tokenization? identication of limited classes of names, dates and times? what is more expensive in your case? knowledge or data?

don't forget your linguistic intuitions! machine learning automatizes some tasks, but we still need our brains: dening the tasks and terminology annotating training and testing data having an intuition about which features may be useful can be crucial in general, features are more important than the choice of learning algorithm error analysis dening constraints to guide the learner valency lexicons can be used in parsers grammar-based parsers with ML-trained disambiguators

learning from data

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

attributes/values or bag of words we often represent the features as attributes with values in practice, as a Python dict features = { "gender":"male", "age":37, "blood_pressure":130,... } sometimes, it's easier just to see the features as a list of e.g. words (bag of words) features = [ "here", "are", "some", "words", "in", "a", "document" ]

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)... but we could also add other features such as the presence of smileys or negations Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover. A1 Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). C2 Pilán, NLP-based Approaches to Sentence Readability for Second Language Learning Purposes, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication

examples of ML in NLP: coreference resolution do two given noun phrases refer to the same real-world entity? Soon et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases, Comp. Ling. 2001

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition Zhang and Johnson A Robust Risk Minimization based Named Entity Recognition System, CoNLL 2003

what goes on when we learn? the learning algorithm observes the examples in the training set it tries to nd common patterns that explain the data: it generalizes so that we can make predictions for new examples how this is done depends on what algorithm we are using

knowledge from experience? given some experience, when can we be certain that we can draw any conclusion?

a fundamental tradeo in machine learning goodness of t: the learned classier should be able to correctly classify the examples in the training data regularization: the classier should be simple this tradeo is called the biasvariance tradeo see e.g. Wikipedia

example: guessing the gender, based on height and weight 120 110 100 90 80 70 60 50 40 150 155 160 165 170 175 180 185 190 195

learning algorithms that we have seen so far we have already seen a number of learning algorithms in previous courses: Naive Bayes perceptron hidden Markov models decision trees transformation rules (Brill tagger)

representation of the prediction function we may represent our prediction function in dierent ways: numerical models: rules weight or probability tables networked models decision trees transformation rules

example: the prediction function as numbers def sentiment_is_positive(features): score = 0.0 score += 2.1 * features["wonderful"] score += 0.6 * features["good"]... score -= 0.9 * features["bad"] score -= 3.1 * features["awful"]... if score > 0: return True else: return False

example: the prediction function as rules def patient_is_sick(features): if features["systolic_blood_pressure"] > 140: return True if features["gender"] == "m": if features["psa"] > 4.0: return True... return False

perceptron revisited the perceptron learning algorithm creates a weight table each weight in the table corresponds to a feature e.g. "fine" probably has a high positive weight in sentiment analysis "boring" a negative weight "and" near zero classication is carried out by summing the weights for each feature def perceptron_classify(features, weights): score = 0 for f in features: score += weights.get(f, 0) if score >= 0: return "pos" else: return "neg"

the perceptron learning algorithm start with an empty weight table classify according to the current weight table each time we misclassify, change the weight table a bit if a positive instance was misclassied, add 1 to the weight of each feature in the document and conversely... def perceptron_learn(examples, number_iterations): weights = {} for iteration in range(number_iterations): for label, features in examples: guess = perceptron_classify(features, weights) if label == "pos" and guess == "neg": for f in features: weights[f] = weights.get(f, 0) + 1 elif label == "neg" and guess == "pos": for f in features: weights[f] = weights.get(f, 0) - 1 return weights

estimation in Naive Bayes, revisited Naive Bayes: P(document, label) = P(f 1,..., f n, label) = P(label) P(f 1,..., f n label) = P(label) P(f 1 label)... P(f n label) how do we estimate the probabilities? maximum likelihood: set the probabilities so that the probability of the data is maximized

estimation in Naive Bayes: supervised case how do we estimate P(positive)? P MLE (positive) = count(positive) count(all) = 2 4 how do we estimate P(nice positive)? count(nice, positive) P MLE (nice positive) = count(any word, positive) = 2 7

overview information about the course machine learning basics introduction to the scikit-learn library

machine learning software general-purpose software, large collections of algorithms: scikit-learn: http://scikit-learn.org Python library will be used in this course Weka: http://www.cs.waikato.ac.nz/ml/weka Java library with nice user interface NLTK includes some learning algorithms but seems to be discontinuing them in favor of scikit-learn special-purpose software, small collections of algorithms: LibSVM/LibLinear for support vector machines CRF++, CRFSGD for conditional random elds Tensorow, Theano, Cae, Keras for neural networks...

scikit-learn toy example: a simple training set # training set: the features X = [{'city':'gothenburg', 'month':'july'}, {'city':'gothenburg', 'month':'december'}, {'city':'paris', 'month':'july'}, {'city':'paris', 'month':'december'}] # training set: the gold-standard outputs Y = ['rain', 'rain', 'sun', 'rain']

scikit-learn toy example: training a classier from sklearn.feature_extraction import DictVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline import pickle classifier = Pipeline([('v', DictVectorizer()), ('c', LinearSVC())]) # train the classifier classifier.fit(x, Y) # optionally: save the classifier to a file... with open('weather.classifier', 'wb') as f: pickle.dump(classifier, f)

explanation of the code: DictVectorizer internally, the features used by scikit-learn's classiers are numbers, not strings a Vectorizer converts the strings into numbers more about this in the next lecture! rule of thumb: use a DictVectorizer for attributevalue features use a CountVectorizer or TfidfVectorizer for bag-of-words features

explanation of the code: LinearSVC LinearSVC is the actual classier we're using this is called a linear support vector machine more about this in lecture 3 use Naive Bayes instead: from sklearn.naive_bayes import MultinomialNB... classifier = Pipeline([('v', DictVectorizer()), ('c', MultinomialNB())]) perceptron: from sklearn.linear_model import Perceptron... classifier = Pipeline([('v', DictVectorizer()), ('c', Perceptron())])

explanation of the code: Pipeline and fit in scikit-learn, preprocessing steps and classiers are often combined into a Pipeline in our case, a DictVectorizer and a LinearSVC the whole Pipeline is trained by calling the method fit which will in turn call fit on all the parts of the Pipeline

toy example: making new predictions and evaluating from sklearn.metrics import accuracy_score Xtest = [{'city':'gothenburg', 'month':'june'}, {'city':'gothenburg', 'month':'november'}, {'city':'paris', 'month':'june'}, {'city':'paris', 'month':'november'}] Ytest = ['rain', 'rain', 'sun', 'rain'] # classify all the test instances guesses = classifier.predict(xtest) # compute the classification accuracy print(accuracy_score(ytest, guesses))

a note on eciency Python is a nice language for programmers but not always the most ecient in scikit-learn, many functions are implemented in faster languages (e.g. C) and use specialized math libraries so in many cases, it is much faster to call the library once than many times: import time t0 = time.time() guesses1 = classifier.predict(xtest) t1 = time.time() guesses2 = [] for x in Xtest: guess = classifier.predict(x) guesses2.append(guess) t2 = time.time() print(t1-t0) print(t2-t1) result: 0.29 sec and 45 sec

some other practical functions splitting the data: from sklearn.cross_validation import train_test_split train_files, dev_files = train_test_split(td_files, train_size=0.8, random_state=0) evaluation, e.g. accuracy, precision, recall, F-score: from sklearn.metrics import f1_score print(f1_score(y_eval, Y_out)) note that we're using our own evaluation in the rst assignment, since we need more details

aside: evaluation methodology how do we evaluate our systems? intrinsic evaluation: test the performance in isolation extrinsic evaluation: I changed my POS tagger how does this change the performance of my parser? how much more money do I make? common measures in intrinsic evaluation classication accuracy precision and recall (for needle in haystack) also several other task-dependent measures.

extended example 1: named entity classication we are given a name (a single word) in a sentence determine if it is a person, location, or an organization My aunt Gözde lives in Ashgabat. the information our classier can use: the words in the sentence the part-of-speech tags the position of the name that we are classifying

extended example 2: document classication we are given a document determine the category of the document (select from a small set of predened categories) we reuse the review dataset that we had in the previous course this dataset has polarity and topic labels for each document