Machine Learning for NLP Lecture 1: Introduction

Size: px

Start display at page:

Download "Machine Learning for NLP Lecture 1: Introduction"

Jeffery Armstrong
5 years ago
Views:

1 Machine Learning for NLP Lecture 1: Introduction UNIVERSITY OF Richard Johansson August 29, 2016

2 overview of the lecture some information about the course machine learning basics and overview overview of the assignments introduction to the scikit-learn library

3 overview information about the course machine learning basics introduction to the scikit-learn library

4 teaching 7 lectures lab sessions where you work on assignment or project seminars

5 your work 3 mandatory assignments plus one optional present a research paper at a seminar mini-project for VG grade: written exam

6 request if you have a personal interest in some topic, please let me know!

7 assignment 1: feature design for function tagging (root) She lives in a house of brick

8 assignment 1: feature design for function tagging nsubj (root) She lives in a house of brick

9 assignment 1: feature design for function tagging the purpose of this assignment is to practice the typical steps in building a machine learning-based system designing features analyzing the performance of the system (and its errors) trying out dierent learning methods

10 assignment 2: classier implementation read a paper about a simple algorithm for training the support vector machine classier write code to implement the algorithm similar to the algorithms in scikit-learn

11 assignment 3: learning for sequence tagging implement a sequential tagging model use it to build a named-entity recognizer United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

12 assignment 4: using TensorFlow (optional) explore Google's new TensorFlow library for neural network training try out some classication and structure prediction tasks (e.g. translation)

13 independent work select a topic of interest (or ask me for ideas) dene a small project write code, carry out experiments write a short paper, present it at a seminar at the end of the course

14 written exam (required for VG) the questions will test your understanding of central ideas in machine learning and let you sketch and discuss (not code) ML-based solutions to some real-world problems in NLP ML is a mathy subject, but the questions will not require much math but you might need to understand the idea behind a few formulas

15 overview information about the course machine learning basics introduction to the scikit-learn library

16 basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence?

17 basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence? the goal of machine learning is to build the prediction functions by observing data

18 some types of machine learning problems classication: learning to output a category label spam/non-spam; positive/negative; subject/object,... structure prediction: learning to build some structure POS tagging; dependency parsing; translation;... (numerical regression: learning to guess a number) value of a share; number of stars in a review;... (reinforcement learning: learning to act in an environment) dialogue systems; playing games;...

19 machine learning in NLP research ACL, EMNLP, Coling, etc are heavily dominated by ML-focused papers

20 why machine learning? why would we want to build the function from data instead of just implementing it? usually because we don't really know how to write down the function by hand speech recognition image classication syntactic parsing translation... might not be necessary for limited tasks where we know: morphology? sentence splitting and tokenization? identication of limited classes of names, dates and times? what is more expensive in your case? knowledge or data?

21 don't forget your linguistic intuitions! machine learning automatizes some tasks, but we still need our brains: dening the tasks and terminology annotating training and testing data having an intuition about which features may be useful can be crucial in general, features are more important than the choice of learning algorithm error analysis dening constraints to guide the learner valency lexicons can be used in parsers grammar-based parsers with ML-trained disambiguators

22 learning from data

23 example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

24 example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

25 attributes/values or bag of words we often represent the features as attributes with values in practice, as a Python dict features = { "gender":"male", "age":37, "blood_pressure":130,... } sometimes, it's easier just to see the features as a list of e.g. words (bag of words) features = [ "here", "are", "some", "words", "in", "a", "document" ]

26 examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its

27 examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)... but we could also add other features such as the presence of smileys or negations Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover.

28 examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover. A1 Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). C2 Pilán, NLP-based Approaches to Sentence Readability for Second Language Learning Purposes, MLT Master's Thesis, 2013

29 examples of ML in NLP: diculty level classication

30 examples of ML in NLP: coreference resolution do two given noun phrases refer to the same real-world entity? Soon et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases, Comp. Ling. 2001

31 examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

32 examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

33 examples of ML in NLP: named entity recognition Zhang and Johnson A Robust Risk Minimization based Named Entity Recognition System, CoNLL 2003

34 what goes on when we learn? the learning algorithm observes the examples in the training set it tries to nd common patterns that explain the data: it generalizes so that we can make predictions for new examples how this is done depends on what algorithm we are using

35 knowledge from experience? given some experience, when can we be certain that we can draw any conclusion?

36 a fundamental tradeo in machine learning goodness of t: the learned classier should be able to correctly classify the examples in the training data regularization: the classier should be simple this tradeo is called the biasvariance tradeo see e.g. Wikipedia

37 example: guessing the gender, based on height and weight

38 learning algorithms that we have seen so far we have already seen a number of learning algorithms in previous courses: Naive Bayes perceptron hidden Markov models decision trees transformation rules (Brill tagger)

39 representation of the prediction function we may represent our prediction function in dierent ways: numerical models: rules weight or probability tables networked models decision trees transformation rules

40 example: the prediction function as numbers def sentiment_is_positive(features): score = 0.0 score += 2.1 * features["wonderful"] score += 0.6 * features["good"]... score -= 0.9 * features["bad"] score -= 3.1 * features["awful"]... if score > 0: return True else: return False

41 example: the prediction function as rules def patient_is_sick(features): if features["systolic_blood_pressure"] > 140: return True if features["gender"] == "m": if features["psa"] > 4.0: return True... return False

42 perceptron revisited the perceptron learning algorithm creates a weight table each weight in the table corresponds to a feature e.g. "fine" probably has a high positive weight in sentiment analysis "boring" a negative weight "and" near zero classication is carried out by summing the weights for each feature def perceptron_classify(features, weights): score = 0 for f in features: score += weights.get(f, 0) if score >= 0: return "pos" else: return "neg"

43 the perceptron learning algorithm start with an empty weight table classify according to the current weight table each time we misclassify, change the weight table a bit if a positive instance was misclassied, add 1 to the weight of each feature in the document and conversely... def perceptron_learn(examples, number_iterations): weights = {} for iteration in range(number_iterations): for label, features in examples: guess = perceptron_classify(features, weights) if label == "pos" and guess == "neg": for f in features: weights[f] = weights.get(f, 0) + 1 elif label == "neg" and guess == "pos": for f in features: weights[f] = weights.get(f, 0) - 1 return weights

44 estimation in Naive Bayes, revisited Naive Bayes: P(document, label) = P(f 1,..., f n, label) = P(label) P(f 1,..., f n label) = P(label) P(f 1 label)... P(f n label) how do we estimate the probabilities? maximum likelihood: set the probabilities so that the probability of the data is maximized

45 estimation in Naive Bayes: supervised case how do we estimate P(positive)? P MLE (positive) = count(positive) count(all) = 2 4 how do we estimate P(nice positive)? count(nice, positive) P MLE (nice positive) = count(any word, positive) = 2 7

46 overview information about the course machine learning basics introduction to the scikit-learn library

47 machine learning software general-purpose software, large collections of algorithms: scikit-learn: Python library will be used in this course Weka: Java library with nice user interface NLTK includes some learning algorithms but seems to be discontinuing them in favor of scikit-learn special-purpose software, small collections of algorithms: LibSVM/LibLinear for support vector machines CRF++, CRFSGD for conditional random elds Tensorow, Theano, Cae, Keras for neural networks...

48 scikit-learn toy example: a simple training set # training set: the features X = [{'city':'gothenburg', 'month':'july'}, {'city':'gothenburg', 'month':'december'}, {'city':'paris', 'month':'july'}, {'city':'paris', 'month':'december'}] # training set: the gold-standard outputs Y = ['rain', 'rain', 'sun', 'rain']

49 scikit-learn toy example: training a classier from sklearn.feature_extraction import DictVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline import pickle classifier = Pipeline([('v', DictVectorizer()), ('c', LinearSVC())]) # train the classifier classifier.fit(x, Y) # optionally: save the classifier to a file... with open('weather.classifier', 'wb') as f: pickle.dump(classifier, f)

50 explanation of the code: DictVectorizer internally, the features used by scikit-learn's classiers are numbers, not strings a Vectorizer converts the strings into numbers more about this in the next lecture! rule of thumb: use a DictVectorizer for attributevalue features use a CountVectorizer or TfidfVectorizer for bag-of-words features

51 explanation of the code: LinearSVC LinearSVC is the actual classier we're using this is called a linear support vector machine more about this in lecture 3 use Naive Bayes instead: from sklearn.naive_bayes import MultinomialNB... classifier = Pipeline([('v', DictVectorizer()), ('c', MultinomialNB())]) perceptron: from sklearn.linear_model import Perceptron... classifier = Pipeline([('v', DictVectorizer()), ('c', Perceptron())])

52 explanation of the code: Pipeline and fit in scikit-learn, preprocessing steps and classiers are often combined into a Pipeline in our case, a DictVectorizer and a LinearSVC the whole Pipeline is trained by calling the method fit which will in turn call fit on all the parts of the Pipeline

53 toy example: making new predictions and evaluating from sklearn.metrics import accuracy_score Xtest = [{'city':'gothenburg', 'month':'june'}, {'city':'gothenburg', 'month':'november'}, {'city':'paris', 'month':'june'}, {'city':'paris', 'month':'november'}] Ytest = ['rain', 'rain', 'sun', 'rain'] # classify all the test instances guesses = classifier.predict(xtest) # compute the classification accuracy print(accuracy_score(ytest, guesses))

54 a note on eciency Python is a nice language for programmers but not always the most ecient in scikit-learn, many functions are implemented in faster languages (e.g. C) and use specialized math libraries so in many cases, it is much faster to call the library once than many times: import time t0 = time.time() guesses1 = classifier.predict(xtest) t1 = time.time() guesses2 = [] for x in Xtest: guess = classifier.predict(x) guesses2.append(guess) t2 = time.time() print(t1-t0) print(t2-t1) result: 0.29 sec and 45 sec

55 some other practical functions splitting the data: from sklearn.cross_validation import train_test_split train_files, dev_files = train_test_split(td_files, train_size=0.8, random_state=0) evaluation, e.g. accuracy, precision, recall, F-score: from sklearn.metrics import f1_score print(f1_score(y_eval, Y_out)) note that we're using our own evaluation in the rst assignment, since we need more details

56 aside: evaluation methodology how do we evaluate our systems? intrinsic evaluation: test the performance in isolation extrinsic evaluation: I changed my POS tagger how does this change the performance of my parser? how much more money do I make? common measures in intrinsic evaluation classication accuracy precision and recall (for needle in haystack) also several other task-dependent measures.

57 extended example 1: named entity classication we are given a name (a single word) in a sentence determine if it is a person, location, or an organization My aunt Gözde lives in Ashgabat. the information our classier can use: the words in the sentence the part-of-speech tags the position of the name that we are classifying

58 extended example 2: document classication we are given a document determine the category of the document (select from a small set of predened categories) we reuse the review dataset that we had in the previous course this dataset has polarity and topic labels for each document

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled