Machine Learning for NLP Lecture 1: Introduction

Similar documents
Python Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS 446: Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CS Machine Learning

Indian Institute of Technology, Kanpur

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Linking Task: Identifying authors and book titles in verbose queries

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ARNE - A tool for Namend Entity Recognition from Arabic Text

The stages of event extraction

Learning From the Past with Experiment Databases

Beyond the Pipeline: Discrete Optimization in NLP

Rule Learning With Negation: Issues Regarding Effectiveness

Artificial Neural Networks written examination

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

BYLINE [Heng Ji, Computer Science Department, New York University,

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Using dialogue context to improve parsing performance in dialogue systems

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

A Case Study: News Classification Based on Term Frequency

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Rule Learning with Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

Distant Supervised Relation Extraction with Wikipedia and Freebase

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Applications of memory-based natural language processing

CS 598 Natural Language Processing

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Ensemble Technique Utilization for Indonesian Dependency Parser

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Parsing of part-of-speech tagged Assamese Texts

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Memory-based grammatical error correction

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Multilingual Sentiment and Subjectivity Analysis

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction to Text Mining

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Smart/Empire TIPSTER IR System

A Vector Space Approach for Aspect-Based Sentiment Analysis

Online Updating of Word Representations for Part-of-Speech Tagging

Natural Language Processing. George Konidaris

Australian Journal of Basic and Applied Sciences

Speech Recognition at ICSI: Broadcast News and beyond

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multi-Lingual Text Leveling

Reducing Features to Improve Bug Prediction

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Corrective Feedback and Persistent Learning for Information Extraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A Comparison of Two Text Representations for Sentiment Analysis

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Named Entity Recognition: A Survey for the Indian Languages

Developing a TT-MCTAG for German with an RCG-based Parser

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

AQUA: An Ontology-Driven Question Answering System

ScienceDirect. Malayalam question answering system

Accuracy (%) # features

Phonological Processing for Urdu Text to Speech System

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

(Sub)Gradient Descent

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

CSL465/603 - Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

Lecture 10: Reinforcement Learning

Universiteit Leiden ICT in Business

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Using Semantic Relations to Refine Coreference Decisions

Discriminative Learning of Beam-Search Heuristics for Planning

The Strong Minimalist Thesis and Bounded Optimality

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Postprint.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Grammars & Parsing, Part 1:

Introduction, Organization Overview of NLP, Main Issues

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Pre-Processing MRSes

Transcription:

Machine Learning for NLP Lecture 1: Introduction UNIVERSITY OF Richard Johansson August 29, 2016

overview of the lecture some information about the course machine learning basics and overview overview of the assignments introduction to the scikit-learn library

overview information about the course machine learning basics introduction to the scikit-learn library

teaching 7 lectures lab sessions where you work on assignment or project seminars

your work 3 mandatory assignments plus one optional present a research paper at a seminar mini-project for VG grade: written exam

request if you have a personal interest in some topic, please let me know!

assignment 1: feature design for function tagging (root) She lives in a house of brick

assignment 1: feature design for function tagging nsubj (root) She lives in a house of brick

assignment 1: feature design for function tagging the purpose of this assignment is to practice the typical steps in building a machine learning-based system designing features analyzing the performance of the system (and its errors) trying out dierent learning methods

assignment 2: classier implementation read a paper about a simple algorithm for training the support vector machine classier write code to implement the algorithm similar to the algorithms in scikit-learn

assignment 3: learning for sequence tagging implement a sequential tagging model use it to build a named-entity recognizer United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

assignment 4: using TensorFlow (optional) explore Google's new TensorFlow library for neural network training try out some classication and structure prediction tasks (e.g. translation)

independent work select a topic of interest (or ask me for ideas) dene a small project write code, carry out experiments write a short paper, present it at a seminar at the end of the course

written exam (required for VG) the questions will test your understanding of central ideas in machine learning and let you sketch and discuss (not code) ML-based solutions to some real-world problems in NLP ML is a mathy subject, but the questions will not require much math but you might need to understand the idea behind a few formulas

overview information about the course machine learning basics introduction to the scikit-learn library

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence?

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence? the goal of machine learning is to build the prediction functions by observing data

some types of machine learning problems classication: learning to output a category label spam/non-spam; positive/negative; subject/object,... structure prediction: learning to build some structure POS tagging; dependency parsing; translation;... (numerical regression: learning to guess a number) value of a share; number of stars in a review;... (reinforcement learning: learning to act in an environment) dialogue systems; playing games;...

machine learning in NLP research ACL, EMNLP, Coling, etc are heavily dominated by ML-focused papers

why machine learning? why would we want to build the function from data instead of just implementing it? usually because we don't really know how to write down the function by hand speech recognition image classication syntactic parsing translation... might not be necessary for limited tasks where we know: morphology? sentence splitting and tokenization? identication of limited classes of names, dates and times? what is more expensive in your case? knowledge or data?

don't forget your linguistic intuitions! machine learning automatizes some tasks, but we still need our brains: dening the tasks and terminology annotating training and testing data having an intuition about which features may be useful can be crucial in general, features are more important than the choice of learning algorithm error analysis dening constraints to guide the learner valency lexicons can be used in parsers grammar-based parsers with ML-trained disambiguators

learning from data

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

attributes/values or bag of words we often represent the features as attributes with values in practice, as a Python dict features = { "gender":"male", "age":37, "blood_pressure":130,... } sometimes, it's easier just to see the features as a list of e.g. words (bag of words) features = [ "here", "are", "some", "words", "in", "a", "document" ]

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)... but we could also add other features such as the presence of smileys or negations Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover. A1 Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). C2 Pilán, NLP-based Approaches to Sentence Readability for Second Language Learning Purposes, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication

examples of ML in NLP: coreference resolution do two given noun phrases refer to the same real-world entity? Soon et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases, Comp. Ling. 2001

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition Zhang and Johnson A Robust Risk Minimization based Named Entity Recognition System, CoNLL 2003

what goes on when we learn? the learning algorithm observes the examples in the training set it tries to nd common patterns that explain the data: it generalizes so that we can make predictions for new examples how this is done depends on what algorithm we are using

knowledge from experience? given some experience, when can we be certain that we can draw any conclusion?

a fundamental tradeo in machine learning goodness of t: the learned classier should be able to correctly classify the examples in the training data regularization: the classier should be simple this tradeo is called the biasvariance tradeo see e.g. Wikipedia

example: guessing the gender, based on height and weight 120 110 100 90 80 70 60 50 40 150 155 160 165 170 175 180 185 190 195

learning algorithms that we have seen so far we have already seen a number of learning algorithms in previous courses: Naive Bayes perceptron hidden Markov models decision trees transformation rules (Brill tagger)

representation of the prediction function we may represent our prediction function in dierent ways: numerical models: rules weight or probability tables networked models decision trees transformation rules

example: the prediction function as numbers def sentiment_is_positive(features): score = 0.0 score += 2.1 * features["wonderful"] score += 0.6 * features["good"]... score -= 0.9 * features["bad"] score -= 3.1 * features["awful"]... if score > 0: return True else: return False

example: the prediction function as rules def patient_is_sick(features): if features["systolic_blood_pressure"] > 140: return True if features["gender"] == "m": if features["psa"] > 4.0: return True... return False

perceptron revisited the perceptron learning algorithm creates a weight table each weight in the table corresponds to a feature e.g. "fine" probably has a high positive weight in sentiment analysis "boring" a negative weight "and" near zero classication is carried out by summing the weights for each feature def perceptron_classify(features, weights): score = 0 for f in features: score += weights.get(f, 0) if score >= 0: return "pos" else: return "neg"

the perceptron learning algorithm start with an empty weight table classify according to the current weight table each time we misclassify, change the weight table a bit if a positive instance was misclassied, add 1 to the weight of each feature in the document and conversely... def perceptron_learn(examples, number_iterations): weights = {} for iteration in range(number_iterations): for label, features in examples: guess = perceptron_classify(features, weights) if label == "pos" and guess == "neg": for f in features: weights[f] = weights.get(f, 0) + 1 elif label == "neg" and guess == "pos": for f in features: weights[f] = weights.get(f, 0) - 1 return weights

estimation in Naive Bayes, revisited Naive Bayes: P(document, label) = P(f 1,..., f n, label) = P(label) P(f 1,..., f n label) = P(label) P(f 1 label)... P(f n label) how do we estimate the probabilities? maximum likelihood: set the probabilities so that the probability of the data is maximized

estimation in Naive Bayes: supervised case how do we estimate P(positive)? P MLE (positive) = count(positive) count(all) = 2 4 how do we estimate P(nice positive)? count(nice, positive) P MLE (nice positive) = count(any word, positive) = 2 7

overview information about the course machine learning basics introduction to the scikit-learn library

machine learning software general-purpose software, large collections of algorithms: scikit-learn: http://scikit-learn.org Python library will be used in this course Weka: http://www.cs.waikato.ac.nz/ml/weka Java library with nice user interface NLTK includes some learning algorithms but seems to be discontinuing them in favor of scikit-learn special-purpose software, small collections of algorithms: LibSVM/LibLinear for support vector machines CRF++, CRFSGD for conditional random elds Tensorow, Theano, Cae, Keras for neural networks...

scikit-learn toy example: a simple training set # training set: the features X = [{'city':'gothenburg', 'month':'july'}, {'city':'gothenburg', 'month':'december'}, {'city':'paris', 'month':'july'}, {'city':'paris', 'month':'december'}] # training set: the gold-standard outputs Y = ['rain', 'rain', 'sun', 'rain']

scikit-learn toy example: training a classier from sklearn.feature_extraction import DictVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline import pickle classifier = Pipeline([('v', DictVectorizer()), ('c', LinearSVC())]) # train the classifier classifier.fit(x, Y) # optionally: save the classifier to a file... with open('weather.classifier', 'wb') as f: pickle.dump(classifier, f)

explanation of the code: DictVectorizer internally, the features used by scikit-learn's classiers are numbers, not strings a Vectorizer converts the strings into numbers more about this in the next lecture! rule of thumb: use a DictVectorizer for attributevalue features use a CountVectorizer or TfidfVectorizer for bag-of-words features

explanation of the code: LinearSVC LinearSVC is the actual classier we're using this is called a linear support vector machine more about this in lecture 3 use Naive Bayes instead: from sklearn.naive_bayes import MultinomialNB... classifier = Pipeline([('v', DictVectorizer()), ('c', MultinomialNB())]) perceptron: from sklearn.linear_model import Perceptron... classifier = Pipeline([('v', DictVectorizer()), ('c', Perceptron())])

explanation of the code: Pipeline and fit in scikit-learn, preprocessing steps and classiers are often combined into a Pipeline in our case, a DictVectorizer and a LinearSVC the whole Pipeline is trained by calling the method fit which will in turn call fit on all the parts of the Pipeline

toy example: making new predictions and evaluating from sklearn.metrics import accuracy_score Xtest = [{'city':'gothenburg', 'month':'june'}, {'city':'gothenburg', 'month':'november'}, {'city':'paris', 'month':'june'}, {'city':'paris', 'month':'november'}] Ytest = ['rain', 'rain', 'sun', 'rain'] # classify all the test instances guesses = classifier.predict(xtest) # compute the classification accuracy print(accuracy_score(ytest, guesses))

a note on eciency Python is a nice language for programmers but not always the most ecient in scikit-learn, many functions are implemented in faster languages (e.g. C) and use specialized math libraries so in many cases, it is much faster to call the library once than many times: import time t0 = time.time() guesses1 = classifier.predict(xtest) t1 = time.time() guesses2 = [] for x in Xtest: guess = classifier.predict(x) guesses2.append(guess) t2 = time.time() print(t1-t0) print(t2-t1) result: 0.29 sec and 45 sec

some other practical functions splitting the data: from sklearn.cross_validation import train_test_split train_files, dev_files = train_test_split(td_files, train_size=0.8, random_state=0) evaluation, e.g. accuracy, precision, recall, F-score: from sklearn.metrics import f1_score print(f1_score(y_eval, Y_out)) note that we're using our own evaluation in the rst assignment, since we need more details

aside: evaluation methodology how do we evaluate our systems? intrinsic evaluation: test the performance in isolation extrinsic evaluation: I changed my POS tagger how does this change the performance of my parser? how much more money do I make? common measures in intrinsic evaluation classication accuracy precision and recall (for needle in haystack) also several other task-dependent measures.

extended example 1: named entity classication we are given a name (a single word) in a sentence determine if it is a person, location, or an organization My aunt Gözde lives in Ashgabat. the information our classier can use: the words in the sentence the part-of-speech tags the position of the name that we are classifying

extended example 2: document classication we are given a document determine the category of the document (select from a small set of predened categories) we reuse the review dataset that we had in the previous course this dataset has polarity and topic labels for each document