Machine Learning for NLP Lecture 1: Introduction

Similar documents
Python Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS 446: Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Indian Institute of Technology, Kanpur

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ARNE - A tool for Namend Entity Recognition from Arabic Text

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Beyond the Pipeline: Discrete Optimization in NLP

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The stages of event extraction

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

BYLINE [Heng Ji, Computer Science Department, New York University,

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Smart/Empire TIPSTER IR System

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Using dialogue context to improve parsing performance in dialogue systems

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Lecture 1: Machine Learning Basics

CS 598 Natural Language Processing

Applications of memory-based natural language processing

Artificial Neural Networks written examination

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Learning From the Past with Experiment Databases

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rule Learning with Negation: Issues Regarding Effectiveness

Ensemble Technique Utilization for Indonesian Dependency Parser

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Parsing of part-of-speech tagged Assamese Texts

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Developing a TT-MCTAG for German with an RCG-based Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Multilingual Sentiment and Subjectivity Analysis

Corrective Feedback and Persistent Learning for Information Extraction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using Semantic Relations to Refine Coreference Decisions

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Online Updating of Word Representations for Part-of-Speech Tagging

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Named Entity Recognition: A Survey for the Indian Languages

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Introduction to Text Mining

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Australian Journal of Basic and Applied Sciences

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ScienceDirect. Malayalam question answering system

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Reducing Features to Improve Bug Prediction

Accuracy (%) # features

Prediction of Maximal Projection for Semantic Role Labeling

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Memory-based grammatical error correction

Second Exam: Natural Language Parsing with Neural Networks

Extracting Verb Expressions Implying Negative Opinions

Natural Language Processing. George Konidaris

Compositional Semantics

Universiteit Leiden ICT in Business

Phonological Processing for Urdu Text to Speech System

Discriminative Learning of Beam-Search Heuristics for Planning

AQUA: An Ontology-Driven Question Answering System

Multi-Lingual Text Leveling

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Postprint.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Extracting and Ranking Product Features in Opinion Documents

Proceedings of the 19th COLING, , 2002.

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

A Graph Based Authorship Identification Approach

A Comparison of Two Text Representations for Sentiment Analysis

Learning Computational Grammars

Some Principles of Automated Natural Language Information Extraction

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Cognitive Thinking Style Sample Report

Cross Language Information Retrieval

Transcription:

Machine Learning for NLP Lecture 1: Introduction UNIVERSITY OF Richard Johansson August 31, 2015

overview of today's lecture some information about the course machine learning basics and overview overview of the assignments introduction to the scikit-learn library

overview information about the course machine learning basics introduction to the scikit-learn library

teaching 5 lectures + 1 guest lecture (hopefully) lab sessions where you work on assignment or project seminars

your work 3 assignments present a research paper at a seminar mini-project for VG grade: solve optional parts of assignments 2 and 3, and select an ambitious project

request if you have a personal interest in some topic, please let me know!

assignment 1: feature design for function tagging (root) She lives in a house of brick

assignment 1: feature design for function tagging nsubj (root) She lives in a house of brick

assignment 1: feature design for function tagging the purpose of this assignment is to practice the typical steps in building a machine learning-based system designing features analyzing the performance of the system (and its errors) trying out dierent learning methods

assignment 2: classier implementation read a paper about a simple algorithm for training the support vector machine classier write code to implement the algorithm similar to the algorithms in scikit-learn

assignment 3: learning for dependency parsing [preliminary] implement the structured perceptron learning algorithm use it to build a graph-based dependency parser yesterday she gave the horse an apple

independent work select a topic of interest (or ask me for ideas) dene a small project write code, carry out experiments write a short paper, present it at a seminar at the end of the course

overview information about the course machine learning basics introduction to the scikit-learn library

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence?

basic ideas given some object, make a prediction is this patient diabetic? is the sentiment of this movie review positive? does this image contain a cat? what is the grammatical function of this noun phrase? what will be tomorrow's share value of this stock? what are the part-of-speech tags of the words in this sentence? the goal of machine learning is to build the prediction functions by observing data

some types of machine learning problems classication: learning to output a category label spam/non-spam; positive/negative; subject/object,... structure prediction: learning to build some structure POS tagging; dependency parsing; translation;... (numerical regression: learning to guess a number) value of a share; number of stars in a review;... (reinforcement learning: learning to act in an environment) dialogue systems; playing games;...

machine learning in NLP research ACL, EMNLP, Coling, etc are heavily dominated by ML-focused papers

why machine learning? why would we want to build the function from data instead of just implementing it? usually because we don't really know how to write down the function by hand speech recognition image classication syntactic parsing translation... might not be necessary for limited tasks where we know: morphology? sentence splitting and tokenization? identication of limited classes of names, dates and times? what is more expensive in your case? knowledge or data?

don't forget your linguistic intuitions! machine learning automatizes some tasks, but we still need our brains: dening the tasks and terminology annotating training and testing data having an intuition about which features may be useful can be crucial in general, features are more important than the choice of learning algorithm error analysis dening constraints to guide the learner valency lexicons can be used in parsers grammar-based parsers with ML-trained disambiguators

learning from data

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

example: is the patient diabetic? in order to predict, we make some measurements of properties we believe will be useful these are called the features

attributes/values or bag of words we often represent the features as attributes with values in practice, as a Python dict features = { "gender":"male", "age":37, "blood_pressure":130,... } sometimes, it's easier just to see the features as a list of e.g. words (bag of words) features = [ "here", "are", "some", "words", "in", "a", "document" ]

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)

examples of ML in NLP: document classication in a previous course, you have implemented a classier of documents many document classiers use the words of the documents as its features (bag of words)... but we could also add other features such as the presence of smileys or negations Günther, Sentiment Analysis of Microblogs, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences? Flickan sover. A1 Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). C2 Pilán, NLP-based Approaches to Sentence Readability for Second Language Learning Purposes, MLT Master's Thesis, 2013

examples of ML in NLP: diculty level classication

examples of ML in NLP: coreference resolution do two given noun phrases refer to the same real-world entity? Soon et al. A Machine Learning Approach to Coreference Resolution of Noun Phrases, Comp. Ling. 2001

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition United Nations ocial Ekeus heads for Baghdad. [ ORG ] [ PER ] [ LOC ]

examples of ML in NLP: named entity recognition Zhang and Johnson A Robust Risk Minimization based Named Entity Recognition System, CoNLL 2003

what goes on when we learn? the learning algorithm observes the examples in the training set it tries to nd common patterns that explain the data: it generalizes so that we can make predictions for new examples how this is done depends on what algorithm we are using

knowledge from experience? given some experience, when can we be certain that we can draw any conclusion?

a fundamental tradeo goodness of t: the learned classier should be able to correctly classify the examples in the training data regularization: the classier should be simple

learning algorithms that we have seen so far we have already seen a number of learning algorithms in previous courses: Naive Bayes perceptron hidden Markov models decision trees transformation rules (Brill tagger)

representation of the prediction function we may represent our prediction function in dierent ways: numerical models: rules weight or probability tables networked models decision trees transformation rules

example: the prediction function as numbers def sentiment_is_positive(features): score = 0.0 score += 2.1 * features["wonderful"] score += 0.6 * features["good"]... score -= 0.9 * features["bad"] score -= 3.1 * features["awful"]... if score > 0: return True else: return False

example: the prediction function as rules def patient_is_sick(features): if features["systolic_blood_pressure"] > 140: return True if features["gender"] == "m": if features["psa"] > 4.0: return True... return False

perceptron revisited the perceptron learning algorithm creates a weight table each weight in the table corresponds to a feature e.g. "fine" probably has a high positive weight in sentiment analysis "boring" a negative weight "and" near zero classication is carried out by summing the weights for each feature def perceptron_classify(features, weights): score = 0 for f in features: score += weights.get(f, 0) if score >= 0: return "pos" else: return "neg"

the perceptron learning algorithm start with an empty weight table classify according to the current weight table each time we misclassify, change the weight table a bit if a positive instance was misclassied, add 1 to the weight of each feature in the document and conversely... def perceptron_learn(examples, number_iterations): weights = {} for iteration in range(number_iterations): for label, features in examples: guess = perceptron_classify(features, weights) if label == "pos" and guess == "neg": for f in features: weights[f] = weights.get(f, 0) + 1 elif label == "neg" and guess == "pos": for f in features: weights[f] = weights.get(f, 0) - 1 return weights

estimation in Naive Bayes, revisited Naive Bayes: P(document, label) = P(f 1,..., f n, label) = P(label) P(f 1,..., f n label) = P(label) P(f 1 label)... P(f n label) how do we estimate the probabilities? maximum likelihood: set the probabilities so that the probability of the data is maximized

estimation in Naive Bayes: supervised case how do we estimate P(positive)? P MLE (positive) = count(positive) count(all) = 2 4 how do we estimate P(nice positive)? count(nice, positive) P MLE (nice positive) = count(any word, positive) = 2 7

machine learning software general-purpose software, large collections of algorithms: scikit-learn: http://scikit-learn.org Python library will be used in this course Weka: http://www.cs.waikato.ac.nz/ml/weka Java library with nice user interface NLTK includes some learning algorithms but seems to be discontinuing them in favor of scikit-learn special-purpose software, small collections of algorithms: LibSVM/LibLinear for support vector machines CRF++, CRFSGD for conditional random elds Theano, Cae, Keras for neural networks...

evaluation methodology how do we evaluate our systems? intrinsic evaluation: test the performance in isolation extrinsic evaluation: I changed my POS tagger how does this change the performance of my parser? how much more money do I make? common measures in intrinsic evaluation classication accuracy precision and recall (for needle in haystack) also several other task-dependent measures.

overview information about the course machine learning basics introduction to the scikit-learn library

detailed example def train_scikit_classifier(x, Y): # A DictVectorizer maps a feature dict to a sparse vector, # e.g. vec.transform({'label':'np'}) might give # [0, 0,..., 0, 1, 0,... ] vec = DictVectorizer() # Convert all the feature dicts to vectors. # As usual, it's more efficient to handle all at once. Xe = vec.fit_transform(x) # Initialize the learning algorithm we will use. classifier = Perceptron(n_iter=20) # Finally, we can train the classifier. classifier.fit(xe, Y) # Return a pipeline consisting of the vectorizer followed # by the classifier. return Pipeline([('vec', vec), ('classifier', classifier)])

what are X and Y? our feature extractor has collected features in the form of attributes with values e.g. { 'label':'np',... } they are stored in the list X the corresponding true outputs are stored in the list Y

the rst step: mapping features to numerical vectors scikit-learn's learning methods works with features as numbers, not strings they can't directly use the feature dicts we have stored in X converting from string to numbers is the purpose of these lines: vec = DictVectorizer() Xe = vec.fit_transform(x)

types of vectorizers a DictVectorizer converts from attributevalue dicts: a CountVectorizer converts from texts (after applying a tokenizer) or lists: a TfidfVectorizer is like a CountVectorizer, but also uses TF*IDF

what goes on in a DictVectorizer? each feature corresponds to one or more columns in the output matrix easy case: boolean and numerical features: for string features, we reserve one column for each possible value that is, we convert to booleans

code example (DictVectorizer) here's an example: from sklearn.feature_extraction import DictVectorizer X = [{'f1':'np', 'f2':'in', 'f3':false, 'f4':7}, {'f1':'np', 'f2':'on', 'f3':true, 'f4':2}, {'f1':'vp', 'f2':'in', 'f3':false, 'f4':9}] vec = DictVectorizer() Xe = vec.fit_transform(x) print(xe.toarray()) print(vec.vocabulary_) the result: [[ 1. 0. 1. 0. 0. 7.] [ 1. 0. 0. 1. 1. 2.] [ 0. 1. 1. 0. 0. 9.]] {'f4': 5, 'f2=in': 2, 'f1=np': 0, 'f1=vp': 1, 'f2=on': 3, 'f3': 4}

CountVectorizers for document representation a CountVectorizer converts from documents the document is a string or a list of tokens just like string features in a DictVectorizer, each word type will correspond to one column

code example (CountVectorizer) here's an example: X = ['example text', 'another text'] vec = CountVectorizer() Xe = vec.fit_transform(x) print(xe.toarray()) print(vec.vocabulary_) the result: [[0 1 1] [1 0 1]] {'text': 2, 'example': 1, 'another': 0}

a comment about the vectorizer methods fit: look at the data, create the mapping transform: convert the data to numbers fit_transform = fit + transform

training a classier after mapping the features to numbers with our Vectorizers, we can train a perceptron classier: classifier = Perceptron(n_iter=20) classifier.fit(xe, Y) other classiers (e.g. Naive Bayes) can be trained in a similar way: classifier = MultinomialNB() classifier.fit(xe, Y)

applying the classier to new examples X_new =... # extract the features for new examples Xe_new = vectorizer.transform(x_new) guesses = classifier.predict(xe_new)

combining a vectorizer and a classier into a pipeline vectorizer =... Xe = vectorizer.fit_transform(x) classifier =... classifier.fit(xe, Y) pipeline = Pipeline([('vec', vectorizer), ('cls', classifier)]) X_new =... # extract the features for new examples guesses = pipeline.predict(x_new)

simplied training of a pipeline we can call fit to train the whole pipeline in one step: pipeline = Pipeline([('vec', DictVectorizer()), pipeline.fit(x, Y) ('cls', Perceptron())])... guesses = pipeline.predict(x_new)

a note on eciency Python is a nice language for programmers but not always the most ecient in scikit-learn, many functions are implemented in faster languages (e.g. C) and use specialized math libraries so in many cases, it is much faster to call the library once than many times: import time t0 = time.time() guesses1 = classifier.predict(x_eval) t1 = time.time() guesses2 = [classifier.predict(x) for x in X_eval] t2 = time.time() print(t1-t0) print(t2-t1) result: 0.29 sec and 45 sec

some other practical functions splitting the data: from sklearn.cross_validation import train_test_split train_files, dev_files = train_test_split(td_files, train_size=0.8, random_state=0) evaluation, e.g. accuracy, precision, recall, F-score: from sklearn.metrics import f1_score print(f1_score(y_eval, Y_out)) note that we're using our own evaluation in this assignment, since we need more details

extended example 1: named entity classication we are given a name (a single word) in a sentence determine if it is a person, location, or an organization My aunt Gözde lives in Ashgabat. the information our classier can use: the words in the sentence the part-of-speech tags the position of the name that we are classifying

extended example 2: document classication we are given a document determine the category of the document (select from a small set of predened categories) we reuse the review dataset that we had in the previous course this dataset has polarity and topic labels for each document