Predictive Analysis of Text: Concepts, Features, and Instances

Similar documents
CS Machine Learning

CS 446: Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

Python Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Machine Learning Basics

Linking Task: Identifying authors and book titles in verbose queries

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Basic Concepts of Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

The stages of event extraction

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Rule Learning with Negation: Issues Regarding Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Evolution of Random Phenomena

2014 State Residency Conference Frequently Asked Questions FAQ Categories

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Unit 2. A whole-school approach to numeracy across the curriculum

Using Web Searches on Important Words to Create Background Sets for LSI Classification

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Chapter 2 Rule Learning in a Nutshell

Applications of memory-based natural language processing

A Case Study: News Classification Based on Term Frequency

A Version Space Approach to Learning Context-free Grammars

Rule-based Expert Systems

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Association Between Categorical Variables

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Using dialogue context to improve parsing performance in dialogue systems

A Vector Space Approach for Aspect-Based Sentiment Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

learning collegiate assessment]

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Speech Emotion Recognition Using Support Vector Machine

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Software Maintenance

Learning Lesson Study Course

Strategic Planning for Retaining Women in Undergraduate Computing

Corpus Linguistics (L615)

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Mathematics Success Grade 7

Multilingual Sentiment and Subjectivity Analysis

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

Digital Media Literacy

Reducing Features to Improve Bug Prediction

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Why Did My Detector Do That?!

Guru: A Computer Tutor that Models Expert Human Tutors

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Math 96: Intermediate Algebra in Context

Learning From the Past with Experiment Databases

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Left, Left, Left, Right, Left

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

How we look into complaints What happens when we investigate

Beyond the Pipeline: Discrete Optimization in NLP

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

CALCULUS III MATH

Interactive Whiteboard

LEGO MINDSTORMS Education EV3 Coding Activities

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Characterization of Calculus I Final Exams in U.S. Colleges and Universities

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Ensemble Technique Utilization for Indonesian Dependency Parser

Aalya School. Parent Survey Results

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Abu Dhabi Indian. Parent Survey Results

Cognitive Thinking Style Sample Report

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Abu Dhabi Grammar School - Canada

Learning to Rank with Selection Bias in Personal Search

CS 100: Principles of Computing

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Lecture 10: Reinforcement Learning

Probability and Statistics Curriculum Pacing Guide

Shelters Elementary School

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

What's My Value? Using "Manipulatives" and Writing to Explain Place Value. by Amanda Donovan, 2016 CTI Fellow David Cox Road Elementary School

Academic Integrity RN to BSN Option Student Tutorial

arxiv: v1 [cs.cv] 10 May 2017

Detecting English-French Cognates Using Orthographic Edit Distance

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

On-the-Fly Customization of Automated Essay Scoring

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Transcription:

of Text: Concepts, Features, and Instances Jaime Arguello jarguell@email.unc.edu August 26, 2015

of Text Objective: developing and evaluating computer programs that automatically detect a particular concept in natural language text 2

basic ingredients 1. Training data: a set of positive and negative examples of the concept we want to automatically recognize 2. Representation: a set of features that we believe are useful in recognizing the desired concept 3. Learning algorithm: a computer program that uses the training data to learn a predictive model of the concept 3

basic ingredients 4. Model: a function that describes a predictive relationship between feature values and the presence/absence of the concept 5. Test data: a set of previously unseen examples used to estimate the model s effectiveness 6. Performance metrics: a set of statistics used to measure the predictive effectiveness of the model 4

training and testing training machine learning model algorithm labeled examples testing model new, unlabeled examples predictions 5

concept, instances, and features features concept color size # slides equal sides... label red big 3 no... yes instances green big 3 yes... yes blue small inf yes... no blue small 4 yes... no...... red big 3 yes... yes 6

training and testing training color size sides equal sides... label red big 3 no... yes green big 3 yes... yes blue small inf yes... no blue small 4 yes... no...... red big 3 yes... yes labeled examples machine learning algorithm model color size sides equal sides... label testing color size sides equal sides... label red big 3 no...??? green big 3 yes...??? blue small inf yes...??? blue small 4 yes...???.....??? red big 3 yes...??? model red big 3 no... yes green big 3 yes... yes blue small inf yes... no blue small 4 yes... no...... red big 3 yes... yes new, unlabeled examples predictions 7

questions Is a particular concept appropriate for predictive analysis? What should the unit of analysis be? How should I divide the data into training and test sets? What is a good feature representation for this task? What type of learning algorithm should I use? How should I evaluate my model s performance? 8

concepts Learning algorithms can recognize some concepts better than others What are some properties of concepts that are easier to recognize? 9

concepts Option 1: can a human recognize the concept? 10

concepts Option 1: can a human recognize the concept? Option 2: can two or more humans recognize the concept independently and do they agree? 11

concepts Option 1: can a human recognize the concept? Option 2: can two or more humans recognize the concept independently and do they agree? Option 2 is better. In fact, models are sometimes evaluated as an independent assessor How does the model s performance compare to the performance of one assessor with respect to another? One assessor produces the ground truth and the other produces the predictions 12

measures agreement: percent agreement Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur yes no yes A B no C D (? +?) (? +? +? +?) 13

measures agreement: percent agreement Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur yes no yes A B no C D (A + D) (A + B + C + D) 14

measures agreement: percent agreement Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur yes no yes 5 5 10 no 15 75 90 20 80 % agreement =??? 15

measures agreement: percent agreement Percent agreement: percentage of instances for which both assessors agree that the concept occurs or does not occur yes no yes 5 5 10 no 15 75 90 20 80 % agreement = (5 + 75) / 100 = 80% 16

measures agreement: percent agreement Problem: percent agreement does not account for agreement due to random chance. How can we compute the expected agreement due to random chance? Option 1: assume unbiased assessors Option 2: assume biased assessors 17

kappa agreement: chance-corrected % agreement Option 1: unbiased assessors yes no yes???? 50 no???? 50 50 50 18

kappa agreement: chance-corrected % agreement Option 1: unbiased assessors yes no yes 25 25 50 no 25 25 50 50 50 19

kappa agreement: chance-corrected % agreement Option 1: unbiased assessors yes no yes 25 25 50 no 25 25 50 50 50 random chance % agreement =??? 20

kappa agreement: chance-corrected % agreement Option 1: unbiased assessors yes no yes 25 25 50 no 25 25 50 50 50 random chance % agreement = (25 + 25)/100 = 50% 21

kappa agreement: chance-corrected % agreement Kappa agreement: percent agreement after correcting for the expected agreement due to random chance K = P(a) P(e) 1 P(e) P(a) = percent of observed agreement P(e) = percent of agreement due to random chance 22

kappa agreement: chance-corrected % agreement Kappa agreement: percent agreement after correcting for the expected agreement due to unbiased chance yes no yes 5 5 10 no 15 75 90 20 80 P(a) = 5+75 100 = 0.80 yes no yes 25 25 50 no 25 25 50 50 50 P(e) = 25+25 100 = 0.50 K = P(a) P(e) = 1 P(e) 0.80 0.50 1 0.50 = 0.60 23

kappa agreement: chance-corrected % agreement Option 2: biased assessors yes no yes 5 5 10 no 15 75 90 20 80 biased chance % agreement =??? 24

kappa agreement: chance-corrected % agreement Kappa agreement: percent agreement after correcting for the expected agreement due to biased chance P(a) = 5+75 100 = 0.80 yes no yes 5 5 10 no 15 75 90 P(e) = 20 80 10 100 100 20 + 90 100 80 100 = 0.74 K = P(a) P(e) = 1 P(e) 0.80 0.74 1 0.74 = 0.23 25

INPUT: unlabeled data, annotators, coding manual OUTPUT: labeled data 1. using the latest coding manual, have all annotators label some previously unseen porron of the data (~10%) 2. measure inter- annotator agreement (Kappa) 3. IF agreement < X, THEN: refine coding manual using disagreements to resolve inconsistencies and clarify definirons return to 1 ELSE Predictive Analysis data annotation process have annotators label the remainder of the data independently and EXIT 26

data annotation process What is good (Kappa) agreement? It depends on who you ask According to Landis and Koch, 1977: 0.81-1.00: almost perfect 0.61-0.70: substantial 0.41-0.60: moderate 0.21-0.40: fair 0.00-0.20: slight < 0.00: no agreement 27

questions Is a particular concept appropriate for predictive analysis? What should the unit of analysis be? What is a good feature representation for this task? How should I divide the data into training and test sets? What type of learning algorithm should I use? How should I evaluate my model s performance? 28

turning data into (training and test) instances For many text-mining applications, turning the data into instances for training and testing is fairly straightforward Easy case: instances are self-contained, independent units of analysis text classification: instances = documents opinion mining: instances = product reviews bias detection: instances = political blog posts emotion detection: instances = support group posts 29

Text Classification predicting health-related documents features concept w_1 w_2 w_3... w_n label 1 1 0... 0 health instances 0 0 0... 0 other 0 0 0... 0 other 0 1 0... 1 other...... 0. 1 0 0... 1 health 30

Opinion Mining predicting positive/negative movie reviews features concept w_1 w_2 w_3... w_n label 1 1 0... 0 posirve instances 0 0 0... 0 negarve 0 0 0... 0 negarve 0 1 0... 1 negarve...... 0. 1 0 0... 1 posirve 31

Bias Detection predicting liberal/conservative blog posts features concept w_1 w_2 w_3... w_n label 1 1 0... 0 liberal instances 0 0 0... 0 conservarve 0 0 0... 0 conservarve 0 1 0... 1 conservarve...... 0. 1 0 0... 1 liberal 32

turning data into (training and test) instances A not-so-easy case: relational data The concept to be learned is a relation between pairs of objects 33

example of relational data: Brother(X,Y) (example borrowed and modified from Wi^en et al. textbook) 34

example of relational data: Brother(X,Y) features concept name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother steven male peggy peter graham male peggy peter yes Ian male grace ray brian male grace ray yes instances anna female pam ian nikki female pam ian no pippa female grace ray brian male grace ray no steven male peggy peter brian male grace ray no......... anna female pam ian brian male grace ray no 35

turning data into (training and test) instances A not-so-easy case: relational data Each instance should correspond to an object pair (which may or may not share the relation of interest) May require features that characterize properties of the pair 36

example of relational data: Brother(X,Y) features concept name_1 gender_1 mother_1 father_1 name_2 gender_2 mother_2 father_2 brother steven male peggy peter graham male peggy peter yes Ian male grace ray brian male grace ray yes instances anna female pam ian nikki female pam ian no pippa female grace ray brian male grace ray no steven male peggy peter brian male grace ray no......... anna female pam ian brian male grace ray no (can we think of a better feature representation?) 37

example of relational data: Brother(X,Y) features concept gender_1 gender_2 same parents brother male male yes yes male male yes yes instances female female no no female male yes no male male no no.... female male no no 38

turning data into (training and test) instances A not-so-easy case: relational data There is still an issue that we re not capturing! Any ideas? Hint: In this case, should the predicted labels really be independent? 39

turning data into (training and test) instances Brother(A,B) = yes Brother(B,C) = yes Brother(A,C) = no 40

turning data into (training and test) instances In this case, what we would really want is: a method that does joint prediction on the test set a method whose joint predictions satisfy a set of known properties about the data as a whole (e.g., transitivity) 41

turning data into (training and test) instances There are learning algorithms that incorporate relational constraints between predictions However, they are beyond the scope of this class We ll be covering algorithms that make independent predictions on instances That said, many algorithms output prediction confidence values Heuristics can be used to disfavor inconsistencies 42

turning data into (training and test) instances Examples of relational data in text-mining: information extraction: predicting that a word-sequence belongs to a particular class (e.g., person, location) topic segmentation: segmenting discourse into topically coherent chunks 43

topic segmentation example A B A B A B A B A B A B A 44

topic segmentation example: instances A B A B A B A B A B A B A 45

topic segmentation example: independent instances? A B A B A B A B A split split split split B A B A 46

topic segmentation example: independent instances? A B A B A B A B A B A B A split split split split 47

questions Is a particular concept appropriate for predictive analysis? What should the unit of analysis be? How should I divide the data into training and test sets? What is a good feature representation for this task? What type of learning algorithm should I use? How should I evaluate my model s performance? 48

training and test data We want our model to learn to recognize a concept So, what does it mean to learn? 49

training and test data The machine learning definition of learning: A machine learns with respect to a particular task T, performance metric P, and experience E, if the system improves its performance P at task T following experience E. -- Tom Mitchell 50

training and test data We want our model to improve its generalization performance! That is, its performance on previously unseen data! Generalize: to derive or induce a general conception or principle from particulars. -- Merriam-Webster In order to test generalization performance, the training and test data cannot be the same. Why? 51

Training data + Representation what could possibly go wrong? 52

training and test data While we don t want to test on training data, models usually perform the best when the training and test set are derived from the same probability distribution. What does that mean? 53

training and test data?? Data Training Data Test Data positive instances negative instances 54

training and test data Is this a good partitioning? Why or why not? Data Training Data Test Data positive instances negative instances 55

training and test data Random Sample Random Sample Data Training Data Test Data positive instances negative instances 56

training and test data On average, random sampling should produce comparable data for training and testing Data Training Data Test Data positive instances negative instances 57

training and test data Models usually perform the best when the training and test set have: a similar proportion of positive and negative examples a similar co-occurrence of feature-values and each target class value 58

training and test data Caution: in some situations, partitioning the data randomly might inflate performance in an unrealistic way! How the data is split into training and test sets determines what we can claim about generalization performance The appropriate split between training and test sets is usually determined on a case-by-case basis 59

discussion Spam detection: should the training and test sets contain email messages from the same sender, same recipient, and/or same timeframe? Topic segmentation: should the training and test sets contain potential boundaries from the same discourse? Opinion mining for movie reviews: should the training and test sets contain reviews for the same movie? Sentiment analysis: should the training and test sets contain blog posts from the same discussion thread? 60

questions Is a particular concept appropriate for predictive analysis? What should the unit of analysis be? How should I divide the data into training and test sets? What type of learning algorithm should I use? What is a good feature representation for this task? How should I evaluate my model s performance? 61

three types of classifiers Linear classifiers Decision tree classifiers Instance-based classifiers 62

three types of classifiers All types of classifiers learn to make predictions based on the input feature values However, different types of classifiers combine the input feature values in different ways Chapter 3 in the book refers to a trained model as knowledge representation 63

linear classifiers: perceptron algorithm y = 1 if w0 + Â n j=1 w jx j > 0 0 otherwise 64

linear classifiers: perceptron algorithm y = 1 if w0 + Â n j=1 w jx j > 0 0 otherwise parameters learned by the model predicted value (e.g., 1 = positive, 0 = negative) 65

linear classifiers: perceptron algorithm test instance f_1 f_2 f_3 0.5 1.0 0.2 model weights w_0 w_1 w_2 w_3 2.0-5.0 2.0 1.0 output = 2.0 + (0.50 x - 5.0) + (1.0 x 2.0) + (0.2 x 1.0) output = 1.7 output predicron = posirve 66

linear classifiers: perceptron algorithm (two- feature example borrowed from Wi^en et al. textbook) 67

linear classifiers: perceptron algorithm (source: h^p://en.wikipedia.org/wiki/file:svm_separarng_hyperplanes.png) 68

linear classifiers: perceptron algorithm 1.0 x2 0.5 0.5 1.0 x1 Would a linear classifier do well on positive (black) and negative (white) data that looks like this? 69

three types of classifiers Linear classifiers Decision tree classifiers Instance-based classifiers 70

example of decision tree classifier: Brother(X,Y) same parents gender_1 yes no no male female gender_2 no male female yes no 71

decision tree classifiers 1.0 x2 0.5 0.5 1.0 x1 Draw a decision tree that would perform perfectly on this training data! 72

three types of classifiers Linear classifiers Decision tree classifiers Instance-based classifiers 73

instance-based classifiers 1.0 x2 0.5? 0.5 1.0 x1 predict the class associated with the most similar training examples 74

instance-based classifiers 1.0? x2 0.5 0.5 1.0 x1 predict the class associated with the most similar training examples 75

instance-based classifiers Assumption: instances with similar feature values should have a similar label Given a test instance, predict the label associated with its nearest neighbors There are many different similarity metrics for computing distance between training/test instances There are many ways of combining labels from multiple training instances 76

questions Is a particular concept appropriate for predictive analysis? What should the unit of analysis be? How should I divide the data into training and test sets? What is a good feature representation for this task? What type of learning algorithm should I use? How should I evaluate my model s performance? 77