Introduction to Classification

Similar documents
Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Machine Learning Basics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Disambiguation of Thai Personal Name from Online News Articles

Reducing Features to Improve Bug Prediction

Linking Task: Identifying authors and book titles in verbose queries

Learning From the Past with Experiment Databases

Australian Journal of Basic and Applied Sciences

Applications of data mining algorithms to analysis of medical data

(Sub)Gradient Descent

Rule Learning with Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

Speech Recognition at ICSI: Broadcast News and beyond

Probabilistic Latent Semantic Analysis

Using dialogue context to improve parsing performance in dialogue systems

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Issues in the Mining of Heart Failure Datasets

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Assignment 1: Predicting Amazon Review Ratings

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Generative models and adversarial training

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Questionnaire Design

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probability and Statistics Curriculum Pacing Guide

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Truth Inference in Crowdsourcing: Is the Problem Solved?

Word Segmentation of Off-line Handwritten Documents

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lecture 1: Basic Concepts of Machine Learning

Speech Emotion Recognition Using Support Vector Machine

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Detecting English-French Cognates Using Orthographic Edit Distance

Multi-Lingual Text Leveling

Universidade do Minho Escola de Engenharia

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Pre-AP Geometry Course Syllabus Page 1

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Calibration of Confidence Measures in Speech Recognition

Radius STEM Readiness TM

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

On-Line Data Analytics

AQUA: An Ontology-Driven Question Answering System

Learning Methods in Multilingual Speech Recognition

Artificial Neural Networks written examination

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

CSL465/603 - Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Bayesian Learning Approach to Concept-Based Document Classification

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

The taming of the data:

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Beyond the Pipeline: Discrete Optimization in NLP

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Chapter 2 Rule Learning in a Nutshell

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Mining Student Evolution Using Associative Classification and Clustering

Cross-lingual Short-Text Document Classification for Facebook Comments

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Using EEG to Improve Massive Open Online Courses Feedback Interaction

First Grade Curriculum Highlights: In alignment with the Common Core Standards

CS 446: Machine Learning

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

arxiv: v1 [cs.lg] 3 May 2013

Data Fusion Through Statistical Matching

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Generating Test Cases From Use Cases

The stages of event extraction

South Carolina English Language Arts

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

WHEN THERE IS A mismatch between the acoustic

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Semi-Supervised Face Detection

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Cross-Lingual Text Categorization

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

A Comparison of Two Text Representations for Sentiment Analysis

Transcription:

Introduction to Classification

Classification: Definition Given a collection of examples (training set ) Each example is represented by a set of features, sometimes called attributes Each example is to be given a label or class Find a model for the label as a function of the values of features. Goal: previously unseen examples should be assigned a label as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (includes clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 3

Illustrating Classification Task 4

Examples of Classification Tasks Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 5

Classification Techniques There are a number of different classification techniques to build a model for classification Decision Tree based Methods Rule-based Methods Memory based reasoning, instance-based learning Neural Networks Genetic Algorithms Naïve Bayes and Bayesian Belief Networks Support Vector Machines In this introduction, we illustrate classification tasks using Decision Tree methods Features can have numeric values (continuous) or a finite set of values (categorical), including boolean true/false 6

Example of a Decision Tree Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES Training Data Model: Decision Tree Example task: Given the marital status, refund status, and taxable income of a person, label them as to whether they will cheat on their income tax. 7

Another Example of Decision Tree Married MarSt Single, Divorced Yes Refund No TaxInc < 80K > 80K YES There could be more than one tree that fits the same data! 8

Decision Tree Classification Task Decision Tree 9

Apply Model to Test Data Start from the root of tree. Test Data Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES 10

Apply Model to Test Data Test Data Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES 11

Apply Model to Test Data Test Data Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES 12

Apply Model to Test Data Test Data Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES 13

Apply Model to Test Data Test Data Yes Refund No Single, Divorced MarSt Married TaxInc < 80K > 80K YES 14

Apply Model to Test Data Test Data Yes Refund No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES 15

Decision Tree Classification Task Decision Tree 16

Evaluating Classification Methods Accuracy classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attribute Speed time to construct model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability understanding and insight provided by the model Other measures, e.g. goodness of rules, such as decision tree size or compactness of classification model 17

Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix for a binary classifier (two labels): PREDICTED CLASS Class=Yes Class=No a: TP (true positive) ACTUAL CLASS Class=Yes a b Class=No c d b: FN (false negative) c: FP (false positive) d: TN (true negative) 18

Classifier Accuracy Measures Another widely-used metric: Accuracy of a classifier M is the percentage of test set that are correctly classified by the model M Yes - C 1 No - C 2 Yes - C 1 a: True positive b: False negative No - C 2 c: False positive d: True negative classes buy_computer = yes buy_computer = no total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 total 7366 2634 10000 19

Other Classifier Measures Alternative accuracy measures (e.g., for cancer diagnosis or information retrieval) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) recall = t-pos/(t-pos + f-neg ) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) 20

Multi-Label Classification Most classification algorithms solve binary classification tasks, while many tasks are naturally multi-label, i.e. there are more than 2 labels Multi-Label problems are solved by training a number of binary classifiers and combining them to get a multi-label result Confusion matrix is extended to the multi-label case Accuracy definition is naturally extended to the multi-label case 21

Issues with uneven classes Consider a 2-class problem with labels Class 0 and Class 1 Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be Class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 22

Evaluating the Accuracy of a Classifier Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 23

Evaluating the Model - Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate 24

Classifier Performance: Feature Selection Too long a training or testing is a performance issue for classification of problems with large numbers of attributes or nominal attributes with large numbers of values Feature selection techniques aim to reduce the number of features by finding a smaller or minimal set that can accurately classify the problem reduce the training and prediction time by eliminating noisy or redundant features Two main types of techniques Filtering methods apply a statistical or other information measure to the attribute values without running any training and testing Wrapper methods try different combinations of attributes, run crossvalidation evaluations and compare the results 25

Filtering Feature Selection Different measures have been applied to feature set in order to identify features (attributes) with redundant or insignificant contribution of information to the problem Examples of measures: Distance measure (also called divergence or discrimination) measures the separability of the features using conditional probability Information Measure uses the entropy formula and class label frequencies to calculate how much information each feature subset carries about class labels Dependency measure measures how strongly a feature subset is associated with a class label Consistency measure uses the number of inconsistent cases, where the same or similar feature values result in different class labels Filtering is only technique possible if the number of features is too large for wrapper methods 26

Wrapper Feature Selection Wrapper methods search for a subset of the features by a process of choosing subsets and testing them with crossvalidation on the training data These methods are more time consuming that filtering methods but can give better results for feature selection Forward search start with one feature in the feature subset, keep adding features by selecting ones that give the best performance until no performance improvements are achieved Backward search start with all the feature in the feature subset, keep removing features by selecting ones whose absence does not degrade the performance until no more non-degradation is possible Other searches are possible: bidirectional, genetic algorithm 27

Classifier Performance: Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers Examples of ensemble methods Bagging Boosting Heterogeneous classifiers trained on different feature subsets sometimes called mixture of experts 28

General Idea 29

Examples of Classification Problems Some NLP problems are widely investigated as supervised classification problems, and use a variety of problem instances Text categorization: assigning topic labels to documents Word Sense Disambiguation: assigning a sense to a word, as it occurs in a document Semantic Role Labeling: assigning semantic roles to phrases in a sentence From the NLTK book, chapter 6: Classify first names according to gender Document classification (text categorization) Part-Of-Speech tagging Sentence Segmentation Identifying Dialog Act types Recognizing Textual Entailment 30

Text Categorization Represent each document by the words/tokens/terms it contains Sometimes called unigrams, sometimes bag-of-words Identify terms from the document text Remove symbols with little meaning Remove words with little meaning the stop words Stem the meaningful words Remove endings to get root of the word From enchanted, enchants, enchantment, enchanting, get the root word enchant Group together words into phrases (optional) Proper names or other words that are likely to have a different meaning as a phrase than the individual words After grouping, may also want to lowercase the terms 31

Document Features Use a feature vector to represent all the words in a document one position for each word in the collection, representing the weights (often frequency) of words Water, water everywhere, and not a drop to drink! Another document with the word drink: drink ( 2, 1, 1, 1, 1, 0, ) ( 0, 0, 0, 0, 1, 0, ) (shown with frequency weights) Feature vectors may have thousands of words and are often restricted by a threshhold frequency of 5 or more 32

Weka demonstration to observe feature vectors 33