A Case Study of Semi-supervised Classification Methods for Imbalanced Data Set Situation

Similar documents
Lecture 1: Machine Learning Basics

Semi-Supervised Face Detection

Python Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Learning From the Past with Experiment Databases

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Generative models and adversarial training

Calibration of Confidence Measures in Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

Using Web Searches on Important Words to Create Background Sets for LSI Classification

(Sub)Gradient Descent

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Probabilistic Latent Semantic Analysis

CSL465/603 - Machine Learning

A survey of multi-view machine learning

Probability and Statistics Curriculum Pacing Guide

Rule Learning With Negation: Issues Regarding Effectiveness

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v2 [cs.cv] 30 Mar 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Issues in the Mining of Heart Failure Datasets

Rule Learning with Negation: Issues Regarding Effectiveness

arxiv: v1 [cs.lg] 3 May 2013

Learning Methods in Multilingual Speech Recognition

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

A Case Study: News Classification Based on Term Frequency

Truth Inference in Crowdsourcing: Is the Problem Solved?

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Exposé for a Master s Thesis

Applications of data mining algorithms to analysis of medical data

CS 446: Machine Learning

Word learning as Bayesian inference

Artificial Neural Networks written examination

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Softprop: Softmax Neural Network Backpropagation Learning

Corrective Feedback and Persistent Learning for Information Extraction

Introduction to Causal Inference. Problem Set 1. Required Problems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Transductive Inference for Text Classication using Support Vector. Machines. Thorsten Joachims. Universitat Dortmund, LS VIII

EXPLOITING DOMAIN AND TASK REGULARITIES FOR ROBUST NAMED ENTITY RECOGNITION

Reducing Features to Improve Bug Prediction

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Speech Recognition at ICSI: Broadcast News and beyond

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Attributed Social Network Embedding

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Lecture 1: Basic Concepts of Machine Learning

Comparison of network inference packages and methods for multiple networks inference

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Modeling function word errors in DNN-HMM based LVCSR systems

Model Ensemble for Click Prediction in Bing Search Ads

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Data Fusion Through Statistical Matching

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing

Modeling function word errors in DNN-HMM based LVCSR systems

Australian Journal of Basic and Applied Sciences

Universidade do Minho Escola de Engenharia

Word Segmentation of Off-line Handwritten Documents

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Go fishing! Responsibility judgments when cooperation breaks down

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A study of speaker adaptation for DNN-based speech synthesis

Team Formation for Generalized Tasks in Expertise Social Networks

A Model of Knower-Level Behavior in Number Concept Development

Software Maintenance

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Edinburgh Research Explorer

Learning to Rank with Selection Bias in Personal Search

Linking Task: Identifying authors and book titles in verbose queries

Multivariate k-nearest Neighbor Regression for Time Series data -

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Transcription:

A Case Study of Semi-supervised Classification Methods for Imbalanced Data Set Situation 11742 IR-Lab Project Fall 2004 Yanjun Qi

Road Map Introduction of Semi-supervised Learning Three semi-supervise classifiers we compared Experiments and Results

Introduction Learning: Supervised (classification, regression, etc.) vs. Unsupervised (clustering etc). Usage {(x,y)} labeled data {x} unlabeled data Supervised learning Yes No Unsupervised learning No Yes

But in some applications Labeled data are often hard to obtain Text categorization: time-consuming for subjects manually Protein Structure, Protein interaction: laborious and expensive experimental efforts etc. Unlabeled data are often easy to obtain : A lot Usage Supervised learning Semi-supervised learning Unsupervised learning {(x,y)} labeled data Yes Yes No {x} unlabeled data No Yes Yes

A Brief Review of Semi-supervised Learning Semi-supervised classification Training also exploits additional unlabeled data Aiming to result more accurate classification function Semi-supervised clustering In recent years, some researchers successfully use labeled style constraints to help the unsupervised clustering Labeled style constraints: like must-link or cannot-link, etc

Representative methods of semisupervised classification Generative Model Large Margin based methods Graph based methods Co-training

Generative Models Unlabeled data P(X)? Classification P(Y X) Generative models for joint probability Gaussian [David 96, Castelli&Cover95, etc] Multinomial [Nigam 98, 00] Use EM to combine small labeled set and large unlabeled set Consider a joint model P(x,y theta), unlabeled examples can be used to estimate parameter theta For instance, by maximizing the joint likelihood

Large Margin Separation To maximize the classification margin on both labeled and unlabeled data while classifying the labeled data as correctly as possible Some existing works Joachims 99 : Transductive SVm Kristin 2002: Boosting Decision Tree Jaakkola 1999 : maximum entropy Et al.

Graph Based Method Generally based up an assumption that similar unlabeled examples should be given the same classification. Place the data points on to a graph based on the distance relationships between examples Then use the known labels to perform some type of graph partitioning Markov random walk : [Szummer and Jaakkola 2000] Graph Mincut: [Blum 2001, 2004] Gaussian Random Field [Zhu 2003, 2004] Tree structure [Griffiths 2003]

Co-Training Available data features are so redundant that we can train two classifiers using different features Unlabeled data reduce the hypothesis space by forcing h1 and h2 to agree The two classifiers should at least agree on the classification for each unlabeled example Some existing works Avrim Blum, Tom Mitchell 1998 F. Denis, etc (2003)

Three Methods We Compared Generative Models Mixture Gaussian Large Margin based methods Transductive SVM Graph based methods Semi-Supervised learning using Gaussian random Fields Co-training Not sure how to split the features

(1) Mixture Gaussian - EM David Miller & Hasn Uyar NIPS 1996 Maximization of the total data likelihood, i.e. over both the labeled and unlabelled data EM used to do the iterative maximization The generalized mixture (GM) model Assumes the class posterior for each mixture component is independent of the feature value Each component is modeled by a Gaussian.

(1) Mixture Gaussian - EM

(1) Mixture Gaussian - EM The learning process: E step: calculate each data point s component posterior probability M step: update each component s mean and variance parameter; update the weight parameter; update the different class given different component s probabilities

(2) Transductive SVM Intuition behind Assume decision boundaries lie in lowdensity regions of feature space unlabeled examples help to find these areas.

(3) Semi-supervised learning using Gaussian random Fields X. Zhu, et al. Semi- Supervised learning using Gaussian Fields and Harmonic Functions. ICML 2003 This method can be viewed as a form of nearest neighbor approach, where the nearest labeled examples are computed in terms of a random walk on graph

(3) Semi-supervised learning using Gaussian random Fields Labeled and unlabeled data Represented as vertices in the weighted graph Edge weights encoding the similarity between instances Propagate label from labeled nodes to unlabeled nodes on the graph

Experiments Empirical comparison of three methods for a specific situation: only two classes have unbalanced class distribution 7 data sets from UCI Machine learning Repository All transformed to Binary Classification task Having different level of class imbalance

Data Sets No. DATASET % MINORITY EXAMPLES DATASET SIZE FEATURE / CLASS SITUATION CLASS USED UNLABEL DATA SIZE IN EACH EXPERIMENTAL RUN 1 Letter-a 3.9 20000 16 numeric (integer) features 17 classes Letter A against all other letter 2000 2 Pendigits 8.3 7494 16 attributes (All input attributes are integers 0..100) 10 classes Digits 0 against all other digits 2000 3 Letter-asubset 17.0 4639 16 numeric (integer) features 17 classes Letter A against Letter BCDEF 2000 5 Yeast 28.9 1484 8 attributes (numerical ) 10 classes NUC against all the other localizations (429 positive) 1350 6 Pima 34.7 768 8 attributes ( numerical ) 2 classes ( 268 positive) 650 7 Bupa 42.0 345 6 attributes (numerical ) 2 classes (145 positive) 240 8 Pendigits - Subset 50.0 1438 16 numeric (integer) features 17 classes Digit 3 against digits 9 (719 positive) 1300

Experimental Design For each data set, various labeled set sizes to be tested: {5, 10, 20, 30, 40, 60, 80, 100}. For each labeled set certain size tested, perform 10 trials In each trial Randomly sample labeled data from the entire dataset Randomly sample a fixed number of items from the rest as unlabeled data

Performance Measurement We use error rate, average error rate and AUC area Balanced error rate (BER = the average of the error rate on positive class examples and the error rate on negative class examples). If there are fewer positive examples, the errors on positive examples will count more. Error rate The area under the ROC curve (AUC score)

Performance Set 1

Performance Set 2

Performance Set 3

Performance Set 5

Performance Set 6

Performance Set 7

Performance Set 8

Discussion Harmonic and TransductiveSVM perform much better than the EM-Mixture method Overall, TransductiveSVM gives a little help compared to the SVM itself by using the unlabeled data Harmonic function seems a bit more stable than Transductive SVM

Discussion Bad performance of EM-Mixture Both labeled and unlabeled data contribute to a reduction of variance, but unlabeled data may lead to an increase in bias when modeling assumption are incorrect! If the train set is too small, the learning updating is very similar with the GMM clustering, with training points to do the initialization.

Discussion Bad performance of EM-Mixture Compared to the small labeled set, too many unlabeled data has too big effect on the total likelihood function The covariance matrix is hard to get when too small label set. Must take some ways to reduce the effect of this problem. For instance, Naïve model

Discussion From these experiments Unlabeled data does help in the small train set case somehow But it also happens that sometimes using the unlabeled data degrades the performance of the classification

Discussion From the results on these date set with different class ratio It seems that the imbalanced distribution is not the main problem for a concrete classification task. If classification perform badly under some imbalance distribution most likely caused by the too small training set s size

The End!