Transductive Inference for Text Classification using Support Vector Machines. Thorsten Joachims International Conference on Machine Learning, 1999

Similar documents
Transductive Inference for Text Classication using Support Vector. Machines. Thorsten Joachims. Universitat Dortmund, LS VIII

A Case Study: News Classification Based on Term Frequency

Python Machine Learning

Lecture 1: Machine Learning Basics

CS 446: Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule Learning With Negation: Issues Regarding Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Artificial Neural Networks written examination

(Sub)Gradient Descent

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning From the Past with Experiment Databases

A Comparison of Two Text Representations for Sentiment Analysis

Cross-lingual Short-Text Document Classification for Facebook Comments

Using dialogue context to improve parsing performance in dialogue systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning to Rank with Selection Bias in Personal Search

arxiv: v1 [cs.lg] 3 May 2013

Reducing Features to Improve Bug Prediction

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Rule Learning with Negation: Issues Regarding Effectiveness

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Disambiguation of Thai Personal Name from Online News Articles

CS Machine Learning

Detecting English-French Cognates Using Orthographic Edit Distance

Automatic document classification of biological literature

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Generative models and adversarial training

AQUA: An Ontology-Driven Question Answering System

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

BMC Medical Informatics and Decision Making 2012, 12:33

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Probabilistic Latent Semantic Analysis

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-Supervised Face Detection

Assignment 1: Predicting Amazon Review Ratings

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Latent Semantic Analysis

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v1 [math.at] 10 Jan 2016

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Issues in the Mining of Heart Failure Datasets

Corpus Linguistics (L615)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Human Emotion Recognition From Speech

Multilingual Sentiment and Subjectivity Analysis

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The stages of event extraction

Australian Journal of Basic and Applied Sciences

Evolutive Neural Net Fuzzy Filtering: Basic Description

WHEN THERE IS A mismatch between the acoustic

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Lecture 1: Basic Concepts of Machine Learning

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Using focal point learning to improve human machine tacit coordination

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Indian Institute of Technology, Kanpur

SARDNET: A Self-Organizing Feature Map for Sequences

Voice conversion through vector quantization

Corrective Feedback and Persistent Learning for Information Extraction

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Georgetown University at TREC 2017 Dynamic Domain Track

Online Updating of Word Representations for Part-of-Speech Tagging

The Role of String Similarity Metrics in Ontology Alignment

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Word Segmentation of Off-line Handwritten Documents

Software Maintenance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

arxiv: v1 [cs.lg] 15 Jun 2015

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Truth Inference in Crowdsourcing: Is the Problem Solved?

Distant Supervised Relation Extraction with Wikipedia and Freebase

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Multi-Lingual Text Leveling

Introduction to Causal Inference. Problem Set 1. Required Problems

Ensemble Technique Utilization for Indonesian Dependency Parser

Cross Language Information Retrieval

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

A Version Space Approach to Learning Context-free Grammars

Mathematics. Mathematics

Transcription:

Transductive Inference for Tet Classification using Support Vector Machines Thorsten Joachims International Conference on Machine Learning, 1999 Presented by Joe Drish CSE 254: Seminar on Learning Algorithms, 2001 Department of Computer Science and Engineering University of California, San Diego Introduction Main Goals Introduce a new method for tet classification - Transductive Support Vector Machines (TSVMs) Analyze why TSVMs are well-suited for tet classification Describe a novel algorithm for training TSVMs Eperimentally demonstrate classification improvements using TSVMs compared to standard inductive learning methods 1

Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work Tet Classification Problem Classify documents into multiple, eactly one, or no semantic categories Learn a classifier to assign categories automatically Applications Netnews Filtering - find interesting news articles Reorganizing a document collection - automatically classify document databases after new categories are introduced 2

Document Preprocessing Information Etraction Documents are strings of characters Words are represented as word stems Eample: computes, computing, and computer are all mapped to the word stem comput Information retrieval research suggests that word stems work well without information loss Documents as Feature Vectors Feature Vectors (see Figure 1) Each document has one feature vector, indeed by word stems Each vector entry is TF(w i, ), the number of times word stem w i occurs in document Scaling by Inverse Document Frequency (IDF) Each feature vector entry is multiplied by IDF ( w i ) = log n DF ( w where n is the total number of documents, and DF(w i ) is the number of documents the word w i occurs in IDF scaling assigns greater weight to word stems that are infrequent across all documents, and lesser weight to frequent word stems i ) 3

Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work Inductive Support Vector Machines Input vectors are separated into two regions: H 1 and H 2 Margin is maimized given minimal separation error Data points that lie on the margin are support vectors Support Vectors H 1 Margin X H 2 X X X X X X 4

Inductive versus Transductive Learning Objectives of Inductive and Transductive Inference Inductive learning: generalize for any future test set Transductive learning: predict the classes for a specific test set In transduction we use information from the given test set Transduction using Support Vector Machines Inductive Support Vector Machines (SVMs) learn a decision boundary between two classes to predict labels for future test sets Transductive Support Vector Machines (TSVMs) attempt to minimize the number of erroneous predictions on a specific test set A variation of supervised and unsupervised learning Transductive Support Vector Machines Positive/negative training eamples are marked +/- Test eamples are dots The solid line gives the TSVM separating hyperplane TSVM SVM Redrawn from figure 2 in [Joachims, 1999] 5

Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work TSVMs and Tet Classification Tet Classification Task Features High dimensional input space (10,000 features) Document feature vectors are sparse Every feature is important, since most words are relevant What makes TSVMs good for this task? TSVMs inherit properties of SVMs, which work well TSVMs eploit co-occurring patterns of tet Alta Vista Search Eample (number of hits in year 2001) pepper, salt: 181,827 pepper, physics: 19,425 salt: 1.9 million physics: 4.2 million 6

TSVMs Using the Test Set: An Eample nuclear physics atom parsley basil salt and D1 1 1 D2 1 1 1 1 D3 1 1 D4 1 1 1 D5 1 1 1 D6 1 1 1 Figure 3 in [Joachims, 1999] Documents D1 and D6 are the training feature vectors Documents D2 through D5 are the test feature vectors D1, D2, and D3 are classified into class A D4, D5, and D6 are classified into class B This is possible since the test vectors D2 and D3 share a common word (atom), as do D4 and D5 (parsley) Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work 7

TSVM Training Algorithm Overview Algorithm Overview r r Input: - labeled training eamples ( 1, y 1 ),..., ( r * r n, y n ) * - unlabeled test eamples 1,..., k - C,C* from OP(2) in [Joachims, 1999] - num+: anticipated number of positive test eamples * * Output: - predicted labels of the test eamples y,..., y User Parameters C and C* specify the SVM margin size num+ allows the tradeoff of recall versus precision recall: proportion of items in the category that are actually placed in the category precision: proportion of items placed in the category that are really in the category 1 k TSVM Training Algorithm Description Algorithm Idea Refer to Figure 4 in the paper, [Joachims, 1999] First label the test data based on inductive SVM classification * * Set the cost factors C and C + to a small number Outer loop (loop 1) Increment the cost factors up to the user defined value of C * Inner loop (loop 2) Locate two test eamples for which changing the class labels leads to a decrease in the current objective function OP(2) If these two eamples eist, switch them Algorithm Notes SVM light (Joachims) is web software for the inductive SVM 8

TSVM Inner Loop Motivation Goal is to minimize objective function OP(2) Algorithm will switch two eamples that further minimize OP(2), if two such eamples eist Same eample can have its label switched repeatedly OP(2) decreases with every iteration Converges in a finite number of steps (proof given in paper) Issues Why is it reasonable to switch to eamples - randomness? Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work 9

Test Set Collections Reuters-21578 Consists of Reuters news data collected in 1987. ModApte split: 9,603 (75%) training and 3,299 (25%) test documents Can be in one or more of 10 classes (e.g., earn, grain, crude, etc.) WebKB collection A collection of World Wide Web pages 4,183 eamples: Cornell University for training, others for testing Can be in only one of 4 classes: course, faculty, project, student Ohsumed corpus Medical documents compiled in 1991 10,000 training eamples; 10,000 testing eamples Can be in one or more of 5 classes (e.g., pathology, neoplasms, etc.) Performance Metrics Recall and Precision (defined intuitively before) recall: tp/(tp + fn), where tp is true positives, and fn is false negatives precision: tp/(tp + fp), where fp is false positives Precision/Recall (P/R) Breakeven Point Standard measure of performance in tet classification Defined as the value for which precision and recall are equal Number of false positives equals number of false negatives 10

Breakeven point: Recall = Precision high scores positively ranked eamples breakeven point negatively ranked eamples low scores = positive = negative tp = 4 fp = 3 equal, fn = fp fn = 3 recall = tp/(tp + fn) = 4/(4 + 3) = 4/7 prec. = tp/(tp + fp) = 4(4 + 3) = 4/7 recall = precision Reuters Eperiment Bayes SVM TSVM earn 78.8 91.3 95.4 acq 57.4 67.8 76.6 money-f 43.9 41.3 60.0 grain 40.1 56.2 68.5 crude 24.8 40.9 83.6 trade 22.1 29.5 34.0 interest 24.5 35.6 50.8 ship 33.2 32.5 46.3 wheat 19.5 47.9 54.4 corn 14.5 41.3 43.7 average 35.9 48.4 60.8 Figure 5 in [Joachims, 1999] Results 17 training and 3,299 test eamples The TSVM gives better performance on all classes TSVMs are better for small training sets (Figure 6) TSVMs are less superior for larger training sets (Figure 7) 61.3? 11

WebKB Eperiment Bayes SVM TSVM course 57.2 68.7 93.8 faculty 42.4 52.5 53.7 project 21.4 37.5 18.4 student 63.5 70.0 83.8 average 46.1 57.2 62.4 Figure 8 in [Joachims, 1999] Results 9 training and 3,957 test eamples course is especially good, project is especially bad. Why? course pages at Cornell do not give topic information With more training eamples SVM catches up to TSVM (figure 10) project is smallest class (1/9), and pages give topic information With more training eamples TSVM overcomes SVM (figure 11) Ohsumed Eperiment Results Bayes SVM TSVM pathology 39.6 41.8 43.4 cardiovascular 49.0 58.0 69.1 neoplasms 53.1 65.1 70.3 nervous System 28.1 35.5 38.1 immunologic 28.3 42.8 46.7 average 39.6 48.6 53.5 Redrawn from figure 9 in [Joachims, 1999] 120 training and 10,000 test eamples The TSVM gives better performance on all classes 12

Talk Outline I. Tet classification II. Transductive inference III. TSVMs for tet classification IV. TSVM algorithm V. Eperimental results VI. Conclusions and future work Conclusions and Future Work TSVMs combine powerful tools use prior knowledge about the test set eploit co-occurrence properties of tet use separating hyperplane margin (SVM) TSVMs are well-motivated for tet classification Improved performance verified eperimentally using three challenging datasets Open questions? type of concepts that benefit best from transductive learning? better way to represent tet and documents? further eploration of better training algorithms? etend transductive classifiers to be inductive classifiers 13

Questions? 14