CLASSIFICATION. CS5604 Information Storage and Retrieval - Fall Virginia Polytechnic Institute and State University. Blacksburg, Virginia 24061

Similar documents
Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Case Study: News Classification Based on Term Frequency

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CS Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Switchboard Language Model Improvement with Conversational Data from Gigaword

Linking Task: Identifying authors and book titles in verbose queries

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Rule Learning With Negation: Issues Regarding Effectiveness

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Australian Journal of Basic and Applied Sciences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Finding Translations in Scanned Book Collections

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Word Segmentation of Off-line Handwritten Documents

Learning From the Past with Experiment Databases

(Sub)Gradient Descent

Using dialogue context to improve parsing performance in dialogue systems

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Knowledge Transfer in Deep Convolutional Neural Nets

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Multi-Lingual Text Leveling

Georgetown University at TREC 2017 Dynamic Domain Track

Speech Recognition at ICSI: Broadcast News and beyond

Online Updating of Word Representations for Part-of-Speech Tagging

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Rule Learning with Negation: Issues Regarding Effectiveness

Learning Methods in Multilingual Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Language Independent Passage Retrieval for Question Answering

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Lecture 1: Machine Learning Basics

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

HLTCOE at TREC 2013: Temporal Summarization

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Modeling function word errors in DNN-HMM based LVCSR systems

CS 446: Machine Learning

Houghton Mifflin Online Assessment System Walkthrough Guide

READY TO WORK PROGRAM INSTRUCTOR GUIDE PART I. LESSON TITLE: Precision Measurement Guided Discussion

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Indian Institute of Technology, Kanpur

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

As a high-quality international conference in the field

Teaching Architecture Metamodel-First

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The stages of event extraction

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

From Self Hosted to SaaS Our Journey (LEC107648)

Modeling function word errors in DNN-HMM based LVCSR systems

Distant Supervised Relation Extraction with Wikipedia and Freebase

Artificial Neural Networks written examination

Reducing Features to Improve Bug Prediction

Generative models and adversarial training

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Detecting English-French Cognates Using Orthographic Edit Distance

A Bayesian Learning Approach to Concept-Based Document Classification

Model Ensemble for Click Prediction in Bing Search Ads

Detecting Online Harassment in Social Networks

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Term Weighting based on Document Revision History

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Software Maintenance

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

On-the-Fly Customization of Automated Essay Scoring

Outreach Connect User Manual

Postprint.

INPE São José dos Campos

The taming of the data:

What is a Mental Model?

Introduction to Causal Inference. Problem Set 1. Required Problems

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Lecture 2: Quantifiers and Approximation

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Measurement. When Smaller Is Better. Activity:

Schoology Getting Started Guide for Teachers

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Modeling user preferences and norms in context-aware systems

A Note on Structuring Employability Skills for Accounting Students

Multi-label classification via multi-target regression on data streams

arxiv: v1 [cs.lg] 15 Jun 2015

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Learning to Rank with Selection Bias in Personal Search

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Transcription:

CLASSIFICATION CS5604 Information Storage and Retrieval - Fall 2016 Virginia Polytechnic Institute and State University Blacksburg, Virginia 24061 Professor: E. Fox Presenters: Saurabh Chakravarty, Eric Williamson December 1, 2016

TABLE OF CONTENTS Problem Definition High-Level Architecture Data-Retrieval and Processing Classification Experimental Results Conclusion and Future work

PROBLEM STATEMENT Given the tweets in the GETAR and IDEAL collections and a set of real world events, determine which tweets belong to each real world event.

PROBLEM STATEMENT Tweet Collection 20: hurricane sandy Hurricane Sandy 27: hurricane Hurricane 188: #Arthur Arthur Fairdale 182: #tornado Tornado 632: fairdale 174: #Manhattan Real world event Manhattan Explosion

HIGH LEVEL ARCHITECTURE Classification pipeline

Training data TRAINING PIPELINE

TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data

TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors

TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model

TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model Train Classifier

TRAINING PIPELINE Pre-processing (clean/lemmatize/remove stop words) Training data Train word vectors Word2Vec Model Classifier Model Train Classifier

PREDICTION PIPELINE Pre-processing (clean/lemmatize/remove stop words) <<table>> idealcs5604f16 Word2Vec Model Classifier Model

PREDICTION PIPELINE Pre-processing (clean/lemmatize/remove stop words) <<table>> idealcs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model

PREDICTION PIPELINE Pre-processing (clean/lemmatize/remove stop words) <<table>> idealcs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model Predict Tweet Event

PREDICTION PIPELINE Pre-processing (clean/lemmatize/remove stop words) <<table>> idealcs5604f16 Load Word2Vec & Classifier model Word2Vec Model Classifier Model Predict Tweet Event

DATA RETRIEVAL CHALLENGES Large amounts of data (1.5 billion tweets) Have to avoid serial execution Cannot fit into memory Prevent reprocessing data unnecessarily

DATA RETRIEVAL FROM HBASE Retrieval Method Description Smaller collection performance Larger collection performance Spark HadoopRDD Load data into driver and parallelize across the cluster. Seamless reading from HBase. Hangs and does not complete reading on collections > one million. Batch Processing Load one batch at a time onto the drive and parallelize across the cluster. Slower reading due to batch overhead. Allows classification or arbitrarily large collections.

CHALLENGES WITH TWEETS Abbreviations and Slang (u,omg) Non English URLs and characters Misspellings RT: @AssociationsNow A Year After Texas Explosion Federal Repourt Outlines Progress on Fertilize... http://t.co/8fdbmu9asu #meetingprofs

TEXT-PREPROCESSING Remove URLs Remove the # characters Lemmatization using StanfordNLP Stopword removal

CLEANING EXAMPLE Raw: RT: @AssociationsNow A Year After Texas Explosion Federal Report Outlines Progress on Fertilize... http://t.co/8fdbmu9asu #meetingprofs Clean: year texas explosion federal report outline progress fertilize meetingprof

TRAINING DATA GENERATION Random samples of collections corresponding to real world events. Tweet assigned a number with the class it belongs to most. Hand Labeled 1000 Tweets

Number of Tweets CLASS DISTRIBUTION FOR TRAINING AND TEST DATA 200 180 160 140 120 100 80 60 40 20 0 Class distribution of manually labeled data 0 1 2 3 4 5 6 7 8 Class Number Class distribution of manually labeled data

TEXT CLASSIFICATION Feature selection Feature representation Choice of classifier

A COMPARISON OF FEATURE SELECTION TECHNIQUES Technique Advantages Disadvantages Tf-idf Superior for small feature. High term removal capability. Accuracy suffers for large datasets. Mutual information Simple to implement. Inferior accuracy performance. Association rules Chi-square statistic Within class popularity Word2Vec Fast execution. Very good accuracy for multiclass scenarios. Easy to interpret the rules Robust accuracy and performance with large sample sets with fewer classes. Identifies words that are most discriminative. Captures relationships of a word with neighbors. Prone to discovering too many rules or poorly understandable rules that hurt performance and interpretation. Difficulty in interpretation of when there are a large number of classes. Ignores the sequence of words. High computational complexity. Long training time for large sample size. For more details, please refer the appendix section.

COMMON FEATURE REPRESENTATION TECHNIQUES One-hot encoding Bag of words Challenges Large number of dimensions Word relationships with neighbors are not captured

WORD2VEC A feature selection technique. Captures the semantic context of a word s relation with neighbors. For more details, refer the appendix.

WORD2VEC Similar words are grouped together and closer to one another. Slide courtesy - http://files.meetup.com/12426342/5_an_overview_of_word2vec.pdf

WORD2VEC Word displacements are relationships between the words.

WORD2VEC Slide courtesy - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

WORD2VEC Slide courtesy - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

CLASSIFIER IMPLEMENTATION DETAILS A word feature is an average of the word vectors generated by the Word2Vec model. We used a vector representation with a default of 100 values. We chose the multi-class logistic regression in the Spark framework to perform classification. The classifier labels the predicted class along with the normalized probabilities of the other classes.

EXPERIMENTS Effect of preprocessing Accuracy performance Runtime performance Probability distribution Class assignment

CLEANING EXPERIMENT Determine how cleaning the data influences accuracy Cleaning: Lemmatization Stopword removal Hashtag removal Experimental setup: Split hand-labeled data: 70% train 30% test

F1 Score CLEANING IMPROVES ACCURACY! 0.96 0.94 0.92 0.9 0.88 0.86 0.84 Without Cleaning Training and Testing Data With Cleaning Word2Vec with logistic regression Association Rules Word2Vec with logistic regression: 29% fewer misclassifications Association Rules: 51% fewer misclassifications

ACCURACY EXPERIMENT Determine which classifier gives better results on labeled data Experimental setup: Generate 10 different breakups of the labeled data Calculate metrics for each classifier on same breakup 9 different classes

ACCURACY EXPERIMENTAL SETUP Labeled Data Divided into 10 sets Training Test 7 sets Rotate 3 sets sets Generate 10 different training and test sets

WORD2VEC OUTPERFORMS ASSOCIATION RULES Accuracy Comparison 1.0000 0.9500 0.9000 0.8500 0.8000 0.7500 0.7000 0.6500 0.6000 0.5500 0.5000 0.9609 0.9638 0.9607 0.9005 0.9109 0.9104 Weighted F1 Weighted Precision Weighted Recall Word2Vec With Logistic Regression Association Rules Word2Vec with logistic regression had a 6.7% increase in F1 score over association rules.

CLASSIFIER RUNTIME PERFORMANCE Need to be able to handle large collections efficiently to classify all of the tweets Classify at a rate faster than the tweets coming in Allow reruns as more classes are added to the training set

Seconds to predict CLASSIFICATION RUNTIME PERFORMANCE 60 Classifier prediction performance 50 40 30 20 10 0 3000 6000 9000 20,000 40,000 80,000 160,000 320,000 640,000 Number of tweets to predict Word2Vec with logistic regression Association Rules

OPTIMIZATION Broadcast models to each partition Word2Vec Model Clean Clean HBase Read Partition and distribute Partition 1 Partition 2 Partition 3 Clean Logistic Regression Model Classify Write Classify Write Classify Write

Seconds to Predict PROCESSING ACROSS PARTITIONS INCREASES RUNTIME PERFORMANCE! 60 50 40 30 20 10 0 3000 6000 9000 20,000 40,000 80,000 160,000 320,000 640,000 Number of tweets to predict Word2Vec with logistic regression Association Rules Optimized Word2Vec 57% faster than original Word2Vec 14% faster than Association Rules

PROBABILITY DISTRIBUTION FOR TEST DATA

MULTI-CLASS ASSIGNMENT DISTRIBUTION FOR TEST DATA 1 23% 4 13% 3 29% 2 35%

CONCLUSION Reading data in blocks from HBase and then partitioning it into parallel tasks results in huge run time performance efficiency and predictability. Cleaning text based on the English usage nuances in the Twitter universe results in better accuracy. Feature selection methods like Word2Vec that capture richer word semantics and context result in better accuracy than traditional ones for text classification. It is natural for a Tweet to be classified in multiple classes and the tradeoff between precision and recall is dependent on the user/product requirements.

FUTURE WORK The system can be retrained using a bigger corpus to generate a newer set of word vectors. Training on a text corpus like Google News can help generate word vectors that have richer word relationships encoded within. These can help improve the classification accuracy. The Logistic Regression classifier can be retrained on new classes. The system will be configured to run via a cron job periodically. In addition to classifying a tweet, the system also emits probabilities of all the classes that could be saved in HBase and can be used by SOLR or the front-end team to use as a criterion for customizing the indexing or user experience. Comparisons can be performed with the results of the developed classifier with the AR classifier or a few more classifiers and a inter-classifier agreement analysis can throw further light on the efficacy of the developed classifier.

ACKNOWLEDGEMENTS We would like to acknowledge and thank the following for assisting and supporting us throughout this project. Dr. Edward Fox, Dr. Denilson Alves Pereira NSF grant IIS - 1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL) Digital Library Research Laboratory Graduate Research Assistant Sunshin Lee Other teams in CS 5604

APPENDIX

FEATURE SELECTION TECHNIQUES Technique Advantages Disadvantages Tf-idf Superior for small feature sets that have a large scatter of features among the classes. High term removal capability. Accuracy suffers for large data sets where a term distribution alone does not suffice in class discrimination. Mutual information Simple to implement. Inferior performance in estimation of probabilities because of bias. Association rules Fast execution. Very good accuracy for multi-class scenarios. Rule based classifier helps understand the classification decision easily. Prone to discovering too many rules or poorly understandable rules that hurt performance and interpretation. Chi-square statistic Robust accuracy and performance with large sample sets with fewer classes. Difficulty in interpretation of when there are a large number of classes. Within class popularity Word2Vec Identifies words that are most discriminative. Captures relationships of a word with neighbors. Ignores the sequence of words. High computation complexity. Long training time for large sample size.

WORD2VEC Slide courtesy - http://files.meetup.com/12426342/5_an_overview_of_word2vec.pdf

WORD2VEC Can learn the word vectors via two forms. CBOW Predict the word, given the context. Slide courtesy - http://files.meetup.com/12426342/5_an_overview_of_word2vec.pdf

WORD2VEC Skip-gram Inverse objective of CBOW. Predict the context, given a word. Slide courtesy - http://files.meetup.com/12426342/5_an_overview_of_word2vec.pdf

REAL WORLD EVENTS USED FOR EXPERIMENTS The real world event along with the amount of tweets labeled as that class for our experimental sets. Hurricane Sandy (108 tweets) Hurricane Isaac (83 tweets) New York Firefighter Shooting (58 tweets) Kentucky Accidental Child Shooting (16 tweets) Newtown School Shooting (157 tweets) Manhattan Building Explosion (189 tweets) China Factory Explosion (178 tweets) Texas Fertilizer Explosion (120 tweets) Hurricane Arthur (169 tweets)

REAL WORLD EVENTS CLASSIFIED The real world events classified along with some collections tweets of that event are found. Real World Event Collections Real World Event Collections Hurricane Sandy 23,27,375 Hurricane Isaac 27,28,375 New York Firefighter Shooting Newtown School Shooting China Factory Explosion 43,46 Kentucky Accidental Child Shooting 41,42,46 Manhattan Building Explosion 231,232 Texas Fertilizer Explosion Hurricane Arthur 27,187,188,375 Quebec Train Derailment 45,46 173,174,399,400 77,381 96,98,381 Fairdale Tornado 406,632 Oklahoma Tornado 406,84 Mississippi Tornado 406,528 Alabama Tornado 406,407