Big Data Analytics Clustering and Classification

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Assignment 1: Predicting Amazon Review Ratings

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS 446: Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

Switchboard Language Model Improvement with Conversational Data from Gigaword

Lecture 1: Basic Concepts of Machine Learning

Reducing Features to Improve Bug Prediction

Detecting English-French Cognates Using Orthographic Edit Distance

Rule Learning with Negation: Issues Regarding Effectiveness

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Probabilistic Latent Semantic Analysis

Indian Institute of Technology, Kanpur

Speech Emotion Recognition Using Support Vector Machine

Calibration of Confidence Measures in Speech Recognition

Radius STEM Readiness TM

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Learning From the Past with Experiment Databases

Grade 6: Correlated to AGS Basic Math Skills

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Multi-Lingual Text Leveling

12- A whirlwind tour of statistics

CSL465/603 - Machine Learning

Learning Methods in Multilingual Speech Recognition

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Comment-based Multi-View Clustering of Web 2.0 Items

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Australian Journal of Basic and Applied Sciences

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Linking Task: Identifying authors and book titles in verbose queries

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Finding Translations in Scanned Book Collections

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Comparison of network inference packages and methods for multiple networks inference

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

arxiv:cmp-lg/ v1 22 Aug 1994

arxiv: v1 [cs.lg] 3 May 2013

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.lg] 15 Jun 2015

Corrective Feedback and Persistent Learning for Information Extraction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning to Rank with Selection Bias in Personal Search

Managerial Decision Making

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

HLTCOE at TREC 2013: Temporal Summarization

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

arxiv: v2 [cs.cv] 30 Mar 2017

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Applications of data mining algorithms to analysis of medical data

A survey of multi-view machine learning

Mathematics process categories

arxiv: v1 [cs.cl] 2 Apr 2017

The Role of String Similarity Metrics in Ontology Alignment

INPE São José dos Campos

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semi-Supervised Face Detection

Axiom 2013 Team Description Paper

The Importance of Social Network Structure in the Open Source Software Developer Community

Softprop: Softmax Neural Network Backpropagation Learning

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Human Emotion Recognition From Speech

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Probability estimates in a scenario tree

Model Ensemble for Click Prediction in Bing Search Ads

A Bayesian Learning Approach to Concept-Based Document Classification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Software Maintenance

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Universidade do Minho Escola de Engenharia

Visit us at:

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Transcription:

E6893 Big Data Analytics Lecture 4: Big Data Analytics Clustering and Classification Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 28th, 2017 1 2017 CY Lin, Columbia University

Review Key ML Components of Mahout 2 2017 CY Lin, Columbia University

Machine Learning example: using SVM to recognize a Toyota Camry Non-ML Rule 1.Symbol has something like bull s head Rule 2.Big black portion in front of car. Rule 3...???? ML Support Vector Machine Feature Space Positive SVs Negative SVs 3 2015 CY Lin, Columbia University

Machine Learning example: using SVM to recognize a Toyota Camry ML Support Vector Machine Positive SVs PCamry > 0.95 Feature Space Negative SVs 4 2015 CY Lin, Columbia University

Clustering 5

Clustering on feature plane 6

Clustering example 7

Steps on clustering 8

Making initial cluster centers 9

K-mean clustering 10

HelloWorld clustering scenario result 11

Parameters to Mahout k-mean clustering algorithm 12

HelloWorld clustering scenario 13

HelloWorld Clustering scenario - II 14

HelloWorld Clustering scenario - III 15

Testing difference distance measures 16

Manhattan and Cosine distances 17

Tanimoto distance and weighted distance 18

Results comparison 19

Data preparation in Mahout vectors 20

vectorization example 0: weight 1: color 2: size 21

Mahout codes to create vectors of the apple example 22

Mahout codes to create vectors of the apple example II 23

Vectorization of text Vector Space Model: Term Frequency (TF) Stop Words: Stemming: 24

Most Popular Stemming algorithms 25

Term Frequency Inverse Document Frequency (TF-IDF) The value of word is reduced more if it is used frequently across all the documents in the dataset. or 26

n-gram It was the best of time. it was the worst of times. ==> bigram Mahout provides a log-likelihood test to reduce the dimensions of n-grams 27

Examples using a news corpus Reuters-21578 dataset: 22 files, each one has 1000 documents except the last one. http://www.daviddlewis.com/resources/testcollections/ reuters21578/ Extraction code: 28

Mahout dictionary-based vectorizer 29

Mahout dictionary-based vectorizer II 30

Mahout dictionary-based vectorizer III 31

Outputs & Steps 1. Tokenization using Lucene StandardAnalyzer 2. n-gram generation step 3. converts the tokenized documents into vectors using TF 4. count DF and then create TF-IDF 32

A practical setting of flags 33

normalization Some documents may pop up showing they are similar to all the other documents because it is large. ==> Normalization can help. 34

Clustering methods provided by Mahout 35

K-mean clustering 36

Hadoop k-mean clustering jobs 37

K-mean clustering running as MapReduce job 38

Hadoop k-mean clustering code 39

The output 40

Canopy clustering to estimate the number of clusters Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy. 41

Running canopy clustering Created less than 50 centroids. 42

News clustering code 43

News clustering example > finding related articles 44

News clustering code II 45

News clustering code III 46

Other clustering algorithms Hierarchical clustering 47

Different clustering approaches 48

Classification definition 49

When to use Mahout for classification? 50

The advantage of using Mahout for classification 51

How does a classification system work? 52

Key terminology for classification 53

Input and Output of a classification model 54

Four types of values for predictor variables 55

Sample data that illustrates all four value types 56

Supervised vs. Unsupervised Learning 57

Work flow in a typical classification project 58

Classification Example 1 Color-Fill 59 Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is color-fill label.

Classification Example 2 Color-Fill (another feature) 60

Mahout classification algorithms Mahout classification algorithms include: Naive Bayesian Complementary Naive Bayesian Stochastic Gradient Descent (SDG) Random Forest 61

Comparing two types of Mahout Scalable algorithms 62

Step-by-step simple classification example 1.The data and the challenge 2.Training a model to find color-fill: preliminary thinking 3.Choosing a learning algorithm to train the model 4.Improving performance of the classifier 63

Choose algorithm via Mahout 64

Stochastic Gradient Descent (SGD) 65

Characteristic of SGD 66

Support Vector Machine (SVM) maximize boundary distances; remembering support vectors 67 nonlinear kernels

Naive Bayes Training set: Classifier using Gaussian distribution assumptions: Test Set: 68 ==> female

Random Forest Random forest uses a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. 69

Adaboost Example Adaboost [Freund and Schapire 1996] Constructing a strong learner as a linear combination of weak learners - Start with a uniform distribution ( weights ) over training examples (The weights tell the weak learning algorithm which examples are important) - Obtain a weak classifier from the weak learning algorithm, h jt :X {-1,1} - Increase the weights on the training examples that were misclassified - (Repeat) 70

Example User Modeling using Time-Sensitive Adaboost Obtain simple classifier on each feature, e.g., setting threshold on parameters, or binary inference on input parameters. The system classify whether a new document is interested by a person via Adaptive Boosting (Adaboost): The final classifier is a linear weighted combination of singlefeature classifiers. Given the single-feature simple classifiers, assigning weights on the training samples based on whether a sample is correctly or mistakenly classified. <== Boosting. Classifiers are considered sequentially. The selected weights in previous considered classifiers will affect the weights to be selected in the remaining classifiers. <== Adaptive. According to the summed errors of each simple classifier, assign a weight to it. The final classifier is then the weighted linear combination of these simple classifiers. Our new Time-Sensitive Adaboost algorithm: In the AdaBoost algorithm, all samples are regarded equally important at the beginning of the learning process We propose a time-adaptive AdaBoost algorithm that assigns larger weights to the latest training samples People select apples according to their shapes, sizes, other people s interest, etc. Each attribute is a simple classifier used in Adaboost. 71

Time-Sensitive Adaboost [Song et al. 2005] 72

Evaluate the model AUC (0 ~ 1): 1 perfect 0 perfectly wrong 0.5 random confusion matrix 73

Average Precision commonly used in sorted results Average Precision is the metric that is used for evaluating sorted results. commonly used for search & retrieval, anomaly detection, etc. Average Precision = average of the precision values of all correct answers up to them, ==> i.e., calculating the precision value up to the Top n correct answers. Average all Pn. 74 2017 CY Lin, Columbia University

Confusion Matrix 75 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

See Training Results 76 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms

Number of Training Examples vs Accuracy 77 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

Classifiers that go bad 78 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

Target leak A target leak is a bug that involves unintentionally providing data about the target variable in the section of the predictor variables. Don t confused with intentionally including the target variable in the record of a training example. Target leaks can seriously affect the accuracy of the classification system. 79

Example: Target Leak 80 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

Avoid Target Leaks 81 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

Avoid Target Leaks II 82 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms 2017 CY Lin, Columbia University

Imperfect Learning for Autonomous Concept Modeling Learning Reference: C.-Y. Lin et al., SPIE EI West, 2005 83 2017 CY Lin, Columbia University

A solution for the scalability issues at training.. Autonomous Learning of Video Concepts through Imperfect Training Labels: Develop theories and algorithms for supervised concept learning from imperfect annotations -- imperfect learning Develop methodologies to obtain imperfect annotation learning from cross-modality information or web links Develop algorithms and systems to generate concept models novel generalized Multiple-Instance Learning algorithm with Uncertain Labeling Density Autonomous Concept Learning Imperfect Learning Cross-Modality Training 84 2017 CY Lin, Columbia University

What is Imperfect Learning? Definitions from Machine Learning Encyclopedia: Supervised learning: a machine learning technique for creating a function from training data. The training data consists of pairs of input objects and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). Predict the value of the function for any valid input object after having seen only a small number of training examples. The learner has to generalize from the presented data to unseen situations in a "reasonable" way. Unsupervised learning: a method of machine learning where a model is fit to observations. It is distinguished from supervised learning by the fact that there is no a priori output. A data set of input objects is gathered. Unsupervised learning then typically treats input objects as a set of random variables. A joint density model is then built for the data set. Proposed Definition of Imperfect Learning: A supervised learning technique with imperfect training data. The training data consists of pairs of input objects and desired outputs. There may be error or noise in the desired output of training data. The input objects are typically treated as a set of random variables. 85 2017 CY Lin, Columbia University

Why do we need Imperfect Learning? Annotation is a Must for Supervised Learning. All (or almost all?) modeling/fusion techniques in our group used annotation for training However, annotation is time- and cost- consuming. Previous focuses were on improving the annotation efficiency minimum GUI interaction, template matching, active learning, etc. Is there a way to avoid annotation? Use imperfect training examples that are obtained automatically/unsupervisedly from other learning machine(s). These machines can be built based on other modalities or prior machines on related dataset domain. Autonomous Concept Learning Imperfect Learning Cross-Modality Training [Lin 03] 86 2017 CY Lin, Columbia University

Proposition Supervised Learning! Time consuming; Spend a lot of time to do the annotation Unsupervised continuous learning! When will it beat the supervised learning? accuracy of Testing Model accuracy of Training Data # of Training Data 87 2017 CY Lin, Columbia University

The key objective of this paper can concept models be learned from imperfect labeling? Example: The effect of imperfect labeling on classifiers (left -> right: perfect labeling, imperfect labeling, error classification area) 88 2017 CY Lin, Columbia University

False positive Imperfect Learning Assume we have ten positive examples and ten negative examples. if 1 positive example is wrong (false positive), how will it affect SVM? Will the system break down? Will the accuracy decrease significantly? If the ratio change, how is the result? Does it depend on the testing set? If time goes by and we have more and more training data, how will it affect? In what circumstance, the effect of false positive will decrease? In what situation, the effect of false positive will still be there? Assume the distribution of features of testing data is similar to the training data. When will it 89 2017 CY Lin, Columbia University

Imperfect Learning If learning example is not perfect, what will be the result? If you teach something wrong, what will be the consequence? Case 1: False positive only Case 2: False positive and false negative Case 3: Learning example has confidence value 90 2017 CY Lin, Columbia University

From Hessienberg s Uncertainty Theory From Hessienberg s Uncertainty Theory, everything is random. It is not measurable. Thus, we can assume a random distribution of positive ones and negative ones. Assume there are two Gaussians in the feature space. One is positive. The other one is negative. Let s assume two situations. The first one: every positive is from positive and every negative is from negative. The second one: there may be some random mistake in the negative. Also, let s assume two cases. 1. There are overlap between two Gaussians. 2. There are not. So, maybe these can be derived to become a variable based on mean and sigma. If the training samples of SVM are random, how will be the result? Is it predictable with a closed mathematical form? How about using linear example in the beginning and then use the random examples next? 91 2017 CY Lin, Columbia University

False Positive Samples Will false positive examples become support vectors? Very likely. We can also assume a r.v. here. Maybe we can also using partially right data Having more weighting on positive ones. Then for the uncertain ones having fewer chance to become support vector Will it work if, when support vector is picked, we take the uncertainty as a probability? Or, should we compare it to other support vectors? This can be an interesting issue. It s like human brain. The first one you learn, you remember it. The later ones you may forget about it. The more you learn the more it will be picked. The fewer it happens, it will be more easily forgotten. Maybe I can even develop a theory to simulate human memory. Uncertainty can be a time function. Also, maybe the importance of support vector can be a time function. So, sometimes machine will forget things.! This make it possible to adapt and adjustable to outside environment. Maybe I can develop a theory of continuous learning Or, continuous learning based on imperfect memory In this way, the learning machine will be affected mostly by the current data. For those old data, it will put less weighting! may reflect on the distance function. Our goal is to have a very large training set. Remember a lot of things. So, we need to learn to forget. 92 2017 CY Lin, Columbia University

Imperfect Learning: theoretical feasibility Imperfect learning can be modeled as the issue of noisy training samples on supervised learning. Learnability of concept classifiers can be determined by probably approximation classifier (pac-learnability) theorem. Given a set of fixed type classifiers, the pac-learnability identifies a minimum bound of the number of training samples required for a fixed performance request. If there is noise on the training samples, the above mentioned minimum bound can be modified to reflect this situation. The ratio of required sample is independent of the requirement of classifier performance. Observations: practical simulations using SVM training and detection also verify this theorem. A figure of theoretical requirement of the number of sample needed for noisy and perfect training samples 93 2017 CY Lin, Columbia University

PAC-identifiable PAC-identifiable: PAC stands for probably approximate correct. Roughly, it tells us a class of concepts C (defined over an input space with examples of size N) is PAC learnable by a learning algorithm L, if for arbitrary small δ and ε, and for all concepts c in C, and for all distributions D over the input space, there is a 1-δ probability that the hypothesis h selected from space H by learning algorithm L is approximately correct (has error less than ε). Pr (Pr ( h ( x ) c D X ( x )) ε ) δ Based on the PAC learnability, assume we have m independent examples. Then, for a given hypothesis, the probability that m examples have not been misclassified is (1-e) m which we want to be less than δ. In other words, we want (1-e) m <= δ. Since for any 0 <= x <1, (1-x) <= e -x, we then have: 1 1 m ln( ) ε δ 94 2017 CY Lin, Columbia University

Sample Size v.s. VC dimension Theorem 2 Let C be a nontrivial, well-behaved concept class. If the VC dimension of C is d, where d <, then for 0 < e < 1 and 4 2 8d 13 m max( log 2, log 2 ) ε δ ε ε any consistent function A: ScC is a learning function for C, and, for 0 < e < 1/2, m has to be larger than or equal to a lower bound, 1 ε 1 m max ln( ), d (1 2 ε (1 δ ) + 2 δ )) ε δ For any m smaller than the lower bound, there is no function A: ScH, for any hypothesis space H, is a learning function for C. The sample space of C, denoted SC, is the set of all 95 2017 CY Lin, Columbia University

How many training samples are required? Examples of training samples required in different error bounds for PAC-identifiable hypothesis. This figure shows the upper bounds and lower bounds at Theorem 2. The upper bound is usually refereed as sample capacity, which guarantees the learnability of training samples. 96 2017 CY Lin, Columbia University

Noisy Samples Theorem 4 Let h < 1/2 be the rate of classification noise and N the number of rules in the class C. Assume 0 < e, h < 1/2. Then the number of examples, m, required is at least and at most m ln(2 δ ) max,log 2 N (1 2 ε (1 δ ) + 2 δ )) ln(1 ε (1 2 η)) ln( N / δ ) ε 1 2 (1 exp( 2 (1 2 η) )) r is the ratio of the required noisy training samples v.s. the noise-free training samples r η = (1 exp( (1 2 η) )) 1 2 2 1 97 2017 CY Lin, Columbia University

Training samples required when learning from noisy examples Ratio of the training samples required to achieve PAC-learnability under the noisy and noise-free sampling environments. This ratio is consistent on different error bounds and VC dimensions of PAC-learnable hypothesis. 98 2017 CY Lin, Columbia University

Learning from Noisy Examples on SVM For an SVM, we can find the bounded VC dimension: d Λ R + n + 2 2 min( 1, 1) 99 2017 CY Lin, Columbia University

Experiments - 1 Examples of the effect of noisy training examples on the model accuracy. Three rounds of testing results are shown in this figure. We can see that model performance does not have significant decrease if the noise probability in the training samples is larger than 60% - 70%. And, we also see the reverse effect of the training samples if the mislabeling probability is larger than 0.5. 100 2017 CY Lin, Columbia University

Experiments 2: Experiments of the effect of noisy training examples on the visual concept model accuracy. Three rounds of testing results are shown in this figure. We simulated annotation noises by randomly change the positive examples in manual annotations to negatives. Because perfect annotation is not available, accuracy is shown as a relative ratio to the manual annotations in [10]. In this figure, we see the model accuracy is not significantly affected for small noises. A similar drop on the training examples is observed at around 60% - 70% of annotation accuracy (i.e., 30% - 40% of missing annotations). 101 2017 CY Lin, Columbia University

Conclusion This paper proves that imperfect learning is possible. In general, the performance of SVM classifiers do not degrade too much if the manual annotation accuracy is larger than about 70%. Continuous Imperfect Learning shall have a great impact in autonomous learning scenarios. 102 2017 CY Lin, Columbia University

Homework #2 (due October 12th) 1. Recommendation: 1-1. Choose any two datasets you can get from any public data set. 1-2. Try various recommendation algorithms provided by Mahout or Spark 2. Clustering: Using datasets from: 1. Online news (e.g., New York Times article in September 2017, or other data sources) 2. Wikipedia articles 3. (optional) gather data from Twitter API, try clustering Do clustering > finding related documents 3. Classification: 3-1: Using two datasets to be provided by TA, try various classification algorithms provided by Mahout or Spark, and discuss their performance 3-2: Do similar experiments on the Wikipedia data that you downloaded. 103 E6893 Big Data Analytics Lecture 5: Big Data Analytics Algorithms

Questions? 104 2017 CY Lin, Columbia University