Practical Methods for the Analysis of Big Data

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning with Negation: Issues Regarding Effectiveness

Reducing Features to Improve Bug Prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Universidade do Minho Escola de Engenharia

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Word learning as Bayesian inference

CS 446: Machine Learning

Axiom 2013 Team Description Paper

Learning From the Past with Experiment Databases

Moderator: Gary Weckman Ohio University USA

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Probability and Statistics Curriculum Pacing Guide

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Generative models and adversarial training

Probabilistic Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Chapter 2 Rule Learning in a Nutshell

Human Emotion Recognition From Speech

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Calibration of Confidence Measures in Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CSL465/603 - Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Artificial Neural Networks written examination

SARDNET: A Self-Organizing Feature Map for Sequences

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Learning Methods for Fuzzy Systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mining Association Rules in Student s Assessment Data

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Master s Programme in European Studies

Multivariate k-nearest Neighbor Regression for Time Series data -

Model Ensemble for Click Prediction in Bing Search Ads

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Speech Emotion Recognition Using Support Vector Machine

JD Concentrations CONCENTRATIONS. J.D. students at NUSL have the option of concentrating in one or more of the following eight areas:

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

INPE São José dos Campos

Lecture 1: Basic Concepts of Machine Learning

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Functional Skills Mathematics Level 2 assessment

Truth Inference in Crowdsourcing: Is the Problem Solved?

Go fishing! Responsibility judgments when cooperation breaks down

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Algebra 2- Semester 2 Review

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Ensemble Technique Utilization for Indonesian Dependency Parser

Conference Presentation

Issues in the Mining of Heart Failure Datasets

An Introduction to the Minimalist Program

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Lecture 10: Reinforcement Learning

Probability and Game Theory Course Syllabus

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The Boosting Approach to Machine Learning An Overview

Linking Task: Identifying authors and book titles in verbose queries

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Modeling user preferences and norms in context-aware systems

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A survey of multi-view machine learning

Applications of data mining algorithms to analysis of medical data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Mining Student Evolution Using Associative Classification and Clustering

Grade 6: Correlated to AGS Basic Math Skills

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Discriminative Learning of Beam-Search Heuristics for Planning

Timeline. Recommendations

Learning goal-oriented strategies in problem solving

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

CS 101 Computer Science I Fall Instructor Muller. Syllabus

An Empirical Comparison of Supervised Ensemble Learning Approaches

Semi-Supervised Face Detection

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

CSC200: Lecture 4. Allan Borodin

Julia Smith. Effective Classroom Approaches to.

Transcription:

Practical Methods for the Analysis of Big Data Module 4: Clustering, Decision Trees, and Ensemble Methods Philip A. Schrodt The Pennsylvania State University schrodt@psu.edu Workshop at the Odum Institute University of North Carolina, Chapel Hill 20-21 May 2013

Topics: Module 4 Clustering K-Means Hierarchical Clustering: Dendograms Comparisons Generating Dendograms from LDA Topics Classification Trees Ensemble Methods Bayesian Model Averaging Random Forests TM Boosting

Topics: Module 4 Clustering K-Means Hierarchical Clustering: Dendograms Comparisons Generating Dendograms from LDA Topics Classification Trees Ensemble Methods Bayesian Model Averaging Random Forests TM Boosting

General comments Requires a metric and there are many for the distance between the cases In contrast to linear approaches but similar to SVM this assumes heterogeneous subpopulations Clustering is typically depicted in two dimensions but usually is computed in an arbitrarily large space

Cluster Example 1 Exercise: search Google images for cluster analysis for a zillion examples

Cluster Example 2 [this had something to do with herpetology, perhaps explaining the importance of road crossings ]

Intuitive Clustering Diagrams from Michael Levitt, Structural Biology, Stanford Source: http://csb.stanford.edu/class/public/lectures/lec4/lecture6/data_visualization/images/intuitive_clustering.jpg

Overview of distance metrics Source: http://www.improvedoutcomes.com/docs/websitedocs/clustering/clustering_parameters/distance_metrics_overview.htm

K-Means Source: http://csb.stanford.edu/class/public/lectures/lec4/lecture6/data_visualization/images/k-means_clustering.jpg

K-Means algorithm Source: http://biology.unm.edu/biology/maggieww/public_html/k-means.gif

K-Means: Issues Results vary depending on the number of clusters Results vary depending on the random starting points: one approach is to do a number of these and see which clusters consistently emerge

Let s go exploring! Google Image Search: k means clustering

Hierarchical Clustering Source: http://csb.stanford.edu/class/public/lectures/lec4/lecture6/data_visualization/images/hierarchical_clustering.jpg

Comparison Strategy Words that are similar should co-occur in topics more frequently For a pair of top-words, let their similarity-weight be equal to: No. of times that the pair appears within all top-word vectors Distance between two vectors: A constant minus the sum of the similarity-weights for word-pairs that occur across the two top-word vectors

Comparing Topics: Combined Sample Dendogram for Topic Vectors: All countries Diplomacy Negotiation Econ-Coop Parliament Election Height 500 1000 1500 2000 2500 3000 3500 4000 Democracy Media Diplomacy Comments Ceremony Parliament Nuclear Economy Smuggling Crime Terrorism Accidents Protest Military Violence

Comparing Topics: France Dendogram for Topic Vectors: France Height 500 1000 1500 2000 2500 3000 3500 Election Election Business Economy Nuclear Diplomacy Development EU Diplomacy IOs Peacekeeping Military Judiciary Terrorism Terrorism Violence Immigration Protest Travel Culture

Comparing Topics:Turkey Dendogram for Topic Vectors: Turkey Height 500 1000 1500 2000 2500 3000 3500 4000 Diplomacy Diplomacy Development Diplomacy Diplomacy EU Cyprus Parliament Election Business Energy Ceremony Military Terrorism Smuggling Judiciary Genocide Disaster Accidents Terrorism

Comparing Topics across Countries: Europe Dendogram for Topic Vectors: Europe Height 0 2000 4000 6000 8000 10000 12000 Diplomacy (NOR-11) Diplomacy (ALL-12) Development (FRA-19) Econ-Coop (NOR-2) Econ-Coop (ALL-7) Econ-Coop (GRC-6) Econ-Coop (POL-15) Diplomacy (POL-10) Negotiation (ALL-1) Diplomacy (FRA-4) Diplomacy (GRC-15) Diplomacy (NOR-16) Media (POL-12) Diplomacy (ALL-8) Diplomacy (FRA-20) Diplomacy (GRC-19) Election (POL-1) Election (POL-3) Coalit-Govt (GRC-20) Election (FRA-18) Election (FRA-11) Election (GRC-12) Parliament (ALL-11) Election (ALL-16) Election (NOR-12) Parliament (POL-16) Crime (ALL-14) Judiciary (FRA-14) Crime (NOR-5) Drugs (GRC-17) Terrorism (FRA-17) Crime (POL-9) Business (FRA-10) Business (POL-5) Economy (ALL-19) Oil (NOR-8) Economy (POL-7) Economy (FRA-15) Economy (GRC-5) Diplomacy (GRC-18) EU (FRA-9) EU (POL-6) Military (POL-20) Nuclear (NOR-7) Nuclear (ALL-9) Nuclear (FRA-6) Comments (GRC-9) Refugees (NOR-9) Media (ALL-4) Democracy (ALL-10) Immigration (GRC-7) IOs (FRA-3) Military (POL-4) Cyprus (GRC-16) Diplomacy (NOR-1) Sri-Lanka (NOR-13) History (POL-8) Culture (FRA-7) Ceremony (ALL-18) Culture (GRC-3) Comments (ALL-2) Royals (NOR-18) Development (NOR-6) Terrorism (GRC-14) Budget (POL-17) Whaling (NOR-17) Agriculture (POL-14) Corruption (GRC-2) Corruption (POL-19) Myanmar (NOR-14) Nobel-Prize (NOR-20) Parliament (ALL-17) Constitution (POL-13) Immigration (POL-2) Immigration (FRA-2) Immigration (NOR-4) Mil-Coop (GRC-8) Military (ALL-6) Military (FRA-16) Military (NOR-3) Military (POL-18) Protest (FRA-13) Protest (GRC-10) Cartoons (NOR-19) Protest (ALL-3) Terrorism (GRC-13) Peacekeeping (FRA-12) Peacekeeping (NOR-15) Terrorism (ALL-15) Violence (ALL-13) Terrorism (FRA-5) Accidents (POL-11) Accidents (ALL-5) Violence (FRA-8) Travel (FRA-1) Accidents (GRC-11) Refugees (GRC-4) Smuggling (ALL-20) Shipping (GRC-1) Shipping (NOR-10) Diplomacy Elections Crime Economy EU Nucs Refugees Culture Governance Military Protest Accidents

Comparing Topics across Countries: Middle East Dendogram for Topic Vectors: Middle East Height 0 2000 4000 6000 8000 10000 12000 Comments (JOR-4) Diplomacy (ALL-12) Diplomacy (JOR-5) Diplomacy (TUR-9) Diplomacy (ALL-8) Media (EGY-9) Business (TUR-11) Development (TUR-19) Econ-Coop (ALL-7) Econ-Coop (EGY-10) Diplomacy (EGY-16) Diplomacy (ISR-10) Diplomacy (JOR-1) Diplomacy (TUR-12) Diplomacy (EGY-6) Negotiation (ALL-1) Diplomacy (ISR-4) Diplomacy (EGY-11) Diplomacy (TUR-4) EU (TUR-7) Cyprus (TUR-10) ISR/PSE-Coop (ISR-14) ISR/PSE-Coop (JOR-12) Parliament (TUR-3) Election (ISR-6) Election (EGY-3) Election (TUR-6) Election (JOR-7) Parliament (ALL-11) Election (ALL-16) Judiciary (JOR-13) Crime (ALL-14) Judiciary (ISR-11) Human-Rights (EGY-18) Judiciary (TUR-17) Accidents (ALL-5) Accidents (EGY-15) Accidents (TUR-1) Terrorism (ISR-3) Terrorism (JOR-3) Terrorism (EGY-5) Terrorism (TUR-2) Missile-Attacks (ISR-2) Violence (ALL-13) Violence (ISR-20) Violence (JOR-6) ISR/PSE-Coop (ISR-17) Military (EGY-7) ISR/PSE-Conf (JOR-9) Gaza (ISR-13) Humanitarian-Aid (JOR-2) Military (ALL-6) Military (TUR-8) Terrorism (TUR-13) Gaza (EGY-14) Terrorism (ISR-16) Society (ISR-12) Society (JOR-20) Ceremony (ALL-18) Ceremony (TUR-16) Nuclear (ALL-9) Nuclear (ISR-8) Nuclear (EGY-2) Hostages (JOR-14) Smuggling (ALL-20) Smuggling (TUR-14) Violence (EGY-13) ISR/PSE-Conf (ISR-15) Terrorism (ALL-15) Police (JOR-16) Refugees (ISR-18) Refugees (JOR-15) Refugees (EGY-12) Disaster (TUR-5) Settlements (ISR-19) Protest (EGY-20) Protest (ALL-3) Protest (JOR-19) Budget (ISR-5) Energy (TUR-20) Nuclear (JOR-18) Economy (ALL-19) Economy (EGY-17) Media (EGY-19) Media (ISR-9) Media (JOR-17) Media (ALL-4) Media (EGY-8) Comments (JOR-11) Diplomacy (TUR-15) ISR/PSE-Coop (EGY-1) Diplomacy (ISR-7) Comments (ALL-2) Comments (ISR-1) Military (JOR-8) Comments (EGY-4) Genocide (TUR-18) Parliament (ALL-17) Democracy (ALL-10) Royals (JOR-10) Diplomacy Elections Legal Violence ISR/PSE Society Nuc Instability Econ Media Comments

Let s go exploring! Google Image Search: dendograms

Topics: Module 4 Clustering K-Means Hierarchical Clustering: Dendograms Comparisons Generating Dendograms from LDA Topics Classification Trees Ensemble Methods Bayesian Model Averaging Random Forests TM Boosting

Classification Tree Example Source: http://orange.biolab.si/doc/ofb/c_otherclass.htm

Classification Tree Example

Let s go exploring! Google Image Search: classification tree

Classification Tree with Continuous Breakpoints [this has something to do with classifying basalts] Source: http://www.ucl.ac.uk/ ucfbpve/papers/vermeeschgca2006/w3441-rev37x.png

ID3 Algorithm Calculate the entropy of every attribute using the data set S Split the set S into subsets using the attribute for which entropy is minimum (or, equivalently, information gain is maximum) Make a decision tree node containing that attribute Recurse on subsets using remaining attributes Source: http://en.wikipedia.org/wiki/id3_algorithm

Entropy: definition Source: http://en.wikipedia.org/wiki/entropy_%28information_theory%29

C4.5 Algorithm C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s 1, s 2,... of already classified samples. Each sample s i consists of a p-dimensional vector (x 1,i, x 2,i,..., x p,i ), where the x j represent attributes or features of the sample, as well as the class in which s i falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists. Source: http://en.wikipedia.org/wiki/c4.5_algorithm

C4.5 vs. ID3 C4.5 made a number of improvements to ID3. Some of these are: Handling both continuous and discrete attributes: In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as? for missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation C4.5 goes back through the tree once it s been created and attempts to remove branches that do not help by replacing them with leaf nodes Source: http://en.wikipedia.org/wiki/c4.5_algorithm

Neural networks Developed by Geoffrey Hinton, who through the magic of the internet, is here to explain... https://www.coursera.org/course/neuralnets

Topics: Module 4 Clustering K-Means Hierarchical Clustering: Dendograms Comparisons Generating Dendograms from LDA Topics Classification Trees Ensemble Methods Bayesian Model Averaging Random Forests TM Boosting

Bayesian Model Averaging Systematically integrates the information provided by all combinations of variables Result is the overall posterior probability that a variable is important Without having to generate hundreds of papers and thousands of nonrandomly discarded models Machine learning suggests that systematic assessment of models gives about 10% better accuracy with much less information, and completely eliminates the need for vaguely defined indicators Predictions can be made using an ensemble of all of the models In meteorology and finance, these models are generally more robust in out-of-sample evaluations Framework is Bayesian rather than frequentist, which eliminates a long list of philosophical and interpretive problems with the frequentist approach

The problem of controls For starters, they aren t controls, they are just another variable Often in a really bad neighborhood Nature bats last in (X X) 1 X y For something closer to a control, use case matching or Bayesian priors Numerous studies over the past 50 years all ignored have suggested that simple models are better In many forecasting models, there is no obvious theoretical reason for using any particular measure, so instead we have to assess multiple measures of the same latent concept: power, legitimacy, authoritarianism This is a feature, not a bug Regression approaches have terrible pathologies in these situations Currently, we laboriously work through all of these options across scores of journal and conference papers presented over the course of years* * So if BMA really catches on, a number of journals and tenure cases are doomed. On the former, how sad. On the latter, be afraid, be very afraid.

BMA: variable inclusion probabilities

BMA: Posterior probabilities

Random Forests TM : Breiman s Algorithm Each tree is constructed using the following algorithm: 1. Let the number of training cases be N, and the number of variables in the classifier be M. 2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. 3. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e., take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. 4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. 5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier). For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the mode vote of all trees is reported as the random forest prediction. Source: http://en.wikipedia.org/wiki/random_forest

This sucker is trade-marked! Random Forests(tm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software. Our trademarks also include RF(tm), RandomForests(tm), RandomForest(tm) and Random Forest(tm). For details: http://www.stat.berkeley.edu/ breiman/randomforests/cc_home.htm

Features of Random Forests Breiman et al claim the following: It is unexcelled in accuracy among current algorithms. It runs efficiently on large data bases. It can handle thousands of input variables without variable deletion. It gives estimates of what variables are important in the classification. It generates an internal unbiased estimate of the generalization error as the forest building progresses. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing. It has methods for balancing error in class population unbalanced data sets. Generated forests can be saved for future use on other data. Prototypes are computed that give information about the relation between the variables and the classification. It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data. The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection. It offers an experimental method for detecting variable interactions. Random forests TM may also cure acne, remove cat hair from upholstery and show promise for bringing peace to the Middle East, though Breiman et al do not explicitly make these claims. Source: http://www.stat.berkeley.edu/ breiman/randomforests/cc_home.htm#features

AdaBoost AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the overfitting problem than most learning algorithms. The classifiers it uses can be weak (i.e., display a substantial error rate), but as long as their performance is slightly better than random (i.e. their error rate is smaller than 0.5 for binary classification), they will improve the final model. Even classifiers with an error rate higher than would be expected from a random classifier will be useful, since they will have negative coefficients in the final linear combination of classifiers and hence behave like their inverses. AdaBoost generates and calls a new weak classifier in each of a series of rounds t = 1,...,T. For each call, a distribution of weights D t is updated that indicates the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased, and the weights of each correctly classified example are decreased, so the new classifier focuses on the examples which have so far eluded correct classification. Source: http://en.wikipedia.org/wiki/adaboost

Go to: adaboost_matas.pdf http://www.robots.ox.ac.uk/ az/lectures/cv/adaboost_matas.pdf