Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Similar documents
CS Machine Learning

Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Model Ensemble for Click Prediction in Bing Search Ads

Assignment 1: Predicting Amazon Review Ratings

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS 446: Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Rule Learning With Negation: Issues Regarding Effectiveness

Generative models and adversarial training

Reducing Features to Improve Bug Prediction

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Disambiguation of Thai Personal Name from Online News Articles

Truth Inference in Crowdsourcing: Is the Problem Solved?

Detecting English-French Cognates Using Orthographic Edit Distance

Rule Learning with Negation: Issues Regarding Effectiveness

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Calibration of Confidence Measures in Speech Recognition

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Switchboard Language Model Improvement with Conversational Data from Gigaword

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v1 [cs.lg] 15 Jun 2015

Australian Journal of Basic and Applied Sciences

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Probabilistic Latent Semantic Analysis

Using dialogue context to improve parsing performance in dialogue systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Linking Task: Identifying authors and book titles in verbose queries

Learning From the Past with Experiment Databases

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

CSL465/603 - Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning to Rank with Selection Bias in Personal Search

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Create Quiz Questions

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Artificial Neural Networks written examination

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Evolutive Neural Net Fuzzy Filtering: Basic Description

Indian Institute of Technology, Kanpur

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Speech Recognition by Indexing and Sequencing

Active Learning. Yingyu Liang Computer Sciences 760 Fall

12- A whirlwind tour of statistics

The stages of event extraction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semantic and Context-aware Linguistic Model for Bias Detection

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Hi I m Ryan O Donnell, I m with Florida Tech s Orlando Campus, and today I am going to review a book titled Standard Celeration Charting 2002 by

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Laboratorio di Intelligenza Artificiale e Robotica

A Case Study: News Classification Based on Term Frequency

HLTCOE at TREC 2013: Temporal Summarization

Lecture 1: Basic Concepts of Machine Learning

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Summarizing Answers in Non-Factoid Community Question-Answering

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

WHEN THERE IS A mismatch between the acoustic

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

An investigation of imitation learning algorithms for structured prediction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Introduction to Simulation

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Mining Student Evolution Using Associative Classification and Clustering

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Introduction to Causal Inference. Problem Set 1. Required Problems

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

arxiv: v2 [cs.cv] 30 Mar 2017

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Automatic Pronunciation Checker

Getting Started with Deliberate Practice

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.cl] 2 Apr 2017

Data Stream Processing and Analytics

Discriminative Learning of Beam-Search Heuristics for Planning

CS177 Python Programming

Online Updating of Word Representations for Part-of-Speech Tagging

Knowledge Transfer in Deep Convolutional Neural Nets

Transcription:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 8: Data Mining (2/4) March 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2017w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

The Task label Given: D = {(x i,y i )} n i Induce: x i =[x 1,x 2,x 3,...,x d ] y 2 {0, 1} f : X! Y Such that loss is minimized 1 nx `(f(x i ),y i ) n i=0 (sparse) feature vector loss function Typically, we consider functions of a parametric form: 1 arg min n nx `(f(x i ; ),y i ) i=0 model parameters

Gradient Descent (t+1) (t) (t) 1 n nx r`(f(x i ; (t) ),y i ) i=0 Source: Wikipedia (Hills)

MapReduce Implementation (t+1) (t) (t) 1 n nx r`(f(x i ; (t) ),y i ) i=0 mappers single reducer compute partial gradient mapper mapper mapper mapper iterate until convergence reducer update model

Spark Implementation val points = spark.textfile(...).map(parsepoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient } compute partial gradient mapper mapper mapper mapper reducer update model

Gradient Descent Source: Wikipedia (Hills)

Stochastic Gradient Descent Source: Wikipedia (Water Slide)

Batch vs. Online Gradient Descent (t+1) (t) (t) 1 n nx r`(f(x i ; (t) ),y i ) i=0 batch learning: update model after considering all training instances Stochastic Gradient Descent (SGD) (t+1) (t) (t) r`(f(x; (t) ),y) online learning: update model after considering each (randomly-selected) training instance In practice just as good! Opportunity to interleaving prediction and learning!

Practical Notes Order of the instances important! Most common implementation: randomly shuffle training instances Single vs. multi-pass approaches Mini-batching as a middle ground We ve solved the iteration problem! What about the single reducer problem?

Ensembles Source: Wikipedia (Orchestra)

Ensemble Learning Learn multiple models, combine results from different models to make prediction Common implementation: Train classifiers on different input partitions of the data Embarrassingly parallel! Combining predictions: Majority voting Simple weighted voting: nx y = arg max k p k (y x) y2y k=1 Model averaging

Ensemble Learning Learn multiple models, combine results from different models to make prediction Why does it work? If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error

MapReduce Implementation (t+1) (t) (t) r`(f(x; (t) ),y) training data training data training data training data mapper mapper mapper mapper learner learner learner learner

MapReduce Implementation (t+1) (t) (t) r`(f(x; (t) ),y) training data training data training data training data mapper mapper mapper mapper reducer learner reducer learner

MapReduce Implementation (t+1) (t) (t) r`(f(x; (t) ),y) How do we output the model? Option 1: write model out as side data Option 2: emit model as intermediate output

What about Spark? (t+1) (t) (t) r`(f(x; (t) ),y) RDD[T] mappartitions f: (Iterator[T]) Iterator[U] learner RDD[U]

previous Pig dataflow previous Pig dataflow map Classifier Training reduce Pig storage function label, feature vector model model model Making Predictions model feature vector feature vector UDF model UDF prediction prediction Just like any other parallel Pig dataflow

Classifier Training training = load training.txt using SVMLightStorage() as (target: int, features: map[]); store training into model/ using FeaturesLRClassifierBuilder(); Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient) Want an ensemble? training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5;

Making Predictions define Classify ClassifyWithLR( model/ ); data = load test.txt using SVMLightStorage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; Want an ensemble? define Classify ClassifyWithEnsemble( model/, classifier.lr, vote );

Sentiment Analysis Case Study Binary polarity classification: {positive, negative} sentiment Use the emoticon trick to gather data Data Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split) Features: Sliding window byte-4grams Models + Optimization: Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting) Source: Lin and Kolcz. (2012) Large-Scale Machine Learning at Twitter. SIGMOD.

0.82 0.81 1m instances 10m instances 100m instances Ensembles with 10m examples better than 100m single classifier! Diminishing returns Accuracy 0.8 for free 0.79 0.78 0.77 0.76 0.75 1 1 1 3 5 7 9 11 13 15 17 19 3 5 11 21 31 41 Number of Classifiers in Ensemble single classifier 10m ensembles 100m ensembles

Supervised Machine Learning training testing/deployment Model? Machine Learning Algorithm

Evaluation How do we know how well we re doing? Induce: f : X! Y Such that loss is minimized 1 nx arg min `(f(x i ; ),y i ) n i=0 We need end-to-end metrics! Obvious metric: accuracy

Metrics Positive Actual Negative Positive True Positive (TP) False Positive (FP) = Type 1 Error Precision = TP/(TP + FP) Predicted Negative False Negative (FN) = Type 1I Error True Negative (TN) Miss rate = FN/(FN + TN) Recall or TPR = TP/(TP + FN) Fall-Out or FPR = FP/(FP + TN)

ROC and PR Curves 1 1 Algorithm 1 Algorithm 2 True Positive Rate 0.8 0.6 0.4 0.2 AUC Precision 0.8 0.6 0.4 0.2 0 Algorithm 1 Algorithm 2 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate Recall Source: Davis and Goadrich. (2006) The Relationship Between Precision-Recall and ROC curves

Training/Testing Splits Training arg min n 1 nx `(f(x i ; ),y i ) i=0 Test Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Training/Testing Splits Cross-Validation

Typical Industry Setup time A/B test Training Test

A/B Testing X % 100 - X % Control Treatment Gather metrics, compare alternatives

A/B Testing: Complexities Properly bucketing users Novelty Learning effects Long vs. short term effects Multiple, interacting tests Nosy tech journalists

Supervised Machine Learning training testing/deployment Model? Machine Learning Algorithm

Applied ML in Academia Download interesting dataset (comes with the problem) Run baseline model Train/Test Build better model Train/Test Does new model beat baseline? Yes: publish a paper! No: try again!

Fantasy Extract features Develop cool ML technique #Profit Reality What s the task? Where s the data? What s in this dataset? What s all the f#$!* crap? Clean the data Extract features Do machine learning Fail, iterate

Source: Wikipedia (Jujitsu) It s impossible to overstress this: 80% of the work in any data project is in cleaning the data. DJ Patil Data Jujitsu

On finding things

On naming things CamelCase smallcamelcase user_id userid snake_case camel_snake dunder snake

On feature extraction ^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ ([^@]+?)@(\\S+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+) \\s+((?:\\s+?,\\s+)*(?:\\s+?))\\s+(\\s+)\\s+(\\s+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\s+)\"\\s+(\\s+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$ An actual Java regular expression used to parse log message at Twitter circa 2010 Friction is cumulative!

Data Plumbing Gone Wrong! [scene: consumer internet company in the Bay Area ] It s over here Okay, let s get going where s the click data? Well, that s kinda non-intuitive, but okay Well, it wouldn t fit, so we had to shoehorn Hang on, I don t remember Oh, BTW, where s the timestamp of the click? Uh, bad news. Looks like we forgot to log it [grumble, grumble, grumble] Frontend Engineer Develops new feature, adds logging code to capture clicks Data Scientist Analyze user behavior, extract insights to improve feature

Fantasy Extract features Develop cool ML technique #Profit Reality What s the task? Where s the data? What s in this dataset? What s all the f#$!* crap? Clean the data Extract features Do machine learning Fail, iterate

Source: Wikipedia (Hills) Congratulations, you re halfway there

Congratulations, you re halfway there Does it actually work? A/B testing Is it fast enough? Good, you re two thirds there

Source: Wikipedia (Oil refinery) Productionize

Productionize What are your jobs dependencies? How/when are your jobs scheduled? Are there enough resources? How do you know if it s working? Who do you call if it stops working? Infrastructure is critical here! (plumbing)

Takeaway lessons: Most of data science isn t glamorous! Source: Wikipedia (Plumbing)

Questions? Source: Wikipedia (Japanese rock garden)