Large-Scale Machine Learning at Twitter

Similar documents
CS Machine Learning

Python Machine Learning

(Sub)Gradient Descent

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

CS 446: Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Learning From the Past with Experiment Databases

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

ATENEA UPC AND THE NEW "Activity Stream" or "WALL" FEATURE Jesus Alcober 1, Oriol Sánchez 2, Javier Otero 3, Ramon Martí 4

Generative models and adversarial training

Reducing Features to Improve Bug Prediction

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

arxiv: v1 [cs.lg] 15 Jun 2015

Learning Methods for Fuzzy Systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Indian Institute of Technology, Kanpur

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Multi-label classification via multi-target regression on data streams

Visit us at:

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CSL465/603 - Machine Learning

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Radius STEM Readiness TM

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Active Learning. Yingyu Liang Computer Sciences 760 Fall

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

GETTING POSITIVE NEWS COVERAGE

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Truth Inference in Crowdsourcing: Is the Problem Solved?

Data Stream Processing and Analytics

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Beyond the Pipeline: Discrete Optimization in NLP

Software Maintenance

COMMUNITY ENGAGEMENT

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

3. Improving Weather and Emergency Management Messaging: The Tulsa Weather Message Experiment. Arizona State University

Strategy and Design of ICT Services

A Pipelined Approach for Iterative Software Process Model

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

Modeling user preferences and norms in context-aware systems

Probabilistic Latent Semantic Analysis

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Finding Your Friends and Following Them to Where You Are

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Top US Tech Talent for the Top China Tech Company

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

An Introduction to Simio for Beginners

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Mining Association Rules in Student s Assessment Data

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Android App Development for Beginners

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Australian Journal of Basic and Applied Sciences

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Linking Task: Identifying authors and book titles in verbose queries

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Learning Methods in Multilingual Speech Recognition

Multi-label Classification via Multi-target Regression on Data Streams

Text-mining the Estonian National Electronic Health Record

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

CSC200: Lecture 4. Allan Borodin

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

Bluetooth mlearning Applications for the Classroom of the Future

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Mathematics subject curriculum

Seminar - Organic Computing

Cognitive Thinking Style Sample Report

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

It's Not Just Standing Up: Patterns for Daily Stand-up Meetings

Problems of the Arabic OCR: New Attitudes

A Case Study: News Classification Based on Term Frequency

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Visual CP Representation of Knowledge

Unit 7 Data analysis and design

Laboratorio di Intelligenza Artificiale e Robotica

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Postprint.

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

MMOG Subscription Business Models: Table of Contents

GACE Computer Science Assessment Test at a Glance

The Enterprise Knowledge Portal: The Concept

Transcription:

Jimmy Lin and Alek Kolcz Twitter, Inc. 1 Image source:google.com/images

Outline Outline Is twitter big data? How can machine learning help twitter? Existing challenges? Existing literature of large-scale learning Overview of machine learning Twitter analytic stack Extending pig Scalable machine learning Sentiment analysis application

Focus of talk.. What we will not talk about : Different useful application of twitter Why Twitter is a great product and one of its kind What we will talk about : Challenges faced while making it a good product Solution approach by Insiders

The Scale of Twitter Twitter has more than 80 million active users 500 million Tweets are sent per day 50 million people log into Twitter every day Over 600 million monthly unique visitors to twitter.com Large scale infrastructure of information delivery Users interact via web-ui, sms, and various apps Over 70% of our active users are mobile users Real-time redistribution of content At Twitter HQ we consume 1,440 hard boiled eggs weekly We also drink 585 gallons of coffee per week Some twitter bragging..

Problems in hand.. Support for user interaction Search Relevance ranking User recommendation WTF or Who To Follow Content recommendation Relevant news, media, trends (other) problems we are trying to solve Trending topics Language detection Anti-spam Revenue optimization User interest modeling Growth optimization

To put learning formally..

Literature.. Literature Traditionally, the machine learning community has assumed sequential algorithms on data fit in memory (which is no longer realistic) Few publication on machine learning work-flow and tool integration with data management platform Google adversarial advertisement detection Predictive analytic into traditional RDBMSes Facebook business intelligence tasks LinkedIn Hadoop based offline data processing But they are not for machine learning specificly. Spark ScalOps But they result in end-to-end pipeline.

Contribution Provided an overview of Twitter s analytic stack Describe pig extension that allow seamless integration of machine learning capability into production platform Identify stochastic gradient descent and ensemble methods as being particularly amenable to large-scale machine learning Note that, No fundamental contributions to machine learning What is author s contribution..

Scalable Machine Learning Scalable Machine learning Techniques for large-scale machine learning Stochastic gradient descent Ensemble method

Gradient Descent.. Google Image

Gradient Descent.. Slides from Yaser Abu Mostafa-Caltech

Gradient Descent.. Slides from Yaser Abu Mostafa-Caltech

Stochastic Gradient Descent ( SGD) sto chas tic stəˈkastik/ adjective 1.randomly determined; having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely. Slides from Yaser Abu Mostafa-Caltech

Stochastic gradient descent Gradient Descent Stochastic Gradient Descent ( SGD) Compute the gradient in the loss function by optimizing value in dataset. This method will do the iteration for all the data in order to one a gradient value. Inefficient and everything in the dataset must be considered.

Stochastic gradient descent Approximating gradient depends on the value of gradient for one instance. Stochastic Gradient Descent ( SGD) Solve the iteration problem and it does not need to go over the whole dataset again and again. Stream the dataset through a single reduce even with limited memory resource. But when a huge dataset stream goes through a single node in cluster, it will cause network congestion problem.

Stochastic Gradient Descent ( SGD) Slides from Yaser Abu Mostafa-Caltech

Aggregation a.k.a Ensemble Learning Slides from Yaser Abu Mostafa-Caltech

Aggregation a.k.a Ensemble Learning Slides from Yaser Abu Mostafa-Caltech

Ensemble Methods Classifier ensembles: high performance learner Performance: very well Some rely mostly on randomization -Each learner is trained over a subset of features and/or instances of the data Ensembles of linear classifiers Ensembles of decision trees (random forest) Ensemble Learning..

At Twitter

Hoeffding s Inequality Sample frequency ν is likely lose to bin frequency µ. Slide taken from Caltech s Learning from Data Course : Dr Yaser Abu Mostafa

Hadoop Ecosystem Big Table open source version Image Source: Apache Yarn Release

Hadoop Ecosystem at Twitter.. Oink: Aggregation query Standard business intelligence tasks Ad hoc query One-off business request Prototypes of new function Experiment by analytic group Database Real-time processes Application log Batch processes Other sources Hadoop cluster Serialization Protocol buffer /Thrift HDFS

Glorifying PIG

Glorifying PIG Credits : Hortonworks

Credits : Hortonworks Large-Scale Machine Learning at Twitter Glorifying PIG

Maximizing the use of Hadoop.. Maximizing the use of Hadoop We cannot afford too many diverse computing environments Most of analytics job are run using Hadoop cluster Hence, that s where the data live It is natural to structure ML computation so that it takes advantage of the cluster and is performed close to the data Seamless scaling to large datasets Integration into production workflows

What authors contributed technically.. Core libraries: Core Java library Basic abstractions similar to existing packages (weka, mallet, mahout) Lightweight wrapper Expose functionalities in Pig

PIG Functions.. Training models: Storage function

PIG Functions.. Shuffling data:

PIG Functions.. Using models:

HortonWorks Way.. Demo Of How Pig Works on HortonWorks: Credits : Hortonworks

Final Model which works!!! Final Learning - Ensemble Methods

Example: Sentiment Analysis Emotion Trick Use case.. Test dataset: 1 million English tweets, minimum 0 letters-long Training data: 1 million, 10 million and 100 million English training examples Preparation: training and test sets contains equal number of positive and negative examples, removed all emoticons.

Finally a graph..

Explaining a bit more of graph.. 1. The error bar denotes 95% confidence interval. The leftmost group of bars show accuracy when training a single logistic regression classifier on {1, 10, 100} million training examples. 3. 1-10 Change Sharp, 10 100 million : Not that sharp 4. The middle and right group of bars in Figure show the results of learning ensembles 5. Ensembles lead to higher accuracy and note that an ensemble trained with 10 million examples outperforms a single classifier trained on 100 million examples 6. No accurate running time reported as experiments were run on production clusters but informal observations are in sync with what the logical mind suggests ( ensemble takes shorter to train because models are learned in parallel ) 7. In terms of applying the learned models, running time increases with the size of the ensembles since an ensemble of n classifiers requires making n separate predictions.

Conclusion What I loved about paper : I understood it? our goal has never been to make fundamental contributions to machine learning, we have taken the pragmatic approach of using off-the shelf toolkits where possible. Thus, the challenge becomes how to incorporate third-party software packages along with inhouse tools into an existing workflow..