Introduction to statistical learning

Similar documents
Python Machine Learning

Lecture 1: Basic Concepts of Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

A Case Study: News Classification Based on Term Frequency

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Probabilistic Latent Semantic Analysis

Learning From the Past with Experiment Databases

Welcome. Paulo Goes Dean, Eller College of Management Welcome Our region

Mining Association Rules in Student s Assessment Data

Welcome to. ECML/PKDD 2004 Community meeting

Reducing Features to Improve Bug Prediction

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Statistics and Data Analytics Minor

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Australian Journal of Basic and Applied Sciences

Universidade do Minho Escola de Engenharia

Rule Learning With Negation: Issues Regarding Effectiveness

Exposé for a Master s Thesis

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

Time series prediction

Linking Task: Identifying authors and book titles in verbose queries

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.cv] 30 Mar 2017

Multivariate k-nearest Neighbor Regression for Time Series data -

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

TotalLMS. Getting Started with SumTotal: Learner Mode

Green Belt Curriculum (This workshop can also be conducted on-site, subject to price change and number of participants)

On-Line Data Analytics

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Speech Emotion Recognition Using Support Vector Machine

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Human Emotion Recognition From Speech

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

BYLINE [Heng Ji, Computer Science Department, New York University,

Mining Student Evolution Using Associative Classification and Clustering

CS 446: Machine Learning

Modeling user preferences and norms in context-aware systems

MMOG Subscription Business Models: Table of Contents

Handling Concept Drifts Using Dynamic Selection of Classifiers

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Switchboard Language Model Improvement with Conversational Data from Gigaword

DEVELOPMENT OF AN INTELLIGENT MAINTENANCE SYSTEM FOR ELECTRONIC VALVES

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Applications of data mining algorithms to analysis of medical data

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The taming of the data:

MARKETING FOR THE BOP WORKSHOP

Customized Question Handling in Data Removal Using CPHC

Date : Controller of Examinations Principal Wednesday Saturday Wednesday

Tap vs. Bottled Water

Top US Tech Talent for the Top China Tech Company

Computerized Adaptive Psychological Testing A Personalisation Perspective

MKTG 611- Marketing Management The Wharton School, University of Pennsylvania Fall 2016

Learning Methods for Fuzzy Systems

Interactive Whiteboard

JOB OUTLOOK 2018 NOVEMBER 2017 FREE TO NACE MEMBERS $52.00 NONMEMBER PRICE NATIONAL ASSOCIATION OF COLLEGES AND EMPLOYERS

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Automating the E-learning Personalization

Unit 7 Data analysis and design

48 contact hours using STANDARD version of Study & Solutions Kit

A Comparison of Two Text Representations for Sentiment Analysis

Online Updating of Word Representations for Part-of-Speech Tagging

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Word Segmentation of Off-line Handwritten Documents

Department of Statistics. STAT399 Statistical Consulting. Semester 2, Unit Outline. Unit Convener: Dr Ayse Bilgin

Computational Data Analysis Techniques In Economics And Finance

A Comparison of Standard and Interval Association Rules

An Introduction to Simio for Beginners

Rule Learning with Negation: Issues Regarding Effectiveness

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Handouts and Resources

Georgia Tech College of Management Project Management Leadership Program Eight Day Certificate Program: October 8-11 and November 12-15, 2007

Probability estimates in a scenario tree

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Corrective Feedback and Persistent Learning for Information Extraction

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Affective Classification of Generic Audio Clips using Regression Models

Len Lundstrum, Ph.D., FRM

Transcription:

Introduction to statistical learning 1. Introduction V. Lefieux June 2018 1/42

Table of contents 2/42

Table of contents 3/42

Data everywhere 4/42

Data everywhere Before: structured data, generated by companies and organizations, regular but not so frequent updates (e.g monthly). Now: unstructured data, generated by users, real time data. 5/42

Some data generated by companies and organization 6/42

Some data generated by users 7/42

Some networks 8/42

And now health data 9/42

3 V? 10/42

4 V? 11/42

5 V? 12/42

The new oil? Clive Huby, 2006. 13/42

A landscape 14/42

Gartner hype cycle 2017 15/42

Table of contents 16/42

A process: collecting, organizing (cleaning and storing), analyzing, visualizing large sets of data. An objective: discover useful information to improve business decisions. 17/42

A new idea? Four major influences act on data analysis today: The formal theories of statistics. Accelerating developments in computers and display devices. The challenge, in many fields, of more and ever larger bodies of data. The emphasis on quantification in an ever wider variety of disciplines. 18/42

Not so new! Data analysis and statistics: an expository overview J. W. Tukey and M. B. Wilk 1966 Four major influences act on data analysis today: The formal theories of statistics. Accelerating developments in computers and display devices. The challenge, in many fields, of more and ever larger bodies of data. The emphasis on quantification in an ever wider variety of disciplines. 19/42

Spam filter 20/42

Web search 21/42

Recommendations 22/42

Marketing 23/42

Customer relationship management (CRM) Hotel chain uses big data to increase bookings. Pizza chain earns more dough in bad weather. Music distributor applies big data for demand planning. Financial services company scores new clients. Retailer creates pregnancy detection model. 24/42

Smart grids And smart cities. 25/42

Genomics 26/42

Table of contents 27/42

The data scientist 28/42

Data scientist skills 29/42

Superhero skills? 30/42

Some definitions: is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD). 31/42

Table of contents 32/42

Some definitions: Machine learning Machine learning is a field of computer science that often uses statistical techniques to give computers the ability to learn (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. 33/42

Some definitions: theory is a framework for machine learning drawing from the fields of statistics and functional analysis. theory deals with the problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, bioinformatics and baseball. 34/42

vs Machine learning Machine learning, from Artificial Intelligence: large scale applications, prediction accuracy., from Statistics: interpretability, precision, uncertainty, inference. For some statisticians: statistical learning is a mathematical formalisation of the machine learning. 35/42

Some concepts: online/offline learning Online learning (real-time): under time constraints. Some examples: Personalized advertising. Personalized healthcare. Navigation & transit tools. Autonomous cars. Load curve forecasts. Weather forecasts. Offline learning (batch). 36/42

Some concepts: supervised/unsupervised learning Supervised learning: Infer (predict) a function/relationship from labeled training data (e.g. classification, regression). Unsupervised learning: Find structure in unlabeled data (e.g. clustering). Even if it is more subjective than supervised learning, it can be useful as a pre-processing step for supervised learning. 37/42

Supervised learning There are many different paradigms, including: Parametric statistics (linear or non-linear). Non-parametric statistics (local estimation methods, e.g smoothing kernel methods, k-nearest neighbors). Tree based methods. Support Vector Machines. Deep learning. 38/42

Some key points Trade-off between prediction accuracy and interpretability. Avoid over-fitting. Parsimonious model vs (full) black box: less is more. 39/42

Table of contents 40/42

Outline Introduction. Unsupervised learning: PCA & clustering. Supervised learning: Cross validation & bootstrap. Reminders on linear regression & logistic regression. Tree based methods. Support Vector Machines. 41/42

Software tools 42/42