ORIE 4741: Learning with Big Messy Data. Introduction

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

CS Machine Learning

Firms and Markets Saturdays Summer I 2014

Python Machine Learning

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Mathematics Success Level E

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

GDP Falls as MBA Rises?

The Federal Reserve Bank of New York

Minitab Tutorial (Version 17+)

With guidance, use images of a relevant/suggested. Research a

Name Class Date. Graphing Proportional Relationships

Parent Information Welcome to the San Diego State University Community Reading Clinic

STUDENT APPLICATION FORM 2016

UNIT ONE Tools of Algebra

(Sub)Gradient Descent

Course Content Concepts

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Northern Kentucky University Department of Accounting, Finance and Business Law Financial Statement Analysis ACC 308

Academic Dean Evaluation by Faculty & Unclassified Professionals

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

GUIDE TO THE CUNY ASSESSMENT TESTS

THE LUCILLE HARRISON CHARITABLE TRUST SCHOLARSHIP APPLICATION. Name (Last) (First) (Middle) 3. County State Zip Telephone

What is Teaching? JOHN A. LOTT Professor Emeritus in Pathology College of Medicine

Executive Guide to Simulation for Health

TASK 2: INSTRUCTION COMMENTARY

Cognitive Thinking Style Sample Report

12- A whirlwind tour of statistics

MATH 205: Mathematics for K 8 Teachers: Number and Operations Western Kentucky University Spring 2017

New Jersey Society of Radiologic Technologists Annual Meeting & Registry Review

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Lecture 10: Reinforcement Learning

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Unequal Opportunity in Environmental Education: Environmental Education Programs and Funding at Contra Costa Secondary Schools.

LAKEWOOD HIGH SCHOOL LOCAL SCHOLARSHIP PORTFOLIO CLASS OF

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Applications of data mining algorithms to analysis of medical data

INVESTING IN STUDENTS OUTSIDE-OF-THE CLASSROOM

Speeding Up Reinforcement Learning with Behavior Transfer

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

BARUCH RANKINGS: *Named Standout Institution by the

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

CIS 121 INTRODUCTION TO COMPUTER INFORMATION SYSTEMS - SYLLABUS

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Introduction to Causal Inference. Problem Set 1. Required Problems

Office Hours: Mon & Fri 10:00-12:00. Course Description

Livermore Valley Joint Unified School District. B or better in Algebra I, or consent of instructor

Trends in College Pricing

School Physical Activity Policy Assessment (S-PAPA)

Perioperative Care of Congenital Heart Diseases

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Feedback Form Results n=106 6/23/10 Emotionally Focused Therapy: Love as an Attachment Bond Presented By: Sue Johnson, Ed.D.

Top Ten Persuasive Strategies Used on the Web - Cathy SooHoo, 5/17/01

arxiv: v2 [cs.cv] 30 Mar 2017

Financial aid: Degree-seeking undergraduates, FY15-16 CU-Boulder Office of Data Analytics, Institutional Research March 2017

Local Activism: Identifying Community Activists (2 hours 30 minutes)

Measures of the Location of the Data

Listening to your members: The member satisfaction survey. Presenter: Mary Beth Watt. Outline

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

What to Do When Conflict Happens

Manipulative Mathematics Using Manipulatives to Promote Understanding of Math Concepts

Community Power Simulation

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Higher Education Six-Year Plans

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Lecture 1: Machine Learning Basics

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

STAT 220 Midterm Exam, Friday, Feb. 24

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Issues in the Mining of Heart Failure Datasets

Jon N. Kerr, PhD, CPA August 2017

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Team Dispersal. Some shaping ideas

Data Structures and Algorithms

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Soham Baksi. Professor, Department of Economics, University of Winnipeg, July 2017 present

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Physics 270: Experimental Physics

Professional Development and Incentives for Teacher Performance in Schools in Mexico. Gladys Lopez-Acevedo (LCSPP)*

Demystifying The Teaching Portfolio

A Study Guide Written By Garrett Christopher Edited by Joyce Friedland and Rikki Kessler

medicaid and the How will the Medicaid Expansion for Adults Impact Eligibility and Coverage? Key Findings in Brief

Common Core State Standards

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

An Introduction to School Finance in Texas

CS 100: Principles of Computing

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

St Philip Howard Catholic School

VIRTUAL LEARNING. Alabama Connecting Classrooms, Educators, & Students Statewide. for FACILITATORS

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Virtually Anywhere Episodes 1 and 2. Teacher s Notes

Transcription:

ORIE 4741: Learning with Big Messy Data Introduction Professor Udell Operations Research and Information Engineering Cornell September 15, 2017 1 / 33

Outline Stories Definitions Kinds of learning Syllabus Logistics 2 / 33

Oh, you work with big messy data? Maybe you could help us out...? 3 / 33

Demography age gender state income education 29 F CT $53,000 college 57? NY $19,000 high school? M CA $102,000 masters 41 F NV $23,000?..... 4 / 33

Medicine 5 / 33

Medicine age gender heart disease statins? 29 F yes no 57? no no? M no no 41 F yes yes.... 6 / 33

Medicine 7 / 33

Pollution [Snow, 1854] 8 / 33

Pollution location time CO2 O2 O3 1 1.7.9? 1 2.5.7? 1 3.4.5 1.4...... 9 / 33

Marketing 10 / 33

Marketing customer product 1 product 2 product 3 1 yes? yes 2 yes yes? 3?? yes....... 11 / 33

Finance 12 / 33

Finance ticker t 1 t 2 AAPL.05 -.21 GOOG -.11.24 FB.07 -.18...... 13 / 33

Email 14 / 33

Data by Volume 15 / 33

Outline Stories Definitions Kinds of learning Syllabus Logistics 16 / 33

Big NASA, 1997: taxing the capacities of main memory, local disk, and even remote disk 1 image courtesy of Kim Minor @ IBM 17 / 33

Big NASA, 1997: taxing the capacities of main memory, local disk, and even remote disk OED, 2015: data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges 1 image courtesy of Kim Minor @ IBM 17 / 33

Big NASA, 1997: taxing the capacities of main memory, local disk, and even remote disk OED, 2015: data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges 4 Vs: 1 1 image courtesy of Kim Minor @ IBM 17 / 33

Big NASA, 1997: taxing the capacities of main memory, local disk, and even remote disk OED, 2015: data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges 4 Vs: 5th V: value 1 image courtesy of Kim Minor @ IBM 1 17 / 33

Big: our definition Definition An algorithm for big data is one with computational and memory requirements that scale linearly (or nearly linearly) in the size of the data. 18 / 33

Big: our definition Definition An algorithm for big data is one with computational and memory requirements that scale linearly (or nearly linearly) in the size of the data. why this definition? independent of hardware business 18 / 33

Big: our definition Definition An algorithm for big data is one with computational and memory requirements that scale linearly (or nearly linearly) in the size of the data. why this definition? independent of hardware business if you use only algorithms for big data, then you re working with big data 18 / 33

Messy noisy: some (or all) values suffer errors, inaccuracies, or malicious corruption 19 / 33

Messy noisy: some (or all) values suffer errors, inaccuracies, or malicious corruption missing: some values are missing, inconsistent, not recorded, or lost 19 / 33

Messy noisy: some (or all) values suffer errors, inaccuracies, or malicious corruption missing: some values are missing, inconsistent, not recorded, or lost heterogeneous: values of many different types continuous values (e.g., 4.2, π) discrete values (e.g., 0, 4, 994) nominal values (e.g., apple, banana, pear) ordinal values (e.g., rarely, sometimes, often) graphs or networks (e.g., person 1 is friends with person 2) text (e.g., doctor s note describing symptoms) sets (e.g., items purchased) 19 / 33

Learning 20 / 33

Learning machine learning? 20 / 33

Learning machine learning? human learning? 20 / 33

Learning machine learning? human learning? when data is big and messy, machine help is essential for human learning! 20 / 33

Data table n examples (patients, respondents, households, assets) d features (tests, questions, sensors, times) a 11 a 1d A =..... a n1 a nd a i is ith row of A: feature vector for ith example a :j is jth column of A: values for jth feature across all examples a ij is jth feature of ith example 21 / 33

Outline Stories Definitions Kinds of learning Syllabus Logistics 22 / 33

Supervised learning identify one column of data that we want to predict x 11 x 1 d 1 y 1 A =...... = X y x n1 x n d 1 y n x i X for i = 1,..., n are rows of X y i Y for i = 1,..., n are entries of y 23 / 33

Supervised learning identify one column of data that we want to predict x 11 x 1 d 1 y 1 A =...... = X y x n1 x n d 1 y n x i X for i = 1,..., n are rows of X y i Y for i = 1,..., n are entries of y we believe there is a mapping f : X Y our goal is to learn f y i f (x i ) 23 / 33

Example: supervised learning for credit card applications goal: decide which credit card applicants should be approved input space: entries of X R d correspond to fields in credit application e.g., salary, years in residence, outstanding debt, number of credit lines,... output space: Y = {+1, 1} +1 means approve 1 means reject data: D = (x 1, y 1 ),..., (x n, y n ) give credit applications of previous customers, and correct decisions in hindsight 24 / 33

Example: supervised learning for credit card applications goal: decide which credit card applicants should be approved input space: entries of X R d correspond to fields in credit application e.g., salary, years in residence, outstanding debt, number of credit lines,... output space: Y = {+1, 1} +1 means approve 1 means reject data: D = (x 1, y 1 ),..., (x n, y n ) give credit applications of previous customers, and correct decisions in hindsight noise? 24 / 33

Exercise: formalizing real problems identify a prediction goal identify the input space X identify the output space Y identify the data D = (x 1, y 1 ),..., (x n, y n ) you d like to use what kinds of noise do you expect in the data? 25 / 33

Kinds of learning 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y Unsupervised learning: given x 1,..., x n, learn patterns or structure 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y Unsupervised learning: given x 1,..., x n, learn patterns or structure Online learning: for i = 1,..., n, given x i, predict and observe y i, learn f (x) = y 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y Unsupervised learning: given x 1,..., x n, learn patterns or structure Online learning: for i = 1,..., n, given x i, predict and observe y i, learn f (x) = y Active learning: for i = 1,..., n, choose x i, predict and observe y i, learn f (x) = y 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y Unsupervised learning: given x 1,..., x n, learn patterns or structure Online learning: for i = 1,..., n, given x i, predict and observe y i, learn f (x) = y Active learning: for i = 1,..., n, choose x i, predict and observe y i, learn f (x) = y Reinforcement learning: for i = 1,..., n, choose x i, predict y i, observe reward r i, learn f (x) = y 26 / 33

Kinds of learning Supervised learning: given (x 1, y 1 ),..., (x n, y n ), learn f (x) = y Unsupervised learning: given x 1,..., x n, learn patterns or structure Online learning: for i = 1,..., n, given x i, predict and observe y i, learn f (x) = y Active learning: for i = 1,..., n, choose x i, predict and observe y i, learn f (x) = y Reinforcement learning: for i = 1,..., n, choose x i, predict y i, observe reward r i, learn f (x) = y this class: mostly supervised and unsupervised learning 26 / 33

Outline Stories Definitions Kinds of learning Syllabus Logistics 27 / 33

Course objectives (I) plot predict cluster impute denoise recommend understand 28 / 33

Course objectives (II) at the end of the course, you should have learned at least one method to solve any problem when not to trust your solution 29 / 33

Course objectives (II) at the end of the course, you should have learned at least one method to solve any problem when not to trust your solution the rest you can learn online... 29 / 33

Outline Stories Definitions Kinds of learning Syllabus Logistics 30 / 33

This class algorithms for big messy data learning to ask the right questions course website: (grading, course requirements, lectures, homework, etc) https://people.orie.cornell.edu/mru8/orie4741/ 31 / 33

Next steps ASAP: enroll (or drop) (or get on wait list) fill out course survey before next lecture: post a question or comment to piazza about this lecture due 8/29/17: homework 0... links on course website 32 / 33

Questions? 33 / 33