LECTURE 01: INTRODUCTION TO MACHINE LEARNING. SDS 293: Machine Learning September 11, 2017

Similar documents
Python Machine Learning

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Lecture 1: Machine Learning Basics

CS Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

(Sub)Gradient Descent

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

AP Statistics Summer Assignment 17-18

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

CS 446: Machine Learning

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

RESPONSE TO LITERATURE

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

12- A whirlwind tour of statistics

Ryerson University Sociology SOC 483: Advanced Research and Statistics

MGT/MGP/MGB 261: Investment Analysis

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

STA 225: Introductory Statistics (CT)

Introduction to Causal Inference. Problem Set 1. Required Problems

Syllabus Foundations of Finance Summer 2014 FINC-UB

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

Assignment 1: Predicting Amazon Review Ratings

CS177 Python Programming

Tour. English Discoveries Online

San José State University Department of Marketing and Decision Sciences BUS 90-06/ Business Statistics Spring 2017 January 26 to May 16, 2017

Discovering Statistics

Independent Assurance, Accreditation, & Proficiency Sample Programs Jason Davis, PE

CS 3516: Computer Networks

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lecture 1: Basic Concepts of Machine Learning

Multi-Lingual Text Leveling

A study of speaker adaptation for DNN-based speech synthesis

Probability and Statistics Curriculum Pacing Guide

Active Learning. Yingyu Liang Computer Sciences 760 Fall

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Sociology 521: Social Statistics and Quantitative Methods I Spring 2013 Mondays 2 5pm Kap 305 Computer Lab. Course Website

EQuIP Review Feedback

Leveraging MOOCs to bring entrepreneurship and innovation to everyone on campus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Introduction to the Practice of Statistics

Detailed course syllabus

Instructor: Matthew Wickes Kilgore Office: ES 310

Rule Learning With Negation: Issues Regarding Effectiveness

MOODLE 2.0 GLOSSARY TUTORIALS

COURSE SYNOPSIS COURSE OBJECTIVES. UNIVERSITI SAINS MALAYSIA School of Management

STUDENT MOODLE ORIENTATION

Corpus Linguistics (L615)

LEARNER VARIABILITY AND UNIVERSAL DESIGN FOR LEARNING

Visualizing Architecture

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

The UNF Digital Commons

Computer Science 1015F ~ 2016 ~ Notes to Students

Algebra 2- Semester 2 Review

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Meriam Library LibQUAL+ Executive Summary

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Lesson M4. page 1 of 2

PATHWAYS IN FIRST YEAR MATHS

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Reducing Features to Improve Bug Prediction

Math 96: Intermediate Algebra in Context

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Handbook for Graduate Students in TESL and Applied Linguistics Programs

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Why Pay Attention to Race?

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Course Content Concepts

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

CS 100: Principles of Computing

EGRHS Course Fair. Science & Math AP & IB Courses

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

KOMAR UNIVERSITY OF SCIENCE AND TECHNOLOGY (KUST)

16.1 Lesson: Putting it into practice - isikhnas

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

CSC200: Lecture 4. Allan Borodin

learning collegiate assessment]

Generic Skills and the Employability of Electrical Installation Students in Technical Colleges of Akwa Ibom State, Nigeria.

For international students wishing to study Japanese language at the Japanese Language Education Center in Term 1 and/or Term 2, 2017

Probabilistic Latent Semantic Analysis

Office Hours: Mon & Fri 10:00-12:00. Course Description

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

FINN FINANCIAL MANAGEMENT Spring 2014

Research Design & Analysis Made Easy! Brainstorming Worksheet

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

CSL465/603 - Machine Learning

Mathematics Success Grade 7

Transcription:

LECTURE 01: INTRODUCTION TO MACHINE LEARNING SDS 293: Machine Learning September 11, 2017

Introductions & background Jordan ( he / him, computer scientist) 2017 on: Asst. Prof. in CS (Smith) 2015 to 2017: Visiting Asst. Prof. in SDS (Smith) 2013 2015: Research Scientist (MITLL) 2010 2013: PhD in Visual Analytics (Tufts) 2008 2010: MSc in Educational Tech. (Tufts) 2004 2008: BA in CS and Math (Smith) Office hours: Mondays 10:30 to noon and by appointment Ford 355 (office) or Ford 343 (Lab)

People 3 Minute Biographies: -Your name and pronouns -Your year, school, and major / area of focus - Technical background - Programming language(s) you know/like - Stats courses you ve taken 3 Questions: -What brought you to this course? -What s one big thing you hope to get out of it? -What s one problem / idea / curiosity that sometimes keeps you up at night?

Outline About this course What is Machine (a.k.a. Statistical) Learning? Example problems Data science refresher Structure of this course

Resources: course website cs.smith.edu/~jcrouser/sds293

Resources: slack channel sds293.slack.com

Resources: tutorials, mini-courses, etc. datacamp.com/groups/sds293-machine-learning Free access to ALL content until March 2018

Some context: my research Visualization Cognitive Science Interaction Design Computational Modeling

About this course Machine Learning Computational Modeling

What is machine learning? Image credit: Coursera

What is machine learning?

Machine learning: Wikipedia

Machine learning: a working definition Machine learning is a set of computational tools for building statistical models These models can be used to: - Group similar data points together (clustering) - Assign new data points to the correct group (classification) - Identify the relationships between variables (regression) - Draw conclusions about the population (density estimation) - Figure out which variables are important (dimension reduction)

Example: men & money in the mid-atlantic

Example: men & money in the mid-atlantic Wage dataset available in the ISLR package Sample: 3000 male earners from the mid-atlantic, surveyed between 2003 and 2009 Dimensions: - Year each datapoint was collected - Age of respondent - Martial status - Race - Educational attainment - Job class - Health - Whether or not they have health insurance - Wage

Example: men & money in the mid-atlantic Question: what is the effect of an earner s age, education, and the year on his wage? Find some friends, then go explore the data at: cs.smith.edu/~jcrouser/sds293/examples/wage.html #protip in classes with Jordan, This icon means your turn to talk

Example: men & money in the mid-atlantic cs.smith.edu/~jcrouser/sds293/examples/wage.html

wage vs. age

wage vs. year

wage vs. education

Example: men & money in the mid-atlantic If we had to pick just one, we should probably use education In reality, the best predictor is probably a combination of all three

Supervised machine learning In this example, we used the value of input variables to predict the value of output variables Another way to think about this:

Supervised machine learning Goal: explain some observable phenomenon Y as a function of some set of predictors X: Y = f(x) + ϵ Problem: we don t know what the function actually looks like; we have to estimate it Machine learning: computational tools for estimating f

Unsupervised machine learning We sometimes have only input variables, but no clearly defined response Can t check ( supervise ) our analysis: unsupervised Can t fit a regression model (why?) What can we do?

Example: personalized marketing

Example: personalized marketing

Example: personalized marketing

Unsupervised machine learning Challenge: identify whether the data separates into (relatively) distinct groups 2 4 6 8 10 12 X2 2 4 6 8 0 2 4 6 8 10 12 0 2 4 6 This kind of problem is called cluster analysis (Ch. 10)

Data science refresher: what is data?

Data: a definition A dataset has some set of variables available for making predictions. For example: Tuition rates, enrollment numbers, public vs. private, etc.

Data: a definition Each variable may be either independent or dependent: - An independent variable (iv) is not controlled or affected by another variable (e.g., time in a time-series dataset) - A dependent variable (dv) is affected by a variation in one or more associated independent variables (e.g., temperature in a region)

Data: a definition A dataset also contains a set of observations (also called records) over these variables. For example: tuition = $46,288, enrollment = 2,563, private, etc.

Data: a definition A dataset also contains a set of observations (also called records) over these variables. For example: tuition = $16,115, enrollment = 28,635, public, etc.

One way to think about this: VARIABLES Tuition Enrollment Public vs. Private OBSERVATIONS Smith College UMass Amherst Hampshire College Mount Holyoke College Amherst College $46,288 2,563 private $16,115 28,635 public $48,065 1,400 private $43,886 2,189 private $50,562 1,792 private

Another way to think about this class school_obs: def init (tuition, enrollment, pub_or_priv): self.tuition = tuition self.enrollment = enrollment self.pub_or_priv = pub_or_priv VARIABLES OBSERVATIONS smith = school_obs(46288, 2563, private ) umass = school_obs(16115, 28635, public )

Basic data types Nominal Ordinal Scale / Quantitative - Ratio - Interval An unordered set of non-numeric values For example: Categorical (finite) data {apple, orange, pear} {red, green, blue} { } Arbitrary (infinite) data { 12 Main St. Boston MA, 45 Wall St. New York NY, } { John Smith, Jane Doe, }

Basic data types Nominal Ordinal Scale / Quantitative - Ratio - Interval An ordered set (also known as a tuple) For example: Numeric: <2, 4, 6, 8> Binary: <0, 1> Non-numeric: <G, PG, PG-13, R> < >

Basic data types Nominal Ordinal Scale / Quantitative - Ratio - Interval A numeric range Ratios [ ] - Distance from absolute zero - Can be compared mathematically using division - For example: height, weight Intervals - Ordered numeric elements that can be mathematically manipulated, but cannot be compared as ratios - E.g.: date, current time

Converting between basic data types Q O [0, 100] <F, D, C, B, A> O N <F, D, C, B, A> {C, B, F, D, A} N O (??) - {John, Mike, Bob} <Bob, John, Mike> - {red, green, blue} <blue, green, red> O Q (??) - Hashing? - Bob + John =?? Discussion: what do you notice? Readings in Information Visualization: Using Vision To Think. Card, Mackinglay, Schneiderman, 1999

Basic operations Nominal (N) - Equality: = and - Frequency: how often does x appear? Ordinal (O) - Relation to other points: >, <,, - Distribution: inference on relative frequency Quantitative (Q) - Other mathematical operations: (+, -, *, /, etc.) - Descriptive statistics: average, standard deviation, etc.

(Hopefully) familiar statistical concepts We tend to refer to problems with a quantitative response as regression problems When the response is qualitative (i.e. nominal or ordinal), we re usually talking about a classification problem Caveat: the distinction isn t always that crisp. For example: - K-nearest neighbors (Ch. 2 and Ch. 4), which works with either - Logistic regression (Ch. 4), which estimates the probabilities of a qualitative response

What we ll cover in this class Ch. 2: Statistical Learning Overview (next class) Ch. 3: Linear Regression Ch. 4: Classification Ch. 5: Resampling Methods Ch. 6: Linear Model Selection Ch. 7: Beyond Linearity Ch. 8: Tree-Based Methods Ch. 9: Support Vector Machines Ch. 10: Unsupervised Learning

General information Course website: cs.smith.edu/~jcrouser/sds293 Slack Channel is live: sds293.slack.com Syllabus (with slides before each lecture) Textbook Assignments Grading Accommodations

About the textbook Digital edition available for free at: www.statlearning.com Lots of useful R source code (including labs) The ISLR package includes all the datasets referenced in the book: > install.packages( ISLR ) Many excellent GitHub repositories of solution sets available...wait, what?

Disclaimer this class is an experiment in constructionism (the idea that people learn most effectively when they re building personally-meaningful things) My job as the instructor:

Assignments and grading Participation (10%): show up, engage, and you ll be fine Labs (30%): run during regular class time, help you get a hands-on look at how various ML techniques work 8 (short) assignments (40%): built to help you become comfortable with applying the techniques Course project (20%)

Preparing for labs in R Two options available for using R: 1. You can install R Studio on your own machine: rstudio.com 2. You can use Smith s RStudio Server: rstudio.smith.edu:8787 If you re unfamiliar with R, you might want to take a look at Smith s Getting Started with R tutorial: www.math.smith.edu/tutorial/r.html

Preparing for labs in python I like the Anaconda distribution from continuum.io, but you re welcome to use whatever you like You ll need to know how to install packages Either 2.7 or 3.6 is fine we ll run into bugs either way J

Course project (20%) Topic: ANYTHING YOU WANT Goals: - Learn how to break big, unwieldy questions down into clear, manageable problems - Figure out if/how the techniques we cover in class apply to your specific problems - Use ML to address them Several (graded) milestones along the way Demos and discussion on the final day of class More on this later

Course learning objectives 1. Understand what ML is (and isn t) 2. Learn some foundational methods / tools 3. Be able to choose methods that make sense

What I expect from you You like difficult problems and you re excited about figuring stuff out You have a solid foundation in introductory statistics You are proficient in coding and debugging (or are ready to work to get there) You re comfortable asking questions

What you can expect from me Your learning experience and process is important to me I m flexible w.r.t. the topics we cover I m happy to share my professional connections Somewhat limited in-person access

Reading In today s class, we covered ISLR: p. 15-28 Next class, we ll be talking about how to compare various kinds of models (ISLR: p. 29-37)

For Wednesday Make sure you can access the slack channel Need a refresher on something? Just ask!

#questions?