All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Similar documents
Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

(Sub)Gradient Descent

Python Machine Learning

Learning From the Past with Experiment Databases

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Course Content Concepts

MATH 1A: Calculus I Sec 01 Winter 2017 Room E31 MTWThF 8:30-9:20AM

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

Foothill College Summer 2016

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Data Structures and Algorithms

ECON492 Senior Capstone Seminar: Cost-Benefit and Local Economic Policy Analysis Fall 2017 Instructor: Dr. Anita Alves Pena

Course Syllabus for Math

95723 Managing Disruptive Technologies

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probabilistic Latent Semantic Analysis

MATH 205: Mathematics for K 8 Teachers: Number and Operations Western Kentucky University Spring 2017

Class Numbers: & Personal Financial Management. Sections: RVCC & RVDC. Summer 2008 FIN Fully Online


CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

Course Syllabus p. 1. Introduction to Web Design AVT 217 Spring 2017 TTh 10:30-1:10, 1:30-4:10 Instructor: Shanshan Cui

Reducing Features to Improve Bug Prediction

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Beginning and Intermediate Algebra, by Elayn Martin-Gay, Second Custom Edition for Los Angeles Mission College. ISBN 13:

MGT/MGP/MGB 261: Investment Analysis

Department of Anthropology ANTH 1027A/001: Introduction to Linguistics Dr. Olga Kharytonava Course Outline Fall 2017

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

CS 100: Principles of Computing

CS Course Missive

CS Machine Learning

STA2023 Introduction to Statistics (Hybrid) Spring 2013

TU-E2090 Research Assignment in Operations Management and Services

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

SYLLABUS- ACCOUNTING 5250: Advanced Auditing (SPRING 2017)

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

MAR Environmental Problems & Solutions. Stony Brook University School of Marine & Atmospheric Sciences (SoMAS)

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

Accounting 312: Fundamentals of Managerial Accounting Syllabus Spring Brown

*In Ancient Greek: *In English: micro = small macro = large economia = management of the household or family

Class Meeting Time and Place: Section 3: MTWF10:00-10:50 TILT 221

Lecture 1: Machine Learning Basics

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

CSL465/603 - Machine Learning

Using Calculators for Students in Grades 9-12: Geometry. Re-published with permission from American Institutes for Research

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

MEDIA LAW AND ETHICS: COMM 3404 Learn to Think-Think to Learn Monday 6:00-8:45 p.m. Smith Lab 2150 Off: , Cell:

School of Innovative Technologies and Engineering

Chemistry 106 Chemistry for Health Professions Online Fall 2015

INTRODUCTION TO SOCIOLOGY SOCY 1001, Spring Semester 2013

Syllabus ENGR 190 Introductory Calculus (QR)

INDES 350 HISTORY OF INTERIORS AND FURNITURE WINTER 2017

Syllabus: CS 377 Communication and Ethical Issues in Computing 3 Credit Hours Prerequisite: CS 251, Data Structures Fall 2015

Penn State University - University Park MATH 140 Instructor Syllabus, Calculus with Analytic Geometry I Fall 2010

Ryerson University Sociology SOC 483: Advanced Research and Statistics

EPI BIO 446 DESIGN, CONDUCT, and ANALYSIS of CLINICAL TRIALS 1.0 Credit SPRING QUARTER 2014

Office Hours: Mon & Fri 10:00-12:00. Course Description

SOCIAL PSYCHOLOGY. This course meets the following university learning outcomes: 1. Demonstrate an integrative knowledge of human and natural worlds

Social Media Journalism J336F Unique ID CMA Fall 2012

Syllabus Education Department Lincoln University EDU 311 Social Studies Methods

BSM 2801, Sport Marketing Course Syllabus. Course Description. Course Textbook. Course Learning Outcomes. Credits.

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Applications of data mining algorithms to analysis of medical data

CIS Introduction to Digital Forensics 12:30pm--1:50pm, Tuesday/Thursday, SERC 206, Fall 2015

Class Tuesdays & Thursdays 12:30-1:45 pm Friday 107. Office Tuesdays 9:30 am - 10:30 am, Friday 352-B (3 rd floor) or by appointment

T Seminar on Internetworking

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Math 181, Calculus I

Bittinger, M. L., Ellenbogen, D. J., & Johnson, B. L. (2012). Prealgebra (6th ed.). Boston, MA: Addison-Wesley.

Corporate Communication

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

ASTRONOMY 2801A: Stars, Galaxies & Cosmology : Fall term

McKendree University School of Education Methods of Teaching Elementary Language Arts EDU 445/545-(W) (3 Credit Hours) Fall 2011

AGN 331 Soil Science Lecture & Laboratory Face to Face Version, Spring, 2012 Syllabus

Physics 270: Experimental Physics

Syllabus: PHI 2010, Introduction to Philosophy

ACADEMIC POLICIES AND PROCEDURES

MAT 122 Intermediate Algebra Syllabus Summer 2016

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Universidade do Minho Escola de Engenharia

EXAMINING THE DEVELOPMENT OF FIFTH AND SIXTH GRADE STUDENTS EPISTEMIC CONSIDERATIONS OVER TIME THROUGH AN AUTOMATED ANALYSIS OF EMBEDDED ASSESSMENTS

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

USC MARSHALL SCHOOL OF BUSINESS

CS 446: Machine Learning

CONSULTATION ON THE ENGLISH LANGUAGE COMPETENCY STANDARD FOR LICENSED IMMIGRATION ADVISERS

Office Location: LOCATION: BS 217 COURSE REFERENCE NUMBER: 93000

Shank, Matthew D. (2009). Sports marketing: A strategic perspective (4th ed.). Upper Saddle River, NJ: Pearson/Prentice Hall.

Artificial Neural Networks written examination

CRITICAL THINKING AND WRITING: ENG 200H-D01 - Spring 2017 TR 10:45-12:15 p.m., HH 205

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

HCI 440: Introduction to User-Centered Design Winter Instructor Ugochi Acholonu, Ph.D. College of Computing & Digital Media, DePaul University

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

PSCH 312: Social Psychology

CS177 Python Programming

MTH 141 Calculus 1 Syllabus Spring 2017

Assignment 1: Predicting Amazon Review Ratings

Transcription:

1 of 11 3/12/2018 3:27 PM Data mining is the science of discovering structure and making predictions in large, complex data sets. Nowadays, almost every organization collects data, which they hope to use to support improved decision making. Learning from data can enable us to better: detect fraud, make accurate medical diagnoses, monitor the reliability of a system, perform market segmentation, improve the success of marketing campaigns, and much, much more. This course serves as an introduction to Data Mining for students in Business and Data Analytics. Students will learn about many commonly used methods for predictive and descriptive analytics tasks. They will also learn to assess the methods' predictive and practical utility. By the end of the class, students will learn to: Use R to run many of the commonly used data mining methods Understand the advantages and disadvantages of various methods Compare the utility of different methods Reliably perform model/feature selection Use resampling-based approaches to assess model performance and reliability Perform analyses of real world data All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Required textbook There is one required textbook in this class. It is available for free at the link below. If you find the textbook to be useful, please show your appreciation by purchasing a copy for personal use. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani An Introduction to Statistical Learning: with Applications in R Recommended textbooks In addition to the required text, the following references are highly recommended. Students may find it useful to own a personal copy of one or two of the texts below. Witten and Frank, Data Mining: Practical Machine Learning Tools and Techniques Hastie, Tibshirani, Friedman, Elements of Statistical Learning Provost and Fawcett, Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking Kuhn and Johnson, Applied Predictive Modeling

2 of 11 3/12/2018 3:27 PM There are many resources online that may help you with various parts of the class. Learning R Here are some resources to help you learn R if you don't know it already. RStudio R for Data Science swirl: Learn R, in R 94-842, My R Programming class Introduction to R Markdown R Style guide ggplot2 cheatsheet Your grade in this course will be determined by a series of 5 weekly homework assignments, lab participation, two exams, and a final project Assignments (20%) Weekly assignments will take the form of a single R Markdown file: namely, code snippets integrated with captions and other narrative. Unless otherwise indicated, all assignments are due before the start of the Thursday class session (2:50PM) on the dates indicated on the Schedule below. Your assignment score for the course will be calculated by averaging your four (4) highest homework scores. That is, your lowest homework score will not count toward your grade. While the homework assignments may vary in length and/or difficulty, each will be graded out of a possible 20 points Lab participation (10%) In addition to the two lectures, there is a weekly lab session that meets in HBH A301 from 4:50-5:30PM each. Lab attendance is mandatory and counts for 10% of your final grade. During the 1 hour lab section, students will get hands-on practice with the week's material by completing a set of structured data analytic exercises. Tasks may include but are not limited to: running or modifying code from the lecture, running methods, creating visualizations, writing short reports. There is a Lab every, with the exception of the last week of class. Thus there are a total of 6 Lab sessions. The 4th session is reserved for an in-class midterm, and therefore does not count toward your participation score. Your participation score for the course will be calculated based on the number of "regular" (non-midterm) lab sessions you attend and participate in as specified by the table below. Midterm exam (15%) Labs attended 0 1 2 3 4-5 Points (max = 10) 0 2.5 5 7.5 10 The Midterm exam will take place from 4:30-5:50PM on, February 9, in HBH A301. Only material covered during the first 3 weeks of class is eligible for the midterm exam.

3 of 11 3/12/2018 3:27 PM The midterm exam will take the form an open book written test. The test will consist of several problems. Just about every problem will be TRUE/FALSE, Multiple choice, or a "and explain your answer" variant of such questions. Sample question. Linear regression is only useful if you're certain that the true relationship between Y and your inputs X is linear. TRUE or FALSE? In a sentence or two, explain your answer. General comment: The midterm is intended to assess your conceptual understanding of the material we covered in the first 3 weeks of class. Because the test is open note, I will not be asking questions where the answer is explicitly written out in the notes. E.g., I will not ask you to write out a step-by-step description of Cross-validation. However, I could ask you something like: Suppose that we have n = 2000 observations and we perform 20-fold Cross-validation. How many observations are used for Training at each step? (Answer: There will be 2000 / 20 = 100 observations in each Fold, so 1900 observations will be used for training and 100 for testing at each step). Final exam (25%) The time for the final exam is set by the University. Please check the official calendars for the latest time and date information The final exam will be a closed book written exam. This exam is intended to test your complete knowledge of the concepts and methods covered in the class. Final project (30%) This will be a data analysis project to be conducted in groups of 2-4 students. More details to follow. Regardless of grading basis, students must receive a score of at least 50% on the final project in order to pass the class. Your final course grade will be calculated according to the following breakdown. Assignments 20% Lab participation 10% Midterm exam 15% Final exam 25% Final project 30% Late submission Homework is to be submitted by 2:50PM on the due date indicated. Late homework will not be accepted for credit. Note that your lowest homework score will not count toward your grade, so you can miss one homework without it counting toward your course grade. You are encouraged to discuss homework problems with your fellow students. However, the work you submit must be your own. You must acknowledge in your submission any help received on your assignments. That is, you must include a comment in your homework submission that clearly states the name of the student, book, or online reference from which you received assistance.

4 of 11 3/12/2018 3:27 PM Submissions that fail to properly acknowledge help from other students or non-class sources will receive no credit. Copied work will receive no credit. Any and all violations will be reported to Heinz College administration. All student are expected to comply with the CMU policy on academic integrity. This policy can be found online at http://www.cmu.edu/academic-integrity/. The course collaboration policy allows you to discuss the problems with other students, but requires that you complete the work on your own. Every line of text and line of code that you submit must be written by you personally. You may not refer to another student's code, or a "common set of code" while writing your own code. You may, of course, copy/modify lines of code that you saw in lecture or lab. The following discussion of code copying is taken from the Computer Science and Engineering Department at the University of Washington. I discussed these issues early on in class, and they are also covered in some form in the academic guidelines for CMU and Heinz College. "[It is] important to make sure that the assistance you receive consists of general advice that does not cross the boundary into using code or answers written by someone else. It is fine to discuss ideas and strategies, but you should be careful to write your programs on your own." "You must not share actual program code with other students. In particular, you should not ask anyone to give you a copy of their code or, conversely, give your code to another student who asks you for it; nor should you post your solutions on the web, in public repositories, or any other publicly accessible place. [You may not work out a full communal solution on a whiteboard/blackboard/paper and then transcribe the communal code for your submission.] Similarly, you should not discuss your algorithmic strategies to such an extent that you and your collaborators end up turning in [essentially] the same code. Discuss ideas together, but do the coding on your own." "Modifying code or other artifacts does not make it your own. In many cases, students take deliberate measures -- rewriting comments, changing variable names, and so forth -- to disguise the fact that their work is copied from someone else. It is still not your work. Despite such cosmetic changes, similarities between student solutions are easy to detect. Programming style is highly idiosyncratic, and the chance that two submissions would be the same except for changes of the sort made easy by a text editor is vanishingly small. In addition to solutions from previous years or from other students, you may come across helpful code on the Internet or from other sources outside the class. Modifying it does not make it yours." "[I] allow exceptions in certain obvious instances. For example, you might be assigned to work with a project team. In that case, developing a solution as a team is expected. The instructor might also give you starter code, or permit use of local libraries. Anything which the instructor explicitly gives you doesn't normally need to be cited. Likewise, help you receive from course staff doesn't need to be cited." If you have any questions about any of the course policies, please don't hesitate to ask. You may post your questions on Piazza or ask me directly.

5 of 11 3/12/2018 3:27 PM Computing: The statistical computing package we will use in this course is R, which is available on many campus computers. You may download your own copy from http://www.r-project.org. We require that you use R Markdown to complete your assignments, which is enabled very nicely with RStudio. Laptop Policy: Students must bring their own laptops to the lab sessions. Communication: Assignments and class information will be posted on Canvas and the class website. Email: The Piazza forum should be used for general course-related questions that may be of interest to others in the class. For other types of questions (e.g., to report illness, request various permissions) please contact Prof. Chouldechova via email. Please include the course code 95791 in the subject line of your email. Disability Services: If you have a disability and need special accomodations in this class, please contact the instructor. You may also want to contact the Disability Resources office at 8-2013. Date Topic Due Week 1: Introduction, Regression++ What is Data Mining? Course logistics What are predictive analytics (supervised learning)? What are descriptive analytics (unsupervised learning)? Introduction to the central themes of the class 01/15-01/19 Linear regression as a predictive tool Polynomial regression Step functions Suggested reading

6 of 11 3/12/2018 3:27 PM ISLR 2.1 ISLR 3.1, 3.2, 3.3, 3.4 ISLR 7.1, 7.2 94-842 Lecture 9: Linear regression in R 94-842 Lecture 10: Factors and interactions in linear regression Links [Lecture 1 notes] [Rmd code] [html] Lab 1: [Rmd] [html] Lab 1: Solutions [Rmd] [html] Introduction to R, RStudio, R Markdown Linear regression in R Week 2: Model selection and validation in regression Splines Additive models Local regression Bias-Variance trade-off Testing-training 01/22-01/26 Cross-validation HW 1 Suggested reading ISLR 7.4, 7.5.1, 7.7.1 ISLR 2.2.1, 2.2.2 ISLR 5.1, 5.2 GAMs R tutorial Links [Lecture 2 notes] Lab 2: [Rmd] [html] Lab 2: Solutions [Rmd] [html] Validation, Cross-validation in R Splines, additive models Week 3: Model Selection, Classification

7 of 11 3/12/2018 3:27 PM Model selection in regression Subset selection Regularized regression AIC/BIC Introduction to classification Bayes classifier 01/29-02/02 Logistic regression HW 2 Links: Suggested reading: ISLR 6.1, 6.2 ISLR 5.3.4 ISLR 2.2.3 ISLR 4.1, 4.2, 4.3 Links: [Lecture 3 notes] Lab 3: [Rmd] [html] Lab 3: Solutions [Rmd] [html] Best subset, Forward, and Backward variable selection AIC, BIC Validation and Cross-validation for variable selection Lasso Week 4: Classification Logistic regression decision boundary k-nearest Neighbours Linear Discriminant Analysis 02/05-02/09 HW 3 Quadratic Discriminant Analysis Naive Bayes Assessing performance of classifiers

8 of 11 3/12/2018 3:27 PM Calibration plots Confusion matrices Cost-based assessment ROC, AUC Suggested reading: ISLR 2.2.3 ISLR 4.4, 4.5 ISLR 5.1.5 APM Chapter 11: Measuring Performance in Classification Models Links: [Lecture 4 notes] [ proc package examples] Midterm exam Week 5: Tree-based methods, Advanced methods Decision trees Decision Trees Bagging Random forests 02/12-02/16 Final project assigned. HW 4 Suggested reading: APM Chapter 11: Measuring Performance in Classification Models ISLR 8.1, 8.2 Links: [Lecture 5 notes] [Final project] [Project descriptions] Lab 4: [Rmd] [html] Lab 4: Solutions Classification and Regression trees Week 6: Unsupervised learning

9 of 11 3/12/2018 3:27 PM Random Forests Boosting Bootstrap SE estimates, CI's What is Unsupervised learning? K-means clustering Hierarchical clustering 02/19-02/23 Association rule mining Suggested reading: ISLR 8.1, 8.2 ISLR 5.3.4 ISLR 10.1, 10.3 Links: [Lecture 6 notes] Lab 5: [Rmd] [html] Lab 5: Solutions Random forests Boosting K-means, Hierarchical Clustering Week 7: Unsupervised learning What is Unsupervised learning? K-means clustering Hierarchical clustering 02/26-03/02 Association rule mining Gaussian mixture models Dimensionality reduction

10 of 11 3/12/2018 3:27 PM Principal components regression Suggested reading: ISLR 10.2 Links: [Lecture 7 notes] Review session [Review slides] Instructor: Prof. Alexandra Chouldechova yyy@cmu.edu, where yyy=achould HBH 2224 Office Hours: See Piazza. Teaching Assistants: Andres Salcedo Noguera yyy@andrew.cmu.edu, yyy=asalcedo Pranav Bhatt yyy@andrew.cmu.edu, yyy=pbhatt Rajeev Bhatia yyy@andrew.cmu.edu, yyy=rrbhatia Dev Pal yyy@andrew.cmu.edu, yyy=devdiptp Pranshu Srivastava yyy@andrew.cmu.edu, yyy=pranshus Class Meetings: W 6:00-8:50PM, HBH A301 (A3) TR 3:00-4:20PM, HBH 1002 (B3) F 4:30-5:30PM, HBH A301 (All) This Website: http://www.andrew.cmu.edu/~achoulde/95791/ All course materials will be posted on this site. Homework submission: Assignments to be submitted via Blackboard. Prerequisites: Students must be enrolled in a graduate program in Heinz College. Special permission can be granted by the College.

11 of 11 3/12/2018 3:27 PM Homework 1 [Rmd] [html] Due 2:50PM, Thursday, January 25 Homework 2 [Rmd] [html] Due 2:50PM, Thursday, Feb 1 Homework 3 [Rmd] [html] Due 2:50PM, Thursday, February 8 Homework 4 [Rmd] [html] Due 2:50PM, Thursday, February 15 Homework 5 [Rmd] [html] Due 2:50PM, Tuesday, February 27 Final Project [Description] Due 11:59PM,, March 9 Copyright (c) 2017 CMU. All rights reserved. Design by Free CSS Templates.