Data Science: Principles and Practice

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning From the Past with Experiment Databases

Lecture 1: Basic Concepts of Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Artificial Neural Networks written examination

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Probability and Statistics Curriculum Pacing Guide

12- A whirlwind tour of statistics

(Sub)Gradient Descent

Linking Task: Identifying authors and book titles in verbose queries

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Universidade do Minho Escola de Engenharia

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Human Emotion Recognition From Speech

Science Fair Project Handbook

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

STA 225: Introductory Statistics (CT)

Model Ensemble for Click Prediction in Bing Search Ads

Rule Learning With Negation: Issues Regarding Effectiveness

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

On-Line Data Analytics

arxiv: v1 [cs.lg] 15 Jun 2015

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Rule Learning with Negation: Issues Regarding Effectiveness

Indian Institute of Technology, Kanpur

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Speech Emotion Recognition Using Support Vector Machine

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Case Study: News Classification Based on Term Frequency

arxiv: v1 [cs.cl] 2 Apr 2017

CSL465/603 - Machine Learning

Word Segmentation of Off-line Handwritten Documents

Individual Differences & Item Effects: How to test them, & how to test them well

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

CS 446: Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Probability estimates in a scenario tree

Software Maintenance

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Applications of data mining algorithms to analysis of medical data

Stopping rules for sequential trials in high-dimensional data

Knowledge Transfer in Deep Convolutional Neural Nets

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Mining Association Rules in Student s Assessment Data

Word learning as Bayesian inference

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Reducing Features to Improve Bug Prediction

Using dialogue context to improve parsing performance in dialogue systems

Association Between Categorical Variables

NCEO Technical Report 27

Generative models and adversarial training

Issues in the Mining of Heart Failure Datasets

Why Did My Detector Do That?!

Axiom 2013 Team Description Paper

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A student diagnosing and evaluation system for laboratory-based academic exercises

The Role of the Head in the Interpretation of English Deverbal Compounds

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Evolution of Symbolisation in Chimpanzees and Neural Nets

arxiv: v1 [cs.cv] 10 May 2017

Calibration of Confidence Measures in Speech Recognition

Modeling user preferences and norms in context-aware systems

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Evolution of Random Phenomena

An Introduction to Simio for Beginners

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CONSISTENCY OF TRAINING AND THE LEARNING EXPERIENCE

MYCIN. The MYCIN Task

Last Editorial Change:

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Using GIFT to Support an Empirical Study on the Impact of the Self-Reference Effect on Learning

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Natural Language Processing: Interpretation, Reasoning and Machine Learning

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

SARDNET: A Self-Organizing Feature Map for Sequences

Transcription:

Data Science: Principles and Practice Lecture 8: Advanced topics Marek Rei 1/34

Data Science: Principles and Practice 01 Overview of Complementary ML Techniques 02 Ethics in Data Science 03 Replicability of Findings 04 Assignment 2/34

Overview of Complementary ML Techniques 3/34

Support Vector Machines Support Vector Machines (SVM) are a type of classification algorithm. Logistic regression tries to maximize the probability of the correct class. SVM tries to find a hyperplane that separates the closest points from both classes with the largest margin. More details in Machine Learning and Bayesian Inference in the Easter term. https://towardsdatascience.com/support-vector-machine-vs-log istic-regression-94cc2975433f 4/34

Decision Trees Recursively divide the data into smaller sections to perform classification. Each node is a rule that splits the data. Each leaf is a classification decision. Provide an interpretable model (relatively). Can easily overfit to the training data. Ruiz-Samblás et al. (2014) Application of data mining methods for classification and prediction of olive oil blends with other vegetable oils. 5/34

Random Forests Combine many different decision trees together to make a single prediction. Return either the most frequency predicted class or average the result. Much more stable than a single decision tree - averages out the overfitting problem. Works really well in practice! 6/34

Convolutional Neural Networks Neural modules operating repeatedly over different subsections of the input space. Great when searching for feature patterns, without knowing where they might be located in the input. The main driver in image recognition. Can also be used for text. https://github.com/vdumoulin/conv_arithmetic 7/34

Recurrent Neural Networks Designed to process input sequences of arbitrary length. Each hidden state A is calculated based on the current input and the previous hidden state. Main neural architecture for processing text, with each input being a word representation. http://colah.github.io/posts/2015-08-understanding-lstms/ 8/34

Dropout During training, randomly set some neural activations to zero. Typically drop 50% of activations in a layer. Form of regularization - prevents the network from relying on any one node. https://www.learnopencv.com/understanding-alexnet/ 9/34

Ethics in Data Science 10/34

Privacy 1. Don t collect or analyze personal data without consent! 2. Keep the data secure and f you don t need the data, delete it! 3. If you release data or statistics, be careful - it may reveal more than you intend. https://www.nytimes.com/2018/03/18/us/cambridg e-analytica-facebook-privacy-data.html 11/34

Privacy movie user date score Netflix released 100M anonymized movie ratings for their data science challenge. 1 56 2004-02-14 5 1 25363 2004-03-01 3 In 16 days, researchers had identified specific users in the dataset. 2 855321 2004-07-29 3 2 44562 2004-07-30 4 1) Mapping movie scores to public accounts on IMDb. 2) Extracting the entire rental history based on a few rented movies. Netflix tried to launch a sequel to the competition but were sued by a user. 12/34

Leaking Private Information https://www.theguardian.com/world/2018/jan/28/fitness-tracking-app-gives-away-location-of-secret-us-army-bases 13/34

Bias in the Training Data Machine learning models learn to do what they are trained to do. The algorithms will pick up biases that are present in that dataset, whether good or bad. Problem 1: The dataset is created with a bias and does not reflect the real task properly. https://blogs.wsj.com/digits/2015/07/01/google-mistakenly-tags-blackpeople-as-gorillas-showing-limits-of-algorithms/ 14/34

Bias in the Training Data Problem 2: The data is representative but contains unwanted bias. We don t want our models to be racist, sexist and discriminatory, even when the training data is. Example: Turkish is a gender neutral language. Google Translate tries to infer a gender when translated into English. https://twitter.com/seyyedreza/status/935291317252493312 15/34

Bias in the Training Data Prior Offenses 2 armed robberies, 1 attempted armed robbery Subsequent Offenses 1 grand theft Prior Offenses 4 juvenile misdemeanors Subsequent Offenses None https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing 16/34

Bias in the Training Data Solution 1: just remove race as a feature. Doesn t work! Race is not used as a feature. The problem: race is correlated with many other features that we may want to use in our machine learning system. Solution 2: include race as a feature and explicitly correct for the bias. Might need to accept lower accuracy for a more fair model. 17/34

Interpretability of our Models For many applications we need to understand why the model produced a specific output. EU law now requires that machine learning algorithms need to be able to explain their decisions. Neural networks are notoriously unexplainable, black box models. https://www.bloomberg.com/opinion/articles/201705-15/don-t-grade-teachers-with-a-bad-algorithm 18/34

Replicability of Findings 19/34

Replicability We test a lot of hypotheses but report only the significant results. This is fine - we can t publish a paper for every relation that doesn t hold. But we need to be aware of this selection when analyzing the results. Studies trying to replicate existing findings are rare and often fail. https://www.theguardian.com/science/2018/aug/27/attempt-to-r eplicate-major-social-scientific-findings-of-past-decade-fails 20/34

Contradicting Studies https://www.vox.com/2015/3/23/8264355/research-study-hype 21/34

P-hacking P-hacking is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no underlying effect. Done by running large numbers of experiments and only paying attention to the ones that come back with significant results. Also known as data dredging, data snooping, data fishing, etc. Statistical significance is defined as being less than 5% likely that the result is due to randomness (p < 0.05). That means we accept that some significant results are going to be false positives! 22/34

P-hacking Total 800 hypotheses to test 23/34

P-hacking The true underlying distribution: Something going on in 100 configurations Nothing going on in the rest 24/34

P-hacking For each hypothesis we test: We discover something We don t discover anything P(false positive) = 0.05 P(false negative) = 0.2 25/34

P-hacking We made 80 true discoveries We made 35 false discoveries False Discovery Proportion = 35 / 115 = 0.3 26/34

P-hacking If P(false negative) = 0.4 and P(false positive) = 0.05 We made 60 true discoveries We made 35 false discoveries False Discovery Proportion = 35 / 95 = 0.37 27/34

P-hacking If P(false negative) = 0.4 and P(false positive) = 0.05 over 1600 experiments We made 60 true discoveries We made 75 false discoveries False Discovery Proportion = 75 / 135 = 0.56 28/34

Spurious Correlations http://www.tylervigen.com/spurious-correlations 29/34

Spurious Correlations A sample study with 54 people, searching over 27,716 possible relations. https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/ 30/34

Strategies Against P-hacking Distinguish between verifying a hypothesis and exploring the data. Benjamini & Hochberg (1995) offer an adaptive p-value: 1. Rank p-values from M experiments. 2. Calculate the Benjamini-Hochberg critical value for each experiment. 3. Significant results are the ones where the p-value is smaller than the critical value. https://web.stanford.edu/class/stats101 31/34

Google Flu Trends Predicting flu epidemics based on online behaviour https://www.npr.org/sections/health-shots/2014/03/13/289802934/googles-flu-tracker-suffers-from-sniffles 32/34

Google Flu Trends http://www.wbur.org/commonhealth/2013/01/13/google-flu-trends-cdc https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/ 33/34

34/34