Deconstructing Data Science

Similar documents
Python Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Lecture 1: Basic Concepts of Machine Learning

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS 446: Machine Learning

Indian Institute of Technology, Kanpur

STA 225: Introductory Statistics (CT)

Switchboard Language Model Improvement with Conversational Data from Gigaword

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

12- A whirlwind tour of statistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CSL465/603 - Machine Learning

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Probability and Game Theory Course Syllabus

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Learning From the Past with Experiment Databases

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Artificial Neural Networks written examination

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Universidade do Minho Escola de Engenharia

Active Learning. Yingyu Liang Computer Sciences 760 Fall

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Speech Recognition at ICSI: Broadcast News and beyond

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Human Emotion Recognition From Speech

A Comparison of Two Text Representations for Sentiment Analysis

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Probabilistic Latent Semantic Analysis

CSC200: Lecture 4. Allan Borodin

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Reducing Features to Improve Bug Prediction

A Vector Space Approach for Aspect-Based Sentiment Analysis

A Case Study: News Classification Based on Term Frequency

Probability and Statistics Curriculum Pacing Guide

Linking Task: Identifying authors and book titles in verbose queries

Lecture 2: Quantifiers and Approximation

arxiv: v1 [cs.cl] 2 Apr 2017

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Graduate Program in Education

Word learning as Bayesian inference

Issues in the Mining of Heart Failure Datasets

Probability estimates in a scenario tree

Genevieve L. Hartman, Ph.D.

Model Ensemble for Click Prediction in Bing Search Ads

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

SELF: CONNECTING CAREERS TO PERSONAL INTERESTS. Essential Question: How Can I Connect My Interests to M y Work?

Beyond the Pipeline: Discrete Optimization in NLP

WEBSITES TO ENHANCE LEARNING

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Methods in Multilingual Speech Recognition

Natural Language Processing. George Konidaris

Science Fair Rules and Requirements

Copyright by Sung Ju Hwang 2013

Discriminative Learning of Beam-Search Heuristics for Planning

Multivariate k-nearest Neighbor Regression for Time Series data -

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Semi-Supervised Face Detection

Informal Comparative Inference: What is it? Hand Dominance and Throwing Accuracy

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Statewide Framework Document for:

STEPS TO EFFECTIVE ADVOCACY

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Australian Journal of Basic and Applied Sciences

ATW 202. Business Research Methods

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Tun your everyday simulation activity into research

The Multi-genre Research Project

Applications of data mining algorithms to analysis of medical data

Conference Presentation

Data Fusion Through Statistical Matching

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Speech Emotion Recognition Using Support Vector Machine

Transcription:

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of Methods Jan 19, 2016

Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random forests Support vector machines Logistic regression Survival models Topic models Networks Neural networks Perceptron K-means clustering Hierarchical clustering

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all skyscrapers Y = {art deco, neo-gothic, modern} x = the empire state building y = art deco

Classification h(x) = y h(empire state building) = art deco

Classification Let h(x) be the true mapping. We never know it. How do we find the best ĥ(x) to approximate it? One option: rule based if x has sunburst motif : ĥ(x) = art deco

Classification Supervised learning Given training data in the form of <x, y> pairs, learn ĥ(x)

task X Y spam classification email {spam, not spam} authorship attribution text {jk rowling, james joyce, } genre classification song {hip-hop, classical, pop, } image tagging image {B&W, color, ocean, fun, }

Methods differ in form of ĥ(x) learned Deep learning Decision trees Probabilistic graphical models Random forests Logistic regression Networks Support vector machines Neural networks Perceptron

Model differences Binary classification: Y = 2 [one out of 2 labels applies to a given x] Multiclass classification: Y > 2 [one out of N labels applies to a given x] Multilabel classification: y > 1 [multiple labels apply to a given x]

Regression A mapping from input data x (drawn from instance space X) to a point y in R (R = the set of real numbers) x = the empire state building y = 17444.6

Linear regression Deep learning Decision trees Ordinal regression Probabilistic graphical models Random forests Support vector machines (regression) Survival models Networks Neural networks Perceptron

Big differences Are the labels yj and yk for two different data points xj and xk independent? During learning and prediction, would your guess for yj help you predict yk?

Label dependence Object recognition in images Neighboring pixels tend to have similar values (building, sky)

J. Adams Label dependence Franklin Homophily in social networks Friends to have similar attribute values Jefferson Voltaire

Big differences Are the labels yj and yk for two different data points xj and xk independent? During learning and prediction, would your guess for yj help you predict yk? [Part of speech tagging, network homophily, object recognition in images] Sequence models (HMMs, CRFS, LSTMs) and general graphical models (MRFs) but come at a high computational cost

Big differences How do the features in x interact with each other? Independent? [Naive Bayes] Potentially correlated but non-interacting? [Logistic regression, linear regression, perceptron, linear SVM] Complex interactions? [Non-linear SVM, neural networks, decision trees, random forests]

Feature interactions training data how predictive is: I like the movie 1 I hate the movie -1 I do not like the movie -1 I do not hate the movie 1 like hate not not like not hate

What do you need? 1. Data (emails, texts) 2. Labels for each data point (spam/not spam, which author it was written by) 3. A way of featurizing" the data that s conducive to discriminating the classes 4. To know that it works.

What do you need? Two steps to building and using a supervised classification model. 1. Train a model with data where you know the answers. 2. Use that model to predict data where you don t.

Recognizing a Classification Problem Can you formulate your question as a choice among some universe of possible classes? Can you create (or find) labeled data that marks that choice for a bunch of examples? Can you make that choice? Can you create features that might help in distinguishing those classes?

Uses of classification Two major uses of supervised classification/regression Prediction Interpretation Train a model on a sample of data <x, y> to predict values for some new data xʹ Train a model on a sample of data <x, y> to understand the relationship between x and y

Clustering Clustering (and unsupervised learning more generally) finds structure in data, using just X X = a set of skyscrapers

What is structure? Unsupervised learning finds structure in data. clustering data into groups discovering factors

Methods differ in the kind of structure learned Deep learning Probabilistic graphical models Networks Topic models K-means clustering Hierarchical clustering

Structure Partitioning X into N disjoint sets [K-means clustering, PGMs] Assigning X to hierarchical structure [Hiearchical clustering] Assigning X to partial membership in N different sets [EM clustering, PGMs, PCA] Learning a representation of x in X that puts similar data points close to each other [Deep learning]

Uses of clustering Exploratory data analysis Discovering interesting or unexpected structure can useful for hypothesis generation Input to supervised models Unsupervised learning generates alternate representations of each x as it relates to the larger X.

Input to supervised models Brown clusters trained from Twitter data: every word is mapped to a single (hierarchical) cluster http://www.cs.cmu.edu/~ark/tweetnlp/cluster_viewer.html

Recognizing a Classification/Regression/Clustering Problem I want to predict a star value {1, 2, 3, 4, 5} for a product review I want to find all of the texts that have allusions to Paradise Lost. Optical character recognition I want to associate photographs of cats with animals in a taxonomic hierarchy I want to reconstruct an evolutionary tree for languages

boyd and Crawford danah boyd and Kate Crawford (2012), Critical Questions for Big Data, Information, Communication and Society Specifically about big data but we can read it as a commentary on much quantitative practice using social data

1. big data changes the definition of knowledge How do computational methods/quantitative analysis pragmatically affect epistemology? Restricted to what data is available (twitter, data that s digitized, google books, etc.). How do we counter this in experimental designs? Establishes alternative norms for what research looks like

2. claims to objectivity and accuracy are misleading What is still subjective in data/empirical methods? What are the interpretive choices still to be made? Interpretation introduces dependence on individuals. Is this ever avoidable? What does an experiment (or results) mean?

2. claims to objectivity and accuracy are misleading Data collection, selection process is subjective, reflecting belief in what matters. Model design is likewise subjective model choice (classification vs. clustering etc.) representation of data feature selection Claims need to match the sampling bias of the data.

3. bigger data is not always better data Uncertainty about its source or selection mechanism [Twitter, Google books] Appropriateness for question under examination How did the data you have get there? Are there other ways to solicit the data you need? Remember the value of small data: individual examples and case studies

4. taken out of context, big data loses it meaning A representation (through features) is a necessary approximation; what are the consequences of that approximation? Example: quantitative measures of tie strength and its interpretation (e.g., articulated, behavior, personal networks).

5. just because it is accessible does not make it ethical Twitter, Facebook, OkCupid Anonymization practices for sensitive data (even if born public) Accountability both to research practice and to subjects of analysis

6. limited access to big data creates new digital divides Inequalities in access to data and the production of knowledge Privileging of skills required to produce knowledge

Tuesday 1/24: Classification Bring examples of hard problems that would fall under the domain of classification, and how you could approach training data collection