Overview. Overview of the course. Classification, Clustering, and Dimension reduction. The curse of dimensionality

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

(Sub)Gradient Descent

Australian Journal of Basic and Applied Sciences

CSL465/603 - Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Time series prediction

Probabilistic Latent Semantic Analysis

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Evolutive Neural Net Fuzzy Filtering: Basic Description

Assignment 1: Predicting Amazon Review Ratings

Learning Methods for Fuzzy Systems

Learning From the Past with Experiment Databases

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Artificial Neural Networks written examination

Speech Emotion Recognition Using Support Vector Machine

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Human Emotion Recognition From Speech

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Reducing Features to Improve Bug Prediction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Word Segmentation of Off-line Handwritten Documents

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A study of speaker adaptation for DNN-based speech synthesis

Lecture 1: Basic Concepts of Machine Learning

Why Did My Detector Do That?!

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Speech Recognition at ICSI: Broadcast News and beyond

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lecture 10: Reinforcement Learning

Learning Distributed Linguistic Classes

arxiv: v2 [cs.cv] 30 Mar 2017

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Rule Learning with Negation: Issues Regarding Effectiveness

Activity Recognition from Accelerometer Data

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Methods in Multilingual Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

arxiv: v1 [cs.lg] 3 May 2013

Introduction to Causal Inference. Problem Set 1. Required Problems

Probability and Game Theory Course Syllabus

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Data Fusion Through Statistical Matching

INPE São José dos Campos

A survey of multi-view machine learning

STA 225: Introductory Statistics (CT)

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Rule Learning With Negation: Issues Regarding Effectiveness

The Good Judgment Project: A large scale test of different methods of combining expert predictions

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Issues in the Mining of Heart Failure Datasets

Modeling function word errors in DNN-HMM based LVCSR systems

Axiom 2013 Team Description Paper

Statewide Framework Document for:

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Universidade do Minho Escola de Engenharia

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Probability and Statistics Curriculum Pacing Guide

Introduction to Simulation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

10.2. Behavior models

A Case Study: News Classification Based on Term Frequency

Switchboard Language Model Improvement with Conversational Data from Gigaword

SARDNET: A Self-Organizing Feature Map for Sequences

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Welcome to. ECML/PKDD 2004 Community meeting

Support Vector Machines for Speaker and Language Recognition

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Comparison of network inference packages and methods for multiple networks inference

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Semi-Supervised Face Detection

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Transcription:

Overview Overview of the course Classification, Clustering, and Dimension reduction The curse of dimensionality Tianwei Yu RSPH Room 334 Tianwei.yu@emory.edu 1

Instructor: Course Outline Tianwei Yu Office: GCR Room 334 Email: tianwei.yu@emory.edu Office Hours: by appointment. Teaching Assistant: Yunchuan Kong, Teng Fei, Yanting Huang Office Hours: TBA Course Website: http://web1.sph.emory.edu/users/tyu8/534

Overview Focus of the course: Classification Clustering Dimension reduction 1 Introduction 2 Python Q & A by TAs 3 Statistical background 4 Stat decision theory 1 5 Stat decision theory 2 6 Density estimation and KNN 7 Basis expansion 1 8 Basis expansion 2 9 Linear Machine 10 Support Vector Machine 1 11 Support Vector Machine 2 12 Boosting 13 Decision Tree 14 Random Forest 15 Bump hunting and forward stagewise regression 3

Overview 16 Hidden Markov Model 1 17 Hidden Markov Model 2 18 Neural networks 1 19 Neural networks 2 20 Neural networks 3 21 Model generalization 1 22 Model generalization 2 23 Clustering 1 24 Clustering 2 & EM algorithm 25 Clustering 3 26 Dimension reduction 1 27 Dimension reduction 2 28 Dimension reduction 3 4

References: Textbook: The elements of statistical learning. Hastie, Tibshirani & Friedman. Python Machine Learning. Raschka & Mirjalili. Other references: Pattern classification. Duda, Hart & Stork. Data clustering: theory, algorithms and application. Gan, Ma & Wu. An introduction to Statistical Learning: with Applications in R. James, Witten, Hastie, Tibshirani. 5

References: Python: https://wiki.python.org/moin/beginnersguide/nonprogrammers Evaluation: Four homeworks/projects (20% each for the first 3, and 30% final project) Requirement: complete in Python. Submit code with results. Class participation evaluated by 4 quizzes (10%) 6

Overview Machine Learning /Data mining Supervised learning direct data mining Unsupervised learning indirect data mining Semi-supervised learning Classification Estimation Prediction Clustering Association rules Description, dimension reduction and visualization Modified from Figure 1.1 from <Data Clustering> by Gan, Ma and Wu 7

Overview In supervised learning, the problem is well-defined: Given a set of observations {x i, y i }, estimate the density Pr(Y, X) Usually the goal is to find the model/parameters to minimize a loss, A common loss is Expected Prediction Error: It is minimized at Objective criteria exists to measure the success of a supervised learning mechanism. 8

Overview In unsupervised learning, there is no output variable, all we observe is a set {x i }. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria from the data itself; to justify a result: > Heuristic arguments, > External information, > Evaluate based on properties of the data 9

Classification The general scheme. An example. 10

Classification In most cases, a single feature is not enough to generate a good classifier. 11

Classification Two extremes: overly rigid and overly flexible classifiers. 12

Classification Goal: an optimal trade-off between model simplicity and training set performance. 13

Classification An example of the overall scheme involving classification: 14

Classification A classification project: a systematic view. 15

Clustering Assign observations into clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. Detect data relations Find natural hierarchy Ascertain the data consists of distinct subgroups... 16

Clustering Mathematically, we hope to estimate the number of clusters k, and the membership matrix U In fuzzy clustering, we have 17

Clustering Some clusters are well-represented by center+spread model; Some are not. 18

Dimension reduction The purpose of dimension reduction: Data simplification Data visualization Reduce noise (if we can assume only the dominating dimensions are signals) Variable selection for prediction

Dimension reduction Outcome variable y exists (learning the association rule) No outcome variable (learning intrinsic structure) Data separation Classification, regression Clustering Dimension reduction SIR, Class-preserving projection, Partial least squares PCA, MDS, Factor Analysis, ICA, NCA

Curse of Dimensionality Bellman R.E., 1961. In p-dimensions, to get a hypercube with volume r, the edge length needed is r 1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable. 21

Curse of Dimensionality In other words, To get a dense sample, if we need N=100 samples in 1 dimension, then we need N=100 10 samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point prediction is much harder near the edge of the training sample. 22

Curse of Dimensionality Estimating a 1D density with 40 data points. Standard normal distribution. 23

Curse of Dimensionality Estimating a 2D density with 40 data points. 2D normal distribution; zero mean; variance matrix is identity matrix. 24

Curse of Dimensionality Another example the EPE of the nearest neighbor predictor. To find E(Y X=x), take the average of data points close to a given x, i.e. the top k nearest neighbors of x Assumes f(x) is well-approximated by a locally constant function When N is large, the neighborhood is small, the prediction is accurate. 25

Curse of Dimensionality Data: Uniform in [ 1, 1] p 26

Curse of Dimensionality 27

Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about the mechanism: class density function. Learns p(x, y), and predict using p(y X). In high dimensions, this is difficult. Discriminative model --- care about boundary. Learns p(y X) directly, potentially with a subset of X. 28

Curse of Dimensionality X 1 Generative model X 2 y X 3 Discriminative model y Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care how many teeth each kind of fish have, or what shape fins they have? 29

Curse of Dimensionality Modern problems are almost always high-dimensional. Training data is often limited. Restrictive models: Flexible (adaptive) models: More assumptions (that may be wrong) Less vulnerable to curse of dimensionality Require less training samples Less assumptions More vulnerable to curse of dimensionality Require more training samples (?) The ideal models: Flexible to capture complex data structures Resistant to curse of dimensionality, can train well with limited samples. Can tell us about important predictors and their interactions 30