Announcements. Only 104 people have signed up for a project team

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Probability and Statistics Curriculum Pacing Guide

WHEN THERE IS A mismatch between the acoustic

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v2 [cs.cv] 30 Mar 2017

Assignment 1: Predicting Amazon Review Ratings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS Machine Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Probabilistic Latent Semantic Analysis

Issues in the Mining of Heart Failure Datasets

Learning Methods in Multilingual Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Australian Journal of Basic and Applied Sciences

Human Emotion Recognition From Speech

Visit us at:

Statewide Framework Document for:

Speech Emotion Recognition Using Support Vector Machine

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A survey of multi-view machine learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Reducing Features to Improve Bug Prediction

Detailed course syllabus

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Case Study: News Classification Based on Term Frequency

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

CSL465/603 - Machine Learning

Generative models and adversarial training

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Multivariate k-nearest Neighbor Regression for Time Series data -

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

ABILITY SORTING AND THE IMPORTANCE OF COLLEGE QUALITY TO STUDENT ACHIEVEMENT: EVIDENCE FROM COMMUNITY COLLEGES

AP Statistics Summer Assignment 17-18

Attributed Social Network Embedding

Why Did My Detector Do That?!

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Software Maintenance

SARDNET: A Self-Organizing Feature Map for Sequences

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lecture 1: Basic Concepts of Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Comment-based Multi-View Clustering of Web 2.0 Items

Rule Learning With Negation: Issues Regarding Effectiveness

Probability and Game Theory Course Syllabus

STA 225: Introductory Statistics (CT)

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Switchboard Language Model Improvement with Conversational Data from Gigaword

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Mathematics process categories

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

How to Judge the Quality of an Objective Classroom Test

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Softprop: Softmax Neural Network Backpropagation Learning

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Rule Learning with Negation: Issues Regarding Effectiveness

GACE Computer Science Assessment Test at a Glance

Multi-Lingual Text Leveling

arxiv: v1 [cs.lg] 3 May 2013

Artificial Neural Networks written examination

A Comparison of Charter Schools and Traditional Public Schools in Idaho

Grade 6: Correlated to AGS Basic Math Skills

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A study of speaker adaptation for DNN-based speech synthesis

MGT/MGP/MGB 261: Investment Analysis

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Learning From the Past with Experiment Databases

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Physics 270: Experimental Physics

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Exploration. CS : Deep Reinforcement Learning Sergey Levine

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Beyond the Pipeline: Discrete Optimization in NLP

Universidade do Minho Escola de Engenharia

Cal s Dinner Card Deals

Corrective Feedback and Persistent Learning for Information Extraction

Course Syllabus for Math

Getting Started with Deliberate Practice

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Axiom 2013 Team Description Paper

An empirical study of learning speed in backpropagation

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Transcription:

Announcements Only 104 people have signed up for a project team if you have not signed up, or are on a team of 1, please try contacting other folks in the same situation if this fails, please email me I will hold office hours tomorrow, 3-4:15, in Revelator Coffee No homework this week (or next) Midterm exam next Thursday (March 9) No class next Tuesday (I will be out of town)

Dimensionality reduction We observe data The goal of dimensionality reduction is to transform these inputs to new variables where in such a way that minimizes information loss Dimensionality reductions serves two main purposes: Helps (many) algorithms to be more computationally efficient Helps prevent overfitting (a form of regularization), especially when

Curse of dimensionality As the dimensionality of our feature space grows, the volume of the space increases A lot In learning, this often translates to requiring exponentially more data in order for the results to be reliable Example: With binary features, how much data do we need to have at least one example of every possible combination of features?

Dimensionality reduction Broadly speaking, methods for dimensionality reduction can be categorized according to: 1. How is information loss quantified? 2. Supervised or unsupervised? i.e., if labels are available, how are they used? 3. Is the map linear or nonlinear? 4. Feature selection versus feature extraction? vs

Feature selection Feature selection is the problem of selecting a subset of the variables that are most relevant for a machine learning task (e.g., classification or regression) Sometimes called subset selection There are three main reasons why we might want to perform feature selection: computational efficiency regularization retains interpretability Feature selection (and feature extraction) improves performance by eliminating irrelevant features

Filter methods Filter methods attempt to rank features in order of importance and then take the top features In supervised learning, importance is usually related to the ability of a feature to predict the label or response variable Advantage simple, fast Disadvantage the best features are usually not the best features The approach to ranking the features will depend on the application

Filtering in classification Consider training data and where How should we rank the features?

Ranking criteria Misclassification rate where is a classifier that compares the feature to a threshold Two sample t-test statistic where and are the within-class means for feature and is the pooled sample standard deviation

Ranking criteria Margin If the data is separable, then we can compute This can be made robust to the non-separable case by replacing the hard minimum with an order statistic that allows you to ignore some fixed number of outliers

Filtering in linear regression In linear regression, we have training data, where, and we expect to change linearly in response to changes in any How should we rank the features?

Correlation coefficient Pick the features which are most correlated with Set where

Mutual information The mutual information between and is This is the Kullback-Leibler (KL) divergence between the joint distribution and the product of the marginal distributions Note that if are independent You can intuitively think of much knowing tells us about as a measure of how

Maximizing mutual information If denotes a subset of features corresponding to, then ideally we would like to maximize over all possible of a desired size Unfortunately, this is typically intractable Instead we could rank the features according to where the mutual information is estimated by first computing histograms or some other estimate of and

Incremental maximization This is a legitimate strategy, but (just like the other methods we have discussed) it can lead to selecting highly redundant features With mutual information, there is a natural way to deal with this redundancy by selecting features incrementally For example, say that we have already selected features and wish to select one more Choose to maximize

Alternatives to filtering A big drawback to the filtering approach is that it usually doesn t capture interactions between features Can result in selecting redundant features Wrapper methods are an alternative with three ingredients: 1. a machine learning algorithm 2. a way to assess the performance of a subset of features 3. a strategy for searching through subsets of features Advantage captures feature interactions where filter methods do not Disadvantage can be slow

Examples 1. LR, SVM, nearest neighbors, least squares, 2. holdout error, cross validation, bootstrap, 3. Forward selection start with no features try adding each one, one at a time pick the best, and then repeat Backward elimination start with all features try removing each one, one at a time remove the worst, and then repeat Many, many others (see greedy algorithms for sparse recovery for hundreds of examples)

Embedded methods Embedded methods jointly perform feature selection and model fitting instead of dividing these into two separate processes The idea is to simultaneously learn a classifier or regression function that does well on the training data while only using a small number of features Prime examples: LASSO Any other learning algorithm that uses regularization -norm

Feature extraction In general, there may not be a small subset of features that works well Examples speech images almost any sampled signal How can we design a good mapping that minimizes the loss of information using only the data we are given? We will approach this from an unsupervised perspective

Principal component analysis (PCA) Unsupervised Linear Loss criteria: Sum of squared errors The idea behind PCA is to find an approximation where with orthonormal columns

Example From Chapter 14 of Hastie, Tibshirani, and Friedman

Derivation of PCA Mathematically, we can define and as the solution to The hard part of this problem is finding Given, it is relatively easy to show that

Determining Suppose are fixed. We wish to minimize Claim: We must have Why? Determining is just standard least-squares regression

Determining Setting and still supposing is fixed, our problem reduces to minimizing

Determining Taking the gradient with respect to to zero, we obtain and setting this equal The choice of is not unique, but the easy (and standard) way to ensure this equality holds is to set

Determining It remains to minimize with respect to For convenience, we will assume that otherwise we could just substitute, since In this case the problem reduces to minimizing

Determining Expanding this out, we obtain Thus, we can instead focus on maximizing

Determining Note that for any vector, we have Thus, we can write is a scaled version of the empirical covariance matrix, sometimes called the scatter matrix

Determining The problem of determining problem reduces to the optimization Analytically deriving the optimal is not too hard, but is a bit more involved than you might initially expect (especially if you already know the answer) We will provide justification for the solution for the case the general case is proven in the supplementary notes

One-dimensional example Consider the optimization problem Form the Lagrangian Take the gradient and set it equal to zero must be an eigenvector of Take to be the eigenvector of corresponding to the maximal eigenvalue

The general case For general values of, the solution is obtained by computing the eigendecomposition of : where and is an orthonormal matrix with columns where

The general case The optimal choice of in this case is given by i.e., take the top eigenvectors of Terminology principal component transform: principal component: principal eigenvector: