Introduction to Multivariate Classification Problems. Byron P. Roe University of Michigan Ann Arbor, MI June 16, 2006

Similar documents
Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Artificial Neural Networks written examination

CS Machine Learning

Generative models and adversarial training

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Learning From the Past with Experiment Databases

Probability and Statistics Curriculum Pacing Guide

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods for Fuzzy Systems

Knowledge Transfer in Deep Convolutional Neural Nets

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Calibration of Confidence Measures in Speech Recognition

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Speech Recognition at ICSI: Broadcast News and beyond

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Rule Learning with Negation: Issues Regarding Effectiveness

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

INPE São José dos Campos

Universidade do Minho Escola de Engenharia

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Softprop: Softmax Neural Network Backpropagation Learning

Model Ensemble for Click Prediction in Bing Search Ads

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Issues in the Mining of Heart Failure Datasets

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Evolution of Random Phenomena

A Case Study: News Classification Based on Term Frequency

Speech Emotion Recognition Using Support Vector Machine

Dublin City Schools Mathematics Graded Course of Study GRADE 4

arxiv: v1 [cs.lg] 15 Jun 2015

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

SARDNET: A Self-Organizing Feature Map for Sequences

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Axiom 2013 Team Description Paper

Classification Using ANN: A Review

Data Fusion Through Statistical Matching

Evidence for Reliability, Validity and Learning Effectiveness

How People Learn Physics

Word Segmentation of Off-line Handwritten Documents

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Human Emotion Recognition From Speech

School Size and the Quality of Teaching and Learning

TD(λ) and Q-Learning Based Ludo Players

Evolution of Symbolisation in Chimpanzees and Neural Nets

4.0 CAPACITY AND UTILIZATION

Mathematics process categories

Grade 6: Correlated to AGS Basic Math Skills

Seminar - Organic Computing

Reducing Features to Improve Bug Prediction

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Using focal point learning to improve human machine tacit coordination

Probability estimates in a scenario tree

WHEN THERE IS A mismatch between the acoustic

arxiv: v1 [cs.lg] 3 May 2013

The Effect of Income on Educational Attainment: Evidence from State Earned Income Tax Credit Expansions

Comparison of network inference packages and methods for multiple networks inference

Evaluation of a College Freshman Diversity Research Program

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Mathematics Success Grade 7

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Lecture 1: Basic Concepts of Machine Learning

Semi-Supervised Face Detection

Why Did My Detector Do That?!

Word learning as Bayesian inference

On the Combined Behavior of Autonomous Resource Management Agents

12- A whirlwind tour of statistics

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

NCEO Technical Report 27

Speaker Identification by Comparison of Smart Methods. Abstract

Learning to Schedule Straight-Line Code

Time series prediction

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Uncertainty concepts, types, sources

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Truth Inference in Crowdsourcing: Is the Problem Solved?

Test Effort Estimation Using Neural Network

A student diagnosing and evaluation system for laboratory-based academic exercises

On-the-Fly Customization of Automated Essay Scoring

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Statistics and Data Analytics Minor

Transcription:

Introduction to Multivariate Classification Problems Byron P. Roe University of Michigan Ann Arbor, MI 48105 June 16, 2006

Use MiniBooNE as Example This experiment has many of the problems to be discussed in C (and some in A). MiniBooNE is looking for a small class of events e Background is about 1000 times signal. Some 300 candidates for feature variables (FV). FV from reconstructed events. If new class exists, determine two parameters; if not set limits as functions of these parameters.

Classification problem Divide data into several categories given a number of feature variables with each event. Often used in particle physics with two categories signal and background.

Older Methods Artificial Neural Net (ANN) Decision Trees

Neural Network Structure Combine the features in a non-linear way to a hidden layer and then to a final layer Use a training set to find the best w ik to distinguish signal and background

Go through all feature variables and find best variable and value to split events. For each of the two subsets repeat the process Proceeding in this way a tree is built. Ending nodes are called leaves. Decision Tree Background/Signal

Select Signal and Background Leaves Assume an equal weight of signal and background training events. If more than ½ of the weight of a leaf corresponds to signal, it is a signal leaf; otherwise it is a background leaf. Signal events on a background leaf or background events on a signal leaf are misclassified.

One Criterion for Best Split Purity, P, is the fraction of the weight of a node due to signal events. Gini: Note that gini is 0 for all signal or all background. The criterion is to minimize gini_left + gini_right of the two children from a parent node

Criterion for Next Branch to Split Pick the branch to maximize the change in gini. Criterion = giniparent giniright-child ginileft-child

Problems with Older Methods ANN is not stable in many available versions i. If put variable in twice, answer often changes ii. If multiply one variable by two, answer often changes iii. If change order of variables, answer often changes Decision trees are also unstable. GO ON TO NEWER METHODS

Newer Methods

Boosting the Decision Tree Give the training events misclassified under this procedure a higher weight. Continuing build perhaps 1000 trees and do a weighted average of the results (1 if signal leaf, -1 if background leaf).

Many variants Change Gini criterion Several weight updating schemes Change scoring Don t change weights, but many trees with subsets of events (bagging, random forests) For neural nets Bayesian neural nets The basic point is to average over many trees in some way. Boosting can, in principle, be applied to many classification schemes ANN..., but most use in physics from trees

Good Reference T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer (2001).

Warning: Boost Use Different than in Many Statistics Articles 45 leaves (8 or less in many publications) 1000 trees Slightly modified scoring Use several sets of boosting trees. Make a cut with first set and then retrain on remainder. (Cascade boosting) OR train with several different backgrounds and then use boosting scores from each as additional feature variables for final training.

Rule Fit This is a variant of boosted decision trees of J. Friedman. Here each node of each tree can be thought of as a rule to select events. For 1000 trees with 45 leaves (89 nodes) apiece, this is 89,000 rules. The score is taken as a linear sum of the truth of the rules. An algorithm is used to optimize the weights of each rule with a regulator term to control the variations.

Support Vector Machines In the multidimensional space of the feature variables, find the borders of signal and background events. Use only the border region. Similar in a sense to boosting, which also gives the most weight to the hard to classify events, which are the border events.

Comparisons It is hard to generalize here. It is likely that the best method depends on the problem. Comparisons are not easy. The comparisons must be made with each method tuned. See for instance the note of J. Conrad and F. Tegenfeldt hep-ph/0605106 and the subsequent e-mails between Conrad and Haijun Yang.

Comparisons II In the comparisons we have made for mini- BooNE and some data from Babar, boosted decision trees worked as well as any method tried. B.P. Roe, H.J. Yang, J. Zhu, I. Stancu and G. McGregor, Nucl. Inst. and Meth. A543 (2005) 577 H.J. Yang, B.P. Roe and J. Zhu, Nucl. Inst. and Meth. A555 (2005) 370-385

Can Statisticians Help Here? Are there different approaches to the data? Are there some useful graphical methods? There is a reluctance among some physicists to use modern classification methods because they are non-intuitive and because physicists worry about accurately modeling data in many dimensions. Are there suggestions from statisticians on these issues?

Number of Feature Variables In miniboone we would like to reduce from 300 to perhaps 150 feature variables a. Check if data distributions agree with Monte Carlo for individual variables and robustness vs small systematic changes in model b. Make short runs and look at: i. Feature variables used most often OR ii. Feature variables giving biggest change in Gini criterion OR iii. Feature variables used first

Number of Feature Variables II To first approximation, equal results with each method, but each has problems. (Example: two variables looking at same thing. Boost may randomly pick one or the other, reducing use by factor of two.) Do statisticians have any suggestions concerning selection of feature variables?

Goodness of Fit First cut on boosting score to reduce sample size by a factor of more than hundred. Even in this cut sample, 2/3 or more are background events. For this cut sample: Take the boosting score as one variable and event energy as a second, do chi-square or log likelihood fit for best values of the two parameters of interest or, for upper limits of the size of the rare process as a function of the two parameters.

Systematic Errors Not easy to relate an assumed error in a parameter (e.g. Fraction of Cherenkov light) to the effect on the reconstructed event. Use Monte Carlo Unisim One run for each systematic varied by one standard deviation. Compare c.v. Multisim A number of MC runs, in each of which all systematic parameters are varied randomly. (See B. Roe technical note) Do statisticians have any suggestions here?

Chi-Square Use of data to further estimate systematic errors. (D. Stump et al., Phys. Rev. D65, 014012.) Ignore Bayes vs frequentist. Take the chi-square with only statistical errors and add a term for each systematic using the multidimensional correlated normal distribution assumed for the systematics N systematic parameters added, but, effectively N bins added so number df same. Runs into problems if more syst. than bins.

Log Likelihood Fits Effectively means using finer bins than can with chi-square. -2lnL approx chi-square fails past 90% CL in one example of our binning. Use Monte Carlo. If the two output parameters were really at the assumed values, what is the likelihood of lnl(best) lnl(real val.) being at least as large as observed. Hard to get to the 4 equivalent normal distribution level. Can statisticians suggest a better way?

Finally Physicists and statisticians are now starting to work together to the benefit of both groups. We can use all the help we can get!!

Backup

Feedforward Neural Network--I

Feedforward Neural Network--II

Comparison of Boosting and ANN Relative ratio here is ANN bkrd kept/boosting bkrd kept. Greater than one implies boosting wins! A. All types of background events. Red is 21 and black is 52 training var. B. Bkrd is pi0 events. Red is 22 and black is 52 training variables Percent nue CCQE kept

Effects of Number of Leaves and Number of Trees Smaller is better! R = c X frac. sig/frac. bkrd.

Effect of Number of PID Variables

AdaBoost Optimization

Can Convergence Speed be Improved? Removing correlations between variables helps. Random Forest (using random fraction[1/2] of training events per tree with replacement and random fraction of PID variables per node (all PID var. used for test here) WHEN combined with boosting. Softening the step function scoring: y=(2*purity 1); score = sign(y)*sqrt( y ).

Performance of AdaBoost with Step Function and Smooth Function

AdaBoost Optimization

The MiniBooNE Collaboration

40 D tank, mineral oil, surrounded by about 1280 photomultipliers. Both Cher. and scintillation light. Geometrical shape and timing distinguishes events