Learning from a Probabilistic Perspective

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

(Sub)Gradient Descent

Learning From the Past with Experiment Databases

Semi-Supervised Face Detection

Softprop: Softmax Neural Network Backpropagation Learning

Universidade do Minho Escola de Engenharia

Artificial Neural Networks written examination

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Assignment 1: Predicting Amazon Review Ratings

Lecture 1: Basic Concepts of Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v1 [cs.lg] 15 Jun 2015

Rule Learning with Negation: Issues Regarding Effectiveness

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Truth Inference in Crowdsourcing: Is the Problem Solved?

CSL465/603 - Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Calibration of Confidence Measures in Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

CS 446: Machine Learning

Word learning as Bayesian inference

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Probability and Statistics Curriculum Pacing Guide

Data Fusion Through Statistical Matching

Introduction to Simulation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Data Stream Processing and Analytics

Applications of data mining algorithms to analysis of medical data

INPE São José dos Campos

FF+FPG: Guiding a Policy-Gradient Planner

A Case Study: News Classification Based on Term Frequency

Reducing Features to Improve Bug Prediction

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Australian Journal of Basic and Applied Sciences

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Speech Emotion Recognition Using Support Vector Machine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Knowledge Transfer in Deep Convolutional Neural Nets

Discriminative Learning of Beam-Search Heuristics for Planning

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Axiom 2013 Team Description Paper

Content-based Image Retrieval Using Image Regions as Query Examples

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Seminar - Organic Computing

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

WHEN THERE IS A mismatch between the acoustic

Linking Task: Identifying authors and book titles in verbose queries

Chapter 2 Rule Learning in a Nutshell

Human Emotion Recognition From Speech

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Genre classification on German novels

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

A Bayesian Learning Approach to Concept-Based Document Classification

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Multi-label classification via multi-target regression on data streams

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning goal-oriented strategies in problem solving

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

On-Line Data Analytics

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Test Effort Estimation Using Neural Network

The Evolution of Random Phenomena

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Knowledge-Based - Systems

Issues in the Mining of Heart Failure Datasets

Time series prediction

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Methods for Fuzzy Systems

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Transcription:

Learning from a Probabilistic Perspective Data Mining and Concept Learning CSI 5387 1 Learning from a Probabilistic Perspective Bayesian network classifiers Decision trees Random Forest Neural networks 2

Bayes Classifier Posterior probability Prior probability: Conditional probability: Normalization factor: 3 Probabilistic Independence For a Boolean concept requires 8 parameters(2 ) requires 6 parameters(2*n) A way to reduce complexity (OO design, Voting) What is independence? (committee) 4

Bayes Classifier X1 C 1-1 - 0 + 0-0 - IF X1=0, Then C=- 5 Naïve Bayes Classifier Simplest Bayesian networks X1 X2 X3 C 1 1 1-1 1 1-0 0 0 + 0 0 0-0 0 0 - C P(C) X1 X2 X3 P(X1 C) P(X2 C) P(X3 C) 6

Naïve Bayes Classifier When the independence assumption is violated Inaccurate probability estimation For classification, large tolerance for dependencies X1 X2 X3 C 1 1 1-1 1 1-0 0 0 + 0 0 0-0 0 0 - IF X1=0, Then C=- 7 Bayesian network Classifiers C P(C) X1 X2 X3 P(X1 C) P(X2 C,X1) P(X3 C,X1) Advantage: independence makes model simpler Disadvantage: if variables contain dependencies, searching structure is difficult Naïve Bayes is the most popular Bayesian network classifier 8

Probabilistic Decision Trees X1 X2 X3 C 1 1 1-1 1 1-0 0 0 + 0 0 0-0 0 0 - Decision trees only use X1 to split X1=1 X1 X1=0 1 + 2 - P(+ X1=0,X2=0,X3=0)=1/3=0.33 IF X1=0, Then C=- 9 Probabilistic Decision Trees X1 P(C) X1=1 X1=0 1 + 2 - P(C X1,X2,X3) Each leaf represents a probabilistic distribution P(C X1,X2 X_3) The leaf class distributions are only related to (1) The number of training instances (2) The number of instances in the leaf 10

Bias and Variance X1 P(C) X1=1 X1=0 1 + 2 - P(C X1,X2,X3) Each leaf represents a probabilistic distribution P(C X1,X2 X_3) A small leaf has inaccurate probability estimation. High variance A large leaf represent less number of variables. High bias Example: use P(C) to approximate P(C X1,X2 X_3) Ideally, a large leaf and many path variables 11 Decision Trees vs. Bayesian networks Bayesian networks Decision trees Sample efficiency better worse Structure learning Hard (general graph) Easy. (divide-conquer) Similarity: probabilistic model. learn optimal classifiers given sufficient data Practice: Bayes classifiers perform well in small dataset, when decision trees are optimal given sufficient data Context specific (in)dependence 12

Duplication Problem in Decision trees A Boolean concept: C= (A1 and A2) or (A3 and A4) A1 1 0 A2 0 1 C=1 A3 C=0 A4 C=1 C=0 A3 C=0 A4 C=1 C=0 (A3 and A4) has been learn twice, and thus requires more instances to learn {A1=T, A2=F,A3=T,A4=T} {A1=T, A2=F,A3=T,A4=F} {A1=T, A2=F,A3=F,A4=F} 13 Independent Decision trees A Boolean concept: C= (A1 or A2) and (A3 or A4) 1 A1 0 P(C=1)=7/16 P(C=0)=9/16 T1 A1 T2 A3 A2 0 1 C=1 A3 C=0 A4 C=1 C=0 A3 C=0 A4 C=1 C=0 A2 1 p(a1=1,a2=1 C=1)=1 p(a1=1,a2=1 C=0)=0... 0... A4......... P(A1, A2,A3,A4 C) P(A1,A2 C)P(A3,A4 C) Note that we have a large leaf for each tree, and still utilize the same number of variables to make predictions 14

Independent Decision trees A set of independent trees are more compact than a decision tree, and thus require less training data to learn Finding independence between variables is difficult An approximation learning algorithm is desired in practice 15 Learning Independent Decision Trees Construct a set of decision trees by injecting randomness to tree learning Randomness makes each tree tend to be independent with each other. Lower variance Each decision tree represents the dependency between variables. Lower bias 16

Random Trees Bagging: grow a tree on randomly select samples from training data with replacement 1. each tree has high prediction power but dependencies between trees are strong 2. when the sample size is large, all trees converge to one Random trees: Randomly select the splitting attribute in each node. (1) each tree is more independent, but the prediction power is low. (2) data contains many useless variables 17 Random Forests Building one tree in random forests 1. For each node, build a tree Randomly select k variable S Pick the best variable from S Split the training data into subsets based on the values of the best variable 3. For each derived subset, repeat the preceding steps. 18

Parameters in Random Forest The larger k, the more dependencies between trees The larger k may not always improve the accuracy of individual trees in small dataset, but may have difference in larger datasets. Why? Example: a set of equal important variables, and the training data only allows one split. The performance of random forest is not very sensitive with k if it is small (<log (number of variables)) The number of tree should be large enough. (>30) 19 Advantages Random forests are competitively with other popular algorithms, such as boosting, SVM in accuracy Simple to use, not sensitive with parameters Do not overfit data, no regularization Accurate probability estimation Resistance to noise No pruning, less imbalance problem 20

Perceptrons vs. Naïve Bayes Naïve Bayes Perceptrons Similarity: a generalized linear model, and the representational power can be increased by structure learning (multi-layer, unrestricted Bayesian) Parameter learning: naive Bayes: generative learning (frequency estimate) perceptrons: discriminative learning (perceptron rule, gradient descent) 21 Generative vs. Discriminative learning generative learning discriminative learning Objective function P(C,X_1, X_i) P(C X_1, X_i) Training time efficient slow Sample efficiency Better(<100) worse overfitting no yes 22

Generative vs. Discriminative Learning variables are independent: generative and discriminative learning may learn the same parameters duplicated variables: discriminative learning learns better parameters than generative learning XOR functions among variables: both of them need to resort to structure learning 23 Some Observations in Practice variables are dependent Y N duplicated variables XOR N Y Naïve Bayes Decision trees Perceptrons 24