Assignment 6 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Similar documents
Lecture 1: Machine Learning Basics

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Artificial Neural Networks written examination

Learning Methods in Multilingual Speech Recognition

CSL465/603 - Machine Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Softprop: Softmax Neural Network Backpropagation Learning

Chapter 2 Rule Learning in a Nutshell

Using dialogue context to improve parsing performance in dialogue systems

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Calibration of Confidence Measures in Speech Recognition

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Mathematics Success Level E

Lecture 10: Reinforcement Learning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Axiom 2013 Team Description Paper

Radius STEM Readiness TM

The Good Judgment Project: A large scale test of different methods of combining expert predictions

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning goal-oriented strategies in problem solving

Task Types. Duration, Work and Units Prepared by

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

CS 446: Machine Learning

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Learning From the Past with Experiment Databases

Reducing Features to Improve Bug Prediction

We re Listening Results Dashboard How To Guide

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Generative models and adversarial training

Corrective Feedback and Persistent Learning for Information Extraction

Linking Task: Identifying authors and book titles in verbose queries

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Switchboard Language Model Improvement with Conversational Data from Gigaword

Detecting English-French Cognates Using Orthographic Edit Distance

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

A Case Study: News Classification Based on Term Frequency

(Sub)Gradient Descent

Pre-AP Geometry Course Syllabus Page 1

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Australian Journal of Basic and Applied Sciences

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Software Maintenance

Model Ensemble for Click Prediction in Bing Search Ads

Mathematics process categories

Issues in the Mining of Heart Failure Datasets

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

arxiv: v1 [cs.cl] 2 Apr 2017

Mathematics Success Grade 7

Evolutive Neural Net Fuzzy Filtering: Basic Description

First Grade Standards

Human Emotion Recognition From Speech

A Version Space Approach to Learning Context-free Grammars

Answer Key For The California Mathematics Standards Grade 1

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

6 Financial Aid Information

Grade 6: Correlated to AGS Basic Math Skills

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Semi-Supervised Face Detection

Detailed Instructions to Create a Screen Name, Create a Group, and Join a Group

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Learning to Rank with Selection Bias in Personal Search

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

DegreeWorks Advisor Reference Guide

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probabilistic Latent Semantic Analysis

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

GACE Computer Science Assessment Test at a Glance

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Prediction of Maximal Projection for Semantic Role Labeling

Evolution of the core team of developers in libre software projects

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Conducting an interview

Assignment 1: Predicting Amazon Review Ratings

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

SARDNET: A Self-Organizing Feature Map for Sequences

Diagnostic Test. Middle School Mathematics

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Transcription:

Assignment 6 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Assume that you are given a data set and a neural network model trained on the data set. You are asked to build a decision tree model with the sole purpose of understanding/interpreting the built neural network model. In such a scenario, which among the following measures would you concentrate most on optimising? (a) Accuracy of the decision tree model on the given data set (b) F1 measure of the decision tree model on the given data set (c) Fidelity of the decision tree model, which is the fraction of instances on which the neural network and the decision tree give the same output (d) Comprehensibility of the decision tree model, measured in terms of the size of the corresponding rule set Sol. (c) Here the aim is not the traditional one of modelling the data well, rather it is to build a decision tree model which is as close to the existing neural network model as possible, so that we can use the decision tree to interpret the neural network model. Hence, we optimise the fidelity measure. 2. Which of the following properties are characteristic of decision trees? (a) High bias (b) High variance (c) Lack of smoothness of prediction surfaces (d) Unbounded parameter set Sol. (b), (c) & (d) Decision trees are generally unstable considering that a small change in the data set can result in a very different set of splits. This is mainly due to the hierarchical nature of decision trees, since a change in split points in the initial stages will affect all the subsequent splits. The decision surfaces that result from decision tree learning are generated by recursive splitting of the feature space using axis parallel hyper planes. They clearly do not produce smooth prediction surfaces such as the ones produced by, say, neural networks. Decision trees do not make any assumptions about the distribution of the data. They are non-parametric methods where the number of parameters depends solely on the data set on which training is carried out. 3. To control the size of the tree, we need to control the number of regions. One approach to do this would be to split tree nodes only if the resultant decrease in the sum of squares error exceeds some threshold. For the described method, which among the following are true? (a) It would, in general, help restrict the size of the trees 1

(b) It has the potential to affect the performance of the resultant regression/classification model (c) It is computationally infeasible Sol. (a) & (b) While this approach may restrict the eventual number of regions produced, the main problem with this approach is that it is too restrictive and may result in poor performance. It is very common for splits at one level, which themselves are not that good (i.e., they do not decrease the error significantly), to lead to very good splits (i.e., where the error is significantly reduced) down the line. Think about the XOR problem. 4. Which among the following statements best describes our approach to learning decision trees (a) Identify the best partition of the input space and response per partition to minimise sum of squares error (b) Identify the best approximation of the above by the greedy approach (to identifying the partitions) (c) Identify the model which gives the best performance using the greedy approximation (option (b)) with the smallest partition scheme (d) Identify the model which gives performance close to the best performance (option (a)) with the smallest partition scheme (e) Identify the model which gives performance close to the best greedy approximation performance (option (b)) with the smallest partition scheme Sol. (e) As was discussed in class we use a greedy approximation to identifying the partitions and typically use pruning techniques which result in a smaller tree with probably some degradation in the performance. 5. Having built a decision tree, we are using reduced error pruning to reduce the size of the tree. We select a node to collapse. For this particular node, on the left branch, there are 3 training data points with the following outputs: 5, 7, 9.6 and for the right branch, there are four training data points with the following outputs: 8.7, 9.8, 10.5, 11. What were the original responses for data points along the two branches (left & right respectively) and what is the new response after collapsing the node? (a) 10.8, 13.33, 14.48 (b) 10.8, 13.33, 12.06 (c) 7.2, 10, 8.8 (d) 7.2, 10, 8.6 Sol. (c) Original responses: 5+7+9.6 Left: 3 = 21.6 3 = 7.2 8.7,9.8,10.5,11 Right: 4 = 40 4 = 10 New response: 7.2 3 7 + 10 4 7 = 8.8 2

6. Given that we can select the same feature multiple times during the recursive partitioning of the input space, is it always possible to achieve 100% accuracy on the training data (given that we allow for trees to grow to their maximum size) when building decision trees? (a) Yes (b) No Sol. (b) Consider a pair of data points with identical input features but different class labels. Such points can be part of the training data but will not be able to be classified without error. 7. Suppose on performing reduced error pruning, we collapsed a node and observed an improvement in the prediction accuracy on the validation set. Which among the following statements are possible in light of the performance improvement observed? (a) The collapsed node helped overcome the effect of one or more noise affected data points in the training set (b) The validation set had one or more noise affected data points in the region corresponding to the collapsed node (c) The validation set did not have any data points along at least one of the collapsed branches (d) The validation set did have data points adversely affected by the collapsed node Sol. (a), (b), (c) & (d) The first option is the kind of error we normally expect pruning to help us overcome. However, a node collapse which ideally should result in an increase in the overall error of the model may actually show an improvement due to a number of factors. Perhaps the points which should have been misclassified due to the collapse are mislabelled in the validation set (option (b)). Such points may also be missing from the the validation set (option (c)). Finally, even if the increased error due to the collapsed node is registered in the validation set, it may be masked by the absence of errors (existing in the training data) in other parts of the validation set (option (d)). 8. Consider the following data set: price maintenance capacity airbag profitable low low 2 no yes low med 4 yes no low low 4 no yes low high 4 no no med med 4 no no med med 4 yes yes med high 2 yes no med high 5 no yes high med 4 yes yes high high 2 yes no high high 5 yes yes 3

Considering profitable as the binary values attribute we are trying to predict, which of the attributes would you select as the root in a decision tree with multi-way splits using the cross-entropy impurity measure? (a) price (b) maintenance (c) capacity (d) airbag Sol. (c) cross entropy price (D) = 4 11 ( 2 4 log 2 2 4 2 4 log 2 2 4 ) + 4 11 ( 2 4 log 2 2 4 2 4 log 2 2 4 ) + 3 11 ( 2 3 log 2 2 3 1 3 log 2 1 3 ) = 0.9777 cross entropy maintenance (D) = 2 11 ( 2 2 log 2 2 2 0 2 log 2 0 2 )+ 4 11 ( 2 4 log 2 2 4 2 4 log 2 2 4 )+ 5 11 ( 2 5 log 2 2 5 3 5 log 2 3 5 ) = 0.8050 cross entropy capacity (D) = 3 11 ( 1 3 log 2 1 3 2 3 log 2 2 3 ) + 6 11 ( 3 6 log 2 3 6 3 6 log 2 3 6 ) + 2 11 ( 2 2 log 2 2 2 0 2 log 2 0 2 ) = 0.7959 cross entropy price (D) = 5 11 ( 3 5 log 2 3 5 2 5 log 2 2 5 ) + 6 11 ( 3 6 log 2 3 6 3 6 log 2 3 6 ) = 0.9868 9. For the same data set, suppose we decide to construct a decision tree using binary splits and the Gini index impurity measure. Which among the following feature and split point combinations would be the best to use as the root node assuming that we consider each of the input features to be unordered? (a) price - {low, med} {high} (b) maintenance - {high} {med, low} (c) maintenance - {high, med} {low} (d) capacity - {2} {4, 5} Sol. (c) gini price({low,med} {high}) (D) = 8 11 2 4 8 4 8 + 3 11 2 2 3 1 3 = 0.4848 gini maintenance({high} {med,low}) (D) = 5 11 2 2 5 3 5 + 6 11 2 4 6 2 6 = 0.4606 gini maintenance({high,med} {low}) (D) = 9 11 2 4 9 5 9 + 2 11 2 1 0 = 0.4040 ginit capacity({2} {4,5}) (D) = 3 11 2 1 3 2 3 + 8 11 2 5 8 3 8 = 0.4621 10. Consider building a spam filter for distinguishing between genuine e-mails and unwanted spam e-mails. Assuming spam to be the positive class, which among the following would be more important to optimise? (a) Precision (b) Recall Sol. (a) If we optimise recall, we may be able to capture more spam e-mails, but in the process, we may also increase the number of genuine mails being predicted as spam. On the other hand, if we optimise precision, we may not be able to capture as many spam e-mails as in the previous 4

approach, but the percentage of e-mails being classified as spam actually being spam will be high. Of the two possible trade-offs, we would prefer the later (optimising precision), since in the former approach, we will be risking more genuine e-mails being classified as spam, which is a costlier error to make in a spam filtering application (as compared to missing out a few spam e-mails which the user may have to deal with manually). Weka-based assignment questions In this assignment, we will use the UCI Mushroom data set available here. We will be using the J48 decision tree algorithm which can be found in Weka under classifiers/trees. We will consider the following to be the default parameter settings: 5

Note the following: The class for which the prediction model is to be learned is named class and is the first attribute in the data. We will use the default Cross-validation test option with Folds = 10. Once a decision tree model has been built, you can right click on the corresponding entry in the Result list pane on the bottom left, and select Visualize tree to see a visual representation of the learned tree. 11. How many levels does the unpruned tree contain considering multi-way and binary splits respectively, with the other parameters remaining the same as above? 6

(a) 6, 8 (b) 6, 7 (c) 5, 7 (d) 5, 8 Sol. (b) Multi-way split 7

Binary split 12. How many levels does the pruned tree (unpruned = false, reducederrorpruning = false) contain considering multi-way and binary splits respectively? (a) 6, 6 (b) 6, 7 (c) 5, 7 (d) 5, 6 Sol. (a) 8

Multi-way split Binary split 13. Consider the effect of the parameter minnumobj. Try modifying the parameter value and observe the changes in the performance values. For example, considering respectively, multiway and binary splits with the default parameters as discussed before, at what (maximum) value of the parameter do we start to see zero error? 9

(a) 11, 7 (b) 11, 6 (c) 10, 6 (d) 10, 7 Sol. (c) This can be observed by trying out the listed values with the default parameters. The idea is for you to appreciate the significance of the minnumobj parameter. 14. Which among the following pairs of attributes seem to be the most important for this particular classification task? (a) population, gill-spacing (b) stalk-surface-below-ring, cap-color (c) gill-spacing, ring-number (d) odor, spore-print-color Sol. (d) From multiple models built using binary/multi-way splits and with and without pruning, we observe that the attributes, odor and spore-print-color always appear at higher levels in the trees indicating their importance for the classification task. 10