Homework 1: Neural Networks

Similar documents
Lecture 1: Machine Learning Basics

Knowledge Transfer in Deep Convolutional Neural Nets

Artificial Neural Networks written examination

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

(Sub)Gradient Descent

Softprop: Softmax Neural Network Backpropagation Learning

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

An empirical study of learning speed in backpropagation

Evolutive Neural Net Fuzzy Filtering: Basic Description

BENCHMARK TREND COMPARISON REPORT:

Learning From the Past with Experiment Databases

An Introduction to Simio for Beginners

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Generative models and adversarial training

On the Combined Behavior of Autonomous Resource Management Agents

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probability and Statistics Curriculum Pacing Guide

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

INPE São José dos Campos

Modeling function word errors in DNN-HMM based LVCSR systems

CS Machine Learning

Major Milestones, Team Activities, and Individual Deliverables

Software Maintenance

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

How to Judge the Quality of an Objective Classroom Test

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

arxiv: v1 [cs.lg] 15 Jun 2015

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Learning Methods for Fuzzy Systems

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

A Comparison of Annealing Techniques for Academic Course Scheduling

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Test Effort Estimation Using Neural Network

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Human Emotion Recognition From Speech

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Evaluation of a College Freshman Diversity Research Program

Rule Learning With Negation: Issues Regarding Effectiveness

Reinforcement Learning by Comparing Immediate Reward

Calculators in a Middle School Mathematics Classroom: Helpful or Harmful?

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

Visit us at:

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Thesis-Proposal Outline/Template

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Running head: DELAY AND PROSPECTIVE MEMORY 1

While you are waiting... socrative.com, room number SIMLANG2016

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Improving Conceptual Understanding of Physics with Technology

Calibration of Confidence Measures in Speech Recognition

Probability estimates in a scenario tree

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Assignment 1: Predicting Amazon Review Ratings

Australian Journal of Basic and Applied Sciences

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Principal vacancies and appointments

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Language Acquisition Chart

Unit 3. Design Activity. Overview. Purpose. Profile

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

WHEN THERE IS A mismatch between the acoustic

SARDNET: A Self-Organizing Feature Map for Sequences

A study of speaker adaptation for DNN-based speech synthesis

The Importance of Social Network Structure in the Open Source Software Developer Community

National Survey of Student Engagement at UND Highlights for Students. Sue Erickson Carmen Williams Office of Institutional Research April 19, 2012

Getting Started with Deliberate Practice

CLASSROOM USE AND UTILIZATION by Ira Fink, Ph.D., FAIA

Biological Sciences, BS and BA

Truth Inference in Crowdsourcing: Is the Problem Solved?

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Using computational modeling in language acquisition research

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

learning collegiate assessment]

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

A Reinforcement Learning Variant for Control Scheduling

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

OUTLINE OF ACTIVITIES

Forget catastrophic forgetting: AI that learns after deployment

Chapter 4 - Fractions

School Size and the Quality of Teaching and Learning

STA 225: Introductory Statistics (CT)

Rule Learning with Negation: Issues Regarding Effectiveness

Reducing Features to Improve Bug Prediction

Centre for Evaluation & Monitoring SOSCA. Feedback Information

Model Ensemble for Click Prediction in Bing Search Ads

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Why Did My Detector Do That?!

Quantitative Research Questionnaire

Transcription:

Scott Chow ROB 537: Learning Based Control October 2, 2017 Homework 1: Neural Networks 1 Introduction Neural networks have been used for a variety of classification tasks. In this report, we seek to use a single hidden-layer feed-forward neural network for a quality control audit at a manufacturing plant. We consider neural network parameters such as number of hidden units, learning rate, and training time and examine their effect on our neural network performance. Additionally, we look at the effects of biased datasets on both the training and testing of neural networks. 2 Our Dataset In this assignment, we were provided with 2 training sets (denoted train1 and train2) and 3 test sets (test1, test2 and test3). Each data point has 5 inputs (x 1,..., x 5 ) that maps to 2 outputs (y 1, y 2 ). These data sets simulate data from a quality control audit at a manufacturing plant. The five inputs correspond to different features of the product while the two outputs represent whether the product has passed or not. While it may seem counterintuitive to have two outputs for a single binary feature (pass or fail), having two outputs allows our classifier to output its confidence in classification. In the data sets however, all example outputs are either y1 = 1, y2 = 0 (denoted as class 1 for convenience) for pass or y1 = 0, y2 = 1 (denoted as class 2) for failed product. One aspect of the datasets that are interesting to examine is the data balance. Table 1 summarizes the number of each class in the dataset. We see that train1 and test1 are evenly balanced, train2 and test2 are heavily biased towards passing products (class 1) while test3 was biased towards failed products (class 2). This data imbalance explains many of our results in the next sections as well as influenced how we trained and tested the network. 1

Name Number of Class 1 Number of Class 2 train1 100 100 train2 180 20 test1 100 100 test2 180 20 test3 20 180 Table 1: A count of the number of each class in each data set. 3 Neural Network Structure Our task is to classify the provided examples in the test sets as class 1 (pass) or class 2 (fail). To accomplish this, we use a single hidden-layer feed forward neural network. This neural network consists of 5 input nodes for the 5 features, a hidden layer of a variable number of hidden units (see Section 4.1) and 2 output nodes corresponding to outputs y1 and y2. The neural network structure is displayed in Figure 1. The neural network is trained using the gradient descent algorithm optimizing for lowest Mean Squared Error. input 1 input 2 h 1 output 1 input 3. input 4 h n output 2 input 5 Figure 1: A simple diagram of our single hidden-layer feed forward neural network. 2

4 Neural Network Performance In this section, we describe how the performance of our neural network changes based on changing the parameters of our neural network. 4.1 Number of Hidden Units First, let us examine how changing the number of neurons in the hidden unit affects the network performance. Varying the hidden units, we expect that networks with fewer hidden units will lower accuracy, due to being unable to model the true distribution. On the other hand, networks with too many hidden units will experience a drop in test accuracy as well, due to overfitting on the training data. In our experiment, we created neural networks with the learning rate is set to 0.05. We vary the number of neurons in the hidden layer to be between 2 and 10. We use train1 as our training set, which is then fed into the neural network in a random order for 100 epochs. In order to account for variability in neural network initialization, we conducted 4 trials per neural network with a different seed each time (seeds used were 2, 7, 8, 24, 42) and recorded the percent correct from testing the neural network. In Figure 2, we plot the average percent correct versus number of hidden units in the network. There is a significant increase in percentage correct going from two to six hidden units. After reaching six hidden units, increasing the number of hidden units does not seem to yield significant improvement in average percent correct. Also, it is interesting to note that tracking training error over epochs run, it seems that networks with fewer hidden units reach a minimum in training error earlier and begin to fluctuate as the learning rate becomes too large and causes the network to overshoot the minimum. This suggest that perhaps smaller networks train faster, albeit at the cost of accuracy due to having insufficient neurons to model the actual function. 3

Figure 2: A plot of average percent correct versus number of neurons in the hidden layer. 4.2 Training Time Next, we examine the number of epochs to train our network. Training for too few epochs will result in the network not performing to its maximum potential, as it will not have converged to a minima. On the other hand, training for too many epochs may result in overfitting, especially if there are other factors encouraging overfitting such as having too many neurons. In this experiment, we initialize networks of 6 hidden units and a learning rate of 0.05. We train the neural network for a total of 1000 epochs, with each epoch being a single pass through the train1 data set. We stop at various points along the way and evaluate the accuracy of the network on the test set to determine when would be a good stopping point. Once again, to account for variations in initialization, we initialize the network with seven different seeds (1, 5, 17, 28, 42, 47, 314) and compute the average correct classification percentage over number of epochs, which is shown in Figure 3. First, we see that in all our trials, our network converges to around 85% 4

accuracy by 400 epochs. The error bars on the graph indicate standard deviation. The reason for the large error bars for 100, 200, and 300 epochs is because for one or more of our trials, the network had not converged and was still at around 50% correct. Because the plot is showing the average and standard deviation of the percent correct, the trials in which the network has not yet converged at those points drag the average down and greatly increase the standard deviation. This is the reason for the large error bars in the graphs of the following sections. Once the network reaches convergence at around 400 epochs, we see that the change in correct classification percentage levels off. In all the trials, the accuracy hovers around 85%, with some variation due to random initialization. Figure 3: A plot of average percent correct versus number of epochs trained. 5

4.3 Learning Rate Finally, let us examine the learning rate. The learning rate affects how much the weight is changed per time step. A low learning rate would result in the network taking a long time to converge. A high learning rate would cause the network to overshoot the minima and fail to converge. The experimental setup is similar to our previous experiment for training time. We initialize networks of 6 hidden units and we train the neural network for a total of 1000 epochs on the train1 data set. We evaluate the accuracy of the network on the test set. To account for variations due to randomness, we initialize the network with four different seeds (2, 3, 7, 42) and compute the average correct classification percentage over number of epochs. This time, however, we repeat this process with different learning rates and observe how the percent correct over number of epochs trained changes as we change the learning rate. The results of this experiment is shown in Figure 4. Figure 4: A plot of average percent correct versus number of epochs trained for the specified learning rate. 6

While the data may look jumbled, it is interesting to observe the trends in the graphs as we increase learning rate. To make these trends clearer, we have included a simplified version of this plot in Figure 5. Figure 5: A simplified plot of Figure 4 of average percent correct versus number of epochs trained for the specified learning rate. First, note that the error bars for the learning rate = 0.05 show that until 400 epochs, the network does not consistently converge across trials. Next, we see that with the lowest learning rate (0.05) takes longer to converge compared to the higher learning rates. On the other hand, while the highest learning rate (0.9) reaches a high accuracy with fewer epochs; it ultimately fails to converge due to overshooting the minima as indicated by the fluctuations. 7

4.4 Other Critical Parameters In addition to number of hidden units, training time and learning rate, there are also a couple of parameters that affect learning rate. Specifically, we examine the effect of momentum and randomizing training order. 4.4.1 Momentum We also examine the effect adding a momentum factor to our weight update. Recall that the momentum term is used to make weight updates more smooth as well as potentially speed up learning. The results of various momentum factors are summarized in Figure 6. Figure 6: A plot of average percent correct versus number of epochs trained for the specified momentum. We see from our plot that we actually observe the opposite effect. Adding a momentum term and increasing the momentum factor actually causes the 8

network to learn slower. In fact, when we increase the momentum term to 0.5, we actually see signs that the network is not converging as smoothly. We hypothesize that this may be caused by the nature of the classification problem. At the initialization of the networks, the first hundred epochs already results in the network converging towards a solution. The momentum term may actually be causing the network to overshoot the minima, thus making it take longer for the network to converge compared to no momentum term. This effect is amplified when the momentum term is large, which would explain the large variances in classification percentage when the momentum was set to 0.5. 9

4.4.2 Randomizing Training Order Another significant factor in the performance of our neural network is the order in which training samples are passed into our network. In the previous experiments, all networks are trained by randomizing training order. Let us check to see whether this was the correct choice. In this experiment, we initialize a neural network with 6 hidden units, learning rate set at 0.1, and trained for 2000 epochs either with randomized order of training examples or fixed order. This process is repeated seven times with different seeds to account for variations in weight initialization. The result are plotted in Figure 7 From Figure 7, we see that using a random ordering of training examples makes a significant difference both in terms of number of epochs to converge as well as final accuracy. The reasoning behind randomizing training samples is to prevent the network weights from oscillating between two values due to repeatedly encountering the same samples in the same order. Figure 7: A plot of average percent correct versus number of epochs trained either with random ordering of training examples or with fixed order. train1 dataset was used for training. 10

4.5 Varying Test Sets So far, all the experiments above having been using the test1 dataset to evaluate performance. Recall that both train1 and test1 are equally balanced. Now let us observe what happens when we run the neural network trained on balanced data on each of the three data sets In this experiment, we train our neural network on the train1 dataset. Our neural network is initialized with 6 hidden units, a learning rate of 0.1 and trained for 500 epochs. Then it is tested on each of the three test sets. This process is repeated 7 times with different seeds (1, 5, 17, 28, 42, 47, 314) in order to account for variations in initializations. The average accuracy for our network on each of the three test sets is summarized in Table 2. Bias Average Accuracy Standard Deviation test1 Balanced (No Bias) 82.79% 2.81 test2 More Class 1 86.21% 4.87 test3 More Class 2 78.79% 9.24 Table 2: The average percent correct and standard deviations for the neural network after being trained on train1. Interestingly enough, there does not appear to be statistically significant difference among the three test sets. One can note that the standard deviations for test2 and test3 are higher than test1. This seems to indicate that there is more variation in the accuracy achieved on the imbalanced dataset. 11

5 Using an Imbalanced Dataset to Train Until now, all our neural networks were trained on the balanced dataset train1. In this section, we explore the effect of training our neural network on an imbalanced dataset, specifically train2. 5.1 Number of Hidden Units We once again explore the influence of hidden units. The experimental setup is identical to the one described in Section 4.1 and the results are shown in Figure 8. We observe that with imbalanced data, it seems that there are peaks at 8 and 12 hidden units. The large error bars on certain number of hidden units indicates the network has trouble converging. There is a downwards trend past 12 hidden units, which is a sign that using more than 12 hidden neurons may result in overfitting and over-complicating the model. In general, 8 hidden units seem to be the ideal number of hidden neurons since we would prefer using the fewest number of hidden neurons to avoid losing generalization abilities as discussed by Wilamowski [2]. Figure 8: A plot of average percent correct versus number of neurons in the hidden layer after training on train2. 12

5.2 Training Time Now let us look at training time. We repeat the same experiment as in Section 4.2, except this time we tried a larger number of epochs, up to 6000. The results are summarized in Figure 9. Figure 9: A plot of average percent correct versus number of epochs trained with train2. We observe it takes far longer for the neural network to begin to converge, taking around 3000 epochs. There is an upwards trend in accuracy as we increase the training time. It is interesting to note that as we near 3000 epochs, there is a decrease in variance that corresponds to when the network begins to converge. Also observe that even though the final accuracy seems to converge at around 95%, which is higher than with the network trained on the balanced dataset, the dataset itself is 90% class 1, so a network that predicts only class 1 would be correct 90% of time. It is important to keep this in mind when comparing the two networks. 13

5.3 Learning Rate Next, we examine the effect of learning rate. We once again perform the same experiment as in Section 4.3, however we increaded the number of epochs trained to 4000 in hopes of seeing convergence as seen in our previous section. The results of our experiment are summarized in Figure 10 Figure 10: A plot of average percent correct versus number of epochs trained for the specified learning rate. We see that it seems that a learning rate of 0.3 leads to highest average correct classification percentage, however the performance gain is slight and may not be statistically significant. Once again, note that the large error bars indicate a wide variance in the mean correct classification percentage across multiple trials. This is a sign that our network trained on the imbalanced dataset is definitely not learning as well as it s counterpart trained on the dataset. Again, I hypothesize that these poor results are caused by the fact that we are testing our network on test1 which is balanced between the two classes. 14

5.4 Other Critical Parameters In this section, we examine other critical parameters in training, this time using train2 as our training set. 5.4.1 Momentum One interesting parameter to consider is momentum. We replicate the experiment described in Section 4.4 with our network trained with train2 and once again extend the number of epochs trained. The results of this experiment are summarized in Figure 11. From our plot, we once again see large error bars caused by different convergence rates amongst the trials. It is interesting to note that in this case, using a higher momentum factor does seem to increase the correct classification percentage, although the significance of these results are cast in doubt due to large variance. These high variances seem to be caused by the imbalanced dataset used in training. Figure 11: A plot of average percent correct versus number of epochs trained for the specified momentum. 15

5.4.2 Randomizing Training Set Samples As described previously, randomization of the order in which training set samples plays a large role in getting the network to converge quickly. In this experiment, we initialized a network with 8 hidden neurons, a learning rate of 0.3, and set the maximum number of epochs to 3000. We then trained our network either with or without randomizing the order of the training examples in train2. We repeated this process with 7 different seeds to account for variations in initialization of weights. The results are summarized in Figure 12. Figure 12: A plot of average percent correct versus number of epochs trained either with random ordering of training examples or with fixed order. We see that once again, randomizing the order of inputs does make a significant difference in terms of epochs needed for convergence and network accuracy. Shuffling the training data allows the network to be exposed to the training data in different orders and prevents it from becoming locked into a patter and stuck oscillating back and forth. 16

5.5 Varying Test Sets Finally, in this section, we examine the performance of our network on different test sets. We expect that since the network was trained on an imbalanced dataset, the network would also perform well when tested on a similarly imbalanced dataset. We perform the same experiment described in Section 4.5 and the results are summarized in Table 3. Bias Average Accuracy Standard Deviation test1 Balanced (No Bias) 60.% 9.0 test2 More Class 1 91.% 1.0 test3 More Class 2 29.% 17.4 Table 3: The average percent correct and standard deviations for the neural network after being trained on train2. We observe that our hypothesis was correct, with our network performing very well on the test2 data set, which also features a 180-20 imbalance towards class 1, which is the same imbalance seen in train2. We see that the fewer Class 1 examples in the test set, the worse our network performs on the set. This makes sense given that the majority of the training set consists of Class 1 examples. 5.6 Dealing with Imbalanced Datasets The performance difference between training using train1 and train2 is caused by the fact that train2 is an imbalanced dataset. By having more Class 1 than Class 2 data, it seems that the neural network has a harder time training in addition to taking a performance hit. Imbalanced datasets are encountered fairly frequently in real life in cases such as anomaly detection. There are a couple strategies used to balance the dataset. These include removing samples from the majority class or duplicating samples from the minority class until both classes are equally represented. These two methods are referred to as random undersampling and random oversampling respectively [1]. Using these two methods, one can equalize the number of examples from each class; however each method has its drawbacks. Random undersampling comes with the drawback of removing some training data entirely while random oversampling has the drawback of duplicating entries. 17

6 Conclusions Single hidden-layer neural networks perform an adequate job in this simple product classification task, yielding around an 85% accuracy rate when trained and tested on a balanced dataset. Neural network performance is dictated by the number of hidden units, training time, learning rate as well as momentum and random ordering of training samples. Additionally, it is clear that imbalanced datasets can cause complications in training and the importance of considering the contents of training and testing set before training has been demonstrated. References [1] He, H., and Garcia, E. A. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263 1284. [2] Wilamowski, B. M. Neural network architectures and learning algorithms. IEEE Industrial Electronics Magazine 3, 4 (2009). 18