Homework III Using Logistic Regression for Spam Filtering

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Python Machine Learning

CS Machine Learning

Artificial Neural Networks written examination

Learning From the Past with Experiment Databases

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v1 [cs.lg] 15 Jun 2015

An empirical study of learning speed in backpropagation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

On the Combined Behavior of Autonomous Resource Management Agents

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Calibration of Confidence Measures in Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Axiom 2013 Team Description Paper

Model Ensemble for Click Prediction in Bing Search Ads

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

LET S COMPARE ADVERBS OF DEGREE

Reducing Features to Improve Bug Prediction

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Rule Learning with Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

WHEN THERE IS A mismatch between the acoustic

Applications of data mining algorithms to analysis of medical data

On-the-Fly Customization of Automated Essay Scoring

Online Updating of Word Representations for Part-of-Speech Tagging

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Generative models and adversarial training

Multi-label classification via multi-target regression on data streams

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Test Effort Estimation Using Neural Network

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Discriminative Learning of Beam-Search Heuristics for Planning

Comment-based Multi-View Clustering of Web 2.0 Items

MGT/MGP/MGB 261: Investment Analysis

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Detecting English-French Cognates Using Orthographic Edit Distance

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

An Introduction to Simio for Beginners

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Multi-Lingual Text Leveling

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Australian Journal of Basic and Applied Sciences

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

CS 446: Machine Learning

School of Innovative Technologies and Engineering

Using Genetic Algorithms and Decision Trees for a posteriori Analysis and Evaluation of Tutoring Practices based on Student Failure Models

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Probability and Statistics Curriculum Pacing Guide

DegreeWorks Advisor Reference Guide

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

STA 225: Introductory Statistics (CT)

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Statewide Framework Document for:

Reinforcement Learning by Comparing Immediate Reward

Transfer Learning Action Models by Measuring the Similarity of Different Domains

CSL465/603 - Machine Learning

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Go fishing! Responsibility judgments when cooperation breaks down

Introduction to Causal Inference. Problem Set 1. Required Problems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

A Reinforcement Learning Variant for Control Scheduling

Task Types. Duration, Work and Units Prepared by

Why Did My Detector Do That?!

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Knowledge Transfer in Deep Convolutional Neural Nets

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Introduction to the Practice of Statistics

Improvements to the Pruning Behavior of DNN Acoustic Models

Multivariate k-nearest Neighbor Regression for Time Series data -

Learning Methods in Multilingual Speech Recognition

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Attributed Social Network Embedding

Interpreting ACER Test Results

Computers Change the World

Case study Norway case 1

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Learning to Schedule Straight-Line Code

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

PowerTeacher Gradebook User Guide PowerSchool Student Information System

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Measurement. When Smaller Is Better. Activity:

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Handling Concept Drifts Using Dynamic Selection of Classifiers

New Project Learning Environment Integrates Company Based R&D-work and Studying

arxiv: v2 [cs.ro] 3 Mar 2017

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Green Belt Curriculum (This workshop can also be conducted on-site, subject to price change and number of participants)

Speech Emotion Recognition Using Support Vector Machine

Transcription:

Homework III Using Logistic Regression for Spam Filtering Introduction to Machine Learning - CMPS 242 By Bruno Astuto Arouche Nunes February 14 th 2008 1. Introduction In this work we study batch learning and how this batch algorithm behaves as we use different data sets (permutations of the same data set), also analyzing how much training is enough and how to optimize the parameters for learning model. The objective of this work is to get familiar with logistic regression, plotting performance curves and understanding early stopping and parameters/model optimization. In the following we describe the main results of this work in which different scenarios of interest were defined and performance curves were plotted. The scenarios, the plots and the respective comments and conclusions are presented in the following sections, but first we briefly present the algorithm itself and the datasets used as input during this experiments. 2. The problem We implemented logistic regression for predicting whether or not emails are to be considered spam. The data set consisted of a matrix 2001x2000, where the first value in each row is the class label, in {0,1}, where 0 is "not spam" and 1 is "spam". Remaining 2000 values in each row are also {0,1} values that indicate presence 1 or non-presence 0 of words in the message. Figure 1 shows the presence (dark points) or absence (blank spaces) of the features in each row of the matrix. Figure 1 is the same dataset but now with its rows permuted.

Figure 1: Data set plotted in MATLAB using the command imagesc(1-data); colormap gray Figure 2: Same data set plotted in MATLAB, but now with the examples/email (rows of the matrix) permuted using the command p=randperm(2000); permdata=data(p(:),:) ; imagesc(1-permdata); colormap gray;

3. Experiment 1 Studding early stopping: In our first experiment, we implemented the logistic regression for training our model, using the first ¾ of the data set as training set and the remaining ¼ as testing set. Our prediction function was a sigmoid function called here y_hat_t = 1/(1 + exp(-a_hat)), where a_hat is the linear activation a_hat = (Wi'*Xi')'. The loss function here is defined as the logistic loss Logistic_Loss=log(1+exp(a_hat))-y.*a_hat, where y is the class label of the batch examples. The loss gradient Loss_Grad=(y_hat_t-y)'*Xi was used for updating the weights, using gradient descent, Wi=Wi-(1/Numb_of_examples_Traning)*eta*((y_hat_t-y)'*Xi)', where eta is the learning rate of the algorithm and Numb_of_examples_Traning=1500, in our case. The main question to be answered here was when to stop training. In this experiment our stopping criteria was the absolute value of the logistic loss gradient. We evaluated the performance of the algorithm for 5 different values of the gradient, where we stopped the training phase when reached these thresholds of [10 0, 10-1, 10-2, 10-3, 10-4 ]. Figure 3: Early stopping using the absolute value of the gradient loss as the stopping criteria to avoid overtraining.

It is easy to see in Figure 3 how much the loss during the training phase drops the longer you train, achieving losses very close to zero when running the algorithm until the gradient reaches 10-4. After reaching the stopping criteria, we stopped training and used the new set of weights acquired in the training phase to predict in the testing set. We got the best result at 10-2 and observed a poorer performance at the testing set when using the model trained until 10-3 and 10-4, configuring overtraining at these last 2 points. In this results, no cross validation or regularization were used and we set eta to be equal to 0.3. 4. Experiment 2 Changing stopping criteria: In this experiment we used to different stopping criteria: (a) the 1-norm of the gradient and (b) the 2-norm of the gradient loss. For each pass on the batch we updated the weight, incurred in loss and recalculated the gradient loss. When the stopping criteria (a) or (b) reached a threshold, we stopped the training phase and applied the given model to the data set and averaged the loss. We varied the threshold [10 0, 10-1, 10-2, 10-3, 10-4 ]. Figure 4 shows the results. Figure 4: Early stopping using the 1-norm and 2-norm of the gradient loss as the stopping criteria to avoid overtraining.

Using the 1-norm and 2-norm of the gradient loss as the stopping criteria to avoid overtraining, it is possible to see that for the 2-norm stopping criteria, the loss is smaller for bigger trainees. For these experiments, no cross validation was used. 5. Experiment 3 Implementing cross-validation and optimizing the model: In the third experiment, we implemented the 5-fold cross validation with 3-way split, using a piece of the data set for training, a piece for validating and a third for testing. We divided the training set by 5 (5x300 examples) plus the data set with 500 examples. From the 5 divisions of the training set, 4 were used for training and 1 for validation. We ran 5 experiments where every fold was used 4 times as training and once as validation. This was done for every value of the learning rate eta tested. For every value of eta we reported the average logistic loss over the 5 losses measured during the 5 validation phases. Figure 5: Very small difference between the tested etas, leaded to very small difference in the average final loss.

Figure 5 shows the values of eta evaluated and it s respective average losses. It is possible to see a difference, but for the suggested eta update (ƞ o α Pass-1 ), the values of eta where very close to each other, making the weights to be updated very slowly and leading to a very similar final result and very similar average losses. We can se by this figure that the difference in the losses is only visible at the order of 10-4. For all experiments in this section we ran the training phase to the precision of gradient loss equal to 10-2. We then tried another experiment for 20 different values of eta, but for a much larger range of the learning rate, from 15 to 0.0163. For each value we trained in 4 folds and validated in the 1 fold left for 5 permutation of the folds, as we did before for the first set of small etas. Figure 6 reports the average loss over the 5 validations. Larger difference between tested learning rates allowed a better tune of this parameter, since its impact now is larger in the weights update. Figure 6: Average logistic loss over 5 folds as a function of the Learning rate. Larger difference between tested learning rates allowed a better tune of this parameter, since its impact now is larger.

It was interested also to notice during the experiments that larger learning rates leaded to the gradient to converge faster, since the weights were updated much faster. But, as we can see at Figure 6, there is a tradeoff between the learning rate and accuracy, which means that we cannot choose a big eta arbitrarily. After plotting the curve at Figure 6, we choose the best and the worse learning rate (and it s respective set of weights) measured in our cross validation step. For each one of these models (the best and the worse) we ran 10 times over the testing set. Table I reports the average and standard deviation (STD) over the 10 runs for both models. Table I: Average and standard deviation for the logistic loss of the best and the worst model. Model Best Worst Learning rate (eta) 5.1084 0.4138 Average Loss 0.0331 0.0668 STD of Loss 0.0066 0.0399 6. Future Work: Given the limited time and the huge computational effort to perform this work, the following experiments were studied implemented but no result was generated until the deadline for this work. Analysis of the impact of shrinking; Regularization implementation; 7. Conclusions: In this work we studied logistic regression for classification problems and applied to a case of study where it was used to estimate if an email is or not a spam. We analyzed the case for overtraining and how to avoid it, using early stopping. Different early stopping criteria were also used as 1-norm and 2-norm of the gradient loss. We also learned how to tune our model parameters in order to optimize our algorithm by using cross validation. More specifically, the parameter tuned was the learning rate. Final results using the optimized model were also presented.