CS Machine Learning

Similar documents
(Sub)Gradient Descent

Python Machine Learning

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Rule Learning with Negation: Issues Regarding Effectiveness

Reducing Features to Improve Bug Prediction

Word Segmentation of Off-line Handwritten Documents

12- A whirlwind tour of statistics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Artificial Neural Networks written examination

On-the-Fly Customization of Automated Essay Scoring

CSL465/603 - Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

CS 446: Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Switchboard Language Model Improvement with Conversational Data from Gigaword

Softprop: Softmax Neural Network Backpropagation Learning

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Tun your everyday simulation activity into research

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The Importance of Social Network Structure in the Open Source Software Developer Community

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Chapter 2 Rule Learning in a Nutshell

Introduction to Causal Inference. Problem Set 1. Required Problems

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Probability and Statistics Curriculum Pacing Guide

Detecting English-French Cognates Using Orthographic Edit Distance

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Multi-Lingual Text Leveling

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Using dialogue context to improve parsing performance in dialogue systems

Discriminative Learning of Beam-Search Heuristics for Planning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-Supervised Face Detection

Conference Presentation

Lecture 10: Reinforcement Learning

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

School Size and the Quality of Teaching and Learning

Association Between Categorical Variables

Model Ensemble for Click Prediction in Bing Search Ads

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

SARDNET: A Self-Organizing Feature Map for Sequences

A Case Study: News Classification Based on Term Frequency

Lecture 1: Basic Concepts of Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Knowledge Transfer in Deep Convolutional Neural Nets

Beyond the Pipeline: Discrete Optimization in NLP

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Generative models and adversarial training

The Evolution of Random Phenomena

Probabilistic Latent Semantic Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

University of Groningen. Systemen, planning, netwerken Bosman, Aart

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Calibration of Confidence Measures in Speech Recognition

MYCIN. The MYCIN Task

Learning Methods in Multilingual Speech Recognition

Catchy Title for Machine

Laboratorio di Intelligenza Artificiale e Robotica

Mining Student Evolution Using Associative Classification and Clustering

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

STAT 220 Midterm Exam, Friday, Feb. 24

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Using Genetic Algorithms and Decision Trees for a posteriori Analysis and Evaluation of Tutoring Practices based on Student Failure Models

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Human Emotion Recognition From Speech

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

The Moodle and joule 2 Teacher Toolkit

Software Maintenance

Learning goal-oriented strategies in problem solving

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Transcription:

CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1

Programming Issues l Program in any platform you want l Realize that you will be doing actual training which can be more time-consuming if implemented in an inefficient way or with an inefficient platform l We will supply a basic ML toolkit in either C++, Java, or Python Could MUST be your own! - Learning l You are welcome to create your own toolkit if you would like, but you need to have at least the level of functionality as in the supplied versions l Toolkit details found in content section of Learning Suite CS 478 Data and Testing 2

Machine Learning Toolkit l The CS 478 tool kit is intended as a starting place for working with machine learning algorithms. It provides the following functionality to run your algorithms: Parses and stores the ARFF file (ARFF is the data set format we will use) Randomizes the instances in the ARFF file Allows normalization of attributes Parses command line arguments Provides four evaluation methods: l Training set method: The model is evaluated on the same data set that was used for training l Static split test set method: Two distinct data sets are made available to the learning algorithm; one for training and one for testing l Random split test set method: A single data set is made available to the learning algorithm and the data set is split such that x% of the instances are randomly selected for training and the remainder are used for testing, where you supply the value of x. l N-fold cross-validation Allows selection of which ML model to train and test with Ability to report training and accuracy information (training and test set accuracies, learning time, etc.) CS 478 Data and Testing 3

l Data Types Gathering a Data Set Nominal (aka Categorical, Discrete) Continuous (aka Real, Numeric) Linear (aka Ordinal) Is usually just treated as continuous, so that ordering info is maintained l Consider a Task: Classifying the quality of pizza What features might we use? l How to represent those features? Will usually depend on the learning model we are using l Classification assumes the output class is nominal. If output is continuous, then we are doing regression. CS 478 Data and Testing 4

Fitting Data to the Model l Continuous -> Nominal Discretize into bins more on this later l Nominal -> Continuous (Perceptron expects continuous) a) One input node for each nominal value where one of the nodes is set to 1 and the other nodes are set to 0 l Can also explode the variable into n-1 input nodes where the most common value is not explicitly represented (i.e. the all 0 case) b) Use 1 node but with a different continuous value representing each nominal value c) Distributed log b n nodes can uniquely represent n nominal values (e.g. 3 binary nodes could represent 8 values) d) If there is a very large number of nominal values, could cluster (discretize) them into a more manageable number of values and then use one of the techniques above l Linear data is already in continuous form CS 478 Data and Testing 5

Data Normalization l What would happen if you used two input features in an astronomical task as follows: Weight of the planet in grams Diameter of the planet in light-years CS 478 Data and Testing 6

Data Normalization l What would happen if you used two input features in an astronomical task as follows: Weight of the planet in grams Diameter of the planet in light-years l Normalize the Data between 0 and 1 (or similar bounds) For a specific instance, could get the normalized feature as follows: f normalized = (f original - Minvalue TS )/(Maxvalue TS - Minvalue TS ) l Use these same Max and Min values to normalize data in novel instances l Note that a novel instance may have a normalized value outside 0 and 1 Why? Is it a big issue? CS 478 Data and Testing 7

ARFF Files l An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a Machine Learning dataset (or relation). Developed at the University of Waikato (NZ) for use with the Weka machine learning software (http://www.cs.waikato.ac.nz/~ml/weka). We will use the ARFF format for CS 478 l ARFF files have two distinct sections: Metadata information l Name of relation (Data Set) l List of attributes and domains Data information l Actual instances or rows of the relation l Optional comments may also be included which give information about the Data Set (lines prefixed with %) CS 478 Data and Testing 8

% 1. Title: Pizza Database % 2. Sources: % (a) Creator: BYU CS 478 Class % (b) Statistics about the features, etc. @RELATION Pizza Sample ARFF File @ATTRIBUTE Weight CONTINUOUS @ATTRIBUTE Crust {Thick, Thin, Stuffed} @ATTRIBUTE Cheesiness CONTINUOUS @ATTRIBUTE Meat {True, False} @ATTRIBUTE Quality {Great, Good, Fair} @DATA.9, Stuffed, 99, True, Great.1, Thin, 2, False, Fair?, Thin, 60, True, Good.6, Thick, 60, True, Great l l Any column could be the output, but we will assume that the last column(s) is the output What would you do to this data before using it with a perceptron and what would the perceptron look like? CS 478 Data and Testing 9

ARFF Files l More details and syntax information for ARFF files can be found at our website l Data sets that we have already put into the ARFF format can also be found at our website and linked to from the LS content page http://axon.cs.byu.edu/data/ l You will use a number of these in your simulations throughout the semester Always read about the task, features, etc, rather than just plugging in the numbers l You will create your own ARFF files in some projects, and particularly with the group project CS 478 Data and Testing 10

Performance Measures l There are a number of ways to measure the performance of a learning algorithm: Predictive accuracy of the induced model (or error) Size of the induced model Time to compute the induced model etc. l We will focus here on accuracy l Fundamental Assumption: Future novel instances are drawn from the same/similar distribution as the training instances CS 478 Data and Testing 11

Toolkit Training/Testing Alternatives l Four methods that we will use with our Toolkit: Training set method: The model is evaluated on the same data set that was used for training Static split test set method: Two distinct data sets are made available to the learning algorithm; one for training and one for testing Random split test set method: A single data set is made available to the learning algorithm and the data set is split such that x% of the instances are randomly selected for training and the remainder are used for testing, where you supply the value of x. N-fold cross-validation CS 478 Data and Testing 12

Training Set Method l Procedure Build model from the dataset Compute accuracy on the same dataset l Simple but least reliable estimate of future performance on unseen data (a rote learner could score 100%!) l Not used as a performance metric but it is often useful information in understanding how a machine learning model learns l This is information which you will typically report in your write-ups and then compare it with how the learner does on a test set, etc. CS 478 Data and Testing 13

Static Training/Test Set l Static Split Approach The data owner makes available to the machine learner two distinct datasets: l l One is used for learning/training (i.e., inducing a model), and One is used exclusively for testing l Note that this gives you a way to do repeatable tests l Can be used for challenges (e.g. to see how everyone does on one particular unseen set, etc.) l Be careful not to overfit the Test Set ( Gold Standard ) CS 478 Data and Testing 14

Random Training/Test Set Approach l Random Split Approach The data owner makes available to the machine learner a single dataset The machine learner splits the dataset into a training and a test set, such that: l l Instances are randomly assigned to either set The distribution of instances (with respect to the target class) is hopefully similar in both sets due to randomizing the data before the split (stratification is even better but not required here) l Typically 60% to 90% of instances is used for training and the remainder for testing the more data there is the more that can be used for training and still get statistically significant test predictions Useful quick estimate for computationally intensive learners Not statistically optimal (high variance, unless lots of data) l Could get a luck or unlucky test set Best to do multiple training runs with different splits. Train and test m different splits and then average the accuracy over the m runs to get a more statistically accurate prediction of generalization accuracy CS 478 Data and Testing 15

N-fold Cross-validation l Use all the data for both training and testing Statistically more reliable All data can be used which is good for small data sets l Procedure Partition the randomized dataset (call it D) into N equallysized subsets S 1,, S N For k = 1 to N l Let M k be the model induced from D - S k l Let a k be the accuracy of M k on the instances of the test fold S k Return (a 1 +a 2 + +a N )/N CS 478 Data and Testing 16

N-fold Cross-validation (cont.) l The larger N is, the smaller the variance in the final result l The limit case where N = D is known as leave-one-out and provides the most reliable estimate. However, it is typically only practical for small instance sets l Generally, a value of N=10 is considered a reasonable compromise between time complexity and reliability l Still must chose an actual model to use during execution - how? Could select the one model that was best on its fold? CS 478 Data and Testing 17

N-fold Cross-validation (cont.) l The larger N is, the smaller the variance in the final result l The limit case where N = D is known as leave-one-out and provides the most reliable estimate. However, it is typically only practical for small instance sets l Generally, a value of N=10 is considered a reasonable compromise between time complexity and reliability l Still must chose an actual model to use during execution - how? Could select the one model that was best on its fold? All data. With any of the above approaches l Note that CV is just a better way to estimate how well we will do on novel data, rather than a way to do model selection CS 478 Data and Testing 18

Perceptron/Regression Project See Content Section of LS (Learning Suite) Also briefly review group project proposal part CS 478 Data and Testing 19

Your Project Proposals l Come up with one carefully proposed idea for a possible group machine learning project, that could be done this semester. This proposal should not be more than one page long. It should include a thoughtful first draft proposal of a) description of the project, b) what features the data set would include and c) how and from where would the data set be gathered and labeled. Give at least one fully specified example of a data set instance based on your proposed features, including a reasonable representation (continuous, nominal, etc.) and value for each feature. The actual values may be fictional at this time. This effort will cause you to consider how plausible the future data gathering and representation might actually be. l Examples Irvine Data Set to get a feel Stick with supervised classification data sets for the most part l Tasks which interest you l Too hard vs Too Easy Data can be gathered in a relatively short time Want you to have to battle with the data/features a bit l Read instructions ASAP on the web site and start thinking CS 478 - Feature Selection and Reduction 20