Machine Learning: Algorithms and Applications

Similar documents
CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Python Machine Learning

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Applications of data mining algorithms to analysis of medical data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Australian Journal of Basic and Applied Sciences

CSL465/603 - Machine Learning

Beyond the Pipeline: Discrete Optimization in NLP

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

CS 446: Machine Learning

A Case Study: News Classification Based on Term Frequency

Human Emotion Recognition From Speech

Indian Institute of Technology, Kanpur

Reducing Features to Improve Bug Prediction

12- A whirlwind tour of statistics

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

The Moodle and joule 2 Teacher Toolkit

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multivariate k-nearest Neighbor Regression for Time Series data -

Word Segmentation of Off-line Handwritten Documents

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Using Web Searches on Important Words to Create Background Sets for LSI Classification

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Mathematics Success Level E

16.1 Lesson: Putting it into practice - isikhnas

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Activity Recognition from Accelerometer Data

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Universidade do Minho Escola de Engenharia

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

School of Innovative Technologies and Engineering

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Using SAM Central With iread

A new way to share, organize and learn from experiments

BLACKBOARD TRAINING PHASE 2 CREATE ASSESSMENT. Essential Tool Part 1 Rubrics, page 3-4. Assignment Tool Part 2 Assignments, page 5-10

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Probability and Statistics Curriculum Pacing Guide

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Modeling function word errors in DNN-HMM based LVCSR systems

Using dialogue context to improve parsing performance in dialogue systems

Statewide Framework Document for:

AUTHOR COPY. Techniques for cold-starting context-aware mobile recommender systems for tourism

Truth Inference in Crowdsourcing: Is the Problem Solved?

STA 225: Introductory Statistics (CT)

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

MyUni - Turnitin Assignments

Data Fusion Through Statistical Matching

Minitab Tutorial (Version 17+)

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Modeling function word errors in DNN-HMM based LVCSR systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Interactive Whiteboard

Multi-Lingual Text Leveling

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Data Stream Processing and Analytics

Outreach Connect User Manual

Introduction to Causal Inference. Problem Set 1. Required Problems

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Issues in the Mining of Heart Failure Datasets

Predicting Future User Actions by Observing Unmodified Applications

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Characteristics of Functions

Speech Emotion Recognition Using Support Vector Machine

(Sub)Gradient Descent

On-the-Fly Customization of Automated Essay Scoring

Calibration of Confidence Measures in Speech Recognition

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Affective Classification of Generic Audio Clips using Regression Models

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Appendix L: Online Testing Highlights and Script

Moodle MyFeedback update April 2017

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Mining Association Rules in Student s Assessment Data

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Grade 6: Correlated to AGS Basic Math Skills

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Transcription:

Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lab 3: 19 th March 2012 WEKA A ML and DM software toolkit n WEKA is a Machine Learning and Data Mining software tool written in Java n Main features A set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (including data visualization) Environment for comparing learning algorithms Available for download at http://www.cs.waikato.ac.nz/ml/weka/ 1

WEKA Main environments Simple CLI A simple command-line interface Explorer (we will use this environment!) An environment for exploring data with WEKA Experimenter An environment for performing experiments and conducting statistical tests between learning schemes KnowledgeFlow An environment that allows you to graphically (drag-anddrop) design the flows of an experiment WEKA The Explorer environment 2

WEKA The Explorer environment Preprocess To choose and modify the data being acted on Classify To train and test learning schemes that classify or perform regression Cluster To learn clusters for the data Associate To discover association rules from the data Select attributes To determine and select the most relevant attributes in the data Visualize To view an interactive 2D plot of the data WEKA The dataset format WEKA deals only with flat (text) files in ARFF (Attribute Relationship File Format) Example of a dataset @relation weather Name of the dataset @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,false,no overcast,83,86,false,yes Nominal attribute Numeric attribute Classification (i.e., by default, the last defined attribute) The examples (instances) 3

WEKA Explorer: Data pre-processing Data can be imported from a file in formats: ARFF, CSV, binary Data can also be read from a URL or from an SQL database using JDBC Pre-processing tools in WEKA are called filters Discretization Normalization Re-sampling Attribute selection Transforming and combining attributes WEKA Explorer: Classifiers (1) Classifiers in WEKA are models for predicting nominal or numeric quantities Classification techniques implemented in WEKA Naïve Bayes classifier and Bayesian networks Decision trees Instance-based classifiers Support vector machines Neural networks Linear regression 4

WEKA Explorer: Classifiers (2) Select a classifier Select test options Use training set. The learned classifier will be evaluated on the training set Supplied test set. To use a different dataset for the evaluation Cross-validation. The dataset is divided in a number of folds, and the learned classifier is evaluated by crossvalidation Percentage split. To indicate the percentage of the dataset held out for the evaluation WEKA Explorer: Classifiers (3) More options Output model. To output (display) the learned classifier Output per-class stats. To output the precision/recall and true/ false statistics for each class Output entropy evaluation measures. To output the entropy evaluation measures Output confusion matrix. To output the confusion (classificationerror) matrix of the classifier s predictions Store predictions for visualization. The classifier s predictions are saved in the memory so that they can be visualized later Output predictions. To output the predictions on the test set Random seed for XVal / % Split. To specify the random seed used when randomizing the data before it is divided up for evaluation purposes 5

WEKA Explorer: Classifiers (4) Classifier output shows important information Run information. The learning scheme options, name of the dataset, instances, attributes, and test mode Classifier model (full training set). A textual representation of the classifier learned on the full training data Predictions on test data. The learned classifier s predictions on the test set Summary. The statistics on how accurately the classifier predicts the true class of the instances under the chosen test mode Detailed Accuracy By Class. A more detailed per-class break down of the classifier s prediction accuracy Confusion Matrix. Elements show the number of test examples whose actual class is the row and whose predicted class is the column WEKA Explorer: Classifiers (5) Result list provides some useful functions Save model. Saves a model (i.e., a trained classifier) object to a binary file. Objects are saved in Java serialized object form Load model. Loads a pre-trained model (i.e., a previously learned classifier) object from a binary file Re-evaluate model on current test set. To evaluate a previously learned classifier on the current test set Visualize classifier errors. To show a visualization window that plots the results of classification Correctly classified instances are represented by crosses, whereas incorrectly classified ones show up as squares 6

WEKA Explorer: Attribute selection To identify which (subsets of) attributes are the most predictive ones In WEKA, a method for attribute selection consists of two parts Attribute Evaluator. An evaluation method for evaluating the appropriateness of attributes correlation-based, wrapper, information gain, chi-squared, Search Method. A search method for determining how (in which order) the attributes are examined best-first, random, exhaustive, ranking, WEKA Explorer: Data visualization Visualization is very useful in practice helps to determine difficulty of the learning problem WEKA can visualize a single attribute (1-D visualization) a pair of attributes (2-D visualization) Different class values (labels) are visualized in different colors Jitter slider supports better visualization when many instances locate (concentrate) around a point in the plot Zooming in/out (i.e., by increasing/decreasing PlotSize and PointSize) 7