Introduction to Machine Learning

Similar documents
Python Machine Learning

CS Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Australian Journal of Basic and Applied Sciences

Rule Learning With Negation: Issues Regarding Effectiveness

Laboratorio di Intelligenza Artificiale e Robotica

Reducing Features to Improve Bug Prediction

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probabilistic Latent Semantic Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

Linking Task: Identifying authors and book titles in verbose queries

Learning From the Past with Experiment Databases

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Laboratorio di Intelligenza Artificiale e Robotica

Data Fusion Models in WSNs: Comparison and Analysis

CSL465/603 - Machine Learning

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 1: Basic Concepts of Machine Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Human Emotion Recognition From Speech

Math 96: Intermediate Algebra in Context

School of Innovative Technologies and Engineering

Rule Learning with Negation: Issues Regarding Effectiveness

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Using dialogue context to improve parsing performance in dialogue systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Axiom 2013 Team Description Paper

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

(Sub)Gradient Descent

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Radius STEM Readiness TM

Assignment 1: Predicting Amazon Review Ratings

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Detecting English-French Cognates Using Orthographic Edit Distance

Introduction, Organization Overview of NLP, Main Issues

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Truth Inference in Crowdsourcing: Is the Problem Solved?

The Importance of Social Network Structure in the Open Source Software Developer Community

Comparison of network inference packages and methods for multiple networks inference

Disambiguation of Thai Personal Name from Online News Articles

Probability and Statistics Curriculum Pacing Guide

Word Segmentation of Off-line Handwritten Documents

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Software Maintenance

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Calibration of Confidence Measures in Speech Recognition

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Multi-Lingual Text Leveling

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Research computing Results

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Generative models and adversarial training

Introduction to Causal Inference. Problem Set 1. Required Problems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Indian Institute of Technology, Kanpur

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Case Study: News Classification Based on Term Frequency

Speech Emotion Recognition Using Support Vector Machine

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Circuit Simulators: A Revolutionary E-Learning Platform

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

12- A whirlwind tour of statistics

Switchboard Language Model Improvement with Conversational Data from Gigaword

SECTION 12 E-Learning (CBT) Delivery Module

Reinforcement Learning by Comparing Immediate Reward

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Affective Classification of Generic Audio Clips using Regression Models

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

How Effective is Anti-Phishing Training for Children?

A study of speaker adaptation for DNN-based speech synthesis

Northern Kentucky University Department of Accounting, Finance and Business Law Financial Statement Analysis ACC 308

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Grade 6: Correlated to AGS Basic Math Skills

Speaker Identification by Comparison of Smart Methods. Abstract

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

arxiv: v1 [cs.lg] 3 May 2013

Ansys Tutorial Random Vibration

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Statewide Framework Document for:

Applications of data mining algorithms to analysis of medical data

Automatic Pronunciation Checker

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Transcription:

Introduction to Machine Learning CSC 640: Advanced Software Engineering James Walden Northern Kentucky University James Walden (NKU) Introduction to Machine Learning 1 / 45

Topics 1 Introduction 2 Building a Model 3 A Machine Learning Algorithms 4 Machine Learning with Python 5 Using scikit-learn 6 Model Performance 7 What s Next 8 References James Walden (NKU) Introduction to Machine Learning 2 / 45

The Hype Cycle James Walden (NKU) Introduction to Machine Learning 3 / 45

AI vs ML vs Deep Learning James Walden (NKU) Introduction to Machine Learning 4 / 45

AI and ML Definitions Artificial Intelligence Artificial intelligence is a term used to describe a system which perceives its environment and takes actions to maximize its chances of achieving its goals. Machine Learning Machine learning is a set of techniques that enable computers to perform tasks without being explicitly programmed. ML systems generalize from past data to make predictions about future data. James Walden (NKU) Introduction to Machine Learning 5 / 45

Machine Learning Formal Definition Machine Learning (Tom Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. Experience Task Performance E-mail message. Identify phishing attempt % correctly classified Malware. Categorize by threat actor Coherent groupings Login records. Identify credential misuse % verified misuse Attack data. Predict #attacks next year Accurate #attacks James Walden (NKU) Introduction to Machine Learning 6 / 45

Machine Learning Tasks Supervised Learning Supervised learning focuses on models that predict the probabilities of new events based on the probabilities of previously observed events. Example task: determine if a file is malware or not. Unsupervised Learning Unsupervised learning models attempt to find patterns in data. Example task: determine how many families of malware exist in dataset and which files belong to each family. James Walden (NKU) Introduction to Machine Learning 7 / 45

Supervised Learning Classification Classification algorithms predict which category an input belongs to based on probabilities learned from previously observed inputs. Example task: determine if a file is malware or not. Regression Regression models predict a continuous output value for a given input based on the output values associated with previous inputs. Example task: predict how many malware samples will be seen next month. We will focus on classification models. James Walden (NKU) Introduction to Machine Learning 8 / 45

Classification Training Data Sample (X) Label (Y) Apple Resulting Model Orange Apple Orange James Walden (NKU) Introduction to Machine Learning 9 / 45

Unsupervised Learning James Walden (NKU) Introduction to Machine Learning 10 / 45

Machine Learning in Software Engineering What questions can machine learning answer for us in software enginerring? Is this class likely to have bugs? How many post-release bugs will this program likely have? Which groups of classes are similar to each other? How much time will take to finish this project? Which keyword or name is the one intended by the programmer after a few keystrokes are entered? Are some requirements redundant or overlapping? Is this patch likely to be accepted by the core developers? Is this outbound packet calling back to a C2 server? James Walden (NKU) Introduction to Machine Learning 11 / 45

Machine Learning Process James Walden (NKU) Introduction to Machine Learning 12 / 45

Building a Model 1. Collect samples of data from both classifications to train the machine learning model. 2. Extract features from each training example to represent the example numerically. 3. Train the machine learning system to identify bad items using the features. 4. Test the system on data that was not used when training to evaluate its performance. James Walden (NKU) Introduction to Machine Learning 13 / 45

Collecting Data Machine learning systems are only as good as their training data. 1. Training data should be as close to the data being test as possible. 2. Having close to equal numbers of bad and good items is better. 3. More training data is better. 4. Systems need to be retrained as software engineering processes and technologies change. James Walden (NKU) Introduction to Machine Learning 14 / 45

Extracting Features James Walden (NKU) Introduction to Machine Learning 15 / 45

Extracting Features Feature selection is guided by expert knowledge. There should be more samples than features. Feature values should not be close to constant. Strongly correlated features can cause problems for some algorithms. James Walden (NKU) Introduction to Machine Learning 16 / 45

Training For each sample, provide training interface with Feature values for sample. Classification of sample as good or bad. James Walden (NKU) Introduction to Machine Learning 17 / 45

Testing Classify data not used in training with model. James Walden (NKU) Introduction to Machine Learning 18 / 45

Decision Trees James Walden (NKU) Introduction to Machine Learning 19 / 45

Comparing with other Algorithms Advantages Decision trees can be interpreted by humans. Can be combined with other techniques. Disadvantages Relatively inaccurate compared to other algorithms. A small input change can result in a big change in the tree. James Walden (NKU) Introduction to Machine Learning 20 / 45

scikit-learn http://scikit-learn.org Efficient user-friendly machine learning toolkit Built on NumPy, SciPy, and matplotlib Open source with BSD license James Walden (NKU) Introduction to Machine Learning 21 / 45

SciPy https://www.scipy.org Scientific computing library build on NumPy Sparse matrices and graphs Optimization and interpolation Signal processing and Fourier transforms James Walden (NKU) Introduction to Machine Learning 22 / 45

NumPy http://www.numpy.org/ Space-efficient n-dimensional arrays Fast vector operations Tools for integrating C/C++ and Fortran code Linear algebra functions James Walden (NKU) Introduction to Machine Learning 23 / 45

Pandas https://pandas.pydata.org/ Python data science library built on NumPy Provides user friendly Data Frames like R Statistical and data visualization functions The reticulate package allows Pandas and R data frames to be shared with the other language. James Walden (NKU) Introduction to Machine Learning 24 / 45

Matplotlib https://matplotlib.org/ Python 2D plotting library for publication quality graphics. The pyplot modules provides a MATLAB-like interface for simple plots. User has full control of all plotting details. Used as basis of Pandas plotting abilities. We will generally use R s ggplot2 in this class. James Walden (NKU) Introduction to Machine Learning 25 / 45

IPython https://ipython.org/ A powerful, interactive Python shell. Use shell commands and Python code in same interface. Used as computation kernel by Jupyter. James Walden (NKU) Introduction to Machine Learning 26 / 45

Jupyter https://jupyter.org/ Interactive notebooks for data science in many languages. Combine Markdown text, computation results, and graphics in a single document. Similar to Mathematica notebooks or RStudio documents. Uses a web interface. James Walden (NKU) Introduction to Machine Learning 27 / 45

Anaconda https://www.anaconda.com/ Most popular Python data science distribution. Comes with scikit-learn, pandas, scipy, numpy, etc. Uses conda package management tool. Create environments with different versions of libraries. James Walden (NKU) Introduction to Machine Learning 28 / 45

Conda features Conda Concepts Channels are sources for packages. Environments are named collections of conda packages, enabling the user to maintain different package versions for different projects. Conda Commands conda list conda install pkgname conda update pkgname conda env list conda create -n NAME conda activate NAME # list installed pkgs # install package # upgrade package # list environments # create env NAME # use env NAME James Walden (NKU) Introduction to Machine Learning 29 / 45

Using scikit-learn The basic process for building a model is 1. Import libraries 2. Load data 3. Preprocess data 4. Split data into test/train sets 5. Train the model 6. Evaluate model performance We will expand on this process with additional steps later. James Walden (NKU) Introduction to Machine Learning 30 / 45

Import Libraries These are libraries that we will need regardless of ML algorithm. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix James Walden (NKU) Introduction to Machine Learning 31 / 45

Load Data Read the CSV data as a Pandas data frame. In [5]: df = pd.read_csv( data.csv ) In [6]: df.shape Out[6]: (1372, 5) In [7]: df.head(3) Out[7]: Variance Skewness Kurtosis Entropy Forgery 0 3.62160 8.6661-2.8073-0.44699 0 1 4.54590 8.1674-2.4586-1.46210 0 2 3.86600-2.6383 1.9242 0.10645 0 Data frames are preferred for exploring the data. Our sample dataset is the banknote forgery dataset. James Walden (NKU) Introduction to Machine Learning 32 / 45

Convert Data Frame to Numpy Array Scikit-learn does not use data frames. It requires that Labels (response variables) be a vector. Features (predictors) be an array. In [5]: y = df[ Forgery ].values In [6]: y.shape Out[6]: (1372,) In [7]: X = df.drop( Forgery, axis=1).values In [8]: X.shape Out[8]: (1372, 4) James Walden (NKU) Introduction to Machine Learning 33 / 45

Split the Data Choose 80% of the data to train the model and 20% to test it. Samples (rows) are chosen randomly. Set random state to make split always the same. In [14]: X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) In [15]: X_train.shape Out[15]: (1097, 4) In [16]: X_test.shape Out[16]: (275, 4) In [17]: y_train.shape Out[17]: (1097,) In [18]: y_test.shape Out[18]: (275,) James Walden (NKU) Introduction to Machine Learning 34 / 45

Train the Model Create a classifier object, then fit it. In [19]: from sklearn.tree import DecisionTreeClassifier In [21]: model = DecisionTreeClassifier() In [22]: model.fit(x_train, y_train); The class names and model creation method names change, but we always use the fit method with the training features + labels. James Walden (NKU) Introduction to Machine Learning 35 / 45

Evaluate the Model We make predictions using the predict() method. In [23]: y_pred = model.predict(x_test) then compare the predicted labels with the actual labels to measure accuracy. In [24]: accuracy_score(y_pred, y_test) Out[24]: 0.9745454545454545 Our model predicts forged bank notes with 97.5% accuracy. James Walden (NKU) Introduction to Machine Learning 36 / 45

Confusion Matrix For more detailed model performance, we use the confusion matrix. In [27]: confusion_matrix(y_pred, Out[27]: array([[153, 3], [ 4, 115]]) Decision tree had 3 false negatives, 4 false positives. James Walden (NKU) Introduction to Machine Learning 37 / 45

Accuracy Accuracy is the percentage of correct classifications. Accuracy = TP + TN TP + TN + FP + FN Problem: If only 1% of files are malware, then a model that classifies all files as benign will is a 99% accurate malware detector. James Walden (NKU) Introduction to Machine Learning 38 / 45

Precision Precision measures how many samples predicted as positive are actually positive. TP Precision = TP + FP Precision is used when the goal is to limit the number of false positives. Problem: Precision can approach 1 if we identify only the sample we re mostly certain of as positive and classify all others as negative. Recall will be low. (1) James Walden (NKU) Introduction to Machine Learning 39 / 45

Recall Recall measures the fraction of positive samples that were identified by the model. TP Recall = (2) TP + FN Recall is used when we need to identify all positive samples, i.e. when it is important to avoid false negatives. Problem: If model predicts all files are malware, there are zero false negatives and recall is 1. Precision will be low. James Walden (NKU) Introduction to Machine Learning 40 / 45

F-measure F 1 is the harmonic mean of precision and recall F 1 = 2 Precision Recall Precision + Recall Provides a balanced consideration of both precision and recall, and can be a better metric of model performance than accuracy. (3) James Walden (NKU) Introduction to Machine Learning 41 / 45

Performance Metrics in Scikit-learn We can easily compute precision, recall, and F1 metrics. from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print(round(accuracy_score(y_pred, y_test), 3)) print(round(precision_score(y_pred, y_test), 3)) print(round(recall_score(y_pred, y_test), 3)) print(round(f1_score(y_pred, y_test), 3)) 0.997 0.848 0.941 0.892 These results are for the payment fraud dataset. James Walden (NKU) Introduction to Machine Learning 42 / 45

Scikit-learn Classification Report The classification report provides two sets of metrics. from sklearn.metrics import classification_report print(classification_report(y_test, y_pred)) precision recall f1-score support 0 1.00 1.00 1.00 7733 1 0.94 0.85 0.89 112 avg / total 1.00 1.00 1.00 7845 First row of metrics is for 0 being the positive (fraudulent) class. Second row is for 1 being the positive (fraudulent) class. James Walden (NKU) Introduction to Machine Learning 43 / 45

What s Next We have a hands-on activity next, in which we will log into a Linux VM with Anaconda installed, start a Jupyter notebook server, use a notebook to solve the bank note problem, and experiment with a few machine learning algorithms. James Walden (NKU) Introduction to Machine Learning 44 / 45

References 1. Clarence Chio and David Freeman, Machine Learning and Security: Protecting Systems with Data and Algorithms, O Reilly Media, 2018. 2. Aurélien Géron, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O Reilly Media, 2017. 3. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer. 2014. 4. Andreas C Müller, Sarah Guido, et. Al, Introduction to Machine Learning with Python: a Guide for Data Scientists, O Reilly Media, 2016. 5. Joshua Saxe and Hillary Sanders, Malware Data Science: Attack Detection and Attribution, No Starch Press, 2018. James Walden (NKU) Introduction to Machine Learning 45 / 45