Introduction to Machine Learning CSC 640: Advanced Software Engineering James Walden Northern Kentucky University James Walden (NKU) Introduction to Machine Learning 1 / 45
Topics 1 Introduction 2 Building a Model 3 A Machine Learning Algorithms 4 Machine Learning with Python 5 Using scikit-learn 6 Model Performance 7 What s Next 8 References James Walden (NKU) Introduction to Machine Learning 2 / 45
The Hype Cycle James Walden (NKU) Introduction to Machine Learning 3 / 45
AI vs ML vs Deep Learning James Walden (NKU) Introduction to Machine Learning 4 / 45
AI and ML Definitions Artificial Intelligence Artificial intelligence is a term used to describe a system which perceives its environment and takes actions to maximize its chances of achieving its goals. Machine Learning Machine learning is a set of techniques that enable computers to perform tasks without being explicitly programmed. ML systems generalize from past data to make predictions about future data. James Walden (NKU) Introduction to Machine Learning 5 / 45
Machine Learning Formal Definition Machine Learning (Tom Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. Experience Task Performance E-mail message. Identify phishing attempt % correctly classified Malware. Categorize by threat actor Coherent groupings Login records. Identify credential misuse % verified misuse Attack data. Predict #attacks next year Accurate #attacks James Walden (NKU) Introduction to Machine Learning 6 / 45
Machine Learning Tasks Supervised Learning Supervised learning focuses on models that predict the probabilities of new events based on the probabilities of previously observed events. Example task: determine if a file is malware or not. Unsupervised Learning Unsupervised learning models attempt to find patterns in data. Example task: determine how many families of malware exist in dataset and which files belong to each family. James Walden (NKU) Introduction to Machine Learning 7 / 45
Supervised Learning Classification Classification algorithms predict which category an input belongs to based on probabilities learned from previously observed inputs. Example task: determine if a file is malware or not. Regression Regression models predict a continuous output value for a given input based on the output values associated with previous inputs. Example task: predict how many malware samples will be seen next month. We will focus on classification models. James Walden (NKU) Introduction to Machine Learning 8 / 45
Classification Training Data Sample (X) Label (Y) Apple Resulting Model Orange Apple Orange James Walden (NKU) Introduction to Machine Learning 9 / 45
Unsupervised Learning James Walden (NKU) Introduction to Machine Learning 10 / 45
Machine Learning in Software Engineering What questions can machine learning answer for us in software enginerring? Is this class likely to have bugs? How many post-release bugs will this program likely have? Which groups of classes are similar to each other? How much time will take to finish this project? Which keyword or name is the one intended by the programmer after a few keystrokes are entered? Are some requirements redundant or overlapping? Is this patch likely to be accepted by the core developers? Is this outbound packet calling back to a C2 server? James Walden (NKU) Introduction to Machine Learning 11 / 45
Machine Learning Process James Walden (NKU) Introduction to Machine Learning 12 / 45
Building a Model 1. Collect samples of data from both classifications to train the machine learning model. 2. Extract features from each training example to represent the example numerically. 3. Train the machine learning system to identify bad items using the features. 4. Test the system on data that was not used when training to evaluate its performance. James Walden (NKU) Introduction to Machine Learning 13 / 45
Collecting Data Machine learning systems are only as good as their training data. 1. Training data should be as close to the data being test as possible. 2. Having close to equal numbers of bad and good items is better. 3. More training data is better. 4. Systems need to be retrained as software engineering processes and technologies change. James Walden (NKU) Introduction to Machine Learning 14 / 45
Extracting Features James Walden (NKU) Introduction to Machine Learning 15 / 45
Extracting Features Feature selection is guided by expert knowledge. There should be more samples than features. Feature values should not be close to constant. Strongly correlated features can cause problems for some algorithms. James Walden (NKU) Introduction to Machine Learning 16 / 45
Training For each sample, provide training interface with Feature values for sample. Classification of sample as good or bad. James Walden (NKU) Introduction to Machine Learning 17 / 45
Testing Classify data not used in training with model. James Walden (NKU) Introduction to Machine Learning 18 / 45
Decision Trees James Walden (NKU) Introduction to Machine Learning 19 / 45
Comparing with other Algorithms Advantages Decision trees can be interpreted by humans. Can be combined with other techniques. Disadvantages Relatively inaccurate compared to other algorithms. A small input change can result in a big change in the tree. James Walden (NKU) Introduction to Machine Learning 20 / 45
scikit-learn http://scikit-learn.org Efficient user-friendly machine learning toolkit Built on NumPy, SciPy, and matplotlib Open source with BSD license James Walden (NKU) Introduction to Machine Learning 21 / 45
SciPy https://www.scipy.org Scientific computing library build on NumPy Sparse matrices and graphs Optimization and interpolation Signal processing and Fourier transforms James Walden (NKU) Introduction to Machine Learning 22 / 45
NumPy http://www.numpy.org/ Space-efficient n-dimensional arrays Fast vector operations Tools for integrating C/C++ and Fortran code Linear algebra functions James Walden (NKU) Introduction to Machine Learning 23 / 45
Pandas https://pandas.pydata.org/ Python data science library built on NumPy Provides user friendly Data Frames like R Statistical and data visualization functions The reticulate package allows Pandas and R data frames to be shared with the other language. James Walden (NKU) Introduction to Machine Learning 24 / 45
Matplotlib https://matplotlib.org/ Python 2D plotting library for publication quality graphics. The pyplot modules provides a MATLAB-like interface for simple plots. User has full control of all plotting details. Used as basis of Pandas plotting abilities. We will generally use R s ggplot2 in this class. James Walden (NKU) Introduction to Machine Learning 25 / 45
IPython https://ipython.org/ A powerful, interactive Python shell. Use shell commands and Python code in same interface. Used as computation kernel by Jupyter. James Walden (NKU) Introduction to Machine Learning 26 / 45
Jupyter https://jupyter.org/ Interactive notebooks for data science in many languages. Combine Markdown text, computation results, and graphics in a single document. Similar to Mathematica notebooks or RStudio documents. Uses a web interface. James Walden (NKU) Introduction to Machine Learning 27 / 45
Anaconda https://www.anaconda.com/ Most popular Python data science distribution. Comes with scikit-learn, pandas, scipy, numpy, etc. Uses conda package management tool. Create environments with different versions of libraries. James Walden (NKU) Introduction to Machine Learning 28 / 45
Conda features Conda Concepts Channels are sources for packages. Environments are named collections of conda packages, enabling the user to maintain different package versions for different projects. Conda Commands conda list conda install pkgname conda update pkgname conda env list conda create -n NAME conda activate NAME # list installed pkgs # install package # upgrade package # list environments # create env NAME # use env NAME James Walden (NKU) Introduction to Machine Learning 29 / 45
Using scikit-learn The basic process for building a model is 1. Import libraries 2. Load data 3. Preprocess data 4. Split data into test/train sets 5. Train the model 6. Evaluate model performance We will expand on this process with additional steps later. James Walden (NKU) Introduction to Machine Learning 30 / 45
Import Libraries These are libraries that we will need regardless of ML algorithm. import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix James Walden (NKU) Introduction to Machine Learning 31 / 45
Load Data Read the CSV data as a Pandas data frame. In [5]: df = pd.read_csv( data.csv ) In [6]: df.shape Out[6]: (1372, 5) In [7]: df.head(3) Out[7]: Variance Skewness Kurtosis Entropy Forgery 0 3.62160 8.6661-2.8073-0.44699 0 1 4.54590 8.1674-2.4586-1.46210 0 2 3.86600-2.6383 1.9242 0.10645 0 Data frames are preferred for exploring the data. Our sample dataset is the banknote forgery dataset. James Walden (NKU) Introduction to Machine Learning 32 / 45
Convert Data Frame to Numpy Array Scikit-learn does not use data frames. It requires that Labels (response variables) be a vector. Features (predictors) be an array. In [5]: y = df[ Forgery ].values In [6]: y.shape Out[6]: (1372,) In [7]: X = df.drop( Forgery, axis=1).values In [8]: X.shape Out[8]: (1372, 4) James Walden (NKU) Introduction to Machine Learning 33 / 45
Split the Data Choose 80% of the data to train the model and 20% to test it. Samples (rows) are chosen randomly. Set random state to make split always the same. In [14]: X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) In [15]: X_train.shape Out[15]: (1097, 4) In [16]: X_test.shape Out[16]: (275, 4) In [17]: y_train.shape Out[17]: (1097,) In [18]: y_test.shape Out[18]: (275,) James Walden (NKU) Introduction to Machine Learning 34 / 45
Train the Model Create a classifier object, then fit it. In [19]: from sklearn.tree import DecisionTreeClassifier In [21]: model = DecisionTreeClassifier() In [22]: model.fit(x_train, y_train); The class names and model creation method names change, but we always use the fit method with the training features + labels. James Walden (NKU) Introduction to Machine Learning 35 / 45
Evaluate the Model We make predictions using the predict() method. In [23]: y_pred = model.predict(x_test) then compare the predicted labels with the actual labels to measure accuracy. In [24]: accuracy_score(y_pred, y_test) Out[24]: 0.9745454545454545 Our model predicts forged bank notes with 97.5% accuracy. James Walden (NKU) Introduction to Machine Learning 36 / 45
Confusion Matrix For more detailed model performance, we use the confusion matrix. In [27]: confusion_matrix(y_pred, Out[27]: array([[153, 3], [ 4, 115]]) Decision tree had 3 false negatives, 4 false positives. James Walden (NKU) Introduction to Machine Learning 37 / 45
Accuracy Accuracy is the percentage of correct classifications. Accuracy = TP + TN TP + TN + FP + FN Problem: If only 1% of files are malware, then a model that classifies all files as benign will is a 99% accurate malware detector. James Walden (NKU) Introduction to Machine Learning 38 / 45
Precision Precision measures how many samples predicted as positive are actually positive. TP Precision = TP + FP Precision is used when the goal is to limit the number of false positives. Problem: Precision can approach 1 if we identify only the sample we re mostly certain of as positive and classify all others as negative. Recall will be low. (1) James Walden (NKU) Introduction to Machine Learning 39 / 45
Recall Recall measures the fraction of positive samples that were identified by the model. TP Recall = (2) TP + FN Recall is used when we need to identify all positive samples, i.e. when it is important to avoid false negatives. Problem: If model predicts all files are malware, there are zero false negatives and recall is 1. Precision will be low. James Walden (NKU) Introduction to Machine Learning 40 / 45
F-measure F 1 is the harmonic mean of precision and recall F 1 = 2 Precision Recall Precision + Recall Provides a balanced consideration of both precision and recall, and can be a better metric of model performance than accuracy. (3) James Walden (NKU) Introduction to Machine Learning 41 / 45
Performance Metrics in Scikit-learn We can easily compute precision, recall, and F1 metrics. from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print(round(accuracy_score(y_pred, y_test), 3)) print(round(precision_score(y_pred, y_test), 3)) print(round(recall_score(y_pred, y_test), 3)) print(round(f1_score(y_pred, y_test), 3)) 0.997 0.848 0.941 0.892 These results are for the payment fraud dataset. James Walden (NKU) Introduction to Machine Learning 42 / 45
Scikit-learn Classification Report The classification report provides two sets of metrics. from sklearn.metrics import classification_report print(classification_report(y_test, y_pred)) precision recall f1-score support 0 1.00 1.00 1.00 7733 1 0.94 0.85 0.89 112 avg / total 1.00 1.00 1.00 7845 First row of metrics is for 0 being the positive (fraudulent) class. Second row is for 1 being the positive (fraudulent) class. James Walden (NKU) Introduction to Machine Learning 43 / 45
What s Next We have a hands-on activity next, in which we will log into a Linux VM with Anaconda installed, start a Jupyter notebook server, use a notebook to solve the bank note problem, and experiment with a few machine learning algorithms. James Walden (NKU) Introduction to Machine Learning 44 / 45
References 1. Clarence Chio and David Freeman, Machine Learning and Security: Protecting Systems with Data and Algorithms, O Reilly Media, 2018. 2. Aurélien Géron, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O Reilly Media, 2017. 3. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer. 2014. 4. Andreas C Müller, Sarah Guido, et. Al, Introduction to Machine Learning with Python: a Guide for Data Scientists, O Reilly Media, 2016. 5. Joshua Saxe and Hillary Sanders, Malware Data Science: Attack Detection and Attribution, No Starch Press, 2018. James Walden (NKU) Introduction to Machine Learning 45 / 45