TOWARDS DATA-DRIVEN AUTONOMICS IN DATA CENTERS

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning With Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

On the Combined Behavior of Autonomous Resource Management Agents

Reducing Features to Improve Bug Prediction

Probability estimates in a scenario tree

(Sub)Gradient Descent

Australian Journal of Basic and Applied Sciences

Detecting English-French Cognates Using Orthographic Edit Distance

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Probability and Statistics Curriculum Pacing Guide

Rule Learning with Negation: Issues Regarding Effectiveness

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Evidence for Reliability, Validity and Learning Effectiveness

Axiom 2013 Team Description Paper

Software Maintenance

Using dialogue context to improve parsing performance in dialogue systems

An Introduction to Simio for Beginners

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

On-the-Fly Customization of Automated Essay Scoring

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

A Neural Network GUI Tested on Text-To-Phoneme Mapping

12- A whirlwind tour of statistics

Disambiguation of Thai Personal Name from Online News Articles

Applications of data mining algorithms to analysis of medical data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Grade 6: Correlated to AGS Basic Math Skills

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

On-Line Data Analytics

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Cross Language Information Retrieval

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Extending Place Value with Whole Numbers to 1,000,000

Measurement. When Smaller Is Better. Activity:

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Statewide Framework Document for:

Multi-label classification via multi-target regression on data streams

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

Modeling user preferences and norms in context-aware systems

CS 446: Machine Learning

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Human Emotion Recognition From Speech

Unit 3 Ratios and Rates Math 6

Why Did My Detector Do That?!

Word Segmentation of Off-line Handwritten Documents

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Issues in the Mining of Heart Failure Datasets

CSC200: Lecture 4. Allan Borodin

Common Core Standards Alignment Chart Grade 5

AQUA: An Ontology-Driven Question Answering System

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Interpreting ACER Test Results

The Good Judgment Project: A large scale test of different methods of combining expert predictions

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Primary National Curriculum Alignment for Wales

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

STA 225: Introductory Statistics (CT)

Grade Dropping, Strategic Behavior, and Student Satisficing

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Truth Inference in Crowdsourcing: Is the Problem Solved?

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Memory-based grammatical error correction

Automatic Pronunciation Checker

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Transcription:

TOWARDS DATA-DRIVEN AUTONOMICS IN DATA CENTERS ALINA SIRBU, OZALP BABAOGLU SUMMARIZED BY ARDA GUMUSALAN

MOTIVATION 2

MOTIVATION Human-interaction-dependent data centers are not sustainable for future data centers which are expected to be bigger scale than todays data centers. Exascale-> billion of billion servers per data center. Unreasonable number of employee to oversee the system. Data centers in the future will fully be autonomic and human technicians will be limited to setting high-level goals and policies. No more low-level operations Level 5 of A survey of autonomic computing-degrees, models, and applications 3

MOTIVATION Applying traditional autonomic computing techniques to large data centers are problematic. Current autonomic computing technologies are reactive. They lack predictive capabilities to anticipate undesired states in advance. 4

GOAL AND CONTRIBUTION 5

GOAL Build a predictive model for node failures in data centers. Develop a new generation of autonomics that is data-driven, predictive, and proactive based on holistic models which considers computer systems as an ecosystem. This is the first step towards a more comprehensive predictor. 6

CONTRIBUTION Shows that modern data centers can be scaled to extreme dimensions by eliminating its reliance on human operators. Provide a prediction model. Combining subsampling with bagging and prediction based voting. Will be explained in the upcoming slides. Provide an example of BigQuery usage with quantitative evaluation of running times as a function of data size. Computation times will be included. Data mining related parts are not directly related to autonomic computing so some parts will be omitted. 7

BACKGROUND 8

RANDOM FOREST A technique that is commonly applied when the feature set is large or the data is unbalanced. Main idea: Ensemble weaklearners to form a strong learner. Weak-learner is a decision tree https://citizennet.com/blog/2012/11/10/random-forestsensembles-and-performance-metrics/ 9

RANDOM FOREST 1. Sample N cases at random with replacement to create a subset of the data. This is called bagging. 2. At each node: 1. Randomly chose m predictor variables from all predictor variables. 2. Determine the predictors that give the best binary split according to some objective function. 3. At the next node, choose another m variables at random from all predictor variables and do the same. https://citizennet.com/blog/2012/11/10/random-forestsensembles-and-performance-metrics/ 10

https://citizennet.com/blog/2012/11/10/random-forestsensembles-and-performance-metrics/ 11 RANDOM FOREST

RANDOM FOREST When a new input is entered to the system: Run it on all of the trees. The result can be obtained as: Average or weighted average the results of each tree. Majority voting. In order to have accurate results, inter-tree correlation should be low. Tradeoff: smaller m leads to lower inter-tree correlation but also lowers strength of an individual tree. https://citizennet.com/blog/2012/11/10/random-forestsensembles-and-performance-metrics/ 12

BUILDING THE FEATURE SET 13

FEATURE SET The workload trace published by Google has been analyzed. 29 days for a cluster of 12,453 machines. Includes: amount of resources used per task for every 5 minute intervals. Due to the size of uncompressed data, the authors used BigQuery. 17 GB for over 1 billion records. 14

BASIC FEATURES Basic features obtained for each machine for every 5 minute interval: Number of tasks currently running, Number of tasks that have started in the last 5 minutes, Number of tasks that have finished, Evicted Failed Completed Killed Lost CPU load, Memory, Disk time, Cycles per instruction, Memory access per instruction. Total of 12 basic features. 15

BASIC FEATURES For each 12 basic feature for each time step: Consider the previous 6 time windows (30 minutes) 12 x 6 = 72 features in total A separate table per feature. First 5 features took between 139 to 939 seconds. The remaining took 3585 to 9096 seconds. 16

SECOND LEVEL AGGREGATION A second level of aggregation for each basic feature compute: Averages, Standard deviation, Coefficient of variation. Repeat this for 6 different running window sizes: 1, 12, 24, 48, 72, 96 hours. 3 statistics x 12 basic features x 6 window sizes = 216 additional features. The sizes of these tables were ranging between 143 GB to 12.5 TB However, not time consuming. 17

THIRD LEVEL OF AGGREGATION Compute the correlation between for following 7 features: Number of running tasks, Number of started tasks, Number of failed jobs, CPU Memory Disk time, Cycles per instruction. A total of 21 correlation pairs. Calculate this for each 6 of the time windows. 1, 12, 24, 48, 72, 96 hours. Additional 21x6=126 features. 18

RUNNING TIMES The tables shows the mean(the standard deviation) of the required times per feature. 19

ADDITIONAL FEATURES 2 new features are added. Up-time for each machine The time passed since the last ADD event The number of REMOVE events for the entire cluster within the last hour. A total of 416 features. 20

21

CLASSIFICATION APPROACH 22

CLASSIFICATION APPROACH The features explained so far are used for classification with Random Forest (RF) classifier. Data points are separated into two parts: SAFE: did not fail -> negative FAIL: failed -> positive All points with time to remove less than 24 hours were assigned to FAIL class. If the gap between REMOVE and ADD event for the same machine was longer than 2 hours, it counted as a failure. Out of 8,957 REMOVE events, 2,298 were considered as failure. Subset of SAFE classes were extracted. 0.5% of the total random sampling Due to imbalanced data points. A total of 544,985 SAFE data points and 108,365 FAIL data points. These 653,350 data points formed the basis of their predictive study. 23

FEATURE SELECTION MECHANISMS The authors explored two types of future selection mechanism: Principal component analysis: a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principle components. Theoretically, the top principle components can be used for classification, since those should contain the most important information. However, the performance of principle components were not better than using original features. 24

FEATURE SELECTION MECHANISMS Filtering the original features: Filter the original features based on their correlation to the time_to_remove event. Only the correlation values greater than a threshold is used. The best performance was obtained with null threshold indicating that once again the best results were obtained using all the features. None of the feature selection mechanisms performed better than the RF with the original features. RF itself performs feature selection when building decision trees. 25

BUILDING THE ENSEMBLE OF RF 26

BUILDING THE ENSEMBLE OF RF The authors observed that the performance of individual classifiers (a single RF) was not satisfactory. They combined RFs to form a forest of forests. A known technique especially for imbalanced with rare events. Each RF had a varying number of decision trees -> between 2 and 15 Each RF trained with different data -> bagging Every time a new classifier is trained, a new training dataset is built by considering all the points in the positive class and a random subset of the negative class. fsafe: the ratio between SAFE and FAIL classes. {0.25, 0.5, 1, 2, 3, 4} 27

BUILDING THE ENSEMBLE OF RF Repeat the algorithm 5 times resulting in 420 RF in the ensemble. 5 reps x 6 fsafe x 14 RF size 28

COMBINING STRATEGY Precision weighted voting: calculate the accuracy of each classifier on training data. Divide the test data into two halves: Individual test: used to evaluate the precision of individual classifiers and and calculate their weights. Ensemble test: used to calculate the final evaluation of the ensemble. Precision is the fraction of points labeled fail is actually a failure. 29

COMBINING STRATEGY The classification of the ensemble is computed as the sum of all individual answers multiplied by their precision. Each individual answer is a discrete value but combination of them is continuous. Higher score indicates a higher probability of failure. For all data points, there is a score 30

CLASSIFICATION RESULTS 31

CROSS VALIDATION APPROACH The authors separated their data as train and test data. Train over 10 days and test on the 12 th. First two days are omitted in order to decrease the effect on aggregated features. Mimics how this model should be used in real life scenarios. Took up to 9 hours with a relatively low spec computer. 32

CROSS VALIDATION APPROACH PR: Precision vs. Recall. Precision: positive predictive value. Recall: sensitivity Ex: Suppose a program for recognizing dogs in scenes from a video that contains dogs and cats. There are a total of 9 dogs and some cats. Our program identifies 7 of the animals as dogs but only 4 out of those 7 are actually dogs. Precision:4/7 Recall: 4/9 https://en.wikipedia.org/wiki/precision_and_recall 33

CLASSIFICATION RESULTS Evaluation was based on receiver operating characteristics (ROC) and precision recall (PR) curves. ROC curves plot True Positive Rate (TPR) versus False Positive Rate (FPR). PR curve displays the precision vs. recall which is equal to TPR. For both, the higher the value, the better it is. A threshold value, s*, is needed to classify the score of the ensemble RF. By decreasing s* the number of true positives grows but so do the false positive. 34

CLASSIFICATION RESULTS For all days, AUROC values are greater than 0.75 and up to 0.97 AUPR ranges between 0.38 and 0.87 Lower performance at the beginning can be due to fact that some of the aggregated data were incomplete (for those over 3 and 4 days). 35

MORE DETAILED LOOK Performance of individual classifiers are displayed. Very low FPR-> good! In many cases, very low TPR-> bad! 36

MORE DETAILED LOOK TPR increases when the fsafe parameter decreases But FPR increases and precision decreases. The results obtained with different fsafe values are diverse which means it is suitable to use with ensemble approaches. In general, the points corresponding to the individual classifiers are below the ROC and PR curves describing the performance of the ensemble. This proves that the ensemble method is better than the individual classifiers. Except for TPR less than 0.2. 37

CLASSIFICATION RESULTS Conclusion: Recall (TPR) rate is between 27.2% and 88.6% (lowest and highest). This means 27.2%-88.6% of the failures were identified successfully. Precision: From all instances labeled as failure, between 50.2% and 72.8% are actual failures (FPR). 38

TIME_TO_NEXT_FAILURE ANALYSIS 39

TIME_TO_NEXT_FAILURE ANALYSIS The authors analyzed the relation between the classifier label and the exact time until the next REMOVE event. 24 hours away point was considered SAFE. It would be considered SAFE if it fails in 2 days. Also, no difference between failing in 10 minutes vs. in 24 hours. SAFE classified as FAIL counts less as a misclassification as the time to next failure decreases. FAIL classified as SAFE has a higher negative impact when it is close to the point of failure. These will be clear in the next slide. 40

TIME_TO_NEXT_FAILURE ANALYSIS Because of the fact the authors used 24 hours as the threshold value, when the classifier gives the wrong output, we still have time to catch it before the FAIL occurs. A good result: if the misclassified positives (actual failures labeled as SAFE) are further in time from the point of failure compared to correctly classified failures. If we do not miss the closer failures but miss the further ones then it is good. We will have some time catch it again. Another good result: if misclassified negatives (actual safes labeled as FAIL) are closer to the failure point compared to correctly classified negatives. If we misclassify closer SAFEs (they will eventually fail after 24 hours) compared to further away misclassified SAFEs, then our prediction makes more sense. 41

TIME_TO_NEXT_FAILURE ANALYSIS The outputs are divided into: TP: failures correctly identified FN: failures missed Upper and lower limits are 0 to 24 as expected. TP, on average, has lower times until the next event compared to FN. This is good news: there is still some time left to actual failure so classifier may detect this in the future. If we take this into account, the lowest prediction goes from 27.2% to 52.5% for benchmark 4 and from 88.6% to 88.8% for benchmark 15. 42

TIME_TO_NEXT_FAILURE ANALYSIS The outputs are divided into : FP: SAFE labeled as FAIL TN: SAFE labeled as SAFE IMPORTANT: but they will eventually fail. The time_to_the_next_failure, on average, are lower for FP than TN. This is good news: the classifier gives false alarms when a failure is approaching, even if it is not strictly in the next 24 hours. 43

ADAPTATIONS FOR REAL LIFE USAGE 44

PERFORMING ONLINE All features need to be computed online. Computation needs to take less than 5 minutes. Data aggregation is embarrassingly parallel. All independent. The cost of storing the data and using BigQuery for analysis is estimated to be 60 dollars per day. Training each RF is also embarrassingly parallel. It is estimated 5 minutes to update their models. 45

CRITIQUE 46

CRITIQUE The cost analysis of missed FAILS plus misclassified FAILS vs. not using their model? Performance of difference features are no presented. How did we sampled two halves of test data (individual, ensemble). Same with training data or separate. The threshold value analysis of ensemble output is omitted. 47