Big Data Meets Advanced Analytics Concepts and Practical Examples Dr. Gerhard Svolba SAS DACH Competence Center Analytics #analyticsx
Concepts when Handling Big Data Use Advanced Machine Learning methods to describe the relationships in your data Understand specifics of complex systems by using Monte Carlo simulations Run SAS High Performance Analytics procedures in distributed mode
Two Practical Big Data Examples Deep Learning with Stacked Denoising Autoencoders Recognize Handwritten Digits Compress information into a few variables Monte Carlo Simulation of the Monopoly Board Game Distribution of the visit frequency on the fields Studying the profitability of different investment decisions
Machine learning ML wird häufig eingesetzt, wenn Die Vorhersagegüte eines Modells wichtiger ist als die Erklärbarkeit Traditionelle Ansätze ungeeignet erscheinen» Mehr Variablen als Beobachtungen» Viele hochkorrelierte Variablen» Unstrukturierte Daten» Fundamental nicht-lineare Zusammenhänge Anwendung : Mustererkennung
Data Mining Machine Learning #analyticsx SUPERVISED LEARNING UNSUPERVISED LEARNING SEMI-SUPERVISED LEARNING TRANSDUCTION Regression LASSO regression Logistic regression Ridge regression Decision tree Gradient boosting Random forests Neural networks SVM Naïve Bayes Neighbors Gaussian processes A priori rules Clustering k-means clustering Mean shift clustering Spectral clustering Kernel density estimation Nonnegative matrix factorization PCA Kernel PCA Sparse PCA Singular value decomposition SOM Prediction and classification* Clustering* EM TSVM Manifold regularization Autoencoders Multilayer perceptron Restricted Boltzmann machines REINFORCEMENT LEARNING DEVELOPMENTAL LEARNING *In semi-supervised learning, supervised prediction and classification algorithms are often combined with clustering.
Handwritten Digits as Training Data Classic MNIST training data 784 features form a 28x28 digital grid Greyscale features range from 0 to 255 60,000 labeled training images (785 variables, including 1 nominal target) 10,000 unlabeled test images (784 input variables)
Semi-Supervised Learning Extract a few representative features to discriminate the digits 0-9 Compress information of 784 variables into 2 features Use a convolutional neural network (deep learning)
Deep-Learning using a Stacked De-noising Autoencoder h5 h4 Uncorrupted Output Features Hidden Neurons Hidden Neurons Target Layer h3 Hidden Neurons Extractable Features h2 Hidden Neurons h1 Hidden Neurons Partially Corrupted Input Features Input Layer
Using SAS Code to Solve the Problem Noised INPUT (i) H1 H2 H3 H4 H5 INPUT(t) i=>h1 H1=>H2 H2=>H3 H3=>H4 H4=>H5 proc neural data= autoencodertraining dmdbcat= work.autoencodertrainingcat performance compile details cpucount= 12 threads= yes; archi MLP hidden= 5; hidden 300 / id= h1; hidden 100 / id= h2; hidden 2 / id= h3 act= linear; hidden 100 / id= h4; hidden 300 / id= h5; input corruptedpixel1-corruptedpixel400 / id= i level= int std= std; target pixel1-pixel400 / act= identity id= t level= int std= std; initial random= 123; prelim 10 preiter= 10; freeze h1->h2; freeze h2->h3; freeze h3->h4; freeze h4->h5; train technique= congra maxtime= 129600 maxiter= 1000; freeze i->h1; thaw h1->h2; train technique= congra maxtime= 129600 maxiter= 1000; freeze h1->h2; thaw h2->h3; train technique= congra maxtime= 129600 maxiter= 1000; freeze h2->h3; thaw h3->h4; train technique= congra maxtime= 129600 maxiter= 1000; freeze h3->h4; thaw h4->h5; train technique= congra maxtime= 129600 maxiter= 1000; thaw i->h1; thaw h1->h2; thaw h2->h3; thaw h3->h4; train technique= congra maxtime= 129600 maxiter= 1000; code file= 'C:\Path\to\code.sas'; run;
Studying a certain section in detail Target 1 Target 2 Target 3 Target 4 W51 W52 W53 W54 h51 h52
Edge Weights of the 5 th layer are loaded with discriminative information
Visualization of the separation of the two middle hidden layers
Our method results in much better separation that simple principal components analysis
Summary: Semi-Supervised Learning Extremely accurate predictions using deep neural networks. Target Variable Digit 0-9 has not been used in the model! Feature Extraction as pre-step in predictive modeling Requires Model-Tuning The most common applications of deep learning involve pattern recognition in unstructured data, such as text, photos, videos and sound.
The Monopoly board game is a complex system Set of Complex Rules Monetary Dimension Dynamic Component Additional Instructions Framework of Opportunities and Events Random Components
Questions of Interest What is the distribution of visits on the fields of the board game? Which fields are most profitable? Which fields to have a high variability in profitability? These questions can be transferred to many other simulations studies of complex systems.
Locating the Token Influential Factors Sum of 2 Dice Go to Jail! Event Fields Accelerator Dice
Almost Even Distribution Sum of 2 Dice
All Field-40 visits are relocated to 14 Sum of 2 Dice Go to Jail!
Event Fields relocate to other fields Sum of 2 Dice Go to Jail! Event Fields
Red Dice introduces high variability Sum of 2 Dice Go to Jail! Event Fields Accelerator Dice
Effect of the accelerator dice after 20 rounds If the 3rd dice shows the Monopoly man: Move forward to the next free propertyfield The the next property field otherwiese
Effect of the accelerator dice after 70 rounds
Example for a Relocation
Profitability Distribution after 40 rounds
Profitability Distribution after 70 rounds
Implementation in SAS
Summary Applying advanced analytical methods to big data allows you to better understand relationshiops in the underlying processes. You receive results that would otherweise remain undiscovered. SAS offers a full set of methods to handle big data in advanced analytics applications
Links Patric Hall: Overview of Machine Learning with SAS Enterprise Miner http://support.sas.com/resources/papers/proceedings 14/SAS313-2014.pdf Rick Wicklin: Simulating Data with SAS http://support.sas.com/publishing/authors/wicklin.ht ml Gerhard Svolba: Applying Data Science: Business Case Studies Using SAS (SAS Press, expected 2017)
Contact Information Gerhard Svolba Analytic Solution Architect SAS-Austria Sastools.by.gerhard@gmx.net http://www.sascommunity.org/wiki/gerhard_svolba LinkedIn XING PictureBlog Data Quality for Analytics Using SAS SAS Press 2012 http://www.sascommunity.org/wiki/data_quality_for_analytics Data Preparation for Analytics Using SAS SAS Press 2006 http://www.sascommunity.org/wiki/data_preparation_for_analytics