Synchronization-based Classification on Distributed Concept-drifting Data Streams

Similar documents
Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Handling Concept Drifts Using Dynamic Selection of Classifiers

Lecture 1: Machine Learning Basics

Combining Proactive and Reactive Predictions for Data Streams

Learning Methods for Fuzzy Systems

CS Machine Learning

Multivariate k-nearest Neighbor Regression for Time Series data -

arxiv: v2 [cs.cv] 30 Mar 2017

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Artificial Neural Networks written examination

Reducing Features to Improve Bug Prediction

Time series prediction

Universidade do Minho Escola de Engenharia

(Sub)Gradient Descent

SARDNET: A Self-Organizing Feature Map for Sequences

Mining Association Rules in Student s Assessment Data

Ensemble Technique Utilization for Indonesian Dependency Parser

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Australian Journal of Basic and Applied Sciences

Learning From the Past with Experiment Databases

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Rule Learning With Negation: Issues Regarding Effectiveness

Linking Task: Identifying authors and book titles in verbose queries

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Case Study: News Classification Based on Term Frequency

Laboratorio di Intelligenza Artificiale e Robotica

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Measurement. When Smaller Is Better. Activity:

arxiv: v1 [cs.lg] 15 Jun 2015

Rule Learning with Negation: Issues Regarding Effectiveness

Online Updating of Word Representations for Part-of-Speech Tagging

CSL465/603 - Machine Learning

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

On-Line Data Analytics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Welcome to. ECML/PKDD 2004 Community meeting

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Lecture 1: Basic Concepts of Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

USING A RECOMMENDER TO INFLUENCE CONSUMER ENERGY USAGE

Applications of memory-based natural language processing

Applying Learn Team Coaching to an Introductory Programming Course

Airplane Rescue: Social Studies. LEGO, the LEGO logo, and WEDO are trademarks of the LEGO Group The LEGO Group.

A Reinforcement Learning Variant for Control Scheduling

Seminar - Organic Computing

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Timeline. Recommendations

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Circuit Simulators: A Revolutionary E-Learning Platform

Grade 6: Correlated to AGS Basic Math Skills

INPE São José dos Campos

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

WITH distinct frameworks and architectures for context

Reinforcement Learning by Comparing Immediate Reward

Learning and Transferring Relational Instance-Based Policies

Evolutive Neural Net Fuzzy Filtering: Basic Description

Radius STEM Readiness TM

Multi-label Classification via Multi-target Regression on Data Streams

Data Stream Processing and Analytics

CSC200: Lecture 4. Allan Borodin

Cross Language Information Retrieval

Characteristics of Functions

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Education: Integrating Parallel and Distributed Computing in Computer Science Curricula

MYCIN. The MYCIN Task

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Modeling user preferences and norms in context-aware systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Generative models and adversarial training

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Learning Methods in Multilingual Speech Recognition

On the Combined Behavior of Autonomous Resource Management Agents

Proceedings of Meetings on Acoustics

Issues in the Mining of Heart Failure Datasets

Multi-label classification via multi-target regression on data streams

ATENEA UPC AND THE NEW "Activity Stream" or "WALL" FEATURE Jesus Alcober 1, Oriol Sánchez 2, Javier Otero 3, Ramon Martí 4

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Moderator: Gary Weckman Ohio University USA

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Laboratorio di Intelligenza Artificiale e Robotica

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Task Types. Duration, Work and Units Prepared by

NCEO Technical Report 27

LABORATORY : A PROJECT-BASED LEARNING EXAMPLE ON POWER ELECTRONICS

Designing a case study

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Conference Presentation

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Assignment 1: Predicting Amazon Review Ratings

Transcription:

Synchronization-based Classification on Distributed Concept-drifting Data Streams

Introduction Classification Classification is a type machine learning task which infers a function from labeled training data. Distributed and parallel Classification The abundance of data and the need to process larger amounts of data have triggered machine learning development. Classic classification algorithms are modified into scaled-up versions which require for distributed machine learning Streaming Classification Another development of machine learning is in processing continuous supply of data The training needs to be performed again from the beginning with the new arrived data and it is costly and time-consuming, dealing with concept-drift. Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 2/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 3/19 Introduction Recent works can be summarized in two basic models Central Learning Model and Distributed Learning Model are two basic models established to deal with distributed data streams Fig.2 (a) Central Learning Model suffers inexpensive data storage and communication (b) Distributed Learning Model suffers the presence of concept drift and lack of modeling the dynamic dependence among streams.

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 4/19 Motivation Focusing on Challenges on Distributed Data Streams Classification How to handle concept-drift of local streams How to learn and model the dynamic dependence or association among data streams over time? How to combine all information for prediction How to utilize the simularity and learn the association of these large-scale data with distributed data streams?

Modeling the Association Among Data Streams Since different data streams often has association in real world data driven-applications, so we established a new learning model. Fig.2 Principle of Combined data for prediction Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 5/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 6/19 Overview of this Framework Fig.3 Framework Overview

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 7/19 Learn Local patterns by dynamic Prototype-based Learning P-Tree i Fig.4 An illustration of the P-Tree structure. Maintain a small set of important prototypes for each data streams a) error-driven learning approach b) synchronization-inspired constrained clustering c) PCA and statistics

Error-driven Representativeness Learning How to dynamically select the short-term and/or long-term representative examples? Basic idea: Leverage the prediction performance of test examples to infer the representativeness of examples by lazy learning: nearest neighbor classifier. Rep(y) = Rep(y) + Sign(x pl, x l ) where Sign(x, y) is the sign function, and 1 if x equals y, -1 otherwise. High representativeness Keep Low representativeness Delete Unchanged representativeness? Summarization (Sync. Algorithm) Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 8/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 9/19 Data Summarization by synchronization Summarization: Constrained Clustering by Synchronization x i ( t t) x i ( t) N 1 ( x( t)) y N sin( y i ( x( t)), eq(lx, ly) ( t) x i ( t)) Cannot Link P1 (a) Constrained clustering by synchronization (b) Prototype-based data representation

PCA and statistics Principle Component Analysis (PCA): Analyze the change of each class data distribution by principle component of two sets of examples. Statistical Analysis: Compute a suitable statistic, which is sensitive to data class distribution changes between the two sets of examples. Fig.5 PCA-based concept drift analysis Fig.6 Statistical Analysis Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 10/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 11/19 weight vector Fig.8 Data Structure for Weight Vector

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 12/19 How to Learn information from other data streams a) Maintain a weight vector for each other data streams by using dynamic errordriven learning b) Learn the relevant data streams which are really useful for prediction of test data Fig.7 Process of Learning information from other streams.

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 13/19 A decay function for the weight Fig.9 A decay function for the weight. For each contribution (correct prediction), it will be decreased over time with a decay function.(i.e., the old correct prediction is less important than the recent correct prediction using other data streams) n n ( i) ( i) t ( i) k k k 0 k k 1 k 1 W ( t) W ( x,t) W ( x )

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 14/19 Ensemble Learning Model Fig.10 Ensemble Learning Process: using Weighted Majority

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 15/19 Global Learning Model Ensemble Learning Framework : arg max c P ( x) c k ( i 1 1 W ( i) k ( t) pre c ( x) W * pre c r ( x)) Other Streams Local Stream i

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 16/19 The Process of Learning Independence from Other Data Streams Dealing Static data with outliers Data with Outliers Helping Learning Data with Outliers

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 17/19 The Process of Learning Independence from Other Data Streams Dealing data with dynamic nature Arriving Data with out enough information Adding information for Arriving Data

Experiments & Results Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 18/19

Synthetic Data The hyperplane in 2-dimensional space was used to simulate different time-changing concepts by altering its orientation and position in a smooth or sudden manner. Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 19/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 20/19 Dealing data with parting plane Dealing data without parting plane

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 21/19 Supplement Data to Help Predict Dealing data with parting plane Dealing data without parting plane

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 22/19 Prediction performance Data with Parting Plane Central Accuracy 96.62% Helpful for Prediction Harmful for Prediction Right Prediction but not helpful Helpful but not use 0 628 633 51362 Harmful but not use 242 Data without Parting Plane Central Accuracy 98.31% Helpful for Prediction Harmful for Prediction Right Prediction but not helpful 229 22 Helpful but not use 0 Harmful but not use 4 49485

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 23/19 Data Set 1.Electricity: Contains 45,312 instances, which was collected from the Australian New South Wales Electricity Market for every five minutes. 2.Forest Covertype : Containing 581,012 instances 3.Sensor : describes seven forest cover types on a 3030 meter grid with 54 different geographic measurements. Containing 2,219,803 instances stream contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 24/19 4. Power supply: Contains 29,928 instances, 2 attributes, and 24 classes An Italy electricity company which records the power from two sources: power supply from main grid and power transformed from other grids 5. Kddcup99: Contains 494,021 instances, 41 attributes, and 23 classes was collected from the KDD CUP challenge in 1999, and the task is to build predictive models capable of distinguishing between intrusions and normal connections

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 25/19 Performance of data stream classification algorithm on real-world data sets Data Set Electricit y Forest Covertype Sensor Power Supply NO. Accuracy Accuracy Accuracy Accuracy Local Accurac y 0 69.31% 88.57% 75.63% 86.58% 1 68.65% 88.68% 74.63% 87.38% 2 65.99% 88.51% 75.67% 87.33% 3 69.07% 88.58% 75.63% 86.63% 4 69.78% 88.64% 57.9% 80.65 Global Accuracy 71.13% 89.50% 73.91% 85.61%

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 26/19 CovtypeNorm ElecNormNew Power Supply Central Accuracy 89.5% Central Accuracy 71.13% Central Accuracy 85.61% Helpful for Prediction Harmful for Prediction Right Prediction but not helpful Helpful but not use Harmful but not use 10442 5204 461045 4 2397 Helpful for Prediction Harmful for Prediction Right Prediction but not helpful Helpful but not use Harmful but not use 2179 1529 25041 0 691 Helpful for Prediction Harmful for Prediction Right Prediction but not helpful Helpful but not use Harmful but not use 47 51 706 0 12

Sensitivity w.r.t. number of data streams(e.g. Covertype) We test data with changing the number of streams Fig.9 Accuracy w.r.t. number of data streams Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 27/19

Sensitivity w.r.t. number of neighbors(e.g. Covertype) We test data with changing the number of k. When k=1,3,5,10, Fig.9 Accuracy w.r.t. neighbors Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 28/19

Sensitivity w.r.t. with different factor We test our Algorithm with different It is stable with the value 0.01, 0.1, 0.5,1. The result shows that Let Covertype data as an example, the central accuracy is all around 90.5. Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 29/19

Conclusion Our method successfully deal with the concept-drift. This study provides a distributed classifying model which can learn the relevance of different streams. The final prediction can combine the information from local classifier. Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 30/19

Data Mining Group Seminar 2014 Data Stream Classification Oct. 17, 2014 31/19 Thanks for your attention! Q & A