Unsupervised and Semi-Supervised Learning. Series Editor M. Emre Celebi, Computer Science Department, Conway, Arkansas, USA

Similar documents
International Series in Operations Research & Management Science

MARE Publication Series

Python Machine Learning

Second Language Learning and Teaching. Series editor Mirosław Pawlak, Kalisz, Poland

Developing Language Teacher Autonomy through Action Research

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Guide to Teaching Computer Science

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Seminar - Organic Computing

CS Machine Learning

A Note on Structuring Employability Skills for Accounting Students

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Advances in Mathematics Education

Australian Journal of Basic and Applied Sciences

MASTER OF ARTS IN APPLIED SOCIOLOGY. Thesis Option

Communication and Cybernetics 17

Learning From the Past with Experiment Databases

Lecture Notes in Artificial Intelligence 4343

PRODUCT PLATFORM AND PRODUCT FAMILY DESIGN

Success Factors for Creativity Workshops in RE

A Case Study: News Classification Based on Term Frequency

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Time series prediction

Measurement & Analysis in the Real World

Mining Association Rules in Student s Assessment Data

Online Master of Business Administration (MBA)

Pre-vocational Education in Germany and China

Mining Student Evolution Using Associative Classification and Clustering

Rule Learning With Negation: Issues Regarding Effectiveness

Automating the E-learning Personalization

What is PDE? Research Report. Paul Nichols

On-Line Data Analytics

BPS Information and Digital Literacy Goals

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

On-the-Fly Customization of Automated Essay Scoring

Detecting English-French Cognates Using Orthographic Edit Distance

Generative models and adversarial training

Data Fusion Models in WSNs: Comparison and Analysis

TextGraphs: Graph-based algorithms for Natural Language Processing

Targetsim Toolbox. Business Board Simulations: Features, Value, Impact. Dr. Gudrun G. Vogt Targetsim Founder & Managing Partner

THE PROMOTION OF SOCIAL AWARENESS

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Probabilistic Latent Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

Disambiguation of Thai Personal Name from Online News Articles

Thought and Suggestions on Teaching Material Management Job in Colleges and Universities Based on Improvement of Innovation Capacity

Instrumentation, Control & Automation Staffing. Maintenance Benchmarking Study

STATUS OF OPAC AND WEB OPAC IN LAW UNIVERSITY LIBRARIES IN SOUTH INDIA

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Reinforcement Learning by Comparing Immediate Reward

Major Milestones, Team Activities, and Individual Deliverables

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Self Study Report Computer Science

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

MAHATMA GANDHI KASHI VIDYAPITH Deptt. of Library and Information Science B.Lib. I.Sc. Syllabus

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Problems of the Arabic OCR: New Attitudes

University of Groningen. Systemen, planning, netwerken Bosman, Aart

10.2. Behavior models

Literature and the Language Arts Experiencing Literature

Geo Risk Scan Getting grips on geotechnical risks

Analyzing the Usage of IT in SMEs

Application of Visualization Technology in Professional Teaching

PeopleSoft Human Capital Management 9.2 (through Update Image 23) Hardware and Software Requirements

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

While you are waiting... socrative.com, room number SIMLANG2016

The Role of Architecture in a Scaled Agile Organization - A Case Study in the Insurance Industry

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

What is Thinking (Cognition)?

Word Segmentation of Off-line Handwritten Documents

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

How to Judge the Quality of an Objective Classroom Test

Learning Methods in Multilingual Speech Recognition

Davidson College Library Strategic Plan

evans_pt01.qxd 7/30/2003 3:57 PM Page 1 Putting the Domain Model to Work

Perspectives of Information Systems

Deploying Agile Practices in Organizations: A Case Study

How to Take Accurate Meeting Minutes

ISSN X. RUSC VOL. 8 No 1 Universitat Oberta de Catalunya Barcelona, January 2011 ISSN X

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

COMMUNICATION-BASED SYSTEMS

Focus on. Learning THE ACCREDITATION MANUAL 2013 WASC EDITION

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Curriculum for the Academy Profession Degree Programme in Energy Technology

Kendriya Vidyalaya Sangathan

CLASS EXODUS. The alumni giving rate has dropped 50 percent over the last 20 years. How can you rethink your value to graduates?

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

PELLISSIPPI STATE TECHNICAL COMMUNITY COLLEGE MASTER SYLLABUS APPLIED MECHANICS MET 2025

Making welding simulators effective

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Transcription:

Unsupervised and Semi-Supervised Learning Series Editor M. Emre Celebi, Computer Science Department, Conway, Arkansas, USA

Springer s Unsupervised and Semi-Supervised Learning book series covers the latest theoretical and practical developments in unsupervised and semi-supervised learning. Titles including monographs, contributed works, professional books, and textbooks tackle various issues surrounding the proliferation of massive amounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in application domains where unlabeled data is abundant, yet it is possible to obtain a small amount of labeled data. Topics of interest in include: Unsupervised/Semi-Supervised Discretization Unsupervised/Semi-Supervised Feature Extraction Unsupervised/Semi-Supervised Feature Selection Association Rule Learning Semi-Supervised Classification Semi-Supervised Regression Unsupervised/Semi-Supervised Clustering Unsupervised/Semi-Supervised Anomaly/Novelty/Outlier Detection Evaluation of Unsupervised/Semi-Supervised Learning Algorithms Applications of Unsupervised/Semi-Supervised Learning While the series focuses on unsupervised and semi-supervised learning, outstanding contributions in the field of supervised learning will also be considered. The intended audience includes students, researchers, and practitioners. More information about this series at http://www.springer.com/series/15892

Olfa Nasraoui Chiheb-Eddine Ben N Cir Editors Clustering Methods for Big Data Analytics Techniques, Toolboxes and Applications 123

Editors Olfa Nasraoui Department of Computer Engineering and Computer Science University of Louisville Louisville, KY, USA Chiheb-Eddine Ben N Cir University of Jeddah Jeddah, KSA ISSN 2522-848X ISSN 2522-8498 (electronic) Unsupervised and Semi-Supervised Learning ISBN 978-3-319-97863-5 ISBN 978-3-319-97864-2 (ebook) https://doi.org/10.1007/978-3-319-97864-2 Library of Congress Control Number: 2018957659 Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface Data has become the lifeblood of today s knowledge-driven economy and society. Big data clustering aims to summarize, segment, and group large volumes and varieties of data that are generated at an accelerated velocity into groups of similar contents. This has become one of the most important techniques in exploratory data analysis. Unfortunately, conventional clustering techniques are becoming more and more unable to process such data due to its high complexity, heterogeneity, large volume, and rapid generation. This raises exciting challenges for researchers to design new scalable and efficient clustering methods and tools which are able to extract valuable information from these tremendous amount of data. The progress in this topic is fast and exciting. This volume aims to help the reader capture new advances in big data clustering. It provides a systematic understanding of the scope in depth, and rapidly builds an overview of new big data clustering challenges, methods, tools, and applications. The volume opens with a chapter entitled Overview of Scalable Partitional Methods for Big Data Clustering. In this chapter, BenHaj Kacem et al. propose an overview of the existing clustering methods with a special emphasis on scalable partitional methods. The authors design a new categorizing model based on the main properties pointed out in the big data partitional clustering methods to ensure scalability when analyzing a large amount of data. Furthermore, a comparative experimental study of most of the existing methods is given over simulated and real large datasets. The authors finally elaborate a guide for researchers and end users who want to decide the best method or framework to use when a task of clustering large scale of data is required. In the second chapter, Overview of Efficient Clustering Methods for Highdimensional Big Data Streams, Hassani focuses on analyzing continuous, possibly infinite streams of data, arriving at high velocity such as web traffic data, surveillance data, sensor measurements, and stock trading. The author reviews recent subspace clustering methods of high-dimensional big data streams while discussing approaches that efficiently combine the anytime clustering concept with the stream v

vi Preface subspace clustering paradigm. Additionally, novel open-source assessment framework and evaluation measures are presented for subspace stream clustering. In the chapter entitled Clustering Blockchain Data, Chawathe gives recent challenges and advances related to clustering blockchain data such as those generated by popular cryptocurrencies like Bitcoin, Ethereum, etc. Analysis of these datasets have diverse applications, such as detecting fraud, illegal transactions, characterizing major services, identifying financial hotspots, characterizing usage and performance characteristics of large peer-to-peer consensus-based systems. The author motivates the study of clustering methods for blockchain data and introduces the key blockchain concepts from a data-centric perspective. He presents different models and methods used for clustering blockchain data and describes the challenges and solutions to the problem of evaluating such methods. Deep Learning is another interesting challenge, which is discussed in the chapter titled An Introduction to Deep Clustering by Gopi et al. The chapter presents a simplified taxonomy of deep clustering methods based mainly on the overall procedural structure or design which helps beginning readers quickly grasp how almost all approaches are designed. This also allows more advanced readers to learn how to design increasingly sophisticated deep clustering pipelines that fit their own machine learning problem-solving aims. Like Deep Learning, deep clustering promises to leave an impact on diverse application domains ranging from computer vision and speech recognition to recommender systems and natural language processing. A new efficient Spark-based implementation of PSO (particle swarm optimization) clustering is described in a chapter entitled Spark-Based Design of Clustering Using Particle Swarm Optimization. Moslah et al. take advantage of in-memory operations of Spark to build grouping from large-scale data and accelerate the convergence of the method when approaching the global optimum region. Experiments conducted on real and simulated large data-sets show that their proposed method is scalable and improves the efficiency of existing PSO methods. The last two chapters describe new applications of big data clustering techniques. In Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats, Haider and Gaber investigate a new streaming anomaly detection approach, namely, Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS), for insider threat detection. The investigated approach solves the issues of high velocity of coming data from different sources and high number of false alarms/positives (Fps). Furthermore, in Effective Tensor-Based Data Clustering Through Sub-tensor Impact Graphs which completes the volume, Candan et al. investigate tensor-based methods for clustering multimodal data such as web graphs, sensor streams, and social networks. The authors deal with the computational complexity problem of tensor decomposition by partitioning the tensor and then obtain the tensor decomposition leveraging the resulted smaller partitions. They introduce the notion of sub-tensor impact graphs (SIGs), which quantify how the decompositions of these sub-partitions impact each other and

Preface vii the overall tensor decomposition accuracy and present several complementary algorithms that leverage this novel concept to address various key challenges in tensor decomposition. We hope that the volume will give an overview of the significant progress and the new challenges arising from big data clustering in theses recent years. We also hope that contents will obviously help researchers, practioners, and students in their study and research. Louisville, KY, USA Manouba, Tunisia Olfa Nasraoui Chiheb-Eddine Ben N Cir

Contents 1 Overview of Scalable Partitional Methods for Big Data Clustering... 1 Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N Cir, and Nadia Essoussi 2 Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams... 25 Marwan Hassani 3 Clustering Blockchain Data... 43 Sudarshan S. Chawathe 4 An Introduction to Deep Clustering... 73 Gopi Chand Nutakki, Behnoush Abdollahi, Wenlong Sun, and Olfa Nasraoui 5 Spark-Based Design of Clustering Using Particle Swarm Optimization... 91 Mariem Moslah, Mohamed Aymen Ben HajKacem, and Nadia Essoussi 6 Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats... 115 Diana Haidar and Mohamed Medhat Gaber 7 Effective Tensor-Based Data Clustering Through Sub-Tensor Impact Graphs... 145 K. Selçuk Candan, Shengyu Huang, Xinsheng Li, and Maria Luisa Sapino Index... 181 ix