Unsupervised and Semi-Supervised Learning. Series Editor M. Emre Celebi, Computer Science Department, Conway, Arkansas, USA

Unsupervised and Semi-Supervised Learning Series Editor M. Emre Celebi, Computer Science Department, Conway, Arkansas, USA

Springer s Unsupervised and Semi-Supervised Learning book series covers the latest theoretical and practical developments in unsupervised and semi-supervised learning. Titles including monographs, contributed works, professional books, and textbooks tackle various issues surrounding the proliferation of massive amounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in application domains where unlabeled data is abundant, yet it is possible to obtain a small amount of labeled data. Topics of interest in include: Unsupervised/Semi-Supervised Discretization Unsupervised/Semi-Supervised Feature Extraction Unsupervised/Semi-Supervised Feature Selection Association Rule Learning Semi-Supervised Classification Semi-Supervised Regression Unsupervised/Semi-Supervised Clustering Unsupervised/Semi-Supervised Anomaly/Novelty/Outlier Detection Evaluation of Unsupervised/Semi-Supervised Learning Algorithms Applications of Unsupervised/Semi-Supervised Learning While the series focuses on unsupervised and semi-supervised learning, outstanding contributions in the field of supervised learning will also be considered. The intended audience includes students, researchers, and practitioners. More information about this series at http://www.springer.com/series/15892

Olfa Nasraoui Chiheb-Eddine Ben N Cir Editors Clustering Methods for Big Data Analytics Techniques, Toolboxes and Applications 123

Editors Olfa Nasraoui Department of Computer Engineering and Computer Science University of Louisville Louisville, KY, USA Chiheb-Eddine Ben N Cir University of Jeddah Jeddah, KSA ISSN 2522-848X ISSN 2522-8498 (electronic) Unsupervised and Semi-Supervised Learning ISBN 978-3-319-97863-5 ISBN 978-3-319-97864-2 (ebook) https://doi.org/10.1007/978-3-319-97864-2 Library of Congress Control Number: 2018957659 Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface Data has become the lifeblood of today s knowledge-driven economy and society. Big data clustering aims to summarize, segment, and group large volumes and varieties of data that are generated at an accelerated velocity into groups of similar contents. This has become one of the most important techniques in exploratory data analysis. Unfortunately, conventional clustering techniques are becoming more and more unable to process such data due to its high complexity, heterogeneity, large volume, and rapid generation. This raises exciting challenges for researchers to design new scalable and efficient clustering methods and tools which are able to extract valuable information from these tremendous amount of data. The progress in this topic is fast and exciting. This volume aims to help the reader capture new advances in big data clustering. It provides a systematic understanding of the scope in depth, and rapidly builds an overview of new big data clustering challenges, methods, tools, and applications. The volume opens with a chapter entitled Overview of Scalable Partitional Methods for Big Data Clustering. In this chapter, BenHaj Kacem et al. propose an overview of the existing clustering methods with a special emphasis on scalable partitional methods. The authors design a new categorizing model based on the main properties pointed out in the big data partitional clustering methods to ensure scalability when analyzing a large amount of data. Furthermore, a comparative experimental study of most of the existing methods is given over simulated and real large datasets. The authors finally elaborate a guide for researchers and end users who want to decide the best method or framework to use when a task of clustering large scale of data is required. In the second chapter, Overview of Efficient Clustering Methods for Highdimensional Big Data Streams, Hassani focuses on analyzing continuous, possibly infinite streams of data, arriving at high velocity such as web traffic data, surveillance data, sensor measurements, and stock trading. The author reviews recent subspace clustering methods of high-dimensional big data streams while discussing approaches that efficiently combine the anytime clustering concept with the stream v

vi Preface subspace clustering paradigm. Additionally, novel open-source assessment framework and evaluation measures are presented for subspace stream clustering. In the chapter entitled Clustering Blockchain Data, Chawathe gives recent challenges and advances related to clustering blockchain data such as those generated by popular cryptocurrencies like Bitcoin, Ethereum, etc. Analysis of these datasets have diverse applications, such as detecting fraud, illegal transactions, characterizing major services, identifying financial hotspots, characterizing usage and performance characteristics of large peer-to-peer consensus-based systems. The author motivates the study of clustering methods for blockchain data and introduces the key blockchain concepts from a data-centric perspective. He presents different models and methods used for clustering blockchain data and describes the challenges and solutions to the problem of evaluating such methods. Deep Learning is another interesting challenge, which is discussed in the chapter titled An Introduction to Deep Clustering by Gopi et al. The chapter presents a simplified taxonomy of deep clustering methods based mainly on the overall procedural structure or design which helps beginning readers quickly grasp how almost all approaches are designed. This also allows more advanced readers to learn how to design increasingly sophisticated deep clustering pipelines that fit their own machine learning problem-solving aims. Like Deep Learning, deep clustering promises to leave an impact on diverse application domains ranging from computer vision and speech recognition to recommender systems and natural language processing. A new efficient Spark-based implementation of PSO (particle swarm optimization) clustering is described in a chapter entitled Spark-Based Design of Clustering Using Particle Swarm Optimization. Moslah et al. take advantage of in-memory operations of Spark to build grouping from large-scale data and accelerate the convergence of the method when approaching the global optimum region. Experiments conducted on real and simulated large data-sets show that their proposed method is scalable and improves the efficiency of existing PSO methods. The last two chapters describe new applications of big data clustering techniques. In Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats, Haider and Gaber investigate a new streaming anomaly detection approach, namely, Ensemble of Random subspace Anomaly detectors In Data Streams (E-RAIDS), for insider threat detection. The investigated approach solves the issues of high velocity of coming data from different sources and high number of false alarms/positives (Fps). Furthermore, in Effective Tensor-Based Data Clustering Through Sub-tensor Impact Graphs which completes the volume, Candan et al. investigate tensor-based methods for clustering multimodal data such as web graphs, sensor streams, and social networks. The authors deal with the computational complexity problem of tensor decomposition by partitioning the tensor and then obtain the tensor decomposition leveraging the resulted smaller partitions. They introduce the notion of sub-tensor impact graphs (SIGs), which quantify how the decompositions of these sub-partitions impact each other and

Preface vii the overall tensor decomposition accuracy and present several complementary algorithms that leverage this novel concept to address various key challenges in tensor decomposition. We hope that the volume will give an overview of the significant progress and the new challenges arising from big data clustering in theses recent years. We also hope that contents will obviously help researchers, practioners, and students in their study and research. Louisville, KY, USA Manouba, Tunisia Olfa Nasraoui Chiheb-Eddine Ben N Cir

Contents 1 Overview of Scalable Partitional Methods for Big Data Clustering... 1 Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N Cir, and Nadia Essoussi 2 Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams... 25 Marwan Hassani 3 Clustering Blockchain Data... 43 Sudarshan S. Chawathe 4 An Introduction to Deep Clustering... 73 Gopi Chand Nutakki, Behnoush Abdollahi, Wenlong Sun, and Olfa Nasraoui 5 Spark-Based Design of Clustering Using Particle Swarm Optimization... 91 Mariem Moslah, Mohamed Aymen Ben HajKacem, and Nadia Essoussi 6 Data Stream Clustering for Real-Time Anomaly Detection: An Application to Insider Threats... 115 Diana Haidar and Mohamed Medhat Gaber 7 Effective Tensor-Based Data Clustering Through Sub-Tensor Impact Graphs... 145 K. Selçuk Candan, Shengyu Huang, Xinsheng Li, and Maria Luisa Sapino Index... 181 ix