On Using Class-Labels in Evaluation of Clusterings

Similar documents
CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Word Segmentation of Off-line Handwritten Documents

Rule Learning with Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

On-Line Data Analytics

Python Machine Learning

BMBF Project ROBUKOM: Robust Communication Networks

Ontologies vs. classification systems

Artificial Neural Networks written examination

Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Learning Methods in Multilingual Speech Recognition

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Learning Methods for Fuzzy Systems

Your Partner for Additive Manufacturing in Aachen. Community R&D Services Education

K-Medoid Algorithm in Clustering Student Scholarship Applicants

INPE São José dos Campos

Truth Inference in Crowdsourcing: Is the Problem Solved?

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Mining Association Rules in Student s Assessment Data

Mining Student Evolution Using Associative Classification and Clustering

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Lecture 1: Basic Concepts of Machine Learning

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Modeling user preferences and norms in context-aware systems

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Your Partner for Additive Manufacturing in Aachen. Community R&D Services Education

Characteristics of Collaborative Network Models. ed. by Line Gry Knudsen

Implementing a tool to Support KAOS-Beta Process Model Using EPF

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

The completed proposal should be forwarded to the Chief Instructional Officer and the Academic Senate.

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Automating the E-learning Personalization

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

The MEANING Multilingual Central Repository

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

High School to College

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Customized Question Handling in Data Removal Using CPHC

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Accreditation in Europe. Zürcher Fachhochschule

EU Education of Fluency Specialists

Assignment 1: Predicting Amazon Review Ratings

2.1 The Theory of Semantic Fields

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Generating Test Cases From Use Cases

Probability and Statistics Curriculum Pacing Guide

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

CSC200: Lecture 4. Allan Borodin

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

RWTH Aachen University

This scope and sequence assumes 160 days for instruction, divided among 15 units.

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Applications of data mining algorithms to analysis of medical data

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

COBRA: A Fast and Simple Method for Active Clustering with Pairwise Constraints

ON BEHAVIORAL PROCESS MODEL SIMILARITY MATCHING A CENTROID-BASED APPROACH

Using AMT & SNOMED CT-AU to support clinical research

Research at RWTH Aachen University. Turning waste into resources

Axiom 2013 Team Description Paper

Human Emotion Recognition From Speech

The Enterprise Knowledge Portal: The Concept

Probability and Game Theory Course Syllabus

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Towards a Collaboration Framework for Selection of ICT Tools

On the Formation of Phoneme Categories in DNN Acoustic Models

Chapter 2 Rule Learning in a Nutshell

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

An Investigation into Team-Based Planning

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Is operations research really research?

Handling Concept Drifts Using Dynamic Selection of Classifiers

Including the Microsoft Solution Framework as an agile method into the V-Modell XT

Visual CP Representation of Knowledge

Strategy for teaching communication skills in dentistry

Applications of memory-based natural language processing

Ontological spine, localization and multilingual access

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Data Fusion Through Statistical Matching

Issues in the Mining of Heart Failure Datasets

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Transcription:

On Using Class-Labels in Evaluation of Clusterings Ines Färber Stephan Günnemann Hans-Peter Kriegel Peer Kröger Emmanuel Müller Erich Schubert Thomas Seidl Arthur Zimek RWTH Aachen University, Germany LMU Munich University, Germany MultiClust at KDD 2010 July 25, 2010

The Dilemma of Evaluation What would be the optimal clustering solution? View 1 View 2 On Using Class-Labels in Evaluation of Clusterings 1 / 1

Introduction evaluation of clustering solutions: evaluation based on internal measures + no additional information needed; data independent - approaches optimizing the evaluation criteria will always be preferred evaluation based on an experts opinion + may reveal new insight into the data - very expensive; results are not comparable evaluation based on external measures + objective evaluation - needs a valid ground truth On Using Class-Labels in Evaluation of Clusterings 2 / 1

History of Cluster Evaluation clustering broke off from classification assumption: classes stand out by inherent similarity traditional clustering mainly follows the partitioning approach external evaluation of traditional clustering: the original assumption motivated the comparison against class labels UCI - iris dataset class structure does not necessarily correspond to a clustering structure classes may split up into several subgroups there might be smooth transitions between two classes On Using Class-Labels in Evaluation of Clusterings 3 / 1

Multi-View Context assumption: data groups differently when seen from different perspectives each object might be grouped in multiple clusters with each perspective a set of attributes can be associated View 1 View 2 clustering goes beyond the structure of class labels data items potentially belong to many clusters in differing views class labels do not meet the assumptions of this scenario On Using Class-Labels in Evaluation of Clusterings 4 / 1

Classes vs. Clusters commonly observed differences between clusterings and class labelings: splitting of classes into multiple clusters merging of classes into a single cluster missing class outliers multiple (overlapping) hidden structures given class label C = shape alternative labels H = color Y W X Z On Using Class-Labels in Evaluation of Clusterings 5 / 1

Case Study Pendigits Dataset differnet ways of digit notation 1 1 2 1 2 1 2 2 1 2 different types of digits 9 and 3 almost 30 different groups of digits in contrast to 10 given classes On Using Class-Labels in Evaluation of Clusterings 6 / 1

Case Study ALOI Dataset object groups that stand out due to similarity based on: color shape rotation object types feature space influences the clustering result On Using Class-Labels in Evaluation of Clusterings 7/1

Can Anything Be Learned? Findin QUESTIONABLE EDSC still widely used! On Using Class-Labels in Evaluation of Clusterings 8 / 1

Challenges 1 ground truth should provide multiple labellings 2 measures should be able to deal with multiple labels e.g. label layers challenges: clustering covers only part of the layer (incompleteness?) clusters in one layer vs. multiple layers (purity vs. variety) the clustering intersects layers the clustering contains newly detected clusters e.g. label hierarchies e.g. label ontologies On Using Class-Labels in Evaluation of Clusterings 9 / 1

Challenges 1 ground truth should provide multiple labellings 2 measures should be able to deal with multiple labels e.g. label layers e.g. label hierarchies challenges: might be hard to derive clustering covers one branch (redundancy?) clustering covers one layer (impurity?) clustering covers nodes only partially (incompleteness?) union of nodes newly detected clusters e.g. label ontologies On Using Class-Labels in Evaluation of Clusterings 10 / 1

Conclusion classification data class label per object database C H 1 hidden clusters per object H 2 H 3 H 4... clustering evaluation enhanced evaluation result proceed in the development of new clustering algorithms ensure objective clustering evaluation labeling of data measures for multiple labels On Using Class-Labels in Evaluation of Clusterings 11 / 1