Clustering Analysis Basics

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Python Machine Learning

Probabilistic Latent Semantic Analysis

A study of speaker adaptation for DNN-based speech synthesis

CS Machine Learning

Algebra 2- Semester 2 Review

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

12- A whirlwind tour of statistics

Lecture 1: Machine Learning Basics

Word Segmentation of Off-line Handwritten Documents

Backwards Numbers: A Study of Place Value. Catherine Perez

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Genevieve L. Hartman, Ph.D.

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Rule Learning With Negation: Issues Regarding Effectiveness

Primary National Curriculum Alignment for Wales

Probability and Statistics Curriculum Pacing Guide

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Using focal point learning to improve human machine tacit coordination

SCORING KEY AND RATING GUIDE

Matching Similarity for Keyword-Based Clustering

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Sight Word Assessment

Generative models and adversarial training

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Grade 6: Correlated to AGS Basic Math Skills

Research Design & Analysis Made Easy! Brainstorming Worksheet

Mathematics Success Level E

Arizona s College and Career Ready Standards Mathematics

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Introduction to Causal Inference. Problem Set 1. Required Problems

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Assignment 1: Predicting Amazon Review Ratings

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

2 nd grade Task 5 Half and Half

K-Medoid Algorithm in Clustering Student Scholarship Applicants

On-Line Data Analytics

(Sub)Gradient Descent

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Australian Journal of Basic and Applied Sciences

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Artificial Neural Networks written examination

Math Grade 3 Assessment Anchors and Eligible Content

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

The New York City Department of Education. Grade 5 Mathematics Benchmark Assessment. Teacher Guide Spring 2013

16.1 Lesson: Putting it into practice - isikhnas

Mining Student Evolution Using Associative Classification and Clustering

Coimisiún na Scrúduithe Stáit State Examinations Commission LEAVING CERTIFICATE 2008 MARKING SCHEME GEOGRAPHY HIGHER LEVEL

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

arxiv: v1 [cs.cl] 2 Apr 2017

Multiple Intelligence Theory into College Sports Option Class in the Study To Class, for Example Table Tennis

Standard 1: Number and Computation

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Lecture 2: Quantifiers and Approximation

Rule Learning with Negation: Issues Regarding Effectiveness

Honors Biology Unit 7 Animal Project

End-of-Module Assessment Task

Seminar - Organic Computing

Helping Your Children Learn in the Middle School Years MATH

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Biological Sciences, BS and BA

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Unit 2. A whole-school approach to numeracy across the curriculum

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Issues in the Mining of Heart Failure Datasets

Contents. Foreword... 5

Learning From the Past with Experiment Databases

Self-Supervised Acquisition of Vowels in American English

Team Formation for Generalized Tasks in Expertise Social Networks

Functional Skills Mathematics Level 2 assessment

arxiv: v2 [cs.cv] 30 Mar 2017

What is this species called? Generation Bar Graph

Human Emotion Recognition From Speech

Chapter 9 Banked gap-filling

AP Statistics Summer Assignment 17-18

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

The Strong Minimalist Thesis and Bounded Optimality

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

CSL465/603 - Machine Learning

How to Read the Next Generation Science Standards (NGSS)

THE UNITED REPUBLIC OF TANZANIA MINISTRY OF EDUCATION SCIENCE AND TECHNOLOGY SOCIAL STUDIES SYLLABUS FOR BASIC EDUCATION STANDARD III-VI

Text-mining the Estonian National Electronic Health Record

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Machine Learning and Development Policy

Transcription:

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [5., KPM] COMP4 Machine Learning

Outline Introduction Data Types and Representations Distance Measures Major Clustering Methodologies Summary COMP4 Machine Learning

Introduction Cluster: A collection/group of data objects/points similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis find similarities between data according to characteristics underlying the data and grouping similar data objects into clusters Clustering Analysis: Unsupervised learning no predefined classes for a training data set Two general tasks: identify the natural clustering number and properly grouping objects into sensible clusters Typical applications as a stand-alone tool to gain an insight into data distribution as a preprocessing step of other algorithms in intelligent systems COMP4 Machine Learning 3

Introduction Illustrative Eample : how many clusters? COMP4 Machine Learning 4

Introduction Illustrative Eample : are they in the same cluster? Blue shark, sheep, cat, dog Lizard, sparrow, viper, seagull, gold fish, frog, red mullet.two clusters.clustering criterion: How animals bear their progeny Gold fish, red mullet, blue shark Sheep, sparrow, dog, cat, seagull, lizard, frog, viper.two clusters.clustering criterion: Eistence of lungs COMP4 Machine Learning 5

Introduction Real Applications: Google News COMP4 Machine Learning 6

Introduction Real Applications: Genetics Analysis COMP4 Machine Learning 7

Introduction Real Applications: Emerging Applications COMP4 Machine Learning 8

Introduction A technique demanded by many real world tasks Bank/Internet Security: fraud/spam pattern discovery Biology: taonomy of living things such as kingdom, phylum, class, order, family, genus and species City-planning: Identifying groups of houses according to their house type, value, and geographical location Climate change: understanding earth climate, find patterns of atmospheric and ocean Finance: stock clustering analysis to uncover correlation underlying shares Image Compression/segmentation: coherent piels grouped Information retrieval/organisation: Google search, topic-based news Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Social network mining: special interest group automatic discovery COMP4 Machine Learning 9

Quiz COMP4 Machine Learning

Data Types and Representations Discrete vs. Continuous Discrete Feature Has only a finite set of values e.g., zip codes, rank, or the set of words in a collection of documents Sometimes, represented as integer variable Continuous Feature Has real numbers as feature values e.g, temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous features are typically represented as floating-point variables COMP4 Machine Learning

Data Types and Representations Data representations Data matri (object-by-feature structure)... i... n............... f... if... nf............... p... ip... np n data points (objects) with p dimensions (features) Two modes: row and column represent different entities Distance/dissimilarity matri (object-by-object structure) n data points, but registers d(,) only the distance d(3,) d(3,) A symmetric/triangular matri : : : Single mode: row and column d( n,) d( n,)...... for the same entity (distance) COMP4 Machine Learning

Data Types and Representations Eamples 3 p p3 p4 p 3 4 5 6 point y p p p3 3 p4 5 Data Matri p p p3 p4 p.88 3.6 5.99 p.88.44 3.6 p3 3.6.44 p4 5.99 3.6 Distance Matri (i.e., Dissimilarity Matri) for Euclidean Distance COMP4 Machine Learning 3

Distance Measures Minkowski Distance (http://en.wikipedia.org/wiki/minkowski_distance) For ( n) and y ( y y n ( p p p y y y ) p, p : Manhattan (city block) distance p : Euclidean distance n n > d(, y) p Do not confuse p with n, i.e., all these distances are defined based on all numbers of features (dimensions). A generic measure: use appropriate p in different applications d(, y) y y d(, y) y ) n y n y y n yn COMP4 Machine Learning 4

Distance Measures Eample: Manhatten and Euclidean distances 3 p p3 p4 p 3 4 5 6 L p p p3 p4 p 4 4 6 p 4 4 p3 4 p4 6 4 Distance Matri for Manhattan Distance point y p p p3 3 p4 5 Data Matri L p p p3 p4 p.88 3.6 5.99 p.88.44 3.6 p3 3.6.44 p4 5.99 3.6 Distance Matri for Euclidean Distance COMP4 Machine Learning 5

Distance Measures Cosine Measure (Similarity vs. Distance) For ( n) and y ( y y y n ) cos(, y) d(, y n y) cos(, y) y n y n y n d(, y) Property: Nonmetric vector objects: keywords in documents, gene features in micro-arrays, Applications: information retrieval, biologic taonomy,... COMP4 Machine Learning 6

COMP4 Machine Learning 7 Distance Measures Eample: Cosine measure.68.3 ), cos( ), (.3.45 6.48 5 ), cos(.45 6 6.48 4 5 3 5 5 3 ),,,, (,, ),,, 5,,, (3, d

Distance Measures Distance for Binary Features For binary features, their value can be converted into or. Contingency table for binary feature vectors, and y y a : number of features that equal for both and y b : number of features that equal for but that are for c : number of features that equal for but that are for d : number of features that equal for both and y y y COMP4 Machine Learning 8

Distance Measures Distance for Binary Features Distance for symmetric binary features Both of their states equally valuable and carry the same weight; i.e., no preference on which outcome should be coded as or, e.g. gender d(, y) a b b c c d Distance for asymmetric binary features Outcomes of the states not equally important, e.g., the positive and negative outcomes of a disease test ; the rarest one is set to and the other is. d(, y) a b b c c COMP4 Machine Learning 9

Distance Measures Eample: Distance for binary features Name Gender Fever Cough Test- Test- Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Y : yes P : positive N : negative gender is a symmetric feature (less important) the remaining features are asymmetric binary set the values Y and P to, and the value N to Mary Jack Jim Jack Mary Jim d( Jack,Mary).33 d( Jack, Jim).67 d( Jim,Mary).75 COMP4 Machine Learning

Distance Measures Distance for nominal features A generalization of the binary feature so that it can take more than two states/values, e.g., red, yellow, blue, green, There are two methods to handle variables of such features. Simple mis-matching d (, y) number of mis-matching features between total number of features and y Convert it into binary variables creating new binary features for all of its nominal states e.g., if an feature has three possible nominal states: red, yellow and blue, then this feature will be epanded into three binary features accordingly. Thus, distance measures for binary features are now applicable! COMP4 Machine Learning

Distance Measures Distance for nominal features (cont.) Eample: Play tennis Outlook Temperature Humidity Wind D Overcast High High Strong D Sunny High Normal Strong Simple mis-matching d( D, D) 4 Creating new binary features Using the same number of bits as those features can take Outlook {Sunny, Overcast, Rain} (,, ).5 Temperature {High, Mild, Cool} (,, ) Humidity {High, Normal} (, ) Wind {Strong, Weak} (, ) d( D, D).4 COMP4 Machine Learning

Major Clustering Methodologies Partitioning Methodology Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square distance cost Typical methods: K-means, K-medoids, CLARANS, COMP4 Machine Learning 3

Major Clustering Methodologies Hierarchical Methodology Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Agglomerative, Diana, Agnes, BIRCH, ROCK, COMP4 Machine Learning 4

Major Clustering Methodologies Density-based Methodology Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue, COMP4 Machine Learning 5

Major Clustering Methodologies Model-based Methodology A generative model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: Gaussian Miture Model (GMM), COBWEB, COMP4 Machine Learning 6

Major Clustering Methodologies Spectral clustering Methodology Convert data set into weighted graph (verte, edge), then cut the graph into sub-graphs corresponding to clusters via spectral analysis Typical methods: Normalised-Cuts, COMP4 Machine Learning 7

Major Clustering Methodologies Clustering ensemble Methodology Combine multiple clustering results (different partitions) Typical methods: Evidence-accumulation based, graph-based combination COMP4 Machine Learning 8

Summary Clustering analysis groups objects based on their (dis)similarity and has a broad range of applications. Measure of distance (or similarity) plays a critical role in clustering analysis and distance-based learning. Clustering algorithms can be categorized into partitioning, hierarchical, density-based, model-based, spectral clustering as well as ensemble Methodologies. There are still lots of research issues on cluster analysis; finding the number of natural clusters with arbitrary shapes dealing with mied types of features handling massive amount of data Big Data coping with data of high dimensionality performance evaluation (especially when no ground-truth available) COMP4 Machine Learning 9