K-Means Clustering. By Susan L. Miertschin

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Lesson M4. page 1 of 2

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Iowa School District Profiles. Le Mars

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Rule Learning With Negation: Issues Regarding Effectiveness

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Issues in the Mining of Heart Failure Datasets

Lecture 1: Basic Concepts of Machine Learning

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Introduction to Questionnaire Design

Association Between Categorical Variables

Online Administrator Guide

Evaluation of Teach For America:

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

SCHOLARSHIP/BURSARY APPLICATION FORM

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Computerized Adaptive Psychological Testing A Personalisation Perspective

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

AP Statistics Summer Assignment 17-18

Introduction to Causal Inference. Problem Set 1. Required Problems

On-Line Data Analytics

Australian Journal of Basic and Applied Sciences

Demographic Survey for Focus and Discussion Groups

Probabilistic Latent Semantic Analysis

Python Machine Learning

CS Machine Learning

Accessing Higher Education in Developing Countries: panel data analysis from India, Peru and Vietnam

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Asian Development Bank - International Initiative for Impact Evaluation. Video Lecture Series

Statistical Studies: Analyzing Data III.B Student Activity Sheet 7: Using Technology

Background Information. Instructions. Problem Statement. HOMEWORK INSTRUCTIONS Homework #3 Higher Education Salary Problem

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Humboldt-Universität zu Berlin

ESIC Advt. No. 06/2017, dated WALK IN INTERVIEW ON

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Reducing Features to Improve Bug Prediction

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Beginning Blackboard. Getting Started. The Control Panel. 1. Accessing Blackboard:

AQUA: An Ontology-Driven Question Answering System

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

HIGH SCHOOL PREP PROGRAM APPLICATION For students currently in 7th grade

learning collegiate assessment]

Mining Association Rules in Student s Assessment Data

General syllabus for third-cycle courses and study programmes in

Rule Learning with Negation: Issues Regarding Effectiveness

Meriam Library LibQUAL+ Executive Summary

1. READING ENGAGEMENT 2. ORAL READING FLUENCY

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Algebra 2- Semester 2 Review

System Quality and Its Influence on Students Learning Satisfaction in UiTM Shah Alam

Assignment 1: Predicting Amazon Review Ratings

A Case Study: News Classification Based on Term Frequency

A non-profit educational institution dedicated to making the world a better place to live

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

CSL465/603 - Machine Learning

Applications of data mining algorithms to analysis of medical data

Functional Maths Skills Check E3/L x

Tuesday 13 May 2014 Afternoon

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Learning Lesson Study Course

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Corpus Linguistics (L615)

KAZMA FAMILY FOUNDATION SCHOLARSHIP WHO CAN APPLY

Matching Similarity for Keyword-Based Clustering

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Quantitative Research Questionnaire

Improving recruitment, hiring, and retention practices for VA psychologists: An analysis of the benefits of Title 38

SELF: CONNECTING CAREERS TO PERSONAL INTERESTS. Essential Question: How Can I Connect My Interests to M y Work?

Australia s tertiary education sector

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Data Fusion Through Statistical Matching

Parent Information Welcome to the San Diego State University Community Reading Clinic

Evaluation of a College Freshman Diversity Research Program

The Teenage Brain and Making Responsible Decisions About Sex

Speech Emotion Recognition Using Support Vector Machine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Aspectual Classes of Verb Phrases

(I couldn t find a Smartie Book) NEW Grade 5/6 Mathematics: (Number, Statistics and Probability) Title Smartie Mathematics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Disambiguation of Thai Personal Name from Online News Articles

Chapter 2 Rule Learning in a Nutshell

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Guidelines for drafting the participant observation report

Course outline. Code: ENS281 Title: Introduction to Sustainable Energy Systems

How to Design Experiments

Learning From the Past with Experiment Databases

Miami-Dade County Public Schools

MYP personal project guide 2011 overview of objectives

SARDNET: A Self-Organizing Feature Map for Sequences

Office of Institutional Effectiveness 2012 NATIONAL SURVEY OF STUDENT ENGAGEMENT (NSSE) DIVERSITY ANALYSIS BY CLASS LEVEL AND GENDER VISION

MODULE 4 Data Collection and Hypothesis Development. Trainer Outline

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Transcription:

K-Means Clustering By Susan L. Miertschin 1

Data Mining - Task Types Classification Clustering Discovering Association Rules Discovering Sequential Patterns Sequence Analysis Regression Detecting Deviations from Normal 2

Data Mining - Task Types Classification Clustering Divide data into groups with similar characteristics - Larson Find clusters of data objects similar in some way to one another Oracle book ( http://download.oracle.com/docs/cd/b28359_01/datamine.111/b28129/clustering.htm) Discovering Association Rules 3 Discovering Sequential Patterns Sequence Analysis Regression

Clustering 4 Find customers similar to each other based on geographical distance to nearest storefront location, number of small dogs owned, number of cats owned, and number of children in household Purpose? Target niche markets, plan new stores Find cardiologists who are similar with respect to likelihood of prescribing a certain class of medication for treatment of congestive heart failure (based on hospital patient records) and patient mix

Clustering Descriptive Unsupervised 5

Clustering Algorithms Group the data based on a criterion Look for improvements in the grouping If improvement is possible then revise the groups iterate 6

K-Means Clustering Algorithm Choose a value for K the number of clusters the algorithm should create Select K cluster centers from the data Arbitrary as opposed to intelligent selection for raw K-means Assign the other instances to the group based on distance to center Distance is simple Euclidean distance 7 Calculate new center for each cluster based on mean values of instances included Evaluate to look for possible improvement

Euclidean Distance 2 dimensions 3 dimensions 8

Restrictions/Considerations 9 Euclidean distance can only be calculated with real numbers Categorical data must be converted to numbers There are issues associated with this conversion process If the categorical data is ordinal (i.e., an order can be established for the categories, e.g. win/place/show is an ordered set of categories) then the conversion is better If the categorical data is nominal then the conversion is not true to meaning of the

Example Credit Card Promotion 10 Data Descriptions Attribute Name Income Range Magazine Promotio n Watch Promotio n Life Ins Promotio n Credit Card Insuranc Value Description 20-30K, 30-40K, 40-50K, 50-60K Numeric Values 20000, 30000, 40000, 50000 Definition Salary range for an individual credit card holder Yes, No 1, 0 Did card holder participate in magazine promotion offered before? Yes, No 1, 0 Did card holder participate in watch promotion offered before? Yes, No 1, 0 Did card holder participate in life insurance promotion offered before? Yes, No 1, 0 Does card holder have credit card insurance?

Sample of Credit Card Promotion Data (from Table 2.3) Incom e Range 40-50K 30-40K 40-50K 30-40K 50-60K 20-30K Magazin e Promo Watch Promo Life Ins Promo CC Ins Sex Age Yes No No No Male 45 Yes Yes Yes No Female 40 No No No No Male 42 Yes Yes Yes Yes Male 43 Yes No Yes No Female 38 No No No No Female 55 30- Yes No Yes Yes Male 35 11 See data handout. 40K

Sample of Numerical Credit Card Promotion Data (from Table 2.3) Incom e Range Magazin e Promo Watch Promo Life Ins Promo CC Ins Sex Age 40000 1 0 0 0 1 45 30000 1 1 1 0 0 40 40000 0 0 0 0 1 42 30000 1 1 1 1 1 43 50000 1 0 1 0 0 38 20000 0 0 0 0 0 55 30000 1 0 1 1 1 35 20000 0 1 0 0 1 27 30000 1 0 0 0 1 43 30000 1 1 1 0 0 41 12 See data handout.

Implementing K-Means Algorithm 13 in Excel There is a link to the Excel file used to create the data handout in Blackboard Download the.zip archive using the link, extract the.csv file, and open it in Excel Follow along with the slides - using

K-Means Algorithm Steps in Excel Set the number of clusters K = 4 (arbitrary) Select K centers Select first points that represent 4 different income ranges = Instances 1,2, 5, 6 (this is slightly less arbitrary) 14

K-Means Algorithm Steps in 15 Excel Compute distance to each center from every other instance (point) Use the distance formula Each instance in this data set is a 7- tuple E.g. (40000,1,0,0,1,45, 0)

K-Means Algorithm Steps in Excel Here is what your result should look like The cells that contain 0 correspond to the distance between a chosen center point and itself 16

K-Means Algorithm Steps in 17 Excel For each instance there are four distance values Choose the minimum distance to associate the instance with the center of the cluster Do you see any problems with the way these

K-Means Algorithm Steps in Excel Transformed Data Values New Distances Calculated 18

K-Means Algorithm Steps in Excel New clusters 19

K-Means Algorithm Steps in Excel Identify the instances that belong to the minimum distance values 20

K-Means Algorithm Steps in 21 Excel Calculate means of attribute values by cluster to determine the cluster center Sort by cluster to aid in calculation If calculated center = former center (to a certain precision) then terminate the algorithm

K-Means Algorithm Steps in Excel Continue iteration using the new centers Yields new clusters Either terminate if new centers = previous centers OR 22 Continue iterations

Computation Question #10 (p. 103, Roiger) Perform the third iteration of the K-Means algorithm for the example given here in the slides What are the new cluster centers? Save your Excel workbook with your organized work relating to K-Means clustering and submit it in the dropbox named IC 0809 K-Means in Balckboard 23

24 Use WEKA

Use WEKA 25 Open the data file you downloaded and used for the Excel exercise. If you open this file in WEKA and then save it With WEKA

26 Use WEKA

Use WEKA Note: K = 2 in this implementati on of K- Means 27

28 Use WEKA

29 Use WEKA

K-Means Clustering By Susan L. Miertschin 30