CS Data Mining. Introductions What Is It? Cultures of Data Mining

Similar documents
The Evolution of Random Phenomena

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Basic Concepts of Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Unit 7 Data analysis and design

Mining Student Evolution Using Associative Classification and Clustering

Word Segmentation of Off-line Handwritten Documents

A Case Study: News Classification Based on Term Frequency

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

CS Machine Learning

Applications of data mining algorithms to analysis of medical data

CS 446: Machine Learning

Coding II: Server side web development, databases and analytics ACAD 276 (4 Units)

Linking Task: Identifying authors and book titles in verbose queries

Context Free Grammars. Many slides from Michael Collins

Python Machine Learning

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Geospatial Visual Analytics Tutorial. Gennady Andrienko & Natalia Andrienko

MYCIN. The MYCIN Task

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Managerial Decision Making

Evidence for Reliability, Validity and Learning Effectiveness

Virtually Anywhere Episodes 1 and 2. Teacher s Notes

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Assignment 1: Predicting Amazon Review Ratings

Shockwheat. Statistics 1, Activity 1

TU-E2090 Research Assignment in Operations Management and Services

Speech Recognition at ICSI: Broadcast News and beyond

learning collegiate assessment]

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Presentation skills. Bojan Jovanoski, project assistant. University Skopje Business Start-up Centre

CSL465/603 - Machine Learning

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Introduction to the Practice of Statistics

Probabilistic Latent Semantic Analysis

How to make your research useful and trustworthy the three U s and the CRITIC

Mathematics Success Grade 7

Mining Association Rules in Student s Assessment Data

Probability and Statistics Curriculum Pacing Guide

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

TUESDAYS/THURSDAYS, NOV. 11, 2014-FEB. 12, 2015 x COURSE NUMBER 6520 (1)

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

12- A whirlwind tour of statistics

Millersville University Degree Works Training User Guide

CS 598 Natural Language Processing

Learning From the Past with Experiment Databases

Contents. Foreword... 5

Axiom 2013 Team Description Paper

University of Toronto Physics Practicals. University of Toronto Physics Practicals. University of Toronto Physics Practicals

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Sight Word Assessment

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Algebra 2- Semester 2 Review

Humboldt-Universität zu Berlin

Constraining X-Bar: Theta Theory

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Hentai High School A Game Guide

Rule Learning With Negation: Issues Regarding Effectiveness

American Studies Ph.D. Timeline and Requirements

Course Content Concepts

Title: Improving information retrieval with dialogue mapping and concept mapping

Evaluation of a College Freshman Diversity Research Program

Unit 3: Lesson 1 Decimals as Equal Divisions

CAN PICTORIAL REPRESENTATIONS SUPPORT PROPORTIONAL REASONING? THE CASE OF A MIXING PAINT PROBLEM

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

HWS Colleges' Social Norms Surveys Online. Survey of Student-Athlete Norms

NCEO Technical Report 27

Evaluating Visual Analytics Systems for Investigative Analysis: Deriving Design Principles from a Case Study

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Learning to Think Mathematically With the Rekenrek

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Probability estimates in a scenario tree

Activities for School

Computerized Adaptive Psychological Testing A Personalisation Perspective

Situational Virtual Reference: Get Help When You Need It

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Name Class Date. Graphing Proportional Relationships

CS 100: Principles of Computing

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Navigating the PhD Options in CMS

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

CLASSROOM PROCEDURES FOR MRS.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Evaluating Statements About Probability

Transcription:

CS345 --- Data Mining Introductions What Is It? Cultures of Data Mining 1

Course Staff Instructors: Anand Rajaraman Jeff Ullman TA: Jeff Klingner 2

Requirements Homework (Gradiance and other) 20% Gradiance class code DD984360 Project 40% Final Exam 40% 3

Project Software implementation related to course subject matter. Should involve an original component or experiment. More later about available data and computing resources. 4

Team Projects Working in pairs OK, but 1. We will expect more from a pair than from an individual. 2. The effort should be roughly evenly distributed. 5

What is Data Mining? Discovery of useful, possibly unexpected, patterns in data. Subsidiary issues: Data cleansing: detection of bogus data. E.g., age = 150. Entity resolution. Visualization: something better than megabyte files of output. Warehousing of data (for retrieval). 6

Typical Kinds of Patterns 1. Decision trees: succinct ways to classify by testing properties. 2. Clusters: another succinct classification by similarity of properties. 3. Bayes models, hidden-markov models, frequent-itemsets: expose important associations within data. 7

Example: Clusters x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x 8

Example: Frequent Itemsets A common marketing problem: examine what people buy together to discover patterns. 1. What pairs of items are unusually often found together at Safeway checkout? Answer: diapers and beer. 2. What books are likely to be bought by the same Amazon customer? 9

Applications (Among Many) Intelligence-gathering. Tracking terrorists, e.g. Web Analysis. PageRank, spam detection. Marketing. Run a sale on diapers; raise the price of beer. 10

Cultures Databases: concentrate on large-scale (non-main-memory) data. AI (machine-learning): concentrate on complex methods, small data. Statistics: concentrate on models. 11

Models vs. Analytic Processing To a database person, data-mining is an extreme form of analytic processing --- queries that examine large amounts of data. Result is the data that answers the query. To a statistician, data-mining is the inference of models. Result is the parameters of the model. 12

(Way too Simple) Example Given a billion numbers, a DB person would compute their average. A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation. 13

Meaningfulness of Answers A big risk when data mining is that you will discover patterns that are meaningless. Statisticians call it Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap. 14

Examples A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy. The Rhine Paradox: a great example of how not to conduct scientific research. 15

Story Behind the Story I gave these two examples last year. The hotels example got picked up by a newspaper reporter who spun it as STANFORD PROFESSOR PROVES TRACKING TERRORISTS IS IMPOSSIBLE I was also corrected in the story about Joseph Rhine (whom I called David). 16

Rhine Paradox --- (1) Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception. He devised (something like) an experiment where subjects were asked to guess 10 hidden cards --- red or blue. He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! 17

Rhine Paradox --- (2) He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? Answer on next slide. 18

Rhine Paradox --- (3) He concluded that you shouldn t tell people they have ESP; it causes them to lose it. 19

Example: Bonferroni s Principle This example illustrates a problem with intelligence-gathering. Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. We want to find people who at least twice have stayed at the same hotel on the same day. 20

The Details 10 9 people being tracked. 1000 days. Each person stays in a hotel 1% of the time (10 days out of 1000). Hotels hold 100 people (so 10 5 hotels). If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? 21

Calculations --- (1) Probability that persons p and q will be at the same hotel on day d : 1/100 * 1/100 * 10-5 = 10-9. Probability that p and q will be at the same hotel on two given days: 10-9 * 10-9 = 10-18. Pairs of days: 5*10 5. 22

Calculations --- (2) Probability that p and q will be at the same hotel on some two days: 5*10 5 * 10-18 = 5*10-13. Pairs of people: 5*10 17. Expected number of suspicious pairs of people: 5*10 17 * 5*10-13 = 250,000. 23

Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. Analysts have to sift through 250,010 candidates to find the 10 real cases. Not gonna happen. But how can we improve the scheme? 24

Moral When looking for a property (e.g., two people stayed at the same hotel twice ), make sure that there are not so many possibilities that random data will not produce facts of interest. 25