Final Exam DATA MINING II - 1DL460

Similar documents
Mining Student Evolution Using Associative Classification and Clustering

Lecture 1: Machine Learning Basics

Mining Association Rules in Student s Assessment Data

A Case Study: News Classification Based on Term Frequency

Australian Journal of Basic and Applied Sciences

Disambiguation of Thai Personal Name from Online News Articles

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Reducing Features to Improve Bug Prediction

Emporia State University Degree Works Training User Guide Advisor

AQUA: An Ontology-Driven Question Answering System

Math 96: Intermediate Algebra in Context

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Domain Ontology Development Environment Using a MRD and Text Corpus

CARITAS PROJECT GRADING RUBRIC

Python Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

On-Line Data Analytics

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Applications of data mining algorithms to analysis of medical data

CS Machine Learning

Introduction to Simulation

What is Thinking (Cognition)?

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule Learning With Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Using computational modeling in language acquisition research

Artificial Neural Networks written examination

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning From the Past with Experiment Databases

Let s think about how to multiply and divide fractions by fractions!

Lecture 1: Basic Concepts of Machine Learning

INTERMEDIATE ALGEBRA Course Syllabus

Comparison of network inference packages and methods for multiple networks inference

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Welcome to. ECML/PKDD 2004 Community meeting

Radius STEM Readiness TM

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Issues in the Mining of Heart Failure Datasets

CSL465/603 - Machine Learning

CS 446: Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Generating Test Cases From Use Cases

Indian Institute of Technology, Kanpur

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

The stages of event extraction

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Planning with External Events

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

A process by any other name

Rule-based Expert Systems

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Update on Standards and Educator Evaluation

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Annual Report for Assessment of Outcomes Fire Protection Technology (FP) Outcomes Assessed for the AAS degree in Fire Protection

Using Rhetoric Technique in Persuasive Speech

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Harness the power of public media and partnerships for the digital age. WQED Multimedia Strategic Plan

CSC200: Lecture 4. Allan Borodin

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

COMPARISON OF TWO SEGMENTATION METHODS FOR LIBRARY RECOMMENDER SYSTEMS. by Wing-Kee Ho

Shockwheat. Statistics 1, Activity 1

Computerized Adaptive Psychological Testing A Personalisation Perspective

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

10.2. Behavior models

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Extending Place Value with Whole Numbers to 1,000,000

Using dialogue context to improve parsing performance in dialogue systems

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Millersville University Degree Works Training User Guide

Organizational Knowledge Distribution: An Experimental Evaluation

The taming of the data:

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

TESL /002 Principles of Linguistics Professor N.S. Baron Spring 2007 Wednesdays 5:30 pm 8:00 pm

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Pragmatic Use Case Writing

Team Formation for Generalized Tasks in Expertise Social Networks

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Getting Started with Deliberate Practice

Multimedia Application Effective Support of Education

Gr. 9 Geography. Canada: Creating a Sustainable Future DAY 1

1. Answer the questions below on the Lesson Planning Response Document.

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Prediction of Maximal Projection for Semantic Role Labeling

Discovery of Topical Authorities in Instagram

Evolutive Neural Net Fuzzy Filtering: Basic Description

DegreeWorks Advisor Reference Guide

KENTUCKY FRAMEWORK FOR TEACHING

Bug triage in open source systems: a review

ENG 111 Achievement Requirements Fall Semester 2007 MWF 10:30-11: OLSC

Xenia High School Credit Flexibility Plan (CFP) Application

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Curriculum and Assessment Guide (CAG) Elementary California Treasures First Grade

Transcription:

Uppsala University Department of Information Technology Kjell Orsborn Tore Risch Final Exam 2011-05-27 DATA MINING II - 1DL460 Date... Friday, May 27, 2011 Time... 14:00-19:00 Teacher on duty... Kjell Orsborn, phone 471 11 54 or 070 425 06 91 Instructions: Read through the complete exam and note any unclear directives before you start solving the questions. The following guidelines hold: Write readably and clearly! Answers that cannot be read can obviously not result in any points and unclear formulations can be misunderstood. Assumptions outside of what is stated in the question must be explained. Any assumptions made should not alter the given question. Write your answer on only one side of the paper and use a new paper for each new question to simplify the correction process and to avoid possible misunderstandings. Please write your name on each page you hand in. When you are finished, please staple these pages together in an order that corresponds to the order of the questions. NOTE! This examination contains 40 points in total and their distribution between sub-questions is clearly identifiable. Note that you will get credit only for answers that are correct. To pass, you must score at least 22. The examiner reserves the right to lower these numbers. You are allowed to use dictionaries to and from English, a calculator but no other material.

1. Web mining and search engines: 6 pts (a) In text of no more than one page present the main ideas and techniques of the PageRank algorithm used by Google. (3 pts) (b) In text of no more than one page present the main ideas and techniques of the Clever project: the ranking technique that is based on identifying authorities and hubs. (3 pts) 2. FP-growth algorithm: 10 pts Given the transactions of Table 1, frequent itemset mining should be performed, using the frequent-pattern (FP) growth approach and a minimum support of 2. The item head table is given in Table 2, where the items are sorted in order of descending support count. Table 1: Transaction database for Question 2 TID T1 T2 T3 T4 T5 T6 T7 T8 T9 List of item id:s i1, i2, i5 i2, i4 i2, i3 i1, i2, i4 i1, i3 i2, i3 i1, i3 i1, i2, i3, i5 i1, i2, i3 Table 2: Item head table for Question 2 Item id Support Nodelink (a) Construct the FP-tree corresponding to the set of transactions in Table 1. (4 pts) Answer: (see following pictures for the construction steps of the FP-tree) (b) Mine the FP-tree according to the FP-growth algorithm. The results should include the set of frequent patterns generated through the different steps in the analysis. (6 pts) Answer: You should explain the intermediate steps and your reasoning in your answers.

FP-tree construction! After reading TID = 1:! i2:1 Figure 1: FP-tree construction step1 After reading TID = 2:! Figure 2: FP-tree construction step2 After reading TID = 3:! i2:3 Figure 3: FP-tree construction step3

After reading TID = 4:! Figure 4: FP-tree construction step4 After reading TID = 5:! Figure 5: FP-tree construction step5 After reading TID = 6:! i2:5 Figure 6: FP-tree construction step6

After reading TID = 7:! i2:5 Figure 7: FP-tree construction step7 After reading TID = 8:! i2:6 i1:3 Figure 8: FP-tree construction step8 After reading TID = 9:! i2:7 i1:4 Figure 9: FP-tree construction step9

FP-tree mining: frequent itemset generation for paths ending in i5:! Prefix paths ending in i5:! Conditional FP-tree for i5:! Conditional pattern base for i5: PB = {(i2, ), (i2, i1, )}! Conditional FP-tree for i5: CFP =,! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i5:2}, {i2, i5:2}, {i1, i5:2}, {i2, i1, i5:2}! Figure 10: FP-tree mining step 1 FP-tree mining: frequent itemset generation for paths ending in i4:! Prefix paths ending in i4:! Conditional FP-tree for i4:! Conditional pattern base for i4: PB = {{i2, }, {i2:1}}! Conditional FP-tree for i4: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i4:2}, {i2, i4:2}! Figure 11: FP-tree mining step 2 FP-tree mining: frequent itemset generation for paths ending in i3:! Prefix paths ending in i3:! Conditional FP-tree for i3:! Conditional pattern base for i3: PB = {{i2, }, {}, {}}! Conditional FP-tree for i3: CFP =,,! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i3:6}, {i2, i3:4}, {i1, i3:4}, {i2, i1, }! Figure 12: FP-tree mining step 3

FP-tree mining: frequent itemset generation for paths ending in i1:! Prefix paths ending in i1:! Conditional FP-tree for i1:! i1:4 Conditional pattern base for i1: PB = {{}}! Conditional FP-tree for i1: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i1:6}, {i2, i1:4}! Figure 13: FP-tree mining step 4 FP-tree mining: frequent itemset generation for paths ending in i2:! Prefix paths ending in i2:! Conditional FP-tree for i2:! i2:7 Conditional pattern base for i2: PB = {{}}! Conditional FP-tree for i2: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i2:7}! Figure 14: FP-tree mining step 5

3. Bayesian classification: 8 pts (a) Explain Bayes Theorem, P (Y X) = P (X Y ) P (Y )/P (X). How is it inferred? (2 pts) (b) Explain the principles of using Bayes Theorem for classification? What assumption is important? Give an example. (2 pts) (c) How are continuous variables handled in Bayesian classification? (2 pts) (d) What is a Bayesian Belief Network? (2 pts) 4. Data stream mining: 8 pts (a) Give examples of three requirements for data stream mining that makes it different from regular data mining. (2 pts) (b) Outline the algorithm for computing moving averages over streams. Explain how special requirements for streaming data are applied. (2 pts) (c) Outline the Denstream algorithm. (2 pts) (d) What properties of DBScan makes it unfit for data stream mining? (2 pts) 5. Cluster validation: 8 pts Table 3: Confusion matrix for Question 5 Cluster Entertainment Financial Foreign Metro National Sports Total #1 1 1 0 11 4 676 693 #2 27 89 333 827 253 33 1562 #3 326 465 8 105 16 29 949 Total 354 555 341 943 273 738 3204 (a) Compute the entropy and purity for the confusion matrix in Table 3. The entropy for a single cluster i is given by e i = L j=1 p ij log 2 p ij, where, p ij is the probability that a member of cluster i belongs to class j, and L is the number of classes. (6 pts) Answer: (see Table 4 below.) (b) Compute the precision, recall and F-measure for the Sports class for cluster #1 and for the Metro class of cluster #2. (2 pts)

Table 4: Answers to confusion matrix for Question 5 Cluster Entertainment Financial Foreign Metro National Sports Total Entropy Purity #1 1 1 0 11 4 676 693 0.20 0.98 #2 27 89 333 827 253 33 1562 1.84 0.53 #3 326 465 8 105 16 29 949 1.70 0.49 Total 354 555 341 943 273 738 3204 1.44 0.61 Answer: Sports class: precision = 0.98, recall = 0.92 and F-measure = 0.94. Metro class: precision = 0.53, recall = 0.88 and F-measure = 0.66. Good Luck! / Kjell & Tore