Final Exam DATA MINING II - 1DL460

Uppsala University Department of Information Technology Kjell Orsborn Tore Risch Final Exam 2011-05-27 DATA MINING II - 1DL460 Date... Friday, May 27, 2011 Time... 14:00-19:00 Teacher on duty... Kjell Orsborn, phone 471 11 54 or 070 425 06 91 Instructions: Read through the complete exam and note any unclear directives before you start solving the questions. The following guidelines hold: Write readably and clearly! Answers that cannot be read can obviously not result in any points and unclear formulations can be misunderstood. Assumptions outside of what is stated in the question must be explained. Any assumptions made should not alter the given question. Write your answer on only one side of the paper and use a new paper for each new question to simplify the correction process and to avoid possible misunderstandings. Please write your name on each page you hand in. When you are finished, please staple these pages together in an order that corresponds to the order of the questions. NOTE! This examination contains 40 points in total and their distribution between sub-questions is clearly identifiable. Note that you will get credit only for answers that are correct. To pass, you must score at least 22. The examiner reserves the right to lower these numbers. You are allowed to use dictionaries to and from English, a calculator but no other material.

1. Web mining and search engines: 6 pts (a) In text of no more than one page present the main ideas and techniques of the PageRank algorithm used by Google. (3 pts) (b) In text of no more than one page present the main ideas and techniques of the Clever project: the ranking technique that is based on identifying authorities and hubs. (3 pts) 2. FP-growth algorithm: 10 pts Given the transactions of Table 1, frequent itemset mining should be performed, using the frequent-pattern (FP) growth approach and a minimum support of 2. The item head table is given in Table 2, where the items are sorted in order of descending support count. Table 1: Transaction database for Question 2 TID T1 T2 T3 T4 T5 T6 T7 T8 T9 List of item id:s i1, i2, i5 i2, i4 i2, i3 i1, i2, i4 i1, i3 i2, i3 i1, i3 i1, i2, i3, i5 i1, i2, i3 Table 2: Item head table for Question 2 Item id Support Nodelink (a) Construct the FP-tree corresponding to the set of transactions in Table 1. (4 pts) Answer: (see following pictures for the construction steps of the FP-tree) (b) Mine the FP-tree according to the FP-growth algorithm. The results should include the set of frequent patterns generated through the different steps in the analysis. (6 pts) Answer: You should explain the intermediate steps and your reasoning in your answers.

FP-tree construction! After reading TID = 1:! i2:1 Figure 1: FP-tree construction step1 After reading TID = 2:! Figure 2: FP-tree construction step2 After reading TID = 3:! i2:3 Figure 3: FP-tree construction step3

After reading TID = 4:! Figure 4: FP-tree construction step4 After reading TID = 5:! Figure 5: FP-tree construction step5 After reading TID = 6:! i2:5 Figure 6: FP-tree construction step6

After reading TID = 7:! i2:5 Figure 7: FP-tree construction step7 After reading TID = 8:! i2:6 i1:3 Figure 8: FP-tree construction step8 After reading TID = 9:! i2:7 i1:4 Figure 9: FP-tree construction step9

FP-tree mining: frequent itemset generation for paths ending in i5:! Prefix paths ending in i5:! Conditional FP-tree for i5:! Conditional pattern base for i5: PB = {(i2, ), (i2, i1, )}! Conditional FP-tree for i5: CFP =,! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i5:2}, {i2, i5:2}, {i1, i5:2}, {i2, i1, i5:2}! Figure 10: FP-tree mining step 1 FP-tree mining: frequent itemset generation for paths ending in i4:! Prefix paths ending in i4:! Conditional FP-tree for i4:! Conditional pattern base for i4: PB = {{i2, }, {i2:1}}! Conditional FP-tree for i4: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i4:2}, {i2, i4:2}! Figure 11: FP-tree mining step 2 FP-tree mining: frequent itemset generation for paths ending in i3:! Prefix paths ending in i3:! Conditional FP-tree for i3:! Conditional pattern base for i3: PB = {{i2, }, {}, {}}! Conditional FP-tree for i3: CFP =,,! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i3:6}, {i2, i3:4}, {i1, i3:4}, {i2, i1, }! Figure 12: FP-tree mining step 3

FP-tree mining: frequent itemset generation for paths ending in i1:! Prefix paths ending in i1:! Conditional FP-tree for i1:! i1:4 Conditional pattern base for i1: PB = {{}}! Conditional FP-tree for i1: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i1:6}, {i2, i1:4}! Figure 13: FP-tree mining step 4 FP-tree mining: frequent itemset generation for paths ending in i2:! Prefix paths ending in i2:! Conditional FP-tree for i2:! i2:7 Conditional pattern base for i2: PB = {{}}! Conditional FP-tree for i2: CFP =! Applying FP-growth on CFP yields: Frequent itemsets (with sup > 2): {i2:7}! Figure 14: FP-tree mining step 5

3. Bayesian classification: 8 pts (a) Explain Bayes Theorem, P (Y X) = P (X Y ) P (Y )/P (X). How is it inferred? (2 pts) (b) Explain the principles of using Bayes Theorem for classification? What assumption is important? Give an example. (2 pts) (c) How are continuous variables handled in Bayesian classification? (2 pts) (d) What is a Bayesian Belief Network? (2 pts) 4. Data stream mining: 8 pts (a) Give examples of three requirements for data stream mining that makes it different from regular data mining. (2 pts) (b) Outline the algorithm for computing moving averages over streams. Explain how special requirements for streaming data are applied. (2 pts) (c) Outline the Denstream algorithm. (2 pts) (d) What properties of DBScan makes it unfit for data stream mining? (2 pts) 5. Cluster validation: 8 pts Table 3: Confusion matrix for Question 5 Cluster Entertainment Financial Foreign Metro National Sports Total #1 1 1 0 11 4 676 693 #2 27 89 333 827 253 33 1562 #3 326 465 8 105 16 29 949 Total 354 555 341 943 273 738 3204 (a) Compute the entropy and purity for the confusion matrix in Table 3. The entropy for a single cluster i is given by e i = L j=1 p ij log 2 p ij, where, p ij is the probability that a member of cluster i belongs to class j, and L is the number of classes. (6 pts) Answer: (see Table 4 below.) (b) Compute the precision, recall and F-measure for the Sports class for cluster #1 and for the Metro class of cluster #2. (2 pts)

Table 4: Answers to confusion matrix for Question 5 Cluster Entertainment Financial Foreign Metro National Sports Total Entropy Purity #1 1 1 0 11 4 676 693 0.20 0.98 #2 27 89 333 827 253 33 1562 1.84 0.53 #3 326 465 8 105 16 29 949 1.70 0.49 Total 354 555 341 943 273 738 3204 1.44 0.61 Answer: Sports class: precision = 0.98, recall = 0.92 and F-measure = 0.94. Metro class: precision = 0.53, recall = 0.88 and F-measure = 0.66. Good Luck! / Kjell & Tore