Second Semester Examinations 2014/15. Data Mining and Visualisation

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Assignment 1: Predicting Amazon Review Ratings

(Sub)Gradient Descent

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS Machine Learning

Python Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Rule Learning With Negation: Issues Regarding Effectiveness

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Lecture 1: Basic Concepts of Machine Learning

Learning From the Past with Experiment Databases

Multivariate k-nearest Neighbor Regression for Time Series data -

Artificial Neural Networks written examination

Rule Learning with Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Chapter 2 Rule Learning in a Nutshell

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Australian Journal of Basic and Applied Sciences

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

CS 446: Machine Learning

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Calibration of Confidence Measures in Speech Recognition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Probabilistic Latent Semantic Analysis

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Truth Inference in Crowdsourcing: Is the Problem Solved?

12- A whirlwind tour of statistics

Common Core State Standards

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Indian Institute of Technology, Kanpur

Reducing Features to Improve Bug Prediction

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Applications of data mining algorithms to analysis of medical data

Probability and Statistics Curriculum Pacing Guide

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SARDNET: A Self-Organizing Feature Map for Sequences

Software Maintenance

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

WHEN THERE IS A mismatch between the acoustic

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Active Learning. Yingyu Liang Computer Sciences 760 Fall

2 nd grade Task 5 Half and Half

arxiv: v2 [cs.cv] 30 Mar 2017

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Multi-label classification via multi-target regression on data streams

Disambiguation of Thai Personal Name from Online News Articles

Human Emotion Recognition From Speech

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Generative models and adversarial training

Detailed course syllabus

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Australia s tertiary education sector

AP Statistics Summer Assignment 17-18

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Issues in the Mining of Heart Failure Datasets

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Measurement. When Smaller Is Better. Activity:

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Word Segmentation of Off-line Handwritten Documents

Mining Association Rules in Student s Assessment Data

Speech Emotion Recognition Using Support Vector Machine

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Attributed Social Network Embedding

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Forget catastrophic forgetting: AI that learns after deployment

How to Judge the Quality of an Objective Classroom Test

A Case Study: News Classification Based on Term Frequency

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Lecture 15: Test Procedure in Engineering Design

Ensemble Technique Utilization for Indonesian Dependency Parser

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Data Stream Processing and Analytics

Functional Skills Mathematics Level 2 assessment

Conference Presentation

Detecting English-French Cognates Using Orthographic Edit Distance

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Missouri Mathematics Grade-Level Expectations

STA 225: Introductory Statistics (CT)

Universidade do Minho Escola de Engenharia

Mathematics subject curriculum

Linking Task: Identifying authors and book titles in verbose queries

Digital Media Literacy

Transcription:

PAPER CODE NO. EXAMINER : Dr. Danushka Bollegala COMP527 DEPARTMENT : Computer Science Tel. No. 0151 7954283 Second Semester Examinations 2014/15 Data Mining and Visualisation TIME ALLOWED : Two and a Half Hours INSTRUCTIONS TO CANDIDATES Answer FOUR questions. If you attempt to answer more questions than the required number of questions (in any section), the marks awarded for the excess questions answered will be discarded (starting with your lowest mark). PAPER CODE COMP527 page 1 of 7 Continued

Question 1 A. State the two main types of data mining models. (2 marks) Predictive models and descriptive models. Each point will be assigned 1 mark. B. Consider that you measured the height and weight of 100 students for a health survey. For 20 students in your sample you could only measure either their height or weight, but not both values. Assume that we would like to train a binary classifier to predict whether a student is overweight compared to the students in this dataset. Answer the following questions about this experiment. (a) State two algorithms that you can use to learn a binary classifier for this purpose. (2 marks) logistic regression, SVM, perceptron, etc. (b) What is meant by the missing-value problem in data mining? Some of the feature values (attributes) in the data might be missing because either the measurements were not taken and/or the data is corrupted. (c) State two disadvantages we will encounter if we ignore the 20 instances that we have incomplete measurements for and use the remaining 80 instances to train the classifier. The dataset size will be too small and we might overfit to it. The dataset size might be too small to learn anything useful (under fitting). The missing data points might contain useful information about the target task. (d) The average height of the students in this dataset is 169cm. Provide a reason for and a reason against using the average to fill the missing values. For: It is a typical value for the height of the students. Against: The 20 students for which we do not have height measurements could be outliers. (e) Assume that we would like to check whether there is any correlation between the height and the weight of the students in this dataset. How do we check this? We could measure the Pearson correlation coefficient between the height and the weight, and if it is high we could conclude that there is a high correlation between the two variables. (f) Given that there is a high correlation between the height and the weight of a student, how can we use this information to overcome the missing-value problem? We could learn a linear relationship between the two variables using a technique such as the linear regression and then use the learnt predictor to predict the missing values. We can then train a binary classifier using this predicted data points as well as the original data points. (g) Without having access to a separate test dataset, how can we evaluate the accuracy of our binary classifier? (2 marks) We can set aside a portion of the train data as held out data, and evaluate using that portion. PAPER CODE COMP527 page 2 of 7 Continued

Question 2 Assume that we are trying to learn a binary sentiment classifier from Amazon product reviews. Each review is assigned a rating (1-5 stars) by a user. We have 1000 such reviews for training purposes and a separate collection of 1000 reviews for testing. Answer the following questions about this experiment. A. Define what is meant by unigrams and bigrams. (2 marks) A unigram would be a single word, whereas a bigram would be two consecutive words. B. Why would it be a good idea to use bigrams as well as unigrams to represent reviews in this task? Negations such as not like can only be captured using bigrams. C. Propose a method to assign binary target labels to this dataset such that we could train a binary sentiment classifier from it. For example, we could assign positive labels to reviews that have 4 or 5 ratings and negative labels to reviews that have 1 or 2 ratings. We could ignore reviews that a have rating value of 3. D. Assume that we trained a logistic regression classifier from this binary labeled dataset. How can we find out what features are most useful when predicting positive sentiment in Amazon reviews? Sort the features in the descending order of their weights in the final weight vector. The top positive features are the ones that are most useful when predicting positive sentiment. E. What is meant by stop words in text mining? (2 marks) Stop words are non-content features such as prepositions and articles. For example, the, an, what, etc. F. What effect would it have if we were to remove stop words in our sentiment classification task It will reduce the dimensionality of the feature space thereby speeding up both the train and test stages. G. Assume that our test dataset turns out to have 700 positive instances and 300 negative instances. What would be the classification accuracy of a random guessing algorithm on our test dataset? Explain your answer. A random guesser will predict positive and negative classes with 0.5 probability. Therefore, it will predict 350 out of the positive instances as positive and 150 out of the negative instances as negative. Therefore, the total number of correctly classified instances will be 150 + 350 = 500, giving a classification accuracy of 500/1000 = 50%. H. For the unbalanced test dataset described in part G, what would be the accuracy obtained by a prediction algorithm that always predicts an instance to be positive? Explain your answer. Because there are 700 positive instances in the test dataset and all of those instances will be correctly classified by this predictor, we will have 700/1000 = 70% accuracy. PAPER CODE COMP527 page 3 of 7 Continued

Question 3 Consider a training dataset consisting of four instances (x 1, 1), (x 2, 1), (x 3, 1) (x 4, 1) where x 1 = (1, 1), x 2 = ( 1, 1), x 3 = ( 1, 1), and x 4 = (1, 1). Here, x denotes the transpose of vector x. We would like to train a binary Perceptron to classify the four instances in this dataset. For this question ignore the bias term b in the Perceptron and answer the following. A. Let us predict an instance x to be positive if w x 0, and negative otherwise. Initializing w = (0, 0), show that after observing x 1, x 2, x 3, and x 4 in that order the weight vector will be x 3 x 4. (6 marks) When w = 0, we have w x 1 = 0. Hence, x 1 is correctly predicted as positive. Same applies for x 2 as well. However, x 3 will be misclassified and the weight vector will be updated to w = 0 x 3 = x 3. Next, x 3 x 4 = 0 and x 3 will be classified incorrectly as positive. Therefore, w = x 3 x 4. B. If we present the four instances in the reverse order (x 4, 1), (x 3, 1), (x 2, 1), (x 1, 1), to the Perceptron, what would be the final value of weight vector at the end of the first iteration? x 4 x 3 + x 2 + x 1 C. Normalize each of the four instances x 1, x 2, x 3, and x 4 into unit L2 length. All the normalized vectors will have a factor 1 2 in front. D. What would be the final weight vector after observing the four instances if you used the L2 normalized training instances instead of the original (unnormalized) instances to train the Perceptron as you did in the part (A) of above? 1 2 (x 3 + x 4 ). E. Now, let us re-assign the target labels for this dataset as follows (x 1, 1), (x 2, 1), (x 3, 1) (x 4, 1). Can we use Perceptron algorithm to linearly classify this revised dataset? Justify your answer. No. The dataset is no longer linearly separable. Answers that either plots the data points in the 2D space or use some other method to show this will receive full marks. If no justification is given, then such answers will receive 2 marks. F. Describe a method to learn a binary linear classifier for the revised dataset described in part (E) above. Kernalized versions such as using the product of the two features as a third feature will receive full marks. PAPER CODE COMP527 page 4 of 7 Continued

Question 4 Consider the dataset shown in Table 1 from which we would like to learn a classifier that could predict whether Play=yes using the four features outlook, temperature, humidity, and windy. Answer the following questions about this dataset. Table 1: Weather dataset for decision tree learning. Outlook Temperature Humidity Windy Play? sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no A. State three problems that are frequently observed in rule-based classifiers. (6 marks) likely to overfit to the train data, can be time consuming when the dataset is large, too sensitive to the noise in the training data, cannot produce confidence scores. Each point will receive 2 marks. B. Using the dataset shown in Table 1, compute the coverage and the accuracy of the rule, IF Outlook = Sunny THEN Play = Yes The rule covers 5 out of the 14 cases. Therefore, its coverage is 5/14. Out of those 5 matches, 2 cases have PLAY = YES. Therefore, the accuracy of the rule is 2/5. Correct answers for coverage will receive 3 marks and the correct answers for accuracy will receive 3 marks. (6 marks) C. Using Table 1 compute the conditional probabilities P (play = yes outlook = sunny), P (play = yes outlook = overcast), and P (play = yes outlook = rainy). (6 marks) P (play = yes outlook = sunny) = 2/5, P (play = yes outlook = overcast) = 4/4, and P (play = yes outlook = rainy) = 3/5 D. Use the Bayes rule to compute P (outlook = sunny play = yes). P (outlook = sunny play = yes) = P (play = yes outlook = sunny)p (outlook = sunny)/p (play = yes) = (2/5) (5/14) (14/9) = 2/9 PAPER CODE COMP527 page 5 of 7 Continued

E. Describe a method to overcome zero-probabilities when computing the likelihood of an event that can be decomposed into the product of a series of multiple independent events. Answers that describe Laplace smoothing or any other smoothing methods will receive full marks. PAPER CODE COMP527 page 6 of 7 Continued

Question 5 Big data sets and the availability of high performance computing resources such as GPUs, have given birth to the so called Big Data Mining era. By combining different datasets and performing pattern analysis across datasets, we can discover trends that were not previously possible to detect using small scale individual datasets. Big Data Mining has received much attention not only from the academia but also from the industry. Answer the following questions about Big Data Mining. A. Explain three challenges we face when performing data mining on large datasets. (12 marks) B. Propose a separate solution to each of the challenges that you described in the previous question (part A) (13 marks) Some of the important challenges and their solutions are (a) Resolving disambiguates when merging datasets (named entity resolution, word sense disambiguation) (b) Privacy issues (Privacy Preserving Data Mining) (c) Difficulties in loading large datasets in to memory to train classification/clustering algorithms (online learning, distributed ML) (d) Ethical issues in data collection (anonymized data) (e) Reliability issues (statistical confidence tests) Answers that elaborate on these lines will receive full marks. PAPER CODE COMP527 page 7 of 7 End