Fall 2015 COMPUTER SCIENCES DEPARTMENT UNIVERSITY OF WISCONSIN MADISON PH.D. QUALIFYING EXAMINATION

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CS Machine Learning

Learning From the Past with Experiment Databases

Assignment 1: Predicting Amazon Review Ratings

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Generative models and adversarial training

CSL465/603 - Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning Methods for Fuzzy Systems

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Reducing Features to Improve Bug Prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CS 446: Machine Learning

Australian Journal of Basic and Applied Sciences

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Artificial Neural Networks written examination

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Missouri Mathematics Grade-Level Expectations

Lecture 1: Basic Concepts of Machine Learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Rule Learning with Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Methods in Multilingual Speech Recognition

Mathematics Scoring Guide for Sample Test 2005

A Comparison of Two Text Representations for Sentiment Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

(Sub)Gradient Descent

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Case study Norway case 1

Introduction to Causal Inference. Problem Set 1. Required Problems

Statewide Framework Document for:

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Universidade do Minho Escola de Engenharia

arxiv: v2 [cs.cv] 30 Mar 2017

University of Groningen. Systemen, planning, netwerken Bosman, Aart

10.2. Behavior models

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Reinforcement Learning Variant for Control Scheduling

Extending Place Value with Whole Numbers to 1,000,000

Human Emotion Recognition From Speech

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multivariate k-nearest Neighbor Regression for Time Series data -

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

A survey of multi-view machine learning

Probability and Statistics Curriculum Pacing Guide

Automatic document classification of biological literature

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Linking Task: Identifying authors and book titles in verbose queries

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

12- A whirlwind tour of statistics

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Cross Language Information Retrieval

Introducing the New Iowa Assessments Mathematics Levels 12 14

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

A Case-Based Approach To Imitation Learning in Robotic Agents

AQUA: An Ontology-Driven Question Answering System

Beyond the Pipeline: Discrete Optimization in NLP

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Exposé for a Master s Thesis

Corpus Linguistics (L615)

While you are waiting... socrative.com, room number SIMLANG2016

Medical Complexity: A Pragmatic Theory

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Science Olympiad Competition Model This! Event Guidelines

Relating Math to the Real World: A Study of Platonic Solids and Tessellations

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Association Between Categorical Variables

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Data Fusion Through Statistical Matching

arxiv: v1 [cs.lg] 3 May 2013

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Degree Qualification Profiles Intellectual Skills

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

SARDNET: A Self-Organizing Feature Map for Sequences

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Seminar - Organic Computing

Semi-Supervised Face Detection

Chapter 2 Rule Learning in a Nutshell

Transcription:

Fall 2015 COMPUTER SCIENCES DEPARTMENT UNIVERSITY OF WISCONSIN MADISON PH.D. QUALIFYING EXAMINATION Artificial Intelligence Monday, September 21, 2015 GENERAL INSTRUCTIONS 1. This exam has 10 numbered pages. 2. Answer each question in a separate book. 3. Indicate on the cover of each book the area of the exam, your code number, and the question answered in that book. On one of your books, list the numbers of all the questions answered. Do not write your name on any answer book. 4. Return all answer books in the folder provided. Additional answer books are available if needed. SPECIFIC INSTRUCTIONS You should answer: 1. both questions in the section labeled 760 MACHINE LEARNING 2. two additional questions in another selected section, 7xx, where both questions must come from the same section. Hence, you are to answer a total of four questions. POLICY ON MISPRINTS AND AMBIGUITIES The Exam Committee tries to proofread the exam as carefully as possible. Nevertheless, the exam sometimes contains misprints and ambiguities. If you are convinced that a problem has been stated incorrectly, mention this to the proctor. If necessary, the proctor can contact a representative of the area to resolve problems during the first hour of the exam. In any case, you should indicate your interpretation of the problem in your written answer. Your interpretation should be such that the problem is nontrivial. 1

760 MACHINE LEARNING: REQUIRED QUESTIONS 760-1 Naïve Bayes, Linear Models and Ensembles 1. Explain why naïve Bayes is a linear model. What are the coefficients of the linear model? 2. Explain why naïve Bayes can be viewed as an ensemble of features. 3. Are linear models and ensembles the same? Why or why not? 2

760-2 Online Learning for Regression Consider the task of learning a regression model from wearable sensor data in an online setting. For example, suppose we want a model that represents an individual s heart rate as a function of accelerometer measurements, temperature, altitude, time of day, etc. Assume that all of the variables, including heart rate, are observable and sampled at the same frequency. Even though heart rate is measured during training, we are interested in modeling it to gain biological insight and to be able to predict it when it is not directly measured. 1. Define the concept of online learning. 2. Describe how you would approach this as a supervised learning task. Specify the learning algorithm you would use and justify this choice. 3. If it was expected that there would be concept drift (i.e. the relationship between heart rate and the other variables changes) over time, how would you adjust your approach? 4. How could the bias/variance tradeoff be controlled when using the algorithm you described above? 3

761 ADVANCED MACHINE LEARNING QUESTIONS 761-1 Labeling Features Consider logistic regression with binary features f 1 (x),..., f d (x) { 1, 1} and binary labels y { 1, 1}: p(y x) = You may assume there is a large unlabeled data set available to you. 1 1 + e y( d. (1) i=1 wifi(x)+w0) 1. The standard way to train the model is to collect a labeled training set (x 1, y 1 )... (x n, y n ). Write down the optimization problem for finding the maximum likelihood estimate of w. (Hint: use log likelihood) 2. In addition to the labeled training set, suppose a domain expert also provides feature labels for some of the features. A feature label for feature f j is a binary variable z j { 1, 1}. It intuitively means that feature f j is indicative of class z j. Note we have intentionally left the definition vague for you to have your own interpretation. As an example, let x = (x 1, x 2 ) [0, 1] 2 and let the true class labels be in the following figure: Let f 1 (x) = bool(x 1 b) be the Boolean function which takes value 1 if x 1 b, and value -1 otherwise. The domain expert labels f 1 as z 1 = 1 to indicate that f 1 is a positive feature. Intuitively, when f 1 fires (f 1 (x) = 1 in the rectangle [b, 0], [1, 1]) the label is always positive. Similarly, let f 2 (x) = bool(x 1 a). The domain expert also labels f 2 as z 2 = 1 to indicate that f 2 is (mostly) a positive feature. John thinks he knows how to incorporate feature labels into logistic regression training. His idea is straightforward: add constraints to the weights in the optimization problem. If feature f j has label z j, John s constraint is z j w j 0. (2) What do you think of John s approach? Be sure to justify your answer. You may use the figure to help make your case. 3. Propose another approach to incorporate feature labels. Clearly state your assumptions. Explain your approach in sufficient detail. 4

761-2 From Word Embedding to Document Distances Consider the problem of clustering documents by their semantic similarity. Consider these two documents: (doc1) Obama addresses the media in Illinois (doc2) The President greets the press in Chicago We can represent each document as a bag-of-words (BOW) vector as follows. We first define a vocabulary with V distinct words. The BOW vector has V dimensions. The ith dimension takes the integer value of the number of times the ith vocabulary word occurs in the document. 1. We can define a distance between two documents as the Euclidean distance between their BOW vectors. With respect to the ultimate goal of document clustering by semantic similarity, what is one major disadvantage of this distance? Use doc1 and doc2 as an example in your answer. 2. Word embedding maps the ith vocabulary word w i to a m-dimensional real-valued vector x i R m. Recent advances in word embedding such as word2vec map semantically similar words to nearby points in R m. For instance, if w i =Obama and w j =President then the Euclidean distance x i x j is small. If each document has length one, we can simply use word embedding Euclidean distance as the distance between documents. But what if each document has length two? Define a distance using word embedding Euclidean distances as building blocks and explain your idea. Your distance should make the four documents Obama media, President press, media Obama, press President all close to each other (assuming Obama and President are close, and media and press are close). 3. Now define a document distance for the situation when the documents have arbitrary (not necessarily the same) lengths. Your distance should seek the overall best word match for the two documents, again using the word embedding Euclidean distances as building blocks. Note you may need to scramble the word order to achieve the best match, and you need to handle different document lengths. (Hint: you may normalize each BOW vector so its elements sum to one.) We ask you to precisely define this document distance by formulating it as an optimization problem. Be sure to include the variables, the objective function, and any constraints if appropriate. Be sure to explain your design with sufficient details. 5

766 ADVANCED COMPUTER VISION QUESTIONS 766-1 Lucas-Kanade Optical Flow The Lucas-Kanade optical flow algorithm is among the most widely used methods for image alignment. This question deals with some of the formulation and optimization aspects of this algorithm. 1. Briefly describe the objective function that the Lucas-Kanade algorithm seeks to optimize. Provide some intuition behind the objective, describe the variables being optimized and how they correspond to a solution to the image alignment problem. 2. Describe any one difference between the (a) Newton and (b) Gauss-Newton approaches when used within the Lucas-Kanade algorithm. 3. A key computational issue in an efficient implementation of the Lucas-Kanade algorithm is efficiently computing the Hessian. Briefly discuss this issue and identify at least one heuristic used in practice to reduce the computational burden. 4. Most Newton-type approaches in Lucas-Kanade implementations are used when we are close to a local minimum. Describe any one strategy you could use to start the estimation process when it is far away from the local optimal solution. 5. The classical formulations of optical flow seem to work best when the lighting variations across the two image frames are fairly small. Describe a variation of the standard procedure that you could use (or whether you would use a completely different algorithm) when there are significant changes in lighting. 6

766-2 Pyramid Match Kernels The pyramid match kernel is an important technique used for image categorization problems in computer vision. This question deals with various technical details of this idea. 1. Briefly describe how the Pyramid Match kernel algorithm computes similarities between unordered sets of features (for each image) to finally obtain a kernel matrix that can be used for regression and classification tasks. 2. Consider an alternative to the pyramid match kernel constructed in the following way: Given a set of real-valued feature vectors derived from an image, construct a single flat histogram based on a number of pre-defined, quantized bins. This will give a vocabulary of words representation and we simply count the frequency of occurrences of individual features over these quantized bins. Identify an advantage or limitation of the pyramid match kernel formulation over this alternative approach. 3. A key property of Pyramid Match kernels is that it satisfies Mercer s condition. Briefly describe why this property is relevant in image categorization experiments. Will this property be essential if we were using a k-nearest neighbors classifier? 4. Assume that our interest is not in image categorization, rather we want to use Pyramid Match kernels simply to identify similar features (or objects) across a set of images. Describe a reasonable strategy for achieving partial match correspondences across images using Pyramid Match kernels. 7

776 ADVANCED BIOINFORMATICS QUESTIONS 776-1 Genome Analysis without DNA sequence Suppose we are interested in studying the genome of species X, but instead of knowing the DNA sequence of the genome, we have multiple measurements of biochemical activity (e.g., transcriptional activity, levels of transcription factor binding, levels of histone modification) at each position of the genome. Specifically, we have m different real-valued biochemical measurements across the n positions of the genome, and thus the data may be represented by an m n matrix, with measurements indexing rows and genome positions indexing columns. 1. Suppose we believe that each position of the genome belongs to one of k functional classes and that positions belonging to the same class have similar biochemical activity profiles. Describe an unsupervised approach for classifying each position as one of k functional classes that does not take positional information into account. 2. After classifying the genomic positions using your approach from (1), you wish to determine whether there are statistically significant dependencies between the functional classes of nearby positions. Describe an approach for detecting such dependencies, should they exist. 3. Assuming you detect dependencies in (2), describe an unsupervised approach for classifying the genomic positions that takes positional information, and thus the detected dependencies, into account. 4. Suppose we obtain the same types of biochemical activity measurements for the genome of species Y. Describe an approach for aligning the genomes of X and Y using the biochemical activity measurements instead of DNA sequence. You may assume that the genomes are collinear and composed of a single chromosome. 8

776-2 Bayesian Networks for Gene Expression Networks Recall the Bayesian network representation of gene networks. Suppose that you had gene expression levels of N genes measured in m different experimental conditions. That is, each gene has m measurements. 1. Let X i denote the random variable for the expression level of the i th gene and let P a(x i ) denote the parents of X i in a Bayesian network. Give two ways to model the Conditional Probability Distributions (CPD) P (X i P a(x i )), and describe two distinguishing properties for each. State the assumptions you need to make to use these forms of CPDs for gene expression data. 2. Briefly describe CPDs in the context of Module networks and how they are estimated. 3. Suppose that the module membership of a gene is also influenced by the promoter sequence of that gene. How would you change the Module network algorithm to incorporate this property? 4. Suppose you were told that the m different conditions come from k different classes. How would you extend Module networks to integrate the class of the experimental condition? 9

This page intentionally left blank. You may use it for scratch paper. Please note that this page will NOT be considered during grading. 10