Stat 613 Fall 2017 Genevera Allen. Material on the Midterm & What you need to know: 1. Regression, Penalized Regression, Non-linear Regression.

Stat 613 Fall 2017 Genevera Allen Material on the Midterm & What you need to know: 1. Regression, Penalized Regression, Non-linear Regression. For an applied word problem, you should be able to decide which method would be appropriate and justify your choice. You should be able to mathematically characterize properties of penalties and penalized regression estimators. 2. Classification: KNN, Nearest Centroid / Naive Bayes, Discriminant Analysis, Logistic / Multinomial Regression, SVMs / Kernel SVMs. For an applied word problem, you should be able to decide which method would be appropriate and justify your choice. You should be able to mathematically characterize properties of various classifiers. 3. Model Validation. You should be able to recognize situations where the process of statistical learning, model selection and/or model assessment is done incorrectly or in a way that will bias You should be able to set up correct procedures for selecting tuning parameters and assessing the model fit in applied scenarios. 4. Matrix Factorizations: PCA, Sparse PCA, ICA, NMF, and MDS. From an applied word problem, you should be able to decide which method would be appropriate and justify your choice. 1

5. Clustering: K-means, Hierarchical, and Biclustering. From an applied word problem, you should be able to decide which method would be appropriate and justify your choice. 6. Additionally, you should be able to examine a new problem mathematically to understand its properties and relate it to statistical learning methods covered in class. 2

Stat 613 Fall 2017 Genevera Allen Sample Midterm Exam Questions Disclaimer: These are sample questions from past exams that are meant to serve as examples of the types of questions that may be on your midterm. They are not comprehensive and do not reflect the full scope of problems that will be on the exam. Your actual exam questions may be harder and/or easier than these questions. 1. Suppose you are fitting a linear regression model with response Y R n and predictors X R n p where the columns of X are orthogonal. (a) What is the solution to the lasso problem: minimize 1 2 Y X β 2 2 + λ β 1? (b) What is the solution to the non-negative lasso problem: minimize 1 2 Y X β 2 2 + λ β 1 Subject to β j 0 for j = 1,..., p? 2. A business analyst is trying to predict market demand for a product over the next six months. He has 90 features of interest measured from 275 stores and decides to use the elastic net for his prediction. To select the optimal regularization parameters, he uses five-fold cross-validation. As his boss wants an estimate of the prediction error, he runs five-fold cross-validation again. For each fold, he fits the elastic net with the previously selected regularization parameter value to fourthfifths of the data and uses the one-fifth left out to estimate the prediction error. He averages the prediction error over each of the five folds and reports this to his boss. Is this an unbiased estimate of the prediction error? If so, why? If not, why not and how would you alter the procedure to obtain an unbiased estimate? 1 3. A friend proposes a new penalty function P γ (t) = log(γ+1) log(γ t + 1) for some parameter γ > 0. Suppose you use this function in a linear regression setting minimizing 1 2 Y X β 2 2 + λp γ(β). 3

(a) What is the behavior of this penalty function? Justify this mathematically. (b) If γ 0, to which other penalty is this most similar? (c) If γ, to which other penalty is this most similar? (d) Is this penalty convex? (e) Describe a scenario in which you may want to use this penalty over other more common penalties. 4. For each of the following classification scenarios, which method would you recommend? Why? If you feel like you need more information, specify what information you need and how this information would change your recommendation. (a) A scientist cares only about misclassification error. She is trying to predict 10 classes based on 530 samples and 62 predictors. (b) A scientist wants to find out which variables are most important for classifying between two classes. He has 180 samples and 5600 features. (c) A scientist has data that is highly correlated. She wants to find out which variables are most important for classifying between two classes. (d) A scientist wants to classify between two classes with 8000 observations and 64 features. He cares only about prediction error. 5. A medical researcher runs PCA on his microarray data consisting of 24,000 genes and 105 Glioblastoma tumor samples. The scatterplot of PC1 verses PC2 reveals three tight clusters of the samples. The researcher is elated as he thinks he has discovered three new subtypes (groups of patients exhibiting similar genomic profiles) of Glioblastoma. To check his discovery, he runs PCA on a similar microarray data set with 19,000 genes and 72 samples obtained from a colleague. The scatterplot of PC1 verses PC2 no longer shows any clustering of the samples. The researcher is now confused and unsure of which set of results he should believe. (a) What happened here? What could explain the researchers findings? (b) Would you recommend another approach? If so, what? Justify your responses. 6. For each of the following, choose the best combination of loss function plus 4

penalty from the following lists. Justify your choice. Loss functions: absolute error squared error logistic loss hinge loss Penalties: lasso ridge adaptive lasso elastic net (a) An oil company has measured p 10, 000 geological features (very complex, highly correlated features) for n 1, 000, 000 samples of prospective locations for drilling a new well. They want to find out which features are most important for predicting a well s two year productivity levels (continuous). (b) An online advertising company is trying to predict whether an individual will like a youtube video based on their demographic information and browsing history. They have a sample of n = 11, 923 likes or dislikes for the video and p = 62 features. (c) A scientist has tested n = 52 rats for sensitivity to a particular drug (continuous) along with a custom-built protein-array p = 648. She wants to know not only which proteins are associated with drug sensitivity, but also the extent to which they are associated. (d) A neuroscientist wants to build a neural decoder that can most accurately classify between when a rat is moving to the left or the right in a maze based on the firing patterns of p 5, 000 recorded neurons. The rat was in the maze for a total of n = 320 time segments for which the direction (left or right) of movement was recorded. 7. List similarities AND differences between the two given methods. (a) Quadratic Discriminant Analysis vs. Support Vector Machines with a second degree polynomial kernel. (b) Hierarchical Clustering vs. Forward Step-wise Regression. 5

(c) K-Means Clustering vs. Naive Bayes classifier. (d) Adaptive Lasso vs. Lasso. (e) Multi-Dimensional Scaling vs. Principal Components Analysis. 6