STATS216v Introduction to Statistical Learning Stanford University, Summer Final (Solutions) Duration: 3 hours

Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2016 Remember the university honor code. Final (Solutions) Duration: 3 hours Write your name and SUNet ID (ThisIsYourSUNetID@stanford.edu) on each page. There are 25 questions in total. All questions are of equal value and are meant to elicit fairly short answers: each question can be answered using 1-5 sentences. You may not access the internet during the exam. You are allowed to use a calculator, though any calculations in the exam, if any, do not have to be carried through to obtain full credit. You may refer to your course textbook and notes, and you may use your laptop provided that internet access is disabled. Please write neatly.

1. Heart rates for mice typically decrease with age. A scientist hypothesizes that the rate of this decrease changes dramatically at age 5. The scientist gathers heart rate data for mice of many different ages. What is one way to statistically investigate this hypothesis? State any assumptions your method requires. We can do a linear regression using heart rate as a response with two features: X 1 = age and X 2 = age I [age>5]. We assume the errors of the model are normally distributed. If the coefficient corresponding to X 2 is statistically significant, then we have evidence that the scientist is correct. 2. A web company has a large dataset of movie reviews. A review is represented as a vector counting the number of appearances of the most common English words. The company has access only to this data, and they would like to know which reviews are positive and which are negative. The dataset is very large and all reviews are long so they want to avoid paying someone to read them. A data scientist suggests using a support vector machine to automatically classify each review as 1 if it is positive and as 0 otherwise, and thus avoiding that even a single movie review has to be read by a human. Do you think that this is a good solution? Explain. Fitting a classifier is not a reasonable idea, because the company does not have any labeled data on which to train it.

3. A researcher fits a Lasso regression model based on a dataset with 300 predictors, picking λ via cross-validation, and ends up with 10 nonzero coefficients. On a whim, she also decides to run best subset selection to find the best linear model with 10 of the original predictors. This results in a model with much lower training error than the Lasso fit. Excited by her results, she obtains some new data, and notices that the Lasso vastly outperforms the best subset model in this validation data. Explain why this is in fact not surprising. This is reasonable because best subset selection is selecting from many, many models in fact, a total of ( ) 300 10 possible models. With such a large number of possibilities, it is possible that best subset selection is overfitting to the training data. As a consequence, it is not surprising this method yields high test error, and in particular higher test error than the Lasso. 4. A scientist is studying ring-width series in trees from semiarid environments of western North-America. He asks you to help him fit a predictive model of the effect of the tree age on the ring-width. He provides a large dataset of ring-width series and he believes that the width should be a smooth function of age, but the relationship may behave very differently across different (but unknown) age ranges. Suggest a reasonable method to solve the scientist s problem. Since the relationship may be very different across different age ranges, linear or polynomial regression are not appropriate. Regression splines require one to specify the position of the knots, which are unknown to the scientist. Since all he wants to assume is smoothness, a smoothing spline is a reasonable answer.

5. A researcher fitted a local regression model of y on x, as displayed in Figure 1. The corresponding prediction curve is shown in red, on the same figure. Being unsatisfied with the outcome, he then asks a consulting data scientist for some advice on how to improve the fit. After some thought, the data scientist suggests that the model may be suffering from high bias and recommends that the span parameter should be decreased. Given the information at your disposal, explain whether you would agree with her. In practice, how would you confirm your intuition about the optimal span parameter? y 2 1 0 1 2 0 20 40 60 80 100 120 x FIGURE 1. Data for question 5. You should not agree with the data scientist. It appears from the picture that the model suffers from high variance, not high bias. The red curve is very wiggly, while the data seem to follow a smooth periodic trend. Therefore, it seems that the span should be increased, not decreased. In practice, you should select the span parameter for local regression via cross-validation. 6. A researcher applies random forests to a very large set of genetic data, to predict whether or not they have prostate cancer. The researcher would like to use 10-fold cross-validation to estimate the test error of the random forest model. Suggest a more computationally efficient way of estimating the test error. Would your answer change if the researcher were using bagging instead of random forests? There is no need to use cross-validation. Random forests are based on bagging and thus one can estimate the test error with the out-of-bag error. The same applies to bagging.

7. A famous dispute in Sociology has to do with whether the number of books in a child s home is relevant to the child s educational outcome, commonly measured by a score 1-100 on a standardized test. All sociologists agree that family income and average age of parents are also relevant, so the question is whether number of books is important after accounting for family income and parent age. Unfortunately, sociologists also agree that the effect of books, family income and average age of parents will be highly nonlinear. Provide a statistically valid way to settle this dispute, and mention any assumptions your method requires. They can use a Generalized Additive (GAM) model. For instance, they could try stand_score = β 0 + f 1 (books) + f 2 (income) + f 3 (parent_age) + ε, with f 1, f 2 and f 3 cubic splines (or of further degree if needed). Assuming the errors ε are normal, we can perform an F -test to check whether the coefficients of basis functions in f 1 are indeed zero or not. 8. A friend of yours suggests a new ensemble tree method. First, he fits N decision trees for a variable y on features X; then, given a new observation, he uses the average of the prediction of each of the N trees. Two of your STATS216v classmates claim this is a useless method: friend A says the predictions coming from this method are exactly what you would get using a random forest, and friend B says the predictions are exactly the same as using bagging. With which of your friends A and B, if any, do you agree? With neither. The proposed algorithm will fit the same tree N times, since there is no inherent randomness in fitting a tree. Bagging and random forests, however, will both fit N different trees by subsampling observations (bagging) or subsampling observations and features (random forests).

9. A geologist is interested in prediction and she believes that her data follows the model y = β 0 + β 1 b 1 (x) + β 2 b 2 (x) + ɛ, where the basis functions b 1 (x) and b 2 (x) are, respectively, defined as b 1 (x) = (x 1) I(x 1), b 2 (x) = (x 4) 2 I(x 4), where, as usual, I(x 1) equals 1 for x 1 and 0 otherwise. You fit a linear regression to the above model and obtain the coefficient estimates ˆβ 0 = 2, ˆβ 1 = 1 and ˆβ 2 = 2. Sketch the estimated curve between x = 3 and x = 6. Note the intercepts, slopes, and the values of the curve at x = 3 and x = 6. The answer is shown in Figure 2. y 6 4 3 2 0 2 4 6 slope=1 slope=0 quadratic 3 2 0 1 2 4 5 6 x FIGURE 2. Answer to question 9. 10. A doctor would like to classify whether someone has diabetes or not, using p = 1000 gene expression levels, and n = 2000 patients. He wants to pick between bagging or random forest for his prediction method. As is common in the medical area, however, we expect many of the 1000 predictors to be irrelevant in determining whether the patient has diabetes or not. In light of this, would you recommend him use random forest or bagging? Explain. Bagging is a good idea here, whereas random forest isn t. Many of the random forest trees will contain mostly irrelevant features, and so won t make the predictions any better. Each bagged tree, on the other hand, can use all features and will therefore be able to use the few relevant features in every tree.

11. You are asked by a bank to create a regression model for credit scores based on n = 100 clients and p = 500 features. For various reasons, you considering using either random forests or the Lasso. You are then told by the bank that, for the new data your method is about to see, we expect about 20% of the features for each data point to be missing at random. In light of this new fact, which of the two classifiers would you choose? Random forests, since they can deal with missing predictors by using surrogate variables. Lasso, on the other hand, would only pick a few of the p = 500 predictors available, resulting in a problem if one of these predictors end up missing. 12. Consider the dataset shown in Figure 3. You would like to use 5-fold cross-validation to compare the performance of logistic regression and QDA on this dataset, but a friend of yours says that cross-validation would not work properly in this case. What could go wrong? Justify. 2 1 X2 0 1 1 0 1 X1 FIGURE 3. Dataset for Problem 12. Using cross-validation and logistic regression here is a bad idea. Note that the two classes are almost linearly separable. As cross-validation only uses a subset of the observations for training, it is likely that the classes in the training data will end up being perfectly separable by straight line.

13. Consider a random forest regression algorithm that randomly samples m out of p total predictors at each split. Suppose that m = p α, for some constant value α [0, 1]. Which of the following statements are always true (for any given dataset)? Briefly justify your answers. (a) The test error is an increasing function of α. (b) The out-of-bag error is an increasing function of α. (c) If α = 1, this procedure is known as bagging. (d) The test error is always minimized at α = 1 2. (a) False, nothing can be said about test error. (b) False, much like the test error. (c) True. When α = 1, there is no randomization over predictors, so random forest becomes bagging. (d) False. This is a popular choice, but it does not always minimize the test error. 14. An environmental engineer is using a classification tree to predict the location of bird nest sites from some ecological data. She knows that one of the predictors is much stronger than all others. Would you recommend she should use bagging or boosting to improve the prediction accuracy of her classification tree? Explain why, and suggest a third alternative method that is also suitable. Since one of the predictors is very strong, the bagged trees would probably be highly correlated. Random forest would decorrelate the trees by allowing the bagging algorithm to consider only a random subset of the predictors at each step, so that provides a third method that should be suitable for this dataset. On the other hand, boosting grows the trees sequentially. By fitting small trees to the residuals, we can expect to slowly improve ˆf and build different shaped trees to attack the residuals that cannot be explained by the most powerful predictor.

15. Suppose that you want to cluster 5 observations, using hierarchical clustering. You have computed the dissimilarity between each pair of observations, and summarized it into the matrix below: 0.3 0.45 0.7 0.2 0.3 0.5 0.8 0.1 0.45 0.5 0.4 0.7 0.7 0.8 0.4 0.9 0.2 0.1 0.7 0.9 This means, for instance, that the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and third observations is 0.5. (a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using single linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram. (b) Would the dendrogram change if you used hierarchical clustering with complete linkage instead of single linkage? If yes, how? (a) The dendrogram is shown in Figure 4(a). (b) In this case, only the height of the last fusion would change. The new dendrogram is shown in Figure 4(b). Cluster Dendrogram Cluster Dendrogram 1 2 5 3 4 1 2 5 3 4 Height 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Height 0.0 0.2 0.4 0.6 0.8 d hclust (*, "single") d hclust (*, "complete") FIGURE 4. Answers to question 15 (a) and (b)

16. A computer scientist has collected a very large dataset of labeled pictures of the handwritten digit 3. Each picture is represented as a 16 16 gray-scale image (that is, 16 16 = 256 real values representing the intensity of each pixel). A subset of 130 handwritten digits is shown in the figure below, as an example. The data scientist notes that all pictures are different, showing a variety of writing styles. However, her dataset is so large that she cannot afford to store all the pictures. Using one of the methods learned in this course, how would you suggest she should compress her dataset, in order to retain both the nature of the 3 digit and some important differences in writing styles? FIGURE 5. Data for question 16. She can average all pictures (pixel-by-pixel) to obtain a simple description of what a 3 looks like on average. The differences in writing style can be compressed using PCA. In particular, she can represent each picture as a point in R 256 and perform PCA on the differences between the individual images and the average image. Then, she can keep only the first few components. This is reasonable because the pixels are inherently correlated. Therefore, one should expect a small number of components to provide a good low-dimensional representation of each image. 17. TRUE or FALSE: for a fixed dataset, if we use a decision tree algorithm for prediction then the more terminal nodes the tree has (equivalently, the more splits there are) the likelier it is our prediction algorithm suffers from high variance but low bias. True: more splits means we have more flexibility in choosing the decision function, though since we have less points in each region, variance becomes a problem.

18. For each of the following, suggest one method learned in this class which fits the description (no explanations are necessary; assume all predictors are quantitative): (a) A classification method that is affected by standardizing the predictors. (b) A classification method that is not affected by standardizing the predictors. (c) A dimensionality reduction method that is affected by standardizing the variables. (d) A regression method that is not affected by applying an increasing function to the predictors. There are many possible answers. An example is: (a) Linear SVM. (b) Logistic regression. (c) PCA. (d) Decision trees. 19. You are interested in predicting whether an individual will vote for Democrats, Republicans or Independent in an election year, and for this you gather many features on each person, such as economic and social status, ideological leanings and where they live. You have reasons to believe that a linear decision boundary would be appropriate to break apart Democrats from Republicans, but a quadratic decision boundary would be more suitable to distinguish Democrats from Independents, and Republicans from Independents. Suggest a modification to one of the algorithms seen in class that would make it ideal for this scenario. We could fit a modified version of quadratic discriminant analysis in which the Republicans and Democrats categories have the same covariance matrix and the Independents category has its own covariance matrix. This would induce a linear boundary between Republicans and Democrats and quadratic boundaries between Independents and either of the two other categories.

20. A friend of yours who is working in finance develops a model to predict whether the market will go up or down tomorrow. He uses cross-validation to validate his model, and obtains the left curve below, which is discouraging: it tells him that it s best for him to use his model with no parameters at all. To investigate this issue, he decides to bootstrap his method by creating N datasets, each being sampled with replacement from his original dataset, and running cross-validation N times. He averages all the cross-validation curves and obtains the right curve in the figure below. From this, he claims it s better to use 5 parameters, not 0. Would you agree with him? FIGURE 6. Plots for Problem 20. No. Because the bootstrap samples contain duplicates, it can often happen that the same data point is both in his training and test sets. Hence, the CV curve will be too optimistic. 21. A friend of yours claims he can accurately determine the probability that a startup company will be succesfull. He collected data on n = 200 past startups, along with p = 30 relevant predictors and a binary outcome, stating whether the company was acquired/ipo-ed within 10 years. To estimate the success probabilities, he says he used SVMs with a polynomial kernel, for added flexibility. Would you trust his claim? Explain. I would not. SVMs cannot estimate probabilities, so his claim must be false.

22. A sports researcher is interested in predicting whether an athlete will win a medal in the next Olympic Games. She has several measurements for each athlete: 10 variables measuring their current health level, such as oxygen flow and muscle strength, 12 variables measuring their endorsement and popularity, and 21 variables measuring relevant physical attributes. She has data from past competitions and, by using a multiple linear regression along with cross-validation, finds that her model is extremely good at prediction. However, none of her predictors have a statistically significant coefficient. Assuming the normality assumption holds, what is a reasonable explanation for this outcome? She is using many correlated predictors, which do not hurt the model s predictive ability, but means they have low p-values. Indeed, since any variable in the model can be dropped and replaced by a strong correlate, linear regression cannot confidently say which of the predictors are statistically significant. 23. You are hired by the government to help them decide which wells in a given county provide potable water. To do so, they determined which of n = 58 wells have potable water. Besides these 58 wells, they have the location of 43 others, and would like to use location data to decide whether these new wells are potable or not. Geologists tell you that wells that are close in distance usually have a similar type of water, and so they frequently employ SVMs for the classification task. Do you think this is a good idea? If so, which SVM kernel should they use? If not, state why and suggest a better classifier. SVMs are a good idea, if we use them with the radial kernel. Indeed, the radial kernel has very local behavior, in the sense that only nearby training observations have an effect on the class label of a test observation, which is exactly what the geologists expertise suggests.

24. Three friends are discussing which variants of hierarchical clustering could have produced figure 7 below. Friend A claims it is the result of a hierarchical clustering with complete linkage, friend B argues this is due to using single linkage, and friend C claims it is due to using average linkage instead. With which of your three friends, if any, do you agree? FIGURE 7. Figure for Problem 24 With none of them: no agglomerative hierarchical clustering could have resulted in the figure above. Indeed, they would all first merge the two closest points into a cluster, and in this case the closest points in the figure belong to different clusters. 25. An astronomer has image data from a telescope and is interested in locating the six galaxy clusters he expects to find within those images. For each pixel in the picture, he knows the temperature, radiation, and gravitational force at the pixel location. However, it is known that astronomical data routinely contains several outliers. Explain why this should make him reconsider the use of k-means, and suggest an adaptation for this algorithm to attenuate the problem. The k-means algorithm relies on the means of the clusters, which are very sensitive to outliers. One way to attenuate the problem is to use the medians, which are much more robust. (This is what is called the k-medoids algorithm.)