February Statistics: Multiple Regression in R

Size: px

Start display at page:

Download "February Statistics: Multiple Regression in R"

Philomena Shelton
6 years ago
Views:

1 February 2016 Statistics: Multiple Regression in R

2 February 2016 How to Use This Course Book This course book accompanies the face-to-face session taught at IT Services. It contains a copy of the slideshow and the worksheets. Software Used We might use Excel to capture your data, but no other software is required. Since this is a Concepts course, we will concentrate on exploring ideas and underlying concepts that researchers will find helpful in undertaking data collection and interpretation. Revision Information Version Date Author Changes made 1.0 February 2016 John Fresen Course book version 1

3 February 2016 Copyright The copyright of this document lies with Oxford University IT Services.

4 February 2016 Contents 1 Introduction What You Should Already Know What You Will Learn Your Resources for These Exercises Help and Support Resources What Next? Statistics Courses IT Services Help Centre... 3

5 Statistics: Concepts TRMSZ 1 Introduction Welcome to the course Multiple in R. This course introduces the concept of regression using Sir Francis Galton s Parent- Child height data and then extends the concept to multiple regression using real examples. The course has an applied focus and makes minimum used of mathematics. No derivations of formulae are presented What You Should Already Know We assume that you are familiar with entering and editing text, rearranging and formatting text - drag and drop, copy and paste, printing and previewing, and managing files and folders. The computer network in IT Services may differ slightly from that which you are used to in your College or Department; if you are confused by the differences, ask for help from the teacher What You Will Learn In this course we will cover the following topics: What is regression Simple linear regression Influential observations Multiple regression Model selection Post selection inference Cross valadation Where to get help. From problem to data to conclusions Topics covered in related Statistics courses, should you be interested, are given in Section IT Services

6 Statistics: Concepts TRMSZ 2 Your Resources for These Exercises The exercises in this handbook will introduce you to some of the tasks you will need to carry out when working with WebLearn. Some sample files and documents are provided for you; if you are on a course held at IT Services, they will be on your network drive H:\ (Find it under My Computer). During a taught course at IT Services, there may not be time to complete all the exercises. You will need to be selective, and choose your own priorities among the variety of activities offered here. However, those exercises marked with a star * should not be skipped. Please complete the remaining exercises later in your own time, or book for a Computer8 session at IT Services for classroom assistance (See section 8.2) Help and Support Resources You can find support information for the exercises on this course and your future use of WebLearn, as follows: WebLearn Guidance (This should be your first port of call) If at any time you are not clear about any aspect of this course, please make sure you ask John for help. If you are away from the class, you can get help and advice by ing the central address weblearn@it.ox.ac.uk. The website for this course including reading material and other material can be found at You are welcome to contact John about statistical issues and questions at john.fresen@gmail.com 2 IT Services

7 Statistics: Concepts TRMSZ 3 What Next? 3.1. Statistics Courses Now that you have a grasp of some basic concepts in Statistics, you may want to develop your skills further. IT Services offers further Statistics courses and details are available at In particular, you might like to attend the course Statistics: Introduction: this is a four-session module which covers the basics of statistics and aims to provide a platform for learning more advanced tools and techniques. Courses on particular discipline areas or data analysis packages include: Statistics: Designing clinical research and biostatistics SPSS: An introduction SPSS: An introduction to using syntax STATA: An introduction to d ata access and management STATA: Data manipulation and analysis STATA: Statistical, survey and graphical analyses 3.2. IT Services Help Centre The IT Services Help Centre at 13 Banbury Road is open by appointment during working hours, and on a drop-in basis from 6:00 pm to 8:30 pm, Monday to Friday. The Help Centre is also a good place to get advice about any aspect of using computer software or hardware. You can contact the Help Centre on (2)73200 or by on help@it.ox.ac.uk 3 IT Services

9 Your safety is important Where is the fire exit? Beware of hazards: Tripping over bags and coats Please report any equipment faults to us Let us know if you have any other concerns 2

adjust the monitors for height, tilt and brightness Session 1 The concept of regression from Galton Thanks to:

10 Your comfort is important The toilets are along the corridor outside the lecture rooms The rest area is where you registered; it has vending machines and a water cooler The seats at the computers are adjustable You can adjust the monitors for height, tilt and brightness Session 1 The concept of regression from Galton Thanks to: Dave Baker, IT Services Jill Fresen, IT Services Jim Hanley, McGill University Ian Sinclair, REES Group Oxford 4

11 Sir Francis Galton (16 February January 1911) Sir Francis Galton was an incredible polymath Cousin of Charles Darwin. General: Genetics What do we inherit form our ancestors? Particular: Do tall parents have tall children and short parents, short children? i.e. Does the height of children depend on the height of parents? Data: Famous 1885 study: 205 sets of parents 928 offspring mph = average height of parents; ch = child height Galton Peas Experiment: Selected 700 pea pods of selected sizes average diam of parent peas ; average diam of child peas

12 Francis Galton: Do tall parents have tall children, short parents short children? Does height of child depend on height of parents? Frequency scatterplot of Galton Data Midparent

child-parent data 64 64 66 68 70 72 sunflower

13 Galton data: boxplots of conditional distributions of child-ht conditional on parent-ht histograms of the marginal distributions of child-ht and parent-ht Plot of child-parent data sunflower plot of data parent-ht parent-ht Plot of data child-ht jittered Plot of distributions child-ht given parent-ht parent-ht parent-ht 72.5

14 Regression is a plot/trace of the means of the conditional distributions trace of actual means regression of Child on Midparent trace of linear regression means assumes means lie on a straight line Midparent superimposing actual and linear regressions Midparent Midparent The trace of actual means has no assumptions in it but end distributions have a lot of sampling variation because of the small number of observations in those distributions Linear regression stabilises that Linear regression model Linear regression model fitted to data Midparent Midparent Linear regression model assumes: 1. Conditional distributions are normal 2. Conditional means lie on a straight line 3. Conditional distributions all have same spread In words: the distribution of child height, conditional on a given midparent height, is normal, with means lying on the straight line, and constant spread In mathematics: This model can be extended in many ways 72 74

15 The Linear Model : ways. Here are three - there are more: can be extended in many 1. Model the mean by a more general function such as a polynomial or trigonometric function or Fourier series or radial basis functions or some nonparametric function 2. Model the variance as a function : 3. Generalize from the Normal distribution to the Exponential Family that includes: normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, Inverse Wishart and many others. But in all cases we are modelling the mean and other parameters of conditional distributions. These are called Generalized Linear Models. In R : fits a linear model fits a generalized linear model fits a generalized additive model Does the average diam of child peas depend on average diam of parent peas? What are the sketches telling us? Would a linear regression model be suitable? An important point about this example is that in regression, it is the slope of the regression line that is important, not the intercept.

16 Go to Lecture 2 A: Example 1 UWC Analysis Detecting Influential Observations using R Influential observations may suggest your model is incorrect data point has been miss-recorded

17 Detecting Influential Observations in R Case1: Outlying in y-space, not x-space. Use studentized residuals to detect these. Case 2: Outlying in x-space, not y-space. High leverage point. Use hatvalues to detect this. Case 3: Outlying in both x-space and y-space. High leverage point. Studentized residuals Hatvalues DFFITS 17 DFBETAS Outliers in Y- space: To detect outliers in the Y-space we compute the studentized residuals, defined as: This is just the standardizing transformation on the residuals. Thus, we expect 99% of studentized residuals to lie between -3 and 3. Values outside of this range suggest possible outliers. In R these are plotted on a graph using the statement: 18

18 Outliers in X-space: To detect outliers in the X-space we compute the hatvalues, defined as: hatvalues = diagonal elements of the hat matrix diag( X( XT X) 1 XT ) The i-th hatvalue measures the distance of case i x-values from the centroid of the x-values. In R these are plotted on a graph using the statement: Some of these will be small, some intermediate, and some large. These are assessed in a relative sense there is, in - 19 We want to avoid the situation in which one observation dominates the others. DFFITS (means change (DF) in FITted values) DFFITS measures the effect or influence of removing the i th observation on the predicted value of the i-th observation, divided by a standardizing quantity. DFFITS i Yi Yi (i ) constant This is a local measure of influence it is only concerned with what is happens at case I if case I is removed from the data. In R these are plotted on a graph using the statement: Some of these will be small, some intermediate, and some large. These are assessed in a relative sense there is, in point. We want to avoid the situation in which one observation dominates the others. 20

19 Cooks Distance measures the effect or influence of removing the i th observation on all predicted, divided by a standardizing quantity to normalize it. Di (Y j Y j (i ) ) 2 constant This is a global measure of influence it considers the effect of removing case I on all predicted values. In R these are plotted on a graph using the statement: Some of these will be small, some intermediate, and some large. These are assessed in a relative sense there is, in - 21 DFBETAS (means change (DF) in the BETAS) DFBETAS measures the effect or influence of removing the i th observation on the estimated regression coefficients, divided by a standardizing quantity. DFBk (i ) k k (i ) constant In R these are plotted on a graph using the statement: is a pxn matrix. The entry in row I and column k represents the effect of removing the i-th observation on the k-th regression coefficient. Some of these will be small, some intermediate, and some large. These are assessed in a relative sense there is, in -

20 Go to Lecture 2 B UWC Analysis outliers and influential observations Model Selection The Executive Salary Data has the variables: lsalary exper educat bonus numemp assets board age profits internat sales Question: How do we select which variables to include in a model for estimating the mean log(salary)? still a vibrant research topic in statistics with many contentious issues controversy: Among competing hypotheses, the one with the fewest assumptions should be selected. Choose the simplest model that gives and adequate description of the data

21 include Akaike information criterion (AIC) Bayesian information criterion (BIC) AIC = -2*log(likelihood(model)) + 2*no predictors BIC = -2*log(likelihood(model)) + 2*no predictors*log(no obs) The first term -2*log(likelihood(model)) gets smaller with more predictors but gets penalized by the increased no of predictors and the increased sample size (no obs). we choose model with the smallest AIC (BIC) Recent excellent articles are: Statistical model choice by Gerda Claeskens (2016) Valid post selection inference 2013 Berk Brown Buja Zhang and Zhao Annals (2015) Problem: If we have k predictors, we have 2^k possible models (without considering interactions and transformations such as log, etc) For the Exec Salary Data we have 10 predictors so there are 2^10 = 1024 models to consider without interactions or transformations. We consider the Stepwise Selection: Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model. Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible. Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

22 Inference after Model Selection The following articles provide a great discussion post-selection inference and make suggestions for how to proceed: Ernst Wit, Edwin van den Heuvel and Jan-Willem Romeijn Statistica Neerlandica, doi: /j x Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao (2013) Valid post-selection inference. Ann. Statist. Volume 41, Number 2, Two major problems that arise in the process of model selection are: First, the distributions of the estimated regression parameters are no longer valid. (This means that the tests and confidence intervals normally calculated are no longer valid.) Second, we should see how well the model works on unseen data. This might conceptually be achieved by splitting the data into two subsets, and a This is called cross-validation. Cross-validation is important in guarding against testing hypotheses suggested by the data (called "Type III errors")

23 Leave-p-out cross-validation Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. Very expensive computing wise Leave-one-out cross-validation Leave-one -out cross-validation (LOOCV) is a particular case of leave-pout cross-validation with p = 1. LOOCV doesn't have the computational problem of general LpO crossvalidation. k-fold cross-validation In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation 2-fold cross-validation = special case

24 Go to lecture 5 Example 4 Executive salary data

25 Lecture 3: Example 2: Multiple Regression Executive Salary Data Four variables Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets) Model: lsalary = β 0 + β 1 exper + β 2 educat + β 3 numemp + β 4 assets + noise where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given. Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the predictor variables considered here? 9. Can you criticize the data or the model?

26 Recommended Steps You may copy and paste into R lsalary exper educat bonus numemp assets board age profits internat sales Step 1 Step 2 Step 3 Read and attach data. My data is stored in the directory Data of my memory stick. One can specify any directory. esd = read.csv("e:/data/execsaldata.csv",header=t,sep=",") attach(esd) head(esd) names(esd) Plot marginal distributions of the response variable and the four predictor variables on a 2X3 scatterplot matrix par(mfrow=c(2,3)) hist(lsalary,prob=t,col="gray");lines(density(lsalary));rug(lsalary) hist(exper,prob=t,col="gray");lines(density(exper));rug(exper) hist(educat,prob=t,col="gray");lines(density(educat));rug(educat) hist(numemp,prob=t,col="gray");lines(density(numemp));rug(numemp) hist(assets,prob=t,col="gray");lines(density(assets));rug(assets) # notice that these are on vastly different scales Pairs plot (matrix of scatterplots) newdata <- cbind(lsalary,exper,educat,numemp,assets) pairs(newdata) Step 4 Can combine steps 2 and 3 (See help for pairs) panel.hist <- function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb], 0, breaks[-1], y, col = "cyan",...) } pairs(newdata, panel = panel.smooth, diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2) Step 5 Compute the correlation coefficients: cor(newdata) # or, to have fewer decimal places round(cor(newdata),3) Step 6 Fit the linear model to the data and get summary and anova: fit1 = lm(lsalary~exper+educat+numemp+assets) summary(fit1) anova(fit1) Step 7 Compute predicted values and residuals: salhat = fitted.values(fit1) res = residuals(fit1)

27 Step 8 # check the fit and assumptions on noise terms i.e. # are they normal and independent of predictor and model? par(mfrow=c(2,4)) plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1), main="observed vs fitted values") abline(0,1) #superimposes an ideal straight line plot(res) # sequential plot of residuals qqnorm(res);qqline(res) #normal probability plot of residuals plot(exper,res) #residuals vs exper plot(educat,res) #residuals vs eduac plot(numemp,res) #residuals vs numemp plot(assets,res) #residuals vs assets plot(salhat,res) #residuals vs fitted values Step 9 # Detecting possible outlying and influential observations # Are there outliers in the y-space or x-space? # Are there influential observations as measured by # DFFITS,Cook s Distance or DFBETAS? par(mfrow=c(3,3)) plot(rstudent(fit1),type="h") plot(hatvalues(fit1),type="h") plot(dffits(fit1),type="h") plot(cooks.distance(fit1),type="h") plot(dfbetas(fit1)[,1],type="h") plot(dfbetas(fit1)[,2],type="h") plot(dfbetas(fit1)[,3],type="h") plot(dfbetas(fit1)[,4],type="h") plot(dfbetas(fit1)[,5],type="h") # We might wish to plot hatvalues against cooks.distance and even # consider other combinations of plots. # # An interesting 2X2 matrix of plots is provided by plot(fit1) par(mfrow=c(2,2)) plot(fit1) Step 10 What are our conclusions? First interpret the diagnostic plots for assumptions on residuals and then consider the possibility of influential observations and outliers. Do the assumptions on the noise terms seem reasonable? If the diagnostics are satisfactory we then look at the summary and ANOVA to assess and interpret the model: What is the mean function and what is the spread about the mean function? Is the model a reasonable approximation to the data? Can we criticize that data or the model and make suggestions for further analysis or research?

29 Example 1: UWC Analysis Mathematical model: result = β 0 + β 1 rating + noise assume noise follows a normal distribution with a mean of zero and variance σ 2 We could write this as result rating~normal(β 0 + β 1 rating, σ 2 ) The conditional distribution of result given rating is normal with a mean of β 0 + β 1 rating and variance of σ 2 Suggested steps: Step 1 # Read data into R, attach data, print first 6 lines uwc = read.table("e:/data/uwcdata.csv",header = T,sep=",") attach(uwc) head(uwc) Step 2 # Plot marginal distributions par(mfrow=c(1,2)) #1 row by 2 cols graphics window hist(rating,prob=t,col="gray");density(rating);rug(rating) hist(result,prob=t,col="gray");density(result);rug(result) Step 3 # fit the linear model (linear regression model) fit1 <- lm(result ~ rating) # fit1 is an object generated by the routine lm containing a lot # of information about the fitted model. # The rest of the steps are simply accessing information in fit1 Step 4 # obtain summary and anova of fit # compute fitted values and residuals summary(fit1) anova(fit1) yhat <- fitted.values(fit1) # fitted values res <- residuals(fit1) # residuals Step 5 # various common plots put into a 2X3 matrix of scatter-plots to # check the fit and assumptions on noise terms i.e. # are they normal and independent of predictor and model? # # plot 1: scatterplot of data and superimposed fitted model - # only do this plot when there is only a single predictor # plot 2: scatterplot of observed values vs fitted values to see # how close the fitted values are to the observed data # do this plot no matter how many predictors # plot 3: plot of residuals (random or pattern?) # plot 4: Q-Q plot of residuals (are they approx normal?) # plot 5: residuals vs predictor (random or pattern?) # if there are many predictors we plot residuals against # each predictor in turn # plot 6: residuals vs fitted (random or pattern?) 1

30 par(mfrow=c(2,3)) plot(rating,result,ylim=c(0,100), pch=19,cex=1.5, main="result vs Rating UWC data/n showing pass mark and fitted model") abline(fit1) #superimposes fitted straight line abline(h=48,lty=2) # superimposes dashed horizontal line at 48 plot(yhat,result,xlim=c(0,100),ylim=c(0,100), main="observed vs fitted values") abline(0,1) #superimposes an ideal straight line plot(res) qqnorm(res);qqline(res) #normal probability plot of residuals plot(rating,res) #residuals vs predictor plot(yhat,res) #residuals vs fitted values Step 6 # Detecting possible outlying and influential observations # Are there outliers in the y-space or x-space? # Are there influential observations as measured by # DFFITS,Cook s Distance or DFBETAS? par(mfrow=c(2,3)) plot(rstudent(fit1),type="h") plot(hatvalues(fit1),type="h") plot(dffits(fit1),type="h") plot(cooks.distance(fit1),type="h") plot(dfbetas(fit1)[,1],type="h") plot(dfbetas(fit1)[,2],type="h") Step 7 What are our conclusions? First interpret the diagnostic plots for assumptions on residuals and then consider the possibility of influential observations and outliers. If the diagnostics are satisfactory we then look at the summary and ANOVA to assess and interpret the model: What is the mean function and what is the spread about the mean function? Can we criticize that data or the model and make suggestions for further analysis or research? 2

31 Lecture 3: Example 2: Multiple Regression Executive Salary Data Four variables Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets) Model: lsalary = β 0 + β 1 exper + β 2 educat + β 3 numemp + β 4 assets + noise where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given. Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the predictor variables considered here? 9. Can you criticize the data or the model?

32 Recommended Steps You may copy and paste into R lsalary exper educat bonus numemp assets board age profits internat sales Step 1 Step 2 Step 3 Read and attach data. My data is stored in the directory Data of my memory stick. One can specify any directory. esd = read.csv("e:/data/execsaldata.csv",header=t,sep=",") attach(esd) head(esd) names(esd) Plot marginal distributions of the response variable and the four predictor variables on a 2X3 scatterplot matrix par(mfrow=c(2,3)) hist(lsalary,prob=t,col="gray");lines(density(lsalary));rug(lsalary) hist(exper,prob=t,col="gray");lines(density(exper));rug(exper) hist(educat,prob=t,col="gray");lines(density(educat));rug(educat) hist(numemp,prob=t,col="gray");lines(density(numemp));rug(numemp) hist(assets,prob=t,col="gray");lines(density(assets));rug(assets) # notice that these are on vastly different scales Pairs plot (matrix of scatterplots) newdata <- cbind(lsalary,exper,educat,numemp,assets) pairs(newdata) Step 4 Can combine steps 2 and 3 (See help for pairs) panel.hist <- function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb], 0, breaks[-1], y, col = "cyan",...) } pairs(newdata, panel = panel.smooth, diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2) Step 5 Compute the correlation coefficients: cor(newdata) # or, to have fewer decimal places round(cor(newdata),3) Step 6 Fit the linear model to the data and get summary and anova: fit1 = lm(lsalary~exper+educat+numemp+assets) summary(fit1) anova(fit1) Step 7 Compute predicted values and residuals: salhat = fitted.values(fit1) res = residuals(fit1)

33 Step 8 # check the fit and assumptions on noise terms i.e. # are they normal and independent of predictor and model? par(mfrow=c(2,4)) plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1), main="observed vs fitted values") abline(0,1) #superimposes an ideal straight line plot(res) # sequential plot of residuals qqnorm(res);qqline(res) #normal probability plot of residuals plot(exper,res) #residuals vs exper plot(educat,res) #residuals vs eduac plot(numemp,res) #residuals vs numemp plot(assets,res) #residuals vs assets plot(salhat,res) #residuals vs fitted values Step 9 # Detecting possible outlying and influential observations # Are there outliers in the y-space or x-space? # Are there influential observations as measured by # DFFITS,Cook s Distance or DFBETAS? par(mfrow=c(3,3)) plot(rstudent(fit1),type="h") plot(hatvalues(fit1),type="h") plot(dffits(fit1),type="h") plot(cooks.distance(fit1),type="h") plot(dfbetas(fit1)[,1],type="h") plot(dfbetas(fit1)[,2],type="h") plot(dfbetas(fit1)[,3],type="h") plot(dfbetas(fit1)[,4],type="h") plot(dfbetas(fit1)[,5],type="h") # We might wish to plot hatvalues against cooks.distance and even # consider other combinations of plots. # # An interesting 2X2 matrix of plots is provided by plot(fit1) par(mfrow=c(2,2)) plot(fit1) Step 10 What are our conclusions? First interpret the diagnostic plots for assumptions on residuals and then consider the possibility of influential observations and outliers. Do the assumptions on the noise terms seem reasonable? If the diagnostics are satisfactory we then look at the summary and ANOVA to assess and interpret the model: What is the mean function and what is the spread about the mean function? Is the model a reasonable approximation to the data? Can we criticize that data or the model and make suggestions for further analysis or research?

35 Lecture 4: Example 5: Model Selection Executive Salary Data All variables Stepwise Regression in R Recommended Steps You may copy and paste into R lsalary exper educat bonus numemp assets board age profits internat sales Step 1 Read and attach data. My data is stored in the directory Data of my memory stick. One can specify any directory. esd = read.csv("e:/data/execsaldata.csv",header=t,sep=",") attach(esd) head(esd) names(esd) Step 2 Step 3 Step 4 # Invoke the MASS library that contains the stepaic function library(mass) Pairs plot (matrix of scatterplots) newdata <- c(exper,educat,numemp,assets,age,profits,sales) Can combine steps 2 and 3 (See help for pairs) panel.hist <- function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb], 0, breaks[-1], y, col = "cyan",...) } pairs(newdata, panel = panel.smooth, diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2) Step 5 Compute the correlation coefficients: round(cor(newdata),3) Step 6 Perform stepwise regression fit1 <- lm(low ~.,data=esd) esd.step <- stepaic(fit1, direction = "backward" ) Step 6 Fit the linear model to the data and get summary and anova: fit2 <- lm(lsalary ~ exper+educat+bonus+numemp+assets+board+age+ profits+internat+sales) summary(fit2) anova(fit2)

36 Step 7 Step 7 Step 8 Fit reduced model take out non-significant terms fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets) summary(fit3) anova(fit3) Compute predicted values and residuals: salhat = fitted.values(fit3) res = residuals(fit3) # check the fit and assumptions on noise terms i.e. # are they normal and independent of predictor and model? par(mfrow=c(3,3)) plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1), main="observed vs fitted values") abline(0,1) #superimposes an ideal straight line plot(res) # sequential plot of residuals qqnorm(res);qqline(res) #normal probability plot of residuals plot(exper,res) #residuals vs exper plot(educat,res) #residuals vs educat plot(bonus,res) #residuals vs educat plot(numemp,res) #residuals vs numemp plot(assets,res) #residuals vs assets plot(salhat,res) #residuals vs fitted values Step 9 # Detecting possible outlying and influential observations # Are there outliers in the y-space or x-space? # Are there influential observations as measured by # DFFITS,Cook s Distance or DFBETAS? par(mfrow=c(3,3)) plot(rstudent(fit3),type="h") plot(hatvalues(fit3),type="h") plot(dffits(fit3),type="h") plot(cooks.distance(fit3),type="h") plot(dfbetas(fit3)[,1],type="h") plot(dfbetas(fit3)[,2],type="h") plot(dfbetas(fit3)[,3],type="h") plot(dfbetas(fit3)[,4],type="h") plot(dfbetas(fit3)[,5],type="h") plot(dfbetas(fit3)[,6],type="h") Step 10 What are our conclusions? First interpret the diagnostic plots for assumptions on residuals and then consider the possibility of influential observations and outliers. Do the assumptions on the noise terms seem reasonable? If the diagnostics are satisfactory we then look at the summary and ANOVA to assess and interpret the model: What is the mean function and what is the spread about the mean function? Is the model a reasonable approximation to the data? Can we criticize that data or the model and make suggestions for further analysis or research?

37 Lecture 5: Inference after model selection The following articles provide a great discussion post-selection inference and make suggestions for how to proceed: Ernst Wit, Edwin van den Heuvel and Jan-Willem Romeijn (2012) All models are wrong... : an introduction to model uncertainty. Statistica Neerlandica, doi: /j x Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao (2013) Valid postselection inference. Ann. Statist. Volume 41, Number 2, Two major problems that arise in the process of model selection are: First, the distributions of the estimated regression parameters are no longer valid. (This means that the tests and confidence intervals normally calculated are no longer valid.) Second, we should see how well the model works on unseen data. This might conceptually be achieved by splitting the data into two subsets, a training set a validation set This is called cross-validation. Cross-validation is important in guarding against testing hypotheses suggested by the data (called "Type III errors") Exhaustive cross-validation Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set. Leave-p-out cross-validation Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. LpO cross-validation requires to learn and validate C_p^n times (where n is the number of observations in the original sample). So as soon as n is quite big it becomes impossible to calculate. Leave-one-out cross-validation Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1. LOOCV doesn't have the calculation problem of general LpO cross-validation because C_1^n=n. k-fold cross-validation In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation

38 2-fold cross-validation This is the simplest variation of k-fold cross-validation. Also called holdout method.[8] For each fold, we randomly assign data points to two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0. Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population and only if selection biases are controlled. Cross validation in R Example 5: Executive salary data esd = read.csv("e:/data/execsaldata.csv",header=t,sep=",") attach(esd) head(esd) fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets) xmatrix <- as.matrix(cbind(exper,educat,bonus,numemp,assets)) library(cvtools) cvfit(fit3, lsalary ~ exper+educat+bonus+numemp+assets, data=esd, y=lsalary,x=xmatrix, cost = rmspe, K = 5, R=1,foldType = "consecutive")

39 Practical 1: Ht Wt UCLA Data The dataset UCLA ht wt sample.csv contains 250 records of human heights and weights. These were obtained by taking a random sample of 250 from the original sample of children in 1993 by a Growth Survey of children from birth to 18 years of age recruited from Maternal and Child Health Centres (MCHC) and schools and were used to develop Hong Kong's current growth charts for weight, height, weight-for-age, weight-for-height and body mass index (BMI). To reduce the size of the data for this exercise a random sample of 250 rows were generated and stored in UCLA ht wt sample.csv. Large data sets require a different treatment that we won t cover here. The columns contain the new index (new.index), the original index (index), the ht and wt. Use the UWC analysis to guide your R code. Step 1 # Read data into R, attach data, print first 6 lines d <- read.csv("e:/data/ucla ht wt sample.csv",header=t) Step 2 # Plot marginal distributions of height and weight # Provide the boxplot of the conditional distributions of weight # given height boxplot(wt~ht,range=0,varwidth=t,col= "gray", main= "Boxplots of the distributions\n of weight given height", xlab="height (in)", ylab="weight (lb)") Step 3 # fit the linear model of wt on ht Step 4 # obtain summary and anova of fit # compute fitted values and residuals Step 5 # various common plots put into a 2X3 matrix of scatter-plots to # check the fit and assumptions on noise terms i.e. # are they normal and independent of predictor and model? # # plot 1: scatterplot of data and superimposed fitted model - # only do this plot when there is only a single predictor # plot 2: scatterplot of observed values vs fitted values to see # how close the fitted values are to the observed data # do this plot no matter how many predictors # plot 3: plot of residuals (random or pattern?) # plot 4: Q-Q plot of residuals (are they approx normal?) # plot 5: residuals vs predictor (random or pattern?) # if there are many predictors we plot residuals against # each predictor in turn # plot 6: residuals vs fitted (random or pattern?) Step 6 # Detecting possible outlying and influential observations # Are there outliers in the y-space or x-space? # Are there influential observations as measured by # DFFITS,Cook s Distance or DFBETAS?

40 Step 7 What are our conclusions? First interpret the diagnostic plots for assumptions on residuals and then consider the possibility of influential observations and outliers. If the diagnostics are satisfactory we then look at the summary and ANOVA to assess and interpret the model: What is the mean function and what is the spread about the mean function? Can we criticize that data or the model and make suggestions for further analysis or research?

41 Practical 2: Salamander Problem and Data The data set for this assignment was obtained from Bill Peterman during the fall of 2005, then a postgraduate student in Ecology and Conservation at University of Missouri. See his webpage at The data given in the appendix were collected on 45 salamanders to ascertain the time to anesthetization (seconds) when submerged in different concentrations of Tricaine Methanesulfonate, or MS-222 for short. It is a fine white powder that easily dissolves in water. The salamanders were placed in a container with the solution and were completely submerged. The temperature of the water-anesthetic solution (MS- 222 was the anesthetic) was measured in degrees Celsius. The covariates considered were snout vent length (sl) measured in millimeters, total length (tl) measured in millimeters, mass measured in grams, ph of the solution, the temperature. The study was motivated because Bill needed to insert electronic tracking devices into the salamanders so that they could be easily tracked. However, he could find no guidelines about the concentration required for anesthetization for salamanders. The objective was to develop a model to predict the time required for anesthetization in terms of the concentration, the size of the salamander as measured by the mass. We will ignore the ph and temperature. Model Building Considerations It seems appropriate to exclude temperature and ph from the analysis because these were strongly correlated with the concentration. Further it seemed sensible to use mass rather than either snout length (sl) or total length (tl) in the analysis. (Because of their high correlation only one of these measurements would be included and mass is by far the more reliable and intuitively appealing measurement.) After considering the scatterplots of time to anesthetization (anes [measured in minutes]) against concentration (conc [mg/l]), it seemed that the analysis should be based on log transformations of both anesthetization time, concentration and mass. Thus, the complete model to be contemplated is Analysis: ln( anes ) 0 1 ln( conc) 2ln( mass) Perform an analysis that will lead to a regression model for predicting the average time to anesthetization in terms of concentration, mass and ph. Suggestions for the analysis step of the Salamander Data (approximate Steps): 1 Read data into R. 2 Analyse the marginal distributions of original/untransformed data and comment on these. (e.g. stem-and-leaf, summary, Q-Qplots, histograms, etc.) 3 Obtain scatterplot and correlation matrices of original data and comment on these. 4 Transform data to logs: log(anes), log(conc) and log(mass). 5 Check that the transformed data are approximately normal. (At this stage we are not so interested in the means and sd s but in the shape of their distributions are they approximately normal?.) 6 Repeat step 3 for the transformed data as a precursor to the model fitting. 7 Fit and assess the contemplated models: Model 1 ln( anes ) 0 1 ln( conc) 2ln( mass) 8 provide the ANOVA and summary table, perform the checks that the assumptions on the error terms are reasonable for that model and perform the usual diagnostic plots looking for outliers in the Y-space, the X-space, the DFFITS, Cooks Distance, and DFBETAS. 9 Interpretations 10 Criticisms and recommendations of experiment, data and the model 1

42 Picture of salamander species used in the anesthetization study: Pictures by Bill Peterman: 2

43 Practical 4: Birthweight data The data JRHbirthwt.csv for this exercise comes from a recent research project at the John Radcliff Hospital. The 17 variables are: Age PAPPA hcg NT trisomy Parity (categorical) BMI Smoking (categorical) Ethnicity (categorical) Conception Gestation in weeks (won t use this) Gestation in days Delivery (categorical) Centile Birthwt PET2 (categorical) G3M (categorical) The objective is to develop a model to predict birthweight from the other variables. I plotted all the marginal distributions and re-coded to eliminate sparse categories, and then converted the categorical variables to factors: Ethnicity.new <- 1*(Ethnicity==1)+2*(Ethnicity==2)+1*(Ethnicity>2) Ethnicity.new <- as.factor( Ethnicity.new) PET.new <-1*(PET2==1)+2*(PET2>1) PET.new <- as.factor(pet.new) G3M.new <- 1*(G3M==1)+2*(G3M>1) G3M.new <- as.factor(g3m.new) Smoking <- as.factor(smoking) Parity.new <- 0*(Parity==0)+1*(Parity==1)+1*(Parity==2) Parity.new <- as.factor(parity.new) Conception.new <- 1*(Conception==1)+2*(Conception>1) Conception.new <- as.factor(conception.new) Perform a stepwise regression on these variables. Fit the best resulting model Obtain the summary and anova Select your model Assess assumptions and influential observations What are your conclusions?

Probability and Statistics Curriculum Pacing Guide

Probability and Statistics Curriculum Pacing Guide Unit 1 Terms PS.SPMJ.3 PS.SPMJ.5 Plan and conduct a survey to answer a statistical question. Recognize how the plan addresses sampling technique, randomization, measurement of experimental error and methods