Math 385/585 Applied Regression Analysis

Math 385/585 Applied Regression Analysis Fall 2017 Section 001 1:50 to 2:50 M W F Instructor: Dr. Chris Edwards Phone: 948-3969 Office: Swart 123 Classroom: Swart 3 Text: Applied Linear Statistical Models, 5 th edition, by Kutner, Nachtsheim, Neter, and Li. Earlier editions of the text will likely be adequate, but you will have to allow for different page numbers and homework problem numbers. Catalog Description: A practical introduction to regression emphasizing applications rather than theory. Simple and multiple regression analysis, basic components of experimental design, and elementary model building. Both conventional and computer techniques will be used in performing the analyses. Prerequisite: Math 201 or Math 301 and Math 256 each with a grade of C or better. Course Objectives: Linear models in statistics are the backbone of many applications, including regression and ANOVA techniques. Math 385 focuses students on the regression aspect of modeling while Math 386 focuses students on the ANOVA aspect. In Math 385, students will learn how to calculate and interpret regression estimates, including parameter estimates, fits, and residuals, and will be able to perform statistical inference. In addition to simple linear regression, successful students will understand the issues introduced in multiple linear regression, including polynomial regression and non-linear regression. Finally, the student will be able to assess model adequacy and know methods to update and improve the model. Upon successful completion of the course, students are expected to have the ability to complete the following: Identify and understand the components and assumptions for the standard linear regression model Use statistical inference on regression model coefficients, including confidence intervals and hypothesis tests Construct and interpret the ANOVA table for describing a linear regression model Calculate and analyze residuals from a regression model Perform diagnostics on a regression model, including assessing lack of fit Perform remedial measures such as transformations to improve a regression model Understand how linear algebra can be used to describe a multiple regression model Perform inference in multiple regression and understand how the increased number of dimensions adds complexity to the interpretations due to collinearity Understand how to fit polynomial regression models Know how to use indicator variables in regression models Be able to build a model from a pool of variables, using techniques such as Best Subsets and Stepwise Regression Identify outliers, in both the X and Y dimensions, in multiple regression models Understand the basics of non-linear regression, including Logistic Regression

Grading: Final grades are based on these 300 points: Topic Points Tentative Date Chapters Exam 1 Simple Linear Regression 70 pts. October 6 1 to 4 Exam 2 Multiple Regression I 70 pts. November 13 5 to 8 Exam 3 Multiple Regression II 70 pts. December 15 9 to 11, 13 and 14 Homework 15 Points Each 90 pts. Homework: I will collect (around) 5 homework problems approximately once every other week. The due dates are listed on the course outline below. I suggest that you work together in small groups on the homework if you like; don t forget that I am a resource for you to use. Often we will use computer software to perform our analyses; include printouts where appropriate, but please make your papers readable. In other words, I don t want 25 pages of printout handed in if you can summarize it in two pages. Final grades are assigned as follows: 270 pts. A (90 %) 260 pts. A- (87 %) 250 pts. B+ (83 %) 240 pts. B (80 %) 230 pts. B- (77 %) 220 pts. C+ (73 %) 210 pts. C (70 %) 200 pts. C- (67 %) 190 pts. D+ (63 %) 180 pts. D (60 %) 179 pts. or less F Office Hours: Office hours are times when I will be in my office to help you. There are many other times when I am in my office. If I am in and not busy, I will be happy to help. My office hours for Fall 2017 semester are 3:00 to 3:45 Monday and Wednesday, and 9:00 to 11:00 Tuesday. Philosophy: I strongly believe that you, the student, are the only person who can make yourself learn. Therefore, whenever it is appropriate, I expect you to discover the mathematics we will be exploring. I do not feel that lecturing to you will teach you how to do mathematics. I hope to be your guide while we learn some mathematics, but you will need to do the learning. I expect each of you to come to class prepared to digest the day s material. That means you will benefit most by having read each section of the text and the Day By Day notes before class. My personal belief is that one learns best by doing. I believe that you must be truly engaged in the learning process to learn well. Therefore, I do not think that my role as your teacher is to tell you the answers to the problems we will encounter; rather I believe I should point you in a direction that will allow you to see the solutions yourselves. To accomplish that goal, I will find different interactive activities for us to work on. Your job is to use me, your text, your friends, and any other resources to become adept at the material. The Day By Day notes also include Skills that I expect you to attain. Math 585 Expectations: Expectations for the graduate students are understandably more rigorous than for the undergraduate student. Students taking Math 585 will have an extra theoretical problem added to each homework, to be assigned during the semester. In addition, a final project worth 50 points will be due at the end of the semester. This project will involve a complete analysis of a data set, including model estimation, development, and validation.

Monday Wednesday Friday September 4 No Class September 11 Day 3 Estimation Sections 1.6 to 1.8 September 18 Day 6 Homework 1 Due ANOVA Section 2.7 September 25 Day 9 Residuals II Sections 3.1 to 3.6 October 2 Day 12 Homework 2 Due Simultaneous Inference Sections 4.1 to 4.3 October 9 Day 15 Intro to Matrices Sections 5.1 to 5.7 October 16 Day 18 Inference Sections 6.3 to 6.6 October 23 Day 21 Homework 3 Due Extra SS Section 7.1 October 30 Day 24 Polynomial Models Section 8.1 November 6 Day 27 Dummy Variables I Sections 8.3 to 8.7 November 13 Day 30 Exam 2 November 20 Day 33 Diagnostics Sections 10.1 to 10.2 November 27 Day 34 X Outliers Section 10.3 December 4 Day 37 Non-Linear Regression I Sections 13.1 to 13.2 December 11 Day 40 Homework 6 Due Logistic Inference Section 14.5 September 6 Day 1 Introduction, Least Squares September 13 Day 4 Inference Sections 2.1 to 2.3 September 20 Day 7 GLM Section 2.8 September 27 Day 10 Lack of Fit Section 3.7 October 4 Day 13 Review October 11 Day 16 Regression Matrices Sections 5.8 to 5.13 October 18 Day 19 Intervals Section 6.7 October 25 Day 22 GLM Tests Sections 7.2 to 7.3 November 1 Day 25 Interactions I Section 8.1 November 8 Day 28 Dummy Variables II Sections 8.3 to 8.7 November 15 Day 31 Model Building Sections 9.1 to 9.3 November 22 No Class November 29 Day 35 Homework 5 Due Y Outliers Section 10.4 December 6 Day 38 Non-Linear Regression II Sections 13.3 to 13.4 December 13 Day 41 Review September 8 Day 2 Models Sections 1.1 to 1.5 September 15 Day 5 Interval Estimates Sections 2.4 to 2.6 September 22 Day 8 Residuals I Sections 3.1 to 3.6 September 29 Day 11 Transformations Sections 3.8 to 3.9 October 6 Day 14 Exam 1 October 13 Day 17 Mult. Reg. Models Sections 6.1 to 6.2 October 20 Day 20 Diagnostics Section 6.8 October 27 Day 23 Computational Problems and Multicollinearity Sections 7.5 to 7.6 November 3 Day 26 Interactions II Section 8.2 November 10 Day 29 Homework 4 Due Review November 17 Day 32 Best Subsets Sections 9.4 to 9.6 November 24 No Class December 1 Day 36 Trees Section 11.4 December 8 Day 39 Logistic Regression Sections 14.2 to 14.3 December 15 Day 42 Exam 3

Homework Assignments: (subject to change if we discover difficulties as we go) Homework 1 Due September 18 1.19, p. 35 Grade Point Average. The director of admissions of a small college selected 120 students at random from the new freshman class in a study to determine whether a student s grade point average (GPA) at the end of the freshman year (Y) can be predicted from the ACT test score (X). The results of the study follow. Assume that first-order regression model (1.1) is i: 1 2 3 118 119 120 X! : 21 14 28 28 16 28 Y! : 3.897 3.885 3.778 3.914 1.860 2.948 1.23, p. 36 1.33, p. 37 2.4, p. 90 appropriate. a.) Obtain the least squares estimates of β! and β!, and state the estimated regression function. b.) Plot the estimated regression function and the data. Does the estimated regression function appear to fit the data well? c.) Obtain a point estimate of the mean freshman GPA for students with ACT test score X = 30. d.) What is the point estimate of the change in the mean response when the entrance test score increases by one point? Refer to Grade Point Average Problem 1.19. a.) Obtain the residuals e!. Do they sum to zero in accord with (1.17)? b.) Estimate σ! and σ. In what units is σ expressed? Refer to the regression model Y! = β! + ε! in Exercise 1.30 Derive the least squares estimator of β! for this model. Refer to Grade Point Average Problem 1.19.

2.55, p. 97 a.) Obtain a 99 percent confidence interval for β!. Interpret your confidence interval. Does it include zero? Why might the director of admissions be interested in whether the confidence interval includes zero? b.) Test, using the test statistic t, whether or not a linear association exists between student s ACT score (X) and GPA at the end of the freshman year (Y). Use a level of significance of 0.01. State the alternatives, decision rule, and conclusion. c.) What is the P-value of our test in part (b)? How does it support the conclusion reached in part (b)? Derive the expression for SSR in (2.51): Homework 2 Due October 2 2.23, p. 93 2.67, p. 99 Refer to Grade Point Average Problem 1.19. a.) Set up the ANOVA table.! SSR = b!!! X! X!. b.) What is estimated by MSR in your ANOVA table? By MSE? Under what condition do MSR and MSE estimate the same quantity? c.) Conduct and F test of whether or not β! = 0. Control the α risk at 0.01. State the alternatives, decision rule, and conclusion. d.) What is the absolute magnitude of the reduction in the variation of Y when X is introduced into the regression model? What is the relative reduction? What is the name of the latter measure? e.) Obtain r and attach the appropriate sign. f.) Which measure, R! or r, has the more clear-cut operational interpretation? Explain. Refer to Grade Point Average Problem 1.19. a.) Plot the data, with the least squares regression line for ACT scores between 20 and 30 superimposed? b.) On the plot from part (a), superimpose a plot of the 95 percent confidence band for the true regression line for ACT scores between 20 and 30. Does the confidence band suggest that the true regression relation has been precisely estimated? Discuss.

3.3, p. 146-147 3.21, p. 151 Refer to Grade Point Average Problem 1.19. a.) Prepare a box plot for the ACT scores X!. Are there any noteworthy features in this plot? b.) Prepare a dot plot of the residuals. What information does this plot provide? c.) Plot the residuals e! against the fitted values Y!. What departures from regression model (2.1) can be studied from this plot? What are your findings? d.) Prepare a normal probability plot of the residuals. Also obtain the coefficient of correlation between the ordered residuals and their expected values under normality. Test the reasonableness of the normality assumption here using Table B.6 and α = 0.05. What do you conclude? e.) Conclude the Brown-Forsythe test to determine whether or not the error variance varies with the level of X. Divide the data into the two groups, X > 26 and X 26, and use α = 0.01. State the decision rule and conclusion. Does your conclusion support your preliminary findings in part (c)? f.) Information is given below for each student on two variables not included in the model, namely, intelligence test score X!. Derive the result in (3.29):!!!!!!!!!!!!!!!!!! Y!" Y!" = Y!" Y! + Y! Y!"!!!!!!!!!!!! Homework 3 Due October 23 3.17, p. 150-151 SSE = SSPE + SSLF Sales growth. A marketing researcher studied annual sales of a product that had been introduced 10 years ago. The data are as follows, where X is the year (coded) and Y is sales in thousands of units: i: 1 2 3 4 5 6 7 8 9 10 X! : 0 1 2 3 4 5 6 7 8 9 Y! : 98 135 162 178 221 232 283 300 374 395

a.) Prepare a scatter plot of the data. Does a linear relation appear adequate here? b.) Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y. Evaluate SSE for λ = 0.3, 0.4, 0.5, 0.6, 0.7. What transformation of Y is suggested? c.) Use the transformation Y! = transformed data. Y and obtain the estimated linear regression function for the 4.21, p. 175 5.7, p. 210 d.) Plot the estimated regression line and the transformed data. Does the regression line appear to be a good fit to the transformed data? e.) Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show? f.) Express the estimated regression function in the original units. When the predictor variable is so coded that X = 0 and the normal error regression model (2.1) applies, are b! and b! independent? Are the joint confidence intervals for β! and β! then independent? Refer to Plastic hardness Problem 1.22. Using matrix methods, find: 1) Y Y 2) X X 3) X Y 5.20, p. 211 5.26, p. 212 Find the matrix A of the quadratic form: 7Y!! 8Y! Y! + 8Y!!. Refer to Plastic hardness Problems 1.22 and 5.7. a) Using matrix methods, obtain the following: 1) X X!! 2) b 3) Y 4) H

5) SSE 6) s! {b} 7) s! {pred} when X! = 30. b) From part (a6), obtain the following: 1) s! b! 2) s b!, b! 3) s b! c) Obtain the matrix of the quadratic form for SSE. Homework 4 Due November 10 6.10, p. 249 Refer to Grocery retailer Problem 6.9. a) Fit regression model (6.5) to the data for three predictor variables. State the estimated regression function. How are b!, b!, and b! interpreted here? b) Obtain the residuals and prepare a box plot of the residuals. What information does this plot provide? c) Plot the residuals against Y, X!, X!, X!, and X! X! on separate graphs. Also prepare a normal probability plot. Interpret the plots and summarize your findings. d) Prepare a time plot of the residuals. Is there any indication that the error terms are correlated? Discuss. 7.4, p. 289 e) Divide the 52 cases into two groups, placing the 26 cases with the smallest fitted values Y! into group 1 and the other 26 cases into group 2. Conduct the Brown-Forsythe test for constancy of the error variance, using α = 0.01. State the decision rule and conclusion. Refer to Grocery retailer Problem 6.9. a) Obtain the analysis of variance table that decomposes the regression sum of squares into extra sums of squares associated with X! ; with X 3, given X! ; and with X!, given X! and X 3. b) Test whether X! can be dropped from the regression model given that X! and X 3 are retained. Use the F test statistic and α = 0.05. State the alternatives, decision rule, and conclusion. What is the P-value of the test?

7.17, p. 290 c) Does SSR(X! ) + SSR(X! X! ) equal SSR(X! ) + SSR(X! X! ) here? Must this always be the case? Refer to Grocery retailer Problem 6.9. a) Transform the variables by means of the correlation transformation (7.44) and fit the standardized regression model (7.45). b) Calculate the coefficients of determination between all pairs of predictor variables. Is it meaningful here to consider the standardized regression coefficients to reflect the effect of one predictor variable when the others are held constant? c) Transform the estimated standardized regression coefficients by means of (7.53) back to the ones for the fitted regression model in the original variables. Verify that they are the same as the ones obtained in Problem 6.10a. 8.16, p. 337-338 8.34, p. 340 Refer to Grade point average Problem 1.19. An assistant to the director of admission conjectured that the predictive power of the model could be improved by adding information on whether the student had chosen a major field of concentration at the time the application was submitted. Assume that regression model (8.33) is appropriate, where X! is entrance test score and X! = 1 if student had indicated a major field of concentration at the time of application and 0 if the major field was undecided. Data for X 2 were as follows: i: 1 2 3 118 119 120 X!! : 0 1 0 1 1 0 a) Explain how each regression coefficient in model (8.33) is interpreted here. b) Fit the regression model and state the estimated regression function. c) Test whether the X! variable can be dropped from the regression model; use α = 0.01. State the alternatives, decision rule, and conclusion. d) Obtain the residuals for regression model (8.33) and plot them against X! X!. Is there any evidence in your plot that it would be helpful to include an interaction term in the model? In a regression study, three types of banks were involved, namely, commercial, mutual savings, and savings and loan. Consider the following system of indicator variables for type of bank: Type of bank X! X! Commercial 1 0 Mutual savings 0 1

Savings and loan 1 1 a) Develop a first-order linear regression model for relating last year s profit or loss (Y) to size of bank (X! ) and type of bank (X!, X! ). b) State the response functions for the three types of banks. c) Interpret each of the following quantities; 1) β! 2) β! 3) β! β! Homework 5 Due November 29 9.15, p. 378-379 9.16, p. 379 9.19, p. 379 Kidney function. Creatinine clearance (Y) is an important measure of kidney function, but is difficult to obtain in a clinical office setting because it requires 24-hour urine collection. To determine whether this measure can be predicted from some data that are easily available, a kidney specialist obtained the data that follow for 33 male subjects. The predictor variables are serum creatinine concentration (X! ), age (X! ), and weight (X! ). a) Prepare separate dot plots for each of the three predictor variables. Are there any noteworthy features in these plots? Comment. b) Obtain the scatter plot matrix. Also obtain the correlation matrix of the X variables. What do the scatter plots suggest about the nature of the functional relationship between the response variable Y and each predictor variable? Discuss. Are any serious multicollinearity problems evident? Explain. c) Fit the multiple regression function containing the three predictor variables as first-order terms. Does it appear that all predictor variables should be retained? Refer to Kidney function Problem 9.15. a) Using first-order and second-order terms for each of the three predictor variables (centered around the mean) in the pool of potential X variables (including cross products of the firstorder terms), find the three best hierarchical subset regression models according to the C! criterion. b) Is there much difference in C! for the three best subset models? Refer to Kidney function Problem 9.15.

a) Using the same pool of potential X variables as in Problem 9.16a, find the best subset of variables according to forward stepwise regression with α limits of 0.10 and 0.15 to add or delete a variable, respectively. b) How does the best subset according to forward stepwise regression compare with the best! subset according to the R!,! criterion obtained in Problem 9.16a? 10.10 a, p 415 Refer to Grocery retailer Problems 6.9 and 6.10. a) Obtain the studentized deleted residuals and identify any outlying Y observations. Use the Bonferroni outlier test procedure with α = 0.05. State the decision rule and conclusion. Homework 6 Due December 11 10.10 b-f, p 415 11.29, p. 479 Refer to Grocery retailer Problems 6.9 and 6.10. b) Obtain the diagonal elements of the hat matrix. Identify any outlying X observations using the rule of thumb presented in the chapter. c) Management wishes to predict the total labor hours required to handle the next shipment containing X! = 300,000 cases whose indirect costs of the total hours is X! = 7.2 and X! = 0 (no holiday in week). Construct a scatter plot of X! against X! and determine visually whether this prediction involves an extrapolation beyond the range of the data. Also, use (10.29) to determine whether an extrapolation is involved. Do your conclusions from the two methods agree? d) Cases 16, 22, 43, and 48 appear to be outlying X observations, and cases 10, 32, 38, and 40 appear to be outlying Y observations. Obtain the DFFITS, DFBETAS, and Cook s distance values for each of these cases to assess their influence. What do you conclude? e) Calculate the average absolute percent difference in the fitted values with and without each of these cases. What does this measure indicate about the influence of each of the cases? f) Calculate Cook s distance D! for each case and prepare an index plot. Are any cases influential according to this measure? Refer to Muscle Mass Problem 1.27. a) Fit a two-region regression tree. What is the first split point based on age? What is SSE for this two-region tree? b) Find the second split point given the two-region tree in part (a). What is SSE for the resulting three-region tree?

13.10, p. 550 c) Find the third split point given the three-region tree in part (b). What is SSE for the resulting four-region tree? d) Prepare a scatter plot of the data with the four-region tree in part (c) superimposed. How well does the tree fit the data? What does the tree suggest about the change in muscle mass with age? e) Prepare a residual plot of e! versus Y! for the four-region tree in part (d). State your findings. Enzyme kinetics. In an enzyme kinetics study the velocity of a reaction (Y) is expected to be related to the concentration (X) as follows: Y! = γ!x! γ! + X! + ε! 13.12, p. 550 Eighteen concentrations have been studied and the results follow: i: 1 2 3 16 17 18 X! : 1 1.5 2 30 35 40 Y! : 2.1 2.5 4.9 19.7 21.3 21.6 a) To obtain starting values for g 0 and g 1, observe that when the error term is ignored we have Y!! = β! + β! X!!, where Y!! =!, β!! =!, β!!! =!! and X!!!! =!. Therefore fit a linear!!! regression function to the transformed data to obtain initial estimates g (!)! =! and!! g! (!) =!!!!. b) Using the starting values obtained in part (a), find the least square estimates of the parameters γ! and γ!. Refer to Enzyme kinetics Problem 13.10. Assume that the fitted model is appropriate and that large-sample inferences can be employed here. 1) Obtain an approximate 95 percent confidence interval for γ!. 2) Test whether or not γ! = 20; use α = 0.05. State the alternatives, decision rule, and conclusion.