source("http://www.stat.ucla.edu/~cocteau/stat13/data/ab.r") ls()

Similar documents
Probability and Statistics Curriculum Pacing Guide

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

CS Machine Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

STA 225: Introductory Statistics (CT)

12- A whirlwind tour of statistics

UNIT ONE Tools of Algebra

Statewide Framework Document for:

Python Machine Learning

Shockwheat. Statistics 1, Activity 1

Grade 6: Correlated to AGS Basic Math Skills

Analysis of Enzyme Kinetic Data

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Ohio s Learning Standards-Clear Learning Targets

Math Placement at Paci c Lutheran University

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

AP Statistics Summer Assignment 17-18

learning collegiate assessment]

Assignment 1: Predicting Amazon Review Ratings

Relationships Between Motivation And Student Performance In A Technology-Rich Classroom Environment

Physics 270: Experimental Physics

Go fishing! Responsibility judgments when cooperation breaks down

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Case study Norway case 1

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Common Core State Standards

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

School of Innovative Technologies and Engineering

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Probabilistic Latent Semantic Analysis

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Machine Learning Basics

On-the-Fly Customization of Automated Essay Scoring

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

February Statistics: Multiple Regression in R

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

CSC200: Lecture 4. Allan Borodin

Coral Reef Fish Survey Simulation

Introduction to Causal Inference. Problem Set 1. Required Problems

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Rule Learning With Negation: Issues Regarding Effectiveness

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Empowering Students Learning Achievement Through Project-Based Learning As Perceived By Electrical Instructors And Students

A Bootstrapping Model of Frequency and Context Effects in Word Learning

2 nd grade Task 5 Half and Half

Characteristics of the Text Genre Informational Text Text Structure

Measuring physical factors in the environment

Characteristics of Functions

MGF 1106 Final Exam Review / (sections )

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A method to teach or reinforce concepts of restriction enzymes, RFLPs, and gel electrophoresis. By: Heidi Hisrich of The Dork Side

Multiple Measures Assessment Project - FAQs

Grade Dropping, Strategic Behavior, and Student Satisficing

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Paper 2. Mathematics test. Calculator allowed. First name. Last name. School KEY STAGE TIER

Applications of data mining algorithms to analysis of medical data

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

A Comparison of Academic Ranking Scales

Individual Differences & Item Effects: How to test them, & how to test them well

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

What is related to student retention in STEM for STEM majors? Abstract:

Statistics and Probability Standards in the CCSS- M Grades 6- HS

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Jason A. Grissom Susanna Loeb. Forthcoming, American Educational Research Journal

PART C: ENERGIZERS & TEAM-BUILDING ACTIVITIES TO SUPPORT YOUTH-ADULT PARTNERSHIPS

The New York City Department of Education. Grade 5 Mathematics Benchmark Assessment. Teacher Guide Spring 2013

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

Lecture 2: Quantifiers and Approximation

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Learning to Think Mathematically With the Rekenrek

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Helping Your Children Learn in the Middle School Years MATH

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Grade 8: Module 4: Unit 1: Lesson 11 Evaluating an Argument: The Joy of Hunting

Detailed course syllabus

NIH Public Access Author Manuscript J Prim Prev. Author manuscript; available in PMC 2009 December 14.

Multiple regression as a practical tool for teacher preparation program evaluation

The Value of Visualization

Rule Learning with Negation: Issues Regarding Effectiveness

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Diagnostic Test. Middle School Mathematics

Instructor: Matthew Wickes Kilgore Office: ES 310

APPENDIX A: Process Sigma Table (I)

Contents. Foreword... 5

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Transcription:

Statistics 13, Lab 6 Regression 1. Getting started The data for this lab come from a study initiated by the Tasmanian Aquaculture and Fisheries Institute to investigate the growth patterns of abalone living along the Tasmanian coastline. The harvest of abalone is subject to quotas that restrict both the number of abalone that can be caught as well as their size. The population of abalone can be regulated ectively if there is a simple way to tell the age of abalone based solely on their appearance. Hence, researchers are interested in relating abalone age to variables like length, height and weight of the animal (measurements that a diver can take before they harvest the animal). If a reasonably accurate model can be found, then the Tasmanian oials can develop rules that prevent overharvesting of young abalone. Determining the actual age of an abalone is a bit like estimating the age of a tree. Rings are formed in the shell of the abalone as it grows, usually at the rate of one ring per year. Getting access to the rings of an abalone involves cutting the shell. After polishing and staining, a lab technician examines a shell sample under a microscope and counts the rings. Because some rings are hard to make out using this method, these researchers believed adding 1.5 to the ring count is a reasonable approximation of the abalones age. The relationship between age and ring count, however, is somewhat controversial. Under certain conditions, abalone can grow more than one ring per year. These conditions relate to weather patterns and other environmental variables. These ects suggest that any relationship we find between ring count and size measurements taken from the animal is likely to involve a lot of error; there are plenty of variables that have been left out of the study Let s start by loading your data set. source("http://www.stat.ucla.edu/~cocteau/stat13/data/ab.r") ls() Among the various objects in your workspace, you should see a new dataset called ab (for abalone) 2. Examining the data With the discussion in the previous section as background, lets look at the variables that were collected for this study; presumably they are among the most important factors influencing ring count. dim(ab) names(ab)

ab[1:5,] The last command returns the first five observations from the dataset; that is, data on 5 of the 2,500 abalone. We can approximately determine the age of an abalone in this study by adding 1.5 to the number of rings in the variable rings. How old are these 5 speciments? The other measurements recorded for each abalone are given in the table below. Variable name Units Description rings count number of rings length mm longest shell measurement diameter mm measured perpendicular to the length height mm height of abalone with meat in the shell whole grams weight of the whole abalone shucked grams weight of just the meat viscera grams gut weight after bleeding shell grams weight of the shell after drying infant 0/1 1 if the abalone is an infant, 0 otherwise The 2,500 abalone in this dataset all had between 1 and 29 rings. If we use the approximate dating rule, that means they are between 2.5 and 30.5 years old. Examine the data for the oldest and youngest abalone in the dataset with the following two commands. subset(ab,ab$rings==1) subset(ab,ab$rings==29) In the first command, recall that the == relation is asking for all the data in ab for abalone having just one ring; in the second command we are asking for the data for those abalone with 29 rings. Question 1. (a) Do the sets of measurements for abalone with 1 and with 29 rings make sense? That is, what aspects of the data agree with the fact that one of these abalone is very young and the other is very old? (Not more than four sentences, please.) (b) Briefy comment on the variables in your dataset. All but one is quantitative; use summary() to describe them. For the one categorical variable, use table(). Note: This is meant to document the fact that you looked at the data briefly. Do not spend a lot of time on this question. Good data analysis starts by looking at the data a little. 3. Correlation We now consider more formal descriptions of the relationships between two variables. In lecture 18 and in your text, you will find a description of the correlation coefficient. Under certain conditions, this quantity measures the degree to which the data in a simple scatterplot lie on a straight line. See your text or your 2

lecture notes for details on how you would compute the correlation coefficient from a sample of data. In R, we can use the command cor. cor(ab$length,ab$rings) What value do you get? Use the material in lecture to guess what the scatterplot of length and rings might look like. Now, lets make this plot and see if your guess was correct. plot(ab$length,ab$rings) In lecture we discussed the use of a scatterplot matrix to see the relationship between several variables at once. You can create this kind of plot with the R command pairs. pairs(ab[,1:3]) This will create a scatterplot matrix with the first three columns or variables in the dataset ab. (Note that plots involving rings will have stripes because the ring count from abalone shells is a discrete variable. It has a large number of levels, but ring count is still a whole number.) Just like pairs, the command cor can also act on a several variables and return their correlations in the form of a matrix. Try the following command. cor(ab[,1:3]) Question 2. (a) Do the relationships in the scatterplots agree with your intuition about the correlation coefficients? Explain using a pair with high correlation and one with relatively low correlation. The diagonal elements in cor(ab[,1:3]) are all 1. Does that make sense? (b) Consider now the first four variables in ab and again form a scatterplot matrix and a matrix of correlation coefficients. The new variable introduced is height. How is it related to the other variables? How is it correlated with the other variables? (Have a look at the last row or column of the pairs-plot and the matrix of correlation coefficients.) Finally, what is odd about the variable height? (c) Now, remove the two outliers in height; they are observations 1381,1504. You can remove them with the command ab2 <- ab[-c(1381,1504),] 3

which creates a new dataset ab2. Again, we are indexing rows in this command, but the minus sign tells R that we want to leave these rows out. This new dataset should only have 2,498 rows. Now, remake the scatterplot matrix and the matrix of correlation coecientcients, calling the commands above with ab2 rather than ab. What has happened to the correlation coecientcients? Comment on the change in the plots. Notice that many of the relationships we have seen so far tend to exhibit unequal spread. The plot of height by diameter, for example, shows greater spread for larger values of diameter. We are not going to consider this problem in depth (this is your first taste of regression, after all), but it is something we should worry about - and will, in a more advanced class. 3. Regression with a single predictor In the last section, we saw plots of rings against the rst three predictor variables. There seemed to be a lot of spread in these data. As noted in the introduction, the investigators at the Marine Resources Division acknowledge that several environmental factors can inuence the growth of rings in abalone. In other words, there are variables relating to the condition of the water off the coast of Tasmania as well as the weather patterns for the last 30 years that might help explain ring counts. None of these data are available to us at this point, and hence we expect a certain amount of error in any model we build with our 8 predictor variables. To begin the modeling process, we will just look at a simple linear model with only one predictor. In mathematical terms we consider the following description of the data: rings = β 0 + β 1 length + error (1) where (error) is a random error. Here the error accounts for all the variables we haven t included in the model. We can compute least squares estimates for β 0 and β 1 with the command fit <- lm(rings~length,data=ab) Here, the argument data = ab tells R to look for the data on rings and length in the dataset ab. The summary table we studied in class is obtained with next command. summary(fit) We can extract just the coefficients from the model with the command coefficients(fit) 4

Question 3. (a) What are the least squares estimates β 0 and β 1? (b) Use these values to create an estimate of the conditional mean ring count for abalone that have length 0.4. Do the same for abalone that have length 0.7. (c) Assuming the 2,500 specimens referred to in your dataset are a random sample of abalone off the coast of Tasmania, explain how you would use the bootstrap to assess the uncertainty in the slope estimate β 1. Bonus. Implement the bootstrap procedure to estimate the standard error for β 1. In the code below, we draw abalone from our sample with replacement from our original data set. For each bootstrap sample we fit a regression and record the slope estimate. In all we will have 5,000 bootstrap replicates of the slope. Describe the distribution of these numbers and use their standard deviation as an estimate of the standard error. How does it compare to the value in the regression table we generated with the command summary above? replicates <- rep(0,5000) for(i in 1:5000){ sample_points <- sample(1:nrow(ab),replace=t) bootsample <- ab[sample_points,] fit <- lm(rings~length,data=bootsample) replicates[i] <- coefficients(fit)[2] } print(i) hist(replicates) sd(replicates) 5