Lecture 1 - Data and Data Summaries

Similar documents
MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Probability and Statistics Curriculum Pacing Guide

Measures of the Location of the Data

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

AP Statistics Summer Assignment 17-18

Shockwheat. Statistics 1, Activity 1

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

STA 225: Introductory Statistics (CT)

Lesson M4. page 1 of 2

Level 1 Mathematics and Statistics, 2015

Introduction to the Practice of Statistics

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

UNIT ONE Tools of Algebra

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Enhancing Students Understanding Statistics with TinkerPlots: Problem-Based Learning Approach

Broward County Public Schools G rade 6 FSA Warm-Ups

CS Machine Learning

Student s Edition. Grade 6 Unit 6. Statistics. Eureka Math. Eureka Math

Mathacle PSet Stats, Concepts in Statistics and Probability Level Number Name: Date:

Functional Skills Mathematics Level 2 assessment

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Math 121 Fundamentals of Mathematics I

Research Design & Analysis Made Easy! Brainstorming Worksheet

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Grade 6: Correlated to AGS Basic Math Skills

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

The Editor s Corner. The. Articles. Workshops. Editor. Associate Editors. Also In This Issue

Sociology 521: Social Statistics and Quantitative Methods I Spring 2013 Mondays 2 5pm Kap 305 Computer Lab. Course Website

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Mathematics Success Level E

Lecture 2: Quantifiers and Approximation

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

learning collegiate assessment]

Statewide Framework Document for:

Introduction to Questionnaire Design

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Algebra 2- Semester 2 Review

PHY2048 Syllabus - Physics with Calculus 1 Fall 2014

Association Between Categorical Variables

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Preliminary Chapter survey experiment an observational study that is not a survey

Lecture 1: Machine Learning Basics

School of Innovative Technologies and Engineering

Unit 3: Lesson 1 Decimals as Equal Divisions

Introducing the New Iowa Assessments Mathematics Levels 12 14

School Size and the Quality of Teaching and Learning

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

ASTR 102: Introduction to Astronomy: Stars, Galaxies, and Cosmology

Statistics and Probability Standards in the CCSS- M Grades 6- HS

Backwards Numbers: A Study of Place Value. Catherine Perez

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Minitab Tutorial (Version 17+)

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Using Proportions to Solve Percentage Problems I

Individual Differences & Item Effects: How to test them, & how to test them well

Informal Comparative Inference: What is it? Hand Dominance and Throwing Accuracy

Excel Formulas & Functions

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

Visit us at:

12- A whirlwind tour of statistics

TU-E2090 Research Assignment in Operations Management and Services

Physics 270: Experimental Physics

Math 96: Intermediate Algebra in Context

InCAS. Interactive Computerised Assessment. System

Bittinger, M. L., Ellenbogen, D. J., & Johnson, B. L. (2012). Prealgebra (6th ed.). Boston, MA: Addison-Wesley.

Preparing a Research Proposal

Lucy Calkins Units of Study 3-5 Heinemann Books Support Document. Designed to support the implementation of the Lucy Calkins Curriculum

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Course Content Concepts

Office Hours: Mon & Fri 10:00-12:00. Course Description

State of New Jersey

BAYLOR COLLEGE OF MEDICINE ACADEMY WEEKLY INSTRUCTIONAL AGENDA 8 th Grade 02/20/ /24/2017

Mathematics subject curriculum

NCEO Technical Report 27

MATH 205: Mathematics for K 8 Teachers: Number and Operations Western Kentucky University Spring 2017

End-of-Module Assessment Task

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Welcome to ACT Brain Boot Camp

Evidence for Reliability, Validity and Learning Effectiveness

SAT MATH PREP:

Coimisiún na Scrúduithe Stáit State Examinations Commission LEAVING CERTIFICATE 2008 MARKING SCHEME GEOGRAPHY HIGHER LEVEL

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Classify: by elimination Road signs

2005 National Survey of Student Engagement: Freshman and Senior Students at. St. Cloud State University. Preliminary Report.

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

ABILITY SORTING AND THE IMPORTANCE OF COLLEGE QUALITY TO STUDENT ACHIEVEMENT: EVIDENCE FROM COMMUNITY COLLEGES

What s Different about the CCSS and Our Current Standards?

HWS Colleges' Social Norms Surveys Online. Survey of Student-Athlete Norms

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Transcription:

Lecture 1 - Data and Data Summaries Statistics 102 Colin Rundel January 14, 2013

Announcements Announcements Homework 1 - Out 1/16, due 1/23 Question from the textbook, make sure you have a copy Lab 1 - Tomorrow RStudio accounts created, try logging in at http:// beta.rstudio.org In-class quiz - using Sakai, first 10 mins (open book, internet, etc.) Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 2 / 33

Data Types of Data Data all variables numerical categorical Numerical (quantitative) - takes on a numerical values Ask yourself - is it sensible to add, subtract, or calculate an average of these values? Categorical (qualitative) - takes on one of a limited number of distinct categories Ask yourself - are there only certain values (or categories) possible? Even if the categories can be identified with numbers, check if it would be sensible to do arithmetic operations with these values. Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 3 / 33

Data Types of Data Numerical Data all variables numerical categorical continuous discrete Continuous - data that is measured, any numerical (decimal) value Discrete - data that is counted, only whole non-negative numbers Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 4 / 33

Data Types of Data Categorical Data all variables numerical categorical continuous discrete regular categorical ordinal Ordinal - categorical data where the categories have a natural order If the levels do not have an inherent ordering to them, then the variable is simply called categorical Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 5 / 33

Data Types of Data Example - Class Survey Students in an introductory statistics course were asked the following questions as part of a class survey: 1 What is your gender, male or female? 2 Are you introverted or extraverted? 3 On average, how much sleep do you get per night? 4 What is your bedtime: 8pm-10pm, 10pm-12am, 12am-2am, later than 2am? 5 How many countries have you visited? 6 On a scale of 1 (very little) - 5 (a lot), how much do you dread this semester? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 6 / 33

Data Types of Data Example - Class Survey The data matrix (data frame) below shows a sample of responses from this survey. Columns represent variables Rows represent observations (cases) student gender intro extra sleep bedtime countries dread 1 male extravert 8 10-12 13 3 2 female extravert 8 8-10 7 2 3 female introvert 5 12-2 1 4 4 female extravert 6.5 12-2 0 2....... 86 male extravert 7 12-2 5 3 Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 7 / 33

Visualization Scatterplots http:// www.gapminder.org/ world Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 8 / 33

Visualization Dot plots Useful for visualizing one numerical variable, especially useful when individual values are of interest. 50 100 150 200 250 d$weight_kg Do you see anything out of the ordinary? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 9 / 33

Histograms and shape Histograms Preferable when sample size is large but hides finer details like individual observations. Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. Histograms are especially convenient for describing the shape of the data distribution. Frequency 0 10 20 30 40 0 2 4 6 8 10 d$no_sex_partner Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 10 / 33

Histograms and shape Bin width The chosen bin width can alter the story the histogram is telling. Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much? Frequency 0 20 40 60 Frequency 0 10 20 30 40 Frequency 0 5 10 15 0 10 20 30 40 0 5 15 25 0 5 15 25 d$no_fb_day d$no_fb_day d$no_fb_day Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 11 / 33

Histograms and shape Skewness Is the histogram right skewed, left skewed, or symmetric? 0 2 4 6 8 10 0 2 4 6 8 10 0 1 2 3 4 5 6 0 10 20 30 40 rs 0 10 20 30 40 ls 0 10 20 30 40 50 60 sym Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 12 / 33

Histograms and shape Note: In order to determine modality, it s best to step back and imagine a smooth curve Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 13 / 33 Modality Does the histogram have a single prominent peak (unimodal), several (bimodal/multimodal), or no prominent peaks (uniform)? 0 2 4 6 8 10 12 0 5 10 15 20 0 2 4 6 8 10 12 14 0 5 10 15 20 unimod 0 5 10 15 20 25 30 bimod 0.0 0.2 0.4 0.6 0.8 1.0 uniform

Histograms and shape Examples How would you expect all of these variables to be distributed? 1 weights of adult females 2 salaries of a random sample of people from North Carolina 3 exam scores 4 birthdays of classmates (day of the month) Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 14 / 33

Centrality Guess the center What would you guess is the average numer of hours students sleep per night? 4 5 6 7 8 9 10 d$hrs_sleep_night Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 15 / 33

Centrality Guess the center, cont. What would you guess is the average weight of students? 50 100 150 200 250 d$weight_kg Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 16 / 33

Centrality Mean x = 1 n (x 1 + x 2 + x 3 + + x n ) n = 1 n i=1 x i Sample mean ( x) - Arithmetic average of values in sample. Population mean (µ) - Computed the same way but it is often not possible to calculate µ since population data is rarely available. The sample mean is a sample statistics, or a point estimate of the population mean. This estimate may not be perfect, but if the sample is good (representative of the population) it is usually a good guess. Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 17 / 33

Centrality Are you typical? http:// www.youtube.com/ watch? v=4b2xovkffz4 How useful are centers alone for conveying the true characteristics of a distribution? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 18 / 33

Centrality Variance Sample Variance s 2 = 1 n 1 n (x i x) 2 i=1 Population Variance σ 2 = 1 N N (x i µ) 2 i=1 Roughly the average squared deviation from the mean. Why do we use the squared deviation in the calculation of variance? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 19 / 33

Centrality Standard deviation Defined to be the square root of the variance Sample SD Population SD s = s 2 = 1 n 1 n (x i x) 2 i=1 σ = σ 2 = 1 N N (x i µ) 2 i=1 Note that variance has square units while the SD has the same units as the data - this leads to a more natural interpretation. Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 20 / 33

Centrality Median, Quartiles, and IQR The median is the value that splits the data in half when ordered in ascending order, i.e. 50 th percentile. 0, 1, 2, 3, 4 If there are an even number of observations, then the median is the average of the two values in the middle. 0, 1, 2, 3, 4, 5 2 + 3 2 = 2.5 The 25 th percentile is also called the first quartile, Q1. The 75 th percentile is also called the third quartile, Q3. The range the middle 50% of the data span is called the interquartile range, or the IQR. Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 21 / 33

Box plots Box plot A box plot visualizes the median, the quartiles, and suspected outliers. 60 suspected outliers Number of Characters (in thousands) 50 40 30 20 10 max whisker reach upper whisker Q 3 (third quartile) median Q 1 (first quartile) 0 lower whisker Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 22 / 33

Box plots Box plot - Example Resting Pulse 62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80 Steps: 1 Calculate median, Q1, Q3, IQR, min, and max 2 Calculate upper and lower fences (Q1-1.5 IQR, Q3 + 1.5 IQR) 3 Find the location of the upper and lower wiskers 4 Locate data points outside wiskers as potential outliers Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 23 / 33

Box plots Robust statistics The median and IQR are examples of what are known as robust statistics - because they are less affected by skewness and outliers than statistics like mean and SD. As such: for skewed distributions it is more appropriate to use median and IQR to describe the center and spread for symmetric distributions it is more appropriate to use the mean and SD to describe the center and spread If you were searching for a car are price conscious, would you be more interested in the mean or median vehicle price when considering a car? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 24 / 33

Box plots Mean vs. median If the distribution is symmetric, center is the mean Symmetric: mean = median If the distribution is skewed or has outliers center is the median Right-skewed: mean > median Left-skewed: mean < median red solid - mean, black dashed - median 0 2 4 6 8 10 0 2 4 6 8 10 0 1 2 3 4 5 6 0 10 20 30 40 ls 0 10 20 30 40 rs 0 10 20 30 40 50 60 sym Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 25 / 33

Box plots Relative Frequency Histograms The infant mortality rate is defined as the number of infant deaths per 1,000 live births. The relative frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. Where would you estimate the third quartile to be located? 0.375 0.25 0.125 0 0 20 40 60 80 100 120 Infant Mortality Rate (per 1000 births) Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 26 / 33

Categorical data Summarizing categorical data Contingency tables Is there a relationship between believing in God and gender? Female Male No 14 8 Somewhat 16 7 Yes 26 10 What percent of females believe in God? What percent of males believe in God? Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 27 / 33

Categorical data Summarizing categorical data Contingency tables (cont.) Females: Males: Female Male Total No 14 8 22 Somewhat 16 7 23 Yes 26 10 36 Total 56 25 82 Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 28 / 33

Categorical data Visualizing categorical data Barplot Frequency 0 5 10 15 20 25 30 35 Arts and humanities Natural science Social sciences Other 0.4 Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 29 / 33

Categorical data Visualizing categorical data Mosaicplots Is there a relationship between major and relationship status? Rel Compl Single A&H 8 2 7 NS 6 1 17 SS 9 5 23 Oth 1 0 3 A&H Rel Compl Single Oth SS NS Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 30 / 33

Categorical data Visualizing categorical data Bivariate Barplots Frequency 0 10 20 30 40 50 Oth SS NS A&H Rel Compl Single A&H NS Statistics 102 (Colin Rundel) SS Lecture 1 - Data and Data Summaries January 14, 2013 31 / 33 20

Categorical data Numerical data across categories Side-by-side box plot How does number of drinks consumed per week vary by affiliation? Drinks per week 0 5 10 15 20 25 30 Greek SLG Greek SLG Independent Affiliation Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 32 / 33

Categorical data Summary Visualization Summary Single numeric - dot plot, box plot, histogram Single categorical - bar plot (or a table) Two numeric - scatter plot Two categorical - mosaic plot, stacked or side-by-side bar plot Numeric and categorical - side-by-side box plot Tufte s Principles: 1 Above all else show data. 2 Maximize the data-ink ratio. 3 Erase non-data-ink. 4 Erase redundant data-ink. 5 Revise and edit Statistics 102 (Colin Rundel) Lecture 1 - Data and Data Summaries January 14, 2013 33 / 33