Variables, distributions, and samples Phil 12: Logic and Decision Making Spring 2011 UC San Diego 4/21/2011
Midterm this Tuesday! Don t need a blue book or scantron Just bring something to write with Sample midterm Not posting an answer key Check answers by checking text, notes, in section, office hours, email If asking me or TAs, must talk through what you think the answer might be, talk through options, reasoning
Anonymous clicker question Do you want me to hold office hours Monday afternoon or evening? A. Yes, Monday 2-4pm B. Yes, Monday 3-5pm C. No, I m good 3
Review Observational research involves careful recording and analysis of what is observed - Without an attempt to manipulate what happens Naturalistic vs. participant observation Risks that must be minimized: - Observer bias - Reactivity - Anthropomorphizing
Coding Schemes A coding scheme is a set of categories used to classify observed phenomena - extract data so as to learn from the observations How can a coding scheme be poorly designed? - fail to have a category for some phenomena you care about recording and analyzing - use one category for phenomena you would like to distinguish
Recording continuously vs. selectively Continuous observation: record what is happening at every moment of time Time sampling: recording what is happening at predetermined intervals Event sampling: recording whenever an event of a specified kind occurs Situation sampling: recording what happens in a variety of different situations (locations) 6
Clicker question To determine how many students carry backpacks, a researcher sits outside the library and records, for every fifth students who exits, whether they have a backpack. The researcher is performing A. Continuous observation B. Time sampling C. Event sampling D.Situation sampling 7
Variables The data from observational research is analyzed in terms of variables A variable is a characteristic or feature of an event that varies(i.e., takes on different values) - Variables of a thrown ball: velocity, momentum, direction, spin,... - Variables of human hair: color, length, texture,... - Variables of human cognition: memory span, speed of reasoning, emotional state,...
Types of variables Variables differ in the type of measurement of the values of the variable that is possible. Sometimes one refers to types of scales rather than types of variables. 1. Categorical or nominal variables 2. Ordinal or rank variables 3. Interval variables 4. Ratio variables
Types of variables - 1 Categorical or nominal variables: items can be assigned to a category (whose members can then be counted, or compared on another variable) - Examples: Gender: male/female Major: psychology, political science, economics,... Organisms: Plant, Animal, Bacteria, Virus,...
Types of variables - 2 Ordinal or rank variables: There is a rank-order to the values the variable may take - Numbers might be assigned to the items, but since there is no metric one cannot compare how much higher or lower one item on the scale is than another - Examples: Movies; *, **, ***, **** Class rank: top 10, next 10, etc. Patient condition: resting and comfortable, stable, guarded, and critical Socioeconomic class: low, middle, high
Types of variables - 3 Interval variables: equal differences between numbers assigned to items reflect equal differences between the values being measured. - Allows additive comparison (e.g., x is three more than y) - But lacking a natural zero-point, does not permit multiplicative comparison (e.g., x is three times y) - Examples: Intelligence: IQ score Temperature: in degrees Celsius or Fahrenheit Personality: degree of extroversion
Types of variables - 4 Ratio variables: items are rated on a scale with equal intervals and a natural 0-point. - Allows for both additive and multiplicative comparison - Examples: Age: in year, months, days,... Temperature: in degrees Kelvin Time: in milliseconds, seconds, years,... Velocity, acceleration, etc. - Interval and ratio data often treated similarly and counted as score data
Summary: Types of Variables Type of variable Example Categorical or nominal college major Score variables Ordinal or rank Interval Ratio patient condition temperature in degrees Fahrenheit age
Clicker question The variable number of clicker responses is A. A categorical or nominal variable B. An ordinal or rank variable C. An interval variable D. A ratio variable
Clicker question On the CAPE evaluations, you respond to questions such as Exams are representative of the course material (the variables being measured) using the following answer choices (values): 1 = strongly disagree 2 = disagree 3 = neither 4 = agree 5 = strongly agree What type of variable are these questions? A. A categorical or nominal variable B. An ordinal or rank variable C. An interval variable D. A ratio variable
Visual representations of data
Nominal & ordinal variables: Bar graphs & Pie Charts Example: Profile of pet ownership in San Diego County
Score variables: Histograms Histograms rather than bar graphs used because score variables are continuous This is done by creating bins and tabulating the number of items in each bin The size of bins can create radically different pictures of the distribution! bin size: 0.25 bin size: 1
Daily Life Activities Bin size: 1 hr 25 Studying (online + offline) 20 No. of people 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Hours
Daily Life Activities Bin size: 0.5 hr 25 Studying (online + offline) 20 No. of people 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Hours
Daily Life Activities Bin size: 0.25 hr 25 Studying (online + offline) 20 No. of people 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Hours
Normal and non-normal distributions Normal distributions - Have a single peak - Scores equally distributed around the peak - Fewer scores further from the peak Non-normal distributions Skewed Bimodal
Daily Life Activities N = 32 25 Studying (online + offline) 20 No. of people 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Hours
Daily Life Activities N = 32 25 In class 20 No. of people 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Hours
Clicker question The distribution below is <100 100-199 200-299 300-399 400-499 500-599 600-699 700-799 >800 63 45 35 37 82 35 39 53 53 A. Normal since it has one peak B. Normal since scores are equally distributed around the peak C. Not normal since because there are not fewer scores further from the peak D. Not normal because scores are not equally distributed around the peak
Describing distributions Two principal measures: 1. Central the standard tendency deviation Two comparable distributions differing in central tendency 2. Variability Two distributions with same central tendency but differing in variability
Three measures of central tendency Mean: the arithmetic average--sum of all the scores divided by the number of instances Median: the score of which half are higher and half are lower Mode: the most frequent score Consider this distribution of values: 2, 6, 9, 7, 9, 9, 10, 8, 6, 7 mean = 73 / 10 = 7.3 median = mode = 7.5 9
Which measure to use? If the distribution is normal, all three measures of central tendency give the same result - The mean is the easiest to calculate and the most frequently reported If there are extreme outliers in one direction, the mean may be distorted - Exam scores: 21, 72, 76, 79, 82, 84, 87, 88, 90, 91, 95 Mean: 78.6 Median: 84 - In such a case, the median gives a better picture of the central tendency of the class
Measures of variability Variability concerns: How much do the scores vary? Range: the lowest value to the highest value 40 40 30 30 20 20 10 10 0 0 2 4 6 8 10 0 0 2 4 6 8 10
Measures of variability Variability concerns: How much do the scores vary? Range: the lowest value to the highest value Variance: (X-mean) 2 N Standard deviation: Variance 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 Mean = 5.0 SD = 0 Mean = 5.0 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 SD = 1.04
Measures of variability Variability concerns: How much do the scores vary? Range: the lowest value to the highest value Variance: (X-mean) 2 N Standard deviation: Variance - Intuitive interpretation: 1 SD: the part of the range in which 68% of the scores fall 2 SD: the part of the range in which 95% of the scores fall 3 SD: the part of the range in which 99% of the scores fall
Variance Consider a distribution: 4 5 5 6 6 6 7 7 8-2 -1-1 6 0 0 1 1 2 4 1 1 0 0 0 1 1 4 Mean = 6 X - mean (X-mean) 2 (X-mean) 2 Variance = = N 12 9 = 1.33 SD = variance = 1.33 = 1.15 Range of 1 SD Range of 2 SD = 6 ± 1.15 = 4.85 to 7.15 = 6 ± 2.30 = 3.70 to 8.30
Range and Standard Deviation range 68% of scores 95% of scores
Clicker question On an exam on which scores were distributed normally and the mean was 86 and the SD was 4, A. 68% of the scores were between 78 and 94 B. 68% of the scores were between 82 and 90 C. 95% of the scores were between 78 and 94 D. 95% of the scores were between 82 and 90 E. None of the above
Populations The phenomena about which we seek to draw conclusions in a study are known as the population. Sometimes one can study each member of the population of interest But if the population is large: - - it may be impossible to study the whole population there may be no need to study the whole population
Samples A sample is a subset of the population chosen for study. From studying the distribution of a variable in a sample, one makes an estimate of the distribution in the actual population Sometimes the estimate from a sample may be more accurate than trying to study the population itself - U.S. Census
Is the sample biased? If information about the sample is to be informative about the actual population, the sample must be representative - Randomization: attempt to insure that the sample is representative by avoiding bias in selecting the sample Risk: inadvertently developing a misrepresentative sample - E.g., using telephone numbers in the phonebook to sample electorate
Does the sample reflect the population? Does the mean of the sample reflect the mean of the actual population? - - - Sampling distribution simulation Very unlikely that the mean of the sample will exactly equal the mean of the population Key question: how much does the mean of the sample vary from the mean of the actual population? Given the mean of a sample, what is the range within which the mean of the actual population lies? - To determine this, the standard deviation measure is very useful
Standard deviation and mean In 68% of samples, the mean of the population will fall within 1 standard deviation of the mean of the sample Sample mean In 95% of samples, the mean of the population will fall within 2 standard deviations from the mean of the sample
What happens as sample size gets larger? As sample size grows, the SD of the sample shrinks So with larger samples, the range of 2 standard deviations shrinks Assume sample mean is 50: Sample size Range of 2 SD (95% confidence interval) Range of 3 SD (99% confidence interval) 10 34.5-65.5 29.5-70.5 20 39-61 35.6-64.4 50 43-57 40.9-59.1 100 45-55 43.5-56.5 500 47.8-52.2 47.1-52.9 1000 48.4-51.6 48-52
Example of estimating population mean from sample mean Example: age of people eating at the Food Court - Draw a sample to make inference of average age of people eating at the Food Court <17 17 18 19 20 21 22 23 24 25 >25 Population 6 18 23 34 32 18 26 29 14 10 10 Sample 2 1 3 1 2 1
Estimating real distribution <17 17 18 19 20 21 22 23 24 25 >25 Population 6 18 23 34 32 18 26 29 14 10 10 Sample 1 (n = 10) 2 1 3 1 2 1 Sample 2 (n=20) 1 2 4 6 3 2 2 Mean of the actual population: 20.63 Sample 1 Sample 2 Mean of the sample: 19.4 20.1 SD of the sample: 1.9 1.6 Range of 1 SD: 17.5-22.3 18.5-21.7 Range of 2 SD: 15.9-24.2 16.9-23.3 Want to predict more accurately? Use a larger sample size
Review Four types of variables: - Nominal ordinal interval ratio Values of variables are distributed - Important goal: characterizing the distribution Graphs - Bar graphs for nominal and ordinal variables - Histograms for score variables Normal versus non-normal distributions - Skewed, bimodal, etc
Review Two principal measures of distributions - Central tendency Mean, median, mode - Variability Range, variance, SD - 1 SD includes approx. 68% of scores - 2 SD includes approx. 95% of scores - 3 SD includes approx. 99% of scores
Review Population and samples - From studying the distribution in sample, estimate the distribution in the actual population - Mean of actual population will Fall within one SD of mean of sample for 68% of samples Fall within two SD of mean of sample for 95% of samples Fall within three SD of mean of sample for 99% of samples - Larger sample yields smaller SD and hence more precise estimate - Hence, to improve the precision of an estimate, use a larger sample