CSI 23 LECTURE NOTES (Ojakian) Topic 1: Overview and Fundamental Background 1. Introduction to Statistics and Excel 2. Fundamental Terminology OUTLINE (References: 1.1, 1.2, 1.3, 3.1) 3. Mean, Median, Mode and other Data Summaries 4. Random Samples 5. Topics with brief introduction: Probability, Estimation, Correlation 1. Simple Introduction to Statistics (a) Goal: Understand some characteristic about a population. i. Example: Consider the population = all NYC residents. Want to understand issues related to voting in election for president in November 2016. (By the way, 78.59% voted for Clinton and 18.4% voted for Trump Source: https://www.dnainfo.com/new-york/numbers/clinton-trump-president-vice-presidentevery-neighborhood-map-election-results-voting-general-primary-nyc) ii. Example: Consider the population = all NYC residents. Want to understand how old we are. iii. Example: Consider the population = all the S.U.V.s in the USA. Want to understand how safe they are. (b) Approach of Statistics: Focus on some parameters of the population and use a sample. i. Example: For the population = all NYC residents, to understand how old we are. A. Focus on some parameters such as: average age, percent of the population that is older than 65, etc. B. Select a sample to study. 2. Some Mathematical Terminology (a) Set: (b) List: (c) Integers: (d) Real Numbers: (e) Function: (f) Applying a function to every element of a list to get a new list: 1
(g) EXERCISES PROBLEM 1. Suppose F (x) is the function which maps an integer x to 0 if it is odd and 1 if it is even. Evaluate F (100003) and F (77777774). Apply F to the list (3, 4, 4, 1, 3). *PROBLEM* 2. Suppose G(x) is the function which maps any real number x to the nearest integer (and up if it is exactly between two integers). Evaluate G(23.78) and G(100.12). Apply G to the list (0.77, 4.1, 50, 7.5). 3. Some Statistics Terminology (a) Population: (b) Individuals: (c) Variable (in the Statistics Sense!): Function from the population to the real numbers. (d) i. Example: Population = NYC residents. A. One variable is the function that maps a person to that person s height. B. Another variable is the function that maps a person to 0 if the person voted for Trump and 1 if the person voted for Clinton, and 2 if the person did something else. ii. Remark: It is a reduction of information. PROBLEM 3. Textbook (1.1) - 8 (6 in 5th Edition) *PROBLEM* 4. On a small sheet of paper, write down an example of one population and two different variables. *PROBLEM* 5. Write a variable or two that could apply to our class (such as height ). Write down one you think would be interesting to understand; I will create a survey based on your responses! 2
4. Common Data Summaries: Mean, Median, Mode (a) Mean (b) Median (c) Mode (d) Remark: Sometimes average refers to mean and sometimes average refers to any of mean, median, or mode. (e) EXERCISES PROBLEM 6. Consider the data: 1, 4, 0, -2, 1. i. What is the mean? What is the median? ii. If the largest number is increased, how does this effect the mean and median? iii. If the smallest number is increased so that it is now the largest number, how does this effect the mean and median? PROBLEM 7. Textbook (3.1): 13 (5 in 5th edition) *PROBLEM* 8. The net worth of someone is the amount of money that person would have if they sold everything they have and then subtracted their debt. Based on the Net Worth handout, answer the following questions (round numbers to nearest 1000): i. Suppose there is bar with two typical 40 year olds and three typical 50 year olds. What is the mean net worth and what is the median net worth in the bar? ii. Suppose there is bar with 99 typical 40 year olds. What is the mean net worth and what is the median net worth in the bar? Now pick your favorite super rich man from the top eight; suppose he walks into the bar. Now what is the mean net worth and what is the median net worth bar? 5. Excel Introduction (a) Putting something in a box: Text, Number, or Function (b) Some functions: i. For mean use: average ii. For median use: median iii. For mode use: mode iv. For summing use: sum (c) Different worksheets. (d) Please!... Organize your work clearly. (e) Exercises PROBLEM 9. Go to Cengage Data (at webpage), download Heights of Pro Basketball Players from the n 30 data. Find the median. Find the mean in two ways: 1) using the average function and 2) using the sum function, but not the average function. *PROBLEM* 10. Go to Cengage Data (at webpage), download a data set of your choice from the n 30 data. Find the median. Find the mean in two ways: 1) using the average function and 2) using the sum function, but not the average function. 3
6. Data Summaries in General (a) Functions whose inputs are lists (Examples: mean, median, mode) (b) Other Data Summaries i. Maximum and Minimum ii. Range iii. Percentages (c) EXERCISES PROBLEM 11. Consider the data summary X that maps a list of integers to the percent of negative numbers in the list. Evaluate X(4, 0, 4, 1, 4) and X(1, 2, 3, 4, 5). *PROBLEM* 12. On a small sheet of paper, write down an example of another data summary (remember: its input should be a list of numbers and its output should be a single number). 7. Fundamental Idea of Statistics To understand a population, understand a sample of the population. (a) Population versus Sample Example: All NYC residents versus this class. (b) Population Parameter versus Sample Statistic *PROBLEM* 13. I have 35 sheets of paper (each numbered 1-10). To guess the population mean, population median, and percent of 1 s, choose a sample of size 5 and find the sample mean, sample median, and sample percentage. (Do with 3 different volunteers and save info on the board) *PROBLEM* 14. Suppose we want to know 1) the average age of a CUNY student and 2) the percent of students 25 years and older. Let s use our class to guess. i. What is the population? ii. What is the variable? iii. What are the population parameters? iv. What is the sample? v. What are the sample statistics? vi. Calculate the sample mean and sample percent (need class data!). vii. How well do you think our sample statistics approximate the population parameters? Why factors support accepting our approximations and what factors support rejecting our approximations? (c) Terminology i. Population mean: µ (pronounced mew ) ii. Sample mean: x (pronounced x bar ) PROBLEM 15. Use the names µ and x on the previous problems. 4
8. Probability (Details: ch. 5, 6) (a) Probability of an event: A measure of how likely it is using a number between 0 and 1. (b) Examples: Coins, Dice, Polls. Probability that Clinton would win 2016 presidential election: 70% or 99% depending on who you asked... (c) Typical assumption: Equally Likely Outcomes Probability = F avorable T otal (d) EXERCISES PROBLEM 16. Suppose you roll a 6-sided die (with numbers 1 through 6). i. What is the probability of rolling a 2? ii. What is the probability of rolling a number larger than 2? *PROBLEM* 17. Suppose you choose one card from a standard deck of playing cards: 52 cards in total, with 4 suits: 13 red hearts, 13 red diamonds, 13 black clubs, and 13 black spades; in each suit there are cards: 1,2,3,4,5,6,7,8,9,10,Jack, Queen, King. i. What is the probability picking the Queen of Spades. ii. What is the probability of picking a 1 (i.e. an Ace)? iii. What is the probability picking a diamond? iv. What is the probability of picking a red Jack? 9. Random Samples (a) Random Sample: (b) Random sample using Excel: randbetween (c) EXERCISES PROBLEM 18. Textbook (1.2) - 9 (5 in 5th Edition) *PROBLEM* 19. Which of the following ways of getting a sample from a population are random? If not completely random, how close to random does it seem and how could you correct the sample to make it random? i. Population = All US residents. Sample = Randomly call 100 people. ii. Population = All US residents that own a phone. Sample = Randomly call 100 people. iii. Population = All subway riders. Sample = Randomly select 100 people entering the Burnside Avenue Subway. 5
10. Estimation (Details: ch. 8) (a) Confidence interval i. Have some population and an unknown population parameter Q. ii. Choose a confidence level : A percent, P %, between 0% and 100% (i.e. a probability measure between 0 and 1). iii. From a random sample obtain a P % confidence interval (a, b) for Q. iv. The probability you pick an interval (a, b) that contains Q is P %. v. Subtlety: The parameter Q is either in or not in (a, b). Having a confidence interval with confidence level P % means that the process that yields (a, b) has a P % chance of producing an interval containing Q. (b) Example: Suppose the newspaper tells you that (22.1, 25.8) is a 90% confidence interval for the average age of a college student. This means that you can be 90% confident that the average age of a college student is in between 22.1 and 25.8. More subtlety: The process that yielded (22.1, 25.8) had a 90% chance of yielding an interval that contains the actual average age of a college student. (c) Example i. Population = the earlier papers with numbers 1-10. Parameter is µ. ii. Take confidence level 95%. iii. Take a random sample. For now, use Excel to obtain confidence interval. (d) Using Excel to get confidence interval. i. Data Data Analysis Descriptive Statistics Confidence Level for Mean ii. Add and subtract the Confidence Level from the Sample Mean to obtain the Confidence Interval. PROBLEM 20. Find a confidence interval for the various samples in the earlier attempt to guess the average of the numbers on the papers. PROBLEM 21. Using our earlier work, find a confidence interval for average age of a CUNY student using our class as the sample. 11. Correlation (Details: ch 4) PROBLEM 22. Textbook: 4.1-7 (a) Important principle: Correlation does not imply Causation. (b) Lurking variable: (c) EXERCISE *PROBLEM* 23. Textbook: 4.1-8 (d) Moral: Be careful on declaring a cause for some phenomena! 6