I.4 Sampling Lecture Notes 1. Statistical Thinking Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write. – H. G. Wells, author of “War of the Worlds” Definition: Statistics is the science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively evaluated. 2. Three Phases of Statistics • Collect the data • Analyze the data – order the data – graphical displays – numerical calculations (such as mean and standard dev) • Interpret the results – use proper statistical techniques to substantiate or refute hypothesized statements – match data to the appropriate technique – determine whether the proper assumptions are satisfied 3. Two types of statistics • Descriptive statistics – summarize and describe a characteristic for some group • Inferential statistics – estimate, infer, predict, or conclude something about a larger group 4. Examples Descriptive Batting Average Yards Per Carry Test Scores Inferential Polls Medical Studies Market Surveys 1 2 5. Two types of data • Quantitative data – values recorded on a natural numerical scale • Qualitative data – classified into categories 6. Quantitative Data • Weight of subjects in medical sample • Height of buildings in Chicago • Temperatures per day at Antarctica Weather Station 7. Qualitative Data • Gender of subjects in medical sample • Political affilation of respondents in a poll survey • Class (fresh, soph, jr, sr) of Math 101 students 8. Vocabulary • The population is the entire set of objects (people or things) under consideration. • A sample is a subset of the population that is available for the analysis. • A bias is a favoring of certain outcomes over others. • A census collects data from each member of the population. • A statistic is a statement of numerical information about a sample. • A parameter is a statement of numerical information about a population. 9. Census versus Sample Would you use a census or a sample to determine the following: • Project the winner of an election • Calculate a baseball player’s batting average • Predict whether it will rain tomorrow 3 • • • • • • Test whether the soup is too salty Calculate Shaq’s free throw average Use a market study to determine a new flavor of toothpaste Report the Dow Jones Average Generalize a medical study to other groups The average score on the first test 10. Dealing with bias Bias in some form occurs in the collecting of most, if not all, sets of data. The bias may come from • the portion of the population surveyed • the phrasing of the questions 11. Examples • “Dewey defeats Truman” projection of Chicago Tribune based on 1948 telephone poll • “Are you in favor of Illinois banning cell phones in cars? Dial *91 on your cellular phone to vote.” • “Do you feel budget cuts are more important than humanitarian programs that would need to be cut to obtain a balanced budget?” 12. Methods for Choosing Samples • Judgement Sample – Use the opinion of person(s) deemed qualified to choose members of the sample. – Example: to investigate study habits of atheletes, ask their coaches and teachers. • Simple Random Selection – Use random numbers to select the sample. 4 – Page 315 Random Digit Table: 72985547555515086461 • Stratefied Sampling – Divide the population into relatively homogenous groups, draw a sample from each group, and take their union. 13. Goals of a good sample • from the correct population • chosen in an unbiased way • large enough to reflect total population 14. Normal Distribution of Random Events Toss a coin 100 times and count the number of heads. How many heads would you expect? • about 50 • exactly 50 It does not seem reasonable that the count will be exactly 50. We would not be surprised if the number of heads turned out to be 48 or 51 or even 55. We would be surprised to see 80 heads, and would begin to suspect that the coin was not fair. 15. Coin Toss Data Experiment: A coin is tossed n = 100 times. The experiment is repeated 1000 times. Here are the results: 5 16. Frequency Table: No. of Heads Heads Freq 1 0 .. . 0 34 0 35 2 36 2 37 2 38 2 39 5 40 14 41 16 42 25 43 30 44 31 Heads Freq 45 54 46 49 47 54 48 66 49 89 50 70 51 77 52 85 53 62 54 57 55 52 56 40 57 36 Heads Freq 58 27 59 19 60 11 61 11 62 5 63 4 64 2 65 0 66 0 67 1 68 0 .. . 0 100 0 mean = 50.296 stand dev = 5.100 17. Coin Toss Histogram 30 40 50 60 70 6 18. Sampling Distributions If we could examine all possible samples of size n of a population, then the frequency distribution of the means of these samples is normally distributed. • • • • µ = the mean over the entire population σ = the standard deviation over the entire population x = the mean of the sampling distribution σx = the standard deviation of the sampling distribution 19. Two Rules Rule 1. x = µ σ Rule 2. σx = √ n We are assuming in Rule 2 that the size of the entire population is much larger than the sample size n. 20. Two Outcome Situations Situation: Two outcomes (for–against; heads–tails; yes–no) p = percent in favor q = percent opposed Written as decimals p+q =1 Why? 21. Example • • • • 29 % of Americans favor Bush’s handling of the War in Iraq, while 71 % do not. p = .29 q = .71 p + q = .29 + .71 = 1 7 22. Quantitizing the Data • • • • We count a for (or yes) vote as X1 = 1 and an against (or no) vote as X2 = 0 Out of 100 people, we would expect 100p yes votes and 100q no votes 23. To calculate the mean Outcome (out of 100 cases): Vote Frequency Freq ×Xi X1 = 1 (yes) 100p 100p X2 = 0 (no) 100q 0 Total 100p So the mean µ= 100p =p 100 24. Standard Deviation Out of 100 cases, Vote Freq (Xi − µ)2 Freq×(Xi − µ)2 X1 = 1 100p (1 − p)2 100p(1 − p)2 2 X2 = 0 100q (0 − p) 100q(0 − p)2 Total 100p(1 − p)2 +100q(0 − p)2 25. Calculating standard deviation First divide the Total by n = 100 cases: Total = p(1 − p)2 + q(0 − p)2 100 = p(1 − p)2 + qp2 = pq 2 + qp2 [1-p=q] 8 = pq(q + p) = pq [because p + q = 1] Then to get σ, take the square root: √ σ = pq 26. The p–q Rule Suppose a coin has probability p of landing heads and q = 1 − p of landing tails. (A value other than p = 1 2 means the coin is not “fair.”) The parameter which measures a head (X = 1) versus a tail (X = 0) has mean µ = p and standard deviation σ = √ pq 27. Bush Popularity Example 29% think Bush is doing a good job 71% do not p = .29 and q = .71 µ = p = .29 p √ σ = pq = (.29)(.71) = .4538 28. Fair Coin Toss Heads = 1, Tails = 0 With a fair coin, we expect the percentage of heads to be 50%: p = .5 and q = .5 µ = p = .5 p √ √ σ = pq = (.5)(.5) = .25 = .5 9 29. Percents versus Actual Numbers Sometimes our calculations are in terms of percents and sometimes they are given as actual numbers. For example, suppose we flip a coin 340 times. We would expect to have roughly 170 heads (and 170 tails). We expect the percentage of heads to be 170 340 = 1 2 or 50% p = 0.5 is the number used in our formulas (along with q = .5) To convert from the percentage to the actual number of expected heads, simply multiply p by n In this case, we expect 1 2 × 340 = 170 heads. 30. Percents versus Actual Numbers Cont’d The p–q formula computes the standard deviation σ for the population when we are thinking in terms of percent The formula σx = √σn computes the standard error of the mean when we are thinking in terms of percent To convert to actual numbers, multiply σx by n. By properties of the square root function √ σ √ ·n=σ· n n 31. Percents versus Actual Numbers Flip a coin 340 times and count the number of heads. Mean and Standard Deviation for the Entire Population p µ = 21 = 0.5 σ = (.5 × .5) = 0.5 Mean and Standard Deviation for Sample Size of n = 340 tosses In terms of percents: x = µ = 0.5 σx = √σ 340 = .027 10 In terms of actual numbers, multiply by n = 340: mean = 0.5 × 340 = 170 stan. dev. = .027 × 340 = 9.22 32. Interpetation Since the sampling distribution is normally distributed with mean 170 and standard deviation of 9.2, the 68–95–99 rule tells us: If you flip a fair coin 340 you would expect the number of heads to be between 161 and 179 68% of the time [1 standard deviation] between 152 and 188 95% of the time [2 standard deviations] between 142 and 198 99% of the time [3 standard deviations] 33. Coin–Toss Model • Suppose a coin has probability p of landing heads and q = 1 − p of landing tails. • Suppose we flip the coin n times and record x, the number of heads for each sample. • The values of x will be normally distributed with mean and standard deviation given as follows: Distribution Distribution Population Sample Sample Percents Actual Numbers Mean p p p·n √ σ √ √ Stan. Dev. σ = pq σ· n n 34. Comparison with Previous Experiment Toss a coin n = 100 times Actual Value Predicted Value Mean 50.296 50 Stan. Dev. 5.100 5