Know the symbols and the meanings We can always easily calculate/know a statistic Often we don’t know (and will never know) the value of the parameter(s); we would have to take a census... Is that even possible? Pew Research Center surveyed 1650 adult internet users to estimate the proportion of American internet users between 18 & 32 years old. Common wisdom hold that this group of young adults is among the heavier users of the internet. The survey found that 30% of the respondents in the sample were between ages 18 & 32. Identify the population, sample, parameter, & statistic (‘estimator’)? This example... proportions... we have ‘means’ too... Drawing conclusions about a population on the basis of observing only a small subset of that population (i.e., a sample); more on this a little later... Always involves some uncertainty Does a given sample represent a particular population accurately? Is the sample systemically ‘off?’ By a lot? By a little? Recall, bias is… Being systematically ‘off;’ scale 5 pound heavy; clock 10 minutes fast, etc. Textbook explains two different types of bias: measurement bias & sampling bias Don’t need to know if a particular bias is measurement or sampling; just need to know the concept of bias Let’s discuss… • People who choose themselves by responding to a general appeal • Biased because people with strong opinions, especially negative opinions, are most likely to respond • Often very misleading • With a partner, come up with one real-life example of voluntary response sampling with which you are familiar; 30 seconds & share out All of our examples discussed make our sample statistic systematically ‘off;’ in other words, bias Worthless data Convenience Sampling: Choosing individuals who are easiest to reach. With a partner, think of a real-life example of convenience sampling. 30 seconds, then share out. Situational examples Bias; worthless data Convenience sampling & voluntary sampling are both bias sampling methods How do we minimize (we can’t eliminate; just minimize as much as possible) bias when sampling/taking samples? Let impersonal chance/randomness do the choosing; more on this... (remember, in voluntary response people chose to respond; in convenience sampling, interviewer made choice; in both situations, personal choice creates bias) Simple Random Sample (SRS) – type of probability or random sample; this type of sampling is without replacement SRS The – chance selects the sample use of chance selecting the sample is the essential principle of statistical sampling Several different types of random sampling, all involve chance selecting the sample Choosing samples by chance gives all individuals an equal chance to be chosen We will focus on Simple Random Sampling (SRS) SRS ensures that every set of n individuals has an equal chance to be in the sample/actually selected Easiest ways to use chance/SRS: • Names in a hat • Random digits generator in calculator or Minitab • Random digits table Random digits table: table of random digits, long string of digits 0, 1, ..., 9 in which: • Each entry in the table is equally likely to be 0–9 • Entries are independent. Knowing the digits in one point of the table gives no information about another part of the table • Table in rows & columns; read either way (but usually rows); groups & rows – no meaning; just easier to read 0 – 9 equally likely 00 – 99 equally likely 000 – 999 equally likely Joan’s small accounting firm serves 30 business clients. Joan wants to interview a sample of 5 clients to find ways to improve client satisfaction. To avoid bias, she chooses a SRS of size 5. Enter table at a random row Notice her clients are numbered (labeled) with 2digits numbers (if this isn’t already done, you must label your list), so we are going to go by 2-digit number in table Ignore all 2-digit number that are beyond 30 (our data is numbered from 01 to 30) Ignore duplicates Continue until we have 5 distinct 2-digit numbers chosen & identify who those clients are 1. Label... Assign a numerical label to every individual 2. Random Digits Table (or Minitab or names in hat)... Select labels at random 3. Stopping Rule ... Indicate when you should stop sampling 4. Identify Sample ... Use labels to identify subjects/individuals selected to be in the sample Be certain all labels have the same # of digits if using RDT(ensures individuals have the same chance to be chosen) Use shortest possible label, i.e., 1 digit for populations up to 10 members (can use labels from 0 to 9), 2 digits for populations from 11 – 100 members (can use labels from 00 to 99), etc. -this is just a good standard of practice... Label students; what labels should we use? Label candy; what labels should we use? SRS of 2 students using Random Digits Table; enter table on line ___ SRS of 2 students using my Minitab (will my Minitab be different from your Minitab?) Should we allow duplicates? When should we and when should we not? To take a SRS, we need a list of our population Is this always possible? Could we get a list of everyone in the United States? In Santa Clarita? COC students? Randomization is great to help with voluntary response bias and convenience bias But it doesn’t help at all with other possible bias... such as wording, non-response, undercoverage, etc. Let’s discuss other sources of potential bias which could make our statistic/our data worthless Most samples suffer from some degree of under coverage (another type of bias) What is bias again?? .... ... Bias is systemically favoring a particular outcome Under coverage occurs when a group(s) is left out of the process of choosing the sample somehow/entirely Talk in your groups and come up with an example of under coverage (some groups in population are left out of the process of choosing the sample) Another source of bias in many/most sample surveys is non-response, when a selected individual cannot be contacted or refuses to cooperate Big problem; happens very often, even with aggressive follow-up Almost impossible to eliminate non-response; we can just try to minimize as much as possible Note: Most media polls won’t/don’t tell us the rate of non-response Response bias...occurs when respondents are untruthful, especially if asked about illegal or unpopular beliefs or behaviors Can you come up with some situations where people might be less than truthful? Who won the “First Lady Debate?” http://www.youtube.com/watch?v=EohGmGQUhA http://perezhilton.com/tv/JIMMY_KIMMEL_LIV E_Who_Won_The_Presidential_Debate_BEFOR E_It_Even_Happened/?id=79b7cb451ec00#.Vb -lN_NViko http://www.mrctv.org/videos/kimmel-publicweighs-first-lady-debate Should we ban disposable diapers? A survey paid for by makers of disposable diapers found that 84% of the sample opposed banning disposable diapers. Here’s the actual question: It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contract, beverage containers, third-class mail and yard wastes are estimated to account for about 21% of the trash in landfills. Given this, in your opinion, would it be fair to ban disposable diapers? Remember our survey ... Possible activity: Sampling Distribution on football lovers... Even if we use SRS/probability sampling, and we are very careful in reducing bias as much as possible, the statistics we get from a given sample will likely be different from the statistics we get from another sample. Statistics vary from sample to sample We can improve our results (our sample statistic can get closer to what our population parameter actually is; our variability decreases) by increasing our random sample size Remember samples vary; parameters are fixed Even if we use SRS/probability sampling, and we are very careful in reducing bias as much as possible, the statistics we get from a given sample will likely be different from the statistics we get from another sample. Statistics vary from sample to sample We can improve our results (our sample statistic can get closer to what our population parameter actually is; our variability decreases) by increasing our random sample size Remember samples vary; parameters are fixed Statistics (like sample proportions) vary from sample to sample A larger n (sample size) decreases standard deviation (variability) A smaller n (sample size) increases standard deviation (variability) But remember no matter how much we increase/how large our sample is, a large sample size does not ‘fix’ underlying issues, like bad wording, under coverage, convenience sampling, etc. Who carried out the survey? How was the sample selected/was the sample representative of the population? How large was the sample? What was the response rate? How were the subjects contacted? When was the survey conducted? What was the exact question asked? True value of population parameter (p or μ) is like a bull’s eye on a target Sample statistic ((𝑥, 𝑝, etc.) is like an arrow fired at the target; sometimes it hits the bull’s eye and sometimes it misses Keep in mind… we (very) often don’t know how close to the bull’s eye we are… When we take many samples from a population (sampling distribution), bias & variability can look like the following: - Bias, aim, accuracy - Variability, precision, standard deviation AND most of the time, we don’t even know where the target is! ... We are trying to estimate where the target is...estimation method... We want: low bias & low variability (good aim & good precision) Properly chosen statistics computed from random samples of sufficient size will (hopefully) have low bias & low variability (good aim & precision) Hits the bull’s eye on the target (even though we don’t know where the target is) Can’t eliminate bias & variability (bad aim & precision); can just do all that we can to reduce bias & variability How good are our estimators? The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. The population we will consider is the scores of 10 students on an exam as follows: The parameter of interest is the mean score in this population. The sample is a SRS drawn from the population. Use RDT (enter at random line) to draw a SRS of size n = 3 from this population. Calculate the mean ( x ) of the sample score. This statistic is an estimate of the population parameter, μ. Repeat this process 5 times. Plot your 5 xs on the board. We are constructing a sampling distribution. What is the approximate value of the center of our sampling distribution? What is the shape? Based on this sampling distribution, what do you think our true population parameter, μ, is? Now let’s calculate our true population parameter, μ. Also, remember the Reece’s Pieces simulation ... If we use good, sound statistical practices, we can get pretty good estimates of our population parameters based on samples Also, our estimates get better and better if ... Our estimates get better and better as our ‘n’ increases... not our population size, but our sample size, n. In other words, the precision (variability; how much So, an estimator based on a sample size of 10 is just as precise taken from a population of 1000 people as it is taken from a population of a million. Let’s explore this idea more… it fluctuates) of an estimator (a sample statistic, like a sample proportion or a sample mean) does not depend on the size of the population; it depends only on the sample size. Consider 1000 SRS’s of n = 100 for proportion of U.S. adults who watched Survivor Guatemala Discuss in your groups some observations you have about this distribution; share out. n = 100 (1000 SRS for both sampling distributions) n = 1000 What do you notice? Given that sampling randomization is used properly, the larger the SRS size ( n ), the smaller the spread (the more tightly clustered; the more precise) of the sampling distribution. Center doesn’t change significantly (good estimator; unbiased estimator) Shape doesn’t change significantly Spread (range, standard deviation) does change significantly; spread is not an unbiased estimator A SRS of 2,500 from U.S. population of 300 million is going to be just as accurate/same amount of variability/precision as that same size SRS of 2,500 from 750,000 San Francisco population Both just as precise (given that population is well-mixed); equally trustworthy; no difference in standard deviation, how ‘spread out’ distribution is Not about population size; it’s about sample size (n) As long as the population is large relative to the sample size, the precision has nothing to do with the size of the population, but only with the size of the sample. Increased sample size improves precision/ reduces variability Surveys based on larger sample sizes have smaller standard error (SE) and therefore better precision (less variability) Standard Deviation (or standard error) = p(1 p) n Standard Deviation (or standard error) = p(1 p) n Let’s try the formula: Let’s say that 20,000 students attend COC; and 20% of them say their favorite ice cream is vanilla. If we were to take a SRS of 100 students and ask them if vanilla is their favorite flavor of ice cream, our standard deviation would be... Standard Deviation (or standard error) = p(1 p) n Now let’s say that 20,000 students attend COC; and 20% of them say their favorite ice cream is vanilla. If we were to take a SRS of 1,000 students and ask them if vanilla is their favorite flavor of ice cream, our standard deviation would be... Standard Deviation (or standard error) = p(1 p) n Now let’s say that 20,000 students attend COC; and 20% of them say their favorite ice cream is vanilla. If we were to take a SRS of 2,000 students and ask them if vanilla is their favorite flavor of ice cream, our standard deviation would be... Standard Deviation (or standard error) = p(1 p) n Increased sample size means we have more information about our population so our variability/standard deviation is going to be less. Increased sample size improves precision/ reduces variability Surveys based on larger sample sizes have smaller standard error (SE) and therefore better precision (less variability) Trade-offs… Cost increases, time-consuming, etc. Life was good and easy with the Normal distribution Could easily calculate probabilities Good working model If we could use the Normal distribution with sampling distributions for proportions, life would be great Guess what? We can. Meet the Central Limit Theorem Has many versions (one for proportions, one for means, etc.) Let’s To discuss proportions for now use CLT with proportions, three conditions must be met Random; samples collected randomly from population Large sample; np ≥ 10 and n(1 – p) ≥ 10; proportion of expected successes and failures at least 10 Big population (population at least 10 times sample size) 20,000 students attend COC; and we know that 20% of them say their favorite ice cream is vanilla. We want to find out... IF we asked a SRS of 100 students if vanilla is their favorite ice cream, what’s the probability that half or more of them would say ‘yes.’? Let’s check conditions to see if we can use the CLT to calculate that probability. Random? Large sample: np ≥ 10 and n(1 – p) ≥ 10 ? Big population: population is at least 10 times the sample size? 20,000 students attend COC; and we know that 20% of them say their favorite ice cream is vanilla. We want to find out... IF we asked a SRS of 15 students if vanilla is their favorite ice cream, what’s the probability that half or more of them would say ‘yes.’? Let’s check conditions to see if we can use the CLT to calculate that probability. Random? Large sample: np ≥ 10 and n(1 – p) ≥ 10 ? Big population: population is at least 10 times the sample size? 20,000 students attend COC; and we know that 20% of them say their favorite ice cream is vanilla. We want to find out... IF we asked a SRS of 5,000 students if vanilla is their favorite ice cream, what’s the probability that half or more of them would say ‘yes.’? Let’s check conditions to see if we can use the CLT to calculate that probability. Random? Large sample: np ≥ 10 and n(1 – p) ≥ 10 ? Big population: population is at least 10 times the sample size? It’s important to check conditions before we calculate probabilities to make sure that what we calculate is accurate, valuable, meaningful information (not nonsense) Random Large sample: np ≥ 10 and n(1 – p) ≥ 10 Big population: population is at least 10 times the sample size the examples have described situations in which we know the value of the population parameter, p. Very unrealistic The whole point of carrying out a survey (most of the time) is that we don’t know the value of p, but we want to estimate it Think about the elections... Parties are taking a lot of sample surveys (polls) to see who is ‘leading’ Took a random sample of 2,928 adults in the US and asked them if they believed that reducing the spread of AIDS and other infectious diseases was an important policy goal for the US government. 1,551responded Spiral ‘yes;’ 53% back for a moment... Random? Large sample? Big population? Random? Large sample? Big population? (So, if we wanted to find a probability using the CLT, we could...) These are the exact conditions we must check to create a confidence interval as well More on confidence intervals in a few... The above percentage just tells us about OUR sample of those specific 2,928 people. What about another sample? Would we get a different % of yes’s? What about the percentage of all adults in the US who believe this? How much larger or smaller than 53% might it be? Do we think a majority (more than 50%) of Americans share this belief? We don’t know p, population parameter; we do ˆ for this sample; it’s 53% know p We don’t know p, population parameter; we do know p ˆ for this sample; it’s 53%. We also know: Our estimate (53%) is unbiased; remember sampling distributions? (maybe not exactly = p; maybe just a little low or a little high) Standard error (typical amount of variability) is pˆ (1 pˆ ) 0.53(1 0.53) about 0.0092 0.9% n 2928 Because we have a ‘large sample,’ the probability distribution of our p ˆ s is close to Normally distributed & centered around the true population parameter. True, unknown population parameter probably centered around 0.53; Normally distributed; standard error (SD; amount of variability in sample statistic) = 0.009; so ... About 68% of the data is as close or closer than 1 standard error away from the unknown population parameter, p 95% of the data is as close or closer than 2 standard errors away from the unknown population parameter, p 99.7% of the data is as close or closer than 3 standard errors away from the unknown population parameter, p True, unknown population parameter probably centered around 0.53; Normally distributed; standard error (SD) = 0.009 So we can be highly confident, 99.7% confident, that the true, unknown population proportion, p, is between 0.53 + (3)(0.009) to 0.53 – (3)(0.009) This is a confidence interval; we are 99.7% confident that the the interval from about 50.3% to 55.7% captures the true, unknown population proportion of Americans who believe that reducing the spread of AIDS and other infectious diseases is an important policy goal for the US government. True, unknown population parameter probably centered around 0.53; Normally distributed; standard error (SD) = 0.009 What if we wanted to construct a confidence interval in which we are 95% confident? What proportion of us have at least one tattoo? So our sample statistic, our p ˆ= If we were to ask another group of COC students, we would get another (likely different) p ˆ 445 Math 075 students were asked this last Spring; 133/445 = 0.299 = 29.9% had at least one tattoo Remember, larger n, generally less variation; but still value (unbiased estimator) centered at same We want to be able to say with a high level of certainty what proportion of all COC students have at least one tattoo. But we don’t know the true, unknown population parameter, p. We don’t know p (population parameter) ˆ (sample statistic); actually we have 2 We do know p sample statistics – our class and the Math 075 data Our estimators are unbiased (what does that mean?) check conditions for each of our samples: Let’s Random selection; Large sample; Big population Can we use either sample statistic (either pˆ )? If so, which should we use? Calculate our standard deviation (our standard error) Our distribution is ≈ Normal (because our conditions are met), centered around p ˆ ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs Let’s create a 95% confidence interval ... We are 95% confident that the interval from _____ to _____ captures the true, unknown population parameter, p, the proportion of all COC students that have at least one tattoo. This is a confidence interval with a 95% confidence level Our distribution is ≈ Normal (because our conditions are met), centered around p ˆ ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs Let’s create a 99.7% confidence interval ... We are 99.7% confident that the interval from _____ to _____ captures the true, unknown population parameter, p, the proportion of all COC students that have at least one tattoo. This is a confidence interval with a 99.7% confidence level Our distribution is ≈ Normal (because our conditions are met), centered around p ˆ ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs Let’s create a 68% confidence interval ... We are 68% confident that the interval from _____ to _____ captures the true, unknown population parameter, p, the proportion of all COC students that have at least one tattoo. This is a confidence interval with a 68% confidence level Our distribution is ≈ Normal (because our ˆ ; 68% conditions are met), centered around p with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs lengths of our did you notice about the confidence intervals as we changed from 68% confident to 95% confident to 99.7% confident? What More on this a little later... Statistical inference provides methods for drawing conclusions about a population based on sample data Methods used for statistical inference assume that the data was produced by properly randomized design Confidence intervals, are one type of inference, and are based on sampling distributions of statistics. The other type of inference we will learn and practice is hypothesis testing (more on this later). Estimator ± margin of error (MOE) Our estimator we just used was our sample proportion, our p ˆ Our margin of error we just used was our standard error, our standard deviation, Margin of error tells us amount we are most likely ‘off’ with our estimate Margin of error helps account for sampling variability (NOT any of the bias’ we discussed...voluntary response, non-response, et.) that the mean temperature in Santa Clarita in degrees Fahrenheit is between -50 and 150? that the mean temperature in Santa Clarita in degrees Fahrenheit is between 70 and 70.001? that the mean temperature in Santa Clarita in degrees Fahrenheit is between -50 and 150? that the mean temperature in Santa Clarita in degrees Fahrenheit is between 70 and 70.001? In general, large interval high confidence level; small interval lower confidence level 99% confidence level (or 99.7%) 95% confidence level 90% confidence level Typically we want both: a reasonably high confidence level AND a reasonably small interval; but there are trade-offs; more on this in a little bit Will we ever know for sure if we captured the true unknown population parameter p? No. Actual p is unknown. Interpretation of a confidence interval: “I am ___% confident that the interval from _____ to _____ captures the true, unknown population proportion of (context).” The lower the confidence level (say 10% confident), the shorter the confidence interval (I am 10% confident that the mean temperature in Santa Clarita is between 70.01 degrees and 70.02 degrees) The higher the confidence level (say 99% confident), the wider the confidence interval (I am 99% confident that the mean temperature in Santa Clarita is between 40 degrees and 100 degrees) What else effects the length of the confidence interval? Larger the n (sample size), shorter the confidence interval (small MOE) Smaller the n (sample size), longer the confidence interval (large MOE) Let’s look at how sample size (n) effects the length of the confidence interval, specifically the margin of error Remember, estimate ± margin of error Margin of error Let’s put some numbers in as a simple example... Larger the n (sample size), shorter the confidence interval (small MOE) Smaller the n (sample size), longer the confidence interval (large MOE) So, if you want (need) high confidence level AND small(er) interval (margin of error), it is possible if you are willing to increase n Can be expensive, time-consuming Sometimes In not realistic (why?) reality, you may need to compromise on the confidence level (lower confidence level) and/or your n (smaller n). Alcohol abuse is considered by some as the #1 problem on college campuses. How common is it? A recent SRS of 10,904 US college students collected information on drinking behavior & alcohol-related problems. The researchers defined “frequent binge drinking” as having 5 or more drinks in a row 3 or more times in the past 2 weeks. According to this definition, 2,486 students were classified as frequent binge drinkers. Based on these data, what can we say about the proportion of all US college students who have engaged in frequent binge drinking? Let’s create a confidence interval so we can approximate the true population proportion of all US college students who engaged in frequent binge drinking. How confident do we want to be (i.e, what confidence level do we want to use)? We must check conditions before we calculate a confidence interval... Random? Large sample? Big population? Perform Minitab calculations 1 sample, proportion Options, 99 CL Data, summarized data, events, trials “+”, data labels Always conclude with interpretation, in context I am 99% confident that the interval from 21.8% to 24.9% contains the true, unknown population parameter, p, the actual proportion of all US college students who have engaged in frequent binge drinking. Let’ use our Math 075 statistic of 133/445 students had at least one tattoo (or 29.9%) We already checked our conditions Previously we created confidence intervals at 68%, 95%, and 99.7% confidence levels (applied Empirical Rule... as an introduction to confidence intervals) Now let’s use Minitab and construct a 90% confidence interval; an 80% confidence interval. Be sure to practice interpreting these confidence intervals as well. Often researchers choose the margin of error & confidence level they want ahead of time/before survey So they need to have a particular n to achieve the MOE and the CL they want. p(1 p) MOE z * n Often researchers choose the MOE A common CL is 95%, so z* ≈ 2 Can solve for n & get formula in textbook 0.5(1 0.5) m 2 n A company has received complaints about its customer service. They intend to hire a consultant to carry out a survey of customers. Before contacting the consultant, the company president wants some idea of the sample size that she will be required to pay for. One critical question is the degree of satisfaction with the company's customer service. The president wants to estimate the proportion p of customers who are satisfied. She decides that she wants the estimate to be within 3% (0.03) at a 95% confidence level. No idea of the true proportion p of satisfied customers; so use p = 0.5. The sample size required is given by p(1 p) MOE z * n 0.5(1 0.5) 0.03 2 n n = 1111.11 So if the president wants to estimate the proportion, p, of customers who are satisfied, at the 95% confidence level, with a margin of error of 3%, she would need a sample size (an n) of at least 1,112.