Toward Statistical Inference Stat 226 – Introduction to Business Statistics I Question: What is the average height of all Stat 226 students? We have several options to answer this question: Spring 2009 Professor: Dr. Petrutza Caragea Section A Tuesdays and Thursdays 9:30-10:50 a.m. 1 “wild guess” 2 collect everybody’s height and compute exact average 3 take a representative sample and compute sample mean Chapter 3, Section 3.3 Toward Statistical Inference Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 1 / 16 Toward Statistical Inference Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 2 / 16 Section 3.3 4 / 16 Toward Statistical Inference Let’s recall the “big picture” out of the three options it becomes obvious that the third: taking a representative sample and computing the sample mean appears to be the most reasonable one. However, this option raises a new and even more important question, namely: How reliable is our estimate based on the sample? Answer: depends on 1 the choice of the sample, i.e. in which way was the sample obtained 2 the sample size (the larger the sample ⇒ the more information we have at hand ⇒ the more accurate and precise our estimate should be) Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 3 / 16 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Toward Statistical Inference Toward Statistical Inference how do we obtain a representative sample? µ is the overall mean of the population we distinguish two types of studies in statistics: µ is fixed value but unknown observational studies versus experiments µ is referred to as a population parameter observational study: observe individuals w.r.t. a variable of interest x̄ is the mean of the sample taken from the population x̄ varies from sample to sample (random but we will know its value once we collected the sample) in a 1981 study researchers compared scholastic performance of music students with that of non-music students at a California High School x̄ is referred to as a sample statistic music students had a much higher overall GPA than non-music students a whooping 16% of music students had all A’s compared with only 5% of the non-music students as a result of the study music programs were expanded nationwide Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 5 / 16 Toward Statistical Inference Section 3.3 6 / 16 a group of patients gets randomly assigned to one of two treatment groups — new drug and standard drug Students were simply observed, recording the choices (music education, no music education) they made and the overall outcome (Grades) receiving the standard drug is called the control treatment patients do not know which drug they receive to eliminate bias if neither doctors nor patients know who is receiving which treatment, then this study is called a double-blinded study Observational study In observational studies, treatments don’t get assigned to study individuals, individuals are simply observed. Section 3.3 experiment: we actively impose a treatment on individuals and observe variable of interest Is a new drug more effective in lowering blood cholesterol level compared to standard drugs? Researchers tried to show an association between music, education and grades. But the study was neither a survey, nor were students assigned to get music education Introduction to Business Statistics I Introduction to Business Statistics I Toward Statistical Inference What is wrong with concluding that music education causes good grades? Stat 226 (Spring 2009, Section A) Stat 226 (Spring 2009, Section A) Experiments An experiment requires a random assignment of study subjects to treatments. 7 / 16 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 8 / 16 Toward Statistical Inference Toward Statistical Inference experiments are the only way to show cause-and-effect relationships how to obtain a random sample Consider the following example: You want to find out how much debt an Iowa State student has on average There is much more to learn about designing an experiment, but that is beyond the scope of this class. How should you pick a representative sample? Keep in mind though that take all Stat 226 students from our section experiments can be designed well, but also really badly. Badly designed experiments often reveal no information at all. go to the dorms and take a random sample of 100 students go to the library and take a random sample of 100 students Most of the success in conducting a designed experiment results directly from how well the pre-experimental planning was done. sample from the Football team ... Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 9 / 16 Toward Statistical Inference e.g. mall survey Section 3.3 10 / 16 ”If you had to do it over again, would you have children?” ”70% OF PARENTS SAY KIDS NOT WORTH IT.” yield very often biased responses voluntary response sample: consists of people who chose themselves by responding to a general appeal e.g. NBC, CNN polls be aware: they often over represent people with strong opinions, most often negative opinions yield very often biased responses a study of exercise called for volunteers to run on a treadmill ⇒ study concluded that “Americans are in great shape” Section 3.3 The advice columnist Ann Landers once asked her readers, A few weeks later, her column was headlined trade-off made for ease of obtaining sample is that samples are typically not very representative of the population. Introduction to Business Statistics I Introduction to Business Statistics I Toward Statistical Inference convenience sampling: the selection of units from the population is based on easy availability and/or accessibility Stat 226 (Spring 2009, Section A) Stat 226 (Spring 2009, Section A) 11 / 16 Indeed 70% of the nearly 10,000 parents who wrote in said they would not have children if they could make the choice again. These data are worthless as indicators of opinion among all American parents. The people who responded felt strongly enough to take the trouble to write Ann Landers. Their letters showed that many of them were angry at their children. These people don’t fairly represent all parents. It is not surprising that a statistically designed opinion poll on the same issue a few months later found that 91% of parents would have children again. Ann Landers announced a 70% ”No” result when the truth about parents was close to 91% ”Yes.” http://score.kings.k12.ca.us/lessons/wwwstats/voluntary.html Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 12 / 16 Toward Statistical Inference Toward Statistical Inference so how do we choose a sample? using the table of random digits example: using the map provided in class, choose 5 counties of Iowa label all individuals assigning each a distinct number/label labels have to be of the same number of digits, e.g. Best way to obtain a representative sample is if we let chance choose the sample from the population. Story Polk Ida Mills Clay Lyon Tama Linn Lee Jackson correct Simple Random Sample of size n To obtain a so-called simple random sample of size n pick a line in Table B to start, e.g. line 122 choose a sample of size n by selecting the first n labels that appear create a list of all individuals of the population and choose n at random, e.g. using the table of random digits (Table B) In a simple random sample (SRS) each set of n individuals has an equal chance of selection. Introduction to Business Statistics I Adam correct wrong random selection ⇒ removes bias and subjectivity Stat 226 (Spring 2009, Section A) county Section 3.3 13 / 16 Toward Statistical Inference if a label/number does not match any labels in the list or if a label/number comes up more than once ⇒ skip it if you cannot obtain a sample of size n in one line, continue in next line, e.g. with 123 if you started in 122 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 14 / 16 Toward Statistical Inference example: for a simple random sample of size n = 5 of Iowa Counties starting at line 122 and using labels as indicated on the map (provided in class) we obtain Assuming that we obtained a representative sample of size n, how do we know that the sample mean x̄ from this sample is indeed a “good” estimate for µ? Answer: Amazingly, averages of random samples behave in very regular and predictable ways, so knowing how x̄− values behave in general lets us deduce how our x̄− value is likely to behave in terms of being close to µ. Caution: SRS are not always feasible and appropriate e.g. you may consider so-called stratified random samples: divide the population into strata, groups of individuals that are similar in some way that is important to the response. Then choose a separate SRS from each stratum and combine these SRSs to form the full sample (more on this on page 179 textbook) Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 15 / 16 more details on this follow in Section 4.4 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Section 3.3 16 / 16