Introduction to Statistics From the Data at Hand to the World at Large Part I Siana Halim TOPICS Population and Sample Sampling distribution models Confidence interval for proportions References: •De Veaux, Velleman , Bock, Stats, Data and Models, Pearson Addison Wesley International Edition, 2005 •John A Rice, Mathematical Statistics and Data Analysis, Duxbury Press, 1995 Sampling and Population We’d like to know about an entire population of individuals, but examining all of them is usually impractical, if not impossible. So we settle for examining a smaller group of individuals – a sampleselected from the population We should select individuals for the sample at random. Randomizing protects us from the influences of all the features of our population, even ones that we may not have thought about. The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important. Sampling and Population Does a census make sense ? It can be difficult to complete a census Populations rarely stand still Taking a census can be more complex than sampling. Population and Parameters Models use mathematics to represent reality. Parameters are the key numbers in those models. A parameter used in a model for a population is called a population parameter. Any summary found from the data is a statistic. Name Statistic Mean y (mu) Standard deviation s (sigma) Correlation r (rho) Regression coefficient b (beta) Proportion p̂ Parameter p Simple Random Samples We need to be sure that the statistics we compute from the sample reflect the corresponding parameter accurately (representative). How would we select a representative sample ? A Simple Random Sample (SRS) Every possible sample of the size we plan to draw has an equal chance to be selected. Each combination of people has an equal chance of being selected as well. The sampling frame is a list of individuals from which the sample is drawn. Samples drawn at random generally differ one from another. Each draw of random numbers selects different people for our sample. These differences lead to different values for the variables we measure. We call these sample-to-sample difference sampling variability. Stratified Sampling All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select to sample. Designs that are used to sample from large populations – especially populations residing across large areas – are often more complicated than simple random samples. Sometimes the population is first sliced into homogeneous groups, called strata, before the sample is selected. Then simple random sampling is used within each stratum before the results are combined. This common sampling design is called stratified random sampling. Cluster and Multistage Sampling Splitting the population Sampling schemes that into similar parts or clusters can make sampling more practical. Then we could simply select one or a few clusters at random and perform a census within each of them. combine several methods are called multistage samples. Sometimes we draw a sample by selecting individuals systematically. This is called a systematic sampling. Sampling Distribution Models • Why do sample proportions vary at all ? • How can surveys conducted at essentially the same time by the same organization asking the same questions get different result ? • This answer is the heart of statistics. • It’s because each survey is based on different sample size. • The proportion vary from sample to sample because the samples are composed of different people Modeling the Distribution of Sample Proportion Most models are useful only when specific assumptions are true. In the case of the model for the distribution of sample proportions, there are two assumptions: 1. The sampled values must be independent of each other. 2. The sample size, n, must be large enough. The corresponding conditions to check before using the Normal to model the distribution of sample proportions are: 1. 10% condition : If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population 2. Success/failure condition : The sample size has to be big enough so that both np and nq are greater than 10 The Sampling Distribution Model of a Proportion Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of p is modeled by a Normal model with mean ( pˆ ) p pq and standard deviation SD( pˆ ) n N p, p 3 pq n p2 pq n pq n p 1 y pˆ , n Proporsi sample qˆ 1 pˆ y is number of success n is the sample size pq n p p 1 pq n p 2 pq n p 3 pq n The Central Limit Theorem (CLT) As the sample size, n, increases, the mean of n independent values has a sampling distribution that tends toward a Normal model with mean ( y ) equal to the population mean, µ, and standard deviation ( y ) SD( y ) n The CLT requires remarkably few assumptions, so there are few conditions to check: 1. Random sampling condition. 2. Independence assumption Sampling Distribution Model for Mean If assumptions of independence and random sampling are met, and the sample size is large enough, the sampling distribution of the sample mean is modeled by a Normal model with a mean equal to the population mean, µ, and a standard deviation equal to n parameter in the population is estimated by Sample mean 1 X n N , n n i 1 1 xi , ˆ s n 2 x X i n i 1 Sample standard deviation 3 n 2 n 1 n 1 n 2 n 3 n Working with Sample Distribution Models Example 1. About 13% of the population is left-handed. A 200-seat school auditorium has been built with 15 “leftie seats,” seats that have the builtin desk on the left rather than the right arm of the chair. In a class of 90 students, what’s the probability that there will not be enough seats for the left-handed students? Step-by-step • State what we want to know. • Check the conditions. • State the parameters and the sampling distribution model. • Make a picture. Sketch the model and shade the area we’re interested in. • Find the z-score or the cutoff proportion. • Find the resulting probability from a table of Normal probabilities. • Discuss the probability in the context of the question. Working with Sample Distribution Models Example 2. Suppose that mean adult weight is 175 pounds with a standard deviation of 25 pounds. An elevator in our building has a weight limit of 10 persons or 2000 pounds. What’s the probability that the 10 people who get on the elevator overload its weight limit? Standard Error When we estimate the standard deviation of a sampling distribution using statistics found from the data, the estimate is called a standard error. pˆ qˆ SE ( pˆ ) n SE ( y ) s n Confidence Interval for Proportion We 95% confidence to state that the True Proportion of the Proportion population is in our interval. Confidence Interval (Example) Sea fans, one spectacular kind of coral, in the Caribbean Sea have been under attack by the disease aspergillosis. In June of 2000, the sea fan disease team from Dr. Drew Harvell’s lab randomly sampled some sea fans at the Las Redes Reef in Akumal, Mexico, at a depth of 40 feet. They found that 54 of the 104 sea fans they sampled were infected with the disease. What might this say about the prevalence of this disease among sea fans in general? Confidence Interval (Example) What can we say about the population proportion, p? Is the infected proportion of all sea fans 51.9%? We do know, though, that the sampling distribution model of p̂ is centered at p, and we know that the standard deviation of the sampling distribution is SD( pˆ ) pq n But we don’t know p, instead we’ll use and find the standard error, Now we know the sampling model for p̂ should look like this: N p, 3 pq n 2 pq n 1 pq n p 1 pq n 2 pq n 3 pq n pq n Because it’s Normal, it says that about 68% of all samples of 104 see fans will have p̂ ‘s within 1SE, 0.049, of p. And about 95% of all these samples will be within p2SEs. BUT Where is our sample proportion in this picture? We do know that for 95% if random samples, p̂ will be no more than 2 SEs away from p. So let’s look at this from p̂‘s point of view. If I’m p̂ , there’s a 95% chance that p is no more than 2 SEs away from me. If I reach out 2 SEs, or 2 x 0.049, away from me on both sides, I’m 95% sure that p will be within my grasp. Now, I’ve got him! Probably. Far better an approximate answer to the right question, … than an exact answer to the wrong question.” - John W. Tukey So what can we really say about p? 1. “51.9% of all sea fans on the Las Redes Reef are infected.” → NO WAY! 2. “It is probably true that 51.9% of all sea fans on the Las Redes Reef are infected” → NO 3. “We don’t know exactly what proportion of sea fans on the Las Redes Reef are infected but we know that it’s within the interval 51.9% ± 2x4.9%. That is, it’s between 42.1% and 61.7%” → GETTING CLOSER! 4. “We don’t know exactly what proportion of sea fans on the Las Redes Reef are infected, but the interval from 42.1% and 61.7% probably contains the true proportion.” →TRUE but a bit wishy-washy. 5. “We are 95% confident that between 42.1% and 61.7% of Las Redes Reef sea fans are infected.”YES! Statement like these are called confidence intervals. They’re the best we can do. The interval is called a one-proportion z-interval. Margin of Error Confidence Interval (CI) has the form ^ ^ p 2SE( p ) The extent of the interval on either side of is called the margin of error (ME). In general, CI look like this: estimate ± ME The more confident we want to be, the larger the margin of error must be. Critical Value The z* = 1.96 and z* = 1.645 is called as the critical value. -1.96 0.95 1.96 The CI for the sample proportion and the sample mean can be formulated as follow ˆ z SE( p ˆ) p * -1.645 1.645 0.9 X z SE ( X ) * Assumptions and Conditions Independence Assumption → check three conditions: Plausible independence condition. This condition depends on your knowledge of the situation. Randomization condition. Were the data sampled at random or generated from a properly randomized experiment? 10% condition. Sample Size Assumption → check success/failure condition. We must expect at least 10 “successes” and at least 10 “failures.” One-proportion z-interval When the conditions are met, we are ready to find the confidence interval for the population proportion, p. The confidence interval is ˆ z * SE ( p ˆ) p where the standard error of the proportion is estimated by SE ( pˆ ) pˆ qˆ n Example In May 2002, the Gallup Poll asked 537 randomly sampled adults the question “Generally speaking, do you believe the death penalty is applied fairly or unfairly in this country today?” Of these, 53% answered “Fairly” and 7% said they didn’t know, What can we conclude from this survey? Student t distribution Note: t → Z if n increase Standard Normal (t with df = ) t (df = 13) t-Distribution has similar shape as the normal distribution but it has longer tails t (df = 5) 0 t T- Distribution Upper Tail Area df .25 .10 .05 Let: n = 3 df = n - 1 = 2 = .10 /2 =.05 1 1.000 3.078 6.314 2 0.817 1.886 2.920 /2 = .05 3 0.765 1.638 2.353 This the value of t, not the value of the probability.. Using t distribution then the CI for mean can be formulated as * X t n 1SE( X ) 0 2.920 t