Chapter 18: Sampling Distributions.1 ___________________________ is the process of drawing conclusions about a population based on data from a sample of that population. Any quantity computed from the values in a sample is called a _____________. The corresponding population characteristic is usually called a ___________________. Note: Since we use “statistics” to estimate “parameters,” “statistics” are also called “estimates.” The problem with using statistics to estimate parameters is that our estimates usually aren’t correct. For example, suppose that we know the mean weight of all students at LSHS is 135 pounds (i.e. = 135). However, if we were to take a random sample of 50 students, we may get ______________________. We need to answer the question: “How good is our estimate?” Since the value of the statistic (estimate) depends on the sample selected, the value of the statistic will vary from sample to sample. This variability is called ____________________________. Note: Sampling variability is different from the variability we have studied so far. If y = weight of a LSHS student, then y has variability (different students have different weights). This is NOT sampling variability. If y = the average weight of a sample of 50 LSHS students, then y also has variability (different samples will produce different averages). This is sampling variability. Thus, statistics (estimates) are variable while parameters (population characteristics) are constant. The distribution of a statistic (estimate) is called its ______________________________. It is the distribution of all the values of the statistic based on all possible samples of the same size. Note: The important features of a sampling distribution are its ______________________________. Suppose that our class is the population and we are taking samples of size 3 to estimate the true average height ( ). y = height (inches) y = average height of 3 students The sampling distribution of y lists all possible values of y and how frequently they will occur. Note: each possible statistic (range, standard deviation, median, proportion of males, etc.) has its own sampling distribution HW #1 Textbook intro How Many Textbooks?? During World War II, Allied forces noticed that all German tanks were labeled with a serial number. After observing numerous tanks, it became obvious that the tanks were numbered systematically. Thus, the allies were able to predict the total number of German tanks and their locations using the observed serial numbers. This became known as the German Tank Problem. Fortunately for our troops in the Middle East, enemy forces do not have any tanks to speak of. However, the methods used to solve the German Tank Problem can be applied to many other situations. In this assignment, your job is to develop a method to estimate the total number of AP Statistics books (or any other textbook) at our school based on a random sample of books and the assumption that the books are numbered sequentially starting from 1. For example, suppose a random sample of n = 7 books gives the following numbers: 10, 38, 59, 61, 74, 90, 94. How can you use this information to estimate the total number of books (N)? One possible method would be to take the median value and double it. In this example, median • 2 = 61 • 2 = 122. Thus, our estimate for the total number of books is N = 122. For this activity you will create three other statistics to estimate the total number of books. Remember, a statistic is any quantity computed from the values in a sample. You may use any combination of the summary statistics we already know (mean, median, min, max, quartiles, IQR, standard deviation, etc.) or invent your own. The goal is to find a relatively simple statistic that reliably predicts the total number of books. To determine which of your three statistics gives the best predictions, you will perform a simulation to generate sampling distributions for each of your three statistics. For the purposes of the simulation, assume that there are 100 books total (N = 100) and that you will be taking samples of size 7 (n = 7). For each run (do 25 runs), generate 7 random numbers from 1-100 and compute the values of each of your three statistics. What you need to turn in: An introduction describing the choice of your 3 statistics Three dotplots showing the sampling distributions for each of your statistics. Make sure they are on the same scale so they can be easily compared. A verbal description/comparison of the three distributions. A description of which statistic you think is best and why. Due _________________ (one per group) NOTE: _______________ in class we will have a competition to see which group has the best statistic. The nominating order will be randomly selected and no statistic will be allowed more than once. During each round of the competition, I will secretly choose N and then give you a sample of n numbers from 1N. Each group will compute the value of their statistic based on the sample of n numbers and the closest estimate gets 3 points, the second closest gets 2 points, and the third closest gets 1 point. After several rounds, the team with the most points will get 2 extra credit points per member. Chapter 18: Modeling the Distribution of Sample Proportions.2 One of the major tasks of a statistician is to discover the proportion of individuals in a population who possess a certain property or characteristic. Unfortunately, without conducting a census, we will never know the true value of p, the ____________. Instead, statisticians will estimate p by taking a sample and computing the ____________________, p̂ . p̂ = number of "successes" in the sample x samplesize n Note: since x is the number of successes, x is a binomial random variable Suppose that 22% of registered voters in California are senior citizens If we were to take a sample of size 100 and find the sample proportion of senior citizens ( p̂ ), what values might we get? What does the distribution of p̂ look like? What about the shape of the distribution? Recall that the binomial distribution can be well approximated by a normal distribution when both np and nq are at least 10. General Properties of the Sampling Distribution of a Sample Proportion: Let p̂ be the proportion of successes in a random sample of size n from a population whose proportion of successes is p. Denote the mean value of the p̂ distribution by p̂ and the standard deviation of the p̂ distribution by p̂ . Then, 1. p̂ p pq n 3. When n is large and p is not too close to 0 or 1, the sampling distribution of p̂ is approximately normal. 2. p̂ Conditions: a. The data is from a random sample from population of interest b. Large sample size: both np and nq are at least 10 (so the normal model is a good approximation to the binomial) c. Sample less than 10% of the population (so we can consider trials to be independent) Note: Rules 1 and 2 follow directly from the rules for binomial random variables. np pˆ x p n n pˆ x n npq npq n n2 pq n Suppose that 43% of registered voters in California are Democrats. What is the probability that there will be less than 30% Democrats in a random sample of size 50? If n = 50, what is the probability that the sample proportion will be within .05 of the true proportion? How can we make this probability bigger? Note: If the large sample size condition isn’t met (or if you just don’t like the normal distribution), you can still use the binomial distribution. However, in later chapters we will need to use the normal approximation so we will use it whenever we can from now on. HW #1: Problems: 18.1-10 all Chapter 18: The Sampling Distribution of the Sample Mean.3 General Properties of the Sampling Distribution of a Sample Mean ( y ): Let y denote the mean of the observations in a random sample of size n from a population having mean y and standard deviation y . Denote the mean value of the y distribution by y and the standard deviation of the y distribution by y . Then, the following rules hold: 1. y = 2. y n 3. When the population distribution of y is normal, the sampling distribution of y is also normal for any size n. 4. Central Limit Theorem: When n is sufficiently large, the sampling distribution of y is well approximated by a normal curve, even when the population is not normal. To explore this, try the following Applet: http://www.ruf.rice.edu/~lane/rvls.html Conditions: a. The data is from a random sample from population of interest b. Large sample size (n > 30) OR distribution of population approximately normal (so distribution of y can be considered approximately normal) c. Sample < 10% of population (so selections can be considered independent) Note: Rules 1 and 2 come from the rules for combining independent random variables: y y ... y y y ... y n y y y n n n y y y ... y n y2 y2 ... y2 n n y2 n n y2 n 2 y2 n y n Suppose that the weights of apples are approximately normally distributed with a mean of 200 grams and a standard deviation of 35 grams. a. What is the probability that if you choose one apple at random, its weight is at least 220 grams? b. What is the probability that if you choose 25 apples at random, the average weight is at least 220 g? Suppose that the population of residents of local town has a mean annual income of $50,000 and a standard deviation of $35,000. If a random sample of 16 town residents is selected, what is the probability that their sample mean is within $5000 of the true mean? HW #3: 18.11, 17-23 odd How Many Textbooks? The Competition!! HW #4: 18.12, 18-24 even Review Chapter 18 HW #5: 18.13, 14, 27, 28, 30 Test Chapter 18 Chapter 19: Confidence Intervals for Proportions.1 One of the primary jobs of a statistician is to estimate characteristics of populations (parameters). What proportion of students drive to school? What is the average weight of a teenager? To make these estimates, statisticians will select a sample from the population of interest and study the members of the sample. However, because of sampling variability, it is possible that our estimates will be wrong. In general, there are two types of estimates that can be made: point estimates and interval estimates. A ________________________ is single number that represents a plausible value for the population parameter. It is called a “point” estimate since it represents a single point on the number line. For example, based on a recent survey, we estimate that 61% of young adults favor social security reform. An _________________________ is a range of plausible values for that parameter. It is called a “interval” estimate since it represents an interval on the number line. For example, based on a recent survey, we estimate that between 58% and 64% of young adults favor social security reform (61% ± 3%). Using an interval estimate greatly increases the probability we are correct. For example, I predict the high temperature tomorrow will be between 0 and 150 degrees. Of course, this interval won’t help me pick my clothes in the morning. There is a tradeoff between confidence and usefulness. Making point estimates: Which statistic should we use? There are two things to look for when choosing a statistic to estimate a population characteristic (parameter): the statistic should have no bias and low variability. An _____________________________ is a statistic with a mean value equal to that of the population characteristic being estimated (i.e. its sampling distribution is centered in right place). For example, in the textbook problem, some unbiased statistics were: They are unbiased since their sampling distributions were centered in the right place. In other words, the estimates weren’t consistently too low or consistently too high. Some examples of biased statistics would be: Why are these biased? Given a choice between several unbiased statistics, which should we use? Note: Sometimes a biased statistic is preferred if it has much less variability. Making Interval Estimates Because of the variability inherent in sampling, our point estimates will rarely be correct. After all, different samples will yield different estimates. And, although a point estimate may be our best single guess for the value of a population parameter, it is certainly not the only plausible value. Thus, statisticians will usually report an interval of plausible values for the population parameter based on the sample. This interval is called a ___________________________. Using a confidence interval gives a much better chance of being correct. The Logic of Confidence Intervals: 1. The distance from p to p̂ is the same as the distance from p̂ to p. 2. 95% of all samples will give a sample proportion p̂ within 1.96 SD of the population proportion p, assuming the distribution of p̂ is normal. 3. Therefore, in 95% of all samples the population proportion p will be within 1.96 SD of the sample proportion p̂ . When the distribution of p̂ is normal, the 95% confidence interval for a population proportion (p) is: point estimate ± margin of error = pˆ 1.96 ˆˆ pq n Why do we use p̂ in the standard deviation instead of p? Note: When we use the sample to estimate the SD, it is called the Standard Error (SE). ˆˆ pq pq SD = and SE = n n How do we know if the distribution of p is normal? Suppose that in a random sample of 100 La Serna HS students, 43 owned a graphing calculator. Find a 95% confidence interval for the true proportion of LSHS students with a graphing calculator. HW #1: Read 432-36 Problems 19.17, 18 Chapter 19: Formal Confidence Intervals.2 Suppose that in a random sample of 200 Whittier teenagers, 18 had tattoos. Use a 95% confidence interval to estimate p, the true proportion of Whittier teenagers with a tattoo. To get full credit on any confidence interval problem, you MUST include the following 4 steps! Note: If any of the conditions in step 2 are violated, then the interval may not be reliable. In this case, state your concerns and proceed with caution. Changing confidence levels: Although a 95% confidence interval is the most frequently used, there are other common confidence levels, including 90% and 99%. The general form of a confidence interval: point estimate ± margin of error statistic ± (critical value)(estimated standard deviation of the statistic) Suppose that in a random sample of 1000 US adults, 483 believe that the US war in Iraq was worth the costs. Use a 90% confidence interval to estimate the true proportion of US residents who believe this. Since p = .483 can we conclude that the majority of US residents feel the war was not worth the costs? Repeat step 3 with a 95% CI and a 99% CI. What factors affect the margin of error (length of the interval)? 1. The confidence level. If we want to be really confident, we can increase the confidence level. However, while this will give the interval a better chance to capture the parameter, our estimate will have less precision. 2. The sample size. If we want more precision, we can increase the sample size. However, this will cost us additional time and money. HW #2: Read 436-438, Problems 19.3, 7, 11, 21, 23 Chapter 19: Finding Sample Size.3 Suppose we are planning a survey and we want a margin of error of less than 4% with 95% confidence. What sample size do we need? ˆˆ pq 1.96 .04 n We are trying to solve for n, but we also do not know the value of p̂ . What can we do? Use a known value of p̂ from a previous survey Do a preliminary pilot survey to estimate p̂ Be conservative. What is the biggest sample size we could possibly need? ˆ ˆ is, the bigger n needs to be. o The bigger pq ˆ ˆ be the largest? o For what value of p̂ will pq Note: When no other information is given use the third option ( p̂ = .5) Suppose we wanted to estimate a proportion with 99% confidence with a margin of error of at most 2.5%. What is the largest sample size we might need? HW #3 Read 441-443 Problems 19.20, 25, 27, 29 Chapter 19: Interpreting the confidence level.4 Suppose that 10% of all students at La Serna have a tattoo and we want to simulate a random sample of size 200 to calculate a 90% confidence interval for the proportion of La Serna students with a tattoo. Let 0 = has a tattoo and 1-9 = no tattoo Generate 200 integers from 0-9 and count the number of 0’s. Use this number to calculate p and the 90% CI for p. Interpreting the confidence level. In other words, what does it mean to be 90% confident? In the long run, 90% of all intervals produced with this method will capture the true value. Or: Before we gather our sample, there is a 90% chance that the sample we will obtain will produce an interval that captures the true value. It does NOT mean that there is a 90% chance (or probability) that a particular interval captures the true value--that particular interval either captures the value or it doesn’t. Again, it is incorrect to say that “there is a 90% chance that the interval from .05 to .13 captures the true proportion.” Rather, the 90% confidence level refers to the method used to construct the interval rather than any particular interval. How many students can you identify at La Serna High School? Suppose that Mr. DuPont randomly selected 50 students from LSHS and you could name 14 of them. Construct a 95% CI for the proportion of students you know at La Serna High School. Then, use the fact that there are 1800 students at LSHS to estimate the total number of students you can identify at LSHS. HW #4 Problems: 19.5, 13, 26, 30 Chapter 19: Review.5 HW #5 Problems 19.14, 22, 24, 28 Chapter 23: Confidence Intervals for a Population Mean.6 In addition to using interval estimates for population proportions (p), we can also use confidence intervals to estimate the true value of a population mean (µ). The formula for a CI for a population mean is based on the sampling distribution of the sample mean y . Recall that: 1. y = 2. y n 3. y is approximately normally distributed when n > 30 or when the population is normal Thus far, we have been using z-critical values from the normal distribution to make our confidence intervals. Now, remember that a z-score is calculated as z = observed value-expected value y standard deviation n As long as y is normally distributed, the distribution of z will also be normal. This is because z is a linear transformation of y . In other words, since , , n are all constants, the shape shouldn’t change, only the center and spread. However, if we don’t know (the population standard deviation) and we use s (the sample standard y deviation) to estimate , the distribution of isn’t quite normal, even if y is approximately s n y normal. So, we give this quantity a new name: t = . s n Since t is based on 2 variables (s and y ) instead of just y (like z), the t-distributions have more variability than the z (normal) distribution. However, as the sample size increases, s gets closer to and the t-distributions get closer to the standard normal distribution (z). The t-curves are very similar to normal curves, except that they are wider and defined by a number called the __________________________. For this chapter, df = n – 1. Properties of the t-distributions: 1. The t-curve corresponding to any fixed number of degrees of freedom (df) is bell shaped, symmetric and centered at 0. 2. Each t-curve is more spread out than the z-curve (standard normal curve). 3. As the df increase, the spread of the corresponding t-curve decreases. 4. As the number of df increases, the t-curves get closer and closer to the z-curve. Since the t-curves are wider than the z-curve, we must go out further than 1.96 SD to capture 95% of the area. To find out how far, we use a t-critical value from the t-table. If df = 20 and you want 95% confidence, what t-critical value should you use? Thus, to capture the middle 95% of the t-distribution with 20 df, you must go out 2.086 stdev. Find the t critical values for the following: a. 95% confidence with n = 10 b. 90% confidence with n = 25 c. 99% confidence with n = 100 Note: When using the t-table and the df you want are not provided, round down to the nearest df given. Suppose that a machine is designed to produce bolts that have a diameter of 5 mm. Every hour a random sample of 15 bolts is selected and a 95% confidence interval for the mean diameter is constructed. If there is evidence that ≠ 5, the machine is adjusted. In one particular sample, the mean diameter was 5.08 mm with a standard deviation of 0.11 mm. Calculate the interval and decide if you need to adjust the machine. HW #6: Read 527-530, problems 23.3, 5, 7, 13 (4 steps only), 15 Chapter 23: More confidence intervals.7 Checking for Normality: When the sample size is small and the data from the sample are provided, you must graph the sample to see if it is reasonable to assume it came from a normally distributed population. Conclusions: slight to moderate skewness is OK, since samples from normal populations will not always look normal: “Since there are no outliers or strong skewness in the sample, it is reasonable to assume that the population is approximately normal.” strong skewness or outliers makes this condition questionable: “Since there is an outlier or strong skewness in the sample, it is questionable to assume that the population is approximately normal. I will proceed with caution.” Suppose that a random sample of students at a particular SAT school were selected and their improvement in SAT score were calculated. Construct a 99% confidence interval to estimate the true mean improvement of students at this school. Does this interval give evidence that students in this school are improving their scores, on average? Improvements: 50, 110, 20, 140, 80, 70, 70, 40 HW #7 Problems: 23.4, 6, 8, 28 (4 steps + e), 27 (4 steps) Chapter 23: Finding Sample Size.8 Suppose that we wanted to estimate the average height of a LSHS student to within 0.5 inches of the true value with 99% confidence. If the standard deviation is known to be 2.3 inches, how big of a sample do we need? Since we know the population standard deviation, we can use z to calculate the sample size. Unfortunately, we rarely (if ever) know the true standard deviation and not the true mean . So, these problems are somewhat unrealistic. In most cases, to find an appropriate sample size we need to first estimate the standard deviation (s). There are a few ways to do this. -Use a previously known value of s. -You can do a preliminary study (pilot study) and use the sample standard deviation s. Also, we use z-critical values instead of t-critical values since we cannot calculate df without knowing n (which is what we looking for). However, if the sample size is large, using z will be close enough. But, for small sample sizes, using z will underestimate the sample size we need. Using the example from yesterday, how many additional students do we need to select to reduce the margin of error to 20 points? HW #8: Read 535-538, Problems: 23.11 (4 steps + de), 13, 34 (4 steps + e) HW #9 Review: Chapter 19 & 23 Test