AMS 5 ESTIMATING PERCENTAGES Models for Percentages We consider problems where the populations is split in two groups. This can be represented with binary boxes that have only tickets with values 0 and 1. We consider the following specific topics: • Mean and SE for percentages • Approximations using the normal distribution • Obtaining samples from a binary population • Estimating the mean and the SD from a sample • Confidence intervals for percentages Estimating Percentages Consider a population that is split in two groups by a specific characteristic. For example, faulty versus properly working parts in a population of computer chips, rainy versus dry days during a specific year. Suppose we take a simple random sample of the population then the expected value for the sample percentage equals the population percentage. That is, if the percentage of men in the population is 46%, then the expected value for the percentage of men in the sample is 46%. But in a given sample we won't necessarily observe 46% of men. This is because of chance error. How do we estimate the chance error? Mean and SD of a binary box Suppose a box has only tickets with either 0 or 1. Then the mean of the box is given by: number of 1s = fraction of 1s total number of tickets The SD of the box is given by: Mean and SD of a binary box Consider the problem of taking a sample from a population where the number of men is 3,091 and the number of women is 3,581. Then we can think of a box model ones corresponding to men and zeroes corresponding to women. Suppose a sample of size 100 is taken from the box. Since the fraction of ones in the box is 46%, the SD of the box is equal to 0.46 × 0.54 ≈ 0.5. So the SE for the sum of 100 draws is 100 × 0.5 = 5. Mean and SD of a binary box Notice that we are supposing that the tickets are drawn with replacements, and this is unlikely to be true for a human a population, but given that the ratio of the sample size to the size of the population is very small, drawing with replacement is a good approximation. The former implies that the number of men in the sample is around 46, give or take 5, out of a 100. This means that the percentage of men in the sample will be around 46% give or take 5%. Mean and SD of a binary box Suppose the sample size is now 400. Then SE for number = 400 × 0.5 = 10 SE for percentage = 10 / 400 = 2.5% So we multiplied the sample size by 4 and got an SE for the percentage that is half of the original one. Notice that, as the sample size goes up: • The SE of the sum increases as the square root of the sample size. • The SE of the percentage decreases as the square root of the sample size. Approximations We can use the normal curve to obtain approximations to probabilities involving percentages. Q: A phone company has 100,000 subscribers. According to the census data, 20% of the subscribers earn over $50,000 a year. What are the expected value and SE of the percentage of subscribers with incomes over $50,000 in a sample of size 400? A: A box model for this problem will be where the ones correspond to the subscribers with incomes above $50,000 and the zeroes to those with incomes below $50,000. The expected value of the number of customers with incomes above $50,000 is 400 × 0.2 = 80 since there are 20,000 ones out of 100,000 tickets. Approximations The SD of the box is given by 0.2 × 0.8 = 0.4 so that the SE of the sum is 400 × 0.4 = 8. Converting to percentages we get 80/400=20% and 8/400=2% for the expected value and the SE respectively. Q: What are the chances that between 18% and 22% of the persons in the sample earn more than $50,000 a year? A: We can convert to standard units and use the normal curve. 18 − 20 22 − 20 = −1 and =1 2 2 Thus we are looking at the interval (-1,1) in standard units. According to the table this has a 68% chance. Sampling with or without replacement When taking a sample from a finite population it is important to bear in mind two important issues: 1. The accuracy is not determined by the size of the sample relative to the population 2. Sampling with or without replacement produces almost the same results when the population size is large. The first statement is somewhat counterintuitive. Suppose we are polling New Mexico and Texas to estimate the voting intentions in a presidential election. NM has about 1.2 million voters and TX has about 12.5. Intuitively we think that, to achieve the same accuracy, we should need a larger sample in Texas than in New Mexico. This is not true! What counts is the absolute size of the sample. Think of the chemical composition of a liquid. If the liquid is well mixed, then a drop should reflect the composition regardless of whether it is taken from a small test tube or from a large jug. The formula for the SE of a percentage does not contain any information of the population size. Statistical inference for percentages A political candidate hires a polling organization to get a estimate of his chances of winning a primary. There are 100,000 voters and a sample of 2,500 is taken. 1,328 of the voters in the sample favor the candidate. So the estimated percentage is (1,328/2,500) × 100% = 53%. The candidate is quite happy with the result, but the pollsters warn him that this is a sample and that chance error has to be accounted for. To estimate the size of the chance error we need the SE of the percentage of voters. To calculate the SE we consider a box model with ones for the voters favoring the candidate and zeroes for those not favoring him. The SD of the box is given by (fraction of 1s) × (fraction of 1s) Statistical inference for percentages The problem is that the fraction of 1s is exactly the quantity that we wanted to estimate! A solution can be worked out by plugging the value of the fraction of 1s estimated from the poll. Thus SD = 1328 / 2500 ×1262 / 2500 ≈ 0.5. From this we can get an estimate of the SE of the number of voters for the given candidate, which is 2500 × 0.5 = 25 and the SE of the percentage of voters will be given by 25/2500 × 100 = 1%. So the candidate can estimate that 53% of the voters, give or take 1% have the intention of voting for him. Confidence Intervals SEs are useful to obtain intervals for the possible values of percentages. Suppose we have a sample of students from a certain university and use it to estimate that 79% of the students live at home, with a SE of 2%. We can create intervals around 79% using multiples of the SE for which we have a certain confidence that the true percentage of the students that live at home will be. Using the normal curve we can build intervals for different levels of confidence. For the previous example, if we consider 2 SEs then we have the interval (75% , 83%) which is a 95% confidence interval for the percentage of students living at home. Confidence Intervals In general we have that: • Sample percentage ± 1 SE is a 68% confidence interval of the •percentage. • Sample percentage ± 2 SE is a 95% confidence interval of the •percentage. • Sample percentage ± 3 SE is a 99.7% confidence interval of •the percentage. Notice that the larger the confidence the wider the interval. 100% can never be achieved since the normal curve has positive mass over the whole range of the real numbers. Interpretation A frequentist interpretation of statistical inference assumes that the parameters are fixed but unknown. Chances are in the sampling procedure. A confidence interval does not express chance. The interval gives a range for the values of the parameter and a confidence level that the parameter will be in that range. By a confidence level we mean that the parameter will be in the range specified by the interval a given percent of the times the sampling procedure is repeated. Problems Problem 1: 500 draws are made at random from a box with 60,000 0s and 20,000 1s. True or false and explain: 1. The expected value for the percentage of 1s among the draws is 25 This is true, the fraction of 1s is 20,000/80,000 = 25%. 2. The expected value for the percentage of 1s among the draws is 25%, give or take 2%. This is false. The expected values involves no chance error. 3. The percentage of 1s among the draws will 25%, give or take 2% or so. This is true. The SD of the box is about 0.43. The SE of the sum is 9.68 and the SE of the percentage of 1s is about 2%. Problems 4. The percentage of 1s in the box is around 25% give or take 2% or so. This is false, the box has, exactly, a percentage of 25% 1s. Problem 2: A simple random sample of 3,500 people age 18 or over is taken in a large town to estimate the percentage of people age 18 and over in that town who read newspapers. It turns out that 2,487 people in the sample are newspaper readers. 1. Give an estimate of the population percentage. 2487 ×100% = 71% 3500 2. Give an estimate of the SE. 3500 × 0.71× 0.29 ≈ 27, 27 / 3500 × 100% ≈ 0.8% 3. Give a 95% confidence interval for the percentage of newspaper readers. 71% ± 2 × 0.8 = (69.4% , 72.6%)