ESTIMATING PERCENTAGES

advertisement
AMS 5
ESTIMATING PERCENTAGES
Models for Percentages
We consider problems where the populations is split in two
groups. This can be represented with binary boxes that have only
tickets with values 0 and 1. We consider the following specific
topics:
• Mean and SE for percentages
• Approximations using the normal distribution
• Obtaining samples from a binary population
• Estimating the mean and the SD from a sample
• Confidence intervals for percentages
Estimating Percentages
Consider a population that is split in two groups by a specific
characteristic. For example, faulty versus properly working parts
in a population of computer chips, rainy versus dry days during a
specific year. Suppose we take a simple random sample of the
population then the expected value for the sample percentage
equals the population percentage. That is, if the percentage of
men in the population is 46%, then the expected value for the
percentage of men in the sample is 46%. But in a given sample
we won't necessarily observe 46% of men. This is because of
chance error. How do we estimate the chance error?
Mean and SD of a binary box
Suppose a box has only tickets with either 0 or 1. Then the mean
of the box is given by:
number of 1s
= fraction of 1s
total number of tickets
The SD of the box is given by:
Mean and SD of a binary box
Consider the problem of taking a sample from a population
where the number of men is 3,091 and the number of women is
3,581. Then we can think of a box model
ones corresponding to men and zeroes corresponding to women.
Suppose a sample of size 100 is taken from the box.
Since the fraction of ones in the box is 46%, the SD of the box is
equal to 0.46 × 0.54 ≈ 0.5. So the SE for the sum of 100 draws is
100 × 0.5 = 5.
Mean and SD of a binary box
Notice that we are supposing that the tickets are drawn with
replacements, and this is unlikely to be true for a human a
population, but given that the ratio of the sample size to the size
of the population is very small, drawing with replacement is a good
approximation. The former implies that the number of men in the
sample is around 46, give or take 5, out of a 100. This means that
the percentage of men in the sample will be around 46% give or
take 5%.
Mean and SD of a binary box
Suppose the sample size is now 400. Then
SE for number = 400 × 0.5 = 10
SE for percentage = 10 / 400 = 2.5%
So we multiplied the sample size by 4 and got an SE for the
percentage that is half of the original one.
Notice that, as the sample size goes up:
• The SE of the sum increases as the square root of the sample
size.
• The SE of the percentage decreases as the square root of the
sample size.
Approximations
We can use the normal curve to obtain approximations to
probabilities involving percentages.
Q: A phone company has 100,000 subscribers. According to the
census data, 20% of the subscribers earn over $50,000 a year.
What are the expected value and SE of the percentage of
subscribers with incomes over $50,000 in a sample of size 400?
A: A box model for this problem will be
where the ones correspond to the subscribers with incomes
above $50,000 and the zeroes to those with incomes below
$50,000. The expected value of the number of customers with
incomes above $50,000 is 400 × 0.2 = 80 since there are 20,000
ones out of 100,000 tickets.
Approximations
The SD of the box is given by 0.2 × 0.8 = 0.4 so that the SE of
the sum is 400 × 0.4 = 8. Converting to percentages we get
80/400=20% and 8/400=2% for the expected value and the SE
respectively.
Q: What are the chances that between 18% and 22% of the
persons in the sample earn more than $50,000 a year?
A: We can convert to standard units and use the normal curve.
18 − 20
22 − 20
= −1 and
=1
2
2
Thus we are looking at the interval (-1,1) in standard units.
According to the table this has a 68% chance.
Sampling with or without replacement
When taking a sample from a finite population it is important to
bear in mind two important issues:
1. The accuracy is not determined by the size of the sample
relative to the population
2. Sampling with or without replacement produces almost the
same results when the population size is large.
The first statement is somewhat counterintuitive. Suppose we are
polling New Mexico and Texas to estimate the voting intentions in
a presidential election. NM has about 1.2 million voters and TX has
about 12.5. Intuitively we think that, to achieve the same accuracy,
we should need a larger sample in Texas than in New Mexico. This is
not true! What counts is the absolute size of the sample. Think of
the chemical composition of a liquid. If the liquid is well mixed, then
a drop should reflect the composition regardless of whether it is
taken from a small test tube or from a large jug. The formula for the
SE of a percentage does not contain any information of the
population size.
Statistical inference for percentages
A political candidate hires a polling organization to get a estimate
of his chances of winning a primary. There are 100,000 voters
and a sample of 2,500 is taken. 1,328 of the voters in the sample
favor the candidate. So the estimated percentage is
(1,328/2,500) × 100% = 53%.
The candidate is quite happy with the result, but the pollsters
warn him that this is a sample and that chance error has to be
accounted for. To estimate the size of the chance error we need
the SE of the percentage of voters. To calculate the SE we
consider a box model with ones for the voters favoring the
candidate and zeroes for those not favoring him. The SD of the
box is given by (fraction of 1s) × (fraction of 1s)
Statistical inference for percentages
The problem is that the fraction of 1s is exactly the quantity that
we wanted to estimate! A solution can be worked out by plugging
the value of the fraction of 1s estimated from the poll. Thus
SD = 1328 / 2500 ×1262 / 2500 ≈ 0.5.
From this we can get an estimate of the SE of the number of
voters for the given candidate, which is 2500 × 0.5 = 25
and the SE of the percentage of voters will be given by
25/2500 × 100 = 1%. So the candidate can estimate that 53%
of the voters, give or take 1% have the intention of voting for
him.
Confidence Intervals
SEs are useful to obtain intervals for the possible values of
percentages. Suppose we have a sample of students from a
certain university and use it to estimate that 79% of the students
live at home, with a SE of 2%. We can create intervals around
79% using multiples of the SE for which we have a certain
confidence that the true percentage of the students that live at
home will be. Using the normal curve we can build intervals for
different levels of confidence. For the previous example, if we
consider 2 SEs then we have the interval (75% , 83%) which is a
95% confidence interval for the percentage of students living at
home.
Confidence Intervals
In general we have that:
• Sample percentage ± 1 SE is a 68% confidence interval of the
•percentage.
• Sample percentage ± 2 SE is a 95% confidence interval of the
•percentage.
• Sample percentage ± 3 SE is a 99.7% confidence interval of
•the percentage.
Notice that the larger the confidence the wider the interval.
100% can never be achieved since the normal curve has positive
mass over the whole range of the real numbers.
Interpretation
A frequentist interpretation of statistical inference assumes that
the parameters are fixed but unknown. Chances are in the
sampling procedure.
A confidence interval does not express chance. The interval gives
a range for the values of the parameter and a confidence level
that the parameter will be in that range.
By a confidence level we mean that the parameter will be in the
range specified by the interval a given percent of the times the
sampling procedure is repeated.
Problems
Problem 1: 500 draws are made at random from a box with
60,000 0s and 20,000 1s. True or false and explain:
1. The expected value for the percentage of 1s among the draws
is 25
This is true, the fraction of 1s is 20,000/80,000 = 25%.
2. The expected value for the percentage of 1s among the draws
is 25%, give or take 2%.
This is false. The expected values involves no chance error.
3. The percentage of 1s among the draws will 25%, give or take
2% or so.
This is true. The SD of the box is about 0.43. The SE of the sum
is 9.68 and the SE of the percentage of 1s is about 2%.
Problems
4. The percentage of 1s in the box is around 25% give or take
2% or so.
This is false, the box has, exactly, a percentage of 25% 1s.
Problem 2: A simple random sample of 3,500 people age 18 or
over is taken in a large town to estimate the percentage of
people age 18 and over in that town who read newspapers.
It turns out that 2,487 people in the sample are newspaper
readers.
1. Give an estimate of the population percentage.
2487
×100% = 71%
3500
2. Give an estimate of the SE.
3500 × 0.71× 0.29 ≈ 27, 27 / 3500 × 100% ≈ 0.8%
3. Give a 95% confidence interval for the percentage of
newspaper readers. 71% ± 2 × 0.8 = (69.4% , 72.6%)
Download