Models for Percentages In this class we consider problems where

advertisement
Models for Percentages
In this class we consider problems where the populations is split in two groups.
This can be represented with binary boxes that have only tickets with values 0
and 1.
We consider the following specific topics
I
Mean and SE for percentages
I
Confidence intervals for percentages
Estimating Percentages
Consider a population that is split in two groups by a specific characteristic. For
example, faulty versus properly working parts in a population of computer chips,
rainy versus dry days during a specific year.
Suppose we take a simple random sample of the population then the expected
value for the sample percentage equals the population percentage.
That is, if the percentage of men in the population is 46%, then the expected
value for the percentage of men in the sample is 46%.
But in a given sample we won’t necessarily observe 46% of men. This is
because of chance error. How do we estimate the error?
Mean and SD of a binary box
Suppose a box has only tickets with either 0 or 1. Then the mean of the box is
given by
number of 1s
= fraction of 1s
total number of tickets
The SD of the box is given by
s
(number of 0s)(0 - fraction of 1s)2 + (number of 1s)(1 - fraction of 1s)2
total number of tickets
=
=
q
(fraction of 0s)(fraction of 1s)2 + (fraction of 1s)(fraction of 0s)2
p
(fraction of 0s)(fraction of 1s)((fraction of 1s) + (fraction of 0s))
=
p
(fraction of 0s)(fraction of 1s)
Consider the problem of taking a sample from a population where the number
of men is 3,091 and the number of women is 3,581. Then we can think of a box
model
3,091 1
3,581 0
ones corresponding to men and zeroes corresponding to women. Suppose a
sample of size 100 is taken from the box.
Since the fraction of ones in the box is 46%, the SD of the box is equal to
√
0.46 × 0.54 = 0.4983974 ≈ 0.5
So the SE for the sum of 100 draws is
√
100 × 0.5 = 5
Notice that we are supposing that the tickets are drawn with replacements, and
this is unlikely to be true for a human population, but given that the ratio of
the sample size to the size of the population is very small, drawing with
replacement is a good approximation.
The former implies that the number of men in the sample is around 46, give or
take 5, out of a 100. This means that the percentage of men in the sample will
be around 46% give or take 5%.
The SE for a percentage is the SE of the sum converted to
percent relative to the sample size
or
SE for percentage =
SE for number
× 100%
size of the sample
Suppose the sample size is now 400. Then
√
SE for number = 400 × 0.5 = 10
10
= 2.5%
400
So we multiplied the sample size by 4 and got an SE for the percentage that is
half of the original one.
SE for percentage =
Multiplying the sample size by a factor, say k, divides the
SE√for the percentage by the square root of such factor, that
is k
Notice that, as the sample size goes up
I
The SE of the sum increases as the square root of the sample size.
I
The SE of the percentage decreases as the square root of the sample size.
Confidence intervals
SEs are useful to obtain intervals for the possible values of percentages.
Suppose we have a sample of students from a certain university and use it to
estimate that 79% of the students live at home, with a SE of 2%. We can create
intervals around 79% using multiples of the SE for which we have a certain
confidence that the true percentage of the students that live at home will be.
Using the normal curve we can build intervals for different levels of confidence.
For the previous example, if we consider 2 SEs then we have the interval (75% ,
83%) which is a 95% confidence interval for the percentage of students living at
home.
In general we have that:
I
Sample percentage ± 1 SE is a 68% confidence interval of the percentage.
I
Sample percentage ± 2 SE is a 95% confidence interval of the percentage.
I
Sample percentage ± 3 SE is a 99.7% confidence interval of the
percentage.
Notice that the larger the confidence the wider the interval. 100% can never be
achieved since the normal curve has positive mass over the whole range of the
real numbers.
Interpretation
A frequentist interpretation of statistical inference assumes that the parameters
are fixed but unknown. Chances are in the sampling procedure.
A confidence interval does not express chance. The interval gives a range for
the values of the parameter and a confidence level that the parameter will be in
that range.
By a confidence level we mean that the parameter will be in the range specified
by the interval a given percent of the times the sampling procedure is repeated.
Problems
Problem 1: 500 draws are made at random from a box with 60,000 0’s and
20,000 1’s. True or false and explain:
1. The expected value for the percentage of 1’s among the draws is 25
This is true, the fraction of 1’s is 20,000/80,000 = 25%.
2. The expected value for the percentage of 1’s among the draws is 25%, give
or take 2%.
This is false. The expected values involves no chance error.
3. The percentage of 1’s among the draws will be 25%, give or take 2% or so.
This is true. The SD of the box is about 0.43. The SE of the sum is 9.68
and the SE of the percentage of 1’s is about 2%.
4. The percentage of 1’s in the box is around 25% give or take 2% or so.
This is false, the box has, exactly, a percentage of 25% 1’s.
Problem 2: A simple random sample of 3,500 people age 18 or over is taken in
a large town to estimate the percentage of people age 18 and over in that town
who read newspapers. It turns out that 2,487 people in the sample are
newspaper readers.
1. Give an estimate of the population percentage.
2, 487
× 100% ≈ 71%
3, 500
2. Give an estimate of the SE.
p
√
3, 500 × 0.71 × 0.29 ≈ 27,
27
× 100% ≈ 0.8
3, 500
3. Give a 95% confidence interval for the percentage of newspaper readers.
71% ± 2 × 0.8 = (69.4%, 72.6%)
Problem 3: An epidemiologist wish to determine the rate of breast cancer in
women 60 to 65 years old. A random sample of 5,000 women is taken in this
age group and it is found that 50 of them had breast cancer sometime during
their lifetime.
1. Give an estimate of the population percentage.
50
× 100% = 1%
5, 000
2. Give an estimate of the SE.
r
0.01 × 0.99
≈ 0.0014
5000
3. Give a 95% confidence interval for the percentage of newspaper readers.
1% ± 2 × 0.0014 = (.7%, 1.3%)
Hypothesis Testing
We consider the problem of deciding between two hypotheses that are mutually
exclusive on the basis of evidence from observations.
We will consider the following topics:
1. Definition of Null and Alternative hypotheses
2. How to make a test of significance
3. z-test
4. t-test
5. P-values
Test of Significance
I
A certain brand of tobacco has an average nicotine content of 2.5
milligrams, while another one has an average of only 1.5 milligrams. A
cigarette manufacturer receives an unlabeled shipment of tobacco and
needs to determine the nicotine content using a sample of the tobacco.
I
A certain type of vaccine is known to be only 25% effective over a period
of 2 years. A new type of vaccine is being tested on 2000 people chosen at
random. We want to test if this new vaccine is more effective than the
original one.
I
A machine for filling bottles of soda has to put 333 ml of liquid in each
bottle. If the average amount is too low or too high with respect to the
expected content then the machine is considered to be out of control. The
machine is regularly inspected to check whether it is out of control by
taking a sample of bottles.
I
A balance designed for precision weighting in a lab has to be kept very
well calibrated. To test the calibration of the balance several measurements
of the weight of an object are done. If they differ too much then the
balance has to be re-calibrated.
In all these examples a decision has to be taken based on the numbers in a
sample. These numbers are subject to uncertainty and we have to decide if the
differences that we observe are only due to chance or not.
Null and Alternative Hypotheses
A bill is proposed to simplify the tax code. The proposer claims that the bill is
‘revenue-neutral’, that is, it will not lower tax revenues. A simulation is run
using 100 tax returns chosen at random and the differences between the tax
paid using the old rules and those that would be paid using the new rules are
recorded.
The average difference comes up to be -$219 with a standard deviation of $725.
Can we claim that the new rules are revenue-neutral?
We can put the problem in these terms: there are two hypotheses:
NULL HYPOTHESIS, H0
and
ALTERNATIVE HYPOTHESIS, H1
Under the null hypothesis there is no difference in revenue and the fact that the
observed value is not 0 is totally due to chance. Under the alternative
hypothesis the difference is real.
For the examples that we considered at the beginning we have that:
I
Cigarette: H0 : the mean nicotine content is 1.5. H1 : the mean nicotine
content is 2.5
I
Vaccine: H0 : the proportion is 25%. H1 : the proportion is higher than
25%.
I
Soda Bottles: H0 : the average amount of liquid is 333 ml. H1 : the
average amount of liquid is not equal to 333 ml.
I
Balance: H0 : the device is calibrated. H1 : the device is not properly
calibrated.
Test statistics
How do we test the null hypothesis against the alternative?
Back to the tax example. Suppose the null is true. Then the difference should
be $0. How ‘large’ is -$219 with respect to $0? To answer this question we
convert to standard units. Given that the√sample was of size 100 the SE for the
average is approximately $72 (SE = SD/ Sample Size):
−$219 − $0
≈ −3
$72
so the difference between the value under the null and the observed value is -3
standard units.
The probability of the interval to the left of -3 is about 0.001, that is, one
chance in 1,000. So, under the null hypothesis, $219 is a very unlikely value.
In general we are calculating a test statistics given by
z=
observed - expected
SE
which is referred to as the z-test.
The calculation of SE depends on what is “observed” in the numerator: a sum,
an average, a number, or a percent.
Once the z-test is calculated we have to decide whether its value is ‘large’ or is
‘small’. We observe the probability of the left tail of the normal curve, below
the z-test. If this probability is small then the value of the z-test is far from the
center of the distribution.
The observed significance level is the chance of getting a
test statistics as extreme or more than the observed one.
This is usually denoted as P and referred to as the P-value
The smaller the P-value, the stronger the evidence against the null, but
The P-value is NOT the chance of the null hypothesis being
right
Making a test of significance
To make a test of significance you need to:
I
set up the null hypothesis
I
pick a test statistics to measure the difference between the data and what
is expected under the null hypothesis
I
compute the test statistics and the corresponding observed significance
level.
A small observed significance level implies that the data are far from the values
expected from the model under the null hypothesis.
What is a small observed significance level?
This is somewhat arbitrary, but it is usually considered that if P is less than 5%
the results are significant. If P is less than 1% the results are highly significant.
Examples
1. A random sample of 85 8th graders has a mean score of 265 with an SD of
55 on a national math test. A State Administrator claims that the mean
score of 8th graders on the examinations is above 260. Is there enough
evidence to support the administrator’s claim?
The hypotheses are:
H0 : mean ≤ 260 vs H1 : mean > 260
The test statistics is obtained by changing to standard units:
265 − 260
√
= 0.838
55/ 85
The probability that a standard normal is above 0.838 is about 21%. This
is a rather large P-value, so there does not seem to be enough evidence to
reject H0 .
2. A light bulb manufacturer guarantees that the mean life of the bulbs is at
least 750 hours. A random sample of 36 light bulbs has a mean of 725
hours and a standard deviation of 60 hours. Is there enough evidence to
reject the manufacturer’s claim?
The hypotheses are
H0 : mean ≥ 750 vs H1 : mean < 750
Binary boxes
Consider again the problem of testing the new vaccine. This is a binary model
since we can classify the population in two groups: the group of people for
which the vaccine was effective and that for which the vaccine was not.
Under the null hypothesis the box model that generates the sample consists of
the box
1
0
0
0
since there is 25% chance that the vaccine is effective.
Suppose that the number of people in the sample for which the vaccine was
effective is 534. According to the null the expected number would be 500. Is
the 34 people difference large enough to reject the null hypothesis and claim
that the new vaccine is more effective?
We need to calculate the standard units of the difference between 550 and 534.
Under the null the SD of the box is
√
SD = 0.25 × 0.75 = 0.4330127 ≈ 0.43
√
so the standard error is SE = 2000 × 0.43 = 19.23 Then
z=
534 − 500
= 1.768071
19.23
The observed significance level is given by the area under normal curve
corresponding to interval above 1.768071. This is around 4%, which is small
enough to conclude that the difference is statistically significant.
So there is evidence to support the claim that the new vaccine is more effective
than the standard.
The t-test
The examples that we have seen so far rely on the fact that the sample size is
large. So, even when the SD of the box is unknown, we can still use the normal
curve to obtain the observed significance level of the test.
This is not the case when the sample size is small. In this case we need a
modification of the z-test due to ‘Student’, a pen name for a statistician called
Gosset.
Consider the following example. The following five measures of the
concentration of Carbon monoxide (CO) are taken from a gas sample where the
concentration is precisely controlled to be 70 parts per million (ppm). Five
measurements are taken to check the calibration of an instrument
78 83 68 72 88
The null hypothesis is that the device is calibrated and so the average of the
measurements is 70 ppm. The average of the sample
√ is 77.8 ppm , the SD is
7.22 ppm and thus the SE of the average is 7.22/ 5 ≈ 3.23. The z-test can be
obtained as
77.8 − 70
z=
≈ 2.4
3.23
To determine the observed significance level we calculate the area to the right
of 2.4 under the normal curve. This is less than 1%, which looks like strong
evidence against the null hypothesis.Unfortunately we have to remember that
the SD that we have calculated is NOT the SD of the box. It is the SD of the
sample, whose size is fairly small, and thus the approximation is not very
precise. We correct the procedure with the following steps.
Step 1:
Consider a different estimate of the SD
r
number of measurements
+
SD =
× SD
number of measurements - 1
Notice that SD + > SD.
In our previous example we get
r
5
× 7.22 ≈ 8.07
4
√
so the SE of the average becomes 8.07/ 5 = 3.61, as opposed to 3.23,
reflecting a higher level of uncertainty.
Then the test statistics becomes
SD
+
=
t=
77.8 − 70
≈ 2.2
3.61
Step 2: To find the observed significance level we can not use the normal curve
any more. We need to use a Student’s t curve. This curve depends on the
degrees of freedom (DF). These are calculated as
degrees of freedom = number of measurements - 1
A table for the Student’s t curves is found at the end of the book. There is one
curve for each value of the DF. Each row corresponds to one curve. The
probabilities that are reported correspond to the right hand tail, as opposed to
what was reported for the normal curve. These curves are symmetric around 0
and for DF above 25 they resemble the normal curve very closely.
Thus, in our example, we need a Student’s t curve with 4 DF. The value 2.2 is
not present in the table for the row corresponding to 4 DF. The closest value is
2.13, which corresponds to 5%. So the P-value for this test is about 5%. Which
is much weaker an evidence against the null than before.
Suppose now that 6 measures are taken with the device
72 79 65 84 67 77
The average
p is equal to 74 ppm and the SD is 6.68 ppm. The√corrected SD is
SD+ = 6/5 × 6.88 ≈ 7.32. The SE of the average is 7.23/ 6 ≈ 2.99 So the
t-test is
74 − 70
≈ 1.34
t=
2.99
This time the DF are 5 and if we look at the table we find that the probability
corresponding to 1.48 is 10%. Since 1.34 is smaller than 1.48 we have that the
P-value is larger than 10%. This is not enough evidence against the null. So the
machine can be considered to be well calibrated.
Examples
1. An environmentalist estimates that the mean waste recycled by adults in the
US is more than 1 pound per person per day. You take a sample of 12 adults
and find that the waste generated per person per day is 1.2 pounds with a
standard deviation of 0.3 pounds. Can you support the environmentalist’s claim?
The hypotheses are:
H0 : mean ≤ 1 vs H1 : mean > 1
The corrected value of the SD is
r
12
× 0.3 = 0.32
11
The test statistics is obtained by changing to standard units:
1.2 − 1
√ = 2.17
0.32/ 12
The probability that a student with 11 degrees of freedom will be above 2.17 is
about 2.5%. This is a rather small P-value, so there seems to be enough
evidence to reject H0 .
2. A microwave oven repairer says that the mean repair cost for damaged
microwave ovens is less than $100. You find a random sample of 5 ovens has a
mean repair cost of $75 with an SD of $12.5. Do you have enough evidence to
support the repairer’s claim?
The hypotheses are:
H0 : mean ≥ 100 vs H1 : mean < 100
The corrected value of the SD is
r
5
× 12.5 = 13.95
4
The test statistics is obtained by changing to standard units:
75 − 100
√ = 4.01
13.95/ 5
The probability that a student with 4 degrees of freedom will be above 4.01 is
less than 1%. This is a rather small P-value, so there seems to be enough
evidence to reject H0 .
Download