Chapter 8: Statistical Inference Confidence Intervals
Background:
Critical values of Z
Za is the value such that P (Z > Za) = a where Z ~ N (0, 1) and
0 < a < 1
Ex:
P (Z > 1.645) = .05 Z
.05
= 1.645
P (Z > 1.96) = .0250 Z
.025
= 1.96
P (Z > 2.326) = .01 Z
.01
= 2.326
P (Z > 2.575) = .005 Z
.005
= 2.576
A statistic is a number that describes the sample. Examples include the sample mean (xbar), the sample standard deviation (s) and the sample proportion (p-hat = number of successes in the sample divided by the sample size).
A parameter is a number that describes a population. Examples include the population mean (μ), the population standard deviation (σ) and the population proportion (p).
Population parameters are usually unknown, so we want to estimate them. The simplest way to estimate parameters are with statistics, which we call point estimates. We can do better by not just using a single point, but by setting up an interval around the point and stating how much confidence or certainty we have that the parameter is in the interval. We call this interval estimation.
Interval Estimation is the beginning of Statistical Inference, which is the action of generalizing about a population from a sample. Statistical inference lets us make statements about population parameters with a stated level of certainty. Statistical inference also enables us to make better decisions.
In order for Statistical inference to be correct we need to make sure that the sample is representative of the population. We do this by taking random samples. These are samples where each unit / subject has the same probability of being selected as any other unit in the population.
Finding CI’s for the population mean.
First we find a point estimator (statistic) that estimates the population parameter. For μ we use x-bar the sample mean.
Second we find the standard error of the estimator.
For x-bar the SE is simple:
SE
s n n
Where σ is the population standard deviation (which will not be known much of the time), s is the sample standard deviation and n is the sample size. (These numbers will be given to you)
Third we find the margin of error. This will depend on whether or not we know σ. If we know σ, the margin of error involves multiplying the SE by a z-score from the standard normal distribution.
To find a 95% CI for μ known σ
(x-bar – Z
.025
* SE, x-bar + Z
.025
* SE)
To find any level CI which will denote (1 – α) where α is between 0 and 1, we need to find Z
α/2
Common CI levels are 95%, 99% and 90%.
Fourth, in order to use the methodology we need to know that the CLT applies. We either need to know that: n ≥ 30 or if n < 30 and the sample comes for an approximately normally distributed population.
Of course we are also assuming that the sample was obtained randomly.
You do not want to use either methodology to find CI’s for μ if the data is binary, like yes-no questions.
Ex:
Suppose you want to find a 95% CI for µ when the sample mean is 27.5 from a random sample of 36 subjects. The population standard deviation is 8.6.
Since n = 36 > 30 we can use the above methodology.
X-bar = 27.5 and σ = 8.6. Our confidence level = .95 so α = .05 and α/2 = .025 so Z
α/2
=
1.96.
Therefore our 95% CI for µ is:
(27.5 – 1.96 * 8.6/6, 27.5 + 1.96 * 8.6/6)
(27.5 – 2.809, 27.5 + 2.809)
(24.691, 30.309)
Therefore we are 95% sure that the population mean is between 24.691, 30.309.
If someone claimed that the population mean is 22, we can state that we are 95% confident that they are incorrect, because 22 is not in the interval.
If someone claimed that the population mean is 32, we can state that we are 95% confident that they are incorrect, because 32 is not in the interval.
However, if someone claimed that the population mean is 26, we cannot state that we are 95% confident that they are correct, because 26 is in the interval. We can only state that we are NOT 95% confident that they are wrong.
On the TI 83/84
Hit [STAT] select TESTS select 7: ZInterval…
ZInterval
Inpt: Data Stats
σ: 8.6 x-bar: 27.5 n: 36
C-Level: .95
Calculate
Output
(24.691, 30.309) x-bar: 27.5 n: 36 select Stats
Ex:
A random sample of 50 Twisty pretzels is taken. Their mean baking time was 13.2 minutes. The population standard deviation is 4.1 minutes. Find a 95% CI for the population mean baking time of the pretzels.
Definition of terms: n = 50 x-bar = 13.2 σ = 4.1
μ = unknown mean baking time for all Twisty pretzels, this is what we are finding the CI for. Note that n > 30.
On the TI 83/84
Hit [STAT] select TESTS select 7: ZInterval…
Output
(12.064, 14.336) x-bar: 13.2 n: 50
We are 95% sure that the mean baking time for all Twisty Pretzels is between 12.064 and 14.336 minutes.
T-Intervals
For CI’s of μ when we do not know σ we will use s to approximate σ and we find the margin of error by multiplying the SE by a t-score which comes from the T-distributions.
The T-distributions are very similar to the Standard Normal Distribution, but are a little more complicated because each T-distribution depends on n. Recall that the normal distribution has 2 parameters, μ and σ. The T-distribution has only one parameter, called v = n -1 = degrees of freedom. The mean of any T-distribution is always 0.
For this class it will simply mean hitting a different button on your calculator.
To find a 95% CI for μ unknown σ
(x-bar – T
α/2
* SE, x-bar + T
α/2
* SE)
In the old days you would need to learn how to read the T-table and manually calculate the margin of error by hand, but via technology, your calculators will calculate CI’s for μ using T-scores for you. The T
α/2
‘s will change depending on n.
Ex: Same as before but now we use T
A random sample of 50 Twisty pretzels is taken. Their mean baking time was 13.2 minutes with a standard deviation of 4.1 minutes. Find a 95% CI for the population mean baking time of the pretzels.
Definition of terms: n = 50 x-bar = 13.2 s = 4.1
μ = unknown mean baking time for all Twisty pretzels, this is what we are finding the CI for. Note that n > 30.
On the TI 83/84
Hit [STAT] select TESTS select 8: TInterval…
TInterval
Inpt: Data Stats x-bar: 13.2 select Stats
Sx: 4.1 n: 50
C-Level: .95
Calculate
Output
(12.035, 14.365) x-bar: 13.2
Sx: 4.1 n: 50
We are 95% sure that the mean baking time for all Twisty Pretzels is between 12.035 and 14.365 minutes.
Ex2. A random sample of 40 students taking the Kaplan SAT prep course is taken. Their average SAT score was 1165 with a standard deviation of 86.3. Find a 95% CI for the mean SAT score for all students taking the Kaplan SAT prep course. n = 40 x-bar = 1165 s = 86.3
95% CI for mean SAT score for all Kaplan students is
(1137.4, 1192.6)
ETS reports that the mean score for all students taking the SAT’s is 1120. Based on the data can Kaplan claim that the mean score for all its students is greater than the mean for all students in general?
Yes, because Kaplan is 95% sure that the mean for all its students is between (1137.4 and 1192.6), 1120 is below this interval.
(1120 < 1137.4)
The Princeton Review, a rival of Kaplan, says that the mean for all its students is 1180.
Can the Princeton Review claim that the mean SAT score for all its students is greater than the mean SAT score for all Kaplan’s students?
No, because 1180 is within the CI (1180 < 1192) so there is insufficient evidence to state that the mean SAT score for all Princeton Review students is greater than the mean SAT score for all Kaplan students.
What CI’s really mean?
You should really use T because σ is rarely known.
Approximate CI’s for p
Ex
Suppose we want to know what percentage / proportion of the US approves of the president.
We cannot sample every one of the 300 million people in the US.
Survey companies take random samples of people in the US and calculate the sample proportion. Usually they also report a margin of error. Something like 32 % plus or minus 2%.
This margin of error gives us an interval for estimation.
(30%, 34%)
Other survey companies may take similar surveys and find slightly different results. But is all the companies were careful to take random samples and word their questions the same, the results will be very similar.
If the sample proportions are slightly different that does not mean that one poll is wrong and one is right. The sample proportion is a statistic and like all statistics it is also a random variable. It will change from sample to sample.
Ex. Assume that the population proportion of US people who approve of the president is actually 30%. If 10 different survey companies take random samples of 100 people, what might the sample proportions look like?
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
0.29 0.31 0.32 0.34 0.27 0.29 0.30 0.32 0.33 0.27
P-hat = sample proportion (Calculator and most texts) Your text calls it P’
P-Hat is a good estimator for p for 3 important reasons.
1. The mean value of p-hat is p. On average if you could find all possible p-hat’s and averaged them you would get p. We say that p-hat is an unbiased estimator for p.
2. The standard error (or standard deviation) of p-hat is known and is small, when both
(n * p-hat and n * q-hat are both ≥ 5)
The SE of p-hat is:
SE (
^ p )
^ p ( 1
^ p ) n
3. If n * p-hat and n * q-hat are both ≥ 5, then the distribution of p-hat is approximately
Normally distributed. This can be shown by the Central Limit Theorem.
These 3 reasons make finding Confidence intervals for p very easy.
Note that x-bar is an unbiased estimator of μ, the SE(x-bar) is also small and known and x-bar is normally distributed, which makes finding CI’s for μ also very easy.
To Find a 100 (1 – α) % Confidence Interval for p p-hat – Z
α/2
* SE, p-hat + Z
α/2
* SE where
SE
^ p ( 1
^ p ) n
For example to find a 95% Confidence Interval for p p-hat – 1.96 * SE, p-hat + 1.96 * SE
The -1.96 and +1.96 come from the standard normal or Z distribution. P(-1.96 < Z <
+1.96) = .95
Ex:
In a random sample of 100 SCSU students 72 said that there was not ample parking on campus. Find and interpret a 95% CI for the true proportion of SCSU students who do think that there is not ample parking on campus. n = 100 and x = 72, therefore p-hat = x/n = .72
So q-hat = 1- .72 = .28
SE
.
72 * .
28
.
0449
100
First check to see if n*p-hat and n*q-hat are bigger than 5. n*p-hat = 100 * .72 = 72 > 5 n*q-hat = 100 * .28 = 28 > 5
So the 95% CI for p is:
(.72 – 1.96* .0449, .72 + 1.96 * .0449)
(.72 - .088, .72 + .088) Note that .088 is the margin of error.
(.632, .808)
We are 95% sure that the population proportion for all SCSU students who think that there is not ample parking is between (.632, .808)
So if the administration were to say that 50% of students think there is not ample parking you could say that you are 95% sure that they are lying. (.50 is not in the interval)
Or is some radical student association said that 90% of students say there is not ample parking, you could say that you are 95% sure that they are lying. (.90 is not in the interval)
If someone else said that 75% of students do think there is not ample parking, you
CANNOT say that you are 95% sure that they are right. You only can say that the true or population proportion of students who think that there is not ample parking is somewhere between .632 and .808. Where p is in the interval you do not know.
To find a 95% CI for p on the TI 83/84 :
Hit [STAT] then go over to Tests, scroll down to
A: 1-PropZInt Hit [ENTER]
(NOT 1-Prop-ZTest)
1-PropZInt x: 72 n: 100
C-Level : .95
Calculate
You can use the calculator too find a CI for p for any C-level between 0% and 100% not inclusive.
The most common CI’s are 95% and 99%.
The formula for the 99% CI is the same as the 95% CI except where there is 1.96 you substitute 2.575.
Find a 99% CI for p, using the previous data:
1-PropZInt x: 72 n: 100
C-Level : .99
Calculate
(.604, .836)
Note that this interval is a little bigger.
Determining the Sample Size
CI for p and µ are very similar. Each CI depended on finding the SE of the statistic and using a distribution, Z for p and T for μ.
Each CI also involved dividing by sqrt(n). Therefore the width of the CI depends on n.
The problem we dealt with last time was finding an interval that we could use for statistical inference. Another similar problem is, we want to estimate a parameter (p or
μ) with a certain confidence level (like 95% or 99%) and we want the margin of error
(1/2 the width of the CI) to be a certain maximal amount and we want to know how many observations we need to accomplish this.
Recall that to find a 95% Confidence Interval for p the formula is: p-hat – 1.96 * SE, p-hat + 1.96 * SE where
SE
^ p ( 1
^ p ) n
The margin of error is 1.96*SE.
Note that as n increases the SE will decrease so the margin of error will decrease. We can always find an n big enough such that we will find a small enough margin of error to satisfy the problem. Whether that n is realistic is another question.
The formula for finding n to find a 95% CI for p that has a margin of error EBP for Error
Bound for Proportion is: n
^ p ( 1
^ p ) * 1 .
96
2
EBP
2
Ex: A drug company wants to find a 95% CI for the population proportion of people its new drug will cure. It wants the margin of error to be 10%. In previous testing they estimate p to be near 70%. How many subjects must they sample?
Information given:
95% CI for p, EBP = .10 and p-hat = .70 so 1 – p-hat = .30 n = .7 * .3 * 1.96
2 /.1
2 n = .7 * .3 * 3.8416 / .01 = 80.6736
The final answer is 81, because n has to be an integer and you ALWAYS round up!
The formula for finding n to find a 99% CI for p that has a margin of error EBP is: n
^ p ( 1
^ p ) * 2 .
576
2
EBP
2
Ex2: If we use the same information as the previous example, and find n for a 99% CI we would get:
99% CI for p, m = .10 and p-hat = .70 so 1 – p-hat = .30 n = .7 * .3 * 6.636 / .01 = 139.356
The final answer is 140, because n has to be an integer and you ALWAYS round up!
Increasing the confidence level and having m and p-hat stay the same makes the n increase. The more confident you want to be the more observations you need.
In general the formal for finding n is: n
^ p ( 1
^ p ) * Z
2
/ 2
EBP
2
Ex 3: If the company wanted to be more accurate, so they want their margin of error to be .01 instead of .1, how many observations would they need to find a 95% CI, assuming p-hat = .70. n = .7 * .3 * 3.8416 / .0001 = 8067.36 n = 8068, a much bigger number than for a 10% margin of error.
As EBP gets smaller n increases.
Ex 4: What if the company did not know that p-hat was about 70%? Because in the numerator of the formula for n we are multiplying p-hat * (1 – p-hat) and p-hat has to be between 0 and 1, we can maximize the numerator by assuming p-hat = 0.5.
So if you are given a problem where p-hat is not given and you are asked to find n, you assume that p-hat = .5 = 1 – p-hat.
To find the sample size required for a 95% CI for p, when we want the margin of error to be 10%, then: n = .5 * .5 * 3.8416 / .01 = 96.04 n = 97
To find the sample size for a 100(1 – α) % CI for μ: n
Z
/ 2
*
EBM
2
2
Z
2
/ 2
*
EBM
2
2
Note that EBM stands for Error Bound for Mean and will sometimes be called the margin of error.
Also note that we are using σ and Z here not s and T, why?
Ex 5: Find the sample size required for a 95% CI for μ with a margin of error of 3, when the standard deviation is 10. n = 3.8416 * 100 / 9 = 42.68 n = 43.
Ex 6: Find the sample size required for a 99% CI for μ with a margin of error of 3, when the standard deviation is 10. n = 6.636* 100 / 9 = 73.7 n = 74
Ex 7: Find the sample size required for a 95% CI for μ with a margin of error of 4, when the standard deviation is 10. n = 3.8416 * 100 / 16 = 24.01 n = 25
How is the width of the CI affected by σ, α, EBM?
Ex 7: Find the sample size required for a 95% CI for μ with a margin of error of $1000 for the mean annual income of Native Americans in Onondaga County, NY. We are not given any information about the standard deviation. We know that nearly all the incomes fall between $0 and $120,000 and that the distribution is close to bell shaped.
First we need to approximate s. If the distribution of salaries is between 0 and 120,000 and bell shaped then this would be the interval (μ – 3σ, μ + 3σ). This means that 6σ =
120000, which means that σ = 20000. We will use this as s. n = 3.8416 * 20000 2 / 1000 2 = 1536.64 n = 1537