Objectives

advertisement
Objectives 6.1, 7.1 Estimating with confidence (CIS: Chapter 10)
p 
Statistical confidence (CIS gives a good explanation of a 95%
CI)
p 
Confidence intervals
p 
Choosing the sample size
p 
t distributions
p 
One-sample t confidence interval for a population mean
p 
How confidence intervals behave
Adapted from authors’ slides © 2012 W.H. Freeman and Company Overview of Inference p 
Sample ≠ population, and sample mean x ≠ population mean µ.
But we do not know the value of µ, and if we want to make any
conclusions about µ then we have to use x to do so.
p 
Methods for drawing conclusions
about a population from sample
€
data are called statistical inference.
p 
There are two main types of inference:
p 
€
§ 
Confidence Intervals - estimating the value of a population
parameter, and
§ 
Tests of Significance - assessing evidence for a claim (hypothesis)
about a population.
Inference is appropriate when data are produced by either
§ 
a random sample or
§ 
a randomized experiment.
Introducing con4idence intervals p 
It is very unlikely that the sample mean based on a sample will ever
equal the true mean. Our aim is to construct an interval around the
sample mean which is `likely’ to contain the mean. This is called a
confidence interval.
p 
p 
In the first lecture we considered a Gallop poll for the proportion of the
electorate that would vote for Obama.
Gallup predicted that the Obama vote would be in the interval
[45%,51%] with 95% confidence.
p 
p 
q 
The Obama vote turned out to be 50.5%, so the interval did capture the
true proportion.
You may be asking yourself how do we understand 95%, since 50.5%
lies in this interval, there does not appear to be any uncertainty in it.
In the next few slides, our objective is to understand how a
confidence interval is constructed and how to understand it.
Review: properties of the sample mean The sample mean x is a unique number for any particular sample. If
you had obtained a different sample (by chance) you almost certainly
would have had a different value for your sample mean.
In fact,
€ you could get many different values for the sample mean, and
virtually none of them would actually equal the true population mean, µ.
Because the sampling distribution of
x is narrower than the population
distribution, by a factor of √n.
The the estimates
n
Sample means,
n subjects
x€
x
tend to be closer to
the population
€σ
€ µ than individual
parameter
n
Population, x
individual subjects
observations are.
σ
µ
If the population is normally distributed N(µ,σ),
the sampling distribution is N(µ,σ/√n),
p 
Using the empirical distribution, since the sample mean is close to
normal, 95% of the time it will be within 2 standard errors of the
mean, that is if I had a hundred sample means, then about 95 times
the sample mean lies in the interval [µ – 2×σ/√n, µ + 2×σ/√n].
p 
Now we make a small correction to the empirical rule. It is not 2
standard deviations of the mean, but 1.96 standard deviations from
the mean. To see why, look up 1.96 in the z-tables.
p 
But the mean is unknown, so our objective is to locate the true mean
based on the sample mean.
p 
To do this we turn the story around, if the sample mean lies in the
interval [µ –1.96×σ/√n, µ+1.96×σ/√n], this is the same as saying the
mean µ lies in the interval [sample mean –1.96×σ/√n, sample mean
+1.96×σ/√n].
q 
Thus 95% of the time, the true mean (that we want to estimate) will
be in the interval
[sample mean –1.96×σ/√n, sample mean +1.96×σ/√n]. This is an
interval which is centered about the sample mean. In the next slide we
illustrate what we mean by 95%.
If multiple samples were possible 95% of all sample means will
σ
be within 1.96 (roughly 2)
n
standard deviations
(1.96 ×σ/√n) of the
€
population parameter µ.
This implies that the
population parameter µ will
be within 1.96 standard
deviations from the sample
average
x , in 95% of all
samples.
This reasoning is the essence of statistical inference.
€
Red dot: mean value
of individual sample
Mean height – sample size one p 
Human heights are approximately a normal distribution. The
standard deviation of a human height is 3.8 inches.
p 
Our objective is to construct a confidence interval for the mean
height.
p 
We start with a very crude estimator and use just one height to
estimate the mean, this is the same as using a sample of size one.
In this case the standard error is 3.8/√1 = 3.8.
p 
Each of you construct a 95% confidence interval for the mean height
using your height as the sample:
[your height – 1.96×3.8, your height + 1.96×3.8]
[your height – 7.44, your height + 7.44].For example, in my case
the interval is [63 – 7.44, 63+7.44] = [55.56,70.44].
q 
Each of you do this too. In fact it is known that the mean height of a
person is 67 inches. Does you interval contain the mean? The
proportion of intervals that contain the mean should be approx 95%.
Mean height – sample size two p 
In the previous experiment the we used just one individual to
estimate the mean height. The `cost’ of using one individual was
that the confidence interval was very wide.
p 
We repeat the experiment, but this time each of you buddy up with
your neighbour and calculate the average height between the two of
you (ie. (your height plus neighbour’s height)/2). You and your
buddy for a sample of size two.
p 
We know that this the sample mean based on a sample of size n=2.
has the standard error 3.8/√2 = 2.68.
p 
Each group construct the interval [sample mean – 1.96×2.68,
sample mean + 1.96 × 2.68] = [sample mean ±5.26].
p 
The mean height is 67 inches, does your interval contain the mean?
p 
What proportion of the intervals in the class contain the mean?
Observations p 
We see that the length of confidence interval when using just one
person in the sample is 2×7.44 = 14.88, this is quite long, and does
not really allow us to pinpoint the mean.
p 
Whereas the length of interval using two people to calculate the
sample mean is 10.52, this is quite a big reduction in length!
p 
If ten people were used to calculate the sample mean the
corresponding interval length would be 14.88/√10 = 4.7.
p 
We see that for any given interval either the mean is in this interval
or not. The 95% comes into play when we look at the proportion of
intervals that contain the mean.
p 
In reality:
p 
p 
p 
We do not know the true mean µ, so will never know whether the interval
contained the mean or not.
We only observe one sample of size n, and thus have one CI.
One confidence interval contain information about the mean. This is
why we say with 95% confidence the mean lies in it.
Implications We do not need to (and
cannot, anyway) take a lot of
random samples to “rebuild”
the sampling distribution and
find µ at its center.
n
All we need is one SRS of
Sample
size n and we can rely on
n
Population
the properties of the
sampling distribution to infer
reasonable values for the
population mean µ.
µ
Multiple samples revisited With 95% confidence, we can say
that µ should be within 1.96
σ
standard deviations (1.96×σ/√n)
from our sample mean x .
p 
€
In 95% of all possible samples of
this size n, µ will indeed fall in our
confidence interval.
€
p 
In only 5% of samples will
x be
farther from µ.
p 
“Confidence” = the proportion of
possible samples that give us a
€
correct conclusion.
n
Calculation practice p 
You want to rent an unfurnished one-bedroom apartment in Dallas.
The mean monthly rent for 10 randomly sampled apartments is 980
dollars. Assume that monthly rents follow a normal distribution with
standard deviation 280 dollars. Construct a 95% confidence interval
for the mean monthly rent of a one-bedroom apartment.
p 
p 
p 
The standard error for the sample mean is 280/√10 = 88.54.
Thus the 95% CI is [980 ±1.96×88.54] = [806,1153]. With 95%
confidence we believe the mean price of one-bedroom apartments in
Dallas lies in this interval.
Does the above confidence interval mean that 95% of all rents
should lie in this interval?
p 
No, it is the interval for the mean. If we want the interval where 95% of
all rents should lie it is [980 ±1.96(88.54+280)] = [257,1720]. You do not
have to understand the calculation, but you will notice this interval is
much wider. The reason is that it must capture 95% of all rents, which
are extremely varied. The previous CI was just capturing the mean rent,
based on the sample mean, which is much less varied.
Calculation practice p 
Hypokalemia is diagnosed when the blood potassium level is below
3.5mEq/dl. The potassium in a blood sample varies from sample to
sample and follows a normal distribution with standard deviation 0.2.
p 
A patient ‘s potassium is measured taken over 4 days. The sample
mean level over these 4 days is 3.7. Construct a 95% confidence
interval for the mean potassium and discuss whether the patient is
likely to be diagnosed with Hypokalemia.
p 
The standard error for the sample mean is 0.2/√2 = 0.1. Thus the 95%
confidence interval for the mean potassium level is [3.7±1.96×0.1] =
[3.504,3.894]. This means with 95% confidence we believe the mean lies
in this interval.
q 
Since 3.5 or less does not lie in this interval, with 95% confidence I can
say that the patient does not have this condition.
Con4idence interval misunderstandings p 
Suppose 400 alumni were asked to rate the University of Okoboji
the university counseling services on a scale 1 to 10. The sample
mean was found to be 8.6 and it is known that the standard
deviation is σ=2. Ima Bitlost has done the analysis, but has made
some mistakes.
p 
Ima computes the 95% CI interval for the mean satisfaction score
as [8.6±1.96×2]. What is her mistake?
p 
Ima has not taken into account that the sample mean has a much
smaller standard deviation (standard error) than the population. The
standard error is 2/√400 = 0.1. Thus the true CI is
[8.6±1.96×0.1] = [8.4,8.796].
p 
After correcting her mistake, she states that “I am 95% confident
that the sample mean lies in the interval [8.4,8.796]” What is wrong
with her statement?
p 
This is a meaningless statement, for sure the sample mean lies in this
interval! It is the population mean that we are 95% confident lies there.
p 
She quickly realizes her mistake and instead states “the probability
that the mean lies in the interval [8.4,8.796] is 95%”, what
misinterpretation is she making now?
p 
p 
By 95%, we mean that if we repeated the experiment many times over
about 95% of the time the intervals will contain the mean. For any given
interval the mean is either in there or not. There is no probability
attached to it. To overcome, this issue we say that with we have 95%
confidence in the mean lies in this interval.
Finally, in her defense for using the normal distribution to determine
the confidence coefficient (1.96) she says “Because the sample
size is quite large, the population of alumni ratings will be close to
normal”. Explain to Ima her misunderstanding.
p 
The distribution of the population always stays the same, regardless of
the sample size (in this case, it is clear that variables that take integer
values between 1 to 10 cannot be normal). However, the sample mean
does get closer to normal as the sample size grow. With a sample size
of 400, the distribution of the sample mean will be very close to normal.
Different levels of con4idence p 
There is no need to restrict ourselves to 95% confidence intervals.
p 
The level of confidence we use really depends on how much
confidence we want. For example, you would expect a 99%
confidence interval is more likely to contain the mean than a 95%
confidence interval.
p 
To construct a 99% confidence interval we use exactly the same
prescription as used to construct a 95% confidence interval, the only
thing that changes is 1.96 goes to 2.57 (if you look up -2.57 in the ztables you will see this corresponds to 0.5%, so 99% of the time the
sample mean will lie within 2.57 standard errors from the mean).
p 
A 99% CI for the mean one-bedroom apartment price is
[980±2.56×88.54]. Length of interval is 2×2.57×88.54
q 
A 90% CI for the mean one-bedroom apartment price is
[980±1.64×88.54]. Length of interval is 2×2.56×88.54
What does a 100% confidence interval look like? In a 100% CI we are
sure to find the mean, but this interval is so wide it is not informative.
Sample size and length of the CI p 
Let us return to the apartment example. We recall that for the
confidence interval for the mean price is [980 ±1.96×88.54] =
[806,1153]. The length of this interval is 2×1.96×88.54 = 347.
p 
What happens to the length of interval if I increase the sample size?
p 
Suppose I take a SRS of 100 apartments in Dallas, the sample
mean based on this sample is 1000, what will the CI be?
p 
p 
What we observe is:
p 
p 
p 
The standard error is 280/√100 = 28 (much smaller than when the
sample size is 10), and the CI is [980 ±1.96×28]. The length of this
interval is 2×1.96×28 =109.
The length of the interval does not depend on the sample mean, this is
just the centralizing factor. It only depends on 1.96, the standard
deviation and the sample size.
The length of the interval gets smaller as the sample size grow.
This suggests that if we want the interval to have a certain level of
precision, we can choose the sample size accordingly.
Margin of Error p 
Margin of error is the lingo used for the plus and minus part in the
confidence interval.
p 
That is the confidence interval is
[sample mean±1.96×σ/√n], the margin of error is 1.96×σ/√n.
q 
q 
For example, in the previous example the margin of error for the CI
based on 10 apartments is 1.96×88.54.
The margin of error for the CI based on 100 apartments is 1.96×28.
q 
The margin of error in some sense, is a measure of accuracy. The
smaller the margin error the more precisely we can pinpoint the true
mean.
q 
Suppose we want the margin or error to be equal to some value,
then we can find the sample size such that we obtain that margin of
error. Solve for n the equation MoE = 1.96×σ/√n (the Margin of Error
and the standard deviation σ are given). See the next slide for an
example.
Calculation practice: What sample size for a given margin of error? Annual coffee sales:
A marketing firm plans to study the annual sales in coffee shops. They want
to estimate the mean annual sales to within $0.2 million, this time with 98%
confidence. How many coffee shops should they sample to obtain a margin
of error of at most $0.2 million with a confidence level of 98%? From a
previous study they guess σ ≈ $1.03 million. To solve the formula we need to
find the correct z-score that will give a 98% CI. Looking up the tables we see
The z* = 2.326. Thus we solve the equation:
2
2
⎛ z * σ ⎞
⎛ 2.326 ×1.03 ⎞
2
n ≈ ⎜
≈
⎟
⎜
⎟ = 12.0 = 144.
0.2
⎝ m ⎠
⎝
⎠
From the calculation, we see they need 144 observations such that the
margin of error is 0.2million.
Calculation practice p 
In a study of bone turn over in young women with a medical
condition, serum TRAP was measured in 31 subjects. The sample
mean was 13.2 units per liter. Assume the standard deviation is
known to be 6.5U/l. Find the 80% CI for the mean serum level.
p 
Look up 10% in the z-tables, this gives 1.28. The standard error for the
sample mean is 6.5/√31 = 1.16. Altogether this gives the CI
[13.2±1.16×1.28] =[11.7,14.6]. This means with we believe with 80%
confidence the mean level of serum for women with this medical condition
should lie in this interval. By choosing such a low level of confidence our
interval is quite narrow, but our confidence in this interval is relatively low.
q 
How large a sample size should we choose such that the 80% CI for
the mean has the margin of error 1U/l.
q 
This means solving 1.28×6.5/√n = 1, n=(1.28×6.5/1)2 =70.
A confidence interval for µ can be expressed two ways.
p 
x ± m.
m is called the margin of error
Egg carton example: 64.17g ± 2.83 g. We say “We conclude that µ is
within 2.83g of 64.17g, with 95% confidence.”
p 
Two endpoints of an interval: (
x − m) to ( x+ m).
Egg carton example: 61.34g to 67.00g. We say “We conclude that µ is
between 61.43g and 67.00g, with 95% confidence.”
Again, the confidence level C is the proportion of possible samples for
€
€
which the conclusion is correct . That is, it is the proportion of possible
samples for which the interval contains µ. (C usually is given in %.)
But there is an important issue to deal with.
§ 
We do not know the value of σ any more than we know the value of µ.
When σ is unknown In the case the we can estimate the standard deviation from the data.
The sample standard deviation s provides an estimate of the population
standard deviation σ.
But when the sample size is
small, the sample contains only
a few individuals. Then s is a
mediocre estimate of σ.
p 
When the sample size is large,
the sample is likely to contain
elements representative of the
whole population. Then s is a
good estimate of σ.
p 
The data is unlikely to contain
values in the tails and, s is likely
to underestimate σ.
p 
Population
distribution
Large sample
Small sample
The z-­‐transform with estimated standard deviation p 
Simply replacing the true standard deviation with the estimated
standard deviation can have severe consequences on the
confidence interval if we do not correct for it.
p 
To see why consider the z-transforms of the sample mean with
known and estimated standard deviations:
p 
(sample mean - µ)/(σ/√n)
p 
(sample mean - µ)/(s/√n)
p 
In the first case, z-transform will be a standard normal. In the second
case the estimated standard deviation adds extra variability into the
`system’. In particular, because s can be small then σ, this means
the z-transform can be larger and take higher values then we would
expect for a standard normal.
p 
In the next few slides we show that when we estimate the standard
deviation the z-transform is no longer a standard normal, but the so
called t-distribution.
How brewers saved statistics p 
Just over 100 years ago, W.S.
Gosset was a biometrician who
worked for Guiness Brewery in
Dublin, Ireland.
p 
Gosset realized that his
inferences with small sample
data seemed to be incorrect
too often – his true confidence
level was less than it was
stated to be!
p 
p 
He worked out the proper
method that took into account
substituting s for σ.
But he had to publish under a
pseudonym: Student.
p 
Gosset’s theory is based on
the distribution of the quantity
t=
p 
x −µ
s
n
.
This looks like the z-score for
x , except that s replaces σ in
the denominator.
Student’s t distributions Suppose that an SRS of size n is drawn from an
Normal(µ,σ) population.
p 
x −µ
z
=
When σ is known, the sampling distribution for
σ n
is
Normal(0,1).
p 
When σ is estimated from the sample standard deviation s, the
x −µ
t
=
sampling distribution for
will be very close to normal if the
s n
sample size n is large. This is because for large n, s will be a very
reliable estimator of σ.
q 
However, in the case that n is not so large, the variability in s will
have an impact on the distribution.
q 
It is clear that the impact it has depends on the sample size.
Student’s t distributions p 
When σ is estimated from the sample standard deviation s, the
sampling distribution for t =
x −µ
s
The sample distribution of t =
n
will depend on the sample size.
x −µ
s
n
is a t distribution with n − 1 degrees of freedom.
p 
The degrees of freedom (df) is a measure of how well s estimates
σ. The larger the degrees of freedom, the better σ is estimated.
q 
This means we need a new set of tables!
When n is very large, s is a very good estimate of σ, and the
corresponding t distributions are very close to the normal distribution.
The t distributions become wider (thicker tailed) for smaller sample
sizes, reflecting that s can be smaller than σ, so the corresponding ttransform is more likely to take extreme values than the z-transform.
Impact on con4idence intervals Suppose we want to construct the C% confidence interval for the mean.
The standard deviation is unknown, so as well as estimating the mean
we also estimate the standard
deviation from the sample.
Practical use of t: t*
t* is related to the chosen
confidence level C.
p 
C
C is the area under Student’s t
curve between −t* and t*.
p 
The confidence interval is thus:
x ± t* s
n
−t*
t*
Example: For an 80% confidence
level C, 80% of Student’s t curve’s
area is contained in the interval.
Con=idence level and the margin of error The confidence level C determines the value of t* (in table D).
The margin of error also depends on t*.
§  Higher confidence C implies a larger
m = t* × s
n
margin of error m (thus less precision
in our estimates).
§  A lower confidence level C produces
a smaller margin of error m (thus
C
better precision in our estimates).
§  We find t* in the line of Table D for df
= n−1 and confidence level C.
−t*
t*
Table D When σ is unknown,
we use a t distribution
with “n−1” degrees of
freedom (df).
Table D shows the
z-values and t-values
corresponding to
landmark P-values/
confidence levels.
t=
When the sample is
very large, we use the
normal distribution
and the standardized
z-value.
x −µ
s
n
p 
Focus first on 2.5%. For each n, the 2.5% corresponds to the area
on the left and right tails of the t-distribution with n degrees of
freedom. Remember a distribution gives the chance/likelihood of
certain outcomes.
p 
Recall that for a normal distribution, the point where we get 2.5% on
the left and the right of the tails of the distribution is 1.96.
p 
If we go down the table. we see that as the sample size, n,
increases the value corresponding to 2.5, goes from 12.71 (for n=1)
to a number that is very close to 1.96 for extremely large n.
p 
This means for small n the variability on the standard deviation s
means that the chance of the t-transform being extreme is relatively
large.
p 
However, as n grows, the estimator of the standard deviation
improves, and the t-transform gets closer to a normal distribution.
p 
You will observe the same is true for other percentages. Take a look
at 5% and 0.5% and look down the table.
Calculation practice (red wine 1)
It has been suggested that drinking red wine in moderation may protect against
heart attacks. This is because red wind contains polyphenols which act on blood
cholesterol.
To see if moderate red wine consumption increases the average blood level of
polyphenols, a group of nine randomly selected healthy men were assigned to
drink half a bottle of red wine daily for two weeks. The percent change in their
blood polyphenol levels are presented here:
0.7
3.5
Sample average
4.0
4.9
5.5
7.0
x = 5.50
Sample standard deviation s = 2.517
Degrees of freedom df = n − 1 = 8
7.4
8.1
8.4
We will encounter two problems
when doing the analysis. The first is
that the sample size is not huge so
we have to hope that the sample
mean is close to normal. The
second is the standard deviation is
unknown and has to be estimated
from the data.
q 
What is the 95% confidence interval for the average percent change?
p 
First, we determine what t* is. The degrees of freedom are df =
n − 1 = 8 and C = 95%.
From Table D we get t* = 2.306.
(…)
p 
The margin of error m is: m = t* × s/√n = 2.306 × 2.517/√9 ≈ 1.93. So
the 95% confidence interval is 5.50 ± 1.93, or 3.57 to 7.43.
p 
We can say “With 95% confidence, the mean of percent increase
is between 3.57% and 7.43%.”
p 
What if we want a 99% confidence interval instead?
p 
For C = 99% and df = 8, we find t* = 3.355. Thus m = 3.355 × 2.517/
√9 ≈ 2.81.
p 
Now, with 99% confidence, we only can conclude the mean is
between 2.69 and 8.31. (A big price to pay for the extra confidence.)
Calculation practice (red wine 2)
Let us return to the same study, but this time we increase the sample size to 15
men. The data is now:
0.7,3.5,4,4.9,5.5,7,7.4,8.1,8.4, 3.2,0.8,4.3,-0.2,-0.6,7.5
The sample mean in this case is 4.3 and the sample standard deviation is 3.06.
Since the sample size has increased, it is likely that the sample standard
deviation is a more reliable estimator of the true standard deviation.
The number of degrees of freedom is 14.
Just as in the previous example we can construct a 95% confidence interval but
now we use 14df instead of 8dfs.
More calculation practice p 
Let us return to the example of prices of apartments in Dallas. 10
apartments are randomly sampled. The sample mean and the
sample standard deviation based on this sample is 980 dollars and
250 dollars (both are estimators based on a sample of size ten).
Construct a 95% confidence interval for the mean:
p 
The standard error is 250/√10 = 79.
p 
Looking up the t-tables at 2.5% and 9 degrees of freedom gives 2.262.
p 
q 
The 95% confidence interval for the mean is [980 ±
2.262×79]=[801,1159].
Suppose we want to know whether the price of apartments have
increased since last year, where the mean price was 850 dollars.
q 
Based on this interval we see that 850 dollars and greater is contained in
this interval. This means the mean could be 850 dollars or higher. There
given the sample it is unclear whether the mean price of apartments has
increased since last year or not.
Example: comparing z and t-­‐values p 
We want to calculate a 99% CI for the mean weight of a newborn
calf. To do this upload the calf data into Statcrunch.
p 
Go to Stat, from here you have two options. If we treat the standard
deviation of calve weights as known (not random), then we can use
the z-statistic, else we need to use the t-statistic.
p 
Suppose we choose t-statistic option -> one-sample -> with data ->
then choose the variable of weights at birth (wt 0). To get the 99%
CI we need to select 99% on the second pages of the options. We
get the 99% interval (using a t-distribution with 43 degrees of
freedom) [90.05,96.37] pounds. This means we 99% confidence we
believe the mean weight of new born calves lies in this interval.
p 
To see how well the normal distribution works, we do the same, but
choose the z-statistic option. This gives the 99% confidence interval
[90.19,96.23]. Notice, that it is slightly narrower, because it does not
take into account the underestimation of the standard deviation.
More calculation practice p 
Let us return to the M&M data. Suppose we want to calculate a 99%
confidence interval for the mean number of M&Ms in plain, peanut
butter and peanut M&Ms. These can be calculated using the
summary statistics output:
Summary statistics for Total:
Group by: Type
Type
n
Mean
Variance
Std. Dev.
Std. Err.
Median
Range Min Max Q1 Q3
M
84 17.297619
8.259753
2.8739786
0.3135768
18
14
7
21
17
19
P
40
9.814744
3.1328492
0.49534693
8
15
6
21
7
8
PB
46 10.913043
3.325604
1.8236238
0.26887867
11
10
8
18
10
11
8.675
Using this output we can calculate the confidence intervals for the mean
number of M&Ms in each type.
Do this.
Statcrunch will also give the CIs p 
Go to Stats -> t-statistics -> one-sample -> with data -> select the
column you want to analyse (choose the Group by if you want it
grouped), on the next page select confidence interval and the level
you want it at.
Sample mean Std. err
DF
L Limit
U limit
17.2
0.31
83
16.4
18.12
8.6
0.49
39
7.33
10.01
10.9
0.268
45
10.18
11.63
Looking at the intervals, do you think it that the mean number of M&Ms in a
plain and peanut bag could be the same.
What about the mean number in peanut and peanut butter?
Later on we shall make a formal test on these questions.
Calculation practice: coffee shop sales A marketing firm randomly samples 45 coffee shops and
determines their annual sales. The sample has an average
of $2.67 million and a standard deviation of $1.03 million.
What can we say with 90% confidence about the mean
annual sales for the population of all coffee shops?
p 
The degrees of freedom is 45−1 = 44.
p 
For 90% confidence, we find t* = 1.680.
p 
The margin of error is 1.680×1.03/√45 = 0.258
p 
So the interval for the true mean is 2.67 ± 0.26.
x ± t* s
n
“We conclude that the mean annual sales of all coffee shops is between
$2.41 million and $2.93 million, with 90% confidence.”
p 
Summary of con4idence interval for µ. p 
The confidence interval for a population mean µ is
x ± t* s
p 
p 
n.
t* is obtained from Student’s t distribution using n−1 degrees of
freedom. (Table D in the textbook.)
t* is the value such that the confidence level C is the area between
–t* and t*.
p 
Confidence is the proportion of samples that lead to a correct
conclusion (for a specific method of inference).
p 
p 
p 
p 
The investigator chooses the confidence level C.
Tradeoff: more confidence means bigger margin of error, wider
intervals.
The degrees of freedom is associated with s, the estimate for σ.
*
The margin of error t s /
larger samples are better.
n also depends on the sample size:
Sample size and experimental design An investigator may need a certain margin of error m (e.g., in a
marketing survey, in a drug trial, etc.).
So plan ahead what sample size to use to achieve that margin of
error.
You will have to guess the value of σ, perhaps from historical data,
and you will not know the degrees of freedom at first. But you can do
a rough calculation.
2
⎛ z * σ ⎞
m ≈ z*
⇔ n ≈ ⎜
⎟ .
n
⎝ m ⎠
σ
This is done in the planning stages of the study. It is not an inference
or conclusion and there are no data yet.
Remember, too, that there typically are costs and constraints
associated with large samples. Economy and feasibility are factors
that will tend to keep sample sizes smaller.
Interpretation of con=idence, again p 
The confidence level C is the proportion of all possible random
samples (of size n) that will give results leading to a correct
conclusion, for a specific method.
p 
In other words, if many random samples were obtained and
confidence intervals were constructed from their data with C = 95%
then 95% of the intervals would contain the true parameter value.
p 
In the same way, if an investigator always uses C = 95% then 95%
of the confidence intervals he constructs will contain the parameter
value being estimated.
p 
But he never knows which ones do!
p 
Changing the method (such as changing the value of t*) will change
the confidence level.
p 
Once computed, any individual confidence interval either will or will
not contain the true population parameter value. It is not random.
p 
It is not correct to say C is the probability that the true value falls in
the particular interval you have computed.
*
x
±
t
×s / n
Cautions about using p 
This formula is only for inference about µ, the population mean.
Different formulas are used for inference about other parameters.
p 
The data must be a simple random sample from the population.
p 
The formula is not quite correct for other sampling designs. (But see
a statistician to get the right inference method.)
p 
Confidence intervals based on t* are not resistant to outliers.
p 
If n is small and the population is not normal, the true confidence
level could be smaller than C. (Usually n ≥ 30 suffices unless the
data are highly skewed.)
p 
This inference cannot rescue sampling bias, badly produced data or
computational errors.
Download