SAMPLING DISTRIBUTIONS PART I: ESTIMATING A POPULATION

advertisement
Confidence Intervals
In statistics, we often prefer to construct an interval estimate of the unknown population
parameter. An interval estimate proposes a range of values to account for the inherent
uncertainty in any estimation process.
The most frequent type of interval estimate is the confidence interval, which employs an interval
formula that, a priori, contains the true parameter a high probability of the time. Unfortunately,
once a particular interval is calculated from sample data, we never know if it is among those that
contained the true parameter or those that did not.
From our knowledge of the Central Limit Theorem and the standard normal distribution (Z), we
know that the following probability statement is approximately true for large enough n:




X 

P  z / 2 
 z / 2  = 1 - 





n


where the “cutoff” point z / 2 is chosen so that  / 2 area is nestled in the upper tail.
Standard Normal (Z)
Area in this tail
is exactly  / 2
z / 2
Z-Values
Area in this tail
is exactly  / 2
Q: How would you compute
the z-value needed here?
A: Take the given  and
compute either
z / 2
(1) =NORMSINV(1-  / 2 )
or
(2) =NORMINV(1-  / 2 , 0, 1)
The preceding probability statement can be rearranged to state
P( X  z / 2 

n
   X  z / 2 

n
) =1- .
This says the formula X  z / 2   / n will “work” (contain the true mean  ) 100 (1   ) % of
the time in the long run. If we replace X with an observed sample mean x , then we get a

sample interval x  z  / 2 
, which is called a 100 (1   ) % confidence interval for the mean.
n
Observe that we still do not know if the computed interval is one of the instances where the
Copyright John Semple 2007
22
formula “worked.” If the underlying population we draw from is normally distributed, the
confidence interval is said to be an exact 100 (1   ) % confidence interval. Otherwise we call it
an approximate 100 (1   ) % confidence interval. The probability  is assumed to be small,
typically .05 or less. This is the probability that the formula fails in the long run.
The confidence interval formula stated earlier is not very useful unless  (or equivalently  2 ) is
known. Soon we’ll see how to replace  2 with a sample estimate, which creates only minor
changes in the formula.
Example. Assuming  2  4900 , construct a 95% confidence interval for the mean repair cost in
the preceding car repair example. Recall that the sample mean was x  $215.66 .
Solution.
Confidence Intervals for a Population Proportion (*Optional)
If you haven’t seen this formula in the real world, then you don’t watch TV. We are awash in
TV polls spewing sample statistics regarding the percentage of people that believe in something,
suffer from something, would vote for someone, etc. The next time you see a poll on TV, glance
at the bottom of the screen. They usually state a margin of error (the “sampling error”).
Suppose you want to determine the percentage of Texans who own firearms. If you take a
random sample of n people from the state, a certain sample percentage will own guns. How does
this sample percentage compare to the true population percentage? Let p denote the true
population percentage and let p̂ denote the sample percentage. Each individual is just a random
draw from a population whose distribution is
X
1 (Own gun)
0 (Do not own gun)
Probability
p
1-p
You can check that the mean of this population is p, and the variance is p(1-p). Applying the
CLT (standard normal version) to this problem means
pˆ  p
p (1  p )
n
Copyright John Semple 2007
~Z.
23
p(1  p)
. This
n
expression still involves the unknown parameter p under the square root. The most common way
to handle this is to replace p with p̂ under the root. This approximation is acceptable provided
npˆ  5 and n(1  pˆ )  5 .
This results in a (1   )100% confidence interval for p given by pˆ  z / 2 
Confidence Interval for a Proportion
A 100(1   )% confidence interval for the true population proportion p is
given by pˆ  z / 2 
pˆ (1  pˆ )
, where p̂ is the sample percentage.
n
Example (This problem was supplied by Chris Weldon, SMU MBA Class P40). A large North
Texas company acquired a new business unit to deliver company mail to each of their 15 North
Texas sites (3-5 deliveries per day) in the fall of 1997. Mail included U.S. mail, small
manufactured parts, internal documents, packages, etc. The cost of this unit is about $1,000,000
a year, and so management felt it was necessary to monitor the performance of their deliveries
(i.e., getting the right package to the right location). It was too costly to keep records of every
item, and so they decided to randomly sample items (usually by auditor's at each site's mail drop)
to estimate the true delivery accuracy. In October of 1997, 8717 units were randomly checked,
of which 69 were delivered in error. Calculate a 95% confidence interval for the true proportion
of accurate deliveries.
Estimating an Unknown Population Variance
In the vast majority of real life data situations, you will not know the population variance  2 .
The formulas for confidence intervals developed earlier do not apply “as is.”
There are a number of ways to handle the problem, but most focus on using an estimate of the
true population variance  2 . Given a random sample ( X1 , X2 ,...., Xn ) from a large population,
the most common estimate of the population variance is the sample variance, denoted by s 2 , and
given by the formula
1 n
s2 
 ( xi  x ) 2 .
n  1 i 1
This value is computed in Excel with the VAR(:) function. The argument (:) references the
spreadsheet cells containing the sample data. The sample standard deviation is s (the square root
of the sample variance), and it is given by the Excel function STDEV(:). Alternatively, s can be
calculated by taking the square root of VAR(:).
Copyright John Semple 2007
24
Confidence Intervals with an Unknown Population Variance: The tdistribution
In our earlier confidence interval formulas, we assumed that the standard deviation  was
known (see the hail dent analysis). This allowed us to use z-values from a standard normal and
compute confidence intervals. If we do not know  , it is tempting to simply insert the sample
estimate s in its place. Fortunately, we can do this with only minor modifications. However,
this requires an understanding of a new distribution called the t distribution.
If a random sample is drawn from a normal distribution, the distribution of ( X   ) ( s / n )
follows a t distribution with n-1 degrees of freedom (df). Note that the denominator has an s
instead of a  . The “degrees of freedom” is a parameter that you do not need to estimate since it
depends directly on the sample size n. The t distribution is symmetric about 0 and looks a lot
like the normal distribution (especially as the degrees of freedom increase). The general shape
for various degrees of freedom is depicted below.
t with
20 df
t distribution vs. z distribution
t with
df >30
(almost
normal)
t with
10 df
0
To calculate probabilities for the t distribution, we use the TDIST function in Excel. The value
p=TDIST(x, df, 1) is the area (=probability) above a nonnegative value x from a t distribution
with df degrees of freedom. The value p=TDIST(x, df, 2) is the combined area above x and
below –x.
In a picture,
Probability = p =
TDIST(x, df, 2)
Probability = p =
TDIST(x, df, 1)
-x
Copyright John Semple 2007
x
25
To convert a probability into a t-value, we use the TINV function. In Excel, TINV(p,df) is the
value x such that p/2 is the area above x in a t distribution with df degrees of freedom.
In a picture,
AREA = p/2
x=TINV(p,df)
One must remember that TINV splits the given probability (p) in half when calculating the
x value.
Confidence Intervals using the t Distribution
If we continue to assume that the population we are sampling from is normal (or approximately
normal), then we can construct a 100 (1   ) % confidence interval (CI) for the mean using the
formula
s
x  t  / 2, df 
n
Example. (Brent Pope, SMU MBA Class 46P). Airco has inspected 81 armature hubs and
measured their roughness. The data is provided in Hubs.xls. Calculate a 95% confidence
interval (CI) for the mean roughness. Calculate a 98% confidence interval (CI) for the mean
roughness.
Copyright John Semple 2007
26
Example. (Tamir Ayad, SMU MBA Class P43P) A new production process is being tested to
reduce the number of “large” particles (2 microns and higher) that contaminate silicon wafers. A
random sample of n=81 wafers is drawn and the large contaminating particles counted. The
following sample statistics were computed: x  1.88 s 2  2.71 . Construct a 99% interval for the
mean of the new process.
Note: When there is sufficient data, you should use the HISTOGRAM function in DATA
ANALYSIS (under TOOLS) to see if the data appear to be roughly normally distributed when
using the t distribution. Double click on HISTOGRAM and fill out your options (select “Chart
Output” to get the visual display or histogram).
FAQ: Selecting a Sample Size (*Optional)
Suppose you want to construct a confidence interval to estimate the mean of some population.

Note that the width of the confidence interval x  z /2 
depends on quantities that do not
n
change with the sample mean. In fact, the width only depends on z /2 ,  and n. If a level of
confidence is specified ( 1   ) and  is known (or can be estimated), then the width of the
confidence interval can be made narrower by selecting a larger value of n.
Example. Suppose in the KIA Motor’s example we want a 95% confidence interval whose width
is $20. How big a sample should we take?

 $10 for n.
Solution. This means the half-width is $10. We need to solve z / 2 
n
The general formula is
 z  
n   /2

 D 
2
where 1) z /2 is determined from an N(0,1) table after the level 1   is specified
2)  is the population standard deviation which is known (or approximated)
3) D is the desired half-width of the confidence interval.
In practice, the standard deviation can often be approximated by the formula
Range
,
4
where the Range is the difference (in absolute value) between the biggest value and the smallest
value in a random sample.

Copyright John Semple 2007
27
Assignment 3
Central Limit Theorem Problems
1. Book, 7.24 (Use CLT for n ≥ 30) NOTE: On part (d), simply determine if the sample size in
part (c) is adequate. What would you recommend with respect to the sample size n=45?
2. Book, 7.25 (Use CLT for n ≥ 30)
Confidence Interval Problems
3.
4.
5.
Book, 8.17 (Remember, if you use s for , use t!)
Book, 8.21
Book, 8.22
Bonus (5 pts)
Book 8.37 (this requires calculating a confidence interval for a proportion)
Copyright John Semple 2007
28
Download