Estimation with Confidence Intervals

advertisement
Biostatistics
Unit 6
Confidence Intervals
1
Statistical inference
• Statistical inference is the procedure by
which we reach a conclusion about a
population on the basis of the information
contained in a sample drawn from that
population.
• Estimation involves the use of the data in
the sample to calculate the corresponding
parameter in the population from which the
sample was drawn.
2
Types of estimates
• A point estimate is a single numerical
value used to estimate the
corresponding population parameter.
• An interval estimate consists of two
numerical values that, with a specified
degree of confidence, we feel includes
the parameter being estimated.
3
Estimator
• An estimator is a rule or formula that
tells how to compute the estimate.
• Estimators are unbiased if they predict
well the value in the population.
4
Table of unbiased estimators
5
Sampled and target populations
• The sampled population is the
population from which we actually draw
the sample.
• The target population is the
population about which we wish to
make an inference.
(continued)
6
Sampled and target populations
• These two populations may or may not be
the same.
• When they are the same, it is possible to
use statistical inference procedures to make
conclusions about the target population.
• If the sample and target populations are
different, conclusions can be made about
the target population only on the basis of
nonstatistical considerations.
7
Random and nonrandom samples
The strict validity of statistical procedures
depends on the assumption of random
samples.
8
Confidence intervals to be studied
A) Confidence Interval for a Population mean
B) Confidence Interval for the Difference of Two
Population Means
C) Confidence Interval for a Population Proportion
D) Confidence Interval for the Difference of Two
Population Proportions
E) Confidence Interval for the Variance of a Normally
Distributed Population
F) Confidence Interval for the Ratio of Variances of
Two Normally Distributed Populations
9
A) Confidence interval for a population mean
Estimating the mean
• Estimating the mean of a normally
distributed population entails drawing a
sample of size n and computing which is
used as a point estimate of m.
• It is more meaningful to estimate m by an
interval that communicates information
regarding the probable magnitude of m.
10
Sampling distributions and estimation
Interval estimates are based on
sampling distributions. When the sample
mean is being used as an estimator of a
population mean, and the population is
normally distributed, the sample mean
will be normally distributed with mean,
, equal to the population mean, m, and
variance of
11
The 95% confidence interval
• 95% of the values of making up the
distribution will lie within two standard
deviations of the mean.
• The actual value is 1.96
• The interval is noted by the two
points, m – 1.96s and m + 1.96s , so
that 95% of the values are in the
interval, m ± 1.96s .
12
The 95% confidence interval
• Since m and are unknown, the
location of the distribution is uncertain.
• We can use as a point estimate of m.
• In constructing intervals of m ± 1.96s ,
about 95% of these intervals would
contain m.
13
Example
Suppose a researcher, interested in obtaining an
estimate of the average level of some enzyme in a
certain human population, takes a sample of 10
individuals, determines the level of the enzyme in
each, and computes a sample mean of x = 22.
Suppose further it is known that the variable of
interest is approximately normally distributed with a
variance of 45. We wish to estimate m.
14
Solution
± 1.96s
15
Components of an interval estimate
• The interval estimate of m is centered
on the point estimate of m.
• 95% of the values of the standard
normal curve lie within 1.96 standard
deviations of the mean.
• The z score of 1.96 used in this case is
called the reliability coefficient.
16
General expression for an interval estimate
17
Table of reliability coefficients
for confidence intervals
18
Interpretation of confidence intervals
The interval estimate for m is expressed
as:
± z1-(a/2)s
If a = .05, we can say that, in repeated
sampling, 95% of the intervals constructed
this way will include m. This is based on
the probability of occurrence of different
values of .
(continued)
19
Interpretation of confidence intervals
The area of the curve of that is outside
the area of the interval is called a.
The amount of area inside the interval is
called 1-a.
20
Probabilistic interpretation of the interval
In repeated sampling from a normally
distributed population with a known
standard deviation, 100(1- a) percent of
all intervals in the form
will, in the long run, include the
population mean, m.
(continued)
21
Probabilistic interpretation of the interval
The quantity 1-a is called the
confidence coefficient or confidence
level and the
interval,
, is called the
confidence interval for m.
22
Practical interpretation of the interval
When sampling is from a normally
distributed population with known
standard deviation, we are 100(1- a)
percent confident that the single
computed interval,
contains the population mean, m.
23
Precision
• Precision indicates how much the
values deviate from their mean.
• Precision is found by multiplying the
reliability factor by the standard error of
the mean.
• This is also called the margin of error.
24
Exercise 6.2.2
We wish to estimate the mean serum indirect
bilirubin level of 4-day-old infants. The mean for a
sample of 16 infants was found to be 5.98
mg/dl. Assuming bilirubin levels in 4-day-old infants
are approximately normally distributed with a
standard deviation of 3.5 mg/dl find:
A) The 90% confidence interval for m
B) The 95% confidence interval for m
C) The 99% confidence interval for m
25
Solution
(1) Given
= 5.98
s = 3.5
n = 16
26
(2) Sketch
27
Solution
(3) Calculations
A) 90% interval (z = 1.645)
5.98 ± 1.645 (.875)
5.98-1.439375, 5.98+1.439375
(4.5408, 7.4129)
28
Solution
B) 95% interval (z = 1.96)
5.98 ± 1.96 (.875)
(4.265, 7.695)
29
Solution
C) 99% interval (z = 2.575)
5.98 ± 2.575 (.875)
(3.7261, 8.2339)
30
Solution
(4) Results
A higher percent confidence level gives a
wider band. There is less chance of making
an error but there is more uncertainty.
Calculator answers are more accurate
because the calculator uses exact values and
derives its answers from calculus.
31
The t distribution
In most real life situations the variance of the
population is unknown. We know that the z
score,
is normally distributed if the population is
normally distributed and is approximately
normally distributed when the population is
large. But, it cannot be used because s is
unknown.
32
Estimation of the standard deviation
The sample standard deviation,
can be used to replace s. If n 30, then
s is a good approximation of s. An
alternate procedure is used when the
samples are small. It is known as
Student's t distribution.
33
Student's t distribution
Student's t distribution is used as an
alternative for z with small samples. It
uses the following formula:
34
Student's t distribution
Student's t distribution was developed
in 1908 by W. S. Gosset (1876-1937)
who worked for the Guinness Brewery.
35
Properties of the t distribution
1. Mean = 0
2. It is symmetrical about the mean.
3. Variance is greater than 1 but approaches
1 as the sample gets large. For df > 2, the
variance = df/(df-2) or
(continued)
36
Properties of the t distribution
4. The range is
to
.
5. t is really a family of distributions because
the divisors are different.
6. Compared with the normal distribution, t is
less peaked and has higher tails.
7. t distribution approaches the normal
distribution as n-1 approaches infinity.
37
38
Confidence interval for a mean using t
General relationship
The reliability coefficient is obtained from
the t distribution.
39
Confidence interval
When sampling is from a normal distribution
whose standard deviation, s, is unknown, the
100(1- a) percent confidence interval for the
population mean, m, is given by:
40
Deciding between z and t
• When constructing a confidence interval for
a population mean, we must decide whether
to use z or t.
• Which one to use depends on the size of the
sample, whether it is normally distributed or
not, and whether or not the variance is
known.
• There are various flowcharts and decision
keys that can be used to help decide. Mine
appears below.
41
Key for deciding between z and t in
confidence interval construction
1.
2.
3.
4.
5.
6.
7.
Population normally distributed................2
Not as above—normally distributed.........5
Sample size is large (30 or higher)............3
Sample size is small (less than 30)............4
Population variance is known.............use z
Population variance not known.... use t (or z)
Population variance is known.............use z
Population variance is not known.......use t
Sample size is large..................................6
Sample size is small..................................7
Population variance is known.............use z
Population variance not known
(central limit theorem applies)............use z
Must use a non-parametric method
42
Example
In a study of preeclampsia, Kaminski and
Rechberger found the mean systolic
blood pressure of 10 healthy,
nonpregnant women to be 119 with a
standard deviation of 2.1.
(continued)
43
Example
(Preeclampsia: Development of hypertension,
albuminuria, or edema between the 20th week
of pregnancy and the first week postpartum.
Eclampsia: Coma and/or convulsive seizures
in the same time period, without other
etiology.)
44
Example
a. What is the estimated standard error of the
mean?
b. Construct the 99% confidence interval for the
mean of the population from which the 10 subjects
may be presumed to be a random sample.
c. What is the precision of the estimate?
d. What assumptions are necessary for the validity
of the confidence interval you constructed?
45
Solution
(1) Given
n = 10
= 119
s = 2.1
46
(2) Sketch of t distribution
47
Reading the t table
48
49
(3) Calculations
= .6640783086
119 ± 3.2498 (.66407...)
116.84, 121.16
50
Solution
Precision = 3.2498 (.66407...)
= 2.158121687
Assumptions
The population is normally distributed
The 10 subjects represent a random sample
from this population.
51
B) Confidence interval for the difference of two
population means
Introduction
From each of two populations an independent
random sample is drawn. Sample
means, and
, are calculated.
(continued)
52
B) Confidence interval for the difference of two
population means Introduction
The difference is
which is an
unbiased estimator of the difference
between the two population
means,
. The variance of the
estimator is
53
Conditions for use
Assuming the populations are normally
distributed, there are three situations
where we would determine the 100(1- a)
percent confidence interval for
.
(continued)
54
Conditions for use
a) where the population variances are known (use z)
b) where the population variances are unknown but
equal (use t)
c) where the population variances are unknown but
unequal (use t').
55
Population variances are known
When the population variances are
known, the 100(1- a) percent confidence
interval for
is given by
56
Example 6.4.1
A research team is interested in the difference
between serum uric acid levels in patients with and
without Down's syndrome. In a large hospital for the
treatment of the mentally retarded, a sample of 12
individuals with Down's syndrome yielded a mean
of
= 4.5 mg/100 ml. In a general hospital a
sample of 15 normal individuals of the same age and
sex were found to have a mean value of
= 3.4
mg/100 ml. If it is reasonable to assume that the
two populations of values are normally distributed
with variances equal to 1 and 1.5, find the 95%
confidence interval for
.
57
Solution
(1) Given
n1 = 12,
= 4.5,
=1
n2 = 15,
= 3.4,
= 1.5
58
Solution
(2) Calculations
The point estimate for
= 4.5 - 3.4 = 1.1
is
59
Solution
The standard error is
60
Solution
The 95% confidence interval is
1.1 ± 1.96 (.4282)
(.26, 1.94)
61
Population variances unknown but
equal
If it can be assumed that the population
variances are equal then each sample
variance is actually a point estimate of the
same quantity. Therefore, we can combine
the sample variances to form a pooled
estimate.
62
Weighted averages
The pooled estimate of the common
variance is made using weighted
averages. This means that each sample
variance is weighted by its degrees of
freedom.
63
Pooled estimate of the
variance
The pooled estimate of the variance
comes from the formula:
64
Standard error of the estimate
The standard error of the estimate is
65
Confidence interval
The 100(1-a) confidence interval for
is:
66
Example
(1) Given
n1 = 13,
= 21.0,
s1 = 4.9
n2 = 17,
= 12.1,
s2 = 5.6
a = .05
67
Example
(2) Calculations
The point estimate for
- is
= 21.0 - 12.1 = 8.9
68
Example
The pooled estimate of the variance is
69
Example
The standard error is
70
Example
The 95% confidence interval is
8.9 ± 2.0484 (1.9569)
8.9 ± 4.0085
(4.9, 12.9)
71
Population variances unknown
and not equal
With unequal variances, the quantity
used to calculate the test statistic does
not follow the t distribution. A substitute
reliability factor called t' has been
proposed.
72
C) Confidence interval for a population
proportion
To begin, a sample is drawn from the population of
interest and the sample proportion, , is
calculated. This sample proportion is used as the
point estimator of the population proportion, p. The
confidence interval is defined by the general formula:
73
Distribution
When n is large, the reliability coefficient will be z
from the standard normal distribution. Since p, the
population proportion, is unknown, we use as an
estimate. The estimate of
, the
standard error, is given by:
74
Confidence interval
The 100(1- a) confidence interval for p is
given by:
75
Probabilistic interpretation.
We say that we are 95% confident that
the population proportion, p,
lies between the calculated limits since,
in repeated sampling, about 95% of the
intervals constructed this way would
contain p.
76
Practical interpretation.
In a specific example, we would expect,
with 95% confidence, to find the
population proportion between the two
boundaries.
77
Example 6.5.2
A research study obtained data regarding sexual
behavior from a sample of unmarried men and
women between the ages of 20 and 44 residing in
geographic areas characterized by high rates of
sexually transmitted diseases and admission to drug
programs. Fifty percent of 1229 respondents
reported that they never used a condom. Construct
a 95 percent confidence interval for the population
proportion never using a condom.
78
Solution
(1) Given
n = 1229
= .50
(for the TI-83, x = 615)
79
Solution
(2) Calculation
80
D) Confidence interval for the difference
of two population proportions
When studying the difference between two
population proportions, the difference between the
two sample proportions,
, can be used as an
unbiased point estimator for the difference between
the two population proportions, p1 – p2. This is used
with the general formula:
81
Distribution
When the central limit theorem applies,
the normal distribution is used to obtain
confidence intervals. The standard error
is estimated by the formula:
82
Confidence interval
The 100(1- a) percent confidence
interval for p1 – p2 is given by:
83
Probabilistic interpretation.
We say that we are 95% confident that
the difference between the two
population proportions, p1 – p2,
lies between the calculated limits since,
in repeated sampling, about 95% of the
intervals constructed this way would
contain p1 – p2.
84
Practical interpretation.
In a specific example, we would expect,
with 95% confidence, to find the
difference between the two population
proportions between the two limits.
85
Example 6.6.1
A study of teenage suicide included a sample of 96
boys and 123 girls between ages of 12 and 16 years
selected scientifically from admissions records to a
private psychiatric hospital. Suicide attempts were
reported by 18 of the boys and 60 of the girls. We
assume that the girls constitute a simple random
sample from a population of similar girls and likewise
for the boys. Construct a 99 percent confidence
interval for the difference between the two
proportions.
86
Solution
(1) Given
n1 = 123
= .4878
n2 = 96
= .1875
87
Solution
(2) Calculation
88
Determining the sample size for
estimating means
It is important to have a sample that is the correct
size. It is also important to have a method that will
allow prediction of the correct sample size for
estimating a population mean or a population
proportion. This is important especially in business
or commercial situations where money is
involved. Selecting a sample size that is too big
wastes money. One that is too small may give
inaccurate results.
89
Objectives
The width of the confidence interval is determined by
the magnitude of the margin of error which is given
by:
d = (reliability coefficient) (standard error)
The total width of the interval is twice this amount.
90
Reducing the margin of error
In the standard error,
, the value of s is a
constant. If the reliability coefficient is fixed, the only
way to reduce the margin of error is to have a large
sample. The size of the sample depends on the size
of s, the degree of reliability and the desired interval
width.
91
Margin of error
92
Sample size for a large population
d = (reliability coefficient) X (standard error)
Solving for n gives
93
Estimating s2
Generally the variance of the population under study
is unknown. As a result s has to be estimated. The
most common sources of estimates for s are:
1. A pilot sample which is drawn from the population
and used as an estimate of s.
2. Estimates of s from previous or similar studies.
3. In a normally distributed population, the range is
usually about 6 standard deviations so is estimated
by R/6.
94
Determination of the sample size for estimating
proportions
The manner of finding sample sizes for estimating a
population proportion is basically the same as for
estimating a mean.
The general formula is:
95
Sample size
Assuming proper random sampling and
an approximately normal distribution, the
sample size is
96
Estimating the population proportion
It is necessary to estimate the population proportion,
p, to use in the determination of the sample size.
1. If an upper limit is suspected or presumed, it
could be used to represent p.
2. A pilot sample could be drawn and used to obtain
an estimate for p.
3. With no better estimate, one may use p = .5
which gives the maximum value of n.
97
E) Confidence interval for the variance of a
normally distributed population
Measures of dispersion
s
S
(continued)
98
E) Confidence interval for the variance of a
normally distributed population
Measures of dispersion
s
E( s2 ) = when
sampling is with
replacement
S
E( s2 ) = when
sampling is without
replacement.
99
Large population size
When N is large, N and N-1 are
approximately equal so s2 and s2 will be
approximately equal. These results
justify why s2 can be used to compute
the population variance.
100
Interval estimate of a population variance
• The value of s2 is used as a point estimator of the
population variance, s2.
• Confidence intervals of s2 are based on the
sampling distribution of (n-1) s2/ s2.
• If samples of size n are drawn from a normally
distributed population, this quantity has a
distribution known as the chi-square distribution
with n-1 degrees of freedom.
• The assumption that the sample is drawn from a
normally distributed population is crucial.
101
The chi-square distribution
The chi-square distribution is not symmetrical. For
low values of n, its shape is variable. The
distribution does not have negative values.
102
Microsoft Excel Demonstration
Note how the shape of the curve
changes depending on the degrees of
freedom. With 1 degree of freedom, the
curve is hyperbolic.
[Here follows the Excel Worksheet.]
103
Microsoft Excel Demonstration
104
Reading the c2 table
105
Finding c2 values
106
Finding c2 values
107
Finding c2 values
108
109
Confidence interval on the c2 distribution
The 100(1-a) confidence interval for the distribution
of (n-1) s2/s2 is a two-tailed c2 distribution between
and
. This interval is given by
110
Confidence interval for s2
From the sampling distribution of (n-1) s2/s2 the
sampling distribution of s2 is derived. The formula
is:
111
Confidence interval for s
To get the 100(1-a) confidence interval for s, the
population standard deviation, the square root of
each term is taken. The result is the formula below.
112
Example 6.9.1
In a study on cholesterol levels a sample of 12 men
and women was chosen. The plasma cholesterol
levels (mmol/L) of the subjects were as follows: 6.0,
6.4, 7.0, 5.8, 6.0, 5.8, 5.9, 6.7, 6.1, 6.5, 6.3, and
5.8. We assume that these 12 subjects constitute a
simple random sample of a population of similar
subjects. We wish to estimate the variance of the
plasma cholesterol levels with a 95 percent
confidence interval.
113
Solution
(1) Given
6.0 6.4 7.0 5.8 6.0 5.8
5.9 6.7 6.1 6.5 6.3 5.8
Estimate the variance with a 95%
confidence interval.
114
Solution
(2) Calculations
Value of s = .3918680978
Values of c2 from table
= 21.920
= 3.816
115
Calculations
116
F) Confidence interval for the ratio of variances
of two normally distributed populations
A way to compare the variances of two normally
distributed populations is to use the variance ratio,
/
. The variance ratio is used, among other
things, as the test statistic for analysis of variance
(ANOVA). If the two variances are equal, then
V. R. = 1.
117
Sampling distribution
The sampling distribution of ( / )/( / ) is
used. Since the population variances are usually not
known, the sample variances are used. The
assumptions are that
and
are computed from
independent samples of size n1 and n2, respectively,
drawn from two normally distributed populations.
(continued)
118
Sampling distribution
If the assumptions are met, (
/
)/(
/
)
follows a distribution known as the F distribution
with two values used for degrees of freedom.
119
Degrees of freedom
• The F distribution uses two values for degrees of
freedom.
• The numerator degrees of freedom is the
value of n1 -1 which is used in calculating
.
• The denominator degrees of freedom is the value
of n2 -1which is used in calculating
.
120
The F distribution
• The F distribution is not symmetrical.
• The distribution does not have negative
values.
• Because it uses two values of degrees
of freedom, there are separate charts
for different confidence intervals.
121
F distribution tables
122
Reading F tables
F tables come in denominations based on
which are
,
,
,
and
with one
tail. For two-tail intervals, the lower boundary,
,
must be calculated to give values
of
,
and
.
123
Reading F tables
124
Two-tail F distribution boundaries
125
The F.95 table
126
The F.975 table
127
The F.995 table
128
Confidence interval for
/
The distribution (
/
)/(
/
) is used to
establish the 100(1- a) percent confidence interval
for
/
. The starting point is
(continued)
129
Confidence interval for
/
From this relation, it can be shown that the 100(1- a)
percent confidence interval for
/
is
130
Example 6.10.1
Among 11 patients in a certain study, the
standard deviation of the property of
interest was 5.8. In another group of 4
patients, the standard deviation was
3.4. We wish to construct a 95 percent
confidence interval for the ratio of the
variances of these two populations.
131
Solution
(1) Given
n1 = 11
n2 = 4
= (5.8)2 = 33.64
a = .05
= (3.4)2 = 11.56
10, 3 = 14.42
= 1/
3, 10 = 1/4.83 = .20704
132
133
Solution
(2) Calculations
Calculation of the 95% confidence interval for
/
134
fin
135
Download