Chapter 6: Statistics and Estimation

advertisement
Chapter 6, Sampling and Estimation
6.1. Random samples
- In this chapter, we will study statistics.
-
What is a statistic: a statistic is a function of sample observations that contains no
unknown parameters.
-
Important concepts in statistics
 A Population is a data set that is the target of our interest.
 A Sample is a subset of data selected from a population.
-
The study of statistics concerned with using sample data to make an inference
about a population is often called "inferential statistics".
-
Why sample? If a population is infinite and it is impossible to observe the
population entirely. Even for finite population, it is usually necessary to use a
sample, a part of the population, to infer from the results pertaining to the entire
population.
(a) example 1: study and exercise habit of students in the class
(b) example 2: annual income of Canadians
(c) example 3: pH value of a developing land
-
Random Sampling: Given a finite (or infinite) population, we select n objects x1,
x2, x3, ..., xn, randomly and expect to make some inference without looking at each
and every objects in the entire population, this is called random sampling. The n
objects are named as a random samples of size n from this population. Following
the above examples, the random samples are:
(a) the 12 students
(b) samples from different region, background, ...
(c) samples at different places
-
The sampling distribution (or the distribution of the sample sets) of a population is
the population's probability distribution:
f(x1, x2, x3, ..., xn) = f(x1)f(x2)f(x3) ..., f(xn)
-
Usage of samples: We are usually concerned with the problem of making
inferences about the parameters of populations, such as the proportion p, the mean
 and the standard derivation . In making such inferences, we use statistics,
namely the quantities calculated on the basis of sample observations.
6.2. Some important statistics
- The important statistics include: proportion, mean, and variance (standard
deviation).
-
It is known that the population mean and variance are defined as follows:

1 N
 xi
N i 1
1 N
   xi   2
N i 1
2
where, N is the number of elements in the population (N   if the population is
infinite). On the other hand, we can calculate the sample mean and variance:
1 n
x   xi
n i 1
s2 

1 n
 xi  x
n  1 i 1
2
Note:
(a) There is a difference between the population mean and sample mean
(b) In the sample variance, why (n - 1) instead of n?
* fact: sample variance tends to be smaller
* prove: later in the chapter
-
Question: We can easily calculate the sample mean and variance, but what is the
mean and variance of the population?
-
Other important statistics
* central tendency:
sample mean
sample median
sample mode
* variability
variance (standard deviation)
range
-
The linear function of a random variable: given y = ax + b, the sample mean is:
y  ax  b
and the sample variance is:
s 2y  a 2 s x2
8.3 Sampling distributions
- An example: take another 6 students, the number of hours of study per week.
Record #
1
2
3
4
5
Name
Gurpreet
Camila
Cara
Chris
Joe
Sex
M
F
F
M
M
# of hours of study
6
20
20
10
12
# of hours of exercise
3
5
3
4
2
6
Mike
M
5
12
Hence, the sample mean (average number of hours of study) is 12.167 and the
sampling standard deviation is 6.6.
-
From the above example, it is seen that the new sample is different to the old
sample (15.5 hours in the sample of 12 students). This is because the samples are
randomly selected and thus, the mean of random samples is random.
-
The sampling mean: If a random sample of size n is taken from a population
having mean  and standard deviation , then x is a value of a random variable
whose distribution has the mean .
-
The sampling variance:


2
For samples from infinite population, the variance of this distribution is 
n
For samples from finite population, the variance of this distribution is
2 N n
n N 1
where, (N - n) / (N - 1) is called finite population correction factor. Note that the
sample size is small, the correction factor is significant, e.g., N = 100, n = 10, then
the correction factor is: 9/99 = 1/11.
-
The reliability of the sampling distribution mean x as an estimate of 
(population mean) is measured by
x 

n
This is called the standard error of the mean, and this reliability measure changes
according to the square root of the sample size. The larger the sample size, the
smaller the sample error. For example, the reliability level for samples with size
10,000 is ten times the reliability level for samples with size 100.
6.4 Sampling distribution of means ( x ):
- Standardized mean. Suppose that we know the variance of the population, a
random variable called "standardized mean" defined by
x
Z
/ n
namely, the difference between sample mean x and the population mean ,
divided by the standard error of the mean (as defined previously); then, the
following theorem holds.
-
Theorem 6.1 (Central Limit Theorem): If x is the mean of a random sample of size
n taken from a population having the mean  and the finite standard deviation ,
then the standardized mean
Z
x
/ n
is the value of a random variable whose distribution function approaches that of
the standard normal distribution as n  .
In short, it is said that the sampling mean is approximately standard Normal
-
The implications of the central limit theorem:
(a) no matter what is the probability distribution of the population, the central
limit theorem is always hold.
(b) normal distribution is usually a good approximation when the sample size
is large (n > 25).
-
An example: what is the average number of hours of study per week of the
students in the class? We don't know for sure (unless we take samples from all the
students). But give a probability of 95%, we can estimate its range. Assuming that
the standard deviation of the population is 4 (if the population standard deviation
is unknown, we have to make an estimation as discussed later), then the
standardized mean is:
15.5  
Z
4 / 12
Based on the central limit theorem, Z conforms standard Normal distribution, and
hence:
P(| Z | < z) = 0.95
From Table A.3, an inverse table lookup shows that z = 1.65. Therefore:
 = 15.5 ± (1.65)(1.155) = 15.5 ± 1.9
-
Note:
(a) suppose we use the sample of 6 students:
12.62  
Z
4/ 6
 = 12.62 ± (1.65)(1.63) = 15.5 ± 2.69
it is seen that the less the samples, the larger the interval.
(b) suppose we use the probability of 99%: from a normal distribution table ,
za = 1.96
15.5  
Z
4 / 12
 = 15.5 ± (1.96)(1.155) = 15.5 ± 2.26
it is seen that the higher the probability, the larger the interval.
-
How to estimate the population variance? From the samples of 12 students: max =
25, min = 10, thus R = 25 - 10 = 15. Using the Chebeshev's theorem, a probability
of 95% results in:
1 - 1/k2 = 0.95
or k = 4.5. From:
|X - | < k
or:
R < k
it follows:
 = R / k = 15 / 4.5 = 3.333
using this estimation, in the above example, the standardized mean is:
15.5  
Z
3.33 / 12
and the 95% interval is:
 = 15.5 ± (1.65)(0.96) = 15.5 ± 1.586.
-
The difference of two populations can also be accessed by the central limit
theorem, since
( X  X 2 )  ( 1   2 )
Z 1
  12    22 
 n  n 
1 
2

is normally distributed.
-
An example: the difference of study habit between male and female students (12
students example). Assuming that the two populations are normally distributed
and has the same mean M = F = 4,
X M = 15.5, XF = 15.5
suppose female students study harder than the male students by an average of 2
hours, then (M - F) = -2, hence:
z = [(15 - 15) - (-2)] / (4)(0.28) = 1.78
P(Z > z) = 1 - 0.9625 = 0.0375
Therefore, it can be conclude there is no enough evidence that girls study more
then 2 hours than the boys.
6.5 The Sampling Distribution of S2
- In the previous sections it is shown that the sampling mean is random variable, the
sampling variance, S2, is also a random variable.
-
The sampling mean is approximately normal (as long as n is large regardless the
distribution of the population) as stated in central limit theory. What about S2?
The answer is unfortunately not available.
-
So we can only discuss a special case: assuming that the distribution of the
population is normal, then (n - 1)S2 / 2 has a chi-square distribution: let
U  (n  1)
S2
2
then the pdf of U is:
n 1
u
1 

1
2
u
e 2

n 1
n

1


f (u )   
2 2
2

 

0
u0
elsewhere
where, (n - 1) is called the degree of freedom.
-
the mean and variance of the chi-square distribution:
mean: u
the variance: 2u
-
we can use chi-square (2) distribution to inference the standard deviation of the
population.
-
In the students example, what is the probability that the standard deviation of the
population is greater 5? that is:
P( > 25) = ?
or:
P(1/25 > 1/) = P[(n-1)s2/25 > (n-1)s/]
Let: U = (n - 1)S2 / 2, and note that s2 = 16, n = 12, and use chi-square
distribution it follows:
P(7.04 > U) = 1 - 0.8 = 0.2
-
There are two ways to calculate chi-square distributions:
- Use a chi-square distribution Table.
- Use a computer program
6.6. t-Distribution
- If the population is normal, when we use s to replace :
X 
T
S/ n
it results in a t-distribution.
-
The pdf of t-distribution:
v 1
 v  1

 
2  2
2   t 
f (t )  
1
,  t 

v 
v

  v
2
where, v = n -1 is called the degree of freedom.
-
The applications of t-distribution: inference regarding to the population mean
when the population is known to be normal.
-
An example: in the 12 students example, suppose the population is normally
distributed, with a probability of 95%, what is the range of the population mean?
  x  t 2 s
From the examples above, we know the sampling mean is 15.5 and the sampling
standard deviation is 4. From a t-distribution table:
t 2 (11)  t0.025 (11)  2.2
therefore, the population mean is within 15.5 ± 8.8.
6.7 F-Distribution
- Comparing the sampling variances of two samples: (n1, s1), (n2, s2), the ratio:
S12
F
S 22
 12
 22
is F-distribution with a degree of freedom v1 = n1 - 1 and v2 = n2 - 1.
-
The pdf of the F-distribution:
 v  v  v 
 1 2  1 
 2  v 2 
h( f ) 
v  v 
 1   2 
2  2
-
v1
2
f
v1
1
2

v
1  f 1
v2




v1  v2
2
, 0 f 
The example: the difference of variance between male students and female
students in terms of study habit.
- male students: nM = 8, sM2 = 4.75
- female students: nF = 4, s F2 = 3.32
therefore:
2

 2 S2 S2 
P 2F  1  P F2 M2  M2   P( F  1.43)
M

  M SF SF 
where, the degree of freedom is vM = 8 - 1 = 7, and vF = 4 - 1 = 3. Hence, we
have:
P(F > 1.43) = 0.58.
The probability of 0.58 indicates that the difference is not significant.
6.8 Introduction to Estimation
- Problems studied previously
 what is the distribution of the sampling?
 what is the mean and variance of the sampling?
-
problems left in Chapter 8 that will be solved in the discussions below:
 what is the distribution of the population?
 what is the mean and variance of the population?
-
Estimation:
 an estimator is a statistic that specifies how to use the sample data to
estimate an unknown parameter of the population.
 an estimator is a random variable
-
An example:
X is an estimator of 
S is an estimator of 
(x / n) is an estimator of p
-
Questions to be answer:
 how good is an estimator?
 how many samples are need?
6.9. Methods of Estimation
- Two types of methods for estimation:
 point estimation
 interval estimation
-
A Point Estimator is a formula or a function that tells us how to calculate a
numerical estimate of the population distribution based on the sampling
measurements.
-
Given a "point," , which is a parameter related to the population distribution,
(e.g., mean or variance), then
 ˆ is denoted as an estimate of 
 if E( ˆ ) = , then  is an unbiased estimator (or unbiased estimate)
 ˆ 1 is said to be a more efficient unbiased estimate than ˆ 2 if
(a) both ˆ and ˆ are unbiased estimates of ˆ , and
1
2
(b) the V( ˆ 1) < V( ˆ 2).
-
Interval estimation or confidence intervals: basic concept
* ( ˆ - ) = ?, or how close is ( ˆ - )?
* there exists an interval [g1( ˆ ), g2( ˆ )] such that:
P(g1( ˆ ) <  < g2( ˆ )) = 1 - ,
* [g1( ˆ ), g2( ˆ )] is called confidence interval
-
A confidence interval is random variables, all it says is that ˆ will be in the
interval with a high probability (1 - ).
-
What will be covered in this chapter
(a) Estimation of means:
Single sample: the mean, standard error, and tolerance limits
Two samples: difference between two means, paired observations,
(b) Estimation of proportion
Single sample: estimating a proportion
Two samples: estimating the difference between two proportions
(c) Estimation of variance
Single samples: estimating the variance
Two samples: estimating the ratio of two variances
(d) Bayesian method
(e) Maximum likelihood estimation
6.10 Estimation of the means:
(1) Case 1: use sampling mean x as an estimate of  when large samples available:
- the error:
x-
-
the standardized error:
x
Z
/ n
-
this will be approximately standard normal (according to the Central Limit
Theorem). So, with a probability of , the confidence interval [-a, a] is determined
by:
P(|Z| > a) = 1 - 
or:
P(- z/2 < Z < z/2) = 1 - 
rearrange, the error with a probability of 1 -  is:
x    z 2

n
or

 

, x  z 2
 x  z 2

n
n

-
Note
(a) if  is unknown, one can use s, which results in a loss of accuracy
(b) the one side confidence interval:
x  z 2

n
-
There are three variables linked by the confidence interval: the confidence interval
(or the error E), the number of samples n, and the confidence level . Given a
confidence level , and the maximum error allowed, E, then, the required sample
size can be found by the following equation:
2
 z 2  

n  

E


-
In the students example, what is the sample size needed?
(2) Case 2: small sample size but population has a normal distribution
- basic assumption:
(a) the sample size is less than 25 (n < 25, as a rule of thumb)
(b) population distribution is normal,
then we could not use the standard normal distribution but the student tdistribution:
-
Using s to estimate : Substitute:
(n - 1)s2 / 2
resulting to s a chi-square r. v. with n - 1 degree-of-freedom:
X     n  Z  T
X 

S n
(n  1) S 2  2 (n  1)
X 2 


let
Z
X 2 
T
then, T conforms a t-distribution.
-
Using the t-distribution, the confidence interval can be determined by
X 
t 2 
 t 2
S/ n
or
x  t 2
-
s
n
Note the
 one-sided confidence interval
 the maximum error is given by
S
E
t
n 2

given the maximum error, the minimum of samples required can be found
(3) Case 3: estimating the difference between two means (multiple samples)
- The difference between two means:
 = E( X 1 - X 2) = 1 - 2
the questions:
(a)  = ?
(b) confidence interval = ?
-
The students example: the difference of number of hours of study between male
and female students
class
samples
Female
Male
From the samples, we have:
2
Male: nM = 8, x M = 15.5, sM = 4.78
2
Female: nF = 4, x F = 15.5, s F = 3.32
Assuming that both population distributions are normal, the point estimation of
the difference is:
D  x F  xM
It conforms a t-distribution:
( X F  X M )  ( F   M )
T
2
S F2 S M

n F nM
with the degree-of-freedom:
2
S F2 S M

n
nM
dof  2 F
2
S F / nF S M
/ nM

nF  1
nM  1
hence, the confidence interval is:
( xF  xM )  t / 2
2
2
S F2 S M
S F2 S M

  F   M  ( xF  xM )  t / 2

n F nM
n F nM
-
In the students example:
dof = 5.1 = 6.
give  = 0.05, t/2(dof) = t0.025(6) = 2.447. Therefore, the conference interval is:
0 - (1.43)(2.447)  1 - 2  0 + (1.43)(2.447) = -3.5  1 - 2  3.5
-
A more general application: linear function of multiple random variables: suppose
there are m populations each having a mean i and a standard deviation i,
furthermore, the parameter to be estimated is a linear function:
 =  aii
then, an unbiased estimate is:
ˆ  a X

i
i
i
and

V( ) =

i
a2i
 2i 
 
 ni 
note that  can be replaced by s without losing much accuracy for a large samples
-
the confidence interval with a probability 1 -  is:
ˆ  z
V (ˆ)
a 2
6.11 Estimating the proportion
- Binomial distribution: large sample confidence interval for p
6.12 Estimating the variances
- the result: the confidence interval for s with confidence coefficient 1 -  is:
 (n  1) s 2 (n  1) s 2 


,
  2 ( )  2 ( ) 
1 a 2
 a2

where,  = (n - 1) is the degrees of freedom.
-
An examples: chemical pollution LC50 measurements in 12 samples are : 15, 5,
21, 19, 10, 5, 8, 2, 7, 2, 4, 9. What is mean and the variance of the population?
 the sampling and variance:
x =9
s = 6.4244

since n = 12 < 30, we have to use t distribution
degree-of-freedom = 12 - 1 = 11
90% =>  = (1 - 0.9) / 2 = 0.05
t0.05 = 1.796 (from the t-distribution Table)
-
the true mean with a confidence coefficient 0.9 is:
x ± 1.796(6.4244)/12 = (5.6692, 12.3308)
-
to determine the confidence interval for the population variance, we have to
use chi-square distribution as discussed in Case 4
-
95% confidence interval for the population variance:
  = 0.975 - 0.025
 0.025 (11) = 21.92
 0.975 (11) = 2.82
 (n - 1)s2 = 11(41.2727)
 the interval: (20.7177, 118.9805)
-
everything is the same except n = 100: using the results delivered in Case 1.
6.13. Confidence Intervals: the Multiple Sample Case
- Multiple sample includes multiple random variables, or linear functions of
multiple random variables, e.g.:
E( X 1 - X 2) = 1 - 2
-
Questions:
(a). ˆ = ?
(b). Confidence interval = ?
-
Cases
 Linear function of means of general distribution (large samples)
 Linear function of proportions of binomial distribution (large samples)
 Linear function of means of normal distribution (small samples)
  12  22 of normal distribution (small samples)
-
Case 1:suppose there are m populations each having a mean i and a standard
deviation i, furthermore, the parameter to be estimated is a linear function:
   ai  i
i
then, an unbiased estimate is:
ˆ   ai xi
i
and
 2 
V (ˆ)   ai2  i 
i
 ni 
note that  can be replaced by s without loosing much accuracy for a large
samples
-
the confidence interval with a probability 1 -  is:
ˆ  z
V (ˆ)
a 2
-
Case 3: for small sample size, we have to assume the population is normal. In
compared to Case 1,  may be significantly different to s, so define:
m
S p2 
 (n
i 1
m
i
 1) S i2
 (n
i 1
i
 1)
and
U (ˆ)  S p
ai2

i 1 n
m
then:
T
ˆ  
S pU (ˆ)
has a t-distribution with
m
 (n
i 1
i
 1) degree of freedom. Furthermore, the
confidence interval is:
  t a 2 S pU (ˆ)
-
Case 4: the ratio of population variance:  12  22 . Define
S12  12
F 2 2
S2  2
then, F conforms a F-distribution with  = n - 1 and  = n - 1 degrees of
freedom.
-
the confidence interval of  12  22 with a probability distribution 1 -  is:
 s22
 2
 s1
-

s22
1
, 2 F/2 (1 , 2 )
F/2 (2 , 1 ) s

1
An example: the problem:




s2
50 samples from type I coupling agent having X 1 = 92, 1 = 20
s2
40 samples from type II coupling agent have X 2 = 98, 2 = 30
what is the true difference between the mean resistances with 95% confidence
interval? (E34)
assuming same number of samples shall be used from both types, how many
samples are needed to estimate the true difference between mean resistances to
within 1 unit with a confidence coefficient 0.95? (E35)
-
this is a case 1 problem since n1, n2 > 30,
* 1 -  = 0.95,  = 0.05, /2 = 0.025, z/2 = 1.96
* true difference

 = 1 - 2

conforms a normal distribution. The 95% confidence interval is:
(92 - 98) ± 1.96  (20/50) + (30/40) = -6 ± 2.1
-
A special case: assuming n1, n2 = n, then
1.96  (20 + 30)/n  1
n = (1.96)2(20 + 30) = 192.08 = 193.
6.14. Prediction Intervals
(1) The problem:
what is the like value of a next observation
what is the like interval of a nex observation
(2) The method:
- based on previous observations, we have got X
- the next observation is Xn+1
- under normal assumption, X - Xn+1 is normal with zero mean and variance:
2 + 2 = 2  1 + 1
n
n
- if the population variance is unknown and is estimated by S2, then:
X - Xn+1


-t
Š
Š t/2 

/2
P
=1-
S 1+ 1


n
- therefore the confident interval is:
x ± t/2 s 1 + 1
n
(3) An example (Example 7.16)
6.15. Maximum Likelihood Estimation
- A demonstration example: sampling from a batch of mechanical parts to
determine the probability model (the form and the parameters). The distribution is
assumed to be Poisson with a unknown parameter  (mean).
-
The solution: take samples: x1, x2, ..., xn, and assume the samples are
independent, then:
P(X1 = x1, X2 = x2,..., Xn = xn,) = P(X1 = x1)P(X2 = x2) ... P(Xn = xn)
since
-
 x
P(X = x) = f(x) = e  , x = 0, 1, 2, ...
x!
n
L( ) =
-n 
e
n
x!

x
i
i=1
i



i=1
what is the most like value of ?
taking ln(•)
the derivative and set it to zero
n
d ln L()
1
=-n+
xi = 0
d



i=1
therefore:

-
 =1

n
-
x =x
i
i=1
note that x is called maximum likelihood estimation
The general procedure
obtain a random sample x1, x2, ..., xn from the distribution of a random variable X
with density f and associated parameter ;
define a function L() by
n
L( ) =
-
n
 f(x )
i
i=1
which is called likelihood function for the sample
find the expression for  that maximizes the likelihood estimator for 

replace  by  to obtain an expression for the maximum likelihood estimator for 
(optional) find the observed value of this estimator for a given sample
6.16. Bayes Estimation
Download