Class2

advertisement
Class 2
Estimation and Hypothesis Testing
When a parameter (e.g., the average)) of a population is estimated using a sample of
data, the estimated value will vary, depending on the particular sample chosen.
Sampling variation, or more formally, sampling distribution, of the estimated
parameter gives us a frame of reference of how accurate the estimate is likely to be. If
we repeatedly sample, and our estimated parameter does not change much, then we
are confident that the estimate from just one sample is likely to be accurate. On the
other hand, if our estimated parameter changes quite markedly for different samples
of data, then we are not at all confident that the estimate from just one sample is likely
to be accurate.
Whenever we report an estimated value (e.g., the average of a sample of data), we
must provide our degree of confidence about the accuracy of the estimate. Typically,
in reports, you will see results such as
However, typically, we will not have the resources to repeatedly sample to obtain the
sampling variation of our estimated value to report a confidence interval. Statistical
theory can help us with the computation of the confidence interval so we don’t need
to resort to repeated sampling to establish confidence intervals.
The sampling distributions of various statistics (e.g., mean, percentage, median,
standard deviation, etc) are different, and they require different statistical theory to
derive the sampling distributions. Below we will focus on the sampling distribution of
the Mean. That is, whenever we are computing averages, we can use a formula to
compute the confidence interval based on just one sample of data.
Central Limit Theorem (CLT)
The sampling distribution of the mean of independently drawn observations will be
approximately normally distributed, even if the distribution from which the sample is
drawn is not normal.
(Check internet sources/other references for descriptions of the central limit theorem)
A simulation can be conducted to illustrate CLT. The data set, Literacy2.csv, contains
27598 students’ reading test scores. The following shows the histogram of the reading
scores:
This histogram looks quite skewed (i.e, not normally distributed). Compute the mean
and standard deviation of the reading scores:
Mean:_______________
Standard deviation:_________________
If we sample from this population, and compute the mean of the sample, we will not
get exactly the population mean. There will be variations in mean values across
different samples.
Select one random sample of 100 students. You can do this using the R function
“sample”.
Compute the mean of this sample.
Repeat this a few times by drawing a few samples, and see the variation of the mean
values of different samples.
To obtain the sampling distribution of the sample mean of randomly sampled 100
students 2000 times, using the R code provided.
Graph the sampling distribution. You should get a histogram like this one (replace my
picture with yours):
Now this histogram looks normally distributed!
Compute the mean and standard deviation of the 2000 sample means
(A) Mean of the 2000 sample means:____________
(B) Standard deviation of the 2000 sample means:____________
The standard deviation of the 2000 sample means is called the standard error.
Given that the sampling distribution of the sample means is approximately normally
distributed, we can compute a confidence interval based on normal distribution. That
is, for normal distributions, about 95% of the observations lie between
mean±1.96×standard deviation.
In our case, about 95% of the sample mean values should lie between
____________ and _____________
(work out the two values using (A) and (B) above)
Formula for computing the standard error
In practice, we can use the result derived from statistical theory that the standard error
of the mean is approximately

n
where  is the population standard deviation, and n is the sample size.
In our case, n is 100,  is 5.8, so using this formula, the standard error should be
about 0.58. How does this compare with what you obtained in (B) above?
In real life, we don’t know the value  . But, scanning over the standard deviation of
each sample of around 100 observations, you will find that the sample standard
deviation of 100 observations is a good estimate of the population standard deviation.
In practice, how to compute confident interval of sample
mean
(1) draw a sample of size n
(2) compute the sample mean (  )
(3) compute the sample standard deviation ( ̂ )
(4) compute the 95% confidence interval of the true mean using   1.96 

n
In one sentence, describe the meaning of a statement like the following:
The estimated mean of height is 174cm ± 29cm (95% confidence interval)
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
General process of making inference about a statistic
(1) Establish the sampling distribution of the statistic to assess the variability of the
statistic.
For example, if we are interested in the mean reading score of students in Taipei, we
take a sample and compute the sample mean. Because this sample mean is not the
population mean, there is likely variation in the value of the sample mean if different
samples are drawn. We need to find out how large the variation is. If the variation is
large, then our estimate is probably not very accurate to represent the population
mean. If the variation is small, then our estimate is probably quite close to the
population mean.
(2) We can repeatedly sample to establish the sampling distribution of the statistic of
interest. But this may be impractical as it will be too costly. We can also establish
sampling distributions theoretically.
Making inferences about Mean
We can use the central limit theorem to establish the sampling distribution of the
sample mean, without doing repeated sampling. Central limit theorem says that the
mean of independently drawn observations will be approximately normally
distributed, even if the distribution from which the sample is drawn is not normal.
Further, it can be shown that mean values computed from samples of size n follow a
normal distribution with mean  and standard deviation of

n
(known as the
standard error), where  is the mean of the distribution we draw our samples from,
and  is the standard deviation of the distribution we draw our samples from.
That is, if X denotes the sample mean, then
X 

has a standard normal distribution
n
with mean zero and standard deviation 1 (z-score). For a standard normal distribution,
95% of the observations lie within 1.96. With a little re-arrangement of the equation,
it can be shown that 95% of the time, or, we are 95% confident that,
X  1.96  
n
 

X  1.96  
n
(There is a 95% chance that the population mean lies within the range shown above.)
Hypothesis Testing
Hypothesis testing is about using data to make (statistical) conclusions about a
hypothesis.
For example, if I have a hypothesis that the mean of students’ population reading
score is 17 out of 30.
H 0 :   17
I draw a sample, say, of 100 students. The sample mean and standard deviation of my
sample are 18.4 and 5.8 respectively. The 95% confidence interval for the mean is
18.4  1.96 
5.8
 17.3,19.5 
100
The 95% confidence interval of (17.3, 19.5) means that, based on our sample, there is
a 95% chance that the true mean lies between (17.3, 19.5). There is a 5% chance that
the true mean lies outside this interval.
As the hypothesised mean value, 17, is outside this confidence interval, we conclude
that we will reject the null hypothesis at the 95% confidence level. Sometimes this is
also said as at the level of p=0.05. This means that there is a 5% chance that we have
incorrectly rejected the null hypothesis.
More generally, we make inferences from our sample about the likelihood of
population parameters, and we make conclusions about the hypothesis based on our
inference.
Sample size and hypothesis testing
Now, draw a sample of 10 from our reading score data. Test the hypothesis that
H 0 :   17
What is the confidence interval in this case?
95% confidence of the mean is between ________________ and _____________.
What is your conclusion about the hypothesis?
Reject or Accept?
Next, draw a sample of 20, and then 50, and then 200. See the difference you will
make in accepting or rejecting the null hypothesis at p=0.05?
Sample of 20: Reject or Accept?
Sample of 50: Reject or Accept?
Sample of 200:Reject or Accept?
What if you use p=0.1 (90% confidence interval (normal distribution for 90% of the
sample means is between   1.64 
rejecting or accepting the hypothesis?

n
)? Would you change you decision of
Make a table below:
Sample size
Reject or Accept Reject or Accept
H 0 :   17
H 0 :   17
at p=0.05
at p=0.1
10
50
100
200
Given that we know that the true population mean is 18.98 (which, in real-life, we
will not know), what do you think about your conclusions in the above table?
What if the hypothesis is H 0 :   18 ? Could you reject this hypothesis? What sample
size would you use to reject this hypothesis?
A cartoon in Darrell Huff’s book on “how to lie with statistics” depicted one person
asking “I want to know the truth”, and another person replying “it ain’t statistics”.
What is your assessment of statistical hypothesis testing in relation to this cartoon?
What DOES statistics tell you?
Some discussion points:
(1) For a population of people, the height distribution is normally distributed with a
mean of 170 cm and a standard deviation of 12 cm. Dave has a height of 196cm.
Could Dave be from this population of people?
(2) In a region, the number of raining days per year is approximately normally
distributed, with a mean of 85 days and a standard deviation of 15 days (the
distribution was established by collecting 200 years of data). One year, the number of
raining days was 120 days. Was this year an ‘abnormal’ year?
If so, can we look at the 200 years of data, what percentage of years do you think will
be ‘abnormal’? But, the 200 years of data has been used to establish the ‘norm’, so
how can any particular year be ‘abnormal’?
Download