Class1

advertisement
Class 1
Sampling distribution and confidence interval
In quantitative research, we are often interested in knowing certain characteristics of
the population. In the context of education, this could be the average student
achievement in mathematics, the percentage of students reaching a particular level,
comparisons between girls and boys, etc. Typically, we will not have access to
population data. So sampling is carried out to estimate the population parameters of
interest.
It is straight forward to compute means, percentages, and other statistics based on the
sample. The issue is: how accurate are these estimates as compared to the population
parameters? Given that we don’t have population data, how are we going to assess the
accuracies of estimates? If we don’t provide accuracy measures associated with our
estimates (technically termed confidence intervals), incorrect conclusions may be
drawn. For example, we may conclude that there are 75% of the students reaching
minimal proficiency standard, when, in the population, there are actually 80%.
Whether this difference (between the estimated and the actual parameter values)
matters or not, it depends on the purposes of using these numbers. We may
erroneously conclude that a policy measure failed (or succeeded) because our
estimates are not accurate enough.
Whenever estimation is carried out, it is essential to report confidence intervals,
so the accuracy of estimates can be taken into account when conclusions are
drawn.
One way to determine the confidence interval is to repeatedly draw samples. The
sample estimate of a parameter is computed for each sample. An examination of the
variations of the estimates across all samples will give us an idea of how different the
estimates are. For example, if we want to estimate the average height of people in a
population, we draw samples of certain size, and compute the averages. For 10
samples, we obtained 175, 175, 175, 176, 174, 175, 176, 176, 175, 175 as the sample
averages. We may then conclude that our sample means seem quite accurate, because
there is not much variation across samples. On the other hand, if we obtained 178,
164, 185, 190, 163, 174, 172, 166, 187, 184 as the sample averages, we may conclude
that there is a great deal of variation from sample to sample, so our confidence about
the accuracy of ONE sample mean is not very high. In the latter case, our confidence
interval will be large, reflecting the uncertainty of the estimate. In the former case, our
confidence interval will be narrow, showing that the sample estimate is quite accurate.
The dilemma is that, typically, we only have resources to draw one sample. We have
to make an estimate of the confidence based on this one sample only. In this case,
statistical theory can help us.
To understand the statistical process of computing confidence intervals, a number of
simulations are designed below. The advantage of simulation is that, we begin with
population parameters that are known, and we simulate sampling processes and
examine the degree of variation across samples. We can then compare sample
estimates and population estimates, and assess how different these are. Simulation
will help us understand the notion (and computation) of confidence intervals in
estimation. In real life, typically, we cannot carry out simulation in a straight forward
way, as we do not know the true population values. We typically apply formulas to
compute confidence intervals. The simulation exercises below are aimed to help us
understand these formulas, so the formulas are not just some black boxes from which
we plug out a few numbers.
Simulation 1
Description
We are interested in estimating the proportion of heads in tossing a coin, to assess
whether the coin is fair or biased. To do this, we can toss a coin 10 times, and record
the number of heads in 10 tosses. The proportion of heads in 10 tosses is our
estimated value of the population proportion.
We know that, even if the coin is fair, in 10 tosses, the proportion of heads may not be
exactly 50%. We could get 7 out of 10 (7/10), or 8 out of 10 (8/10), or 2 out of 10
(2/10), etc. If we just base our estimate on 10 tosses, and use our estimate as the
proportion for the population parameters (say, in 1 million tosses), more often than
not, we will make incorrect conclusions. However, if we take the confidence interval
into account, and say, for example, that our estimates mostly vary between 3 and 7, so
the coin could still be fair, even if we don’t get exactly 50% in the 10 tosses. To
determine the likely number of heads in 10 tosses for a fair coin, do the following
simulation.
Step 1
Do the simulation in
10CoinTosses_demo.swf.
EXCEL
following
the
animated
software
demo
Read the EXCEL csv file into R. Plot a histogram.
(Your picture will look a little different from mine, as your random numbers will not
likely be the same as mine. Paste your results over mine.)
This distribution is called the sampling distribution of the random variable number of
heads in 10 tosses of a fair coin.
From this distribution, we get an idea of the variability of our results. For example,
from the histogram, we see that we can get 3 to 7 heads in 10 tosses of a coin for a fair
coin. (Your conclusion may be slightly different from mine. If we simulate this 1000
times instead of 100 times, our results are likely to be more similar.) If we have a
coin and we want to assess whether the coin is fair or not, we are not likely to
conclude that the coin is biased if we get 3 to 8 heads, since we know that it is quite
likely to have these results from a fair coin.
(An additional exercise is to change the probability of getting a head, say, to 0.7
instead of 0.5, and then find the sampling distribution. That is, instead of the formula
“=if(rand()>0.5,1,0”, use the formula “if(rand()>0.7,1,0”)
Step 2
Repeat the process for 50 tosses instead of 10 tosses. Use R code to do this
simulation.
(Replace my graph with yours. Make sure the scale goes from 0 to 50)
This is the sampling distribution when the proportion of heads is estimated from 50
tosses. We can see that the sampling distribution is tighter than that from the 10
tosses. That is, we are more confident about our estimated proportion of heads from
one sample of 50 tosses (than from one sample of 10 tosses). We might say that, in 50
tosses, if we get between 20 and 30 heads, we will conclude that the coin is fair.
Step 3
Repeat the process for 100 tosses instead of 50 tosses.
[Insert your sampling distribution below.]
What range of values would be your acceptance range to conclude that the coin is
fair?
Level of Confidence
In deciding on the confidence interval associated with an estimate, we first need to
decide on a level at which the confidence interval is stated. For example, we may say
that at the 95% confidence level, our estimate will range between 18 and 32. What
this means is that, if we repeatedly draw samples, we expect our sample estimate to be
within the range of 18 to 32, 95% of the time. For 5% of the time, our sample estimate
will be outside this range. Statistically, we say that the 95% confidence interval is 18
to 32.
We can choose a different level of confidence. For example, we may choose 90%
confidence level. That means that the range we quote will cover 90% of our sample
estimates if we repeatedly draw samples. For 10% of the time, the sample estimate
will be outside the 90% confidence interval. Consequently, the higher the confidence
level (e.g., 99% instead of 95%), the wider the confidence interval will be.
Theoretical formula for a random variable with a binomial distribution
The coin tossing exercise is an example of the binomial distribution, where the
number of heads in n tosses, X, follows a binomial distribution. The number of tosses,
n, is the “number of trials”, and the probability of success, p, (of tossing a head) is 0.5
in our case.
For a binomial random variable, the mean is np, and the variance is npq, where
q=1-p. So the standard deviation is the square root of (npq).
Therefore, we can compute the confidence interval using an approximate formula.
Remember that for a normal distribution, 95% of the observations lies within
mean±1.96×standard deviation. So the 95% confidence interval for a binomial
random variable is approximately np±1.96×sqrt(npq). Check this with your simulated
results.
Simulation 2
Description
This time, we will simulate a height distribution and sample from it.
Step 1
Assume that, in some adult population, the average height of people is 175 cm, and
standard deviation is 20cm, and that the distribution is normally distributed.
We will randomly sample from this population 10 people and record their heights. We
then compute the average of the 10 heights, and use this as our estimate of the
population average.
(Replace my graph with yours)
How confident are we about using the average height of 10 randomly selected people
to represent the population average? If our estimate from a sample of 10 is 185 cm,
can we say that IS the population average? The sampling distribution shown above
tells us that, if the true mean is 175 cm, we can easily get a mean of 185 cm in a
sample of 10 people. Should we say that, while our estimate is 185 cm, given the
inaccuracies caused by sampling, the true mean is likely to lie somewhere between x
and y?
If we increase our sample size from 10 to 100, would our values of x and y be much
closer to each other?
Step 2
Repeat the process for a sample of 50 people instead of 10 people.
[Insert your sampling distribution below.]
Step 3
Repeat the process for a sample of 100 people.
[Insert your sampling distribution below.]
Step 4
In R compute the average and standard deviation for the 100 sample means when 10
people were selected.
Do the same for the 100 samples when 50 people were selected.
Do the same for the 100 samples when 100 people were selected.
As the sample size increases from 10 people to 100 people, are the sample means
getting closer to 175 cm?
As the sample size increases from 10 people to 100 people, do the sample means vary
less?
In R, we have recorded sample means for 100 samples. If a histogram is made, this is
the sampling distribution of sample means.
The sampling distribution of the sample means tells us about how accurate our
estimated means are.
Standard errors
The standard deviation of the sampling distribution of the sample means is called the
standard error of the mean.
From statistical theory, the standard error of the mean is given by
 height
n
where  height is the standard deviation of the population distribution. In our example,
this is 20 cm. n is the sample size (e.g., 10, 50, 100).
We can use this formula to estimate the variation of the sample means, without
resorting to simulation.
Fill in the table below.
Sample size
Observed
mean
Observed
standard
deviation of
the means
Expected mean Expected standard
deviation of the
means (as
calculated from the
formula above)
10
50
100
However, the formula for computing the standard error requires the knowledge of the
population standard deviation which is usually unknown to us. In practice, we use the
observed standard deviation as an estimate. This is the topic for next week.
A Practical Example
A data set (not simulated) of students’ reading scores is provided (Literacy.csv).
The data set contains student scores on a reading test for a population.
Compute population mean and standard deviation of reading scores.
Compute population means for girls and boys separately.
Use R sample function to sub sample just 10 students. Compute sample mean for all
10 students, and compute sample means for girls and boys separately. Are you happy
these are close enough to the population means?
Increase your sample size. At what sample size are you happy enough that the sample
means are good estimates of the population means?
Summary
By the end of this class, you should understand the need to specify the degree of
accuracy of an estimate.
Typically, in reports, you will see results such as
This means that, with 95% certainty, the mean score lies within 418 ± 14.3. That is,
you are making a statement that the population mean lies between 403.7 and 432.3,
and there is 5% chance this statement is incorrect.
Download