DTC Quantitative Research Methods Statistical Inference I: Sampling distributions Thursday 30

advertisement
DTC Quantitative Research Methods
Statistical Inference I: Sampling
distributions
Thursday 30th October 2014
Outline
•
•
•
•
•
Inference
Confidence intervals
Sampling distributions
The normal distribution and z-scores
Working out confidence intervals
What is inference?
• Most of the time we care about the attributes of a
population – adults in the UK; women workers; small
businesses…
• But we usually only study a sample of the population.
• Inferential statistics give you the tools to infer
population characteristics from the sample.
• Inferential statistics usually assume a random
sample. This is why it is so important to use methods
of random sampling when at all possible.
• Instead of, say, reporting that 35% of our sample
have some characteristic, using inferential statistics
we are able to estimate, or infer, the proportion of the
population that is likely to have that characteristic.
• In order to do this we use confidence intervals.
What is a ‘Confidence
Interval’?
• A ‘Confidence Interval’ for a particular sample statistic
(e.g. the mean) is a range of values around the statistic
that is believed to contain, with a certain level of
probability (often 95%) the ‘true’ value of that statistic (i.e.
the population value).
• For example, if we see a report that 37% of people (plus
or minus 3%) intend to vote Labour. What is being said is
that the pollsters are reasonably confident that the true
number of people who intend to vote Labour is between
34% and 40%. If they have not said otherwise, it is very
likely that this is a 95% confidence interval.
How do we arrive at a
confidence interval?
• How do we judge how big a confidence interval
should be (plus or minus 2% or 5% or 15%...)?
• What does it mean to be 95% certain that it is the
size that we say it is?
• And how do we know that the results we got in
our sample of the population are not just a quirk
of our particular sample (or ‘sampling error’)?
• Part of the answer to these questions can be
seen in common-sense assessments…
Example: Judging whether
differences occur by chance…
How do we judge whether it is plausible that two
population means are the same and that any difference
between them simply reflects sampling error?
Example: Household size of minority ethnic groups
(HOH = Head of household; data adapted from 1991 Census)
1.The size of the difference between the two
sample means
Indian HOH:
Bangladeshi HOH:
Mean
3.0
5.0
Indian HOH:
Pakistani HOH:
Mean
3.0
4.0
The first difference is more ‘convincing’
Judging whether differences
occur by chance…
2. The sample sizes of the two samples
Pakistani HOH:
Bangladeshi HOH:
Pakistani HOH:
Bangladeshi HOH:
3 4 5
4 5 6
Mean
4.0
5.0
2 2 3 4 4 4 5 5 5 6
2 3 4 4 5 5 6 6 7 8
Mean
4.0
5.0
The second difference is more ‘convincing’
Judging whether differences
occur by chance…
3. The amount of variation in each of the
two groups (samples)
Pakistani HOH:
Bangladeshi HOH:
Pakistani HOH:
Bangladeshi HOH:
2 2 3 4 4 4 5 5 5 6
2 3 4 4 5 5 6 6 7 8
Mean
4.0
5.0
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
Mean
4.0
5.0
The second difference is more ‘convincing’.
Example continued: the impact of
variability on a difference in means.
The three graphs each
show two groups with the
same mean difference.
However the groups in
each of the three graphs
have different levels of
variability.
Where there is lower
variability there is less
cross-over between the
groups, and so the
difference between the
means expresses a more
pertinent difference (there
is almost no one in group
A with the same score as
anyone in group B).
Judging whether differences
occur by chance…
As we’ll see these three things – the size of
the difference between the means, sample
size, and the amount of variation (measured
by the standard deviation) within the
sample(s) – are critical to our determination
of whether a difference we observe in a
sample (or between samples) is likely to
represent a real difference in the population
(or between populations).
So, what is the relation of the
sample to the population?
• If the sample is a random sample of the population, it
may sometimes have a large number of extremely
high values (for example: very happy people)
• And sometimes it may have a large number of
extremely low values cases (for example: very sad
people)
• But over the long run (if we kept on taking a sample,
and then putting it back and taking another one), we
would expect that most of the samples would fairly
well represent the population (for example: with a
mean happiness that corresponds fairly closely to
the mean happiness of the population).
Sampling Distributions
•
•
•
•
•
•
The distribution of different possible
samples that could be taken from a
population is known as a sampling
distribution.
The more we understand about this
distribution the better because it will
help us to work out the likely
relationship of our particular sample to
the population
What we find is that as more and more
samples are taken, the average (i.e.
mean) of the sample means tends to
equal the mean of the population.
The sampling distribution of means
also looks like a normal distribution
(Central Limit Theorem).
However the sampling distribution of
means is less varied than the
population.
See sampling distribution simulation at:
http://onlinestatbook.com/stat_sim/
Sampling from a Population: Sample
Means (from Field, 2005).
The formal theorem
“If repeated (simple random) samples of size N
are drawn from a normally distributed population,
the means of such samples will be normally
distributed with mean  and standard error [i.e.
standard deviation] /N... if the N of each
sample drawn is large, then regardless of the
shape of the population distribution the sample
means will tend to distribute themselves normally
with mean  and standard error /N”.
 = population mean
 = population standard deviation
N = number in sample
So where does this get us…???
• Well, we know that over the long run the mean of our
samples is likely end up as the population mean.
• We know that over the long run (when the sample is
‘large’ enough) that the distribution of sample means
looks normal. [Note: A “large sample” is sometimes
considered to be one of size 30+, but a size of 100+
can more ‘safely’ be viewed as adequately large.]
• And we know that the variation in the sample means,
known as the standard error, is (more or less) /N.
• Although we usually only have a single sample, this
information means we can work out a fairly reliable
estimate of the population mean by combining the
sample with what we know about normal distributions.
What’s so special about the
Normal Curve?
•
•
•
The normal curve is a symmetrical distribution of scores with an equal
number of scores above and below the midpoint of the abscissa (the
horizontal axis, or ‘x-axis’, for the curve).
Since the distribution of scores is symmetric, the mean, median, and mode
are all at the same point on the abscissa. In other words, the mean = the
median = the mode.
If we divide the distribution up into standard deviation units, a known
proportion of scores lies within each portion under the curve.
34% of cases are between
the mean and one SD away
•
From published or online tables, we can find the proportion of scores above
and below any point on the abscissa, expressed in standard deviation units.
Scores expressed in standard deviation units, are referred to as Z-scores.
z-scores
z-Scores can be calculated for any value. They are a means of standardizing
values that are measured on different scales by showing these values just in
terms of the number of standard deviations away from the mean they fall.
z-scores are calculated by subtracting the mean from any value and dividing it by
the standard deviation.
z
=
x - mean
s
z-scores will always have:
a mean of 0 and standard deviation of 1.
We can quickly see that this is true of the mean, since when x = mean, the
numerator (top bit!) will equal 0, and therefore z must = 0.
It may be a little less clear that it is true of the standard deviation.
However if you think about the instance when x is one standard deviation bigger
than the mean (i.e. x = mean + s)
 z = (mean + s) - mean
s
=
s
s
= 1
Finding the 95% point on a
normal distribution…
• From the table of we can see that when
z = 1.96 (sometimes simplified to z = 2)
the p-value, which represents the
probability of being in the larger area
(to the left), is 0.975.
• Therefore the area under one (small)
tail of the curve is p=0.025.
• This means that scores greater than
z = 1.96 occur just 2.5% of the time.
• Further (because the normal curve is
symmetric) we can calculate that the
area under both tails (beyond z = 1.96
and z = -1.96) is 0.05.
• In other words 95% of the area is in the
middle, between z = -1.96 and z = 1.96
• And scores further from the mean than
1.96 thus only occur 5% of the time
97.5%
2.5%
\z = 1.96
95%
z = -1.96
z = 1.96
Note: What happens if the sample size is
too small for one to safely assume that the
sample mean has a Normal distribution?
• When a sample is small (i.e. less than about 25)
the assumption that the sample mean is normally
distributed is not reasonable.
• In fact, regardless of sample size, the sample
mean can be assumed to have a t-distribution;
the precise shape of a t-distribution depends on
the sample size, and for moderate-to-large
sample sizes the t-distribution is very similar to
the Normal distribution (and, as the sample size
approaches infinity, eventually converges with it).
Combining that with what we know about the sampling distribution:
• 95% of cases lie within +/- 1.96 standard deviations of the mean in a
normal distribution.
• The distribution of sample means is normal.
• And the standard error of sample means is approximately /n
F
r
e
q
u
e
n
c
y
95% of
sample means
2.5% of
sample means
1.96/n
1.96/n

(population mean)
2.5% of
sample means
Sample mean
Therefore 95% of sample means fall into the range:
 - 1.96(/n) to  + 1.96(/n)
Example
• If we take a sample of 100 people and find that they work a mean
of 34 hours per week with a standard deviation of 8 hours, how
do we construct a 95% confidence interval for the mean number
of hours worked by the population?
• We know that 95% of sample means fall in the range:
 - 1.96(/n) to  + 1.96(/n)
• We estimate  using the sample standard deviation, which is 8.
• The sample size (n) is 100. Therefore n = 10.
• Therefore 1.96(/n)
= 1.96 x (8 / 10)
= 1.96 x .8
= 1.568
• Therefore there is a 95% likelihood that the sample mean that we
have found is within (about) 1.57 hours of the actual mean.
• And so we can say with 95% confidence that the population’s
mean weekly hours of work will fall somewhere between 34
minus 1.57 and 34 plus 1.57.
• A 95% confidence interval of 32.43 to 35.57 hours per week.
Why 95%?
• A confidence interval need not be 95%.
• However this is the generally accepted level for statistical testing.
It is considered that errors occurring only 5% (or 1/20) times are
acceptable. Furthermore, a higher value can produce confidence
intervals that may be viewed too wide (producing an
unacceptable risk of Type I errors – discussed next week).
• However for some purposes a more cautious approach may be
necessary.
• For instance, if you were an antiquarian librarian sampling over
time the humidity in your rare book storage facility, you might
want to be confident that the average humidity level was neither
destructively high or low at a 99.9% level at least! In this case
you would construct a 99.9% confidence interval (where only
0.1% of cases fell outside of the range). You could use the
normal distribution to do this, in a similar fashion to the way in
which we used it to work out that the 95% confidence level
relates to plus or minus 1.96 standard errors.
Note: Small samples continued…
• The procedure for producing 95% confidence intervals remains very
similar to the one for larger sample sizes (i.e. the one using the
‘normal distribution’, which might just as well be referred to as the zdistribution), as does the test to see whether a suggested population
mean is plausible.
• The only difference is that the ‘magic number’ 1.96 is replaced by a
slightly larger number, the magnitude of which gets bigger as the
sample size gets smaller.
• Thus, for a sample size of 25, 1.96 is replaced by 2.06 and, for a
sample size of 15, by 2.13. (You can sometimes find a table of
values for the t-distribution at the back of a statistics textbook).
However another problem arises with small samples: the distribution
of sample means can be asymmetric. In fact, the assumption that the
sample mean has a t-distribution is only reasonable for small
samples if the distribution of the variable under consideration
approximates the normal distribution.
Download