DTC Quantitative Research Methods Statistical Inference I: Sampling distributions Thursday 30th October 2014 Outline • • • • • Inference Confidence intervals Sampling distributions The normal distribution and z-scores Working out confidence intervals What is inference? • Most of the time we care about the attributes of a population – adults in the UK; women workers; small businesses… • But we usually only study a sample of the population. • Inferential statistics give you the tools to infer population characteristics from the sample. • Inferential statistics usually assume a random sample. This is why it is so important to use methods of random sampling when at all possible. • Instead of, say, reporting that 35% of our sample have some characteristic, using inferential statistics we are able to estimate, or infer, the proportion of the population that is likely to have that characteristic. • In order to do this we use confidence intervals. What is a ‘Confidence Interval’? • A ‘Confidence Interval’ for a particular sample statistic (e.g. the mean) is a range of values around the statistic that is believed to contain, with a certain level of probability (often 95%) the ‘true’ value of that statistic (i.e. the population value). • For example, if we see a report that 37% of people (plus or minus 3%) intend to vote Labour. What is being said is that the pollsters are reasonably confident that the true number of people who intend to vote Labour is between 34% and 40%. If they have not said otherwise, it is very likely that this is a 95% confidence interval. How do we arrive at a confidence interval? • How do we judge how big a confidence interval should be (plus or minus 2% or 5% or 15%...)? • What does it mean to be 95% certain that it is the size that we say it is? • And how do we know that the results we got in our sample of the population are not just a quirk of our particular sample (or ‘sampling error’)? • Part of the answer to these questions can be seen in common-sense assessments… Example: Judging whether differences occur by chance… How do we judge whether it is plausible that two population means are the same and that any difference between them simply reflects sampling error? Example: Household size of minority ethnic groups (HOH = Head of household; data adapted from 1991 Census) 1.The size of the difference between the two sample means Indian HOH: Bangladeshi HOH: Mean 3.0 5.0 Indian HOH: Pakistani HOH: Mean 3.0 4.0 The first difference is more ‘convincing’ Judging whether differences occur by chance… 2. The sample sizes of the two samples Pakistani HOH: Bangladeshi HOH: Pakistani HOH: Bangladeshi HOH: 3 4 5 4 5 6 Mean 4.0 5.0 2 2 3 4 4 4 5 5 5 6 2 3 4 4 5 5 6 6 7 8 Mean 4.0 5.0 The second difference is more ‘convincing’ Judging whether differences occur by chance… 3. The amount of variation in each of the two groups (samples) Pakistani HOH: Bangladeshi HOH: Pakistani HOH: Bangladeshi HOH: 2 2 3 4 4 4 5 5 5 6 2 3 4 4 5 5 6 6 7 8 Mean 4.0 5.0 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 Mean 4.0 5.0 The second difference is more ‘convincing’. Example continued: the impact of variability on a difference in means. The three graphs each show two groups with the same mean difference. However the groups in each of the three graphs have different levels of variability. Where there is lower variability there is less cross-over between the groups, and so the difference between the means expresses a more pertinent difference (there is almost no one in group A with the same score as anyone in group B). Judging whether differences occur by chance… As we’ll see these three things – the size of the difference between the means, sample size, and the amount of variation (measured by the standard deviation) within the sample(s) – are critical to our determination of whether a difference we observe in a sample (or between samples) is likely to represent a real difference in the population (or between populations). So, what is the relation of the sample to the population? • If the sample is a random sample of the population, it may sometimes have a large number of extremely high values (for example: very happy people) • And sometimes it may have a large number of extremely low values cases (for example: very sad people) • But over the long run (if we kept on taking a sample, and then putting it back and taking another one), we would expect that most of the samples would fairly well represent the population (for example: with a mean happiness that corresponds fairly closely to the mean happiness of the population). Sampling Distributions • • • • • • The distribution of different possible samples that could be taken from a population is known as a sampling distribution. The more we understand about this distribution the better because it will help us to work out the likely relationship of our particular sample to the population What we find is that as more and more samples are taken, the average (i.e. mean) of the sample means tends to equal the mean of the population. The sampling distribution of means also looks like a normal distribution (Central Limit Theorem). However the sampling distribution of means is less varied than the population. See sampling distribution simulation at: http://onlinestatbook.com/stat_sim/ Sampling from a Population: Sample Means (from Field, 2005). The formal theorem “If repeated (simple random) samples of size N are drawn from a normally distributed population, the means of such samples will be normally distributed with mean and standard error [i.e. standard deviation] /N... if the N of each sample drawn is large, then regardless of the shape of the population distribution the sample means will tend to distribute themselves normally with mean and standard error /N”. = population mean = population standard deviation N = number in sample So where does this get us…??? • Well, we know that over the long run the mean of our samples is likely end up as the population mean. • We know that over the long run (when the sample is ‘large’ enough) that the distribution of sample means looks normal. [Note: A “large sample” is sometimes considered to be one of size 30+, but a size of 100+ can more ‘safely’ be viewed as adequately large.] • And we know that the variation in the sample means, known as the standard error, is (more or less) /N. • Although we usually only have a single sample, this information means we can work out a fairly reliable estimate of the population mean by combining the sample with what we know about normal distributions. What’s so special about the Normal Curve? • • • The normal curve is a symmetrical distribution of scores with an equal number of scores above and below the midpoint of the abscissa (the horizontal axis, or ‘x-axis’, for the curve). Since the distribution of scores is symmetric, the mean, median, and mode are all at the same point on the abscissa. In other words, the mean = the median = the mode. If we divide the distribution up into standard deviation units, a known proportion of scores lies within each portion under the curve. 34% of cases are between the mean and one SD away • From published or online tables, we can find the proportion of scores above and below any point on the abscissa, expressed in standard deviation units. Scores expressed in standard deviation units, are referred to as Z-scores. z-scores z-Scores can be calculated for any value. They are a means of standardizing values that are measured on different scales by showing these values just in terms of the number of standard deviations away from the mean they fall. z-scores are calculated by subtracting the mean from any value and dividing it by the standard deviation. z = x - mean s z-scores will always have: a mean of 0 and standard deviation of 1. We can quickly see that this is true of the mean, since when x = mean, the numerator (top bit!) will equal 0, and therefore z must = 0. It may be a little less clear that it is true of the standard deviation. However if you think about the instance when x is one standard deviation bigger than the mean (i.e. x = mean + s) z = (mean + s) - mean s = s s = 1 Finding the 95% point on a normal distribution… • From the table of we can see that when z = 1.96 (sometimes simplified to z = 2) the p-value, which represents the probability of being in the larger area (to the left), is 0.975. • Therefore the area under one (small) tail of the curve is p=0.025. • This means that scores greater than z = 1.96 occur just 2.5% of the time. • Further (because the normal curve is symmetric) we can calculate that the area under both tails (beyond z = 1.96 and z = -1.96) is 0.05. • In other words 95% of the area is in the middle, between z = -1.96 and z = 1.96 • And scores further from the mean than 1.96 thus only occur 5% of the time 97.5% 2.5% \z = 1.96 95% z = -1.96 z = 1.96 Note: What happens if the sample size is too small for one to safely assume that the sample mean has a Normal distribution? • When a sample is small (i.e. less than about 25) the assumption that the sample mean is normally distributed is not reasonable. • In fact, regardless of sample size, the sample mean can be assumed to have a t-distribution; the precise shape of a t-distribution depends on the sample size, and for moderate-to-large sample sizes the t-distribution is very similar to the Normal distribution (and, as the sample size approaches infinity, eventually converges with it). Combining that with what we know about the sampling distribution: • 95% of cases lie within +/- 1.96 standard deviations of the mean in a normal distribution. • The distribution of sample means is normal. • And the standard error of sample means is approximately /n F r e q u e n c y 95% of sample means 2.5% of sample means 1.96/n 1.96/n (population mean) 2.5% of sample means Sample mean Therefore 95% of sample means fall into the range: - 1.96(/n) to + 1.96(/n) Example • If we take a sample of 100 people and find that they work a mean of 34 hours per week with a standard deviation of 8 hours, how do we construct a 95% confidence interval for the mean number of hours worked by the population? • We know that 95% of sample means fall in the range: - 1.96(/n) to + 1.96(/n) • We estimate using the sample standard deviation, which is 8. • The sample size (n) is 100. Therefore n = 10. • Therefore 1.96(/n) = 1.96 x (8 / 10) = 1.96 x .8 = 1.568 • Therefore there is a 95% likelihood that the sample mean that we have found is within (about) 1.57 hours of the actual mean. • And so we can say with 95% confidence that the population’s mean weekly hours of work will fall somewhere between 34 minus 1.57 and 34 plus 1.57. • A 95% confidence interval of 32.43 to 35.57 hours per week. Why 95%? • A confidence interval need not be 95%. • However this is the generally accepted level for statistical testing. It is considered that errors occurring only 5% (or 1/20) times are acceptable. Furthermore, a higher value can produce confidence intervals that may be viewed too wide (producing an unacceptable risk of Type I errors – discussed next week). • However for some purposes a more cautious approach may be necessary. • For instance, if you were an antiquarian librarian sampling over time the humidity in your rare book storage facility, you might want to be confident that the average humidity level was neither destructively high or low at a 99.9% level at least! In this case you would construct a 99.9% confidence interval (where only 0.1% of cases fell outside of the range). You could use the normal distribution to do this, in a similar fashion to the way in which we used it to work out that the 95% confidence level relates to plus or minus 1.96 standard errors. Note: Small samples continued… • The procedure for producing 95% confidence intervals remains very similar to the one for larger sample sizes (i.e. the one using the ‘normal distribution’, which might just as well be referred to as the zdistribution), as does the test to see whether a suggested population mean is plausible. • The only difference is that the ‘magic number’ 1.96 is replaced by a slightly larger number, the magnitude of which gets bigger as the sample size gets smaller. • Thus, for a sample size of 25, 1.96 is replaced by 2.06 and, for a sample size of 15, by 2.13. (You can sometimes find a table of values for the t-distribution at the back of a statistics textbook). However another problem arises with small samples: the distribution of sample means can be asymmetric. In fact, the assumption that the sample mean has a t-distribution is only reasonable for small samples if the distribution of the variable under consideration approximates the normal distribution.