Class 2 Estimation and Hypothesis Testing When a parameter (e.g., the average)) of a population is estimated using a sample of data, the estimated value will vary, depending on the particular sample chosen. Sampling variation, or more formally, sampling distribution, of the estimated parameter gives us a frame of reference of how accurate the estimate is likely to be. If we repeatedly sample, and our estimated parameter does not change much, then we are confident that the estimate from just one sample is likely to be accurate. On the other hand, if our estimated parameter changes quite markedly for different samples of data, then we are not at all confident that the estimate from just one sample is likely to be accurate. Whenever we report an estimated value (e.g., the average of a sample of data), we must provide our degree of confidence about the accuracy of the estimate. Typically, in reports, you will see results such as However, typically, we will not have the resources to repeatedly sample to obtain the sampling variation of our estimated value to report a confidence interval. Statistical theory can help us with the computation of the confidence interval so we don’t need to resort to repeated sampling to establish confidence intervals. The sampling distributions of various statistics (e.g., mean, percentage, median, standard deviation, etc) are different, and they require different statistical theory to derive the sampling distributions. Below we will focus on the sampling distribution of the Mean. That is, whenever we are computing averages, we can use a formula to compute the confidence interval based on just one sample of data. Central Limit Theorem (CLT) The sampling distribution of the mean of independently drawn observations will be approximately normally distributed, even if the distribution from which the sample is drawn is not normal. (Check internet sources/other references for descriptions of the central limit theorem) A simulation can be conducted to illustrate CLT. The data set, Literacy2.csv, contains 27598 students’ reading test scores. The following shows the histogram of the reading scores: This histogram looks quite skewed (i.e, not normally distributed). Compute the mean and standard deviation of the reading scores: Mean:_______________ Standard deviation:_________________ If we sample from this population, and compute the mean of the sample, we will not get exactly the population mean. There will be variations in mean values across different samples. Select one random sample of 100 students. You can do this using the R function “sample”. Compute the mean of this sample. Repeat this a few times by drawing a few samples, and see the variation of the mean values of different samples. To obtain the sampling distribution of the sample mean of randomly sampled 100 students 2000 times, using the R code provided. Graph the sampling distribution. You should get a histogram like this one (replace my picture with yours): Now this histogram looks normally distributed! Compute the mean and standard deviation of the 2000 sample means (A) Mean of the 2000 sample means:____________ (B) Standard deviation of the 2000 sample means:____________ The standard deviation of the 2000 sample means is called the standard error. Given that the sampling distribution of the sample means is approximately normally distributed, we can compute a confidence interval based on normal distribution. That is, for normal distributions, about 95% of the observations lie between mean±1.96×standard deviation. In our case, about 95% of the sample mean values should lie between ____________ and _____________ (work out the two values using (A) and (B) above) Formula for computing the standard error In practice, we can use the result derived from statistical theory that the standard error of the mean is approximately n where is the population standard deviation, and n is the sample size. In our case, n is 100, is 5.8, so using this formula, the standard error should be about 0.58. How does this compare with what you obtained in (B) above? In real life, we don’t know the value . But, scanning over the standard deviation of each sample of around 100 observations, you will find that the sample standard deviation of 100 observations is a good estimate of the population standard deviation. In practice, how to compute confident interval of sample mean (1) draw a sample of size n (2) compute the sample mean ( ) (3) compute the sample standard deviation ( ̂ ) (4) compute the 95% confidence interval of the true mean using 1.96 n In one sentence, describe the meaning of a statement like the following: The estimated mean of height is 174cm ± 29cm (95% confidence interval) ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________ General process of making inference about a statistic (1) Establish the sampling distribution of the statistic to assess the variability of the statistic. For example, if we are interested in the mean reading score of students in Taipei, we take a sample and compute the sample mean. Because this sample mean is not the population mean, there is likely variation in the value of the sample mean if different samples are drawn. We need to find out how large the variation is. If the variation is large, then our estimate is probably not very accurate to represent the population mean. If the variation is small, then our estimate is probably quite close to the population mean. (2) We can repeatedly sample to establish the sampling distribution of the statistic of interest. But this may be impractical as it will be too costly. We can also establish sampling distributions theoretically. Making inferences about Mean We can use the central limit theorem to establish the sampling distribution of the sample mean, without doing repeated sampling. Central limit theorem says that the mean of independently drawn observations will be approximately normally distributed, even if the distribution from which the sample is drawn is not normal. Further, it can be shown that mean values computed from samples of size n follow a normal distribution with mean and standard deviation of n (known as the standard error), where is the mean of the distribution we draw our samples from, and is the standard deviation of the distribution we draw our samples from. That is, if X denotes the sample mean, then X has a standard normal distribution n with mean zero and standard deviation 1 (z-score). For a standard normal distribution, 95% of the observations lie within 1.96. With a little re-arrangement of the equation, it can be shown that 95% of the time, or, we are 95% confident that, X 1.96 n X 1.96 n (There is a 95% chance that the population mean lies within the range shown above.) Hypothesis Testing Hypothesis testing is about using data to make (statistical) conclusions about a hypothesis. For example, if I have a hypothesis that the mean of students’ population reading score is 17 out of 30. H 0 : 17 I draw a sample, say, of 100 students. The sample mean and standard deviation of my sample are 18.4 and 5.8 respectively. The 95% confidence interval for the mean is 18.4 1.96 5.8 17.3,19.5 100 The 95% confidence interval of (17.3, 19.5) means that, based on our sample, there is a 95% chance that the true mean lies between (17.3, 19.5). There is a 5% chance that the true mean lies outside this interval. As the hypothesised mean value, 17, is outside this confidence interval, we conclude that we will reject the null hypothesis at the 95% confidence level. Sometimes this is also said as at the level of p=0.05. This means that there is a 5% chance that we have incorrectly rejected the null hypothesis. More generally, we make inferences from our sample about the likelihood of population parameters, and we make conclusions about the hypothesis based on our inference. Sample size and hypothesis testing Now, draw a sample of 10 from our reading score data. Test the hypothesis that H 0 : 17 What is the confidence interval in this case? 95% confidence of the mean is between ________________ and _____________. What is your conclusion about the hypothesis? Reject or Accept? Next, draw a sample of 20, and then 50, and then 200. See the difference you will make in accepting or rejecting the null hypothesis at p=0.05? Sample of 20: Reject or Accept? Sample of 50: Reject or Accept? Sample of 200:Reject or Accept? What if you use p=0.1 (90% confidence interval (normal distribution for 90% of the sample means is between 1.64 rejecting or accepting the hypothesis? n )? Would you change you decision of Make a table below: Sample size Reject or Accept Reject or Accept H 0 : 17 H 0 : 17 at p=0.05 at p=0.1 10 50 100 200 Given that we know that the true population mean is 18.98 (which, in real-life, we will not know), what do you think about your conclusions in the above table? What if the hypothesis is H 0 : 18 ? Could you reject this hypothesis? What sample size would you use to reject this hypothesis? A cartoon in Darrell Huff’s book on “how to lie with statistics” depicted one person asking “I want to know the truth”, and another person replying “it ain’t statistics”. What is your assessment of statistical hypothesis testing in relation to this cartoon? What DOES statistics tell you? Some discussion points: (1) For a population of people, the height distribution is normally distributed with a mean of 170 cm and a standard deviation of 12 cm. Dave has a height of 196cm. Could Dave be from this population of people? (2) In a region, the number of raining days per year is approximately normally distributed, with a mean of 85 days and a standard deviation of 15 days (the distribution was established by collecting 200 years of data). One year, the number of raining days was 120 days. Was this year an ‘abnormal’ year? If so, can we look at the 200 years of data, what percentage of years do you think will be ‘abnormal’? But, the 200 years of data has been used to establish the ‘norm’, so how can any particular year be ‘abnormal’?