Session 7 Standard errors, Estimation and Confidence Intervals 1 Learning Objectives By the end of this session, you will be able to explain what is meant by an estimate of a population parameter, and its standard error explain the meaning of a confidence interval calculate a confidence interval for the population mean using sample data, and state the assumptions underlying the above calculation 2 Reminder: What is inference? Inference is about drawing conclusions concerning population characteristics using information gathered from the sample It is assumed that the sample is representative of the population A further assumption is that the sample has been drawn as a simple random sample from an infinite population 3 Estimation Population Mean Variance Std. deviation 2 Sample x s2 s Population characteristics (parameters) are denoted by greek letters, sample values by latin letters Sample characteristics are measurable and form estimates of the population values. 4 Example of statistical inference What is the mean number of persons per household in Mukono district? Data from 80 households surveyed in this district gave a mean household size of 5.6 with a standard deviation 3.30. Hence our best estimate of the mean household size in Mukono district is therefore 5.6. What results are likely if we sampled again with a different set of households? 5 Example using Stata Open Stata file UNHS_hh&poverty.dta Numeric code is 109 for Mukono 6 Use summarize dialogue Type db summarize or use menu Statistics Summaries, tables Summaries Summary Statistics Variable hhsize Then use by/if/in tab 7 dist ==109 is condition Results Summaries for whole sample Summaries for Mukono only 8 The distribution of means Suppose 10 University students were given a standard meal and the time taken to consume the meal was recorded for each. Suppose the 10 values gave: mean = 11.24, with std.dev.= 0.864 Let’s assume this exercise was repeated 50 times with different samples of students A histogram of the resulting 500 obs. appears below, followed by a histogram of the 50 means from each sample 9 Histogram of raw data The data appear to follow a normal distribution 10 Histogram of 50 sample means The distn of the sample means is called its Sampling Distribution Notice that the variability of the above distn is smaller than the variability of the raw data 11 Back to estimation… The estimate of the mean household size in Mukono district was 5.6. Is this sufficient for reporting purposes, given that this answer is based on one particular sample? What we have is an estimate based on a sample of size 80. But how good is this estimate? We need a measure of the precision, i.e. variability, of this estimate… 12 Sampling Variability The accuracy of the sample mean x as an estimate of depends on: (i) the sample size (n) since the more data we collect, the more we know about the population, and the (ii) inherent variability in the data 2 These two quantities must enter the measure of precision of any estimate of a population parameter. We aim for high precision, i.e. low standard error! 13 Standard error of the mean Precision of x as estimate of is given by: the standard error of the mean. s.e. x n Also written as s.e.m., or sometimes s.e. It is estimated using the sample data: s/n For example on household size, s.e.=3.298/80 = 3.298/8.944 = 0.369 14 Confidence Interval for Instead of using a point estimate, it is usually more informative to summarise using an interval which is likely (i.e. with 95% confidence) to contain . This is called an interval estimate or a Confidence Interval (C.I.) For example, we could report that the mean household size of HHs in Mukono district is 5.6 with 95% confidence interval (4.87, 6.33), i.e. there is a 95% chance that the interval (4.87,6.33) includes the true value . 15 Analysis using Stata Type db ci or use menu Use the by/if/in tab as before 16 Results For whole sample Just for Mukono 17 Finding the Confidence Interval The 95% confidence limits for (lower and upper) are calculated as: x tn1 ( s n) and x tn1 ( s where tn-1 is the 5% level for the t-distribution with (n-1) degrees of freedom. 2½% Statistical tables and statistical –t software give t-values. n) 2½% 0 t 18 t-values for finding 95% C.I. P 2 3 4 5 10 6.31 2.92 2.35 2.13 2.02 5 12.7 4.30 3.18 2.78 2.57 2 31.8 6.96 4.54 3.75 3.36 6 7 8 9 10 1.94 1.89 1.86 1.83 1.81 2.45 2.36 2.31 2.26 2.23 3.14 3.00 2.90 2.82 2.76 20 30 40 60 1.72 1.70 1.68 1.67 2.09 2.04 2.02 2.00 2.53 2.46 2.42 2.39 1.64 1.96 2.33 19 =1 x tn1 ( s 2½% –t n) 2½% 0 t Correct interpretation of C.I.s If we sampled repeatedly and found a 95% C.I. each time, only 95% of them would include the true , i.e. there is a 95% chance that a single interval would include . 13 12 11 10 0 5 10 15 20 25 30 35 40 45 50 20 An example (persons per HH) For rural households (n=40) in Mukono, we find mean=6.43, std.dev.=3.54 for the number of persons per household. Hence a 95% confidence interval for the true mean number of persons per household: 6.43 t39 (s/n) = 6.43 2.02(3.54/40) = 6.43 1.13 = (5.30, 7.56) Can you interpret this interval? Write down your answer. We will then discuss. 21 Analysis in Stata Press Page Up to retrieve the last command Then add “& rurban == 0” to the condition Or use the menus and change the dialogue 22 Underlying assumptions The above computation of a confidence interval assumes that the data have a normal distribution. More exactly, it requires the sampling distribution of the mean to have a normal distribution. What happens if data are not normal? Not a serious problem if sample size is large because of the Central Limit Theorem, i.e. that the sampling distribution of the mean has a normal distribution, for large sample sizes. 23 Assumptions - continued So even when data are not normal, the formula for a 95% confidence interval will give an interval whose “confidence” is still high - approximately 95%. It is better to attach some measure of uncertainty than worry about the exact confidence level. 24