Interval estimates of the mean for small n, σ unknown. Estimates of sample size. ASW, 8.2 – 8.4 Economics 224 – Notes for October 15 Interval estimates using the t distribution • When a random sample of size n is drawn from a normally distributed population whose standard deviation σ is unknown, the sampling distribution of the sample mean has a t distribution. • The interval estimate has the same format as earlier, but with a t value replacing the Z value, and the sample standard deviation s replacing σ. The interval estimate of the population mean μ is s x t n • A confidence level must be specified for each interval estimate; the confidence level and the sample size determine the t value. t distribution • The shape of the t distribution is similar to that of the normal distribution – bell-shaped, peaked in the centre, symmetrical about the mean, and asymptotic to the horizontal axis. • The table of the t distribution is a standardized t distribution, that is, the t values have mean 0 and standard deviation 1. • There are many different t distributions, one for each degree of freedom (see explanation on a later slide). • For small degrees of freedom, the t distribution is very dispersed. As the degrees of freedom increase, the t distribution becomes more concentrated around the mean. • The limiting distribution for the t distribution is the normal distribution. That is, as the degrees of freedom increase, the t distribution is approximated by the normal distribution. The concept of degrees of freedom (df) • “Degrees of freedom” refers to how many sample values can vary freely. In many statistical procedures, some sample values are constrained by the parameters to be estimated. • When a t distribution describes the sampling distribution of the mean, one degree of freedom is lost since s is used as an estimate of σ (ASW, 304). In this case, the t distribution has n-1 degrees of freedom, where n is the sample size. • Degrees of freedom are used in the chi-square test and distribution and in regression and analysis of variance models. The type and number of constraints and degrees of freedom differ from model to model. t table (ASW, 303 and Appendix B, Table 2 • In ASW, the t table gives areas in the upper tail of the distribution. The t distribution is symmetric about a mean of t = 0 so the same area for the lower tail is given by the negative of the t value in the table. • Each degree of freedom defines a different t distribution. For degrees of freedom above 100, use Z values from the standard normal distribution, since the normal distribution closely approximates the t distribution for large df. • t values are given in the body of the table, with areas under the curve given at the top of each column, and df at the start of each row. • The values in the table state the t value, or number of standard deviations, to the right of the centre of the distribution that is required to include all but the area in the right tail of the distribution. General notation for interval estimates • The confidence coefficient is given the symbol (1-α). α is the first letter of the Greek alphabet and is termed “alpha.” For interval estimates, α is merely a symbol used to denote the area in the two tails of a distribution. The area in the middle of the distribution is (1-α) or (1-α) x 100% and there is α/2 of the area in each of the tails of the distribution. • When a sample mean is normally distributed, the Z values for the 95% interval estimate, or the (1 – 0.05) x 100% = 95% interval, are ±1.96. That is, if α = 0.05, Z Z0.05 Z0.025 1.96 2 2 and the interval estimate of the population mean μ is x Z 2 n x 1.96 n Notation for interval estimates using the t distribution • For the t distribution, the notation is the same as in the last slide, with t replacing Z. The only addition is that for each df, there is a different t value. • For a with a t distribution. the t values for the (1–α)% interval are ±tα/2 for appropriate df. • If a sample mean has a t distribution, df = n-1, that is, the sample size minus 1. • For a 98% interval estimate of a population mean, where the sample size is n = 9, or df = n – 1 = 8, t = 2.896. • In this case, the interval estimate of the population mean μ is x t 2 s s x 2.896 x (0.931 s) n 9 Example of wages of workers employed at new jobs after a plant shutdown – I • Prior to the shutdown of an Ontario manufacturing plant in the 1990s, male were paid $13.76 per hour and females $11.80 per hour. • Two years after the workers were laid off, researchers located some of these laid off workers. For twelve males workers who found new jobs, the mean hourly wage was $12.20, with a standard deviation of $3.27. For twelve female workers who found new jobs, the mean hourly wage was $8.11, with a standard deviation of $3.53. • Obtain 90% interval estimates of the mean wage for all laid off male workers who found new jobs. For female workers. What do you conclude from these results? Source: Data and research from Belinda Leach and Anthony Winson, “Bringing ‘Globalization’ Down to Earth: Restructuring and Labour in Rural Communities,” Canadian Review of Sociology and Anthropology, 32:3, August 1995. Example of wages – II • These are small samples, and for neither males nor females is the standard deviation of wages known. While male wages in the new jobs may not be exactly normally distributed, assume that they are symmetrically distributed or close to normal. Assume the same for the distribution of wages of female workers. Assume that each sample is a random sample of all laid off workers who had new jobs at the time of the study. • From the above assumptions, for each of male and female workers who found new jobs, the distribution of sample mean pay is a t distribution. In each case, the sample has size n = 12, so the df associated with each interval estimate is df = 12 1 = 11. For a 90% interval estimate, there is 10% or 0.10 of the area in the middle of the distribution and 5% or 0.05 of the area in each of the two tails of the distribution. The appropriate t value for 11 df and 90% confidence is 1.796. Example of wages – III • For males, let μ be the mean wage of all those laid off male workers who found new jobs. The 90% interval estimate for the population mean μ is x t 2 s 3.27 12.20 1.796 12.20 0.94 n 12 or $12.20 ± $0.94 or ($11.26 , $13.14). • For females, the procedure is the same and the resulting 90% interval estimate is x t 2 s 3.53 8.11 1.796 8.11 1.02 n 12 or $8.11 ± $1.02 or ($7.09 , $9.13). Example of wages – IV • These interval estimates provide reasonable certainty that the mean hourly pay of laid off workers has declined. • For males, the 90% confidence interval is $11.30 to $13.14, so the mean pay for all re-employed males is very likely below the pre-layoff level of $13.76. • For females the situation appears worse. The 90% interval estimate of mean pay of re-employed females is from $7.09 to $9.13, an interval well below the pre-layoff level of $11.80. • Neither result is certain but evidence at the time of the study is that workers, especially females, experienced a decline in mean pay, as compared with the pre-layoff situation. • Cautions – samples may not be random, distribution of pay may not be normal, uncertainty with only 90% confidence. Using the t distribution • Small n often occurs in practice. • σ unknown is the usual situation. • Normal distribution of population. This is unlikely but so long as the population distribution is close to symmetric, this should not produce unreliable results. • In these situations, when estimating a population mean, it is advisable to use the t distribution as the sampling distribution of the sample mean, rather than the normal distribution. • For larger sample sizes, the sampling distribution of sample means can be approximated by the normal distribution. • All of the above assume that the sample is a random sample, or is equivalent to a random sample. t distribution in economic analysis • Much economic analysis uses very large sample sizes. But there are situations where n is small and the population standard deviation is unknown. Then the sample mean has a t distribution. When n >100, the normal provides a good approximation to the t distribution. • Experiments, administrative data, or other situations with a small number of cases. • Measurement error is often close to normally distributed. • Regression analysis, especially with time series data, where the number of observations across time is not large. Regression coefficients have a t distribution (ASW, Ch. 12). • Economic variables are sometimes assumed to be normally distributed, but with unknown variability, so the t distribution is used for the distribution of the sample mean. Estimating sample size (ASW, 310-313) • Prior to conducting a research study, it is often useful to estimate the sample size required to achieve a particular margin of error, specifying a confidence level. • While this may not be the final sample size a researcher obtains, the following calculations provide an estimate of the number of population elements from which a researcher should attempt to obtain data. This, in turn, can be used to plan the research study and estimate the time and cost that will be required to conduct it. • Cost may be too great, respondents may refuse to participate, nonresponse to some questions, time may be insufficient. • The method examined here provides the required sample size for a random sample from a population, given the margin of error and confidence level. Formula • Margin of error = E. • Confidence level = (1 – α) 100% and the corresponding normal value is Zα/2. • Population standard deviation is σ. • n is the required sample size when random sampling from this population. Z n 2 E 2 Rationale for formula • Formula for interval estimate is x Z 2 n • Researcher wishes an interval xE • Let E Z 2 n Z • And solve this expression for n, giving n 2 E 2 Example of sample size – I • A manager at Access Communications wants to know whether it is worthwhile to target university students for a promotion. In order to do this she would like to know how many minutes of TV students watch each day, accurate to within five minutes, with 99% confidence. The upper limit of budget expenditures for the study is $1,000. • You have been hired as a consultant to the manager and your task is to conduct a sample of university students to obtain the required information. What sample size is required? What would you recommend to the manager? Example of sample size – II Fortunately you have kept the Excel worksheet from Economics 224 and when you check it, you determine that the standard deviation of the hours students in Economics 224 reported watching TV daily was 1.298 hours. From this, you use the requirements specified by the manager, that is, a margin of error of 5 minutes or E = 5/60 = 0.0833 hours and 99% confidence. The Z value is 2.576, so the required sample size is: Z n 2 E 2 2.575 1.298 2 2 40 . 124 1,609.955 0.0833 Example – III • You report that the required sample size is at least n = 1,610. • You also report that a larger sample size might be required, since you have an estimate of the variability of the population that may be low. That is, σ for all university students might exceed 1.3 hours. • If this is a random sample, to obtain a sampling frame and then contact each student by telephone, email, or Canada Post, you estimate that the cost of sampling is approximately $10 per student, for a total cost of well over $1,000. • From this, you might recommend a smaller sample size, with relaxed requirements, say E = 15 minutes, 95% confidence, for a sample of around 100 students. You might note that the required margin of error of five minutes is very difficult to obtain and too demanding. • Explore less expensive methods of conducting the survey. Estimating σ • To obtain an estimate of the required n, some estimate of the population standard deviation σ is required. • Use s from previous studies or similar populations. • Pilot study. Obtain a preliminary estimate of σ. • Judgment or best guess. Dividing the range by 4 can produce a reasonable provisional estimate of σ. If there are outliers, it may be best to eliminate these. For example, what is the σ of income for Saskatchewan residents? Minimum = 0 and maximum might be $10 million plus. But make range from 0 to $100,000 and this may include 99% plus of the population. Rough estimate of σ for Saskatchewan income would be around $25,000. (For 2001, s = $23,000, from Census). • Structure sampling procedure so that the sample size can later be increased, if necessary. Additional notes about sample size – I • When obtaining n from the formula, round up. • Make sure units for σ and E in the formula are the same. • n larger with – Greater variability σ in the population. – Larger confidence level. – Smaller margin of error E. • Trade off between costs and accuracy of results. • For a random sample, n does not depend on population size if the proportion of the population sampled (n/N) is small. • It may not be possible to obtain the required n, so researcher will have to settle for a larger margin of error or reduced confidence level, or both. For example, time series data on Internet use may only be available for n = 10 years. Additional notes about sample size – II • Sample size given by above formula indicates the number of population elements actually required in the study. If individuals or firms to be surveyed are reluctant to participate, cannot be found, or are unwilling to answer some questions, expand the required number of elements in the hope that the n indicated can be obtained. For example, if the formula indicates a required n = 500 and 25% nonresponse is expected, expand sample size to 650 or 700. • Sampling procedure affects required sample size. Cluster samples might need to have larger n but stratification of a sample might reduce the required sample size. Different formulae for more complex sampling procedures. Weighting of sample elements • This issue is not discussed in the text but is one that needs consideration in much survey sampling. • Sampling procedure may be designed so each element selected in the sample represents a different number of population elements (eg. cluster, stratified, multistage sampling). Research methodology should report the weighting procedures to be used when conducting data analysis. Statistics Canada often includes a weight in the data set. • Weighting may occur after data obtained, to estimate characteristics of population. For example, if males and females are about equal in number in a population but a sample has fewer males than females, data from males may be more heavily weighted when analyzing and reporting results. Conclusion about interval estimates and sample size • Formulae are precise but approximations are often used: – Random sample? – Standard deviation of population? – Confidence level arbitrary. – Nonresponse and other nonsampling errors. • When data come from samples, there is usually sampling error. Interval estimates and estimates of sample size are necessary but remember above cautions about their accuracy. • Replication of studies, similar research on related topics and comparable populations. • Careful sample and research design and data analysis. Later on Wednesday or on Monday • Normal approximation to the binomial (ASW, 6.3). • Sampling distribution of the sample proportion (ASW, 7.6). • Interval estimate of a population proportion (ASW, 8.4). • Sample size for estimation of a population proportion (ASW, 315-316). • Review – Monday during class and Tuesday at review session