October 15 -- Sample Size

advertisement
Interval estimates of the mean
for small n, σ unknown.
Estimates of sample size.
ASW, 8.2 – 8.4
Economics 224 – Notes for October 15
Interval estimates using the t distribution
• When a random sample of size n is drawn from a
normally distributed population whose standard
deviation σ is unknown, the sampling distribution of
the sample mean has a t distribution.
• The interval estimate has the same format as earlier,
but with a t value replacing the Z value, and the
sample standard deviation s replacing σ. The interval
estimate of the population mean μ is
s
x t
n
• A confidence level must be specified for each interval
estimate; the confidence level and the sample size
determine the t value.
t distribution
• The shape of the t distribution is similar to that of the normal
distribution – bell-shaped, peaked in the centre, symmetrical
about the mean, and asymptotic to the horizontal axis.
• The table of the t distribution is a standardized t distribution,
that is, the t values have mean 0 and standard deviation 1.
• There are many different t distributions, one for each degree
of freedom (see explanation on a later slide).
• For small degrees of freedom, the t distribution is very
dispersed. As the degrees of freedom increase, the t
distribution becomes more concentrated around the mean.
• The limiting distribution for the t distribution is the normal
distribution. That is, as the degrees of freedom increase, the
t distribution is approximated by the normal distribution.
The concept of degrees of freedom (df)
• “Degrees of freedom” refers to how many sample values can
vary freely. In many statistical procedures, some sample
values are constrained by the parameters to be estimated.
• When a t distribution describes the sampling distribution of
the mean, one degree of freedom is lost since s is used as an
estimate of σ (ASW, 304). In this case, the t distribution has
n-1 degrees of freedom, where n is the sample size.
• Degrees of freedom are used in the chi-square test and
distribution and in regression and analysis of variance models.
The type and number of constraints and degrees of freedom
differ from model to model.
t table (ASW, 303 and Appendix B, Table 2
• In ASW, the t table gives areas in the upper tail of the
distribution. The t distribution is symmetric about a mean of
t = 0 so the same area for the lower tail is given by the
negative of the t value in the table.
• Each degree of freedom defines a different t distribution. For
degrees of freedom above 100, use Z values from the
standard normal distribution, since the normal distribution
closely approximates the t distribution for large df.
• t values are given in the body of the table, with areas under
the curve given at the top of each column, and df at the start
of each row.
• The values in the table state the t value, or number of
standard deviations, to the right of the centre of the
distribution that is required to include all but the area in the
right tail of the distribution.
General notation for interval estimates
• The confidence coefficient is given the symbol (1-α). α is the
first letter of the Greek alphabet and is termed “alpha.” For
interval estimates, α is merely a symbol used to denote the
area in the two tails of a distribution. The area in the middle
of the distribution is (1-α) or (1-α) x 100% and there is α/2 of
the area in each of the tails of the distribution.
• When a sample mean is normally distributed, the Z values for
the 95% interval estimate, or the (1 – 0.05) x 100% = 95%
interval, are ±1.96. That is, if α = 0.05, Z  Z0.05  Z0.025  1.96
2
2
and the interval estimate of the population mean μ is
x  Z

2
n
 x  1.96

n
Notation for interval estimates using the t
distribution
• For the t distribution, the notation is the same as in the last
slide, with t replacing Z. The only addition is that for each df,
there is a different t value.
• For a with a t distribution. the t values for the (1–α)% interval
are ±tα/2 for appropriate df.
• If a sample mean has a t distribution, df = n-1, that is, the
sample size minus 1.
• For a 98% interval estimate of a population mean, where the
sample size is n = 9, or df = n – 1 = 8, t = 2.896.
• In this case, the interval estimate of the population mean μ is
x  t
2
s
s
 x  2.896
 x  (0.931 s)
n
9
Example of wages of workers employed at new
jobs after a plant shutdown – I
• Prior to the shutdown of an Ontario manufacturing plant in
the 1990s, male were paid $13.76 per hour and females
$11.80 per hour.
• Two years after the workers were laid off, researchers located
some of these laid off workers. For twelve males workers who
found new jobs, the mean hourly wage was $12.20, with a
standard deviation of $3.27. For twelve female workers who
found new jobs, the mean hourly wage was $8.11, with a
standard deviation of $3.53.
• Obtain 90% interval estimates of the mean wage for all laid off
male workers who found new jobs. For female workers.
What do you conclude from these results?
Source: Data and research from Belinda Leach and Anthony Winson, “Bringing
‘Globalization’ Down to Earth: Restructuring and Labour in Rural Communities,”
Canadian Review of Sociology and Anthropology, 32:3, August 1995.
Example of wages – II
• These are small samples, and for neither males nor females is
the standard deviation of wages known. While male wages in
the new jobs may not be exactly normally distributed, assume
that they are symmetrically distributed or close to normal.
Assume the same for the distribution of wages of female
workers. Assume that each sample is a random sample of all
laid off workers who had new jobs at the time of the study.
• From the above assumptions, for each of male and female
workers who found new jobs, the distribution of sample mean
pay is a t distribution. In each case, the sample has size n =
12, so the df associated with each interval estimate is df = 12 1 = 11. For a 90% interval estimate, there is 10% or 0.10 of
the area in the middle of the distribution and 5% or 0.05 of
the area in each of the two tails of the distribution. The
appropriate t value for 11 df and 90% confidence is 1.796.
Example of wages – III
• For males, let μ be the mean wage of all those laid off male
workers who found new jobs. The 90% interval estimate for
the population mean μ is
x  t
2
s
3.27
 12.20  1.796
 12.20  0.94
n
12
or $12.20 ± $0.94 or ($11.26 , $13.14).
• For females, the procedure is the same and the resulting 90%
interval estimate is
x  t
2
s
3.53
 8.11  1.796
 8.11  1.02
n
12
or $8.11 ± $1.02 or ($7.09 , $9.13).
Example of wages – IV
• These interval estimates provide reasonable certainty that the
mean hourly pay of laid off workers has declined.
• For males, the 90% confidence interval is $11.30 to $13.14, so
the mean pay for all re-employed males is very likely below
the pre-layoff level of $13.76.
• For females the situation appears worse. The 90% interval
estimate of mean pay of re-employed females is from $7.09 to
$9.13, an interval well below the pre-layoff level of $11.80.
• Neither result is certain but evidence at the time of the study
is that workers, especially females, experienced a decline in
mean pay, as compared with the pre-layoff situation.
• Cautions – samples may not be random, distribution of pay
may not be normal, uncertainty with only 90% confidence.
Using the t distribution
• Small n often occurs in practice.
• σ unknown is the usual situation.
• Normal distribution of population. This is unlikely but so long
as the population distribution is close to symmetric, this
should not produce unreliable results.
• In these situations, when estimating a population mean, it is
advisable to use the t distribution as the sampling distribution
of the sample mean, rather than the normal distribution.
• For larger sample sizes, the sampling distribution of sample
means can be approximated by the normal distribution.
• All of the above assume that the sample is a random sample,
or is equivalent to a random sample.
t distribution in economic analysis
• Much economic analysis uses very large sample sizes. But
there are situations where n is small and the population
standard deviation is unknown. Then the sample mean has a
t distribution. When n >100, the normal provides a good
approximation to the t distribution.
• Experiments, administrative data, or other situations with a
small number of cases.
• Measurement error is often close to normally distributed.
• Regression analysis, especially with time series data, where
the number of observations across time is not large.
Regression coefficients have a t distribution (ASW, Ch. 12).
• Economic variables are sometimes assumed to be normally
distributed, but with unknown variability, so the t distribution
is used for the distribution of the sample mean.
Estimating sample size (ASW, 310-313)
• Prior to conducting a research study, it is often useful to
estimate the sample size required to achieve a particular
margin of error, specifying a confidence level.
• While this may not be the final sample size a researcher
obtains, the following calculations provide an estimate of the
number of population elements from which a researcher
should attempt to obtain data. This, in turn, can be used to
plan the research study and estimate the time and cost that
will be required to conduct it.
• Cost may be too great, respondents may refuse to participate,
nonresponse to some questions, time may be insufficient.
• The method examined here provides the required sample size
for a random sample from a population, given the margin of
error and confidence level.
Formula
• Margin of error = E.
• Confidence level = (1 – α) 100% and the corresponding normal
value is Zα/2.
• Population standard deviation is σ.
• n is the required sample size when random sampling from this
population.
 Z 
n 2
 E





2
Rationale for formula
• Formula for interval estimate is
x  Z
2

n
• Researcher wishes an interval
xE
• Let
E  Z

2
n
 Z 
• And solve this expression for n, giving n   2
 E





2
Example of sample size – I
• A manager at Access Communications wants to know whether
it is worthwhile to target university students for a promotion.
In order to do this she would like to know how many minutes
of TV students watch each day, accurate to within five
minutes, with 99% confidence. The upper limit of budget
expenditures for the study is $1,000.
• You have been hired as a consultant to the manager and your
task is to conduct a sample of university students to obtain
the required information. What sample size is required?
What would you recommend to the manager?
Example of sample size – II
Fortunately you have kept the Excel worksheet from Economics
224 and when you check it, you determine that the standard
deviation of the hours students in Economics 224 reported
watching TV daily was 1.298 hours.
From this, you use the requirements specified by the manager,
that is, a margin of error of 5 minutes or E = 5/60 = 0.0833
hours and 99% confidence. The Z value is 2.576, so the
required sample size is:
 Z 
n 2
 E

2
  2.575 1.298  2
2
 

40
.
124
 1,609.955

  0.0833 

Example – III
• You report that the required sample size is at least n = 1,610.
• You also report that a larger sample size might be required,
since you have an estimate of the variability of the population
that may be low. That is, σ for all university students might
exceed 1.3 hours.
• If this is a random sample, to obtain a sampling frame and
then contact each student by telephone, email, or Canada
Post, you estimate that the cost of sampling is approximately
$10 per student, for a total cost of well over $1,000.
• From this, you might recommend a smaller sample size, with
relaxed requirements, say E = 15 minutes, 95% confidence, for
a sample of around 100 students. You might note that the
required margin of error of five minutes is very difficult to
obtain and too demanding.
• Explore less expensive methods of conducting the survey.
Estimating σ
• To obtain an estimate of the required n, some estimate of the
population standard deviation σ is required.
• Use s from previous studies or similar populations.
• Pilot study. Obtain a preliminary estimate of σ.
• Judgment or best guess. Dividing the range by 4 can produce
a reasonable provisional estimate of σ. If there are outliers, it
may be best to eliminate these. For example, what is the σ of
income for Saskatchewan residents? Minimum = 0 and
maximum might be $10 million plus. But make range from 0
to $100,000 and this may include 99% plus of the population.
Rough estimate of σ for Saskatchewan income would be
around $25,000. (For 2001, s = $23,000, from Census).
• Structure sampling procedure so that the sample size can
later be increased, if necessary.
Additional notes about sample size – I
• When obtaining n from the formula, round up.
• Make sure units for σ and E in the formula are the same.
• n larger with
– Greater variability σ in the population.
– Larger confidence level.
– Smaller margin of error E.
• Trade off between costs and accuracy of results.
• For a random sample, n does not depend on population size if
the proportion of the population sampled (n/N) is small.
• It may not be possible to obtain the required n, so researcher
will have to settle for a larger margin of error or reduced
confidence level, or both. For example, time series data on
Internet use may only be available for n = 10 years.
Additional notes about sample size – II
• Sample size given by above formula indicates the number of
population elements actually required in the study. If
individuals or firms to be surveyed are reluctant to
participate, cannot be found, or are unwilling to answer some
questions, expand the required number of elements in the
hope that the n indicated can be obtained. For example, if the
formula indicates a required n = 500 and 25% nonresponse is
expected, expand sample size to 650 or 700.
• Sampling procedure affects required sample size. Cluster
samples might need to have larger n but stratification of a
sample might reduce the required sample size. Different
formulae for more complex sampling procedures.
Weighting of sample elements
• This issue is not discussed in the text but is one that needs
consideration in much survey sampling.
• Sampling procedure may be designed so each element
selected in the sample represents a different number of
population elements (eg. cluster, stratified, multistage
sampling). Research methodology should report the
weighting procedures to be used when conducting data
analysis. Statistics Canada often includes a weight in the
data set.
• Weighting may occur after data obtained, to estimate
characteristics of population. For example, if males and
females are about equal in number in a population but a
sample has fewer males than females, data from males may
be more heavily weighted when analyzing and reporting
results.
Conclusion about interval estimates and
sample size
• Formulae are precise but approximations are often used:
– Random sample?
– Standard deviation of population?
– Confidence level arbitrary.
– Nonresponse and other nonsampling errors.
• When data come from samples, there is usually sampling
error. Interval estimates and estimates of sample size are
necessary but remember above cautions about their accuracy.
• Replication of studies, similar research on related topics and
comparable populations.
• Careful sample and research design and data analysis.
Later on Wednesday or on Monday
• Normal approximation to the binomial (ASW, 6.3).
• Sampling distribution of the sample proportion
(ASW, 7.6).
• Interval estimate of a population proportion (ASW,
8.4).
• Sample size for estimation of a population
proportion (ASW, 315-316).
• Review – Monday during class and Tuesday at review
session
Download