Necessary Sample Size for Good Central Limit Theorem

advertisement
Cornwell 1
Paul Cornwell
March 25, 2011
MAT 5900- Monte Carlo Simulations
Professors Frey and Volpert
Necessary Sample Size for Good Central Limit Theorem Approximation
In conjunction with the Law of Large Numbers, the Central Limit Theorem is the
foundation for the majority of statistical practices today. The term Central Limit Theorem (CLT)
actually refers to a series of theorems, but in practice it is usually condensed as follows: averages
of independent, identically distributed variables (with positive variance σ) tend towards the
normal (Gaussian) distribution as sample size n gets large. Specifically, the normal distribution
to which the sampling distribution tends has mean πœ‡ = πœ‡π‘₯Μ… and variance 𝜎π‘₯Μ…2 =
𝜎2
𝑛
. The utility of
this theory is grounded in the idea of averages. Even if the underlying distribution of some
variable is finicky, it can still be analyzed through the use of averages. For this reason, the CLT
is invoked in many situations involving predictions or probabilities. Because of the “limit” part
of the theorem, however, the normal distribution will only be an approximation for the sampling
distribution if n is finite. Exactly how large n has to be for the CLT to provide a good
approximation is an elusive question, and it turns out that Monte Carlo simulations are as good a
tool as any for making such determinations.
The generalization of the CLT commonly taught in statistics classes today is a theorem of
Laplace from 1810. In discussing this result, Hans Fischer says the normal approximation is
appropriate “under conditions that, in practice, were always fulfilled.”1 As Fischer indicates,
there is not a lot of concern surrounding the conditions (i.e. the necessary sample size) needed to
appeal to the CLT. However, it is still important to know at what point it becomes acceptable to
do so. As sample size increases, the normal approximation becomes better and better, so at the
1
A History of the Central Limit Theorem 353
Cornwell 2
very least having a target n in mind would give an indication of the quality of an approximation,
even if it is impossible to alter the design of an experiment or study by increasing the sample
size. One problem with this (which is probably responsible for a lack of guidelines on the topic)
is the tremendous dependence of the sampling distribution on the underlying distribution. Thus,
instead of getting too specific, the word “large” is employed to describe the requisite sample
size. My goal is to come up with a more specific procedure for determining how large of a
sample size is needed to get a good approximation from the Central Limit Theorem.
The problem of diverse underlying distributions is not the only obstacle in this endeavor.
Prior to giving any kind of numerical answer(s) to the question at hand, it is necessary to define
exactly what it means to be a “good” approximation. Practically speaking, the sampling
distribution of a random variable is never normal. For this reason, using conventional “tests for
normality” is a task doomed for failure. Given enough data, they will always be able to reject the
hypothesis that the sampling distribution is not normal—because it’s not. However, there are a
few techniques to be salvaged from this field that will be of use. Instead of doing conventional
tests, the best way to approach this problem is to consider the qualities of the Gaussian
distribution. This way, there is at least some basis of comparison for the sampling distribution to
the normal. Once these criteria are in place, the next task is to determine how closely a
distribution must resemble the normal for each, which will then determine whether or not the
sample size is large enough for a good approximation. Finally, I will have to devise a way of
communicating these results. It is not necessarily practical to have a separate number for every
possible distribution for two reasons: first, there are far too many for this information to be
helpful; second, in practice it is common not to know the underlying distribution for a random
Cornwell 3
variable, only certain features. Instead, it would be helpful if I could identify some common
features of these distributions that affect the speed of their convergence.
The normal distribution can be observed to have the following properties: it is unimodal;
it is bell-shaped; it is symmetric; and it is continuous. Thus, in order for a sampling distribution
to be approximated by the normal, it should certainly exhibit these traits. Each of these can be
measured qualitatively using different metrics. Skewness, for example, is defined as the third
standardized central moment. For any continuous or discrete probability density function f (x),
this is given by the equation 𝛾1 =
3
π‘₯Μ… −πœ‡
𝐸 [(𝜎 ) ].
⁄ 𝑛
√
As the name suggests, this moment measures the
skewness of a distribution, or more specifically its asymmetry. While skewness of zero does not
necessarily imply perfect symmetry, perfect symmetry implies skewness equal to zero. Thus, the
normal distribution has skewness equal to zero. The “peakedness” of a distribution is measured
by the fourth centralized moment: 𝛾2 =
4
π‘₯Μ… −πœ‡
𝐸 [(𝜎 ) ].2
⁄ 𝑛
√
This statistic is called kurtosis, and it
measures, depending on who you ask, how peaked or flat a distribution is, or how heavy the tails
are. The kurtosis of the normal distribution is equal to three, so it is common practice to subtract
three from the calculated fourth moment to quantify peakedness vis-à-vis the normal. This
modified statistic is known as “excess kurtosis.”3 This is the statistic that I observe in my
simulations. In theory, there is nothing special about the first four moments (these two, along
with μ1 = mean and μ2 = variance). It turns out that these numbers are insufficient for describing
a distribution in full. However, they have the advantage of being fairly easy to compute, and
2
3
Testing for Normality 41
Wolfram Math World “Skewness”
Cornwell 4
behaving well for averages.4 Thus, if used in combination with other measures, they could be
useful tools in determining the quality of an approximation.
The various moments of a distribution can all be calculated analytically, without the aid
of simulation. Other measures, however, necessitate some kind of simulation in order to work. R
is very useful in this case, because it can easily generate variables from a large number of built-in
distributions or homemade PDFs. There are two “size” parameters at play in my code. The first
(N) is the number of observations in the sampling distribution. This number should be large so
that one has a thorough idea of what the sampling distribution looks like. Unless otherwise
noted, I generated 500,000 sample means in every simulation. The second (n) is the number of
observations that are averaged to make each of those sample means. It is the latter of these that
should affect the shape of the sampling distribution and is therefore the subject of this
investigation. A sampling distribution for any distribution can be generated fairly easily and
quickly with a “for” loop, and that is all that is needed to begin a comparison to the normal. One
of the features of a sampling distribution that cannot be measured purely analytically is the
proportion of observations in the tails. We know that 95% of the data falls within approximately
1.96 standard deviations from the mean for any normal distribution. Thus, if you standardize the
sample based on its own mean and standard deviation, you can compare the size of the tails by
calculating the proportion of numbers whose absolute value is greater than 1.96. This calculation
takes practically no time, even for a large number of values N. This formula can be tweaked to
give the percentage of the data that falls within any number of standard deviations of the mean.
The so-called empirical rule states that, for the normal distribution, 68.22% of data is within one
standard deviation of the mean, 95.44% within two standard deviation and 99.73% of
4
(from Dr. Frey) Skewness decreases by a factor of √𝑛 and excess kurtosis by a factor of n
Cornwell 5
observations within three standard deviations.5 These proportions can be calculated through
simulation just like the tail probabilities, and they can be easily compared to the normal.
Another valuable tool for comparison in R is the empirical cumulative distribution
function (EDF). A cumulative distribution (CDF) is a function plotting the probability that a
random variable will be less than x. Thus, the limit of the CDF at negative infinity is zero, and at
positive infinity is one. Given any PDF (even a sample), one can make a CDF empirically by
simply plotting the sample observations as the independent variable versus the proportion of the
sample elements less than or equal to each observation as the dependent variable. This can be
done easily in R by simply sorting the vector containing the sample and plotting it against the
cumulative probabilities. For an easy contrast with the normal, one can standardize the sampling
distribution based on its mean and standard deviation and compare it to the standard normal
CDF. The better the approximation by the normal, the closer these two graphs should be. In
practice, it turns out that the most deviation between two CDFs (usually) occurs in the middle,
which makes sense because the extreme values (limits) must be the same for all CDFs. Although
even an “eye test” is valuable with these graphs, there is a way to quantify closeness between
CDFs. Kolmogorov-Smirnov distance is defined as the maximum distance between two CDFs.
Although a test for normality based on this statistic is unpopular based on its poor power and
inapplicability for discrete distributions, it is still a good measure for quality of approximation
that should be relatively consistent from simulation to simulation.
For the reasons discussed above, there are few guidelines for a minimum sample size
required to invoke the CLT in practice. The general rule of thumb that one sees in introductory
statistics courses calls for n ≥ 30—not a very helpful guide. In reality, depending on the
underlying distribution, this recommendation can either be overkill, or entirely inadequate. A
5
A First Course in Statistical Methods 81
Cornwell 6
slight improvement on this is with the binomial problem, where the recommendation is a
function of the parameter p, which gives the probability of a “success” for a binary variable.
Generally, one will see recommendations that n be greater than the minimum of 5/p and 5/q,
where q is equal to (1-p).6 The reasoning for this is that the skewness of the binomial distribution
increases as p moves away from .5. In general, the closer a distribution is to normal, the quicker
its sampling distribution will approach normal. Others suggest that the values np and nq be
greater than ten instead of five, but the idea is the same. Besides the binomial distribution,
however, there is not much guidance beyond n having to be “large.”
To come up with a set of criteria for these statistics is seemingly arbitrary. However, it
must be done in order to have some kind of objective scale for identifying a good approximation
to the normal distribution. One way to make this process easier to have two different thresholds:
one for a superior approximation and the other for an adequate one. By looking at both
histograms of the samples I generated and comparisons of the empirical distribution functions to
the normal cumulative distribution function, I decided that the following standards are requisite
for an “adequate” approximation to normal: excess kurtosis should be less than .5 in magnitude;
skewness should be less than .25 in magnitude; the tail probabilities for nominal 5% should be
between .04 and .06; and Kolmogorov-Smirnov distance should be less than .05. For a superior
approximation, the following numbers are required: excess kurtosis less than .3 in magnitude;
skewness less than .15 in magnitude; tail probabilities for nominal 5% should be between .04 and
.06; and K-S distance should be less than .02. One could argue that these requirements are fairly
conservative, but sometimes there is very little differentiating two distributions, especially
considering the traits that all probability density functions must share by virtue of being
6
A First Course in Statistical Methods 167
Cornwell 7
probability distributions. What follows is the application of these measures to various underlying
distributions.
It turns out that one of the fastest distributions to converge to normal in averages is the
continuous uniform. The only parameters of the uniform distribution are the endpoints of the
interval. The third and fourth moments are independent of the length of the interval, and
experiments with changing the length suggest that doing so does not affect the rate of
convergence. Thus, for all of my simulations, I used the interval [0,1]. A distribution of 500,000
samples of size three already seems to be approximately normal by looking at a histogram of the
data. Because of its symmetry, the uniform distribution has no skewness, so one would expect
samples to behave the same way. The excess kurtosis, however, is equal to -1.2 for the
continuous uniform distribution. Thus, for a sample of size five it is equal to -.24 (kurtosis
decreases by a factor of n). Generating uniform samples of size five gave a diminutive
Kolmogorov-Smirnov (KS) distance of .005, which suggests a very tight fit by the normal
distribution. The percentage of data in the tails (compared to the normal value of 5%) was .048.
The continuous uniform is a member of the beta family of distributions (α = β =1). The beta has
two parameters, α and β, and is defined on the interval [0,1]. If one lets α = β =1/3, then the
result is a bimodal, symmetric distribution, with a valley at its mean. It turns out that the
convergence of this distribution is fast as well. The exact sample sizes needed for each statistic
are summarized in the tables at the end. Other than the normal distribution itself, it is unlikely
that a distribution will converge this quickly, but it could still be helpful to have some of the
statistics of this sampling distribution in order to see what is required for a good fit.
Returning to the binomial distribution, it seems as if the rule of thumb that the product of
n and p should be greater than 5 to invoke the Central Limit Theorem is insufficient. Even when
Cornwell 8
that product is ten, it turns out that there is still a large deviation between the two CDFs around
the middle of the distribution. The following graph shows how Kolmogorov-Smirnov distance
decreases as np increases.
K-S Distance for Binomial versus n*p
0.14
0.12
0.1
K-S Distance
0.08
0.06
p = .5
0.04
p = .1
0.02
0
0
5
10
15
20
25
30
35
np
As expected, the quality of approximation (at least insofar as it is measured by K-S distance)
seems to vary with np, so it is appropriate to speak about that product instead of a particular
proportion p. It appears that the K-S level doesn’t begin to level off really until about np = 15.
And it is not until np = 30 that it finally gets down to the .05 level. This number indicates
problems in the middle of the distribution, but even the tails are not as close as is desired for np =
20. A sample of 20 sampling distributions when n = 40 and p = .5 yielded an upper bound of
.0414 for the proportion of data outside 1.96 standard deviations. This is at the very bottom of
the target area (.04, .06) compared to the normal distribution .05. Once np is greater than 30, the
tail percentage is a more desirable .0519. The skewness of the binomial is zero when p = .5, but
for p = .1 it requires a sample size of 317 to get the skewness below .15. Even this number is
small compared to the value of 1850 needed to get the K-S distance under .02. Part of the
Cornwell 9
problem is that the binomial distribution is discrete, which hurts the K-S statistic, but it still
seems that the requirement that np be bigger than 10 (n =100 for p = .1) is not adequate.
Another common continuous distribution is the exponential. The exponential has one
parameter, λ, which is the inverse of its mean. It turns out that the rate of converge for this
distribution in averages is independent of that parameter; thus, I elected to use λ =1. When n is as
small as eight, the K-S distance drops below .05. However, at this point, the EDF still indicates a
difference in the tails of the data because the skewness and kurtosis are a little high (.75 for
kurtosis and 2-½ ≈ .707 for skewness). I think that when the sample size is ten the approximation
becomes better. The pictures below show a histogram of the data from the sample, and a graph of
the EDF versus the CDF for the normal distribution. When the sample size is 10, the excess
kurtosis and skewness values are .6 and .63 respectively, and the distribution appears to follow
the normal fairly closely.
The one problem—and this is evident in the histogram more than the EDF—is the positive skew
of this distribution. This distribution shows that it takes a long time to get rid of asymmetry from
the underlying distribution. In this case, it is not until n reaches 64 that the skewness becomes
Cornwell 10
low enough for an adequate approximation. The next highest n value needed to get in range for
the other statistics is 12 for the kurtosis. Likewise, for a superior approximation a sample size of
178 is needed to get a low enough skewness (n = 45 needed for the K-S distance is next highest).
At this point, the normal approximation for exponential samples is very good.
With all of this discussion of the Central Limit Theorem, it is worth mentioning the times
that it is inapplicable. The theorem states that the random variables being averaged must have
positive variance, which means it must have at least two moments. Thus some distributions with
no moments (heavy tails) should not be approximated well by the normal. For example, the
Cauchy distribution has no mean, and therefore no variance or any other moments. The PDF of
1
𝑏
the Cauchy distribution is 𝑓(π‘₯) = πœ‹ (π‘₯−π‘š)2 +𝑏2, where b is the width of the distribution at half its
maximum and m is the median. Integrating this function confirms that it is a PDF regardless of b,
but one runs into problems trying to find the mean. Letting m = 0 (center the distribution at the y𝑏
axis), the anti-derivative ∫ π‘₯𝑓(π‘₯)𝑑π‘₯ = 2 ln|π‘₯ 2 + 𝑏 2 | + C represents the first moment of the
Cauchy distribution. Since the log function grows unbounded, this distribution has no mean for
any value of b. Thus, for distributions with no mean or variance, one would not expect any kind
of convergence to the normal. In fact, it turns out that the Cauchy distribution looks the same in
averages of any sample size (including n = 1). One can see here how the EDFs for 50,000
Cauchy variables with mean zero and location one and a sample of 50 Cauchy samples with the
same parameters look the same:
Cornwell 11
Another interesting example of this phenomenon is the Student’s t distribution. This
distribution has heavier tails than the normal, and the number of moments it has depends on the
number of degrees of freedom. This distribution could be helpful in determining the minimum
sample size needed for a symmetric, unimodal distribution because its tails get increasing lighter
as sample size increases. Considering that the variance of the Student’s t only exists for degrees
of freedom (df = n - 1) greater than 2 and the kurtosis for df > 4, the convergence of the Student’s
t is remarkably fast. The underlying distribution itself—not even the sampling distribution—
becomes quite close to normal with as few as 5 degrees of freedom. In this case, the K-S distance
is already down to .03, and the CDF is very close to the normal. The value outside 1.96 standard
deviations is 5.3%, which is more than the normal (as expected), but still very close. One thing to
note here is the value of the kurtosis. It is
6
𝑑𝑓−4
for the Student’s t, which shows that even
distributions with a high kurtosis (6 in this case) can be good approximations to the normal. I
looked at averages for two cases: first, with degrees of freedom equal to 2.5; and second, with
degrees of freedom equal to 4.1. To get the K-S distance sufficiently small for df = 2.5, it
required a sample size of 320. This shows how the sample size will blow up and that eventually
Cornwell 12
the Central Limit Theorem won’t apply as degrees of freedom get arbitrarily close to 2. For
degrees of freedom = 4.1, the K-S distance gets very small with samples as small as 5. The only
statistic that is difficult to lose is the kurtosis (n = 200 for superior approximation). The example
of the Student’s t casts some doubt on the importance of the fourth moment as a metric for
comparing distributions.
Below are two tables containing the results of my investigations. The tables are organized
so that it reports the minimum sample size needed to satisfy each criteria. The minimum sample
size needed for an adequate or superior approximation would therefore be the maximum in each
row, which is also reported.
Adequate Approximation
1
Tail prob.
.04 < x < .06
2
K-S Distance
<.05
2
4
1
3
3
4
Exponential
12
64
5
8
64
Binomial (p =.1)
11
114
14
332
332
Binomial (p =.5)
4
1
12
68
68
N/A
N/A
13
20
20
120
1
1
2
120
∣Kurtosis∣<.5
∣Skewness∣<.25
Uniform
3
Beta (α=β=1/3)
Distribution
Student’s t
2.5 df
Student’s t
4.1 df
Maximum
3
Cornwell 13
Superior Approximation
1
Tail prob.
.04 < x < .06
2
K-S Distance
<.02
2
6
1
3
4
6
Exponential
20
178
5
45
178
Binomial (p =.1)
18
317
14
1850
1850
Binomial (p =.5)
7
1
12
390
390
N/A
N/A
13
320
320
200
1
1
5
200
∣Kurtosis∣<.3
∣Skewness∣<.15
Uniform
4
Beta (α=β=1/3)
Distribution
Student’s t
2.5 df
Student’s t
4.1 df
Maximum
4
The first thing I notice in looking at these tables is the magnitude of the numbers. With
the exception of the beta family of distributions, none of these fairly common distributions have
sampling distributions well approximated by the normal for n around 30. Because of the
widespread use of and reliance on hypothesis tests, it is interesting to note that it often takes a
very large sample size potentially to make such tests reliable. Also, it is clear that skewness in
the underlying distribution is the hardest thing to correct for in the sampling distribution. In the
distributions where kurtosis was high, the alignment of the EDF and the normal CDF was much
faster than those with a strong skew. Another alarming result is the high sample size
requirements for the binomial distribution. Part of the problem in minimizing the K-S distance
here was the fact that the distribution only takes on discrete values. If one were to employ
continuity correction methods to the binomial, I suspect that it would be well approximated by
the normal much faster than in my study. Finally, these studies have made me very suspicious of
kurtosis as a valuable statistic for measuring normality. Even when it is not defined, such as with
the Student’s t with four or fewer degrees of freedom, the other statistics suggest that the
Cornwell 14
convergence is quite fast. A graph of the PDF for the Student’s t with only five degrees of
freedom shows that it is already close to the normal. Thus, taking averages of those variables will
make the approximation even better with only a small value of n.
So, what are the implications of all this? First, it reinforces the fact that statisticians
should strive to collect as much data as possible when doing inferential tests. Second, it indicates
the importance of methods for doing inference when traditional assumptions of normality fail. In
practice, it seems that the conditions for invoking the Central Limit Theorem are not so trivial
after all. That being said, its pervasiveness in the social and natural sciences makes it one of the
most important theorems in mathematics.
Download