Cornwell 1 Paul Cornwell March 25, 2011 MAT 5900- Monte Carlo Simulations Professors Frey and Volpert Necessary Sample Size for Good Central Limit Theorem Approximation In conjunction with the Law of Large Numbers, the Central Limit Theorem is the foundation for the majority of statistical practices today. The term Central Limit Theorem (CLT) actually refers to a series of theorems, but in practice it is usually condensed as follows: averages of independent, identically distributed variables (with positive variance σ) tend towards the normal (Gaussian) distribution as sample size n gets large. Specifically, the normal distribution to which the sampling distribution tends has mean π = ππ₯Μ and variance ππ₯Μ 2 = π2 π . The utility of this theory is grounded in the idea of averages. Even if the underlying distribution of some variable is finicky, it can still be analyzed through the use of averages. For this reason, the CLT is invoked in many situations involving predictions or probabilities. Because of the “limit” part of the theorem, however, the normal distribution will only be an approximation for the sampling distribution if n is finite. Exactly how large n has to be for the CLT to provide a good approximation is an elusive question, and it turns out that Monte Carlo simulations are as good a tool as any for making such determinations. The generalization of the CLT commonly taught in statistics classes today is a theorem of Laplace from 1810. In discussing this result, Hans Fischer says the normal approximation is appropriate “under conditions that, in practice, were always fulfilled.”1 As Fischer indicates, there is not a lot of concern surrounding the conditions (i.e. the necessary sample size) needed to appeal to the CLT. However, it is still important to know at what point it becomes acceptable to do so. As sample size increases, the normal approximation becomes better and better, so at the 1 A History of the Central Limit Theorem 353 Cornwell 2 very least having a target n in mind would give an indication of the quality of an approximation, even if it is impossible to alter the design of an experiment or study by increasing the sample size. One problem with this (which is probably responsible for a lack of guidelines on the topic) is the tremendous dependence of the sampling distribution on the underlying distribution. Thus, instead of getting too specific, the word “large” is employed to describe the requisite sample size. My goal is to come up with a more specific procedure for determining how large of a sample size is needed to get a good approximation from the Central Limit Theorem. The problem of diverse underlying distributions is not the only obstacle in this endeavor. Prior to giving any kind of numerical answer(s) to the question at hand, it is necessary to define exactly what it means to be a “good” approximation. Practically speaking, the sampling distribution of a random variable is never normal. For this reason, using conventional “tests for normality” is a task doomed for failure. Given enough data, they will always be able to reject the hypothesis that the sampling distribution is not normal—because it’s not. However, there are a few techniques to be salvaged from this field that will be of use. Instead of doing conventional tests, the best way to approach this problem is to consider the qualities of the Gaussian distribution. This way, there is at least some basis of comparison for the sampling distribution to the normal. Once these criteria are in place, the next task is to determine how closely a distribution must resemble the normal for each, which will then determine whether or not the sample size is large enough for a good approximation. Finally, I will have to devise a way of communicating these results. It is not necessarily practical to have a separate number for every possible distribution for two reasons: first, there are far too many for this information to be helpful; second, in practice it is common not to know the underlying distribution for a random Cornwell 3 variable, only certain features. Instead, it would be helpful if I could identify some common features of these distributions that affect the speed of their convergence. The normal distribution can be observed to have the following properties: it is unimodal; it is bell-shaped; it is symmetric; and it is continuous. Thus, in order for a sampling distribution to be approximated by the normal, it should certainly exhibit these traits. Each of these can be measured qualitatively using different metrics. Skewness, for example, is defined as the third standardized central moment. For any continuous or discrete probability density function f (x), this is given by the equation πΎ1 = 3 π₯Μ −π πΈ [(π ) ]. ⁄ π √ As the name suggests, this moment measures the skewness of a distribution, or more specifically its asymmetry. While skewness of zero does not necessarily imply perfect symmetry, perfect symmetry implies skewness equal to zero. Thus, the normal distribution has skewness equal to zero. The “peakedness” of a distribution is measured by the fourth centralized moment: πΎ2 = 4 π₯Μ −π πΈ [(π ) ].2 ⁄ π √ This statistic is called kurtosis, and it measures, depending on who you ask, how peaked or flat a distribution is, or how heavy the tails are. The kurtosis of the normal distribution is equal to three, so it is common practice to subtract three from the calculated fourth moment to quantify peakedness vis-à-vis the normal. This modified statistic is known as “excess kurtosis.”3 This is the statistic that I observe in my simulations. In theory, there is nothing special about the first four moments (these two, along with μ1 = mean and μ2 = variance). It turns out that these numbers are insufficient for describing a distribution in full. However, they have the advantage of being fairly easy to compute, and 2 3 Testing for Normality 41 Wolfram Math World “Skewness” Cornwell 4 behaving well for averages.4 Thus, if used in combination with other measures, they could be useful tools in determining the quality of an approximation. The various moments of a distribution can all be calculated analytically, without the aid of simulation. Other measures, however, necessitate some kind of simulation in order to work. R is very useful in this case, because it can easily generate variables from a large number of built-in distributions or homemade PDFs. There are two “size” parameters at play in my code. The first (N) is the number of observations in the sampling distribution. This number should be large so that one has a thorough idea of what the sampling distribution looks like. Unless otherwise noted, I generated 500,000 sample means in every simulation. The second (n) is the number of observations that are averaged to make each of those sample means. It is the latter of these that should affect the shape of the sampling distribution and is therefore the subject of this investigation. A sampling distribution for any distribution can be generated fairly easily and quickly with a “for” loop, and that is all that is needed to begin a comparison to the normal. One of the features of a sampling distribution that cannot be measured purely analytically is the proportion of observations in the tails. We know that 95% of the data falls within approximately 1.96 standard deviations from the mean for any normal distribution. Thus, if you standardize the sample based on its own mean and standard deviation, you can compare the size of the tails by calculating the proportion of numbers whose absolute value is greater than 1.96. This calculation takes practically no time, even for a large number of values N. This formula can be tweaked to give the percentage of the data that falls within any number of standard deviations of the mean. The so-called empirical rule states that, for the normal distribution, 68.22% of data is within one standard deviation of the mean, 95.44% within two standard deviation and 99.73% of 4 (from Dr. Frey) Skewness decreases by a factor of √π and excess kurtosis by a factor of n Cornwell 5 observations within three standard deviations.5 These proportions can be calculated through simulation just like the tail probabilities, and they can be easily compared to the normal. Another valuable tool for comparison in R is the empirical cumulative distribution function (EDF). A cumulative distribution (CDF) is a function plotting the probability that a random variable will be less than x. Thus, the limit of the CDF at negative infinity is zero, and at positive infinity is one. Given any PDF (even a sample), one can make a CDF empirically by simply plotting the sample observations as the independent variable versus the proportion of the sample elements less than or equal to each observation as the dependent variable. This can be done easily in R by simply sorting the vector containing the sample and plotting it against the cumulative probabilities. For an easy contrast with the normal, one can standardize the sampling distribution based on its mean and standard deviation and compare it to the standard normal CDF. The better the approximation by the normal, the closer these two graphs should be. In practice, it turns out that the most deviation between two CDFs (usually) occurs in the middle, which makes sense because the extreme values (limits) must be the same for all CDFs. Although even an “eye test” is valuable with these graphs, there is a way to quantify closeness between CDFs. Kolmogorov-Smirnov distance is defined as the maximum distance between two CDFs. Although a test for normality based on this statistic is unpopular based on its poor power and inapplicability for discrete distributions, it is still a good measure for quality of approximation that should be relatively consistent from simulation to simulation. For the reasons discussed above, there are few guidelines for a minimum sample size required to invoke the CLT in practice. The general rule of thumb that one sees in introductory statistics courses calls for n ≥ 30—not a very helpful guide. In reality, depending on the underlying distribution, this recommendation can either be overkill, or entirely inadequate. A 5 A First Course in Statistical Methods 81 Cornwell 6 slight improvement on this is with the binomial problem, where the recommendation is a function of the parameter p, which gives the probability of a “success” for a binary variable. Generally, one will see recommendations that n be greater than the minimum of 5/p and 5/q, where q is equal to (1-p).6 The reasoning for this is that the skewness of the binomial distribution increases as p moves away from .5. In general, the closer a distribution is to normal, the quicker its sampling distribution will approach normal. Others suggest that the values np and nq be greater than ten instead of five, but the idea is the same. Besides the binomial distribution, however, there is not much guidance beyond n having to be “large.” To come up with a set of criteria for these statistics is seemingly arbitrary. However, it must be done in order to have some kind of objective scale for identifying a good approximation to the normal distribution. One way to make this process easier to have two different thresholds: one for a superior approximation and the other for an adequate one. By looking at both histograms of the samples I generated and comparisons of the empirical distribution functions to the normal cumulative distribution function, I decided that the following standards are requisite for an “adequate” approximation to normal: excess kurtosis should be less than .5 in magnitude; skewness should be less than .25 in magnitude; the tail probabilities for nominal 5% should be between .04 and .06; and Kolmogorov-Smirnov distance should be less than .05. For a superior approximation, the following numbers are required: excess kurtosis less than .3 in magnitude; skewness less than .15 in magnitude; tail probabilities for nominal 5% should be between .04 and .06; and K-S distance should be less than .02. One could argue that these requirements are fairly conservative, but sometimes there is very little differentiating two distributions, especially considering the traits that all probability density functions must share by virtue of being 6 A First Course in Statistical Methods 167 Cornwell 7 probability distributions. What follows is the application of these measures to various underlying distributions. It turns out that one of the fastest distributions to converge to normal in averages is the continuous uniform. The only parameters of the uniform distribution are the endpoints of the interval. The third and fourth moments are independent of the length of the interval, and experiments with changing the length suggest that doing so does not affect the rate of convergence. Thus, for all of my simulations, I used the interval [0,1]. A distribution of 500,000 samples of size three already seems to be approximately normal by looking at a histogram of the data. Because of its symmetry, the uniform distribution has no skewness, so one would expect samples to behave the same way. The excess kurtosis, however, is equal to -1.2 for the continuous uniform distribution. Thus, for a sample of size five it is equal to -.24 (kurtosis decreases by a factor of n). Generating uniform samples of size five gave a diminutive Kolmogorov-Smirnov (KS) distance of .005, which suggests a very tight fit by the normal distribution. The percentage of data in the tails (compared to the normal value of 5%) was .048. The continuous uniform is a member of the beta family of distributions (α = β =1). The beta has two parameters, α and β, and is defined on the interval [0,1]. If one lets α = β =1/3, then the result is a bimodal, symmetric distribution, with a valley at its mean. It turns out that the convergence of this distribution is fast as well. The exact sample sizes needed for each statistic are summarized in the tables at the end. Other than the normal distribution itself, it is unlikely that a distribution will converge this quickly, but it could still be helpful to have some of the statistics of this sampling distribution in order to see what is required for a good fit. Returning to the binomial distribution, it seems as if the rule of thumb that the product of n and p should be greater than 5 to invoke the Central Limit Theorem is insufficient. Even when Cornwell 8 that product is ten, it turns out that there is still a large deviation between the two CDFs around the middle of the distribution. The following graph shows how Kolmogorov-Smirnov distance decreases as np increases. K-S Distance for Binomial versus n*p 0.14 0.12 0.1 K-S Distance 0.08 0.06 p = .5 0.04 p = .1 0.02 0 0 5 10 15 20 25 30 35 np As expected, the quality of approximation (at least insofar as it is measured by K-S distance) seems to vary with np, so it is appropriate to speak about that product instead of a particular proportion p. It appears that the K-S level doesn’t begin to level off really until about np = 15. And it is not until np = 30 that it finally gets down to the .05 level. This number indicates problems in the middle of the distribution, but even the tails are not as close as is desired for np = 20. A sample of 20 sampling distributions when n = 40 and p = .5 yielded an upper bound of .0414 for the proportion of data outside 1.96 standard deviations. This is at the very bottom of the target area (.04, .06) compared to the normal distribution .05. Once np is greater than 30, the tail percentage is a more desirable .0519. The skewness of the binomial is zero when p = .5, but for p = .1 it requires a sample size of 317 to get the skewness below .15. Even this number is small compared to the value of 1850 needed to get the K-S distance under .02. Part of the Cornwell 9 problem is that the binomial distribution is discrete, which hurts the K-S statistic, but it still seems that the requirement that np be bigger than 10 (n =100 for p = .1) is not adequate. Another common continuous distribution is the exponential. The exponential has one parameter, λ, which is the inverse of its mean. It turns out that the rate of converge for this distribution in averages is independent of that parameter; thus, I elected to use λ =1. When n is as small as eight, the K-S distance drops below .05. However, at this point, the EDF still indicates a difference in the tails of the data because the skewness and kurtosis are a little high (.75 for kurtosis and 2-½ ≈ .707 for skewness). I think that when the sample size is ten the approximation becomes better. The pictures below show a histogram of the data from the sample, and a graph of the EDF versus the CDF for the normal distribution. When the sample size is 10, the excess kurtosis and skewness values are .6 and .63 respectively, and the distribution appears to follow the normal fairly closely. The one problem—and this is evident in the histogram more than the EDF—is the positive skew of this distribution. This distribution shows that it takes a long time to get rid of asymmetry from the underlying distribution. In this case, it is not until n reaches 64 that the skewness becomes Cornwell 10 low enough for an adequate approximation. The next highest n value needed to get in range for the other statistics is 12 for the kurtosis. Likewise, for a superior approximation a sample size of 178 is needed to get a low enough skewness (n = 45 needed for the K-S distance is next highest). At this point, the normal approximation for exponential samples is very good. With all of this discussion of the Central Limit Theorem, it is worth mentioning the times that it is inapplicable. The theorem states that the random variables being averaged must have positive variance, which means it must have at least two moments. Thus some distributions with no moments (heavy tails) should not be approximated well by the normal. For example, the Cauchy distribution has no mean, and therefore no variance or any other moments. The PDF of 1 π the Cauchy distribution is π(π₯) = π (π₯−π)2 +π2, where b is the width of the distribution at half its maximum and m is the median. Integrating this function confirms that it is a PDF regardless of b, but one runs into problems trying to find the mean. Letting m = 0 (center the distribution at the yπ axis), the anti-derivative ∫ π₯π(π₯)ππ₯ = 2 ln|π₯ 2 + π 2 | + C represents the first moment of the Cauchy distribution. Since the log function grows unbounded, this distribution has no mean for any value of b. Thus, for distributions with no mean or variance, one would not expect any kind of convergence to the normal. In fact, it turns out that the Cauchy distribution looks the same in averages of any sample size (including n = 1). One can see here how the EDFs for 50,000 Cauchy variables with mean zero and location one and a sample of 50 Cauchy samples with the same parameters look the same: Cornwell 11 Another interesting example of this phenomenon is the Student’s t distribution. This distribution has heavier tails than the normal, and the number of moments it has depends on the number of degrees of freedom. This distribution could be helpful in determining the minimum sample size needed for a symmetric, unimodal distribution because its tails get increasing lighter as sample size increases. Considering that the variance of the Student’s t only exists for degrees of freedom (df = n - 1) greater than 2 and the kurtosis for df > 4, the convergence of the Student’s t is remarkably fast. The underlying distribution itself—not even the sampling distribution— becomes quite close to normal with as few as 5 degrees of freedom. In this case, the K-S distance is already down to .03, and the CDF is very close to the normal. The value outside 1.96 standard deviations is 5.3%, which is more than the normal (as expected), but still very close. One thing to note here is the value of the kurtosis. It is 6 ππ−4 for the Student’s t, which shows that even distributions with a high kurtosis (6 in this case) can be good approximations to the normal. I looked at averages for two cases: first, with degrees of freedom equal to 2.5; and second, with degrees of freedom equal to 4.1. To get the K-S distance sufficiently small for df = 2.5, it required a sample size of 320. This shows how the sample size will blow up and that eventually Cornwell 12 the Central Limit Theorem won’t apply as degrees of freedom get arbitrarily close to 2. For degrees of freedom = 4.1, the K-S distance gets very small with samples as small as 5. The only statistic that is difficult to lose is the kurtosis (n = 200 for superior approximation). The example of the Student’s t casts some doubt on the importance of the fourth moment as a metric for comparing distributions. Below are two tables containing the results of my investigations. The tables are organized so that it reports the minimum sample size needed to satisfy each criteria. The minimum sample size needed for an adequate or superior approximation would therefore be the maximum in each row, which is also reported. Adequate Approximation 1 Tail prob. .04 < x < .06 2 K-S Distance <.05 2 4 1 3 3 4 Exponential 12 64 5 8 64 Binomial (p =.1) 11 114 14 332 332 Binomial (p =.5) 4 1 12 68 68 N/A N/A 13 20 20 120 1 1 2 120 β£Kurtosisβ£<.5 β£Skewnessβ£<.25 Uniform 3 Beta (α=β=1/3) Distribution Student’s t 2.5 df Student’s t 4.1 df Maximum 3 Cornwell 13 Superior Approximation 1 Tail prob. .04 < x < .06 2 K-S Distance <.02 2 6 1 3 4 6 Exponential 20 178 5 45 178 Binomial (p =.1) 18 317 14 1850 1850 Binomial (p =.5) 7 1 12 390 390 N/A N/A 13 320 320 200 1 1 5 200 β£Kurtosisβ£<.3 β£Skewnessβ£<.15 Uniform 4 Beta (α=β=1/3) Distribution Student’s t 2.5 df Student’s t 4.1 df Maximum 4 The first thing I notice in looking at these tables is the magnitude of the numbers. With the exception of the beta family of distributions, none of these fairly common distributions have sampling distributions well approximated by the normal for n around 30. Because of the widespread use of and reliance on hypothesis tests, it is interesting to note that it often takes a very large sample size potentially to make such tests reliable. Also, it is clear that skewness in the underlying distribution is the hardest thing to correct for in the sampling distribution. In the distributions where kurtosis was high, the alignment of the EDF and the normal CDF was much faster than those with a strong skew. Another alarming result is the high sample size requirements for the binomial distribution. Part of the problem in minimizing the K-S distance here was the fact that the distribution only takes on discrete values. If one were to employ continuity correction methods to the binomial, I suspect that it would be well approximated by the normal much faster than in my study. Finally, these studies have made me very suspicious of kurtosis as a valuable statistic for measuring normality. Even when it is not defined, such as with the Student’s t with four or fewer degrees of freedom, the other statistics suggest that the Cornwell 14 convergence is quite fast. A graph of the PDF for the Student’s t with only five degrees of freedom shows that it is already close to the normal. Thus, taking averages of those variables will make the approximation even better with only a small value of n. So, what are the implications of all this? First, it reinforces the fact that statisticians should strive to collect as much data as possible when doing inferential tests. Second, it indicates the importance of methods for doing inference when traditional assumptions of normality fail. In practice, it seems that the conditions for invoking the Central Limit Theorem are not so trivial after all. That being said, its pervasiveness in the social and natural sciences makes it one of the most important theorems in mathematics.