The normal distribution When we measure living things, we frequently see data with a distribution that looks similar to a bell shape. Suppose I count the number of ants on each of the 200 roses in my garden. I might get a graph like this. Instead of plotting all the individual points, it is usually convenient to draw a histogram, or to draw the bell curve that fits the data. These data have a normal distribution, also known as the Gaussian distribution or the bell curve. What was Gauss doing that led him to this distribution? Gauss was an astronomer, and was interested in the distance from the earth to the moon or to the sun. When he and other astronomers repeatedly measured the distance from the earth to the moon, they got slightly different answers. Gauss showed that what we now call the Gaussian distribution was a good description of the variability of these repeated observations. The Gaussian distribution often occurs when each observation is affected by many small random errors. The height of the normal distribution curve for any given value of x is given by the equation: Population mean = Greek letter mu) Population standard deviation = (Greek letter sigma) The height of the curve for a given x is called the probability density. Probability density is defined so that the area under the curve is 1.0, as required for probability. This equation for the height of the normal curve is called the probability density function. Refer to the supplementary material on probability distributions or a textbook on probability for more details. The normal distribution occurs frequently in statistical analyses. The location of the normal distribution is determined by its mean. The shape (spread) of the normal distribution is determined by its standard deviation (or variance). Here are some example normal curves (from Wikipedia) Z-scores We'll find it useful to describe an observation in terms of the number of standard deviations it is from the mean. This measure of distance from the mean is called the zscore. Converting from the original scale (where the units might be seconds or kilograms) to the z-score (where the units are standard deviations) is like converting the Fahrenheit to Centigrade, or from inches to millimeters. It is just a change of the units of measurement. The z-score rescaling is much like re-scaling temperature from degrees F to degrees C: C = (F - 32) *5/9. The conversion can also be written as C = (F – 32) / 1.8 To convert from F to C, we subtract a constant (F – 32) and divide by a constant (1.8). Similarly, to convert a measurement from its original scale to Z score, we subtract a constant (the mean, X ) and divide by a constant (the standard deviation, s). Here's an example. Suppose that we collect the number of flowers on 5 rose plants. Plant Number of flowers 1 2 3 4 5 8 8 10 12 12 z-score (standardize) -1.00 -1.00 0.00 1.00 1.00 We can calculate the mean of the sample. Mean 5 Xi i 1 5 = 10 flowers We can calculate the sample variance and sample standard deviation of the sample. Sample variance = S2 N 2 X i X i 1 = N 1 = 4 flowers2 Sample standard deviation = S = Square root of (S2) = 2 flowers. Now we are ready to convert from flowers to standard deviations as the measure of distance from the mean, which gives us the z-score. We convert from original scale to Z score, by subtracting the mean, X ) and dividing by the standard deviation, S. z-score = (Xi – X )/S We can calculate the z-score of the number of flowers on each plant. Mean of 5 plants = 10 flowers Sample standard deviation = 2 flowers. z-score = (Xi – 10) / 2 Plant Number of flowers 1 2 3 4 5 z-score (standardize) 8 8 10 12 12 -1.00 -1.00 0.00 1.00 1.00 Standard Normal distribution For many statistical analyses, it is convenient, to use a normal curve that has mean = 0 and standard deviation = 1. This is called the Standard normal distribution. The units on the x-axis can be whatever you measured (such as kilograms, inches, blood pressure). For the standard normal distribution, we label the x-axis in units of standard deviations from the mean. When a normal distribution has mean = 0 and standard deviation = 1, it is a standard normal distribution. The sum of the probabilities of all possible outcomes, for any probability distribution, is always 1.0. For example, when we flip a fair coin, we say that the probability of heads is 0.5, and the probability of tails is 0.5. The sum of the two probabilities (which are all the possible outcomes) is 0.5 + 0.5 = 1.0. The normal distribution is a probability distribution, so the area under the curve must be 1.0. The normal distribution has some properties that are useful when we do t-tests, ANOVA, regression, and other statistical tests. 68% of all observations fall within 1 standard deviation of the mean 95% of all observations fall within 2 standard deviations of the mean. (within 1.96 standard deviations, to be exact) 99.7% of all observations fall within 3 standard deviations of the mean In the normal distribution, 95% of the total area under the curve falls in the region under the curve that is plus or minus 2 standard deviations from the mean. That is, 95% of the area (0.95 of the probability) falls in the range of mean 2 SD (actually 1.96, but 2 is easy to remember). 5% of the total area under the curve falls in the region under the curve that is beyond 2 standard deviations from the mean. That is, 5% of the area (0.05 of the probability) falls in the tails of the normal distribution beyond mean 2 SD. Said another way, the area under the tails of the standard normal curve, more than 1.96 standard deviations from the mean, is approximately 5% or 0.05. This value of 0.05 area is the source of the probability threshold value 0.05 (also known as alpha = 0.05) that we often use in statistics tests. Transforming data to normality Many statistical methods, such as t-tests, analysis of variance, and regression, assume that the data that are normally distributed and have equal variance within each treatment group or level. Even if it is not required, most analyses work better with data that are normally distributed and that have equal variance within each treatment group or level. If you have outliers and/or non-normal distributions, you may be able to apply a transform to the data to make the distribution more normal, and to reduce the influence of the outliers. Here is an example of a data distribution typical of biological measurements, such as the amount of protein in the blood. The distribution of the original values does not look at all normal. The log transform of the original values looks more like a normal distribution. If the log transform is not effective in producing a more normal distribution, your software may have alternatives such as Box-Cox transforms or Johnson transforms. Another alternative is to use a non-parametric test, such as a Wilcoxon rank sum test, described later.