2.1 Normal distribution Z scores

advertisement
The normal distribution
When we measure living things, we frequently see data with a distribution that looks
similar to a bell shape. Suppose I count the number of ants on each of the 200 roses in my
garden. I might get a graph like this.
Instead of plotting all the individual points, it is usually convenient to draw a histogram,
or to draw the bell curve that fits the data.
These data have a normal distribution, also known as the Gaussian distribution or the bell
curve. What was Gauss doing that led him to this distribution? Gauss was an astronomer,
and was interested in the distance from the earth to the moon or to the sun. When he and
other astronomers repeatedly measured the distance from the earth to the moon, they got
slightly different answers. Gauss showed that what we now call the Gaussian distribution
was a good description of the variability of these repeated observations. The Gaussian
distribution often occurs when each observation is affected by many small random errors.
The height of the normal distribution curve for any given value of x is given by the
equation:
Population mean = Greek letter mu)
Population standard deviation =  (Greek letter sigma)
The height of the curve for a given x is called the probability density. Probability density
is defined so that the area under the curve is 1.0, as required for probability. This
equation for the height of the normal curve is called the probability density function.
Refer to the supplementary material on probability distributions or a textbook on
probability for more details.

The normal distribution occurs frequently in statistical analyses. The location of the
normal distribution is determined by its mean. The shape (spread) of the normal
distribution is determined by its standard deviation (or variance).
Here are some example normal curves (from Wikipedia)
Z-scores
We'll find it useful to describe an observation in terms of the number of standard
deviations it is from the mean. This measure of distance from the mean is called the zscore. Converting from the original scale (where the units might be seconds or kilograms)
to the z-score (where the units are standard deviations) is like converting the Fahrenheit
to Centigrade, or from inches to millimeters. It is just a change of the units of
measurement.
The z-score rescaling is much like re-scaling temperature from degrees F to degrees C:
C = (F - 32) *5/9.
The conversion can also be written as
C = (F – 32) / 1.8
To convert from F to C, we subtract a constant (F – 32) and divide by a constant (1.8).
Similarly, to convert a measurement from its original scale to Z score, we subtract a
constant (the mean, X ) and divide by a constant (the standard deviation, s). Here's an
example.
Suppose that we collect the number of flowers on 5 rose plants.
Plant
Number of flowers
1
2
3
4
5
8
8
10
12
12
z-score
(standardize)
-1.00
-1.00
0.00
1.00
1.00
We can calculate the mean of the sample.
Mean
5
Xi
i 1
5
= 10 flowers
We can calculate the sample variance and sample standard deviation of the sample.
Sample variance = S2
N
2
 X i X
i 1
=


N 1
= 4 flowers2
Sample standard deviation = S = Square root of (S2)
= 2 flowers.
Now we are ready to convert from flowers to standard deviations as the measure of
distance from the mean, which gives us the z-score. We convert from original scale to Z
score, by subtracting the mean, X ) and dividing by the standard deviation, S.
z-score = (Xi – X )/S
We can calculate the z-score of the number of flowers on each plant.
Mean of 5 plants = 10 flowers
Sample standard deviation = 2 flowers.
z-score = (Xi – 10) / 2
Plant
Number of flowers
1
2
3
4
5
z-score
(standardize)
8
8
10
12
12
-1.00
-1.00
0.00
1.00
1.00
Standard Normal distribution
For many statistical analyses, it is convenient, to use a normal curve that has mean = 0
and standard deviation = 1. This is called the Standard normal distribution.
The units on the x-axis can be whatever you measured (such as kilograms, inches, blood
pressure). For the standard normal distribution, we label the x-axis in units of standard
deviations from the mean. When a normal distribution has mean = 0 and standard
deviation = 1, it is a standard normal distribution.
The sum of the probabilities of all possible outcomes, for any probability distribution, is
always 1.0. For example, when we flip a fair coin, we say that the probability of heads is
0.5, and the probability of tails is 0.5. The sum of the two probabilities (which are all the
possible outcomes) is 0.5 + 0.5 = 1.0. The normal distribution is a probability
distribution, so the area under the curve must be 1.0.
The normal distribution has some properties that are useful when we do t-tests, ANOVA,
regression, and other statistical tests.



68% of all observations fall within 1 standard deviation of the mean
95% of all observations fall within 2 standard deviations of the mean. (within 1.96
standard deviations, to be exact)
99.7% of all observations fall within 3 standard deviations of the mean
In the normal distribution, 95% of the total area under the curve falls in the region under
the curve that is plus or minus 2 standard deviations from the mean. That is, 95% of the
area (0.95 of the probability) falls in the range of mean  2 SD (actually 1.96, but 2 is
easy to remember).
5% of the total area under the curve falls in the region under the curve that is beyond 2
standard deviations from the mean. That is, 5% of the area (0.05 of the probability) falls
in the tails of the normal distribution beyond mean  2 SD.
Said another way, the area under the tails of the standard normal curve, more than 1.96
standard deviations from the mean, is approximately 5% or 0.05. This value of 0.05 area
is the source of the probability threshold value 0.05 (also known as alpha = 0.05) that we
often use in statistics tests.
Transforming data to normality
Many statistical methods, such as t-tests, analysis of variance, and regression, assume that
the data that are normally distributed and have equal variance within each treatment
group or level. Even if it is not required, most analyses work better with data that are
normally distributed and that have equal variance within each treatment group or level.
If you have outliers and/or non-normal distributions, you may be able to apply a
transform to the data to make the distribution more normal, and to reduce the influence of
the outliers.
Here is an example of a data distribution typical of biological measurements, such as the
amount of protein in the blood.
The distribution of the original values does not look at all normal. The log transform of
the original values looks more like a normal distribution.
If the log transform is not effective in producing a more normal distribution, your
software may have alternatives such as Box-Cox transforms or Johnson transforms.
Another alternative is to use a non-parametric test, such as a Wilcoxon rank sum test,
described later.
Download