Testing for Normality and Symmetry

advertisement
Biomedical Statistics
Testing for Normality and Symmetry
Teacher:Jang-Zern Tsai (蔡章仁)
Student:邱瑋國









outline
Graphical Tests
Analysis of Skewness and Kurtosis
Statistical Tests
Chi-square test for normality
Kolmogorov-Smironov
Shapiro-Wilk (original test)
Shapiro-Wilk (expanded test)
Transformations
Graphical Tests
Histogram
histogram can be used to test whether data is normally
distributed.
Determine whether the data in
column B of Figure 1 are
normally distributed using a
histogram.
the histogram doesn’t look
particularly normal.
QQ Plot
A PP plot (point-point plot)
A QQ plot (quantile-quantile plot)
Using a QQ plot determine whether the data set {-5.2, -3.9, -2.1,
0.2, 1.1, 2.7, 4.9, 5.3} is normally distributed.
we have also standardized the original data so that it is easier to
compare the standardized data with the standard normal
approximation for each data point. Finally we have included a scatter
diagram (the QQ plot) of the data vs. the standardized normal data
Real Statistics Excel Support
The Real Statistics Resource Pack contains a supplemental data analysis
tool which creates QQ plots automatically. In the current implementation,
the data must be organized as a single column. We illustrate the use of
the QQ Plot data analysis tool in the following example.
Box Plots
While box plots can’t actually be used to test for normality, they can
be useful for testing for symmetry, which often is a sufficient
substitute for normality.
Analysis of Skewness and Kurtosis
ince the skewness and kurtosis of the normal distribution are zero,
values for these two parameters should be close to zero for data to
follow a normal distribution.
A rough measure of the standard error of the skewness is
where n
is the sample size.
A rough measure of the standard error of the kurtosis is
where n is the sample size.
If the absolute value of the skewness for the data is more than
twice the standard error this indicates that the data are not
symmetric, and therefore not normal. Similarly if the absolute value
of the kurtosis for the data is more than twice the standard error
this is also an indication that the data are not normal.
Statistical Tests for Normality and Symmetry
Chi-square test for normality
The chi-square goodness of fit test can be used to test the
hypothesis that data comes from a normal hypothesis. In particular,
we can use Theorem 2 of Goodness of Fit, to test the null
hypothesis:
We now perform the Chi-square goodness of fit test. Since the observed and
expected frequencies of the first and last interval are less than 5, it is better
to combine the 1st and 2nd as well as the last and second to last intervals.
The chi-square test statistic is 4.47, which is less than the critical value of
14.07, and so we can conclude that there is a good fit. Note that the df =
number of interval – 1 = 8 – 1 = 7 since the mean and standard deviation are
given.
We next test the null hypothesis that the data is normally distributed
using the sample mean and variance (3.74 and 2.20 respectively as
see in Figure 3) as estimates for the population mean/variance. As in
Example 1, we combine the first two and the last two intervals so that
all frequencies are at least 5. Once again we use a chi-square
goodness of fit test based on 8 intervals, but this time since the mean
and variance are estimated parameters, per Theorem 3 of Goodness
of Fit.n
Kolmogorov-Smirnov Test
Definition 1: Let x1,…,xn be an ordered sample with x1 ≤ … ≤ xn
and define Sn(x) as follows:
Now suppose that the sample comes from a population with
cumulative distribution function F(x) and define Dn as follows:
Observation: It can be shown that Dn doesn’t depend on F.
Since Sn(x) depends on the sample chosen, Dn is a random
variable. Our objective is to use Dn as way of estimating F(x).
The distribution of Dn can be calculated, but for our purposes the
important aspect of this distribution are the critical values. These
can be found in the Kolmogorov-Smirnov Table.
If Dn,α is the critical value from the table, then P(Dn ≤ Dn,α) =
1 – α. Dn can be used to test the hypothesis that a random
sample came from a population with a specific distribution
function F(x). If
then the sample data is a good fit with F(x).
Also from the definition of Dn given above, it follows that
the mean is 481.4 and the standard deviation is 155.2. We can now
build the table that allows us to carry out the KS test, namely:
Dn = the largest value in column G, which in our case is 0.0117. If
the data is normally distributed then the critical value Dn,α will be
larger than Dn. From the Kolmogorov-Smirnov Table we see that
Dn,α = D1000,.05 = 1.36 / SQRT(1000) = 0.043007
Since Dn = 0.0117 < 0.043007 = Dn,α, we conclude that the data is
a good fit with the normal distribution.
Shapiro-Wilk Original Test
Rearrange the data in ascending order so that x1 ≤ … ≤ xn.
Calculate SS as follows:
If n is even, let m = n/2, while if n is odd let m = (n–1)/2
Calculate b as follows, taking the ai weights from the Table 1
(based on the value of n) in the Shapiro-Wilk Table. Note that if n
is odd, the median data value is not used in the calculation of b.
Calculate the test statistic W = b2 ⁄ SS
Find the value in the Table 2 of the Shapiro-Wilk Table (for a
given value of n) that is closest to W, interpolating if necessary. This
is the probability that the data comes from a normal distribution.
We begin by sorting the data in column A using Data > Sort &
Filter|Sort or the QSORT supplemental function, putting the results in
column B. We next look up the a coordinate values for n = 12 (the
sample size) in Table 1 of the Shapiro-Wilk Table, putting these values
in column E. Corresponding to each of these 6 coordinates a1,…,a6, we
calculate the values x12 – x11, …, x7 – x6, where xi is the ith data
element in sorted order. E.g. since x1 = 35 and x12 = 86, we place the
difference 86 – 35 = 51 in cell. Column I contains the product of the
coordinate and difference values. E.g. cell I5 contains the formula
=E5*H5. The sum of these values is b = 44.1641, which is found in
cell.
We next calculate SS = 2008.667. Thus W = b2 ⁄ SS = .971026.
We now look for .971026 when n = 12 in Table 2 of
the Shapiro-Wilk Table and find that the value lies between .50
and .90. The W value for .5 is .943 and the W value for .9
is .973. Interpolating .971026 between these value, we arrive
at p-value = .873681. Since p-value = .87 > .05 = α, we retain
the null hypothesis that the data are normally distributed.
SHAPIRO(R1, False) = the Shapiro-Wilk test statistic W for the data in
the range R1
SWTEST(R1, False) = p-value of the Shapiro-Wilk test on the data in
R1
SWCoeff(n, j, False) = the jth coefficient for samples of size n
SWCoeff(R1, C1, False) = the coefficient corresponding to cell C1
within sorted range R1
SWPROB(n, W) = p-value of the Shapiro-Wilk test for a sample of
size n for test statistic W
Shapiro-Wilk Expanded Test
The following version of the Shapiro-Wilk Test handles samples
between 12 and 5,000 elements, although samples of at least 20
elements are recommended. We also show how to handle samples with
more than 5,000 elements.
Assuming that the sample has n elements, perform the following steps:
1. Sort the data in ascending order x1 ≤ … ≤ xn
2. Define the values m1, …, mn by
mi = NORMSINV((i − .375)/(n + .25))
3. Let M = [mi] be the n × 1 column vector whose elements are these
mi and let
If M is represented by the n × 1 range R1 in Excel, then
=SUMSQ(R1) calculates the value m.
4. Set u = 1/
and define the coefficients a1, …, an where
ai = mi / for 2 < i < n − 1
It turns out that ai = −an-i+1 for all i and that
where A = [ai] is the n × 1 column vector whose elements are the ai.
5. The W statistic is now defined by
Because of the above properties of the coefficients a1, …, an it turns out
that W = the square of the correlation coefficient between a1, …,
an and x1, …, xn. Thus the values of W are always between 0 and 1.
6. Thus we can test the statistic
using the standard normal distribution. If the p-value ≤ α then
we reject the null hypothesis that the original data is normally
distributed.
We carry out the calculations described above to get the results
shown in Figure 1. The W statistic is 0.971066. The p-value
= .921649 > .05 = α shows that there are no grounds for rejecting
the null hypothesis that the data is normally distributed.
SHAPIRO(R1) = the Shapiro-Wilk test statistic W for the data in
the range R1 using the expanded method
SWTEST(R1) = p-value of the Shapiro-Wilk test on the data in R1
using the expanded method
SWCoeff(n, j) = the jth coefficient for samples of size n
SWCoeff(R1, C1) = the coefficient corresponding to cell C1 within
sorted range R1.
Note that these functions can optionally take an additional
argument b: SHAPIRO(R1, b), SWTEST(R1, b), SWCoeff(n, j, b)
and SWCoeff(R1, C1, b). When omitted this argument defaults to
True (i.e. the values for the expanded Shapiro-Wilk test as
described above are used). If b is set to False then the values for
the original Shapiro-Wilk test are used instead.
This time we use the supplemental functions described above to
obtain the results shown in Figure 2. The value of W and the p-value
are as indicated using the formulas indicated. Since p-value =
0.019314 < .05 = α, we reject the hypothesis that the data is
normally distributed. Note that we don’t need to sort the data and
the data does not have to be arranged in a column to use the
formulas.
If for some reason we want to obtain the coefficients, we need to
sort the data. This is done by highlighting the range G3:K14 and
entering =QSORT(A3:E14) and pressing Ctrl-Shft-Enter. The first
coefficient is obtained by entering the formula
=SWCoeff($G$3:$K$14,G3) in cell M3. If you highlight the range
M3:Q14 and press Ctrl-R and Ctrl-D
Observation: If a sample larger than 5,000, you can randomly divide the larger
sample into a number of approximately equal-sized smaller samples and then
run the SW algorithm as described above on each sample to obtain the z score
for each smaller sample. Suppose that there are k such samples with z scores
of z 1 , …, z n . Recall that if range R1 contains sample i then z i =
NORMSINV(SWTEST(R1)).
The average of the z-scores will be an approximation of the z value for the
whole sample. The expected mean of z is the average of the means of the zi,
namely 0 and the standard deviation of z should be the standard deviation of the
zi divided by √k, namely 1/ . Thus you should test z/ using the standard normal
distribution.
Transformations to Create Symmetry
When a sample is not distributed normally, and is not even
symmetric, then sometimes it can be useful to transform the data
so that the transformed data is more normal or at least roughly
symmetric. We touch upon the subject in Transformations, and will
explore this concept a bit further in this section.
When data is skewed to the left, transformations such as f(x) = log
x (either base 10 or base e) and f(x) = will tend to correct some
of the skew since larger values are compressed. Both of these
transformations don’t accept negative numbers, and so the
transformations f(x) = log (x+a) or f(x) =
may need to be
used instead where a is a constant sufficiently large so that x + a
is positive for all the data elements. We now show how to use a log
transformation via an example.
If we create a QQ Plot as described in Graphical Tests for
Normality and Symmetry, we see that the data is not very
normal (Figure 2). We now make a log transfer. We choose
log base 10, although the result would be similar if we had
chosen log base.
THE END
Download