Descriptive Statistics

advertisement
Descriptive Statistics
Part 5 – Measures of Shape
We have looked at numerical measures of location and dispersion, now we will look at
measures of shape. A histogram can give you a general idea of the shape of a distribution, but
two numerical measures of shape give a more precise evaluation: skewness tells you the
amount and direction of skew (departure from horizontal symmetry), and kurtosis tells you
how tall and sharp the central peak is, relative to a standard bell curve.
Why do we care? One application is testing for normality: many inferential statistics require
that a distribution be normal or nearly normal. A normal distribution has skewness and excess
kurtosis of 0, so if your distribution is close to those values then it is probably close to
normal.
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution,
or data set, is symmetric if it looks the same to the left and right of the center point.
To calculate skewness we use Pearson’s Coefficient of skewness or the Third Moment.
Population Skewness
Po ulation kewness
where:
=
population mean
N
=
number of observations in the population
X
=
class midpoint
=
population standard deviation
=
class frequency
f
Sample Skewness
To get the skewness for a sample we simply use the population skewness formula and make
the following adjustment:
am le kewness
Po ulation kewness
where:
n
=
number of observations in the sample
The skewness for a normal distribution is zero, and any symmetric data should have a
skewness near zero. Negative values for the skewness indicate data that are skewed left and
positive values for the skewness indicate data that are skewed right. By skewed left, we mean
that the left tail is long relative to the right tail. Similarly, skewed right means that the right
tail is long relative to the left tail.
Skewness
Distribution Shape
Calculated Value
Positive
Tail to the Right, values extend further to the right but
concentrated in left
Bell shaped or symmetrical
Mean > Median > Mode
Zero
Negative
Mean = Median = Mode
Tail to the left, values can extend further to the left but Mean < Median < Mode
concentrated in the right
In (a) there is a long tail and distortion that is caused by extremely small values which pull
the mean downward so that it is less than the median.
In (c) there is a long tail to the right that is caused by extremely large values which pull the
mean upward so that it is greater than the median.
Example:
Below is grouped data for heights of a sample of 100 randomly selected male students
Height
Frequency
(Meters)
(f)
1.51 to 1.58
5
1.59 to 1.66
18
1.67 to 1.74
42
1.75 to 1.82
27
1.83 to 1.90
8
A histogram shows that the data are skewed left, not symmetric.
45
40
35
Frequency
30
25
20
15
10
5
0
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Height (Meters)
But how highly skewed are they, compared to other data sets? To answer this question, you
have to compute the skewness (using the formula for population skewness to start)
Step 1: Find the midpoint of each class. Recall that the midpoint of a class is halfway
between the lower limits of two consecutive classes. It is computed by adding the
lower limits of consecutive classes and dividing the result by 2. Referring to the table
above, for the first class the lower class limit is 1.51m and the next limit is 1.59m.
The class midpoint is 1.55m, found by (1.59m – 1.51m)/2.
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Frequency
(f)
5
18
42
27
8
Step 2: Compute the arithmetic mean for the distribution.
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Frequency
(f)
5
18
42
27
8
fX
7.75
29.34
71.82
48.33
14.96
172.2
= ΣfX
Solving for the arithmetic mean we get:
Step 3: Subtract the mean from the class midpoint. That is, find
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Frequency
(f)
5
18
42
27
8
fX
7.75
29.34
71.82
48.33
14.96
172.2
Mean
µ
1.722
1.722
1.722
1.722
1.722
(X - µ)
-0.172
-0.092
-0.012
0.068
0.148
Step 4: Square the difference between the class midpoint and the mean and multiply the
squared difference by the class frequency and sum. Calculate the standard deviation.
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Frequency
(f)
5
18
42
27
8
fX
7.75
29.34
71.82
48.33
14.96
172.2
Mean
µ
1.722
1.722
1.722
1.722
1.722
(X - µ)
(X - µ)2
f(X - µ)2
-0.172
-0.092
-0.012
0.068
0.148
0.029584
0.008464
0.000144
0.004624
0.021904
0.14792
0.15235
0.00605
0.12485
0.17523
0.6064
To find the standard deviation we insert these values into the formula
Step 5: Cube the difference between the class midpoint and the mean and multiply the by the
class frequency and sum.
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Frequency
(f)
5
18
42
27
8
fX
7.75
29.34
71.82
48.33
14.96
172.2
Mean
µ
1.722
1.722
1.722
1.722
1.722
(X - µ)
(X - µ)3
f(X - µ)3
-0.172
-0.092
-0.012
0.068
0.148
-0.005088448
-0.000778688
-0.000001728
0.000314432
0.003241792
-0.02544
-0.01402
-0.00007
0.00849
0.02593
-0.0051
Step 6: Use the formula to calculate population skewness.
Po ulation kewness
Po ulation kewness
This would be the skewness if the you had data for the whole population, however, you are
dealing with a sample and must compute the sample skewness
am le kewness
Po ulation kewness
am le kewness
am le kewness
.
.
If skewness is positive, the data are positively skewed or skewed right, meaning that the right
tail of the distribution is longer than the left. If skewness is negative, the data are negatively
skewed or skewed left, meaning that the left tail is longer.
If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite
unlikely for real-world data, so how can you interpret the skewness number?



If skewness is less than − or greater than + , the distribution is highly skewed.
If skewness is between − and −0.5 or between +0.5 and +1, the distribution is
moderately skewed.
If skewness is between −0.5 and +0.5, the distribution is approximately symmetric.
With a skewness of − .
symmetric.
97, the sample data for student heights are approximately
Caution: This is an interpretation of the data you actually have. When you have data for the
whole o ulation, that’s fine. But when you have a sam le, the sam le skewness doesn’t
necessarily apply to the whole population.
Kurtosis
If a distribution is symmetric, the next question is about the central peak: is it high and sharp,
or short and broad? You can get some idea of this from the histogram, but a numerical
measure is more precise.
Kurtosis is the degree of peakedness of a distribution, that is, is it peaked or flat relative to a normal
distribution.
Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and
have heavy tails.
Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.
The reference standard is a normal distribution, which has a kurtosis of 3. Often, excess
kurtosis is presented instead of kurtosis, where excess kurtosis is simply kurtosis - 3. For
exam le, the “kurtosis” re orted by Excel is actually the excess kurtosis.



A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any
distribution with kurtosis ≈3 (excess ≈ ) is called mesokurtic.
A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic.
Compared to a normal distribution, its central peak is lower and broader, and its
tails are shorter and thinner.
A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic.
Compared to a normal distribution, its central peak is higher and sharper, and
its tails are longer and fatter.
Term
Leptokurtic
Distribution
Shape
Peaked
Kurtosis
Greater than 3
Mesokurtic
Normal
3
Platykurtic
Flat
Less than 3
Excess
Kurtosis
Greater than 0
0
Less than 0
To calculate Kurtosis, we use the Fourth Moment. It is computed almost the same way as the
coefficient of skewness: just change the exponent 3 to 4 in the formula
Population Kurtosis
Po ulation
where:
=
population mean
N
=
number of observations in the population
X
=
class midpoint
=
population standard deviation
=
class frequency
f
In general, we normally want to find this in terms of excess kurtosis so we simply subtract 3.
Again, the excess kurtosis is generally used because the excess kurtosis of a normal distribution is 0.
Po ulation
Sample Excess Kurtosis
To get the excess kurtosis for a sample we simply compute the population excess kurtosis and
make the following adjustment:
Po ulation
Example:
Below is grouped data for heights of a sample of 100 randomly selected male students
Height
Frequency
(Meters)
(f)
1.51 to 1.58
5
1.59 to 1.66
18
1.67 to 1.74
42
1.75 to 1.82
27
1.83 to 1.90
8
From earlier:
N
Height
(Meters)
1.51 to 1.58
1.59 to 1.66
1.67 to 1.74
1.75 to 1.82
1.83 to 1.90
=
1.722
=
100
=
0.077872
Class
Midpoint
(X)
1.55
1.63
1.71
1.79
1.87
Frequency
(f)
5
18
42
27
8
fX
7.75
29.34
71.82
48.33
14.96
172.2
Use the formula to calculate population kurtosis.
Po ulation
Mean
µ
1.722
1.722
1.722
1.722
1.722
(X - µ)
-0.172
-0.092
-0.012
0.068
0.148
(X - µ)4
0.000875213
0.000071639
0.000000021
0.000021381
0.000479785
f(X - µ)4
0.004376065
0.001289507
0.000000871
0.000577297
0.003838282
0.010082022
Po ulation
Po ulation
Find the Population Excess Kurtosis
Po ulation
Po ulation
Po ulation
Find the Sample Excess Kurtosis using the formula
Po ulation
.
This sample is slightly platykurtic: its peak is just a bit shallower than the peak of a normal
distribution.
Download