Descriptive Statistics Part 5 – Measures of Shape We have looked at numerical measures of location and dispersion, now we will look at measures of shape. A histogram can give you a general idea of the shape of a distribution, but two numerical measures of shape give a more precise evaluation: skewness tells you the amount and direction of skew (departure from horizontal symmetry), and kurtosis tells you how tall and sharp the central peak is, relative to a standard bell curve. Why do we care? One application is testing for normality: many inferential statistics require that a distribution be normal or nearly normal. A normal distribution has skewness and excess kurtosis of 0, so if your distribution is close to those values then it is probably close to normal. Skewness Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. To calculate skewness we use Pearson’s Coefficient of skewness or the Third Moment. Population Skewness Po ulation kewness where: = population mean N = number of observations in the population X = class midpoint = population standard deviation = class frequency f Sample Skewness To get the skewness for a sample we simply use the population skewness formula and make the following adjustment: am le kewness Po ulation kewness where: n = number of observations in the sample The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. Skewness Distribution Shape Calculated Value Positive Tail to the Right, values extend further to the right but concentrated in left Bell shaped or symmetrical Mean > Median > Mode Zero Negative Mean = Median = Mode Tail to the left, values can extend further to the left but Mean < Median < Mode concentrated in the right In (a) there is a long tail and distortion that is caused by extremely small values which pull the mean downward so that it is less than the median. In (c) there is a long tail to the right that is caused by extremely large values which pull the mean upward so that it is greater than the median. Example: Below is grouped data for heights of a sample of 100 randomly selected male students Height Frequency (Meters) (f) 1.51 to 1.58 5 1.59 to 1.66 18 1.67 to 1.74 42 1.75 to 1.82 27 1.83 to 1.90 8 A histogram shows that the data are skewed left, not symmetric. 45 40 35 Frequency 30 25 20 15 10 5 0 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Height (Meters) But how highly skewed are they, compared to other data sets? To answer this question, you have to compute the skewness (using the formula for population skewness to start) Step 1: Find the midpoint of each class. Recall that the midpoint of a class is halfway between the lower limits of two consecutive classes. It is computed by adding the lower limits of consecutive classes and dividing the result by 2. Referring to the table above, for the first class the lower class limit is 1.51m and the next limit is 1.59m. The class midpoint is 1.55m, found by (1.59m – 1.51m)/2. Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Frequency (f) 5 18 42 27 8 Step 2: Compute the arithmetic mean for the distribution. Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Frequency (f) 5 18 42 27 8 fX 7.75 29.34 71.82 48.33 14.96 172.2 = ΣfX Solving for the arithmetic mean we get: Step 3: Subtract the mean from the class midpoint. That is, find Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Frequency (f) 5 18 42 27 8 fX 7.75 29.34 71.82 48.33 14.96 172.2 Mean µ 1.722 1.722 1.722 1.722 1.722 (X - µ) -0.172 -0.092 -0.012 0.068 0.148 Step 4: Square the difference between the class midpoint and the mean and multiply the squared difference by the class frequency and sum. Calculate the standard deviation. Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Frequency (f) 5 18 42 27 8 fX 7.75 29.34 71.82 48.33 14.96 172.2 Mean µ 1.722 1.722 1.722 1.722 1.722 (X - µ) (X - µ)2 f(X - µ)2 -0.172 -0.092 -0.012 0.068 0.148 0.029584 0.008464 0.000144 0.004624 0.021904 0.14792 0.15235 0.00605 0.12485 0.17523 0.6064 To find the standard deviation we insert these values into the formula Step 5: Cube the difference between the class midpoint and the mean and multiply the by the class frequency and sum. Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Frequency (f) 5 18 42 27 8 fX 7.75 29.34 71.82 48.33 14.96 172.2 Mean µ 1.722 1.722 1.722 1.722 1.722 (X - µ) (X - µ)3 f(X - µ)3 -0.172 -0.092 -0.012 0.068 0.148 -0.005088448 -0.000778688 -0.000001728 0.000314432 0.003241792 -0.02544 -0.01402 -0.00007 0.00849 0.02593 -0.0051 Step 6: Use the formula to calculate population skewness. Po ulation kewness Po ulation kewness This would be the skewness if the you had data for the whole population, however, you are dealing with a sample and must compute the sample skewness am le kewness Po ulation kewness am le kewness am le kewness . . If skewness is positive, the data are positively skewed or skewed right, meaning that the right tail of the distribution is longer than the left. If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail is longer. If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? If skewness is less than − or greater than + , the distribution is highly skewed. If skewness is between − and −0.5 or between +0.5 and +1, the distribution is moderately skewed. If skewness is between −0.5 and +0.5, the distribution is approximately symmetric. With a skewness of − . symmetric. 97, the sample data for student heights are approximately Caution: This is an interpretation of the data you actually have. When you have data for the whole o ulation, that’s fine. But when you have a sam le, the sam le skewness doesn’t necessarily apply to the whole population. Kurtosis If a distribution is symmetric, the next question is about the central peak: is it high and sharp, or short and broad? You can get some idea of this from the histogram, but a numerical measure is more precise. Kurtosis is the degree of peakedness of a distribution, that is, is it peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. The reference standard is a normal distribution, which has a kurtosis of 3. Often, excess kurtosis is presented instead of kurtosis, where excess kurtosis is simply kurtosis - 3. For exam le, the “kurtosis” re orted by Excel is actually the excess kurtosis. A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ≈3 (excess ≈ ) is called mesokurtic. A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner. A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter. Term Leptokurtic Distribution Shape Peaked Kurtosis Greater than 3 Mesokurtic Normal 3 Platykurtic Flat Less than 3 Excess Kurtosis Greater than 0 0 Less than 0 To calculate Kurtosis, we use the Fourth Moment. It is computed almost the same way as the coefficient of skewness: just change the exponent 3 to 4 in the formula Population Kurtosis Po ulation where: = population mean N = number of observations in the population X = class midpoint = population standard deviation = class frequency f In general, we normally want to find this in terms of excess kurtosis so we simply subtract 3. Again, the excess kurtosis is generally used because the excess kurtosis of a normal distribution is 0. Po ulation Sample Excess Kurtosis To get the excess kurtosis for a sample we simply compute the population excess kurtosis and make the following adjustment: Po ulation Example: Below is grouped data for heights of a sample of 100 randomly selected male students Height Frequency (Meters) (f) 1.51 to 1.58 5 1.59 to 1.66 18 1.67 to 1.74 42 1.75 to 1.82 27 1.83 to 1.90 8 From earlier: N Height (Meters) 1.51 to 1.58 1.59 to 1.66 1.67 to 1.74 1.75 to 1.82 1.83 to 1.90 = 1.722 = 100 = 0.077872 Class Midpoint (X) 1.55 1.63 1.71 1.79 1.87 Frequency (f) 5 18 42 27 8 fX 7.75 29.34 71.82 48.33 14.96 172.2 Use the formula to calculate population kurtosis. Po ulation Mean µ 1.722 1.722 1.722 1.722 1.722 (X - µ) -0.172 -0.092 -0.012 0.068 0.148 (X - µ)4 0.000875213 0.000071639 0.000000021 0.000021381 0.000479785 f(X - µ)4 0.004376065 0.001289507 0.000000871 0.000577297 0.003838282 0.010082022 Po ulation Po ulation Find the Population Excess Kurtosis Po ulation Po ulation Po ulation Find the Sample Excess Kurtosis using the formula Po ulation . This sample is slightly platykurtic: its peak is just a bit shallower than the peak of a normal distribution.