Skewness & Kurtosis: Reference Source: http://mathworld.wolfram.com/NormalDistribution.html Further Moments – Skewness • Skewness measures the degree of asymmetry exhibited by the data n skewness (x x) i 1 3 i ns 3 • If skewness equals zero, the histogram is symmetric about the mean • Positive skewness vs negative skewness • Skewness measured in this way is sometimes referred to as “Fisher’s skewness” Further Moments – Skewness Source: http://library.thinkquest.org/10030/3smodsas.htm n Mode Median skewness Mean A B 3 ( x x ) i i 1 ns 3 Median Mean n = 26 mean = 4.23 median = 3.5 mode = 8 n skewness Value 1 2 3 4 5 6 7 8 9 10 Occurrences 1 4 8 4 3 2 1 1 1 1 Deviation (1 – 4.23) = -3.23 (2 – 4.23) = -2.23 (3 – 4.23) = -1.23 (4 – 4.23) = -0.23 (5 – 4.23) = 0.77 (6 – 4.23) = 1.77 (7 – 4.23) = 2.77 (8 – 4.23) = 3.77 (9 – 4.23) = 4.77 (10 - 4.23)= 5.77 Mean = 4.23 s = 2.27 3 ( x x ) i i 1 ns Cubed deviation 3 Occur*Cubed (-3.23)3 = -33.70 (-2.23)3 = -11.09 (-1.13)3 = -1.86 (-0.23)3 = -0.01 (+0.77)3 = 0.46 (+1.77)3 = 5.54 (+2.77)3 = 21.25 (+3.77)3 = 53.58 (+4.77)3 = 108.53 (+5.77)3 = 192.10 -33.70 -44.36 -14.89 -0.05 1.37 11.09 21.25 53.58 108.53 192.10 Sum = 294.94 Skewness = 0.97 n Mode Median Mean skewness (x x) i i 1 ns Skewness > 0 (Positively skewed) 3 3 n skewness 3 ( x x ) i Mode i 1 ns 3 Median Mean A B Skewness < 0 (Negatively skewed) n skewness (x x) i 1 i ns 3 Source: http://mathworld.wolfram.com/NormalDistribution.html Skewness = 0 (symmetric distribution) 3 Skewness – Review • Positive skewness – There are more observations below the mean than above it – When the mean is greater than the median • Negative skewness – There are a small number of low observations and a large number of high ones – When the median is greater than the mean Kurtosis – Review • Kurtosis measures how peaked the histogram is (Karl Pearson, 1905) n kurtosis (x x) i i ns 4 4 3 • The kurtosis of a normal distribution is 0 • Kurtosis characterizes the relative peakedness or flatness of a distribution compared to the normal distribution Kurtosis – Review • Platykurtic– When the kurtosis < 0, the frequencies throughout the curve are closer to be equal (i.e., the curve is more flat and wide) • Thus, negative kurtosis indicates a relatively flat distribution • Leptokurtic– When the kurtosis > 0, there are high frequencies in only a small part of the curve (i.e, the curve is more peaked) • Thus, positive kurtosis indicates a relatively peaked distribution n kurtosis (x x) i i ns 4 4 3 Source: http://espse.ed.psu.edu/Statistics/Chapters/Chapter3/Chap3.html Measures of central tendency – Review • Measures of the location of the middle or the center of a distribution • Mean • Median • Mode Mean – Review • Mean – Average value of a distribution; Most commonly used measure of central tendency • Median – This is the value of a variable such that half of the observations are above and half are below this value, i.e., this value divides the distribution into two groups of equal size • Mode - This is the most frequently occurring value in the distribution An Example Data Set • Daily low temperatures recorded in Chapel Hill (01/18-01/31, 2005, °F) Jan. 18 – 11 Jan. 19 – 11 Jan. 20 – 25 Jan. 21 – 29 Jan. 22 – 27 Jan. 23 – 14 Jan. 24 – 11 Jan. 25 – 25 Jan. 26 – 33 Jan. 27 – 22 Jan. 28 – 18 Jan. 29 – 19 Jan. 30 – 30 Jan. 31 – 27 • For these 14 values, we will calculate all three measures of central tendency - the mean, median, and mode Mean – Review • Mean –Most commonly used measure of central tendency • Procedures • (1) Sum all the values in the data set • (2) Divide the sum by the number of values in the n data set • Watch for outliers x x i 1 n i Mean – Review • (1) Sum all the values in the data set 11 + 11 + 11 + 14 + 18 + 19 + 22 + 25 + 25 + 27 + 27 + 29 + 30 + 33 = 302 n • (2) Divide the sum by the number of values in the data set Mean = 302/14 = 21.57 x x i 1 i n • Is this a good measure of central tendency for this data set? Median – Review • Median - 1/2 of the values are above it & 1/2 below • (1) Sort the data in ascending order • (2) Find the value with an equal number of values above and below it • (3) Odd number of observations [(n-1)/2]+1 value from the lowest • (4) Even number of observations average (n/2) and [(n/2)+1] values • (5) Use the median with asymmetric distributions, particularly with outliers Median – Review • (1) Sort the data in ascending order: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 • (2) Find the value with an equal number of values above and below it Even number of observations average the (n/2) and [(n/2)+1] values (14/2) = 7; [(14/2)+1] = 8 (22+25)/2 = 23.5 (°F) • Is this a good measure of central tendency for this data? Mode – Review • Mode – This is the most frequently occurring value in the distribution • (1) Sort the data in ascending order • (2) Count the instances of each value • (3) Find the value that has the most occurrences • If more than one value occurs an equal number of times and these exceed all other counts, we have multiple modes • Use the mode for multi-modal data Mode – Review • (1) Sort the data in ascending order: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 • (2) Count the instances of each value: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 3x 1x 1x 1x 1x 2x 2x 1x 1x 1x • (3) Find the value that has the most occurrences mode = 11 (°F) • Is this a good measure of the central tendency of this data set? Measures of Dispersion – Review • In addition to measures of central tendency, we can also summarize data by characterizing its variability • Measures of dispersion are concerned with the distribution of values around the mean in data: – Range – Interquartile range – Variance – Standard deviation – z-scores – Coefficient of Variation (CV) An Example Data Set • Daily low temperatures recorded in Chapel Hill (01/18-01/31, 2005, °F) Jan. 18 – 11 Jan. 19 – 11 Jan. 20 – 25 Jan. 21 – 29 Jan. 22 – 27 Jan. 23 – 14 Jan. 24 – 11 Jan. 25 – 25 Jan. 26 – 33 Jan. 27 – 22 Jan. 28 – 18 Jan. 29 – 19 Jan. 30 – 30 Jan. 31 – 27 • For these 14 values, we will calculate all measures of dispersion Range – Review • Range – The difference between the largest and the smallest values • (1) Sort the data in ascending order • (2) Find the largest value max • (3) Find the smallest value min • (4) Calculate the range range = max - min • Vulnerable to the influence of outliers Range – Review • Range – The difference between the largest and the smallest values • (1) Sort the data in ascending order 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 • (2) Find the largest value max = 33 • (3) Find the smallest value min = 11 • (4) Calculate the range range = 33 – 11 = 22 Interquartile Range – Review • Interquartile range – The difference between the 25th and 75th percentiles • (1) Sort the data in ascending order • (2) Find the 25th percentile – (n+1)/4 observation • (3) Find the 75th percentile – 3(n+1)/4 observation • (4) Interquartile range is the difference between these two percentiles Interquartile Range – Review • (1) Sort the data in ascending order 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 • (2) Find the 25th percentile – (n+1)/4 observation (14+1)/4 = 3.75 11+(14-11)*0.75 = 13.265 • (3) Find the 75th percentile – 3(n+1)/4 observation 3(14+1)/4 = 11.25 27+(29-27)*0.25 = 27.5 • (4) Interquartile range is the difference between these two percentiles 27.5 – 13.265 = 14.235 Variance – Review • Variance is formulated as the sum of squares of statistical distances (or deviation) divided by the population size or the sample size minus one: n s 2 (x x) i 1 i n 1 2 Variance – Review • (1) Calculate the mean x • (2) Calculate the deviation for each value xi x • (3) Square each of the deviations ( xi x ) 2 • (4) Sum the squared deviations 2 ( x x ) i • (5) Divide the sum of squares by (n-1) for a sample 2 ( x x ) /( n 1) i Variance – Review • (1) Calculate the mean x 25.7 • (2) Calculate the deviation for each value xi x Jan. 18 (11 – 25.7) = -10.57 Jan. 25 (25 – 25.7) = 3.43 Jan. 19 (11 – 25.7) = -10.57 Jan. 26 (33 – 25.7) = 11.43 Jan. 20 (25 – 25.7) = 3.43 Jan. 27 (22 – 25.7) = 0.43 Jan. 21 (29 – 25.7) = 7.43 Jan. 28 (18 – 25.7) = -3.57 Jan. 22 (27 – 25.7) = 5.43 Jan. 29 (19 – 25.7) = -2.57 Jan. 23 (14 – 25.7) = -7.57 Jan. 30 (30 – 25.7) = 8.42 Jan. 24 (11 – 25.7) = -10.57 Jan. 31 (27 – 25.7) = 5.42 Variance – Review • (3) Square each of the deviations ( xi x ) 2 Jan. 18 (-10.57)^2 = 111.76 Jan. 25 (3.43)^2 = 11.76 Jan. 19 (-10.57)^2 = 111.76 Jan. 26 (11.43)^2 = 130.61 Jan. 20 (3.43)^2 = 11.76 Jan. 27 (0.43)^2 = 0.18 Jan. 21 (7.43)^2 = 55.18 Jan. 28 (-3.57)^2 = 12.76 Jan. 22 (5.43)^2 = 29.57 Jan. 29 (-2.57)^2 = 6.61 Jan. 23 (7.57)^2 = 57.33 Jan. 30 (8.43)^2 = 71.04 Jan. 24 (-10.57)^2 = 111.76 Jan. 31 (5.43)^2 = 29.57 • (4) Sum the squared deviations (x x) i 2 = 751.43 Variance – Review • (5) Divide the sum of squares by (n-1) for a sample 2 ( x x ) /( n 1) i = 751.43 / (14-1) = 57.8 • The variance of the Tmin data set (Chapel Hill) is 57.8 Standard Deviation – Review • Standard deviation is equal to the square root of the variance n s (x x) i 1 2 i n 1 • Compared with variance, standard deviation has a scale closer to that used for the mean and the original data Standard Deviation – Review • (1) Calculate the mean x • (2) Calculate the deviation for each value xi x • (3) Square each of the deviations ( xi x ) 2 • (4) Sum the squared deviations 2 ( x x ) i • (5) Divide the sum of squares by (n-1) for a sample (x x) i 2 /( n 1) • (6) Take the square root of the resulting variance 2 ( x x ) /( n 1) i Standard Deviation – Review • (1) – (5) s2 = 57.8 • (6) Take the square root of the variance 57.8 7.6 • The standard deviation (s) of the Tmin data set (Chapel Hill) is 7.6 (°F) z-score – Review • Since data come from distributions with different means and difference degrees of variability, it is common to standardize observations • One way to do this is to transform each observation into a z-score xi x z s • May be interpreted as the number of standard deviations an observation is away from the mean z-scores – Review • z-score is the number of standard deviations an observation is away from the mean • (1) Calculate the mean x • (2) Calculate the deviation xi x • (3) Calculate the standard deviation s 2 ( x x ) /( n 1) i • (4) Divide the deviation by standard deviation z ( xi x ) / s z-scores – Review • Z-score for maximum Tmin value (33 °F) • (1) Calculate the mean x 21.57 • (2) Calculate the deviation xi x 11.43 • (3) Calculate the standard deviation (SD) 2 ( x x ) /( n 1) 7.6 i • (4) Divide the deviation by standard deviation z ( xi x ) / s 11.43 / 7.6 1.50 Coefficient of Variation – Review • Coefficient of variation (CV) measures the spread of a set of data as a proportion of its mean. • It is the ratio of the sample standard deviation to the sample mean s CV 100% x • It is sometimes expressed as a percentage • There is an equivalent definition for the coefficient of variation of a population Coefficient of Variation – Review • (1) Calculate mean x • (2) Calculate standard deviation s 2 ( x x ) /( n 1) i • (3) Divide standard deviation by mean CV = s 100% x Coefficient of Variation – Review • (1) Calculate mean x 25.7 • (2) Calculate standard deviation s (x x) i 2 /( n 1) 7.6 • (3) Divide standard deviation by mean CV = s 100% 7.6 / 25.7 100% 29.58 x Histograms – Review • We may also summarize our data by constructing histograms, which are vertical bar graphs • A histogram is used to graphically summarize the distribution of a data set • A histogram divides the range of values in a data set into intervals • Over each interval is placed a bar whose height represents the percentage of data values in the interval. Building a Histogram – Review • (1) Develop an ungrouped frequency table 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 11 3 14 1 18 1 19 1 22 1 25 2 27 2 29 1 30 1 33 1 Building a Histogram – Review • 2. Construct a grouped frequency table Select a set of classes 11-15 4 16-20 2 21-25 3 26-30 4 31-35 1 Building a Histogram – Review • 3. Plot the frequencies of each class Box Plots – Review • We can also use a box plot to graphically summarize a data set • A box plot represents a graphical summary of what is sometimes called a “five-number summary” of the distribution – Minimum – Maximum – 25th percentile – 75th percentile – Median • Interquartile Range (IQR) max. median min. Rogerson, p. 8. 75th %-ile 25th %-ile Boxplot – Review Further Moments of the Distribution • While measures of dispersion are useful for helping us describe the width of the distribution, they tell us nothing about the shape of the distribution Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91. Skewness – Review • Skewness measures the degree of asymmetry exhibited by the data • Positive skewness – More observations below the mean than above it • Negative skewness – A small number of low observations and a large number of high ones n skewness (x x) i 1 i ns 3 3 For the example data set: Skewness = -0.1851 Skewness = -0.1851 (Negatively skewed) Kurtosis – Review • Kurtosis measures how peaked the histogram is • Leptokurtic: a high degree of peakedness – Values of kurtosis over 0 • Platykurtic: flat histograms – Values of kurtosis less than 0 n kurtosis (x x) i i ns 4 4 3 For the example data set: Kurtosis = -1.54 < 0 Kurtosis = -1.54 < 0 (Platykurtic)