Looking at data: distributions - Describing distributions with numbers IPS section 1.2 © 2006 W.H. Freeman and Company (authored by Brigitte Baldi, University of California-Irvine; adapted by Jim Brumbaugh-Smith, Manchester College) Objectives Describing distributions with numbers Describe center of a set of data Describe positions within a set of data Represent quartiles graphically Identify outliers mathematically Describe amount of variation (or “spread”) in a set of data Choose appropriate summary statistics Describe effects of linear transformations Terminology Measures of center mean ( x ) median (M) mode Measures of position percentiles quartiles (Q1 and Q3) Five-number summary Boxplot (regular and modified) Measures of spread range interquartile range (IQR) variance (s2) standard deviation (s) Measure of center: the mean The mean (or arithmetic average) To calculate the mean ( x) add all values, then divide by the number of observations. Sum of heights is 1598.3 divided by 25 women = 63.9 inches 58.2 59.5 60.7 60.9 61.9 61.9 62.2 62.2 62.4 62.9 63.1 63.9 63.9 64.0 64.1 64.5 64.8 65.2 65.7 66.2 66.7 67.1 67.8 68.9 69.6 Mathematical notation n number of values (i.e., observations) in data set xi data value number i x1, x2, , xn Σ sum up the expression that follows (Σ is the Greek upper case “sigma”) woman (i) height (x) woman (i) height (x) i=1 x1= 58.2 i = 14 x14= 64.0 i=2 x2= 59.5 i = 15 x15= 64.1 i=3 x3= 60.7 i = 16 x16= 64.5 i=4 x4= 60.9 i = 17 x17= 64.8 i=5 x5= 61.9 i = 18 x18= 65.2 i=6 x6= 61.9 i = 19 x19= 65.7 i=7 x7= 62.2 i = 20 x20= 66.2 i=8 x8= 62.2 i = 21 x21= 66.7 i=9 x9= 62.4 i = 22 x22= 67.1 i = 10 x10= 62.9 i = 23 x23= 67.8 i = 11 x11= 63.1 i = 24 x24= 68.9 i = 12 x12= 63.9 i = 25 x25= 69.6 i = 13 x13= 63.9 n=25 S=1598.3 Mathematical notation: x1 x2 ... xn x n 1 n x xi n i 1 1598.3 x 63.9 25 Your numerical summary must be meaningful. Height of 25 women in a class x 63.9 Here the shape of the distribution is wildly irregular. Why? Could we have more than one plant species or phenotype? The distribution of women’s heights appears coherent and fairly symmetrical. The mean is a good numerical summary. x 69.6 Height of Plants by Color x 63.9 5 x 70.5 x 78.3 red Number of Plants 4 pink blue 3 2 1 0 58 60 62 64 66 68 70 72 74 76 78 80 82 Height in centimeters A single numerical summary here would not make sense. 84 Measure of center: the median The median (M) is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 25 12 6.1 Sort observations in increasing order n = number of observations ______________________________ If n is odd, the median is the exact middle value. n = 25 (n+1)/ = 26/ = 13 2 2 Median = 3.4 If n is even, the median is the mean of the two middle observations. n = 24 n/ = 12 2 Median = (3.3+3.4)/2 = 3.35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0.6 1.2 1.5 1.6 1.9 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 Comparing the mean and the median The mean and the median are approximately equal if the distribution is roughly symmetrical. The median is resistant to skewness and outliers, staying near the main peak. The mean is not resistant, bring pulled in the direction of outliers or skewness. Mean and median for a symmetric distribution Mean Median Mean and median for skewed distributions Left skew Mean Median Mean Median Right skew Mean and median of a distribution with outliers x 4.1 Percent of people dying x 3.3 Without the outliers With the outliers, 14 and 14 The mean is pulled quite a bit The median is only slightly to the right by the two high pulled to the right by the outliers outliers (from 3.3 up to 4.1). (from 3.4 up to 3.6). Impact of skewed data Mean and median of symmetric data Disease X: M 3.4 x 3.3 Mean and median are nearly the same. … and for right-skewed distribution Multiple myeloma: M 2.5 x 3.4 Mean is pulled toward the skewness (i.e., longer tail). Measure of spread: the quartiles The first quartile, Q1, is a value that has 25% (one fourth) of the data at or below it (it is the median of the lower half of the sorted data, excluding M). M = median = 3.4 The third quartile, Q3, is a value that has 75% (three fourths) of the data at or below it (it is the median of the upper half of the sorted data, excluding M). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 6.1 Q1= first quartile = 2.2 Q3= third quartile = 4.35 Five-number summary and boxplot 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1 6.1 5.6 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 Largest = max = 6.1 BOXPLOT 7 upper “whisker” Q3= third quartile = 4.35 M = median = 3.4 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 4 3 2 1 lower “whisker” Q1= first quartile = 2.2 Smallest = min = 0.6 0 Disease X Five-number summary: min Q1 M Q3 max Boxplots for skewed data Years until death Comparing box plots for a symmetric and a right-skewed distribution 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Boxplots remain true to the data and clearly depict symmetry or skewness. Disease X Multiple Myeloma IQR Test for Outliers (or “1.5 IQR Criterion”) Outliers are troublesome data points; it is important to be able to identify them. In a boxplot, outliers are far beneath or far above the box (i.e., far below Q1 or above Q3). Define the interquartile range (IQR) to be the height of the box: IQR = Q3 − Q1 (distance between Q1 and Q3). We identify an observation as an outlier if it falls more than 1.5 times the interquartile range (IQR) below the first quartile or above the third quartile. If X < Q1 − 1.5(IQR) then X is considered a low outlier If X > Q3 + 1.5(IQR) then X is considered a high outlier Create a modified boxplot by plotting outliers separately and extending the whiskers to the lowest and highest non-outliers. 12 11 10 9 8 7 6 5 4 3 2 1 12 11 10 9 8 7 6 5 4 3 2 1 7.9 6.1 5.6 5.3 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 8 7.575 7 Q3 = 4.35 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 6.1 5 Interquartile range IQR = Q3 – Q1 = 4.35 − 2.2 = 2.15 4 3 2 1.5(IQR) = 1.5(2.15) = 3.225 1 Q1 = 2.2 0 Disease X Observation #25 has a value of 7.9 years, a possible high outlier. Q3 + 1.5(IQR) = 4.35 + 3.225 = 7.575 Since 7.9 > 7.575 it is considered an outlier, so use modified plot. Measures of spread: the standard deviation Measures of variation or spread answer the question, “How much is the data set as a whole spread out?” Range – distance from smallest data value to largest range = max – min Highly sensitive to outliers since depends solely on the two most extreme values. Interquartile range IQR = Q3 − Q1 Better than overall range since Variance and standard deviation Each measures variation from the mean. Standard deviation The standard deviation (s) describes variation above and below the mean. Like the mean, it is not resistant to skewness or outliers. 1. First calculate the variance s2. n 1 2 s2 ( x x ) i n 1 i 1 x 2. Then take the square root to get the standard deviation s. 1 n 2 s ( x x ) i n 1 i 1 Calculations … 1 n 2 s ( x x ) i n 1 i 1 Mean = 63.4 Sum of squared deviations from mean = 85.2 Degrees freedom (df) = n − 1 = 13 s2 = variance = 85.2/ 13 = 6.55 inches squared s = standard deviation = √6.55 = 2.56 inches Women’s height (inches) i xi x (xi-x) (xi-x)2 1 59 63.4 -4.4 19.0 2 60 63.4 -3.4 11.3 3 61 63.4 -2.4 5.6 4 62 63.4 -1.4 1.8 5 62 63.4 -1.4 1.8 6 63 63.4 -0.4 0.1 7 63 63.4 -0.4 0.1 8 63 63.4 -0.4 0.1 9 64 63.4 0.6 0.4 10 64 63.4 0.6 0.4 11 65 63.4 1.6 2.7 12 66 63.4 2.6 7.0 13 67 63.4 3.6 13.3 14 68 63.4 4.6 21.6 Sum 0.0 Sum 85.2 Mean 63.4 SPSS output for summary statistics: From menu: Analyze Descriptive Statistics Explore Displays common statistics of your sample data: x , M, s2, S, min, max, range, IQR Descriptives Height Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statistic 63.3571 61.8789 Std. Error .68426 64.8354 63.3413 63.0000 6.555 2.56026 59.00 68.00 9.00 3.50 .177 -.360 .597 1.154 Comments on standard deviation Standard deviation is generally positive (and never negative!) (s = 0 only when data values are identical— not very interesting data!) Larger standard deviation more variation in the data (i.e., data is spread out farther from the mean) Standard deviation has the same units as the original data (while variance does not) Choosing measures of center and spread: Mean and standard deviation are more precise (since based on actual data values); have nice mathematical properties but not resistant. Median and IQR are less precise (since based only on positions); are resistant to outliers, errors and skewness. Choosing among summary statistics Since the mean and std. deviation are not resistant, use only to Height of 25 Women describe distributions that are 69 fairly symmetrical with no outliers. 68 If clear outliers or strong skewness are present use the median and IQR. Don’t mix & match; use either x and s, or M and IQR. 67 Height in Inches 65 64 62 61 60 median and quartiles, the mean 59 by using error bars. x 63 Similar to a boxplot representing and std. dev. can be represented xs 66 xs 58 Box Plot Boxplot Mean x +/ SD s Mean or Median #1 Which should you use (and why) – mean or median? Middletown is considering imposing an income tax on citizens. City hall wants a numerical summary of its citizens income to estimate the total tax base. In a study of standard of living of families in Middletown, a sociologist desires a numerical summary of “typical” family income in that city. Mean or Median #2 You are planning to buy a home in Middletown. You ask your real estate agent what the “average” home value is in the neighborhood you are considering. Which would be more useful to you as the home buyer – the mean or the median? Which might the real estate agent be tempted to tell you is the “average” home value? Why? Changing the unit of measurement Variables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: xnew = a + bx. Temperatures can be expressed in degrees Fahrenheit (F) or degrees Celsius (C). C = (5/9)* F − 160/9 Linear transformations do not change the basic shape of a distribution (skewness, symmetry, modes, outliers). But they do change the measures of center and spread: Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (IQR, s) by b. Adding the same number a (positive or negative) to each observation adds a to all measures of center and quartiles but it does not change measures of spread (IQR, s). Changing degrees Fahrenheit to Celsius Fahrenheit Celsius Mean 25.73 (5/9)*25.73 − 160/9 = −3.48 Std Dev 5.12 (5/9)*5.12 = 2.84