Chapter 4 Descriptive Statistics • Statistics are descriptive measures derived from a sample (n (n items). Numerical Description Central Tendency Dispersion Standardized Data Percentiles and Quartiles Box Plots Grouped Data Skewness and Kurtosis (optional) • Parameters are descriptive measures derived from a population (N (N items). Numerical Description • Three key characteristics of numerical data: Characteristic Interpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Numerical Description Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? Numerical Description Numerical Description ¯ Example: Vehicle Quality • Consider the data set of vehicle defect rates from J. D. Power and Associates. • Defect rate = total no. defects x 100 no. inspected • Numerical statistics can be used to summarize this random sample of brands. • Must allow for sampling error since the analysis is based on sampling. To begin, sort the data in Excel. • Number of defects per 100 vehicles, 1004 models. 1 Numerical Description • Sorted data provides insight into central tendency and dispersion. Numerical Description ¯ Visual Displays • The dot plot offers a visual impression of the data. Numerical Description ¯ Visual Displays • Histograms with 5 bins (suggested by Sturge Sturge’’s Rule) and 10 bins are shown below. Descriptive Statistics in Excel Go to Tools | Data Analysis and select Descriptive Statistics • Both are symmetric with no extreme values and show a modal class toward the low end. Highlight the data range, specify a cell for the upperupper - left corner of the output range, check Summary Statistics and click OK. Here is the resulting analysis. 2 Here is the resulting MegaStat analysis: Descriptive Statistics in MegaStat Central Tendency Central Tendency • The central tendency is the middle or typical values of a distribution. • Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics. ¯ Six Measures of Central Tendency Statistic Formula Mean 1 n ∑ xi n i=1 Median Mode Midrange Most frequently occurring data value x min + xmax 2 Excel Formula = MODE(Data MODE(Data)) =0.5*(MIN(Data)) =0.5*(MIN(Data + MAX(Data MAX(Data)) )) Pro Useful for attribute data or discrete data with a small range. Easy to understand and calculate. Con = AVERAGE(Data AVERAGE(Data)) Familiar and uses all the sample information. Influenced by extreme values. = MEDIAN(Data MEDIAN(Data)) Robust when extreme data values exist. Ignores extremes and can be affected by gaps in data values. Central Tendency ¯ Six Measures of Central Tendency Formula Pro Middle value in sorted array Central Tendency Statistic Excel Formula ¯ Six Measures of Central Tendency Con May not be unique, and is not helpful for continuous data. Influenced by extreme values and ignores most data values. Statistic Geometric mean (G ( G) Trimmed mean Formula n x1 x2 ... x n Same as the mean except omit highest and lowest k% of data values (e.g., 5%) Excel Formula = GEOMEAN(Data GEOMEAN(Data)) = TRMEAN(Data TRMEAN(Data,, %) Pro Useful for growth rates and mitigates high extremes. Con Less familiar and requires positive data. Mitigates effects of extreme values. Excludes some data values that could be relevant. 3 Central Tendency ¯ Mean Central Tendency ¯ Mean • A familiar measure of central tendency. Population Formula µ= i =1 N n Sample Formula x= n N ∑ xi • For the sample of n = 37 car brands: x= ∑ xi ∑ xi i=1 n = 87 + 93 + 98+ ... + 159 + 164 + 173 4639 = = 125.38 37 37 i= 1 n • In Excel, use function =AVERAGE(Data =AVERAGE(Data)) where Data is an array of data values. Central Tendency ¯ Characteristics of the Mean • Arithmetic mean is the most familiar average. • Affected by every sample item. • The balancing point or fulcrum for the data. Central Tendency ¯ Characteristics of the Mean • Regardless of the shape of the distribution, absolute distances from the mean to the data n points always sum to zero. ∑ ( xi − x ) = 0 • Consider the following i =1 asymmetric distribution of quiz scores whose mean = 65. n ∑ (x i i =1 − x )= (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65) = ((-23) + ((-5) + (5) + (10) + (13) = -28 + 28 = 0 Central Tendency ¯ Median — Median is that value of the variate which divides the ordered data into two equal halves. — Median does not look at the extreme values.( Not sensitive to extreme values) — Ignores the values of the variable — Median value is unique. Central Tendency ¯ Median • The median ( M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower half of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array. 4 Central Tendency ¯ Median Central Tendency ¯ Median • For n = 8, the median is between the fourth and fifth observations in the data array. • For n = 9, the median is the fifth observation in the data array. Central Tendency ¯ Median Central Tendency ¯ Median • Consider the following n = 6 data values: 11 12 15 17 21 32 • What is the median? xn / 2 + x( n / 2 +1) For even n, Median = n/2 = 6/2 = 3 and 2 n/2+1 = 6/2 + 1 = 4 • Consider the following n = 7 data values: 12 23 23 25 27 34 41 • What is the median? For odd n, Median = ( n+1)/2 = (7+1)/2 = 8/2 = 4 M = x 4 = 25 M = (x (x 3+x 4)/2 = (15+17)/2 = 16 11 12 15 16 17 21 32 Central Tendency ¯ Median • Use Excel’ Excel’s function =MEDIAN(Data =MEDIAN(Data)) where Data is an array of data values. • For the 37 vehicle quality ratings (odd n) the position of the median is ( n+1)/2 = (37+1)/2 = 19. • So, the median is x 19 = 121. • When there are several duplicate data values, the median does not provide a clean “ 50 50-- 50 50”” split in the data. x( n +1 ) / 2 12 23 23 25 27 34 41 Central Tendency ¯ Characteristics of the Median • The median is insensitive to extreme data values. • For example, consider the following quiz scores for 3 students: Tom ’s scores: Tom’ 20, 40, 70, 75, 80 Jake’’s scores: Jake 60, 65, 70, 90, 95 Mary ’s scores: 50, 65, 70, 75, 90 Mean =57, Median = 70, 70, Total = 285 Mean = 76, Median = 70, 70, Total = 380 Mean = 70, Median = 70, 70, Total = 350 • What does the median for each student tell you? 5 Central Tendency ¯ Mode Central Tendency ¯ Mode • The most frequently occurring data value. • Similar to mean and median if data values occur often near the center of sorted data. • May have multiple modes or no mode. • For example, consider the following quiz scores for 3 students: Lee’s scores : Lee’ 60, 70, 70, 70, Pat’’s scores: Pat scores: 45, 45, 70, 90, Sam’’s scores : Sam 50, 60, 70, 80, Xiao’’s scores : Xiao 50, 50, 70, 90, 80 Mean =70, Median = 70, Mode = 70 100 Mean = 70, Median = 70, Mode = 45 90 Mean = 70, Median = 70, Mode = none 90 Mean = 70, Median = 70, Modes = 50,90 • What does the mode for each student tell you? Central Tendency ¯ Mode Central Tendency ¯ Mode • Easy to define, not easy to calculate in large samples. • Generally isn’ isn’t useful for continuous data since data values rarely repeat. • Use Excel’ Excel’s function =MODE(Array =MODE(Array ) - will return #N/A if there is no mode. - will return first mode found if multimodal. • Best for attribute data or a discrete variable with a small range (e.g., Likert scale). • May be far from the middle of the distribution and not at all typical. Central Tendency Central Tendency ¯ Example: Price/Earnings Ratios and Mode ¯ Example: Price/Earnings Ratios and Mode • Consider the following P/E ratios for a random sample of 68 Standard & Poor’ Poor’s 500 stocks. 7 8 8 10 10 10 10 12 13 13 14 15 15 15 15 15 16 16 16 17 13 13 13 13 13 14 14 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 26 26 27 29 29 30 31 34 36 37 23 23 23 24 25 26 26 40 41 45 48 55 68 91 • What is the mode? • Excel’ Excel’s descriptive statistics results are: • The mode 13 occurs 7 times, but what does the dot plot show? Mean 22.7206 Median Mode 19 13 Range 84 Minimum Maximum 7 91 Sum Count 1545 68 6 Central Tendency Central Tendency ¯ Example: Price/Earnings Ratios and Mode • The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29. ¯ Mode • A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. • Occurs when dissimilar populations are combined in one sample. For example, • These multiple modes suggest that the mode is not a stable measure of central tendency. Central Tendency Central Tendency ¯ Skewness • Compare mean and median or look at histogram to determine degree of skewness. ¯Mean median and mode are approximately related. In one situation they are all equal. Otherwise, ¯Mode = Mean - 3(Mean - Median) Central Tendency Central Tendency ¯ Symptoms of Skewness Distribution’ s Distribution’ Shape Histogram Appearance Skewed left (negative skewness) Long tail of histogram points left (a few low values but most data on right) Tails of histogram are balanced (low/high values offset) Symmetric Skewed right (positive skewness) Long tail of histogram points right (most data on left but a few high values) ¯ Skewness Statistics • For the sample of J.D. Power quality ratings, the mean (125.38) exceeds the median (121). What does this suggest? Mean < Median Mean ≈ Median Mean > Median 7 Central Tendency Central Tendency ¯ Geometric Mean ¯ Growth Rates • The geometric mean (G) is a G= multiplicative average. n • A variation on the geometric mean used to find the average growth rate for a time series. x1 x2 ... xn • For the J. D. Power quality data (n=37): G= 37 (87)(93)(98)...(164)(173) = 37 2.37667 ×10 G =n 77 = 123.38 xn −1 x1 • For example, from 1998 to 2002, Spirit Airlines revenues are: • In Excel use =GEOMEAN(Array =GEOMEAN(Array)) • The geometric mean tends to mitigate the effects of high outliers. Central Tendency ¯ Growth Rates • The average growth rate is given by taking the geometric mean of the ratios of each year’ year’s revenue to the preceding year. • Due to cancellations, only the first and last years are relevant: Revenue (mil) 131 1999 2000 227 311 2001 354 2002 403 Central Tendency ¯ Geometric Mean • Suppose $100 is growing by 10% each year, then • year • Year • Year • Year 0 (current) 1 $100 $110 2 3 $121 $133.1 $146.41 • Year 4 227 311 354 403 403 G=5 − 1 = 5 131 − 1 131 227 311 354 = 1.242− 1.242 −1 = .242 or 24.2% per year Year 1998 • Mean = 122.1 • Now let us take the average growth rate: • The Geometric mean = 121.0 amount grew by $46.41 in 4 years growth (using AM) = 11.6025 growth (using GM) = 10.0 (Which is correct). • Average • Average • GM is better measure of central tendency when the data is showing showing a proportionate change. • In Excel use =(403/131)^(1/5)=(403/131)^(1/5)- 1 Central Tendency ¯ Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. x + xmax Midrange = min 2 • For the J. D. Power quality data (n=37): x + xmax x +x 87 + 173 Midrange = min = 130 = 1 37 = 2 2 2 • Here, the midrange (130) is higher than the mean (125.38) or median (121). Central Tendency ¯ Trimmed Mean • To calculate the trimmed mean, mean , first remove the highest and lowest k percent of the observations. • For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. • So, we would remove the three smallest and three largest observations before averaging the remaining values. 8 Central Tendency Central Tendency ¯ Trimmed Mean ¯ Trimmed Mean • Here is a summary of all the measures of central tendency for the n = 68 P/E values. Mean: 22.72 =AVERAGE(PERatio AVERAGE(PERatio)) Median: 19.00 =MEDIAN(PERatio MEDIAN(PERatio)) Mode: 13.00 =MODE(PERatio MODE(PERatio)) Geometric Mean: 19.85 =GEOMEAN(PERatio GEOMEAN(PERatio)) Midrange: 5% Trim Mean: 49.00 21.10 =(MIN(PERatio)+MAX(PERatio))/2 =TRIMMEAN(PERatio,0.1) • The trimmed mean mitigates the effects of very high values, but still exceeds the median. • The Federal Reserve uses a 16% trimmed mean to mitigate the effects of extremes in its analysis of the Consumer Price Index. Dispersion Dispersion • Variation is the “spread spread”” of data points about the center of the distribution in a sample. Consider the following measures of dispersion: ¯ Measures of Variation Statistic Formula Range xmax – xmin Variance (s2) n ∑ ( xi − x ) i =1 Excel = MAX(Data) MIN(Data)) MIN(Data 2 = VAR(Data VAR(Data)) n −1 Pro Con Easy to calculate Sensitive to extreme data values. Plays a key role in mathematical statistics. Non- intuitive Nonmeaning. ¯ Measures of Variation Statistic Formula Standard deviation ( s) ∑ ( xi − x ) CoefCoefficient.. of ficient variation ( CV CV)) s 100 × x n i =1 2 Formula Excel i =1 n = AVEDEV(Data AVEDEV(Data)) = STDEV(Data STDEV(Data)) Non- intuitive Nonmeaning. None Measures relative variation in percent so can compare data sets. Requires non-non negative data. ¯ Range Pro Con Easy to understand. Lacks “ nice nice”” theoretical properties. n ∑ xi − x Con Most common measure. Uses same units as the raw data ($ , £, ¥, etc.). Dispersion ¯ Measures of Variation Mean absolute deviation ( MAD MAD)) Pro n −1 Dispersion Statistic Excel • The difference between the largest and smallest observation. Range = x max – x min • For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84 9 Dispersion Dispersion ¯ Variance ¯ Standard Deviation • The population variance( variance ( σ2) is defined as the sum of squared deviations around the mean µ divided by the population size. N ∑ ( xi −µ ) i= 1 σ2 = • For the sample variance (s 2), we divide by n – 1 instead of n, otherwise s 2 would tend to s2 = underestimate the unknown 2 population variance σ . 2 N n ∑ ( xi − x ) 2 i =1 n −1 Dispersion Statistic Excel population formula Excel sample formula Variance =VARP(Array ) =VAR(Array ) =STDEVP(Array ) =STDEV(Array ) Dispersion ¯ Calculating a Standard Deviation • Now, calculate the sample standard deviation: s= ∑ ( xi − x ) i =1 n −1 2 = 2380 = 595 = 24.39 5 −1 • Somewhat easier, the two two-- sum formulacan formula can also be used: 2 n ∑ xi 2 i =1 x − ∑i n s 2 = i =1 = n −1 n (360)2 28300 − 5 = 5− 1 Population standard deviation N σ= ∑ ( xi − µ ) i =1 N 2 Sample standard deviation n s= ∑ ( xi − x ) i =1 2 n −1 ¯ Calculating a Standard Deviation • Excel Excel’’s built in functions are n • Explains how individual values in a data set vary from the mean. • Units of measure are the same as X. Dispersion ¯ Standard Deviation Standard deviation • The square root of the variance. 28300 −25920 = 595 = 24.39 5− 1 • Consider the following five quiz scores for Stephanie. Dispersion ¯ Calculating a Standard Deviation • The standard deviation is nonnegative because deviations around the mean are squared. • When every observation is exactly equal to the mean, the standard deviation is zero. • Standard deviations can be large or small, depending on the units of measure. • Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially. 10 Dispersion ¯ Coefficient of Variation Dispersion ¯ Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unitunit-free measure of dispersion • Expressed as a percent of the mean. s CV = 100 × x CV = 100 × • For example: Defect rates ( n = 37) ATM deposits ( n = 100) P/E ratios ( n = 68) s x s = 22.89 x = 125.38 gives s = 280.80 x = 233.89 gives s = 14.28 = 22.72 gives x CV = 100 × (22.89)/(125.38) = 18% CV = 100 × (280.80)/(233.89) = 120% CV = 100 × (14.08)/(22.72) = 62% • Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. Dispersion ¯ Mean Absolute Deviation • The Mean Absolute Deviation (MAD MAD)) reveals the average distance from an individual data point to the mean (center of the distribution). Dispersion ¯ Central Tendency vs. Dispersion: Manufacturing • Consider the histograms of hole diameters drilled in a steel plate during manufacturing. • Uses absolute values of the deviations around the mean. n MAD = ∑ xi − x i =1 n Machine A • Excel Excel’’s function is =AVEDEV(Array =AVEDEV(Array ) Dispersion ¯ Central Tendency vs. Dispersion: Manufacturing Machine B • The desired distribution is outlined in red. Dispersion ¯ Central Tendency vs. Dispersion: Job Performance • Consider student ratings of four professors on eight teaching attributes (10(10- point scale). Machine A Machine B Acceptable variation but Desired mean (5mm) mean is less than 5 mm. but too much variation. • Take frequent samples to monitor quality. 11 Dispersion ¯ Central Tendency vs. Dispersion: Job Performance Dispersion ¯ Central Tendency vs. Dispersion: Job Performance • Jones and Wu have identical means but different standard deviations. • Smith and Gopal have different means but identical standard deviations. Dispersion Standardized Data ¯ Central Tendency vs. Dispersion: Job Performance • A high mean (better rating) and low standard deviation (more consistency) is preferred. Which professor do you think is best? Standardized Data ¯ Chebyshev Chebyshev’’s Theorem • For k = 2 standard deviations, 100[1 – 1/22] = 75% • So, at least 75.0% will lie within µ + 2σ • For k = 3 standard deviations, 100[1 – 1/32] = 88.9% • So, at least 88.9% will lie within µ + 3σ • Although applicable to any data set, these limits tend to be too wide to be useful. ¯ Chebyshev Chebyshev’’s Theorem • Developed by mathematicians Jules Bienaym Bienaymé é (1796-- 1878) and Pafnuty Chebyshev (1821 (1796 (1821-- 1894). • For any population with mean µ and standard deviation σ, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/ 1/kk 2]. Standardized Data ¯ The Empirical Rule • The normal or Gaussian distribution was named for Karl Gauss (1771(1771- 1855). • The normal distribution is symmetric and is also known as the bellbell-shaped curve. • The Empirical Rulestates Rule states that for data from a normal distribution, we expect that for k = 1 about 68.26% will lie within µ + 1σ k = 2 about 95.44% will lie within µ + 2σ k = 3 about 99.73% will lie within µ + 3σ 12 Standardized Data ¯ The Empirical Rule • Distance from the mean is measured in terms of the number of standard deviations. Standardized Data ¯ Example: Exam Scores • If 80 students take an exam, how many will score within 2 standard deviations of the mean? • Assuming exam scores follow a normal distribution, the empirical rule states Note: no upper bound is given. Data values outside µ + 3σ are rare. about 95.44% will lie within µ + 2σ so 95.44% x 80 ≈ 76 students will score + 2σ from µ. • How many students will score more than 2 standard deviations from the mean? Standardized Data ¯ Unusual Observations • Unusual observations are those that lie beyond µ + 2σ. Standardized Data ¯ Unusual Observations • For example, the P/E ratio data contains several large data values. Are they unusual or outliers? 7 • Outliers are observations that lie beyond µ + 3σ. 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 8 14 15 15 15 15 15 16 16 20 16 17 18 18 20 20 21 21 18 21 18 19 22 22 19 19 23 23 19 19 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 Standardized Data ¯ The Empirical Rule • If the sample came from a normal distribution, then the Empirical rule states Standardized Data ¯ The Empirical Rule • Are there any unusual values or outliers? 7 8 . . . 48 55 68 91 x ± 1s = 22.72 ± 1(14.08) = (8.9, 38.8) Unusual x ± 2 s = 22.72 ± 2(14.08) = (-5.4, 50.9) Unusual Outliers Outliers x ± 3s = 22.72 ± 3(14.08) = (-19.5, 65.0) -19.5 -5.4 8.9 22.72 38.8 50.9 65.0 13 Standardized Data ¯ Defining a Standardized Variable • A standardized variable( variable ( Z) redefines each observation in terms the number of standard deviations from the mean. xi − µ σ Standardization formula for a population: zi = Standardization formula for a sample: x −x zi = i s Standardized Data ¯ Defining a Standardized Variable • A negative z value means the observation is below the mean. Standardized Data ¯ Defining a Standardized Variable • z i tells how far away the observation is from the mean. • For example, for the P/E data, the first value x 1 = 7. The associated z value is zi = xi − x s = 7 – 22.72 = - 1.12 14.08 Standardized Data ¯ Defining a Standardized Variable • Here are the standardized z values for the P/E data: • Positive z means the observation is above the mean. For x 68 = 91, zi = xi − x = 91 – 22.72 = 4.85 14.08 s • What do you conclude for these four values? Standardized Data ¯ Defining a Standardized Variable • MegaStat calculates standardized values as well as checks for outliers. • In Excel, use =STANDARDIZE(Array =STANDARDIZE(Array,, Mean, STDev)) to calculate a STDev standardized z value. Standardized Data ¯ Outliers • What do we do with outliers in a data set? • If due to erroneous data, then discard. • An outrageous observation (one completely outside of an expected range) is certainly invalid. • Recognize unusual data points and outliers and their potential impact on your study. • Research books and articles on how to handle outliers. 14 Standardized Data Percentiles and Quartiles ¯ Estimating Sigma ¯ Percentiles • For a normal distribution, the range of values is 6σ 6σ (from µ – 3σ to µ + 3σ 3σ). • If you know the range R (high – low), you can estimate the standard deviation as σ = R/6. • Useful for approximating the standard deviation when only R is known. • This estimate depends on the assumption of normality. Percentiles and Quartiles ¯ Percentiles • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles are used in employee merit evaluation and salary benchmarking. Percentiles and Quartiles ¯ Quartiles • Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1 ïLower 25%ð | Q2 ïSecond 25%ð | Q3 ïThird 25%ð | ïUpper 25%ð • The three values that separate the four groups are called Q 1, Q 2, and Q 3, respectively. Percentiles and Quartiles ¯ Quartiles • The second quartile Q 2 is the median median,, an important indicator of central tendency. tendency. Q2 ï Lower 50% ð | ï Upper 50% ð • Q 1 and Q 3 measure dispersion since the interquartile range Q 3 – Q 1 measures the degree of spread in the middle 50 percent of data values. Q1 | Percentiles and Quartiles ¯ Quartiles • Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). ïLower 25%ð 25%ð • Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-- takers scored below you. test • Deciles are data that have been divided into 10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups. Q3 ï Middle 50% ð | ïUpper 25%ð 25%ð • The first quartile Q 1 is the median of the data values below Q 2, and the third quartile Q 3 is the median of the data values above Q 2. Q1 ïLower 25%ð 25%ð | Q2 ïSecond 25%ð 25%ð For first half of data, 50% above, 50% below Q 1. | Q3 ïThird 25%ð 25%ð | ïUpper 25%ð 25%ð For second half of data, 50% above, 50% below Q 3. 15 Percentiles and Quartiles ¯ Quartiles Percentiles and Quartiles ¯ Method of Medians • Depending on n, the quartiles Q 1,Q 2, and Q 3 may be members of the data set or may lie between two of the sorted data values. • For small data sets, find quartiles using method of medians : Step 1. Sort the observations. Step 2. Find the median Q 2. Step 3. Find the median of the data values that lie below Q 2. Step 4. Find the median of the data values that lie above Q 2. Percentiles and Quartiles ¯ Excel Quartiles ¯ Example: P/E Ratios and Quartiles • Use Excel function =QUARTILE(Array =QUARTILE(Array , k) to return the k th quartile. • Excel treats quartiles as a special case of percentiles. For example, to calculate Q 3 =QUARTILE(Array QUARTILE(Array,, 3) =PERCENTILE(Array , 75) • Excel calculates the quartile positions as: Position of Q 1 0.25n 0.25 n + 0.75 Position of Q 2 Position of Q 3 0.50n + 0.50 0.50n 0.75n 0.75 n + 0.25 Percentiles and Quartiles ¯ Example: P/E Ratios and Quartiles • Using Excel’ Excel’s method of interpolation, the quartile positions are: Quartile Position Q1 Q2 Q3 Percentiles and Quartiles Formula = 0.25(68) + 0.75 = 17.75 = 0.50(68) + 0.50 = 34.50 = 0.75(68) + 0.25 = 51.25 Interpolate Between X17 + X18 X34 + X35 X51 + X52 • Consider the following P/E ratios for 68 stocks in a portfolio. 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 • Use quartiles to define benchmarks for stocks that are low - priced (bottom quartile) or highhigh- priced (top quartile). Percentiles and Quartiles ¯ Example: P/E Ratios and Quartiles • The quartiles are: Quartile First (Q (Q 1) Second (Q (Q 2) Third (Q (Q 3) Formula Q 1 = X17 + 0.75 (X (X18- X17) = 14 + 0.75 (14(14- 14) = 14 Q 2 = X34 + 0.50 (X (X35- X34) = 19 + 0.50 (19(19- 19) = 19 Q 3 = X51 + 0.25 (X (X52- X51) = 26 + 0.25 (26(26- 26) = 26 16 Percentiles and Quartiles ¯ Example: P/E Ratios and Quartiles Percentiles and Quartiles ¯ Tip • So, to summarize: Q1 ïLower 25%ð 25%ð of P/E Ratios Q2 ïSecond 25%ð 25%ð of P/E Ratios 14 19 Whether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications. Q3 ïThird 25%ð 25%ð of P/E Ratios 26 ïUpper 25%ð 25%ð of P/E Ratios • These quartiles express central tendency and dispersion. What is the interquartile range? • Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations. Percentiles and Quartiles ¯ Caution Percentiles and Quartiles ¯ Dispersion Using Quartiles • Quartiles generally resist outliers. • However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values. Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8 Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8 • Some robust measures of central tendency and dispersion using quartiles are: Statistic Midhinge Formula Excel Q1 + Q3 2 • Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well. Percentiles and Quartiles ¯ Dispersion Using Quartiles Statistic Midspread Formula Q3 – Q1 Coefficient Q 3 − Q1 of quartile 100× Q + Q 3 1 variation ( CQV CQV)) =0.5*(QUARTILE (Data,1)+QUARTILE (Data,3)) Pro Con Robust to presence of extreme data values. Less familiar to most people. Percentiles and Quartiles ¯ Midhinge Excel Pro Con =QUARTILE(Data,3) QUARTILE(Data,1) Stable when extreme data values exist. Ignores magnitude of extreme data values. None Relative variation in percent so we can compare data sets. Less familiar to non-non statisticians • The mean of the first and third quartiles. Midhinge = Q1 + Q3 2 • For the 68 P/E ratios, Midhinge = Q1 + Q3 14 + 26 = = 20 2 2 • A robust measure of central tendency since quartiles ignore extreme values. 17 Percentiles and Quartiles ¯ Midspread (Interquartile Range) • A robust measure of dispersion Midspread = Q 3 – Q 1 • For the 68 P/E ratios, Midspread = Q 3 – Q 1 = 26 – 14 = 12 Percentiles and Quartiles ¯ Coefficient of Quartile Variation (CQV) • Measures relative dispersion, expresses the midspread as a percent of the midhinge midhinge.. Q3 − Q1 CQV = 100 × Q3 + Q1 • For the 68 P/E ratios, Q − Q1 26 − 14 CQV = 100 × 3 = 100 × = 30.0% Q3 + Q1 26 + 14 • Similar to the CV CV,, CQV can be used to compare data sets measured in different units or with different means. Box Plots Box Plots Whiskers • A useful tool of exploratory data analysis (EDA). Center of Box is Midhinge • Also called a box box-- and and-- whisker plot. plot. Box • Based on a five five-- number summary: summary: Xmin, Q 1, Q 2, Q 3, Xmax • Consider the fivefive- number summary for the 68 P/E ratios: Xmin, Q 1, Q 2, Q 3, Xmax 7 Q1 Minimum 14 19 26 91 Maximum Box Plots ¯ Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Lower fence Upper fence Right -skewed Median (Q (Q 2) Box Plots Inner fences Q1 – 1.5 (Q (Q3–Q1) Q3 + 1.5 (Q (Q3–Q1) Q3 Outer fences: Q1 – 3.0 (Q (Q3–Q1) Q3 + 3.0 (Q (Q3–Q1) ¯ Fences and Unusual Data Values • For example, consider the P/E ratio data: Inner fences Outer fences: Lower fence: 14 – 1.5 (26– (26–14) = −4 14 – 3.0 (26– (26–14) = −22 Upper fence: 26 + 1.5 (26– (26–14) = +44 26 + 3.0 (26– (26–14) = +62 • Ignore the lower fence since it is negative and P/E ratios are only positive. • Values outside the inner fences are unusual while those outside the outer fences are outliers outliers.. 18 Box Plots Grouped Data ¯ Fences and Unusual Data Values • Truncate the whisker at the fences and display unusual values Inner Outer and outliers Fence Fence as dots. Unusual ¯ Nature of Grouped Data • Although some information is lost, grouped data are easier to display than raw data. • When bin limits are given, the mean and standard deviation can be estimated. Outliers • Based on these fences, there are three unusual P/E values and two outliers. • Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequencies Grouped Data ¯ Mean and Standard Deviation • Consider the frequency distribution for prices of Lipitor® Lipitor ® for three cities: Grouped Data ¯ Nature of Grouped Data • Estimate the mean and standard deviation by k f m 3427.5 j j x =∑ = = 72.92552 47 j =1 n s= • Where mj = class midpoint fj = class frequency k = number of classes n = sample size Grouped Data ¯ Nature of Grouped Data k f j (mj − x)2 j= 1 n −1 ∑ = 2091.48936 = 6.74293 47 − 1 • Note: don’ don’t round off too soon. Grouped Data ¯ Accuracy Issues • Now estimate the coefficient of variation CV = 100 (s (s / x ) = 100 (6.74293 / 72.92552) = 9.2% ¯ Accuracy Issues • How accurate are grouped estimates compared to ungrouped estimates? • For the previous example, we can compare the grouped data statistics to the ungrouped data statistics. • For this example, very little information was lost due to grouping. • However, accuracy could be lost due to the nature of the grouping (i.e., if the groups were not evenly spaced within bins). 19 Grouped Data ¯ Accuracy Issues Grouped Data ¯ Accuracy Issues • The dot plot shows a relatively even distribution within the bins. • Accuracy tends to improve as the number of bins increases. • If the first or last class is openopen- ended, there will be no class midpoint (no mean can be estimated). • Effects of uneven distributions within bins tend to average out unless there is systematic skewness. Skewness and Kurtosis ¯ Skewness • Assume a lower limit of zero for the first class when the data are nonnegative. • You may be able to assume an upper limit for some variables (e.g., age). • Median and quartiles may be estimated even with open-- ended classes. open Skewness and Kurtosis ¯ Skewness • Generally, skewness may be indicated by looking at the sample histogram or by comparing the mean and median. • Skewness is a unitunit-free statistic. • The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution). • Calculate the sample’ sample’s skewness coefficient as: • This visual indicator is imprecise and does not take into consideration sample size n. Skewness and Kurtosis ¯ Skewness Skewness = n x −x n ∑ i ( n −1)( n − 2) i=1 s 3 Skewness and Kurtosis ¯ Skewness • In Excel, go to Tools | Data Analysis | Descriptive Statistics or use the function =SKEW(array ) • Consider the following table showing the 90% range for the sample skewness coefficient. 20 Skewness and Kurtosis ¯ Skewness Skewness and Kurtosis ¯ Skewness • Coefficients within the 90% range may be attributed to random variation. Skewness and Kurtosis ¯ Skewness • Coefficients outside the range suggest the sample came from a nonnormal population. Skewness and Kurtosis ¯ Kurtosis • As n increases, the range of chance variation narrows. • Kurtosis is the relative length of the tails and the degree of concentration in the center. • Consider three kurtosis prototype shapes. Skewness and Kurtosis ¯ Kurtosis Skewness and Kurtosis ¯ Kurtosis • A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ. • Consider the following table of expected 90% range for sample kurtosis coefficient. • Excel and MINITAB calculate kurtosis as: 4 Kurtosis = n n( n + 1) 3( n −1)2 x −x − ∑ i ( n − 1)( n − 2)( n − 3) i=1 s ( n − 2)( n − 3) 21 Skewness and Kurtosis ¯ Kurtosis • A sample coefficient within the ranges may be attributed to chance variation. Skewness and Kurtosis ¯ Kurtosis • Coefficients outside the range would suggest the sample differs from a normal population. Skewness and Kurtosis ¯ Kurtosis • As sample size increases, the chance range narrows. • Inferences about kurtosis are risky for n < 50. 22