A PowerPoint Presentation Package to Accompany Applied Statistics in Business & Economics, 4th edition David P. Doane and Lori E. Seward Prepared by Lloyd R. Jaisingh McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 4 Descriptive Statistics Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data 4.8 Skewness and Kurtosis 4-2 Chapter 4 Descriptive Statistics Chapter Learning Objectives LO4-1: LO4-2: LO4-3: LO4-4: LO4-5: LO4-6: Explain the concepts of center, variability, and shape. Use Excel to obtain descriptive statistics and visual displays. Calculate and interpret common measures of center. Calculate and interpret common measures of variability. Transform a data set into standardized values. Apply the Empirical Rule and recognize outliers. 4-3 Chapter 4 Descriptive Statistics Chapter Learning Objectives LO4-7: Calculate quartiles and other percentiles. LO4-8: Make and interpret box plots. LO4-9: Calculate and interpret a correlation coefficient and covariance. LO4-10: Calculate the mean and standard deviation from grouped data. LO4-11: Assess skewness and kurtosis in a sample. 4-4 Chapter 4 LO4-1 4.1 Numerical Description LO4-1: Explain the concepts of center, variability, and shape. Three key characteristics of numerical data: 4-5 Chapter 4 LO4-2 4.1 Numerical Description LO4-2: Use Excel to obtain descriptive statistics and visual displays. EXCEL Histogram Display for Tables 4.3 4-6 Chapter 4 LO4-3 4.2 Measures of Center LO4-3: Calculate and interpret common measures of center. Mean • A familiar measure of center Population Mean • Sample Mean In Excel, use function =AVERAGE(Data) where Data is an array of data values. 4-7 Chapter 4 4.2 Measures of Center LO4-3 Median • • • • The median (M) is the 50th percentile or midpoint of the sorted sample data. M separates the upper and lower halves of the sorted observations. If n is odd, the median is the middle observation in the data array. If n is even, the median is the average of the middle two observations in the data array. 4-8 Chapter 4 4.2 Measures of Center LO4-3 Mode • • • The most frequently occurring data value. May have multiple modes or no mode. The mode is most useful for discrete or categorical data with only a few distinct data values. For continuous data or data with a wide range, the mode is rarely useful. 4-9 Chapter 4 LO4-1 4.2 Measures of Center LO4-1: Explain the concepts of center, variability, and shape. Shape • Compare mean and median or look at the histogram to determine degree of skewness. • Figure 4.10 shows prototype population shapes showing varying degrees of skewness. 4-10 Chapter 4 LO4-3 4.2 Measures of Center Geometric Mean • The geometric mean (G) is a multiplicative average. Growth Rates A variation on the geometric mean used to find the average growth rate for a time series. 4-11 4.2 Measures of Center Growth Rates • Chapter 4 LO4-3 For example, from 2006 to 2010, JetBlue Airlines revenues are: Year Revenue (mil) 2006 2,361 2007 2,843 2008 3,392 2009 3,292 2010 3,779 The average growth rate: or 12.5 % per year. 4-12 Chapter 4 LO4-3 4.2 Measures of Center Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. • For the J.D. Power quality data: • Here, the midrange (126.5) is higher than the mean (114.70) or median (113). 4-13 Chapter 4 LO4-3 4.2 Measures of Center Trimmed Mean • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations. • So, we would remove the two smallest and two largest observations before averaging the remaining values. 4-14 Chapter 4 LO4-3 4.2 Measures of Center Trimmed Mean • Here is a summary of all the measures of central tendency for the J.D. Power data. Mean: 114.70 =AVERAGE(Data) Median: 113 =MEDIAN(Data) Mode: 111 =MODE.SNGL(Data) Geometric Mean: 113.35 =GEOMEAN(Data) Midrange: 126.5 (MIN(Data)+MAX(Data))/2 5% Trim Mean: 113.94 =TRIMMEAN(Data, 0.1) • The trimmed mean mitigates the effects of very high values, but still exceeds the median. 4-15 Chapter 4 LO4-4 4.3 Measures of Variability LO4-4: Calculate and interpret common measures of variability. • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of variability: Measures of Variability Statistic Range Sample Variance (s2) Formula Excel xmax – xmin =MAX(Data) MIN(Data) Pro Con Sensitive to Easy to calculate extreme data values. Plays a key role =VAR.S(Data) in mathematical statistics. Nonintuitive meaning. 4-16 Chapter 4 LO4-4 4.3 Measures of Variability Measures of Variation Statistic Sample standard deviation (s) Sample coefficient. of variation (CV) Formula Excel Pro Most common measure. Uses =STDEV.S(Data) same units as the raw data ($ , £, ¥, grams etc.). None Measures relative variation in percent so can compare data sets. Con Nonintuitive meaning. Requires nonnegative data. 4-17 Chapter 4 4.3 Measures of Variability LO4-4 Measures of Variability Statistic Mean absolute deviation (MAD) Formula Excel Pro n xi x i 1 =AVEDEV(Data) Easy to understand. n Con Lacks “nice” theoretical properties. Population variance Population standard deviation 4-18 Chapter 4 LO4-4 4.3 Measures of Variability Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unit-free measure of dispersion. • Expressed as a percent of the mean. • Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. 4-19 Chapter 4 LO4-4 4.3 Measures of Variability Mean Absolute Deviation • This statistic reveals the average distance from the center. • Absolute values must be used since otherwise the deviations around the mean would sum to zero. It is stated in the unit of measurement. • The MAD is appealing because of its simple interpretation. 4-20 Chapter 4 4.3 Measures of Variability LO4-1 Central Tendency vs. Dispersion: Manufacturing • Take frequent samples to monitor quality. 4-21 Chapter 4 4.4 Standardized Data Chebyshev’s Theorem • • • • • For any population with mean m and standard deviation s, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2]. For k = 2 standard deviations, • Although 100[1 – 1/22] = 75% applicable to So, at least 75.0% will lie within m + 2s any data set, For k = 3 standard deviations, these limits 100[1 – 1/32] = 88.9% tend to be rather wide. So, at least 88.9% will lie within m + 3s 4-22 Chapter 4 4.4 Standardized Data The Empirical Rule • The normal distribution is symmetric and is also known as the bell-shaped curve. • The Empirical Rule states that for data from a normal distribution, we expect the interval m ± ks to contain a known percentage of data. For k = 1, 68.26% will lie within m + 1s k = 2, 95.44% will lie within m + 2s k = 3, 99.73% will lie within m + 3s 4-23 Chapter 4 4.4 Standardized Data The Empirical Rule Note: No upper bound is given. Data values outside m + 3s are rare. 4-24 Chapter 4 LO4-5 4.4 Standardized Data LO4-5: Transform a data set into standardized values. • A standardized variable (Z) redefines each observation in terms of the number of standard deviations from the mean. Standardization formula for a population: Standardization formula for a sample (for n > 30): A negative z value means the observation is to the left of the mean. Positive z means the observation is to the right of the mean. 4-25 Chapter 4 LO4-6 4.4 Standardized Data LO4-6: Apply the Empirical Rule and recognize outliers. 4-26 Chapter 4 4.4 Standardized Data Estimating Sigma • For a normal distribution, the range of values is almost 6s (from m – 3s to m + 3s). • If you know the range R (high – low), you can estimate the standard deviation as s = R/6. • Useful for approximating the standard deviation when only R is known. • This estimate depends on the assumption of normality. 4-27 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box-Plots LO4-7: Calculate quartiles and other percentiles Percentiles • Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. • Deciles are data that have been divided into 10 groups. Quintiles are data that have been divided into 5 groups. Quartiles are data that have been divided into 4 groups. • • 4-28 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box Plots Percentiles • Percentiles may be used to establish benchmarks for comparison purposes (e.g. health care, manufacturing, and banking industries use 5th, 25th, 50th, 75th and 90th percentiles). • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles can be used in employee merit evaluation and salary benchmarking. 4-29 4.5 Percentiles, Quartiles, and Box Plots Quartiles • Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1 Lower 25% • Chapter 4 LO4-7 | Q2 Second 25% | Q3 Third 25% | Upper 25% The three values that separate the four groups are called Q1, Q2, and Q3, respectively. 4-30 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box Plots Quartiles • The second quartile Q2 is the median, a measure of central tendency. Q2 Lower 50% • | Upper 50% Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values. Q1 Lower 25% | Q3 Middle 50% | Upper 25% 4-31 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box Plots Quartiles – The method of medians • The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2. Q1 Lower 25% | Q2 Second 25% For first half of data, 50% above, 50% below Q1. | Q3 Third 25% | Upper 25% For second half of data, 50% above, 50% below Q3. 4-32 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box Plots Method of Medians • For small data sets, find quartiles using method of medians: Step 1: Sort the observations. Step 2: Find the median Q2. Step 3: Find the median of the data values that lie below Q2. Step 4: Find the median of the data values that lie above Q2. 4-33 Chapter 4 LO4-7 4.5 Percentiles, Quartiles, and Box Plots Method of Medians Example: 4-34 Chapter 4 4.5 Percentiles, Quartiles, and Box Plots LO4-7 Example: P/E Ratios and Quartiles • So, to summarize: Q1 Lower 25% of P/E Ratios • 27 Q2 Second 25% of P/E Ratios 35.5 Q3 Third 25% of P/E Ratios 40.5 Upper 25% of P/E Ratios These quartiles express central tendency and dispersion. What is the interquartile range? 4-35 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots LO4-8: Make and interpret box plots. • A useful tool of exploratory data analysis (EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • Consider the five-number summary for the previous P/E ratios example: Xmin, Q1, Q2, Q3, Xmax 7 27 35.5 40.5 49 4-36 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots • The box plot is displayed visually, like this. • A box plot shows variability and shape. 4-37 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots 4-38 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots: Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Inner fences Outer fences: Lower fence Q1 – 1.5 (Q3 – Q1) Q1 – 3.0 (Q3 – Q1) Upper fence Q3 + 1.5 (Q3 – Q1) Q3 + 3.0 (Q3 – Q1) • Values outside the inner fences are unusual while those outside the outer fences are outliers. 4-39 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots: Fences and Unusual Data Values • For example, consider the P/E ratio data: Inner fences Outer fences: Lower fence: 107 – 1.5 (126 –107) = 78.5 107 – 3.0 (126 –107) = 50 Upper fence: 126 + 1.5 (126 –107) = 154.5 126 + 3.0 (126 –107) = 183 There is one outlier (170) that lies above the inner fence. There are no extreme outliers that exceed the outer fence. 4-40 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots: Fences and Unusual Data Values • Truncate the whisker at the fences and display unusual values and outliers as dots. Outlier • Based on these fences, there is only one outlier. 4-41 Chapter 4 LO4-8 4.5 Percentiles, Quartiles, and Box Plots Box Plots: Midhinge • The average of the first and third quartiles. • The name midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”. 4-42 Chapter 4 LO4-9 4.6 Correlation and Covariance LO4-9: Calculate and interpret a correlation coefficient and covariance. Correlation Coefficient • The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Note: -1 ≤ r ≤ +1. 4-43 Chapter 4 LO4-9 4.6 Correlation and Covariance Correlation Coefficient • Illustration of Correlation Coefficients 4-44 Chapter 4 LO4-9 4.6 Correlation and Covariance Covariance The covariance of two random variables X and Y (denoted σXY ) measures the degree to which the values of X and Y change together. 4-45 Chapter 4 LO4-9 LO 4.6 Correlation and Covariance Covariance A correlation coefficient is the covariance divided by the product of the standard deviations of X and Y. 4-46 Chapter 4 LO4-10 4.7 Grouped Data LO4-10: Calculate the mean and standard deviation from grouped data. Weighted Mean Group Mean and Standard Deviation 4-47 Chapter 4 LO4-10 4.7 Grouped Data Group Mean and Standard Deviation 4-48 Chapter 4 LO4-11 4.8 Skewness and Kurtosis LO4-11: Assess skewness and kurtosis in a sample. Skewness 4-49 Chapter 4 LO4-11 4.8 Skewness and Kurtosis LO4-11: Assess skewness and kurtosis in a sample. Kurtosis 4-50