Chapter Study Guide Chapter 5 Displaying and Describing Quantitative Data I. How to Display Quantitative Data? Histogram, Stem-and-Leaf, Boxplot and Time Series Plot 1. Histogram: It is a bar chart that plots the bin counts as the height of bars. Histogram counts the number of cases that fall into each bin and displays that count as the height of the corresponding bar. Unlike bars for categorical data, histograms do not show gaps between bars unless there are no cases in the bin. We need to decide the width of a bin (endpoints) or let the software calculate automatically. When percentages are used instead of counts, we get a relative frequency histogram. 2. Stem-and-Leaf Display: Unlike Histograms, Stem-and-Leaf displays show actual values for each bin. Stem-and-Leaf displays work best when the data set is relatively small, i.e., less than a few hundreds. For a large dataset, Histograms do a better job. Read the textbook (Sharpe 2011, pp.88-89) and understand how it works. II. How to Describe Quantitative Data? Shape: Mode, Symmetry, Outlier Center: Mean and Median Spread: Range, Interquartile Range (IQR), Variance and Standard Deviation 1. Mode: Formally, it is defined as the single value that appears most often in the dataset. However, humps in Histograms are also called modes. A histogram with one hump can be described as unimodal distribution. Bimodal distribution is used for two humps, multimodal distribution for three or more humps, and uniform distribution for no apparent humps. 2. Symmetry: A distribution is symmetric if the halves on either side of the center approximately mirror each other. The thinner ends of a distribution are called Tails. If one tail stretches out farther than the other, the distribution is said to be skewed to the direction of the longer tail. 3. Outlier: Extremely high or low values in the dataset care usually considered outliers. They may be true values or are likely caused by error. Chaodong Han OPRE 504 Page 1 of 6 4. Mean: The simple average of all observations in the dataset is called Mean. Mean is a good indicator of distribution for symmetric data but can be misleading for highly skewed (with outliers on one side) data. 5. Median: Median is the value that splits the data into two halves. Since Median is resistant to outliers or skewed distribution, it is commonly used for data with skewed distribution, such as Census data on house income. Steps to find out the median by hand: 1. Re-arrange the data in order of values (from low to high or from high to low). 2. Count the number of values in the dataset, assuming there is n values. 𝑛+1 3. If the number is odd, the middle value is the Median. Its order is 2 4. If the number if even, there are two middle values. The Median is the simple average of 𝑛 𝑛+1 two middle values: 2 𝑎𝑛𝑑 2 6. Range: Total Range = Maximum – Minimum. Total Range is sensitive to extreme values and could be misleading in skewed data. 7. Interquartile Range (IQR): IQR is also called midspread or middle fifty, IQR = Q3-Q1. To get IQR, order all values (e.g., from high to low) first. Then use three values to evenly split the dataset into four parts which should have approximately the same number of cases. Order 1 2 3 4 5 6 7 8 9 10 11 8. Value 28 26 25 22 20 19 18 17 15 14 12 Quartile Q3 (upper quartile) Q2 (also called Median) Q1 (lower quartile) Variance and Standard Deviation: Variance: s2 = ∑(𝑦−𝑦̅)2 𝑛−1 and Standard Deviation: s = √ ∑(𝑦−𝑦̅)2 𝑛−1 , where 𝑦̅ is the mean, n is the number of observations, y is the actual value of individual observation. Chaodong Han OPRE 504 Page 2 of 6 Read Sharpe 2011(p.94) and find out how to calculate Variance and Standard Deviation by hand. Summary: How to Describe The Distribution of Quantitative Data If the data are symmetric, it is appropriate to report mean and standard deviation; If the dataset are skewed, the median and IQR are more appropriate descriptions of the distribution. In practice, all those descriptive statistics are reported easily be software packages. But interpretation and focus should vary depending on the actual shape of distribution. Mean, Median and Mode in Histogram Data: 1, 2, 3, 3, 3, 4, 4, 4, 4, 5 Data: 1, 2, 2, 3, 3, 3, 4, 4, 5 Data: 1, 2, 2, 2, 2, 3, 3, 3, 4, 5 4 3 2 1 4 3 2 1 4 3 2 1 1 1 2 3 4 3 2 5 4 mode 1 mean median mode 3.5 median Symmetric Long, Thin Tail to the Left Mean < Median < Mode 5 2 3 4 2.7 mean mean 3.3 Left Skewed 4 median 2.5 mode 2.0 Right Skewed Long, Thin Tail to the Right Mean = Median = Mode Mean > Median > Mode Summary: In one-mode distribution, extreme values in the long tail always pull “mean” towards them; Mean is closest to the tail; Mode is opposite to the tail; Median sits in the middle. Chaodong Han OPRE 504 Page 3 of 6 5 III. How to Display Quantitative Data (Continued) Boxplot 1. Five-Number Summary: It describes the distribution of a dataset by reporting five statistics: median (Q2), quartiles (Q1 and Q3), and extremes (Max and Min). 2. Boxplot: Boxplot is used to show Five-Number Summary. Read Sharpe 2011, pp.95-96 and understand how to build and interpret a Boxplot. Below is distribution of NYSE daily transaction volumes. Extreme Outliers (* indicating > Q3 + 3*IQR) Outliers (o indicating >Q3+1.5*IQR but < Q3 + 3*IQR ) Highest value within upper Fence = Q3 +1.5*IQR Q3 (Upper Quartile) Q2 (Median) Q1 (Lower Quartile) Lowest value within lower Fence = Q1 - 1.5*IQR Outliers (O indicating < Q1 – 1.5*IQR) Extreme Outliers (* indicating <Q1 - 3*IQR) Question 5.1 [Sharpe 2011, Exercise 11, p.117]. Below is a summary statistics for the sizes (in acres) of upstate New York vineyards. Variable N Mean St. Dev. Min Q1 Acres 46.5 47.76 6 18.50 36 Median (Q2) 33.50 Q3 Max 55 250 a) From the summary statistics, would you describe this distribution as symmetric or skewed? Explain. Step 1 Comparing the Mean and Median: Chaodong Han OPRE 504 Page 4 of 6 Step 2 Comparing Q3-Q2 and Q2-Q1: Step 3 Compare Max - Median and Median – Min: b) Are there any outliers? Explain. More Exercises: Sharpe 2011, Exercises 9, 12, 17, 18, 19, 20, 23 and 24. Question 5.2 [Sharpe 2011, Exercises 37, p.122]. Three statistics classes all took the same test. Here are the histograms of the scores for each class. a) Which class had the highest mean score? b) Which class had the highest median score? c) For which class are the mean and median most different? Which is higher? Why? d) Which class had the smallest standard deviation? Why? e) Overall, which class do you think performed best on the test? Why? Chaodong Han OPRE 504 Page 5 of 6 Question 5.3 [Sharpe 2011, p.123, Exercise 38]. Based on information provided in Exercise 37, match Boxplot with each class’s histogram. More Exercises: Sharpe 2011 (pp. 121-122) Chapter 5, Exercises 29, 30, 31, 32, 33, 34, 35, and 36. Chaodong Han OPRE 504 Page 6 of 6