Here

advertisement
Chapter Study Guide
Chapter 5
Displaying and Describing Quantitative Data
I.
How to Display Quantitative Data?
Histogram, Stem-and-Leaf, Boxplot and Time Series Plot
1.
Histogram:
It is a bar chart that plots the bin counts as the height of bars. Histogram counts the number of
cases that fall into each bin and displays that count as the height of the corresponding bar. Unlike
bars for categorical data, histograms do not show gaps between bars unless there are no cases in
the bin. We need to decide the width of a bin (endpoints) or let the software calculate
automatically. When percentages are used instead of counts, we get a relative frequency
histogram.
2.
Stem-and-Leaf Display:
Unlike Histograms, Stem-and-Leaf displays show actual values for each bin. Stem-and-Leaf
displays work best when the data set is relatively small, i.e., less than a few hundreds. For a large
dataset, Histograms do a better job. Read the textbook (Sharpe 2011, pp.88-89) and understand
how it works.
II.
How to Describe Quantitative Data?
Shape: Mode, Symmetry, Outlier
Center: Mean and Median
Spread: Range, Interquartile Range (IQR), Variance and Standard Deviation
1.
Mode:
Formally, it is defined as the single value that appears most often in the dataset. However, humps
in Histograms are also called modes. A histogram with one hump can be described as unimodal
distribution. Bimodal distribution is used for two humps, multimodal distribution for three or
more humps, and uniform distribution for no apparent humps.
2.
Symmetry:
A distribution is symmetric if the halves on either side of the center approximately mirror each
other. The thinner ends of a distribution are called Tails. If one tail stretches out farther than the
other, the distribution is said to be skewed to the direction of the longer tail.
3.
Outlier:
Extremely high or low values in the dataset care usually considered outliers. They may be true
values or are likely caused by error.
Chaodong Han
OPRE 504
Page 1 of 6
4.
Mean:
The simple average of all observations in the dataset is called Mean. Mean is a good indicator of
distribution for symmetric data but can be misleading for highly skewed (with outliers on one
side) data.
5.
Median:
Median is the value that splits the data into two halves. Since Median is resistant to outliers or
skewed distribution, it is commonly used for data with skewed distribution, such as Census data
on house income.
Steps to find out the median by hand:
1.
Re-arrange the data in order of values (from low to high or from high to low).
2.
Count the number of values in the dataset, assuming there is n values.
𝑛+1
3.
If the number is odd, the middle value is the Median. Its order is 2
4.
If the number if even, there are two middle values. The Median is the simple average of
𝑛
𝑛+1
two middle values: 2 𝑎𝑛𝑑 2
6.
Range:
Total Range = Maximum – Minimum. Total Range is sensitive to extreme values and could be
misleading in skewed data.
7.
Interquartile Range (IQR):
IQR is also called midspread or middle fifty, IQR = Q3-Q1. To get IQR, order all values (e.g.,
from high to low) first. Then use three values to evenly split the dataset into four parts which
should have approximately the same number of cases.
Order
1
2
3
4
5
6
7
8
9
10
11
8.
Value
28
26
25
22
20
19
18
17
15
14
12
Quartile
Q3 (upper quartile)
Q2 (also called Median)
Q1 (lower quartile)
Variance and Standard Deviation:
Variance: s2 =
∑(𝑦−𝑦̅)2
𝑛−1
and Standard Deviation: s = √
∑(𝑦−𝑦̅)2
𝑛−1
, where 𝑦̅ is the mean, n is the
number of observations, y is the actual value of individual observation.
Chaodong Han
OPRE 504
Page 2 of 6
Read Sharpe 2011(p.94) and find out how to calculate Variance and Standard Deviation by hand.
Summary: How to Describe The Distribution of Quantitative Data
If the data are symmetric, it is appropriate to report mean and standard deviation;
If the dataset are skewed, the median and IQR are more appropriate descriptions of the
distribution.
In practice, all those descriptive statistics are reported easily be software packages. But
interpretation and focus should vary depending on the actual shape of distribution.
Mean, Median and Mode in Histogram
Data:
1, 2, 3, 3, 3, 4, 4, 4, 4, 5
Data:
1, 2, 2, 3, 3, 3, 4, 4, 5
Data:
1, 2, 2, 2, 2, 3, 3, 3, 4, 5
4
3
2
1
4
3
2
1
4
3
2
1
1
1
2
3
4
3
2
5
4 mode
1
mean
median
mode
3.5 median
Symmetric
Long, Thin Tail to the Left
Mean < Median < Mode
5
2
3
4
2.7 mean
mean 3.3
Left Skewed
4
median 2.5
mode 2.0
Right Skewed
Long, Thin Tail to the Right
Mean = Median = Mode
Mean > Median > Mode
Summary:
In one-mode distribution, extreme values in the long tail always pull “mean” towards them;
Mean is closest to the tail; Mode is opposite to the tail; Median sits in the middle.
Chaodong Han
OPRE 504
Page 3 of 6
5
III.
How to Display Quantitative Data (Continued)
Boxplot
1.
Five-Number Summary:
It describes the distribution of a dataset by reporting five statistics: median (Q2), quartiles (Q1
and Q3), and extremes (Max and Min).
2.
Boxplot:
Boxplot is used to show Five-Number Summary. Read Sharpe 2011, pp.95-96 and understand
how to build and interpret a Boxplot. Below is distribution of NYSE daily transaction volumes.
Extreme Outliers (* indicating > Q3 + 3*IQR)
Outliers (o indicating >Q3+1.5*IQR
but < Q3 + 3*IQR )
Highest value within upper Fence = Q3 +1.5*IQR
Q3 (Upper Quartile)
Q2 (Median)
Q1 (Lower Quartile)
Lowest value within lower Fence = Q1 - 1.5*IQR
Outliers (O indicating < Q1 – 1.5*IQR)
Extreme Outliers (* indicating <Q1 - 3*IQR)
Question 5.1 [Sharpe 2011, Exercise 11, p.117]. Below is a summary statistics for the sizes (in
acres) of upstate New York vineyards.
Variable N
Mean
St. Dev.
Min
Q1
Acres
46.5
47.76
6
18.50
36
Median
(Q2)
33.50
Q3
Max
55
250
a) From the summary statistics, would you describe this distribution as symmetric or
skewed? Explain.
Step 1 Comparing the Mean and Median:
Chaodong Han
OPRE 504
Page 4 of 6
Step 2 Comparing Q3-Q2 and Q2-Q1:
Step 3 Compare Max - Median and Median – Min:
b) Are there any outliers? Explain.
More Exercises:
Sharpe 2011, Exercises 9, 12, 17, 18, 19, 20, 23 and 24.
Question 5.2 [Sharpe 2011, Exercises 37, p.122]. Three statistics classes all took the same test.
Here are the histograms of the scores for each class.
a) Which class had the highest mean score?
b) Which class had the highest median score?
c) For which class are the mean and median most different? Which is higher? Why?
d) Which class had the smallest standard deviation? Why?
e) Overall, which class do you think performed best on the test? Why?
Chaodong Han
OPRE 504
Page 5 of 6
Question 5.3 [Sharpe 2011, p.123, Exercise 38]. Based on information provided in Exercise 37,
match Boxplot with each class’s histogram.
More Exercises:
Sharpe 2011 (pp. 121-122) Chapter 5, Exercises 29, 30, 31, 32, 33, 34, 35, and 36.
Chaodong Han
OPRE 504
Page 6 of 6
Download