Uploaded by colehoffman016

L2.4 Descriptive Univar Compare

advertisement
Introduction to Business Statistics
Module 2. Descriptive Statistics: Univariate
Lecture 2.4
Comparing Distributions
Professor J.B. Lang, University of Iowa, 8/1/22
1
L2.4 Topics
• Side-by-Side Box Plots for comparing Quant-C distributions
• Comparing 2-number summaries
• Comparing 5-number summaries
• REL FREQ Clustered Bar Charts for comparing Qual distributions
(FREQ version not as useful for comparisons)
• FREQ and REL FREQ 100% Stacked Bar Charts for comparing Qual distributions
2
Univariate Descriptive Statistics: Goals
𝑋𝑋: 𝑠𝑠 is [feature-value] data; it is the collection of 𝑋𝑋 values for sample 𝑠𝑠 = (𝑠𝑠1 , … , 𝑠𝑠𝑛𝑛 ).
𝑋𝑋: 𝑠𝑠 = π‘₯π‘₯1 , π‘₯π‘₯2 , … , π‘₯π‘₯𝑛𝑛 [featureless notation] will also be used for computational convenience.
Definition. Sample distribution of 𝑋𝑋 values for the sample 𝑠𝑠 is
Goals:
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑋𝑋: 𝑠𝑠 =
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑖𝑖𝑖𝑖 𝑋𝑋: 𝑠𝑠 , π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
Graphically and Numerically describe sample 𝑠𝑠 , data 𝑋𝑋: 𝑠𝑠, or 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑋𝑋: 𝑠𝑠)
Communicate Relative Standing of a value relative to a reference distribution.
Graphically and Numerically Compare Two or More Samples or Univariate Distributions
(this lecture).
3
Example 1. T-Bill Rates, 2- or 5-Number Summary (or Neither)?
The distn is approximately bell-shaped, perhaps
with just a slight skew to the right. Arguably, either
the 2- or 5-number summary could be used here.
Density Histogram of T-Bill Rates 1958-2008 (n=2600)
0.20
0.182885
2-Number Summary:
π‘₯π‘₯ = 5.57, 𝑠𝑠 = 2.70, 𝑛𝑛 = 2600.
Density
0.15
0.110577
0.0963462
0.10
0.0451923
0.05
0.0342308
0.0123077
0.00
0
2
4
6
8
10
Rate (APR)
0.0111538
12
0.00730769
14
16
5-Number Summary:
π‘šπ‘šπ‘šπ‘šπ‘šπ‘š = 0.85, 𝑄𝑄1 = 3.73, 𝑄𝑄2 = 5.15, 𝑄𝑄3 = 6.99,
π‘šπ‘šπ‘šπ‘šπ‘šπ‘š = 15.76, 𝑛𝑛 = 2600.
An advantage of the 5-number summary is that it
can be graphically displayed in a boxplot.
Indeed, for this example, we can compare T-Bill
Rates across decades using side-by-side boxplots.
See next page…
4
Example 1, cont’d. Side-by-Side Box Plots
Boxplot of T-Bill Rate
15.76
16
Boxplots of T-Bill Rates (by years)
14
16
12
14
8
6.9875
5.145
3.73
6
4
2
0.85
0
5-Number Summary
Min
Q1
Median
0.85 3.73
5.145
12
10
Rate
Rate
10
8
6
4
Q3
6.9875
Max
N
15.76 2600
2
0
1958-1959
1960-1969
1970-1979 1980-1989
Years
1990-1999
2000-2008
5
Example 1, cont’d. Side-by-Side Ornament Plots
Ornament Plots of T-Bill Rates by Years
Ornament Plots
are a variant of
boxplots.
6
Example 2. Which Bond Fund?
You have decided to invest in a bond fund and plan to limit
your choice of funds to Morningstar “medalist” funds. But
now you must choose between a taxable fund and a
municipal bond fund that is at least partially tax-free. Which
is better? Here (left) are the returns (%) for the three-year
period leading up to spring of 2013.
Goal: Compare the two sample distributions, say
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑑𝑑𝑑𝑑𝑑𝑑: 𝑠𝑠1 ) and 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(π‘šπ‘šπ‘šπ‘šπ‘šπ‘š: 𝑠𝑠2 ).
7
Example 2. Which Bond Fund?, cont’d
Both variables are Quant-C, but with so few observations the “shape” of
the sample distributions are a little hard to assess.
Let’s compute both the 5-Number and 2-Number summaries for each.
Also display side-by-side box plots.
While the measures of centrality are similar (medians of 6.18 and 5.695), the variability of returns
for the taxable bond fund (tbf) is much greater than for mbf (e.g., IQR : 5.935 vs 0.9025.)
Ignoring tax implications, the risk averse investor might choose the less risky mbf (tbf risk:
sd=3.63 vs mbf risk: sd=1.44).
8
Example 2. Which Bond Fund? cont’d
An Excel (for Microsoft 365) Aside…
Boxplots (aka Box and Whisker Plots) in Excel, use quartile calculations based on =quartile.exc by
default. Right clicking on a boxplot, then Format Data Series, then selecting the “Inclusive
Median” box in the Quartile Calculation section will give a boxplot based on =quartile.inc [
=quartile]. The difference are typically minimal especially for larger data sets.
QUARTILE.EXC
QUARTILE.INC
9
Example 3. Pet Ownership
Data on pet ownership has been collected in two regions. The full
data are in pets.xlsx. Here, we use only the first 100 observations.
The surveyed owners self-identified as 𝐢𝐢𝐢𝐢𝐢𝐢, 𝐷𝐷𝐷𝐷𝐷𝐷, or 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 owners.
We set out to compare the two distributions, say 𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐴𝐴 and 𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐡𝐡 .
Here, 𝑠𝑠𝐴𝐴 and 𝑠𝑠𝐡𝐡 are the samples of owners from Region 𝐴𝐴 and
Region 𝐡𝐡, respectively. The variable 𝑃𝑃𝑃𝑃𝑃𝑃 takes values (𝐢𝐢𝐢𝐢𝐢𝐢,
𝐷𝐷𝐷𝐷𝐷𝐷, 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂) and hence is of type Qual-N.
To compare the two sample distributions, we will use side-by-side
REL FREQ Bar Charts (either clustered or 100% stacked).
FREQ Bar Charts are not as helpful, because the sample sizes are
quite different across the two Regions (see next slide).
10
Example 3. Pet Ownership, cont’d
FREQ Clustered Bar Chart
In Excel, used Insert > Table > Pivot Table
REL FREQ Clustered Bar Chart
11
Example 3. Pet Ownership, cont’d
FREQ 100% Stacked Bar Charts
REL FREQ 100% Stacked Bar Charts
E.g., Proportion labeled as 𝐷𝐷𝐷𝐷𝐷𝐷 owners are about the same for the two Regions (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝐴𝐴 𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐷𝐷𝐷𝐷𝐷𝐷 = 45%
vs. 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝐡𝐡 𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐷𝐷𝐷𝐷𝐷𝐷 = 42.5%). But the proportions labeled 𝐢𝐢𝐢𝐢𝐢𝐢 and 𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂𝑂 owners are quite different for
the two Regions (e.g., 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝐴𝐴 𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐢𝐢𝐢𝐢𝐢𝐢 = 10% vs. 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝐡𝐡 𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐢𝐢𝐢𝐢𝐢𝐢 = 47.5%.
12
L2.4 Exercises
1.
Excel. Go to baseball_weights.xlsx.
(a)
Compute the 2-number summaries for Manufacturer A and for Manufacturer B.
(b)
Compute the 5-number summaries for Manufacturer A and for Manufacturer B.
(c)
Are the 2-number summaries reasonable to report in this case? Explain. Hint: Create histograms for the two samples
and look at the shapes.
(d)
Create side-by-side boxplots to compare the two sample distributions of weights.
2.
Excel. Go to pets.xlsx. (See also Example 3 above, where we restricted our analysis to the first 100 observations.)
(a)
Do either the 2-number or 5-number summary make sense for describing the distributions of 𝑃𝑃𝑃𝑃𝑃𝑃? Explain.
(b)
(c)
What are the sample sizes for Regions A and B?
What number of the Region A owners consider themselves Cat owners?
(d)
What proportion of the Region A owners consider themselves Cat owners?
(e)
Create a REL FREQ Clustered Bar Chart that is useful for comparing the 𝑃𝑃𝑃𝑃𝑃𝑃 distributions across Region, i.e., for
comparing 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐴𝐴 ) and 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐡𝐡 ). See Example 3 for a useful example of this chart.
(f)
(g)
Explain why a FREQ Clustered Bar Chart is not as useful as the REL FREQ version for comparing the 𝑃𝑃𝑃𝑃𝑃𝑃 distns?
Create a FREQ 100% Stacked Bar Chart that is useful for comparing the 𝑃𝑃𝑃𝑃𝑃𝑃 distributions across Region, i.e., for
comparing 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐴𝐴 ) and 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑃𝑃𝑃𝑃𝑃𝑃: 𝑠𝑠𝐡𝐡 ). Again, see Example 3 for a useful example of this chart.
13
L2.4 Exercises
3. This exercise uses data in nightmares.xlsx. The goal is to compare the 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁? distributions across the samples of
Left-Side (L) and Right-Side (R) sleepers.
Count of
The table of counts to the right was
created using Pivot Table in Excel.
Nightmare?
Side
L
R
Grand Total
Nightmare?
yes
9
6
15
no
13
33
46
Grand Total
22
39
61
(a) If 𝑠𝑠𝐿𝐿 and 𝑠𝑠𝑅𝑅 are the two samples being compared, what are the two sample sizes?
(b) What is the value of 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝐿𝐿 (𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁? = 𝑦𝑦𝑦𝑦𝑦𝑦) ? That is, what proportion of the sample of Left-Side sleepers had
nightmares?
(c) What is the value of 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑠𝑠𝑅𝑅 (𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁? = 𝑦𝑦𝑦𝑦𝑦𝑦) ? That is, what proportion of the sample of Right-Side sleepers had
nightmares?
(d) Create a 100% Stacked Bar Chart that is useful for comparing the two sample distributions, 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁? : 𝑠𝑠𝐿𝐿 ) and
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁? : 𝑠𝑠𝑅𝑅 ). You can do this by hand or in Excel. Do clearly label your chart.
14
L2.4 Exercises
4. The data displayed at the right can be found in
house_prices_saratoga.xlsx. The side-by-side boxplot can be
used to compare the sample distributions of house prices
($1000’s) across those houses without a fireplace and those
with at least one fireplace. The respective sample sizes were
428 (without) and 635 (with).
(a) Which boxplot (Left or Right) corresponds to the sample of
houses without a fireplace? Why?
(b) What are the mean and median house price for houses
with at least one fireplace?
(c) The middle 50% of the house prices for houses with at least
one fireplace fall between what two numbers?
(d) If you were to look at the histogram of prices for those
houses with at least one fireplace, will it’s shape be
symmetric, skewed right, or skewed left? Explain.
15
L2.4 Exercises
5. In an exercise of L2.1, you created stem
plots of the Females ages and the Male
ages. Because the two sample sizes are
small and identical, a back-to-back
stem plot can be useful for comparison.
(a) Create a back-to-back stem plot
for comparing the Female and Male
ages. Hint: Use the shared stems, 2,2,3,3,
4,4,5,5,6,6.
(b) In words, how do the age distributions
compare for the Females and Males?
Ages of First 15 OscarWinning Actors (Female)
22
37
28
63
32
26
31
27
26
27
30
26
29
24
38
Ages of First 15 OscarWinning Actors (Male)
44
40
62
These data are part
53
of the file
35
oscars.xlsx.
47
34
33
52
41
37
38
34
32
40
16
L2.4 Exercises
Exercise 6. Excel. Information on height and singing part for 130 members of a choir are given in the data set
singers_by_parts.xlsx .
(a) Create, and describe the shape of, a histogram of all 130 heights (ignoring singing part). Use bin widths of 2.5 inches
starting at 60 inches; i.e., use bins [60, 62.5], (62.5, 65], … (75, 77.5].
(b) Give the 2-number summary for the heights of the Sopranos
(c) Give the 2-number summary for the heights of the Altos
(d) Compute the 90th percentile for the heights of the Altos and give it an interpretation.
(e) Compute the IQR for the Altos and fill in the blank: “The middle 50% of the heights for Altos fall in a range of width
__________ inches.”
(f) Create side-by-side boxplots for the heights of each of the four singing parts (Soprano, Alto, Tenor, and Bass).
(g) Comment on the height comparisons across the four singing parts.
17
Download