Introduction to Business Statistics Module 2. Descriptive Statistics: Univariate Lecture 2.4 Comparing Distributions Professor J.B. Lang, University of Iowa, 8/1/22 1 L2.4 Topics • Side-by-Side Box Plots for comparing Quant-C distributions • Comparing 2-number summaries • Comparing 5-number summaries • REL FREQ Clustered Bar Charts for comparing Qual distributions (FREQ version not as useful for comparisons) • FREQ and REL FREQ 100% Stacked Bar Charts for comparing Qual distributions 2 Univariate Descriptive Statistics: Goals ππ: π π is [feature-value] data; it is the collection of ππ values for sample π π = (π π 1 , … , π π ππ ). ππ: π π = π₯π₯1 , π₯π₯2 , … , π₯π₯ππ [featureless notation] will also be used for computational convenience. Definition. Sample distribution of ππ values for the sample π π is Goals: ππππππππ ππ: π π = ππππππππππππππππ π£π£π£π£π£π£π£π£π£π£π£π£ ππππ ππ: π π , ππππππ ππππππππππ Graphically and Numerically describe sample π π , data ππ: π π , or ππππππππ(ππ: π π ) Communicate Relative Standing of a value relative to a reference distribution. Graphically and Numerically Compare Two or More Samples or Univariate Distributions (this lecture). 3 Example 1. T-Bill Rates, 2- or 5-Number Summary (or Neither)? The distn is approximately bell-shaped, perhaps with just a slight skew to the right. Arguably, either the 2- or 5-number summary could be used here. Density Histogram of T-Bill Rates 1958-2008 (n=2600) 0.20 0.182885 2-Number Summary: π₯π₯ = 5.57, π π = 2.70, ππ = 2600. Density 0.15 0.110577 0.0963462 0.10 0.0451923 0.05 0.0342308 0.0123077 0.00 0 2 4 6 8 10 Rate (APR) 0.0111538 12 0.00730769 14 16 5-Number Summary: ππππππ = 0.85, ππ1 = 3.73, ππ2 = 5.15, ππ3 = 6.99, ππππππ = 15.76, ππ = 2600. An advantage of the 5-number summary is that it can be graphically displayed in a boxplot. Indeed, for this example, we can compare T-Bill Rates across decades using side-by-side boxplots. See next page… 4 Example 1, cont’d. Side-by-Side Box Plots Boxplot of T-Bill Rate 15.76 16 Boxplots of T-Bill Rates (by years) 14 16 12 14 8 6.9875 5.145 3.73 6 4 2 0.85 0 5-Number Summary Min Q1 Median 0.85 3.73 5.145 12 10 Rate Rate 10 8 6 4 Q3 6.9875 Max N 15.76 2600 2 0 1958-1959 1960-1969 1970-1979 1980-1989 Years 1990-1999 2000-2008 5 Example 1, cont’d. Side-by-Side Ornament Plots Ornament Plots of T-Bill Rates by Years Ornament Plots are a variant of boxplots. 6 Example 2. Which Bond Fund? You have decided to invest in a bond fund and plan to limit your choice of funds to Morningstar “medalist” funds. But now you must choose between a taxable fund and a municipal bond fund that is at least partially tax-free. Which is better? Here (left) are the returns (%) for the three-year period leading up to spring of 2013. Goal: Compare the two sample distributions, say ππππππππ(π‘π‘π‘π‘π‘π‘: π π 1 ) and ππππππππ(ππππππ: π π 2 ). 7 Example 2. Which Bond Fund?, cont’d Both variables are Quant-C, but with so few observations the “shape” of the sample distributions are a little hard to assess. Let’s compute both the 5-Number and 2-Number summaries for each. Also display side-by-side box plots. While the measures of centrality are similar (medians of 6.18 and 5.695), the variability of returns for the taxable bond fund (tbf) is much greater than for mbf (e.g., IQR : 5.935 vs 0.9025.) Ignoring tax implications, the risk averse investor might choose the less risky mbf (tbf risk: sd=3.63 vs mbf risk: sd=1.44). 8 Example 2. Which Bond Fund? cont’d An Excel (for Microsoft 365) Aside… Boxplots (aka Box and Whisker Plots) in Excel, use quartile calculations based on =quartile.exc by default. Right clicking on a boxplot, then Format Data Series, then selecting the “Inclusive Median” box in the Quartile Calculation section will give a boxplot based on =quartile.inc [ =quartile]. The difference are typically minimal especially for larger data sets. QUARTILE.EXC QUARTILE.INC 9 Example 3. Pet Ownership Data on pet ownership has been collected in two regions. The full data are in pets.xlsx. Here, we use only the first 100 observations. The surveyed owners self-identified as πΆπΆπΆπΆπΆπΆ, π·π·π·π·π·π·, or πππππππππ owners. We set out to compare the two distributions, say ππππππ: π π π΄π΄ and ππππππ: π π π΅π΅ . Here, π π π΄π΄ and π π π΅π΅ are the samples of owners from Region π΄π΄ and Region π΅π΅, respectively. The variable ππππππ takes values (πΆπΆπΆπΆπΆπΆ, π·π·π·π·π·π·, πππππππππ) and hence is of type Qual-N. To compare the two sample distributions, we will use side-by-side REL FREQ Bar Charts (either clustered or 100% stacked). FREQ Bar Charts are not as helpful, because the sample sizes are quite different across the two Regions (see next slide). 10 Example 3. Pet Ownership, cont’d FREQ Clustered Bar Chart In Excel, used Insert > Table > Pivot Table REL FREQ Clustered Bar Chart 11 Example 3. Pet Ownership, cont’d FREQ 100% Stacked Bar Charts REL FREQ 100% Stacked Bar Charts E.g., Proportion labeled as π·π·π·π·π·π· owners are about the same for the two Regions (πππππππππ π π΄π΄ ππππππ = π·π·π·π·π·π· = 45% vs. πππππππππ π π΅π΅ ππππππ = π·π·π·π·π·π· = 42.5%). But the proportions labeled πΆπΆπΆπΆπΆπΆ and πππππππππ owners are quite different for the two Regions (e.g., πππππππππ π π΄π΄ ππππππ = πΆπΆπΆπΆπΆπΆ = 10% vs. πππππππππ π π΅π΅ ππππππ = πΆπΆπΆπΆπΆπΆ = 47.5%. 12 L2.4 Exercises 1. Excel. Go to baseball_weights.xlsx. (a) Compute the 2-number summaries for Manufacturer A and for Manufacturer B. (b) Compute the 5-number summaries for Manufacturer A and for Manufacturer B. (c) Are the 2-number summaries reasonable to report in this case? Explain. Hint: Create histograms for the two samples and look at the shapes. (d) Create side-by-side boxplots to compare the two sample distributions of weights. 2. Excel. Go to pets.xlsx. (See also Example 3 above, where we restricted our analysis to the first 100 observations.) (a) Do either the 2-number or 5-number summary make sense for describing the distributions of ππππππ? Explain. (b) (c) What are the sample sizes for Regions A and B? What number of the Region A owners consider themselves Cat owners? (d) What proportion of the Region A owners consider themselves Cat owners? (e) Create a REL FREQ Clustered Bar Chart that is useful for comparing the ππππππ distributions across Region, i.e., for comparing ππππππππ(ππππππ: π π π΄π΄ ) and ππππππππ(ππππππ: π π π΅π΅ ). See Example 3 for a useful example of this chart. (f) (g) Explain why a FREQ Clustered Bar Chart is not as useful as the REL FREQ version for comparing the ππππππ distns? Create a FREQ 100% Stacked Bar Chart that is useful for comparing the ππππππ distributions across Region, i.e., for comparing ππππππππ(ππππππ: π π π΄π΄ ) and ππππππππ(ππππππ: π π π΅π΅ ). Again, see Example 3 for a useful example of this chart. 13 L2.4 Exercises 3. This exercise uses data in nightmares.xlsx. The goal is to compare the πππππππππππππππππ? distributions across the samples of Left-Side (L) and Right-Side (R) sleepers. Count of The table of counts to the right was created using Pivot Table in Excel. Nightmare? Side L R Grand Total Nightmare? yes 9 6 15 no 13 33 46 Grand Total 22 39 61 (a) If π π πΏπΏ and π π π π are the two samples being compared, what are the two sample sizes? (b) What is the value of πππππππππ π πΏπΏ (πππππππππππππππππ? = π¦π¦π¦π¦π¦π¦) ? That is, what proportion of the sample of Left-Side sleepers had nightmares? (c) What is the value of πππππππππ π π π (πππππππππππππππππ? = π¦π¦π¦π¦π¦π¦) ? That is, what proportion of the sample of Right-Side sleepers had nightmares? (d) Create a 100% Stacked Bar Chart that is useful for comparing the two sample distributions, ππππππππ(πππππππππππππππππ? : π π πΏπΏ ) and ππππππππ(πππππππππππππππππ? : π π π π ). You can do this by hand or in Excel. Do clearly label your chart. 14 L2.4 Exercises 4. The data displayed at the right can be found in house_prices_saratoga.xlsx. The side-by-side boxplot can be used to compare the sample distributions of house prices ($1000’s) across those houses without a fireplace and those with at least one fireplace. The respective sample sizes were 428 (without) and 635 (with). (a) Which boxplot (Left or Right) corresponds to the sample of houses without a fireplace? Why? (b) What are the mean and median house price for houses with at least one fireplace? (c) The middle 50% of the house prices for houses with at least one fireplace fall between what two numbers? (d) If you were to look at the histogram of prices for those houses with at least one fireplace, will it’s shape be symmetric, skewed right, or skewed left? Explain. 15 L2.4 Exercises 5. In an exercise of L2.1, you created stem plots of the Females ages and the Male ages. Because the two sample sizes are small and identical, a back-to-back stem plot can be useful for comparison. (a) Create a back-to-back stem plot for comparing the Female and Male ages. Hint: Use the shared stems, 2,2,3,3, 4,4,5,5,6,6. (b) In words, how do the age distributions compare for the Females and Males? Ages of First 15 OscarWinning Actors (Female) 22 37 28 63 32 26 31 27 26 27 30 26 29 24 38 Ages of First 15 OscarWinning Actors (Male) 44 40 62 These data are part 53 of the file 35 oscars.xlsx. 47 34 33 52 41 37 38 34 32 40 16 L2.4 Exercises Exercise 6. Excel. Information on height and singing part for 130 members of a choir are given in the data set singers_by_parts.xlsx . (a) Create, and describe the shape of, a histogram of all 130 heights (ignoring singing part). Use bin widths of 2.5 inches starting at 60 inches; i.e., use bins [60, 62.5], (62.5, 65], … (75, 77.5]. (b) Give the 2-number summary for the heights of the Sopranos (c) Give the 2-number summary for the heights of the Altos (d) Compute the 90th percentile for the heights of the Altos and give it an interpretation. (e) Compute the IQR for the Altos and fill in the blank: “The middle 50% of the heights for Altos fall in a range of width __________ inches.” (f) Create side-by-side boxplots for the heights of each of the four singing parts (Soprano, Alto, Tenor, and Bass). (g) Comment on the height comparisons across the four singing parts. 17