Chapter 3 Graphical and Numerical Summaries of Categorical Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: 1) Construct graphs that appropriately describe data 2) Calculate and interpret numerical summaries of a data set. 3) Combine numerical methods with graphical methods to analyze a data set. Displaying Qualitative Data “Sometimes you can see a lot just by looking.” Yogi Berra Hall of Fame Catcher, NY Yankees The three rules of data analysis won’t be difficult to remember 1. Make a picture —reveals aspects not obvious in the raw data; enables you to think clearly about the patterns and relationships that may be hiding in your data. 2. Make a picture —to show important features of and patterns in the data. You may also see things that you did not expect: the extraordinary (possibly wrong) data values or unexpected patterns 3. Make a picture —the best way to tell others about your data is with a well-chosen picture. Bar Charts: show counts or relative frequency for each category Example: Titanic passenger/crew distribution Titanic Passengers by Class 1000.00 885 900.00 800.00 706 700.00 600.00 500.00 400.00 325 285 300.00 200.00 100.00 0.00 Crew First Second Third Pie Charts: shows proportions of the whole in each category Example: Titanic passenger/crew distribution Titanic Passengers by Class Third 32% Second 13% Crew 40% First 15% Example: Top 10 causes of death in the United States 2001 Rank Causes of death Counts % of top 10s % of total deaths 1 Heart disease 700,142 37% 28% 2 Cancer 553,768 29% 22% 3 Cerebrovascular 163,538 9% 6% 4 Chronic respiratory 123,013 6% 5% 5 Accidents 101,537 5% 4% 6 Diabetes mellitus 71,372 4% 3% 7 Flu and pneumonia 62,034 3% 2% 8 Alzheimer’s disease 53,852 3% 2% 9 Kidney disorders 39,480 2% 2% 32,238 2% 1% 10 Septicemia All other causes 629,967 25% For each individual who died in the United States in 2001, we record what was the cause of death. The table above is a summary of that information. Top 10 causes of death: bar graph Top 10 causes of deaths in the United States 2001 The number of individuals who died of an accident in 2001 is approximately 100,000. Ca nc Ce er re s br ov Ch as cu ro ni la c r re sp ira to ry Ac ci Di de ab nt s et es m el Fl litu u & s pn eu Al zh m on ei m ia er 's di se Ki as dn e ey di so rd er s Se pt ice m ia ise as es 800 700 600 500 400 300 200 100 0 He ar td Counts (x1000) Each category is represented by one bar. The bar’s height shows the count (or sometimes the percentage) for that particular category. zh ei m er 's di de nt s se as e Ac ci 800 700 600 500 400 300 200 100 0 Ca nc Ce er s re br ov Ch as cu ro la ni r c re sp ira Di to ab ry et es m el Fl litu u s & pn eu m on He ia ar td ise as Ki dn es ey di so rd er s Se pt ice m ia Al Counts (x1000) ise as es Ca nc Ce er re s br ov Ch as cu ro ni la c r re sp ira to ry Ac ci Di de ab nt s et es m el Fl litu u & s pn eu Al zh m on ei m ia er 's di se Ki as dn e ey di so rd er s Se pt ice m ia He ar td Counts (x1000) 800 700 600 500 400 300 200 100 0 Top 10 causes of deaths in the United States 2001 Bar graph sorted by rank Easy to analyze Sorted alphabetically Much less useful Top 10 causes of death: pie chart Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Percent of people dying from top 10 causes of death in the United States in 2001 Make sure your labels match the data. Make sure all percents add up to 100. Percent of deaths from top 10 causes Percent of deaths from all causes Child poverty before and after government intervention—UNICEF, 1996 What does this chart tell you? •The United States has the highest rate of child poverty among developed nations (22% of under 18). •Its government does the least—through taxes and subsidies—to remedy the problem (size of orange bars and percent difference between orange/blue bars). Could you transform this bar graph to fit in 1 pie chart? In two pie charts? Why? The poverty line is defined as 50% of national median income. Contingency Tables: Categories for Two Variables Example: Survival and class on the Titanic Marginal distributions Crew Alive Dead Total First 212 673 885 885/2201 marg. dist. 40.2% of class Second Third 202 118 123 167 325 285 325/2201 14.8% 285/2201 12.9% Total 178 528 706 706/2201 32.1% 710 1491 2201 marg. dist. of survival 710/2201 32.3% 1491/2201 67.7% Marginal distribution of class. Bar chart. Marginal distribution of class: Pie chart Contingency Tables: Categories for Two Variables (cont.) Conditional distributions. Given the class of a passenger, what is the chance the passenger survived? Crew Alive Survival Dead Total Count % of col. Count % of col. Count 212 24.0% 673 76.0% 885 First 202 62.2% 123 37.8% 325 Class Second Third Total 118 178 710 41.4% 25.2% 32.3% 167 528 1491 58.6% 74.8% 67.7% 285 706 2201 Conditional distributions: segmented bar chart Contingency Tables: Categories for Two Variables (cont.) Questions: What fraction of survivors were in first class? What fraction of passengers were in first class and survivors ? What fraction of the first class passengers survived? Class Crew Alive Survival Dead Total Count % of col. Count % of col. Count 212 24.0% 673 76.0% 885 First 202 62.2% 123 37.8% 325 202/710 202/2201 202/325 Second Third Total 118 178 710 41.4% 25.2% 32.3% 167 528 1491 58.6% 74.8% 67.7% 285 706 2201 3-Way Tables Example: Georgia death-sentence data Death Sentence Yes No Totals % Death Sentence Race of Defendant Black White Race of Victim Race of Victim Black White Black White 18 50 2 58 1420 178 62 687 1438 228 64 745 1.2 21.9 3.1 7.8 Totals 128 2347 2475 UC Berkeley Lawsuit MEN WOMEN No. of applicants 2691 1835 Admitted 1199 557 % admitted 44.6 30.4 LAWSUIT (cont.) MEN MAJOR A B C D E F TOTAL No. of Applicants 825 560 325 417 191 373 2691 No. Admitted 512 (62%) 353 (63%) 120 (37%) 138 (33%) 53 (28%) 23 (6%) 1199 WOMEN No. of No. Applicants Admitted 108 *89 (82%) 25 *17 (68%) 593 202 (34%) 375 *131 (35%) 393 94 (24%) 341 *24 (7%) 1835 557 Simpson’s Paradox The reversal of the direction of a comparison or association when data from several groups are combined to form a single group. Fly Alaska Airlines, the ontime airline! Alaska Airlines % Arrivals No. of Destination On Time Arrivals L. A. 88.9% 559 Phoenix 94.8% 233 San Diego 91.4% 232 San Fran. 83.1% 605 Seattle 85.8% 2,146 Total 3,775 American West % Arrivals No. of On Time Arrivals 85.6% 811 92.1% 5,255 85.5% 448 71.3% 449 76.7% 262 7,225 American West Wins! You’re a Hero! Alaska Airlines % Arrivals No. of Destination On Time Arrivals L. A. 88.9% 559 Phoenix 94.8% 233 San Diego 91.4% 232 San Fran. 83.1% 605 Seattle 85.8% 2,146 Total 3,775 86.7% American West % Arrivals No. of On Time Arrivals 85.6% 811 92.1% 5,255 85.5% 448 71.3% 449 76.7% 262 7,225 89.1% End of Chapter 3