Lecture Unit 2 Graphical and Numerical Summaries of Data UNIT OBJECTIVES At the conclusion of this unit you should be able to: 1) Construct graphs that appropriately describe data 2) Calculate and interpret numerical summaries of a data set. 3) Combine numerical methods with graphical methods to analyze a data set. 4) Apply graphical methods of summarizing data to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate graphical and numerical summary procedures. Displaying Qualitative Data Section 2.1 “Sometimes you can see a lot just by looking.” Yogi Berra Hall of Fame Catcher, NY Yankees The three rules of data analysis won’t be difficult to remember 1. Make a picture —reveals aspects not obvious in the raw data; enables you to think clearly about the patterns and relationships that may be hiding in your data. 2. Make a picture —to show important features of and patterns in the data. You may also see things that you did not expect: the extraordinary (possibly wrong) data values or unexpected patterns 3. Make a picture —the best way to tell others about your data is with a well-chosen picture. Bar Charts: show counts or relative frequency for each category Example: Titanic passenger/crew distribution Titanic Passengers by Class 1000.00 885 900.00 800.00 706 700.00 600.00 500.00 400.00 325 285 300.00 200.00 100.00 0.00 Crew First Second Third Pie Charts: shows proportions of the whole in each category Example: Titanic passenger/crew distribution Titanic Passengers by Class Third 32% Second 13% Crew 40% First 15% Example: Top 10 causes of death in the United States Rank Causes of death Counts % of top 10s % of total deaths 1 Heart disease 700,142 37% 28% 2 Cancer 553,768 29% 22% 3 Cerebrovascular 163,538 9% 6% 4 Chronic respiratory 123,013 6% 5% 5 Accidents 101,537 5% 4% 6 Diabetes mellitus 71,372 4% 3% 7 Flu and pneumonia 62,034 3% 2% 8 Alzheimer’s disease 53,852 3% 2% 9 Kidney disorders 39,480 2% 2% 32,238 2% 1% 10 Septicemia All other causes 629,967 25% For each individual who died in the United States, we record what was the cause of death. The table above is a summary of that information. Top 10 causes of death: bar graph Top 10 causes of deaths in the United States The number of individuals who died of an accident in is approximately 100,000. Ca nc Ce er re s br ov Ch as cu ro ni la c r re sp ira to ry Ac ci Di de ab nt s et es m el Fl litu u & s pn eu Al zh m on ei m ia er 's di se Ki as dn e ey di so rd er s Se pt ice m ia ise as es 800 700 600 500 400 300 200 100 0 He ar td Counts (x1000) Each category is represented by one bar. The bar’s height shows the count (or sometimes the percentage) for that particular category. zh ei m er 's di de nt s se as e Ac ci 800 700 600 500 400 300 200 100 0 Ca nc Ce er s re br ov Ch as cu ro la ni r c re sp ira Di to ab ry et es m el Fl litu u s & pn eu m on He ia ar td ise as Ki dn es ey di so rd er s Se pt ice m ia Al Counts (x1000) ise as es Ca nc Ce er re s br ov Ch as cu ro ni la c r re sp ira to ry Ac ci Di de ab nt s et es m el Fl litu u & s pn eu Al zh m on ei m ia er 's di se Ki as dn e ey di so rd er s Se pt ice m ia He ar td Counts (x1000) 800 700 600 500 400 300 200 100 0 Top 10 causes of deaths in the United States Bar graph sorted by rank Easy to analyze Sorted alphabetically Much less useful Recent Annual Computer Hardware Sales ($billion) 1. United States $158 2. China $64.4 3. Japan $54 4. Germany $24.4 5. Britain $23.5 6. France $19.3 7. Brazil $14.2 8. Italy $13.1 9. Australia $12.8 10. India $11.9 NY Times Recent Annual Software Sales ($billions) 1. United States $137.9 2. Japan $23.4 3. Germany $20 4. Britain $16.8 5. France $12.6 6. Canada $7.3 7. Italy $6.3 8. China $5.4 9. Netherlands $5.4 10. Australia $4.8 Top 10 causes of death: pie chart Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Percent of people dying from top 10 causes of death in the United States Make sure your labels match the data. Make sure all percents add up to 100. Percent of deaths from top 10 causes Percent of deaths from all causes Internships Basic bar chart Side-by-side bar chart Average Student Debt by State 2010 Class $0 New Hampshire Maine Iowa Minnesota Pennsylvania Vermont Ohio Indiana Rhode island New York Michigan Massachusetts Connecticut Alabama Wisconsin Louisiana DC Idaho Oregon Illinois New Jersey West Virginia South Carolina Virginia South Dakota Montana Alaska Missouri Kansas Mississippi Washington Colorado Maryland Delaware Arkansas Nebraska Florida North Carolina Texas Oklahoma Wyoming Tennessee Kentucky Georgia Arizona California Nevada New Mexico Hawaii Utah $5,000 $10,000$15,000$20,000$25,000$30,000$35,000 Student Debt North Carolina Schools North Carolina Private Schools 2010 Class Average debt of graduates 0 Campbell University Inc New Life Theological Seminary Meredith College Mid-Atlantic Christian University Wake Forest University Methodist University Johnson C Smith University Chowan University Catawba College Mars Hill College Elon University Wingate University Lenoir-Rhyne University Davidson College St Andrews Presbyterian… Duke University Belmont Abbey College Mean North Carolina - 4-year… Brevard College Warren Wilson College Mount Olive College Salem College Saint Augustines College High Point University Tuition and fees (in-state) 20000 North Carolina Public Schools 2010 Class Average debt of graduates 40000 0 UNC Greensboro UNC School of the Arts NC A & T Mean North Carolina - 4-year or above NCSU UNC-Wilmington UNC Charlotte ECU Appalachian UNC Asheville Elizabeth City Tuition and fees (in-state) 5000 10000 15000 20000 25000 Unnecessary dimension in a pie chart Contingency Tables: Categories for Two Variables Example: Survival and class on the Titanic Marginal distributions Crew Alive Dead Total First 212 673 885 885/2201 marg. dist. 40.2% of class Second Third 202 118 123 167 325 285 325/2201 14.8% 285/2201 12.9% Total 178 528 706 706/2201 32.1% 710 1491 2201 marg. dist. of survival 710/2201 32.3% 1491/2201 67.7% Marginal distribution of class. Bar chart. Marginal distribution of class: Pie chart Contingency Tables: Categories for Two Variables (cont.) Conditional distributions. Given the class of a passenger, what is the chance the passenger survived? Crew Alive Survival Dead Total Count % of col. Count % of col. Count 212 24.0% 673 76.0% 885 First 202 62.2% 123 37.8% 325 Class Second Third Total 118 178 710 41.4% 25.2% 32.3% 167 528 1491 58.6% 74.8% 67.7% 285 706 2201 Conditional distributions: segmented bar chart Contingency Tables: Categories for Two Variables (cont.) Questions: What fraction of survivors were in first class? What fraction of passengers were in first class and survivors ? What fraction of the first class passengers survived? Class Crew Alive Survival Dead Total Count % of col. Count % of col. Count 212 24.0% 673 76.0% 885 First 202 62.2% 123 37.8% 325 202/710 202/2201 202/325 Second Third Total 118 178 710 41.4% 25.2% 32.3% 167 528 1491 58.6% 74.8% 67.7% 285 706 2201 TV viewers during the Super Bowl in 2013. What is the marginal distribution of those who watched the commercials only? 1. 2. 3. 4. 8.0% 23.5% 58.2% 27.7% 0% 1 0% 2 0% 3 0% 4 TV viewers during the Super Bowl in 2013. What percentage watched the game and were female? 1. 2. 3. 4. 41.8% 38.8% 51.2% 19.8% 0% 1 0% 2 0% 3 0% 4 10 TV viewers during the Super Bowl in 2013. Given that a viewer did not watch the Super Bowl telecast, what percentage were male? 1. 2. 3. 4. 45.2% 48.8% 26.8% 27.7% 0% 1 0% 2 0% 3 0% 4 10 3-Way Tables Example: Georgia death-sentence data Death Sentence Yes No Totals % Death Sentence Race of Defendant Black White Race of Victim Race of Victim Black White Black White 18 50 2 58 1420 178 62 687 1438 228 64 745 1.2 21.9 3.1 7.8 Totals 128 2347 2475 UC Berkeley Lawsuit M EN W O M EN N o. of a p p lican ts 26 9 1 18 3 5 A d m itted 119 9 557 % a d m itted 44.6 3 0 .4 LAWSUIT (cont.) M EN M A JO R A B C D E F TOTAL N o. of A p p lican ts 825 560 325 417 191 373 2691 N o. A d m itted 5 1 2 (6 2 % ) 3 5 3 (6 3 % ) 1 2 0 (3 7 % ) 1 3 8 (3 3 % ) 5 3 (2 8 % ) 2 3 (6 % ) 1199 W OM EN N o. of N o. A p p lican ts A d m itted 108 * 8 9 (8 2 % ) 25 * 1 7 (6 8 % ) 593 2 0 2 (3 4 % ) 375 * 1 3 1 (3 5 % ) 393 9 4 (2 4 % ) 341 * 2 4 (7 % ) 1835 557 Simpson’s Paradox The reversal of the direction of a comparison or association when data from several groups are combined to form a single group. Fly Alaska Airlines, the ontime airline! A la sk a A irlin es A m erica n W est % A rriv als N o. o f % A rriv als N o. o f D estin ation O n T im e A rriv als O n T im e A rriv als L. A. P ho en ix S an D iego S an F ran . S eattle T o tal 8 8.9 % 9 4.8 % 9 1.4 % 8 3.1 % 8 5.8 % 5 59 2 33 2 32 6 05 2 ,1 46 3 ,7 75 8 5.6 % 9 2.1 % 8 5.5 % 7 1.3 % 7 6.7 % 8 11 5 ,2 55 4 48 4 49 2 62 7 ,2 25 American West Wins! You’re a Hero! A la sk a A irlin es A m erica n W est % A rriv als N o. o f % A rriv als N o. o f D estin ation O n T im e A rriv als O n T im e A rriv als L. A. P ho en ix S an D iego S an F ran . S eattle T o tal 8 8.9 % 9 4.8 % 9 1.4 % 8 3.1 % 8 5.8 % 8 6.7 % 5 59 2 33 2 32 6 05 2 ,1 46 3 ,7 75 8 5.6 % 9 2.1 % 8 5.5 % 7 1.3 % 7 6.7 % 8 9.1 % 8 11 5 ,2 55 4 48 4 49 2 62 7 ,2 25 Section 2.2 Displaying Quantitative Data Histograms Stem and Leaf Displays Relative frequency Relative Frequency Histogram of Exam Grades .30 .25 .20 .15 .10 .05 0 40 50 60 70 80 Grade 90 100 Frequency Histograms BAKER CITY HOSPITAL - LENGTH OF STAY DISTRIBUTION 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 Frequency Histograms A histogram shows three general types of information: It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. 30 19.2 19.23 19.26 19.29 19.32 19.35 19.38 19.41 19.44 19.47 19.5 19.53 19.56 19.59 19.62 19.65 19.68 19.71 19.74 19.77 19.8 19.83 19.86 19.89 19.92 19.95 19.98 20.01 20.04 20.07 20.1 20.13 20.16 20.19 Frequency All 200 m Races 20.2 secs or less 200 m Races 20.2 secs or less (approx. 700) 60 50 40 Usain Bolt 2008 19.30 Michael Johnson 1996 19.32 20 10 0 TIMES Histograms Showing Different Centers 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 70 60 50 40 30 20 10 0 Histograms - Same Center, Different Spread 70 60 50 40 30 20 10 16 < 18 14 < 16 12 < 14 10 10 < 12 8 8< 6< 6 4< 4 2< 0< 2 0 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 369480 821544.6154 1273609.231 1725673.846 2177738.462 2629803.077 3081867.692 3533932.308 3985996.923 4438061.538 4890126.154 5342190.769 5794255.385 6246320 6698384.615 7150449.231 7602513.846 8054578.462 8506643.077 8958707.692 9410772.308 9862836.923 10314901.54 10766966.15 11219030.77 11671095.38 12123160 12575224.62 13027289.23 13479353.85 13931418.46 14383483.08 14835547.69 15287612.31 15739676.92 16191741.54 16643806.15 17095870.77 17547935.38 More Frequency Excel Example: 2012-13 NFL Salaries Histogram 1000 900 800 700 600 500 400 300 200 100 0 Bin Statcrunch Example: 2012-13 NFL Salaries Frequency and Relative Frequency Histograms identify smallest and largest values in data set divide interval between largest and smallest values into between 5 and 20 subintervals called classes * each data value in one and only one class * no data value is on a boundary How Many Classes? Can choose from two formulas 2 n . 3333 Sturges' 1 Rule : log( n ) log( 2 ) n is the sample size Histogram Construction (cont.) * compute frequency or relative frequency of observations in each class * x-axis: class boundaries; y-axis: frequency or relative frequency scale * over each class draw a rectangle with height corresponding to the frequency or relative frequency in that class Example. Number of daily employee absences from work 106 obs; approx. no of classes= {2(106)}1/3 = {212}1/3 = 5.69 1+ log(106)/log(2) = 1 + 6.73 = 7.73 There is no single “correct” answer for the number of classes For example, you can choose 6, 7, 8, or 9 classes; don’t choose 15 classes EXCEL Histogram Histogram of Employee Absences 45 Frequency 40 35 30 25 20 15 10 5 0 Absences from Work Absences from Work (cont.) 6 classes class width: (158-121)/6=37/6=6.17 7 6 classes, each of width 7; classes span 6(7)=42 units data spans 158-121=37 units classes overlap the span of the actual data values by 42-37=5 lower boundary of 1st class: (1/2)(5) units below 121 = 121-2.5 = 118.5 EXCEL histogram Histogram of Employee Absences 70 Frequency 60 50 40 30 20 10 0 118.5 125.5 132.5 139.5 146.5 Absences from Work 153.5 160.5 Grades on a statistics exam Data: 75 66 77 66 64 73 91 65 59 86 61 86 61 58 70 77 80 58 94 78 62 79 83 54 52 45 82 48 67 55 Frequency Distribution of Grades Class Limits 40 up to 50 Frequency 2 50 up to 60 6 60 up to 70 8 70 up to 80 7 80 up to 90 5 90 up to 100 2 Total 30 Relative Frequency Distribution of Grades Class Limits 40 up to 50 Relative Frequency 2/30 = .067 50 up to 60 6/30 = .200 60 up to 70 8/30 = .267 70 up to 80 7/30 = .233 80 up to 90 5/30 = .167 90 up to 100 2/30 = .067 Relative frequency Relative Frequency Histogram of Grades .30 .25 .20 .15 .10 .05 0 40 50 60 70 80 Grade 90 100 Based on the histogram, about what percent of the values are between 47.5 and 52.5? 1. 2. 3. 4. 50% 5% 17% 30% 0% 1 0% 2 0% 3 0% 4 10 Stem and leaf displays Have the following general appearance stem leaf 1 8 9 2 1 2 8 9 9 3 2 3 8 9 4 0 1 5 6 7 6 4 Stem and Leaf Displays Partition each no. in data into a “stem” and “leaf” Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps Example: employee ages at a small company 18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10’s digit; leaf: 1’s digit 18: stem=1; leaf=8; 18 = 1 | 8 stem leaf 1 8 9 2 1 2 8 9 9 3 2 3 8 9 4 0 1 5 6 7 6 4 Suppose a 95 yr. old is hired stem 1 2 3 4 5 6 7 8 9 leaf 8 9 1 2 8 9 9 2 3 8 9 0 1 6 7 4 5 Number of TD passes by NFL teams: 2012-2013 season (stems are 10’s digit) stem 4 3 2 2 1 0 leaf 03 247 6677789 01222233444 13467889 8 Pulse Rates n = 138 # 3 9 10 23 23 16 23 10 10 4 2 4 1 Stem 4* 4. 5* 5. 6* 6. 7* 7. 8* 8. 9* 9. 10* 10. 11* Leaves 588 001233444 5556788899 00011111122233333344444 55556666667777788888888 00000112222334444 55555666666777888888999 0000112224 5555667789 0012 58 0223 1 Advantages/Disadvantages of Stem-and-Leaf Displays Advantages 1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) Disadvantages display becomes unwieldy for large data sets Population of 185 US cities with between 100,000 and 500,000 Multiply stems by 100,000 Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999-2000, 2012-13 multiply stems by 10 1999-2000 2 6 2 6655 43322221100 9998887666 421 2012-13 4 3 3 2 2 1 1 0 03 7 24 6677789 01222233444 67889 134 8 Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10’s digits 1. 2. 3. 4. 5. 4 6 8 10 12 0% 1 0% 0% 2 3 0% 0% 4 5 10 Interpreting Graphical Displays: Shape Symmetric distribution A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram Skewed distribution extends much farther out than the right side. Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations. Shape (cont.)Female heart attack patients in New York state Age: left-skewed Cost: right-skewed Shape (cont.): Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. Alaska Florida Center: typical value of frozen personal pizza? ~$2.65 Spread: fuel efficiency 4, 8 cylinders 4 cylinders: more spread 8 cylinders: less spread Other Graphical Methods for Economic Data Time plots plot observations in time order, with time on the horizontal axis and the variable on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.) Unemployment Rate, by Educational Attainment Water Use During Super Bowl Winning Times 100 M Dash Annual Mean Temperature End of Section 2.2