2.4 Describing Distributions Numerically Numerical and More Graphical Methods to Describe Univariate Data 2 characteristics of a data set to measure center measures where the “middle” of the data is located variability measures how “spread out” the data is The median: a measure of center Given a set of n measurements arranged in order of magnitude, Median= middle value n odd mean of 2 middle values, n even Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5 Student Pulse Rates (n=62) 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5 Medians are used often Year 2016 baseball salaries Median $1,956,250 (max=$32,000,000 Clayton Kershaw; min=$507,000) Median fan age: MLB 45; NFL 43; NBA 41; NHL 39 Median existing home sales price: May 2011 $166,500; May 2010 $174,600 Median household income (2008 dollars) 2009 $50,221; 2008 $52,029 The median splits the histogram into 2 halves of equal area Median Salaries by Major Examples Example: n = 7 17.5 2.8 3.2 13.9 14.1 25.3 45.8 Example n = 7 (ordered): m = 14.1 2.8 3.2 13.9 14.1 17.5 25.3 45.8 Example: n = 8 17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8 Example n =8 (ordered) m = (14.1+17.5)/2 = 15.8 2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8 Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 4960 4971 5245 5546 7586 1. 2. 3. 4. 5245 4965.5 4960 4971 0% 1 0% 2. 0% 3 0% 4 10 Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 5245 5546 4971 5587 7586 1. 2. 3. 4. 5245 4965.5 5546 4971 0% 1 0% 2. 0% 3 0% 4 10 Measures of Spread The range and interquartile range Ways to measure variability range=largest-smallest OK sometimes; in general, too crude; sensitive to one large or small data value The range measures spread by examining the ends of the data A better way to measure spread is to examine the middle portion of the data Quartiles: Measuring spread by examining the middle The first quartile, Q1, is the value in the sample that has 25% of the data at or below it (Q1 is the median of the lower half of the sorted data). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it (Q3 is the median of the upper half of the sorted data). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 6 5 4 3 2 1 2 3 4 5 6 7 6 5 4 3 2 1 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 6.1 Q1= first quartile = 2.3 m = median = 3.4 Q3= third quartile = 4.2 Quartiles and median divide data into 4 pieces 1/4 1/4 Q1 1/4 M 1/4 Q3 Quartiles are common measures of spread http://oirp.ncsu.edu/ir/admit http://oirp.ncsu.edu/univ/peer University of Southern California Economic Value of College Majors Mid-career earnings by major: 25th, 50th, 75th percentiles. Rules for Calculating Quartiles Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q1; Step 2b: find the median of the upper half; this median is Q3. Important: when n is odd include the overall median in both halves; when n is even do not include the overall median in either half. 11 Example 2 4 6 8 10 12 14 16 18 20 n = 10 Median m = (10+12)/2 = 22/2 = 11 Q1 : Q3 median of lower half 2 4 6 8 10 Q1 = 6 : median of upper half 12 14 16 18 20 Q3 = 16 Pulse Rates n = 138 # 3 9 10 23 23 16 23 10 10 4 2 4 1 Stem 4* 4. 5* 5. 6* 6. 7* 7. 8* 8. 9* 9. 10* 10. 11* Leaves Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70 588 001233444 5556788899 00011111122233333344444 55556666667777788888888 00000112222334444 55555666666777888888999 0000112224 5555667789 0012 58 0223 1 Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35; Q1 = 63 Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78 Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q1? 1. 2. 3. 4. 287 257.5 263.5 262.5 # stemleaf 2 2255 4 2357 6 2426 7 257 10 26257 12 2759 (4) 281567 15 2935599 10 30333 7 3145 5 32155 2 336 1 340 0% 1 0% 2. 0% 3. 0% 4. Interquartile range lower quartile Q1 middle quartile: median upper quartile Q3 interquartile range (IQR) IQR = Q3 – Q1 measures spread of middle 50% of the data Example: beginning pulse rates Q3 = 78; Q1 = 63 IQR = 78 – 63 = 15 Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is 263.5. What is the value of the IQR? 1. 2. 3. 4. 23.5 39.5 46 69.5 # stemleaf 2 2255 4 2357 6 2426 7 257 10 26257 12 2759 (4) 281567 15 2935599 10 30333 7 3145 5 32155 2 336 1 340 0% 1. 0% 2. 0% 3 0% 4. 5-number summary of data Minimum Q1 median Q3 maximum Pulse data 45 63 70 78 111 Boxplot: display of 5-number summary 1 2 3 4 5 6 7 6 5 4 3 2 1 2 3 4 5 6 7 6 5 4 3 2 1 6.1 5.6 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 Largest = max = 6.1 BOXPLOT 7 Q3= third quartile = 4.2 m = median = 3.4 6 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 4 3 2 1 Q1= first quartile = 2.3 Smallest = min = 0.6 0 Disease X Five-number summary: min Q1 m Q3 max Boxplot: display of 5-number summary Example: age of 66 “crush” victims at rock concerts 1999-2000. 5-number summary: 13 17 19 22 47 Boxplot construction 1) construct box with ends located at Q1 and Q3; in the box mark the location of median (usually with a line or a “+”) 2) fences are determined by moving a distance 1.5(IQR) from each end of the box; 2a) upper fence is 1.5*IQR above the upper quartile 2b) lower fence is 1.5*IQR below the lower quartile Note: the fences only help with constructing the boxplot; they do not appear in the final boxplot display Box plot construction (cont.) 3) whiskers: draw lines from the ends of the box left and right to the most extreme data values found within the fences; 4) outliers: special symbols represent each data value beyond the fences; 4a) sometimes a different symbol is used for “far outliers” that are more than 3 IQRs from the quartiles Boxplot: display of 5-number summary 1 2 3 4 5 6 7 6 5 4 3 2 1 2 3 4 5 6 7 6 5 4 3 2 1 7.9 6.1 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 8 Largest = max = 7.9 7 BOXPLOT Q3+1.5*IQR= 4.2+2.85 = 7.05 6 Q3= third quartile = 4.2 Years until death 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 5 Interquartile range Q3 – Q1= 4.2 − 2.3 = 1.9 4 3 2 1 0 Disease X Q1= first quartile = 2.3 1.5 * IQR = 1.5*1.9=2.85. Individual #25 has a value of 7.9 years, so 7.9 is an outlier. The line from the top end of the box is drawn to the biggest number in the data that is less than 7.05 ATM Withdrawals by Day, Month, Holidays Beg. of class pulses (n=138) Q1 = 63, Q3 = 78 IQR=78 63=15 1.5(IQR)=1.5(15)=22.5 Q1 - 1.5(IQR): 63 – 22.5=40.5 Q3 + 1.5(IQR): 78 + 22.5=100.5 40.5 63 45 70 78 100.5 Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who gained at least 50 yards. What is the approximate value of Q3 ? 0 136 273 410 547 684 958 821 1095 1232 1369 Pass Catching Yards by Receivers 1. 2. 3. 4. 450 750 215 545 0% 1 0% 2 0% 3 0% 4 10 Rock concert deaths: histogram and boxplot Automating Boxplot Construction Excel “out of the box” does not draw boxplots. Many add-ins are available on the internet that give Excel the capability to draw box plots. Statcrunch (http://statcrunch.stat.ncsu.edu) draws box plots. Statcrunch Boxplot 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 6 5 4 3 2 1 2 3 4 5 6 7 6 5 4 3 2 1 7.9 6.1 5.3 4.9 4.7 4.5 4.2 4.1 3.9 3.8 3.7 3.6 3.4 3.3 2.9 2.8 2.5 2.3 2.3 2.1 1.5 1.9 1.6 1.2 0.6 Largest = max = 7.9 Q3= third quartile = 4.2 Q1= first quartile = 2.3 Tuition 4-yr Colleges Macro: Stock, bond returns-30 yrs Smallest = -21.98 Q1 = 0.2075 Median = 2.935 Q3 = 10.725 Largest = 42.98 IQR = 10.5175 Outliers = (42.98, -21.98) Bonds BoxPlot -40 Smallest = -26.61 Q1 = -0.555 10 Median = 10.43510 Q3 = 25.1275 10 Largest = 44.38 10 10 IQR = 25.6825 10 Outliers = () 10 -20 0 Stocks x 44.38 34.84 32.54 29.93 29.22 27.31 25.30 10 -40.00 25.07 10 22.36 BoxPlot 10 10 10 10 10 10 10 -20.00 10 10 0.00 20 40 60 20.00 40.00 60.00 BoxPlot