THE AVERAGE TIME SPENT ON HW1 WAS 166 MINUTES. THE MEDIAN TIME SPENT ON HW1 WAS 115 MINUTES. THE DISTRIBUTION OF TIME SPENT ON HW 1 IS… 25% 25% 25% 1. 2. 3. 4. 25% Symmetric Uniform Skewed to the left Skewed to the right 1. 2. 3. 4. UPCOMING IN CLASS Homework #2 due Sunday (9/1) at 10:00pm Quiz #1 in class 8/28 (open book) Part 1 of the Data Project due (9/4) Slide 4- 2 WHAT ABOUT SPREAD? THE STANDARD DEVIATION (CONT.) The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data. y y 2 s n1 Slide 4- 3 CHAPTER 4 Understanding and Comparing Distributions WHY WE NEED TO UNDERSTAND AND COMPARE DISTRIBUTIONS? Understanding the distributions provides us the preliminary descriptive data information and help you get a sense of models for further explanations. THE BIG PICTURE We can answer much more interesting questions about variables when we compare distributions for different groups. Below is a histogram of the Average Wind Speed for every day in 1989. COMPARING GROUPS It is always more interesting to compare groups. With histograms, note the shapes, centers, and spreads of the two distributions. What does this graphical display tell you? DAILY WIND SPEED: MAKING BOXPLOTS A boxplot is a graphical display of the fivenumber summary. Boxplots are particularly useful when comparing groups. Slide 1- 8 THE FIVE-NUMBER SUMMARY The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). Example: The fivenumber summary for the daily wind speed is: Max 8.67 Q3 2.93 Median 1.90 Q1 1.15 Min 0.20 MEN VS WOMEN STARTING SALARIES Men Min 18,000 Q1 25,000 Q2 - Median 45,000 Q3 65,000 Max 70,000 Women Min 18,000 Q1 35,000 Q2 - Median 42,000 Q3 45,000 Max 50,000 CONSTRUCTING BOXPLOTS 1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box. CONSTRUCTING BOXPLOTS (CONT.) 2. Erect “fences” around the main part of the data. The upper fence is 1.5 IQRs above the upper quartile. Q3 + 1.5*IQR The lower fence is 1.5 IQRs below the lower quartile. Q1 - 1.5*IQR Note: the fences only help with constructing the boxplot and should not appear in the final display. Slide 1- 12 CONSTRUCTING BOXPLOTS (CONT.) 3. Use the fences to grow “whiskers.” Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not connect it with a whisker. Slide 1- 13 CONSTRUCTING BOXPLOTS (CONT.) 4. Add the outliers by displaying any data values beyond the fences with special symbols. We often use a different symbol for “far outliers” that are farther than 3 IQRs from the quartiles. Slide 1- 14 WIND SPEED: MAKING BOXPLOTS (CONT.) Compare the histogram and boxplot for daily wind speeds: How does each display represent the distribution? Slide 1- 15 COMPARING GROUPS (CONT) Boxplots offer an ideal balance of information and simplicity, hiding the details while displaying the overall summary information. We often plot them side by side for groups or categories we wish to compare. What do these boxplots tell you? Slide 1- 16 A CLASS OF FOURTH GRADERS TAKES A DIAGNOSTIC READING TEST, AND THE SCORES ARE REPORTED BY READING GRADE LEVEL. THE 5-NUMBER SUMMARY FOR THE BOYS AND GIRLS ARE SHOWN BELOW. Girls Min: 2.5 Q1: 3.7 Q2: 4.3 Q3: 4.7 Max: 5.8 Boys Min: 2.7 Q1: 4.1 Q2: 4.5 Q3: 4.9 Max: 5.9 WHICH GROUP HAD THE HIGHEST SCORE 1. 2. Girls Boys 0% 1 0% 2 Slide 1- 18 WHICH GROUP HAD THE GREATEST RANGE 1. 2. 3. Girls Boys They are the same 0% 0% 0% Slide 1- 19 1 2 3 WHICH GROUP HAD THE GREATEST IQR 1. 2. 3. Girls Boys They are the same 0% 0% 0% Slide 1- 20 1 2 3 WHICH GROUP’S SCORES APPEAR MORE SKEWED? 1. 2. 3. 4. The boy’s scores are more skewed. The quartiles are the same distance from the mean. The girl’s scores are more skewed. The quartiles are not the same distance from the median. The boy’s scores are more skewed. The quartiles are not the same distance from the median. The girl’s scores are more skewed. The quartiles are the same distance from the median. Slide 1- 21 WHICH GROUP GENERALLY DID BETTER ON THE TEST? 1. 2. 3. 4. Girls did better b/c the mean for girls was higher. Girls did better b/c the median for girls was higher. Boys did better b/c the mean for boys was higher. Boys did better b/c the median for boys was higher. 0% 0% 0% 0%Slide 1- 22 1 2 3 4 WHAT CAN GO WRONG? (CONT.) Avoid inconsistent scales, either within the display or when comparing two displays. Label clearly so a reader knows what the plot displays. Good intentions, bad plot: Slide 1- 23 WHAT ABOUT OUTLIERS? If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. *RE-EXPRESSING SKEWED DATA TO IMPROVE SYMMETRY Slide 1- 25 TRANSFORMING DATA y=Log(x) To get original data back x=10^y =10y y=Sqrt(x) To get original data back x=y^2 = y*y Slide 1- 26 *RE-EXPRESSING SKEWED DATA TO IMPROVE SYMMETRY (CONT.) One way to make a skewed distribution more symmetric is to re-express or transform the data by applying a simple function (e.g., logarithmic function). Note the change in skewness from the raw data (previous slide) to the transformed data (right): NEXT TIME Chapter 5 How we use the Standard Deviation to make comparisons….