Name: Ms. D’Amato Date: Block: Chapter 5: Describing Distributions Numerically Finding the Center: The Median When we think of a typical value, we usually look for the For a unimodal, symmetric distribution, it’s easy to find the center: See left graph below. As a measure of center, the (the average of the minimum and maximum values) is very sensitive to skewed distributions and outliers. See right graph above. The of the distribution. is a more reasonable choice for center than the midrange. o Is the value with exactly half the values below it and half above it. o It is the data value (once the data values have been ordered) that divides the histogram into two o It has the same units as the data. areas. Finding the median by hand: When n is odd… …the median is the middle value. Counting from the ends, we find this value in the 𝑛+1 2 position. When n is even… …there are two middle values. The median is the average of the two values in positions 𝑛 2 and 𝑛 2 + 1. 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7 Spread: Home on the Range Always report a measure of describing a distribution numerically. The along with a measure of center when of the data is the difference between the maximum and minimum values: Range = max – min What is the range of the following data values? 34, 25, 16, 73, 41 A disadvantage of the range (like the midrange) is that a single can make it very large and, thus, not representative of the data overall. Spread: The Interquartile Range The lets us ignore extreme data values and concentrate on the middle of the data. To find the IQR, we first need to know what quartiles are… o Quartiles divide the data into four equal sections. The the median. quartile is the median of the half of the data below The median. quartile is the median of the half of the data above the o The difference between the quartiles is the IQR, so IQR = upper quartile – lower quartile The lower and upper quartiles are the The IQR contains the middle and percentiles of the data, so… of the values of the distribution. Finding quartiles by hand: When n is odd… When n is even… 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7 Lower Quartile = ________ Lower Quartile = ________ Upper Quartile = ________ Upper Quartile = ________ IQR = ________ IQR = ________ 5-Number Summary The 5-number summary of a distribution reports its , , and (maximum and minimum). o Example: Let’s look at the five-number summary for the number of flat tires on Route 7. Max 47 flat tires Q3 22 Median 19 Q1 17 Min 13 o A (quantitative) variable. is a graphical display of the 5-number summary of a o Boxplots are useful when comparing groups. Making Boxplots 1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box. 2. Erect “fences” around the main part of the data. The upper fence is 1.5 IQRs above the upper quartile. The lower fence is 1.5 IQRs below the lower quartile. Note: the fences only help with constructing the boxplot and should not appear in the final display. and Upper fence = Q3 + 1.5 IQRs = 22 + 1.5 x 5 = 29.5 flat tires Lower fence = Q1 – 1.5 IQRs = 17 – 1.5 x 5 = 9.5 flat tires 3. Use the fences to grow “whiskers.” Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not connect it with a whisker. 4. Add the outliers by displaying any data values beyond the fences with special symbols. We often use a different symbol for “far outliers” that are farther than 3 IQRs from the quartiles. The center of a boxplot is a box that shows the middle half of the data, between the quartiles. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If it is not centered, the distribution is skewed. The whiskers show skewness as well if they are not the same length. By turning a boxplot and putting it on the same scale, we can compare the boxplot and histogram of flat tires and see how each represents the distribution. Comparing Groups with Boxplots When placed side-by-side, we can see which group has a higher median, a greater IQR, where the central 50% of the data are located, and which has the greater overall range. Can get an idea of the symmetry form whether the medians are centered within their boxes and whether the whiskers extend roughly the same distance on either side of the boxes. The following set of boxplots compares the effectiveness of various coffee containers: What does this graphical display tell you? Summarizing Symmetric Distributions Medians do a good job of identifying the center of skewed distributions. When we have symmetric data, the is a good measure of center. We find the mean by adding up all of the data values and dividing by n, the number of data values we have. The distribution of pulse rates for 52 adults is generally symmetric, with a mean of 72.7 beats per minute (bpm) and a median of 73 bpm: The Formula for Averaging (Say It in Greek) The formula for mean is given by: y Total y n n The formula says that to find the mean, we add up the numbers and divide by n. Mean or Median? Regardless of the shape of the distribution, the of the data would balance: In symmetric distributions, the mean and median are approximately the in value, so either measure of center may be used. For skewed data, though, it’s better to report the than the mean as a measure of center. The median splits a histogram so that the of the bars on either side of the median are equal. The mean the histogram, taking into account both the size of the bars and their distance from the center, but as a result, it may not have equal numbers of data values on either side. When the data is symmetric the mean = median is the point at which a histogram When the data is skewed to the left (tail longer on left) the mean is to the left of the median (mean < median) When the data is skewed to the right the mean (tail longer on right) the mean is to the right of the median (mean > median) What About Spread? The Standard Deviation A more powerful measure of spread than the IQR is the which takes into account how far A The , notated by (almost) averaging them: data value is from the mean. is the distance that a data value is from the mean. o Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. , is found by summing the squared deviations and s2 y y 2 n 1 The variance will play a role later in our study, but it is problematic as a measure of spread—it is measured in , units! The , , is just the square root of the variance and is measured in the same units as the original data. s y y 2 n 1 Finding the standard deviation by hand: Steps: 1. Find the mean, 𝑦̅. Round to the nearest tenth if necessary. 2. Next, find the deviations by taking 𝑦̅ from each value: (𝑦 − 𝑦̅) 3. Square the deviation: (𝑦 − 𝑦̅)2 4. Add these numbers up and divide by n – 1. This gives you the variance. 5. Take the square root to find the standard deviation. Let’s look at the batch of values is 4, 3, 10, 12, 8, 9, and 3. 1. Find the mean. 2-3. Find the deviations and squared deviation. Observations y Deviations (𝑦 − 𝑦̅) Squared Deviations (𝑦 − 𝑦̅)2 4. Add the squared deviations up and divide by n – 1. 5. Take the square root of your answer for #4. This is your standard deviation. Which y value(s) lie within one standard deviation of the mean? Thinking About Variation Since Statistics is about variation, Statistics. Measures of spread help us talk about what we don’t know. is an important fundamental concept of When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be . When the data values are scattered far from the center, the IQR and standard deviation will be . Shape, Center, and Spread When telling about a quantitative variable, always report the shape of its distribution, along with a center and a spread. o If the shape is , report the o If the shape is , report the and possibly the median and IQR as well. and . and What About Outliers? If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. What Can Go Wrong? Don’t forget to do a reality check — don’t let technology do your thinking for you. Don’t forget to sort the values before finding the median or percentiles. Don’t compute numerical summaries of a categorical variable. Watch out for multiple modes — multiple modes might indicate multiple groups in your data. Be aware of slightly different methods — different statistics packages and calculators may give you different answers for the same data. Beware of outliers. Make a picture (make a picture, make a picture). Be careful when comparing groups that have very different spreads. o Consider these side-by-side boxplots of cotinine levels: What Have We Learned? We can now summarize distributions of quantitative variables numerically. o The 5-number summary displays the min, Q1, median, Q3, and max. o Measures of center include the mean and median. o Measures of spread include the range, IQR, and standard deviation. We know which measures to use for symmetric distributions and skewed distributions. We can also display distributions with boxplots. o While histograms better show the shape of the distribution, boxplots reveal the center, middle 50%, and any outliers in the distribution. o Boxplots are useful for comparing groups.