CHAPTER 4 DESCRIPTIVE METHODS FOR A SINGLE NUMERICAL VARIABLE So far in this course we have dealt with categorical variables. We have summarized categorical variables with counts, percentages, bar graphs, and mosaic plots. In this chapter, we will consider descriptive methods appropriate for summarizing a single numerical variable. These summaries are intended to describe the following characteristics of a numerical data set: 1. 2. 3. 4. Location Dispersion Shape Position of an Observation MEASURES OF LOCATION Consider the hair activity completed in class. The following hair lengths have been obtained for a group of four men. Hair Length (mm) Person 1 2 3 4 _____________________________________ (64 mm) _______________________ (40 mm) ___________________________ (48 mm) _______ (12 mm) For convenience, we will label the observations of a data set as x1, x2, x3, and x4. That is, x1 is the value of the first measurement (i.e. 64) , x2 is the value of the second measurement (i.e. 40), etc. Let n represent the total number of data points. We can use the following statistics to measure some aspect of the location of a data set (or distribution): Mean: The arithmetic average of all of the values in a data set. Note that this quantity measures the center of a data set. n Sample Mean: x x i i 1 n 1 Median: The middle term of a data set (after the numerical values have been ordered). If the data set contains an even number of observations, then the median is the average of the middle two observations. This quantity is also a measure of center. Mode: The observation(s) that occurs most frequently in a data set. Using Excel to calculate these descriptive statistics: Example 3.1: Consider the men hair lengths obtained above. Put these values into Excel as is shown here. The average hair length can be obtained using the =AVERAGE() function. You can “name” a range of data in Excel. This is done by highlighting your data and giving the data range a name in the box just above the column labels. If a set of data values have been named in Excel, you can use this name in the formulas. This is shown here. 2 The median and mode can be obtained similarly in Excel. Question: 1. Why does Excel give a value of #N/A for the mode? 2. Is the median necessarily a data point in the dataset? Explain. 3. Is the mean necessarily a data point in the dataset? 4. Suppose the data point for Person 4 was replaced by somebody that had just completely shaven their hair. That is, suppose the value of 12 was replaced by 0 in the dataset. a. Explain the impact of this change on the mean. Recalculate the mean. Your friend decided to divide by 3 instead of 4 when calculating this new mean. Do you think this is a good idea? Why or why not? b. Explain the impact of this change of the median. 5. Often the mode is discussed when means and medians are discussed. This gives the impression that the mode is a reasonable measure of center. Explain why this is not necessarily the case. When would the mode be a good measure of center for a dataset. 3 The following compares these three measures. Questions 6. Compare and contrast the mean and median for the men hair length. Consider the following hair lengths for women. Find the mean, median, and mode in Excel. Questions: 7. Compare the mean hair length for each gender. 8. Compare the median hair length for each gender. 9. Outliers can adversely affect the mean more than the median. Do you think it is more, less, or equally likely that men will have outliers for hair length compared women? Explain. 4 MEASURES OF POSITION In addition to the mean, percentiles give us an idea of the entire spectrum of data values. Percentiles: The pth percentile of a set of measurements is defined to be the point in the data set where p% of the measurements fall at or below. Consider the Hair.xlsx dataset. The following is the data for Men only. The percentiles can be obtained using the =PERCENTILE() function in Excel. 5 It is often more useful to investigate this entire spectrum of percentiles using a plot. Statisticians call this plot a Cumulative Density Function Plot or CDF Plot for short. The CDF plot for Men Hair length is shown here. Questions 10. What is the shortest hair length? What is the longest? 11. What is the “middle” hair length? What name do we give this value? 12. What percent of Men have hair length more than 50mm? 13. Use the plot to decide a range of values for which “most” hair lengths can be found. What is this range? 14. The 2.5% percentile is about 3 and the 97.5% is 112.5. What proportion of men’s hair lengths are between these two values? 15. Why is this plot have a long tail on the upper-end? What does this mean in the context of this example? 6 Consider the percentiles and CDF plot for Women. Often two cumulative density functions are displayed on a single graph. This allows for easy comparisons. Questions: 16. Compare and contrast these two CDF plots. State at least two differences. 7 Quartiles: Quantities that divide the data into quarter Q2 – The half way point in the data (i.e. the median) Q1 – The median of the lower half of the data. Q3 – The median of the upper half of the data. Consider the hair length of women Women Hair Length 180 240 270 350 360 Getting quartiles in Excel using the =QUARTILE() function. 8 Consider the hair length of women Men Hair Length 12 40 48 64 Comment: Software packages may differ in their computation of quartiles. Any such differences usually diminish as the number of observations increase, so be careful when calculating quantiles for small data sets! JMP Software Minitab Software 9 Note: The differences in the methods for computing quartiles become less important the more data you have. For example, when hair lengths from all women are included, the differences are small. Excel All Women (n=40) JMP Software 10 MEASURES OF SPREAD Example 3.2: Consider the following data sets. Data Set A Data Set B Data Set C Questions: 1. What is the mean for each data set? The median? 2. Is a measure of center enough to describe a data set? If not, what else do we need? 11 Several quantities exist for measuring the amount of spread (i.e. dispersion) in a data set. Range: The difference between the largest measurement and the smallest measurement in a data set. Range = Maximum – Minimum Questions: 3. How many observations from the data set are used in the computation of the range? 4. Outliers (which we will discuss later) are extreme observations which need to be handled with care in an analysis. How will outliers affect the range? 5. What is the smallest possible value for the range? What does it mean if the range is at this value? Interquartile Range (IQR): In an attempt to alleviate the problems that the range has with outliers, the IQR is computed as the difference between the first and third quartiles. IQR = Q3 – Q1 Questions: 6. What percent of the data lies within the interquartile range? 7. Do you feel that the IQR adequately measures dispersion? Why or why not? 12 Average Distance from the Mean: To summarize the variability in a set of measurements, we may want to use every observation in the data set to calculate the “average distance from the mean.” n Average distance from mean (x i 1 i x) n Calculate the average distance from the mean for the Men hair length: 13 Questions: 8. What is the problem with using this method? 9. Recall what happened when we attempted to use string to measure the variation in hair lengths. Red string was used to represent the average and white string was used to represent each persons’ hair length. What color string would each individual have for their residual string? Explain. 10. What is the total length of the red residual string? White residual string? Are these two lengths the same? Why is this a problem? 11. It can be shown using a little bit of algebra that we will always get zero for an answer. Do you have any ideas on how to overcome this problem? 14 Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using absolute distances. Compute the MAD for the rats in the control group: n MAD | x x | i i 1 n Although this gives us a valid measure of the variability in a set of measurements, this quantity has difficult statistical properties. So, we traditionally use the variance and standard deviation. Variance: This is the average squared distance from the mean. n Sample Variance: s 2 (x i x) 2 i 1 n 1 Comments: We divide by n –1 because dividing by n tends to produce a biased estimate (specifically, an underestimate). That is, statistically speaking using n-1 is better. Note that the sample variance is quite large when compared to the values in our original data set. This is because the original distances were squared and so the variance is in terms of squared units. So, to get back in the scale of our original data set, we take the square root of the variance. Standard Deviation: The square root of the variance. n Sample Standard Deviation: s s 2 (x i x) 2 i 1 n 1 15 Determine the Range, IQR, Mean Absolute Deviation, and Standard Deviation for the Women hair lengths. Women Range Men 52 IQR 19 Mean Absolute Deviation Standard Deviation 15 21.76 Questions: 12. Which gender has more variability in their measurements? Explain. 13. Which measurement is used most by statisticians? Why is this so? 14. Which measurement is most influenced by outliers? Least influenced? 16