chapter 4: descriptive methods for a single

advertisement
CHAPTER 4
DESCRIPTIVE METHODS FOR A SINGLE NUMERICAL VARIABLE
So far in this course we have dealt with categorical variables. We have summarized categorical
variables with counts, percentages, bar graphs, and mosaic plots. In this chapter, we will consider
descriptive methods appropriate for summarizing a single numerical variable. These summaries
are intended to describe the following characteristics of a numerical data set:
1.
2.
3.
4.
Location
Dispersion
Shape
Position of an Observation
MEASURES OF LOCATION
Consider the hair activity completed in class. The following hair lengths have been obtained for a
group of four men.
Hair Length (mm)
Person
1
2
3
4
_____________________________________ (64 mm)
_______________________ (40 mm)
___________________________ (48 mm)
_______ (12 mm)
For convenience, we will label the observations of a data set as x1, x2, x3, and x4. That is, x1 is the
value of the first measurement (i.e. 64) , x2 is the value of the second measurement (i.e. 40), etc.
Let n represent the total number of data points.
We can use the following statistics to measure some aspect of the location of a data set (or
distribution):

Mean: The arithmetic average of all of the values in a data set. Note that this quantity
measures the center of a data set.
n
Sample Mean:
x
x
i
i 1
n
1

Median: The middle term of a data set (after the numerical values have been ordered).
If the data set contains an even number of observations, then the median is the average of
the middle two observations. This quantity is also a measure of center.

Mode: The observation(s) that occurs most frequently in a data set.
Using Excel to calculate these descriptive statistics:
Example 3.1: Consider the men hair lengths obtained above. Put these values into Excel as is
shown here. The average hair length can be obtained using the =AVERAGE() function.
You can “name” a range of data
in Excel. This is done by
highlighting your data and giving
the data range a name in the box
just above the column labels.
If a set of data values have been named in Excel, you can
use this name in the formulas. This is shown here.
2
The median and mode can be obtained similarly in Excel.
Question:
1. Why does Excel give a value of #N/A for the mode?
2. Is the median necessarily a data point in the dataset? Explain.
3. Is the mean necessarily a data point in the dataset?
4. Suppose the data point for Person 4 was replaced by somebody that had just completely
shaven their hair. That is, suppose the value of 12 was replaced by 0 in the dataset.
a. Explain the impact of this change on the mean. Recalculate the mean. Your
friend decided to divide by 3 instead of 4 when calculating this new mean. Do
you think this is a good idea? Why or why not?
b. Explain the impact of this change of the median.
5. Often the mode is discussed when means and medians are discussed. This gives the
impression that the mode is a reasonable measure of center. Explain why this is not
necessarily the case. When would the mode be a good measure of center for a dataset.
3
The following compares these three measures.
Questions
6. Compare and contrast the mean and median for the men hair length.
Consider the following hair lengths for women. Find the mean, median, and mode in Excel.
Questions:
7. Compare the mean hair length for each gender.
8. Compare the median hair length for each gender.
9. Outliers can adversely affect the mean more than the median. Do you think it is more,
less, or equally likely that men will have outliers for hair length compared women?
Explain.
4
MEASURES OF POSITION
In addition to the mean, percentiles give us an idea of the entire spectrum of data values.
Percentiles: The pth percentile of a set of measurements is defined to be the point in the data set
where p% of the measurements fall at or below.
Consider the Hair.xlsx dataset. The following is the data for Men only. The percentiles can be
obtained using the =PERCENTILE() function in Excel.
5
It is often more useful to investigate this entire spectrum of percentiles using a plot. Statisticians
call this plot a Cumulative Density Function Plot or CDF Plot for short. The CDF plot for Men
Hair length is shown here.
Questions
10. What is the shortest hair length? What is the longest?
11. What is the “middle” hair length? What name do we give this value?
12. What percent of Men have hair length more than 50mm?
13. Use the plot to decide a range of values for which “most” hair lengths can be found.
What is this range?
14. The 2.5% percentile is about 3 and the 97.5% is 112.5. What proportion of men’s hair
lengths are between these two values?
15. Why is this plot have a long tail on the upper-end? What does this mean in the context of
this example?
6
Consider the percentiles and CDF plot for Women.
Often two cumulative density functions are displayed on a single graph. This allows for easy
comparisons.
Questions:
16. Compare and contrast these two CDF plots. State at least two differences.
7
Quartiles: Quantities that divide the data into quarter

Q2 – The half way point in the data (i.e. the median)


Q1 – The median of the lower half of the data.
Q3 – The median of the upper half of the data.
Consider the hair length of women
Women Hair Length
180
240
270
350
360
Getting quartiles in Excel using the =QUARTILE() function.
8
Consider the hair length of women
Men Hair Length
12
40
48
64
Comment: Software packages may differ in their computation of quartiles. Any such
differences usually diminish as the number of observations increase, so be careful when
calculating quantiles for small data sets!
JMP Software
Minitab Software
9
Note: The differences in the methods for computing quartiles become less important the more
data you have. For example, when hair lengths from all women are included, the differences are
small.
Excel
All Women (n=40)
JMP Software
10
MEASURES OF SPREAD
Example 3.2: Consider the following data sets.
Data Set A
Data Set B
Data Set C
Questions:
1. What is the mean for each data set? The median?
2. Is a measure of center enough to describe a data set? If not, what else do we need?
11
Several quantities exist for measuring the amount of spread (i.e. dispersion) in a data set.

Range: The difference between the largest measurement and the smallest
measurement in a data set.
Range = Maximum – Minimum
Questions:
3. How many observations from the data set are used in the computation of the range?
4. Outliers (which we will discuss later) are extreme observations which need to be
handled with care in an analysis. How will outliers affect the range?
5. What is the smallest possible value for the range? What does it mean if the range is
at this value?

Interquartile Range (IQR): In an attempt to alleviate the problems that the range has
with outliers, the IQR is computed as the difference between the first and third quartiles.
IQR = Q3 – Q1
Questions:
6. What percent of the data lies within the interquartile range?
7. Do you feel that the IQR adequately measures dispersion? Why or why not?
12

Average Distance from the Mean:
To summarize the variability in a set of measurements, we may want to use every
observation in the data set to calculate the “average distance from the mean.”
n
Average distance from mean 
 (x
i 1
i
 x)
n
Calculate the average distance from the mean for the Men hair length:
13
Questions:
8. What is the problem with using this method?
9. Recall what happened when we attempted to use string to measure the variation in
hair lengths. Red string was used to represent the average and white string was used
to represent each persons’ hair length. What color string would each individual have
for their residual string? Explain.
10. What is the total length of the red residual string? White residual string? Are these
two lengths the same? Why is this a problem?
11. It can be shown using a little bit of algebra that we will always get zero for an
answer. Do you have any ideas on how to overcome this problem?
14

Mean Absolute Deviation (MAD): This is the average distance from the mean
calculated using absolute distances. Compute the MAD for the rats in the control group:
n
MAD 
| x  x |
i
i 1
n
Although this gives us a valid measure of the variability in a set of measurements, this
quantity has difficult statistical properties. So, we traditionally use the variance and
standard deviation.

Variance: This is the average squared distance from the mean.
n
Sample Variance: s 2 
 (x
i
 x) 2
i 1
n 1
Comments:


We divide by n –1 because dividing by n tends to produce a biased estimate
(specifically, an underestimate). That is, statistically speaking using n-1 is better.

Note that the sample variance is quite large when compared to the values in our
original data set. This is because the original distances were squared and so the
variance is in terms of squared units. So, to get back in the scale of our original data
set, we take the square root of the variance.
Standard Deviation: The square root of the variance.
n
Sample Standard Deviation: s  s 2 
 (x
i
 x) 2
i 1
n 1
15
Determine the Range, IQR, Mean Absolute Deviation, and Standard Deviation for the
Women hair lengths.
Women
Range
Men
52
IQR
19
Mean Absolute
Deviation
Standard Deviation
15
21.76
Questions:
12. Which gender has more variability in their measurements? Explain.
13. Which measurement is used most by statisticians? Why is this so?
14. Which measurement is most influenced by outliers? Least influenced?
16
Download