Measures of Position

advertisement
Chapter 3 – Numerically Summarizing Data
After we have become somewhat familiar with the data through representing it graphically and
observing the characteristics of the distribution, we want to describe the characteristics with numerical
values called descriptive statistics.
Recall from Chapter 1:
Defn: A parameter is a numerical characteristic of a population.
Defn: A statistic is a numerical characteristic of a sample. (Remember, a sample is a subset of a
population.)
We want to use the value of a statistic found from the sample data to gain knowledge about the value
of the corresponding parameter, which we would be able to get directly if we had access to the entire
population.
Measures of Central Tendency give us information about the location of the center (in some sense) of
the distribution of (numeric) data values. We will discuss four measures of central tendency: mean,
median, mode, and the midrange.
Defn: If we have a set of n sample data values, x1, x2, … , xn, the mean of these data values is their
arithmetic average:
1
1 n
x   x1  x 2    x n    xi
n
n i 1
.
If we have a set of N population data values, the mean of these values is:

Note:
1
x1  x2    x N   1
N
N
x
N
x
i 1
i
.
is a statistic;  is a parameter.
Example: p. 123, Example 6
1) Go to STAT, 1:Edit.
2) Enter the data, with a suitable variable name, such as BP.
3) Choose STAT, CALC, 1:1-Var Stats.
4) Enter the variable name, and press ENTER.
5) You will see a list of numerical values for the data, including
50
and
x
i 1
i
 374.4
x  7.488 ,
.
The average, or mean, birthweight for the babies was found to be 7.448 pounds. The total weight for
the babies is 374.4 pounds.
Properties of the Mean:
1) One computes the mean by using all of the values of the data.
2) The mean varies less than the other two measures of central tendency when samples are taken
from the same population and all three measures are computed for these samples.
3) The mean is used in computing other statistics, such as the variance.
4) The mean for the data set is unique, and not necessarily one of the data values.
5) The mean is affected by extremely high or low values and may not be the appropriate measure to
use in these situations.
Example: Suppose that we had made a mistake in entering the data for the first baby, entering 0.8,
rather than 5.8. The computed value of the mean would be
than the value computed from the correct data.
x  7.388 pounds, somewhat lower
Sometimes, the correct raw data has extreme values. In these situations, the mean may not be the best
measure of central tendency to use. In such cases, we might prefer to use the median.
~x such that at least 50% of the data
~
x and at least 50% of the data values lie above ~x .
Defn: The median is the midpoint of the data set; it is a value
values lie below
Example: p. 123, Example 6
1) Go to STAT, 1:Edit.
2) Enter the data, with a suitable variable name, such as BP.
3) Choose STAT, CALC, 1:1-Var Stats.
4) Enter the variable name, and press ENTER.
~
5) You will see a list of numerical values for the data, including x  Med  7.35 .
The median value of the birthweights for sample of babies was 7.35 pounds.
Properties of the Median:
1) The median is used when one must find the center or middle value of a data set.
2) The median is used when one must determine whether the data values fall into the upper half or
the lower half of the distribution.
3) The median is used to find the average of an open-ended distribution.
4) The median is affected less than the mean by extremely high or extremely low values.
Example: Let’s return to the above example, and assume that the last data value was incorrectly
entered as 0.8, rather than 5.8. The median value is found to be 7.35 pounds, since the single incorrect
value has little effect on the calculation of the median.
Sometimes, the median is a more appropriate measure of central tendency than the mean for a data set.
Example: The U.S. Department of Commerce Bureau of Labor Statistics gives information about the
distribution of personal incomes in the U.S. This distribution, of course, has extreme values. Hence
the Bureau uses the median income, rather than the mean, as the appropriate measure of central
tendency.
In some situations, the most appropriate measure of central tendency is the mode of the distribution.
Defn: The data value that occurs most often in a data set is called the mode.
Note: Some data sets do not have a mode. For example, the data set consisting of the values 1, 1, 2, 2,
3, 3, 4, 4 does not have a single most frequently occurring value, and hence does not have a mode. For
this data, the mean or median would be the most appropriate measure of central tendency.
Examples: p. 124 Examples 7 and 8.
The calculator will provide only a little help here, in sorting the data.
1) Rearrange the data so that the values are listed in increasing order.
2) Find the value that occurs most frequently, if such a value exists.
For this data set, the mode is 0.
Properties of the Mode:
1) The mode is used when the most typical case is desired.
2) The mode is the easiest average to compute.
3) The mode can be used when the data are categorical, such as religious preference, gender, or
political affiliation.
4) The mode is not always unique. A data set can have more than one mode, in which case we say
that it does not have a mode.
Distribution Shapes: (See p. 122)
1) In a positively skewed distribution, the following relationship holds among the measures of
~
central tendency: Mode  x  x
2) In a negatively skewed distribution, the following relationship holds among the measures of
~
central tendency: x  x  Mode
3) In a symmetrical distribution, the data values are evenly distributed on both sides of the mean.
In this situation, the following relationship holds among the measures of central tendency:
x~
x  Mode
Measures of Variability: In addition to locating the center (in some sense) of a data distribution, we
also want to know how spread out the data values are. We will talk about three measures of variability:
Range, Variance, and Standard Deviation.
Defn: The range of a data set is the difference between the largest and smallest data values: Range =
Xmax – Xmin.
Example: p. 123, Example 6 (baby birthweights)
Range = 9.4 – 5.8 = 3.6 pounds
The range is not the most useful measure of the variability of the data, however, since it ignores much
of the information about variability. The following two data distributions have the same range, but we
would not say that they have the same variability:
Data set 1: 10, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 90
Data set 2: 10, 10, 10, 10, 50, 50, 50, 50, 90, 90, 90, 90
If we construct histograms for each of these data sets, we see that the first set of data values is more
concentrated at the center.
We need another measure of variability that will allow us to distinguish between these two situations.
This measure of variability should include information about the location of each item in the data set
relative to the center of the data distribution.
Defn: For an observation xi, define the corresponding deviation from the mean to be
ei  xi  x .
Can we use the sum of all of the deviation scores for the data as our measure of variability? No. Why
not?
n
For any data set x1, x2, …, xn, we have
e
i 1
i
0
. Why is this so?
Defn: For a population of N data values, x1, x2, …, xN , having population mean
1
1
   x1  x 2    x N  
N
N
1
 
N
2
N
 x
i 1
N
x
i 1
i
, The variance of the population data set is
  .
2
i
The standard deviation of the population data set is the square root of the variance.
Defn: For a sample of n data values, x1, x2, …, xn, having sample mean
1
1 n
x   x1  x 2    x n    xi
n
n i 1
s2 
, The variance of the sample data set is
1 n
 x i  x 2 .

n  1 i 1
The standard deviation of the sample data set is the square root of the variance.
Why do we need to define two different additional measures of variability for a data set? (Hint: units
of measurement). Why do we divide by n – 1, rather than by n, when computing the sample variance?
Defn: An unbiased estimator of a parameter is a statistic, such that the average of the values of the
statistic for repeated random samples of the same size tends toward the true value of the parameter.
When we divide by n – 1, rather than n, to compute the sample variance, we are creating an unbiased
statistic for estimating the population variance.
Example: p.123, Example 6.
From the 1-Var Stats function of the calculator, we find that the mean is
x  7.488
pounds. The
standard deviation of the data set is s = 0.8030 pounds. The variance of the data set is then
s2 = 0.6448 squared pounds.
Now assume that we have committed two data entry errors, replacing 5.8 with 0.8 and replacing 9.4
with 19.4. What is are the values of the variability measures now?
We find s = 2.0762 pounds and s2 = 4.3105 pounds. There is much more variability in the data with
these two data entry errors.
Example: Given the following data set: 5, 5, 5, 5, 5, 5, 5, 5, what is the standard deviation? (Hint:
You don’t need to use the calculator to answer this question.)
Uses of Variance and Standard Deviation:
1) As previously stated, variances and standard deviations can be used to determine the spread of the
data. If the variance or standard deviation is large, the data are more dispersed. This information is
useful in comparing two or more data sets to determine which is more (most) variable.
2) The measures of variance and standard deviation are used to determine the consistency of a variable.
For example, in the manufacture of fittings, like nuts and bolts, the variation in the diameters must be
small or parts will not fit together.
3) The variance and standard deviation are used to determine the number of data values that fall within
a specified interval in a distribution.
4) Finally, the variance and standard deviation are used quite often in inferential statistics. These uses
will be shown later in the course.
The Empirical Rule: If a data distribution is bell-shaped (or normal), then the following statements are
true:
1) Approximately 68% of the data values lie within one standard deviation on either side of the mean.
2) Approximately 95% of the data values lie within two standard deviations on either side of the
mean.
3) Approximately 99.7% of the data values lie within three standard deviations on either side of the
mean.
Example: p. 128, Exercise 29
Suppose that we know that the distribution of M&M weights is approximately bell-shaped. From the
Empirical Rule, we can then say that approximately 68% of the M&M’s in the sample have weights
between
x  s  0.8746 g  0.0356 g  0.8390 g , and
x  s  0.8746 g  0.0356 g  0.9102 g .
We can also say that approximately 95% of the M&M’s in the sample have weights between
x  2s  0.8746 g  (2*0.0356 g )  0.8034 g , and
x  2s  0.8746 g  (2*0.0356 g )  0.9458 g .
Finally, we can say that approximately 99.74% of the M&M’s in the sample have weights between
x  3s  0.8746 g  (3*0.0356 g )  0.7678 g , and
x  3s  0.8746 g  (3*0.0356)  0.9814 g .
Example: p. 96, Exercise 42
Measures of Position
In addition to summary statistics describing the entire data set, we are often interested in locating
particular members of the sample, or particular data values, within the context of the data set as a
whole.
Defn: A standard score, or z-score, for a data value is obtained by subtracting the mean from the data
value and then dividing by the standard deviation. For an observation xi in a sample data set, the
zi 
z-score is
xi  x
s .
For a member of a population data set, the
zi 
z-score is
xi  

.
With z-scores, we can compare relative locations of scores from two different data distributions.
Example: Which of the following exam grades has a better relative position
a) A grade of 43 on a test with a mean of 40 and standard deviation of 3
b) A grade of 75 on a test with a mean of 72 and standard deviation of 5.
z1 
The z-score corresponding to the data value from the first data set is
43  40
1
3
.
z2 
75  72
 0 .6
5
.
The z-score corresponding to the data value from the second data set is
The first score is higher, relative to its score distribution, than the second score.
Percentiles
To locate the position of an individual score relative to its own data set, we often use percentiles.
Defn: The kth percentile of a data set is the value for which at most k% of the data values are less than
that value and at most (100 – k)% of the data values are more than that number.
Finding a Data Value Corresponding to the kth Percentile.
1) Arrange the data in order from lowest to highest.
2) Substitute into the formula
i
kn
, where n is the size of the data set, and k is the particular
100
percent.
3) a) If i is not an integer, round up the next higher integer. Starting at the smallest data value in
the list, count up to the position corresponding to the rounded value of i. The data value at that
position is the kth percentile of the data set.
b) If i is an integer, use the value halfway between the ith and (i+1)th data values, counting from
the smallest data value.
Example: What value in the following data set corresponds to the 60th percentile?
1) 12, 28, 35, 42, 47, 49, 50
2) n = 7, and
i
(7)(60)
 4.2
100
3) 4.2 is not an integer so we round up to 5. The 60th percentile is then the 5th data value, counting
from the lowest, or 47.
Example: The A. C. Nielsen Company publishes data on the TV-viewing habits of Americans in the
Nielsen Report on Television. A sample of 20 people yielded the following data on weekly viewing
times:
25, 41, 27, 32, 43, 66, 35, 31, 15, 5, 34, 26, 32, 38, 16, 30, 38, 30, 20, 21.
What is the 25th percentile of the data? I.e., what is the value such that at least 25% of the data values
are less than that value and at least 75% of the data values are greater than that value?
1) Rearrange the data in ascending order: 5, 15, 16, 20, 21, 25, 26, 27, 30, 30, 31, 32, 32, 34, 35,
38, 38, 41, 43, 66
2)
i
kn (20)( 25)

5
100
100
3) The position of the 25th percentile is halfway between the 5th and the 6th data observations,
21  25
 23
2
namely,
.
Hence, 25% of the weekly viewing times are less than 23 hours, and 75% of the weekly viewing times
are greater than 23 hours.
Defn: The first quartile of a data set is Q1, the 25th percentile. The second quartile of the data set is
~
x  Q2 , the median, or 50th percentile.
The third quartile of the data set is Q3, the 75th percentile.
Defn: An outlier is an extremely high or extremely low data value, when compared to the rest of the
data values.
Defn: The 5-number summary of a data set consists of the lowest value of the data, Xmin, the three
quartiles, Q1,
~x
, and Q , and the highest value of the data, Xmax.
3
Example: For the Nielsen data
1) Enter the data as the variable VIEW.
2) Go to STAT, CALC, 1-Var Stats, and enter the variable name VIEW.
3) Scroll down and read off the 5-number summary of the data set:
Xmin = 5, Q1 = 23, Med = 30.5, Q3 = 36.5, Xmax = 66.
Defn: The interquartile range of a data set is IQR = Q3 – Q1.
We will consider a data value to be an outlier if its value is either greater than Q3 + 1.5IQR or less than
Q1 – 1.5IQR.
Example: The Nielsen data. We suspect that the largest value, 66 could be an outlying observation.
We calculate
Q3 + 1.5IQR = 36.5 + (1.5)(36.5 – 23) = 56.75. Since 66 > 56.75, then the largest data value is
actually an outlier, and should be investigated individually.
Defn: A boxplot is a graphical representation of a numeric data set, using the 5-number summary.
The data values between Q1 and Q3 are represented by a box, with a vertical line at the median value.
The data values between Xmin and Q1 are represented by a line segment attached to the left end of the
box. The data values between Q3 and Xmax are represented by a line segment attached to the right end
of the box.
Note: The TI-83 will do boxplots.
Information Obtained from a Boxplot
a) If the median is near the center of the box and the two lines are of approximately equal length, the
distribution is approximately symmetric.
b) If the median falls to the left of the center of the box and the right line is longer than the left line, the
distribution is positively skewed.
c) If the median falls to the right of the center of the box and the left line is longer than the right line,
the distribution is negatively skewed.
Example: The Nielsen data
1) Enter the data in the calculator.
2) Set the WINDOW appropriately.
3) Clear all other plots and drawings.
4) Choose STAT PLOT, and turn Plot 1 on.
5) Choose the 5th Type of plot, the Boxplot.
6) For Xlist, choose the name of the variable, in this case VIEW.
7) To display the boxplot, hit the GRAPH key.
8) Use the TRACE key to find Xmin, Q1, the median, Q3, and Xmax.
Download