Example 4.6 A dentist is researching the average

advertisement
Chapter 4
Numerical Methods for Describing Data
4.1 Describing the Center of a Data Set
In last chapter, we introduced some graphical and tabular methods for describing data.
We have seen that a stem-and-leaf display, a frequency distribution, or a histogram gives
general impressions about where each data set is centered and how much it spreads out
about its center. Now we introduce how to calculate numerical summary measures that
describe more precisely both the center and extent of spread.
A measure of center is the number that describes roughly where the data set is
“centered”. The two most popular measures of center are the mean and the median.
Notations: x = the variable for which we have sample data.
n = the number of observations in the sample (sample size)
x1 = the first sample observation
x2 = the second sample observation

xn = the nth(last) sample observation
x

n
 xi
 x1  x2    xn .
i 1
Definition 4.1 The sample mean of a numerical sample x1, x2, , xn, denoted by x , is the
arithmetic average, that is,
x 
x1  x2    xn
n
x
 n
Example 4.1 A student took five exams. His scores of the five exams are
96 85 93 87 91
Then the mean of the scores is
x 
96  85  93  87  91
5
 90.4
Note: The mean x is not necessary to be a possible observable value of x.
Definition 4.2 The population mean, denoted by , is the average of all x values in the
entire population.
It is customary to use Roman letters to denote sample characteristics and Greek letters to
denote population characteristics, for example, x for sample mean and  for population
mean.
The value of x varies from sample to sample, whereas there is just one value for . We
shall see subsequently how the value of x from a particular sample can be used to draw
various conclusions about .
One drawback to the mean as a measure of center for a data set is that its value can be
greatly affected by the presence of even a single outlier in the data set. We now introduce
another measure of center that is not so sensitive to outliers.
Definition 4.3 Once the data values have been listed in order from smallest to largest, the
median is the middle value in the list and divides the list into two equal parts, that is,
The single middle value if n is odd
Sample median =
The average of the middle two values if n is even
The population median is the middle value in the ordered list of all population
observations.
If we denote the ordered sample list from smallest to largest by
x(1), x( 2 ),  , x( n )
then

 x( n21 )
Sample median =  x( n )  x( n 1)
2
2

2

if n is odd
if n is even.
Example 4.2 The ordered scores in Example 4.1 are
85
x(1)
87
x(2)
91
x(3)
93
x(4)
96
x(5)
median = x( 51 )  x( 3)  91
2
An advantage of median over mean is that median is not highly influenced by outliers.
Question: If the student failed to get a good score in the fourth exam for a special reason,
say 37 instead of 87, what are the mean and the median?

Comparing the Mean and the Median
(1) When the histogram is symmetric, the mean and median are equal. (2) When the
histogram is positively skewed, the mean lies above the median. (3) When a histogram is
negatively skewed, the mean is smaller than the median.
We also have some other measures of center, for example, trimmed mean.
Definition 4.4 A trimmed mean is computed by (i) first ordering the data values from
smallest to largest, (ii) then deleting a selected number of values from each end of the
ordered list, and (iii) finally averaging the remaining values. The trimming percentage is
the percentage of values deleted from each end of the ordered list.
Number deleted from each end = (trimming percentage)  n
Sometimes the number of observations to be deleted from each end of the data set is
specified. Then the corresponding trimming percentage is
trimming percentage = (number deleted from each end / n)  100
Question: (1) How many observations should we delete from each end to get a 20%
trimmed mean of a data set of size 20? (2) What is the 0% trimmed mean? (3) What is the
range of trimming percentage?
Example 4.3 The following are data on the number of seconds showing alcohol use in
each of 30 animated films released between 1980 and 1997. Find the 10% trimmed mean.
34
0
414
0
0
0
0
0
76
74
123
0
3 0
28 0
7
0
0
0
46
0
38
39
13
0
74
0
76
0
0 0
123 414
73
0
72
0
5
3
5
7
The ordered data values are
0
13
0
28
0
34
0
38
0
39
0
46
0
72
0
73
0
0
Since 10% of 30 is 3, the 10% trimmed mean results from deleting the three largest and
three smallest data values and then averaging the remaining 24 data values.
10% trimmed mean =(0+0++74) / 24 = 432 / 24 = 18,
which falls between the mean x =34.83 and the median ?.
A trimmed mean with a small to moderate trimming percentage (between 5% and 25%) is
less affected by outliers than the mean, but it is not as insensitive as the median.
Trimmed means are used in many sports, for example, gymnastics and dive. In these
sports, several judges give scores to each athlete. However, the final score of an athlete is
computed by first deleting the lowest and highest scores and then averaging the
remaining scores.
4.2 Describing Variability in a Data Set
Reporting a measure of center gives only partial information about a data set. It is also
important to describe the spread of values about the center. For example, given the
following scores of two students in a course
Student 1
Student 2
85
80
90
90
95
100
which student’s performance is better? We cannot distinguish the two students’
performances based only on a measure of center.
Definition 4.5 The range of a sample = the largest value – the smallest value.
The n deviations from the sample mean are the differences
x1  x , x2  x ,  , xn  x .
A deviation is positive if the x value exceeds x and negative if the x value is less than x .
Generally, the larger the magnitudes (ignoring the signs) of the deviations, the greater the
amount of variability in the sample.
How to combine the deviations into a single numerical measure?
Since  ( x  x )   x  nx   x   x  0 , we can not simply add the
deviations together to measure variability. The standard way to prevent positive and
negative deviations from counteracting one another is to square them before combinging.
Definition 4.6 The sample variance, denoted by s2, is the sum of squared deviations from
the mean divided by n-1. That is,
2
(x x)
s 2   n 1

S xx
n 1
The sample standard deviation is the positive square root of the sample variance and is
denoted by s. s is most used since it has the same unit as observations.
Note: (1) Interpretation of s: the typical amount by which an observation deviates
from x .
(2) A computational formula for S xx is
S xx 



Thus, s 2 
S xx
n 1
 ( x2  x )   ( x 2  2 xx
 x  2 x  x  nx
(
x)
(
x)
2
 x  2 n  n
(
x)
2
 x  n
2
2
2
 x2)
2
2
2
2
x (
x) / n
.
  n
1
Example 4.4 A sociologist is studying the amount of time 3 to 6 year olds are allowed to
watch television each day. A sample of 15 children is selected and the amount of time
they were allowed to watch television was recorded. The data is listed below in hours.
4.0
2.5
1.5
2.0
8.0
3.5
4.0
4.0
2.0
1.0
3.2
1.5
2.5
3.0
3.5
x = (4.0+2.5+1.5+2.0+8.0+3.5+4.0+4.0+2.0+1.0+3.2+1.5+2.5+3.0+3.5) / 15
= 46.2/15 = 3.08
Observation
4.0
2.5
1.5
2.0
8.0
3.5
4.0
4.0
2.0
1.0
3.2
1.5
2.5
3.0
3.5
Deviation
(x  x)
0.92
-0.58
-1.58
-1.08
4.92
0.42
0.92
0.92
-1.08
-2.08
0.12
-1.58
-0.58
-0.08
0.42
x2
Squared Deviation
( x  x )2
0.8464
0.3364
2.4964
1.1664
24.2064
0.1764
0.8464
0.8464
1.1664
4.3264
0.0144
2.4964
0.3364
0.0064
0.1764
16
6.25
2.25
4
64
12.25
16
16
4
1
10.24
2.25
6.25
9
12.25
Sum = 39.444
Sum =181.74
s2 = ?.
(
 x)2
x2  n

s can also be calculated by s 
= ?.
n 1
2
2
s=
s 2 = ?.
The measures of variability for the entire population that are analogous to s2 and s for a
sample are called the population variance and population standard deviation, and are
denoted by 2 and  respectively. Generally 2 is unknown and is estimated by s2. We
use the divisor n-1 in s2 rather than n because, (1) on average, it tends to be a bit closer to
2, (2) the degrees of freedom of s2 is n-1. It is better to use s and  for comparative
purposes than for an absolute assessment of variability.
As with x , s is greatly affected by the presence of even a single outlier. A measure of
variability that is resistant to the effects of outliers is the interquartile range.
Definition 4.7 lower quartile = median of the lower half of the sample
Upper quartile = median of the upper half of the sample.
(if n is odd, the median of the entire sample is excluded from both halves.)
The interquartile range (iqr) = upper quartile – lower quartile.
The population interquartile range = upper population quartile – lower population
quartile.
Example 4.5 Determine the lower quartile, upper quartile, and the interquartile range for
the data in Example 4.4.
The ordered data values are
1.0
1.5
1.5
2.0
2.0
2.5
2.5
3.0
3.2 3.5
3.5
4.0
4.0
4.0
8.0
The sample size n = 15 is an odd number, so the median, 3.0 is excluded from both
halves of the sample:
Lower half
Upper half
1.0
3.2
1.5
3.5
1.5
3.5
2.0
4.0
2.0
4.0
2.5
4.0
2.5
8.0
Lower quartile = ?, upper quartile = ?, and iqr = ? - ? = ?.
If a histogram of a data set can be reasonably well approximated by a normal curve, then
roughly standard deviation s  1iqr
.35 .
4.3 Boxplots
A boxplot is a display that provides information about the center, spread, and symmetry
(or skewness) of the data.

Construction of a skeletal boxplot
1. Draw a horizontal (or vertical) measurement scale.
2. Construct a rectangular box whose left (or lower) edge is at the lower quartile and
whose right (or upper) edge is at the upper quartile.
3. Draw a vertical (or horizontal) line segment inside the box at the location of the
median.
4. Extend horizontal (or vertical) line segments from each end of the box to the smallest
and largest observations in the data set. (These line segments are called whiskers.)
Example 4.6 A dentist is researching the average time that people brush their teeth. A
sample of 21 brushing times is collected and listed below (in seconds).
15 30
120 45
35
30
90 60 45
335 240 50
135
75
120
15
30
30
45
60
30
The ordered observations are
15 15 30 30 30 30
120 120 135 240 335
30
35
45
45
45
50
60
60
75
90
Five-number summary:
Smallest observation = 15
Median = x((21+1)/2) = x(11) = ?
Lower quartile = median of the lower half = ?
Upper quartile = median of the upper half = ?
Largest observation = 335
+----------+---------+---------+---------+---------+---------+---------+
0
50
100
150
200
250
300
350
Figure 4.1 Skeletal boxplot for the brushing times data
The median line is closer to the lower edge of the box than to the upper edge, suggesting
a concentration of values in the lower part of the middle half. The upper whisker is much
longer than the lower whisker, giving the impression of positive skewness.

Construction of a modified boxplot
We know that an outlier is an unusually small or large data value. Then what does
“unusually small or large” mean? Here we give a more formal definition of outliers.
An outlier  An observation that is more than 1.5 iqr away from the closest end of the
box.
An extreme outlier  An outlier that is more than 3 iqr from the closest end of the box.
A mild outlier  An outlier that is not an extreme outlier.
A modified boxplot  A boxplot in which mild outliers are represented by shaded
circles, extreme outliers are represented by open circles, and whiskers extend on each end
to the most extreme observations that are not outliers.
Example 4.7 Draw a modified boxplot for the data set in Example 4.6
Median = 45
Lower quartile = 30
Upper quartile = 105
iqr = 105 –30 = 75
1.5 iqr = 1.5  75 = 112.5
3 iqr = 3  75 = 225
Thus,
Upper edge of box + 1.5 iqr = ? + 112.5 = ?
Lower edge of box – 1.5 iqr = ? – 112.5 = ?
So 240 and 335 are both outliers on the upper end (because they are greater than 217.5),
and there are no outliers on the lower end (because no observations are less than -82.5).
Since
Upper edge of the box + 3 iqr = ? + 225 = ?
335 is an extreme outlier, and 240 is a mild outlier.
The upper whisker extends to the largest observation that is not an outlier, ?, and the
lower whisker extends to ?.
Mild outlier Extreme outlier


+----------+---------+---------+---------+---------+---------+---------+
0
50
100
150
200
250
300
350
Figure 4.2 Modified boxplot for the data in Example 4.6
4.4 Measures of Relative Standing
After you take a test, you probably want to know the position of your score in all scores
of the test. Does your score place you among the top 10% of those who took the test, or
only among the top 30%? Such questions can be answered by measures of relative
standing.
The z score of a particular observation in a data set is
z score = (observation – mean) / standard deviation
The z score tells us how many standard deviations the observation is from the mean. It is
positive if the observation lies above the mean, and negative if the observation lies below
the mean.
Example 4.8 In a GRE, a student scored 540 in the verbal section with mean 520 and
standard deviation 20, 750 in the math section with mean 720 and standard deviation 40.
In which section did the student perform better?
Verbal z score = (? – ?) / ? = ?
Math z score = (? – ?) / ? = ?
Thus, the student performed better in verbal section.
Another important measure of relative standing is percentile.
For any number r between 0 and 100, the rth percentile is a value such that r percent of
the observations in the data set is less than or equal to that value.
The median is the ?th percentile, and the lower and upper quartiles are the ?th and ?th
percentiles, respectively.
Download