739 26–3 Numerical Description of Data

advertisement
Section 26–3
◆
739
Numerical Description of Data
Stem and Leaf Plots
Make a stem and leaf plot for the data of
16. problem 1.
17. problem 2.
18. problem 3.
26–3 Numerical Description of Data
We saw in Sec. 26–2 how we can describe a set of data by frequency distribution, a frequency
histogram, a frequency polygon, or a cumulative frequency distribution. We can also describe
a set of data with just a few numbers, and this more compact description is more convenient
for some purposes. For example, if, in a report, you wanted to describe the heights of a group
of students and did not want to give the entire frequency distribution, you might simply say:
The mean height is 58 inches, with a standard deviation of 3.5 inches.
The mean is a number that shows the centre of the data; the standard deviation is a measure of
the spread of the data. As mentioned earlier, the mean and the standard deviation may be found
either for an entire population (and are thus population parameters) or for a sample drawn from
that population (and are thus sample statistics).
Thus to describe a population or a sample, we need numbers that give both the centre of the
data and the spread. Further, for sample statistics, we need to give the uncertainty of each figure.
This will enable us to make inferences about the larger population.
◆◆◆
Example 15: A student recorded the running times for a sample of participants in a race.
From that sample she inferred (by methods we’ll learn later) that for the entire population of racers
mean time 23.65 0.84 minutes
standard deviation 5.83 0.34 minutes
◆◆◆
The mean time is called a measure of central tendency. We show how to calculate the mean,
and other measures of central tendency, later in this section. The standard deviation is called a
measure of dispersion. We also show how to compute it, and other measures of dispersion, in this
section. The “plus-or-minus” values show a degree of uncertainty called the standard error. The
uncertainty 0.84 min is called the standard error of the mean, and the uncertainty 0.34 min
is called the standard error of the standard deviation. We show how to calculate standard errors
in Sec. 26–6.
Measures of Central Tendency: The Mean
Some common measures of the centre of a distribution are the mean, the median, and the mode.
The arithmetic mean, or simply the mean, of a set of measurements is equal to the sum of the
measurements divided by the number of measurements. It is what we commonly call the average. It is the most commonly used measure of central tendency. If our n measurements are
x1, x2, . . . , xn, then
x x1 x2 xn
There is a story, probably untrue,
about a statistician who drowned
in a lake that had an average
depth of 1 ft. (30 cm).
740
Chapter 26
◆
Introduction to Statistics and Probability
where we use the Greek symbol (sigma) to represent “the sum of.” The mean, which we call
_
x (read “x bar”), is then given by the following:
x
x n
Arithmetic
Mean
249
The arithmetic mean of n measurements is the sum of those measurements divided by n.
We can calculate the mean for a sample or for the entire population. However, we use a
different symbol in each case.
is the Greek capital letter
"sigma".
x is the sample mean
(mu) is the population mean
◆◆◆
Example 16: Find the mean of the following sample:
746
574
645
894
736
695
635
794
Solution: Adding the values gives
x 746 574 645 894 736 695 635 794 5719
Then, with n 8,
5719
x 715
8
rounded to three significant digits.
◆◆◆
Weighted Mean
When not all of the data are of equal importance, we may compute a weighted mean, where the
values are weighted according to their importance.
◆◆◆
Example 17: A student has grades of 83, 59, and 94 on 3 one-hour exams, a grade of 82 on
the final exam which is equal in weight to 2 one-hour exams, and a grade of 78 on a laboratory
report, which is worth 1.5 one-hour exams. Compute the weighted mean.
Solution: If a one-hour exam is assigned a weight of 1, then we have a total of the weights of
w 1 1 1 2 1.5 6.5
To get a weighted mean, we add the products of each grade and its weight, and divide by the
total weight.
83(1) 59(1) 94(1) 82(2) 78(1.5)
weighted mean 79.5
6.5
◆◆◆
Section 26–3
◆
741
Numerical Description of Data
In general, the weighted mean is given by the following:
(wx)
weighted mean w
Midrange
We have already noted that the range of a set of data is the difference between the highest
and the lowest numbers in the set. The midrange is simply the value midway between the two
extreme values.
highest value lowest value
midrange 2
◆◆◆
Example 18: The midrange for the values
3, 5, 6, 6, 7, 11, 11, 15
is
3 15
midrange 9
2
◆◆◆
Mode
Our next measure of central tendency is the mode.
Mode
The mode of a set of numbers is the value(s) that
occur most often in the set.
251
A set of numbers can have no mode, one mode, or more than one mode.
◆◆◆
Example 19:
(a) The set 1 3 3 5 6 7 7 7 9
(b) The set 1 1 3 3 5 5 7 7
(c) The set 1 3 3 3 5 6 7 7 7
has the mode 7.
has no mode.
9 has two modes, 3 and 7.
It is called bimodal.
◆◆◆
Median
To find the median, we simply arrange the data in order of magnitude and pick the middle value.
For an even number of measurements, we take the mean of the two middle values.
Median
◆◆◆
The median of a set of numbers arranged in order of
magnitude is the middle value of an odd number of
measurements, or the mean of the two middle values
of an even number of measurements.
Example 20: Find the median of the data in Example 16.
Solution: We rewrite the data in order of magnitude.
574
635
645
695
736
746
794
894
250
742
Chapter 26
◆
Introduction to Statistics and Probability
The two middle values are 695 and 736. Taking their mean gives
695 736
median 715.5
2
◆◆◆
Five-Number Summary
The median of that half of a set of data from the minimum value up to and including the
median is called the lower hinge. Similarly, the median of the upper half of the data is called
the upper hinge.
The lowest value in a set of data, together with the highest value, the median, and the two
hinges, is called a five-number summary of the data.
◆◆◆
Example 21: For the data
12
17
18
20
22
28
32
34
49
52
59
66
17 18 20 22 28 30 30 32 34 49 52
59
the median is (28 32)2 30, so
12
minimum
value
lower
hinge
median
upper
hinge
66
maximum
value
The five-number summary for this set of data may be written
◆◆◆
[12, 20, 30, 49, 66]
Min Hinge Median
12
20
30
Hinge
49
FIGURE 26–6 A boxplot. It
is also called a box and whisker
diagram.
Max
66
◆◆◆
Boxplots
A graph of the values in the five-number summary is called a boxplot.
◆◆◆
Example 22: A boxplot for the data of Example 21 is given in Fig. 26–6.
Measures of Dispersion
We will usually use the mean to describe a set of numbers, but used alone it can be misleading.
◆◆◆
Example 23: The sets of numbers
1
1
10
10
1
1
1
1
10
10
10
1
1
1
91
10
10
10
and the set
10
10
each have a mean value of 10 but are otherwise quite different. Each set has a sum of 100. In
the first set most of this sum is concentrated in a single value, but in the second set the sum is
◆◆◆
dispersed among all the values.
Thus we need some measure of dispersion of a set of numbers. Four common ones are the
range, the percentile range, the variance, and the standard deviation. We will cover each of these.
Range
We have already introduced the range in Sec. 26–2, and we define it here.
Range
The range of a set of numbers is the difference
between the largest and smallest numbers in the set.
252
Section 26–3
◆◆◆
◆
743
Numerical Description of Data
Example 24: For the set of numbers
6
3
9 12
44
2
53
1
8
the largest value is 53 and the smallest is 1, so the range is
range 53 1 52
◆◆◆
We sometimes give the range by stating the end values themselves. Thus, in Example 24
we might say that the range is from 1 to 53.
Quartiles, Deciles, and Percentiles
We have seen that for a set of data arranged in order of magnitude, the median divides the data
into two equal parts. There are as many numbers below the median as there are above it.
Similarly, we can determine two more values that divide each half of the data in half again.
We call these values quartiles. Thus one-fourth of the values will fall into each quartile. The
quartiles are labelled Q1, Q2, and Q3. Thus Q2 is the median.
Those values that divide the data into 10 equal parts we call deciles and label D1, D2, . . . .
Those values that divide the data into 100 equal parts are percentiles, labelled P1, P2, . . . . Thus
the 25th percentile is the same as the first quartile (P25 Q1). The median is the 50th percentile,
the fifth decile, and the second quartile (P50 D5 Q2). The 70th percentile is the seventh
decile (P70 D7).
One measure of dispersion sometimes used is to give the range of values occupied by
some given percentiles. Thus we have a quartile range, decile range, and percentile range. For
example, the quartile range is the range from the first to the third quartile.
◆◆◆
Example 25: Find the quartile range for the set of data
2
4
5
7
8
11
15
18
19
21
24
25
Solution: The quartiles are
Note that the quartiles are not in
the same locations as the upper
and lower hinges. The hinges
here would be at 7 and 19.
57
Q1 6
2
11 15
Q2 the median 13
2
19 21
Q3 20
2
Then
quartile range Q3 Q1 20 6 14
◆◆◆
We see that half the values in Example 25 fall within the quartile range. We may similarly
compute the range of any percentiles or deciles. The range from the 10th to the 90th percentile
is the one commonly used.
Variance
We now give another measure of dispersion called the variance, but we must first mention
deviation. We define the deviation of any number x in a set of data as the difference between
_
that number and the mean x of that set of data.
◆◆◆
Example 26: A certain set of measurements has a mean of 48.3. What are the deviations
of the values 24.2 and 69.3 in that set?
Solution: The deviation of 24.2 is
24.2 48.3 24.1
744
Chapter 26
◆
Introduction to Statistics and Probability
and the deviation of 69.3 is
69.3 48.3 21.0
◆◆◆
To get the variance of a population of n numbers, we add up the squares of the deviations
of each number in the set and divide by n.
(x x )2
2 n
Population
Variance
is the Greek small letter
sigma.
253
To find the variance of a sample, it is more accurate to divide by n 1 rather than n. As
with the mean, we use one symbol for the sample variance and a different symbol for the population variance.
s2 is the sample variance
2 is the population variance
For large samples (over 30 or
so), the variance found by either
formula is practically the same.
_
Sample
Variance
◆◆◆
(x x)2
s n1
2
254
Example 27: Compute the variance for the population
1.74
2.47
3.66
4.73
5.14
6.23
7.29
8.93
9.56
_
Solution: We first compute the mean, x.
1.74 2.47 3.66 4.73 5.14 6.23 7.29 8.93 9.56
x 9
5.53
We then subtract the mean from each of the nine values to obtain deviations. The deviations are
then squared and added, as shown in Table 26–6.
TABLE 26–6
Measurement
x
1.74
2.47
3.66
4.73
5.14
6.23
7.29
8.93
9.56
x 49.75
Deviation
_
xx
3.79
3.06
1.87
0.80
0.39
0.70
1.76
3.40
4.03
_
(x x) 0.02
Deviation Squared
_
(x x)2
14.36
9.36
3.50
0.64
0.15
0.49
3.10
11.56
16.24
_
(x x)2 59.41
The variance 2 is then
59.41
2 6.60
9
◆◆◆
Section 26–3
◆
745
Numerical Description of Data
Standard Deviation
Once we have the variance, it is a simple matter to get the standard deviation. It is the most
common measure of dispersion.
Standard
Deviation
The standard deviation of a set of numbers is the
positive square root of the variance.
255
Histogram
Data
1
σ=0
6 6 6 6 6 6 6 6 6 6 6 6
6
x
1
2
σ = 0.50
5 5 5 5 5 5 6 6 6 6 6 6
5 5 5 5 6 6 6 6 7 7 7 7
Relative frequency
5 6
x
1
3
σ = 0.82
5 6 7
x
4 4 4 5 5 5 6 6 6 7 7 7
1
4
σ = 1.12
4 5 6 7
x
3 3 4 4 5 5 6 6 7 7 8 8
1
6
σ = 1.71
3 4 5 6 7 8
x
1 2 3 4 5 6 7 8 9 10 11 12
1
12
σ = 3.45
1 2 3 4 5 6 7 8 9 10 11 12
x
FIGURE 26–7
746
Chapter 26
◆
Introduction to Statistics and Probability
As before, we use s for the sample standard deviation and for the population standard
deviation.
◆◆◆
Example 28: Find the standard deviation for the data of Example 27.
Solution: We have already found the variance in Example 27.
2 6.60
Taking the square root gives the standard deviation.
6.60 2.57
To get an intuitive feel for the standard deviation, we have computed it in Fig. 26–7 for
several data sets consisting of 12 numbers that can range from 1 to 12. In the first set, all of the
numbers have the same value, 6, and in the last set every number is different. The data sets in
between have differing amounts of repetition. To the right of each data set are a relative frequency histogram and the population standard deviation (computation not shown).
Note that the most compact distribution has the lowest standard deviation, and as the distribution spreads, the standard deviation increases.
For a final demonstration, let us again take 12 numbers in two groups of six equal values,
as shown in Fig. 26–8. Now let us separate the two groups, first by one interval and then by
two intervals. Again, notice that the standard deviation increases as the data move further from
the mean.
Data
Histogram
1
2
σ = 0.50
5 5 5 5 5 5 6 6 6 6 6 6
5 6
x
4 4 4 4 4 4 6 6 6 6 6 6
Relative frequency
The mean shifts slightly from one
distribution to the next, but this
does not affect our conclusions.
◆◆◆
1
2
σ = 1.0
4
6
x
1
2
σ = 1.50
4 4 4 4 4 4 7 7 7 7 7 7
4
7
x
FIGURE 26-8
From these two demonstrations we may conclude that the standard deviation increases
whenever data move away from the mean.
Section 26–3
Exercise 3
◆
747
Numerical Description of Data
◆
Numerical Description of Data
Mean
1. Find the mean of the following set of grades:
85
74
69
59
60
96
84
48
89
76
96
68
98
79
76
2. Find the mean of the following set of weights:
173
127
142
164
163
153
116
199
3. Find the mean of the weights in problem 1 of Exercise 2.
4. Find the mean of the times in problem 2 of Exercise 2
5. Find the mean of the prices in problem 3 of Exercise 2.
Weighted Mean
6. A student’s grades and the weight of each grade are given in the following table. Find their
weighted mean.
Hour exam
Hour exam
Quiz
Final exam
Report
Grade
Weight
83
74
93
79
88
5
5
1
10
7
7. A student receives hour-test grades of 86, 92, 68, and 75, a final exam grade of 82, and
a project grade of 88. Find the weighted mean if each hour-test counts for 15% of his
grade, the final exam counts for 30%, and the project counts for 10%.
Midrange
8. Find the midrange of the grades in problem 1.
9. Find the midrange of the weights in problem 2.
Mode
10.
11.
12.
13.
14.
Find the mode of the grades in problem 1.
Find the mode of the weights in problem 2.
Find the mode of the weights in problem 1 of Exercise 2.
Find the mode of the times in problem 2 of Exercise 2.
Find the mode of the prices in problem 3 of Exercise 2.
Median
15. Find the median of the grades in problem 1.
16. Find the median of the weights in problem 2.
17. Find the median of the weights in problem 1 of Exercise 2.
748
Chapter 26
◆
Introduction to Statistics and Probability
18. Find the median of the times in problem 2 of Exercise 2.
19. Find the median of the prices in problem 3 of Exercise 2.
Five-Number Summary
20. Give the five-number summary for the grades in problem 1.
21. Give the five-number summary for the weights in problem 2.
Boxplot
22. Make a boxplot using the results of problem 20.
23. Make a boxplot using the results of problem 21.
Range
24. Find the range of the grades in problem 1.
25. Find the range of the weights in problem 2.
Percentiles
26. Find the quartiles and give the quartile range of the following data:
28
39
46
53
69
71
83
94
102
117
126
27. Find the quartiles and give the quartile range of the following data:
1.33
2.28
3.59
4.96
5.23
6.89
7.91
8.13
9.44
10.6
11.2
12.3
Variance and Standard Deviation
28. Find the variance and standard deviation of the grades in problem 1. Assume that these
grades are a sample drawn from a larger population.
29. Find the variance and standard deviation of the weights in problem 2. Assume that these
weights are a sample drawn from a larger population.
30. Find the population variance and standard deviation of the weights in problem 1 of Exercise 2.
31. Find the population variance and standard deviation of the times in problem 2 of Exercise 2.
32. Find the population variance and standard deviation of the prices in problem 3 of Exercise 2.
26–4
Introduction to Probability
Why do we need probability to learn statistics? In Sec. 26–3 we learned how to compute
certain sample statistics, such as the mean and the standard deviation. Knowing, for exam_
ple, that the mean height x of a sample of students is 69.5 in., we might infer that the mean
height of the entire population is 69.5 in. But how reliable is that number? Is equal to
69.5 in. exactly? We’ll see later that we give population parameters such as as a range of
values, say, 69.5 0.6 in., and we state the probability that the true mean lies within that
range. We might say, for example, that there is a 68% chance that the true value lies within
the stated range.
Download