2-2 - aw

advertisement
2-2
39
Frequency Distributions
2-2 Frequency Distributions
When working with large data sets, it is often helpful to organize and summarize
the data by constructing a table that lists the different possible data values (either
individually or by groups) along with the corresponding frequencies, which represent the number of times those values occur.
Definition
A frequency distribution lists data values (either individually or by groups of
intervals), along with their corresponding frequencies (or counts).
Table 2-2 is a frequency distribution summarizing the measured cotinine levels of the 40 smokers listed in Table 2-1. The frequency for a particular class is
the number of original values that fall into that class. For example, the first class
in Table 2–2 has a frequency of 11, indicating that 11 of the original data values
are between 0 and 99 inclusive.
We first present some standard terms used in discussing frequency distributions, and then we describe how to construct and interpret them.
Definitions
Lower class limits are the smallest numbers that can belong to the different
classes. (Table 2-2 has lower class limits of 0, 100, 200, 300, and 400.)
Upper class limits are the largest numbers that can belong to the different
classes. (Table 2-2 has upper class limits of 99, 199, 299, 399, and 499.)
Class boundaries are the numbers used to separate classes, but without the gaps
created by class limits. They are obtained as follows: Find the size of the gap between the upper class limit of one class and the lower class limit of the next
class. Add half of that amount to each upper class limit to find the upper class
boundaries; subtract half of that amount from each lower class limit to find the
lower class boundaries. (Table 2-2 has gaps of exactly 1 unit, so 0.5 is added to
the upper class limits and subtracted from the lower class limits. The first class
continued
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
Table 2-2
Frequency Distribution
of Cotinine Levels
of Smokers
Cotinine
Frequency
0–99
100–199
200–299
300–399
400–499
11
12
14
1
2
40
CHAPTER 2
Describing, Exploring, and Comparing Data
has boundaries of 20.5 and 99.5, the second class has boundaries of 99.5 and
199.5, and so on. The complete list of boundaries used for all classes is 20.5,
99.5, 199.5, 299.5, 399.5, and 499.5.
Class midpoints are the midpoints of the classes. (Table 2-2 has class midpoints
of 49.5, 149.5, 249.5, 349.5, and 449.5.) Each class midpoint can be found by
adding the lower class limit to the upper class limit and dividing the sum by 2.
Class width is the difference between two consecutive lower class limits or two
consecutive lower class boundaries. (Table 2-2 uses a class width of 100.)
Growth Charts
Updated
Pediatricians typically use standardized growth charts to compare their patient’s weight and
height to a sample of other children. Children are considered to
be in the normal range if their
weight and height fall between
the 5th and 95th percentiles. If
they fall outside of that range,
they are often given tests to
ensure that there are no serious
medical problems. Pediatricians
became increasingly aware of a
major problem with the charts:
Because they were based on children living between 1929 and
1975, the growth charts were
found to be inaccurate. To rectify
this problem, the charts were updated in 2000 to reflect the current measurements of millions of
children. The weights and heights
of children are good examples of
populations that change over
time. This is the reason for including changing characteristics of
data over time as an important
consideration for a population.
The definitions of class width and class boundaries are tricky. Be careful to
avoid the easy mistake of making the class width the difference between the lower
class limit and the upper class limit. See Table 2-2 and note that the class width is
100, not 99. You can simplify the process of finding class boundaries by understanding that they basically fill the gaps between classes by splitting the difference
between the end of one class and the beginning of the next class.
Procedure for Constructing a Frequency Distribution
Frequency distributions are constructed for these reasons: (1) Large data sets can
be summarized, (2) we can gain some insight into the nature of data, and (3) we
have a basis for constructing important graphs (such as histograms, introduced in
the next section). Many uses of technology allow us to automatically obtain frequency distributions without manually constructing them, but here is the basic
procedure:
1. Decide on the number of classes you want. The number of classes should be
between 5 and 20, and the number you select might be affected by the convenience of using round numbers.
2. Calculate
Class width <
shighest valued 2 slowest valued
number of classes
Round this result to get a convenient number. (Usually round up.) You might
need to change the number of classes, but the priority should be to use values
that are easy to understand.
3. Starting point: Begin by choosing a number for the lower limit of the first
class. Choose either the lowest data value or a convenient value that is a little
smaller.
4. Using the lower limit of the first class and the class width, proceed to list the
other lower class limits. (Add the class width to the starting point to get the
second lower class limit. Add the class width to the second lower class limit to
get the third, and so on.)
5. List the lower class limits in a vertical column and proceed to enter the upper
class limits, which can be easily identified.
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
2-2
41
Frequency Distributions
6. Go through the data set putting a tally in the appropriate class for each data
value. Use the tally marks to find the total frequency for each class.
When constructing a frequency distribution, be sure that classes do not overlap so
that each of the original values must belong to exactly one class. Include all
classes, even those with a frequency of zero. Try to use the same width for all
classes, although it is sometimes impossible to avoid open-ended intervals, such
as “65 years or older.”
EXAMPLE Cotinine Levels of Smokers Using the 40
cotinine levels for the smokers in Table 2-1, follow the above procedure to construct the frequency distribution shown in Table 2-2.
Assume that you want 5 classes.
S O LU TI ON
Step 1: Begin by selecting 5 as the number of desired classes.
Step 2: Calculate the class width. In the following calculation, 98.2 is rounded
up to 100, which is a more convenient number.
class width <
491 2 0
shighest valued 2 slowest valued
5
5 98.2 < 100
number of classes
5
Step 3: We choose a starting point of 0, which is the lowest value in the list
and is also a convenient number.
Step 4: Add the class width of 100 to the starting point of 0 to determine that
the second lower class limit is 100. Continue to add the class width of
100 to get the remaining lower class limits of 200, 300, and 400.
Step 5: List the lower class limits vertically, as shown in the margin. From
this list, we can easily identify the corresponding upper class limits
as 99, 199, 299, 399, and 499.
Step 6: After identifying the lower and upper limits of each class, proceed to
work through the data set by entering a tally mark for each value.
When the tally marks are completed, add them to find the frequencies
shown in Table 2-2.
Relative Frequency Distribution
An important variation of the basic frequency distribution uses relative frequencies, which are easily found by dividing each class frequency by the total of all
frequencies. A relative frequency distribution includes the same class limits as a
frequency distribution, but relative frequencies are used instead of actual frequencies. The relative frequencies are sometimes expressed as percents.
class frequency
relative frequency 5
sum of all frequencies
In Table 2-3 the actual frequencies from Table 2-2 are replaced by the corresponding relative frequencies expressed as percents. The first class has a relative
frequency of 11 > 40 5 0.275, or 27.5%, which is often rounded to 28%. The
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
02
100
200
300
400
Table 2-3
Relative Frequency
Distribution of Cotinine
Levels in Smokers
Cotinine
Relative
Frequency
0–99
100–199
200–299
300–399
400–499
28%
30%
35%
3%
5%
42
CHAPTER 2
Describing, Exploring, and Comparing Data
second class has a relative frequency of 12 > 40 5 0.3, or 30.0%, and so on. If constructed correctly, the sum of the relative frequencies should total 1 (or 100%),
with some small discrepancies allowed for rounding errors. Because 27.5% was
rounded to 28% and 2.5% was rounded to 3%, the sum of the relative frequencies
in Table 2-3 is 101% instead of 100%.
Because they use simple proportions or percentages, relative frequency distributions make it easier for us to understand the distribution of the data and to compare different sets of data.
Authors Identified
In 1787–88 Alexander Hamilton,
John Jay, and James Madison
anonymously published the
famous Federalist Papers in an
attempt to convince New Yorkers
that they should ratify the Constitution. The identity of most of
the papers’ authors became
known, but the authorship of 12
of the papers was contested.
Through statistical analysis of the
frequencies of various words, we
can now conclude that James
Madison is the likely author of
these 12 papers. For many of the
disputed papers, the evidence in
favor of Madison’s authorship is
overwhelming to the degree that
we can be almost certain of being
correct.
Cumulative Frequency Distribution
Another variation of the standard frequency distribution is used when cumulative
totals are desired. The cumulative frequency for a class is the sum of the frequencies for that class and all previous classes. Table 2-4 is the cumulative frequency distribution based on the frequency distribution of Table 2-2. Using the
original frequencies of 11, 12, 14, 1, and 2, we add 11 1 12 to get the second cumulative frequency of 23, then we add 11 1 12 1 14 5 37 to get the third, and so
on. See Table 2-4 and note that in addition to using cumulative frequencies, the
class limits are replaced by “less than” expressions that describe the new range of
values.
Critical Thinking: Interpreting Frequency
Distributions
The transformation of raw data to a frequency distribution is typically a means to
some greater end. The following examples illustrate how frequency distributions
can be used to describe, explore, and compare data sets. (The following section
shows how the construction of a frequency distribution is often the first step in the
creation of a graph that visually depicts the nature of the distribution.)
EXAMPLE Describing Data Refer to Data Set 1 in Appendix B for the
pulse rates of 40 randomly selected adult males. Table 2-5 summarizes the last
digits of those pulse rates. If the pulse rates are measured by counting the number of heartbeats in 1 minute, we expect that those last digits should occur with
frequencies that are roughly the same. But note that the frequency distribution
shows that the last digits are all even numbers; there are no odd numbers present. This suggests that the pulse rates were not counted for 1 minute. Perhaps
they were counted for 30 seconds and the values were then doubled. (Upon
further examination of the original pulse rates, we can see that every original
value is a multiple of four, suggesting that the number of heartbeats was
counted for 15 seconds, then that count was multiplied by four.) It’s fascinating
to learn something about the method of data collection by simply describing
some characteristics of the data.
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
2-2
Table 2-4
Cumulative Frequency Distribution
of Cotinine Levels in Smokers
Cotinine
Less than 100
Less than 200
Less than 300
Less than 400
Less than 500
Table 2-5
Last Digits of Male Pulse
Rates
Cumulative
Frequency
Last
Digit
Frequency
11
23
37
38
40
0
1
2
3
4
5
6
7
8
9
7
0
6
0
11
0
9
0
7
0
EXAMPLE Exploring Data In studying the behavior of the Old Faithful geyser in Yellowstone National Park, geologists collect data for the times
(in minutes) between eruptions. Table 2-6 summarizes actual data that were
obtained. Examination of the frequency distribution reveals unexpected behavior: The distribution of times has two different peaks. This distribution led geologists to consider possible explanations.
Table 2-6
Times (in minutes)
Between Old Faithful
Eruptions
Time
EXAMPLE Comparing Data Sets The Chapter Problem
given at the beginning of this chapter includes data sets consisting of
measured cotinine levels from smokers, nonsmokers exposed to tobacco smoke, and nonsmokers not exposed to tobacco smoke. Table 2-7 shows
Table 2-7
43
Frequency Distributions
Cotinine Levels for Three Groups
Cotinine
Smokers
Nonsmokers
Exposed to Smoke
Nonsmokers Not
Exposed to Smoke
0–99
100–199
200–299
300–399
400–499
500–599
28%
30%
35%
3%
5%
0%
85%
5%
3%
3%
0%
5%
95%
0%
3%
3%
0%
0%
continued
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
40–49
50–59
60–69
70–79
80–89
90–99
100–109
Frequency
8
44
23
6
107
11
1
44
CHAPTER 2
Describing, Exploring, and Comparing Data
the relative frequencies for the three groups. By comparing those relative frequencies, it should be obvious that the frequency distribution for smokers is
very different from the frequency distributions for the other two groups. Because the two groups of nonsmokers (exposed and not exposed) have such
high frequency amounts for the first class, it might be helpful to further compare those data sets with a closer examination of those values.
2-2 Basic Skills and Concepts
In Exercises 1–4, identify the class width, class midpoints, and class boundaries for the
given frequency distribution based on Data Set 1 in Appendix B.
1. Systolic Blood
Frequency
Pressure of Women
Frequency
90–99
100–109
110–119
120–129
130–139
140–149
150–159
1
4
17
12
5
0
1
80–99
100–119
120–139
140–159
160–179
180–199
9
24
5
1
0
1
3. Cholesterol of Men
Table for Exercise 13
Outcome
Frequency
1
2
3
4
5
6
27
31
42
40
28
32
Table for Exercise 14
Digit
Frequency
0
1
2
3
4
5
6
7
8
9
18
12
14
9
17
20
21
26
7
16
2. Systolic Blood
Pressure of Men
0–199
200–399
400–599
600–799
800–999
1000–1199
1200–1399
Frequency
13
11
5
8
2
0
1
4. Body Mass
Index of Women
Frequency
15.0–20.9
21.0–26.9
27.0–32.9
33.0–38.9
39.0–44.9
10
15
11
2
2
In Exercises 5–8, construct the relative frequency distribution that corresponds to the
frequency distribution in the exercise indicated.
5. Exercise 1
6. Exercise 2
7. Exercise 3
8. Exercise 4
In Exercises 9–12, construct the cumulative frequency distribution that corresponds to the
frequency distribution in the exercise indicated.
9. Exercise 1
10. Exercise 2
11. Exercise 3
12. Exercise 4
13. Loaded Die The author drilled a hole in a die and filled it with a lead weight, then
proceeded to roll it 200 times. (Yes, the author has too much free time.) The results
are given in the frequency distribution in the margin. Construct the corresponding relative frequency distribution and determine whether the die is significantly different
from a fair die that has not been “loaded.”
14. Lottery The frequency distribution in the margin is based on the Win Four numbers
from the New York State Lottery, as listed in Data Set 26 in Appendix B. Construct
the corresponding relative frequency distribution and determine whether the results
appear to be selected in such a way that all of the digits are equally likely.
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
2-2
45
Frequency Distributions
15. Bears Refer to Data Set 9 in Appendix B and construct a frequency distribution of the
weights of bears. Use 11 classes beginning with a lower class limit of 0 and use a
class width of 50 lb.
16. Body Temperatures Refer to Data Set 4 in Appendix B and construct a frequency distribution of the body temperatures for midnight on the second day. Use 8 classes beginning with a lower class limit of 96.5 and use a class width of 0.4°F. Describe two
different notable features of the result.
17. Head Circumferences Refer to Data Set 3 in Appendix B. Construct a frequency distribution for the head circumferences of baby boys and construct a separate frequency
distribution for the head circumferences of baby girls. In both cases, use the classes of
34.0–35.9, 36.0–37.9, and so on. Then compare the results and determine whether
there appears to be a significant difference between the two genders.
18. Animated Movies for Children Refer to Data Set 7 in Appendix B. Construct a frequency distribution for the lengths of time that animated movies for children contain
tobacco use and construct a separate frequency distribution for the lengths of time for
alcohol use. In both cases, use the classes of 0–99, 100–199, and so on. Compare the
results and determine whether there appears to be a significant difference.
19. Marathon Runners Refer to Data Set 8 in Appendix B. Construct a relative frequency
distribution for the ages of the sample of males who finished the New York City
marathon, then construct a separate relative frequency distribution for the ages of the
females. In both cases, start the first class with a lower class limit of 19 and use a
class width of 10. Compare the results and determine whether there appears to be any
notable difference between the two groups.
20. Regular Coke > Diet Coke Refer to Data Set 17 in Appendix B. Construct a relative
frequency distribution for the weights of regular Coke by starting the first class at
0.7900 lb and use a class width of 0.0050 lb. Then construct another relative frequency distribution for the weights of diet Coke by starting the first class at 0.7750 lb
and use a class width of 0.0050 lb. Then compare the results and determine whether
there appears to be a significant difference. If so, provide a possible explanation for
the difference.
2-2 Beyond the Basics
21. Interpreting Effects of Outliers Refer to Data Set 20 in Appendix B for the axial loads
of aluminum cans that are 0.0111 in. thick. The load of 504 lb is called an outlier because it is very far away from all of the other values. Construct a frequency distribution that includes the value of 504 lb, then construct another frequency distribution
with the value of 504 lb excluded. In both cases, start the first class at 200 lb and use
a class width of 20 lb. Interpret the results by stating a generalization about how much
of an effect an outlier might have on a frequency distribution.
22. Number of Classes In constructing a frequency distribution, Sturges’ guideline suggests that the ideal number of classes can be approximated by 1 1 (log n) > (log 2),
where n is the number of data values. Use this guideline to complete the table for determining the ideal number of classes.
An Addison-Wesley product. Copyright (c) 2004 Pearson Education.
Table for Exercise 22
Number
Ideal Number
of Values
of Classes
16–22
23–45
5
6
7
8
9
10
11
12
Download