Uploaded by Mayank Rawat

Notes 12, Numerical Summaries to Probability Plots

advertisement
6
Descriptive Statistics
Chapter Outline
6-1 Numerical Summaries of Data
6-2 Stem-and-Leaf Diagrams
6-3 Frequency Distributions and
Histograms
6-4 Box Plots
6-5 Time Sequence Plots
6-6 Scatter Diagrams
6-7 Probability Plots
Statistics is the science of data. An important aspect of dealing with data is organizing and summarizing the data in ways
that facilitate its interpretation and subsequent analysis. This
aspect of statistics is called descriptive statistics, and is the
subject of this chapter. For example, in Chapter 1 we presented eight prototype units made on the pull-off force of
prototype automobile engine connectors. The observations
(in pounds) were 12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6, and
13.1. There is obvious variability in the pull-off force values.
How should we summarize the information in these data?
This is the general question that we consider. Data summary
methods should highlight the important features of the data,
such as the middle or central tendency and the variability,
because these characteristics are most often important for
engineering decision making. We will see that there are both
numerical methods for summarizing data and a number of
powerful graphical techniques. The graphical techniques are
particularly important. Any good statistical analysis of data
should always begin with plotting the data.
199
200
Chapter 6/Descriptive Statistics
Learning Objectives
After careful study of this chapter, you should be able to do the following:
1. Compute and interpret the sample mean, sample variance, sample standard deviation, sample
median, and sample range
2. Explain the concepts of sample mean, sample variance, population mean, and population variance
3. Construct and interpret visual data displays, including the stem-and-leaf display, the histogram, and
the box plot
4. Explain the concept of random sampling
5. Construct and interpret normal probability plots
6. Explain how to use box plots and other data displays to visually compare two or more samples of data
7. Know how to use simple time series plots to visually display the important features of time-oriented data
8. Know how to construct and interpret scatter diagrams of two or more variables
6-1 Numerical Summaries of Data
Well-constructed data summaries and displays are essential to good statistical thinking,
because they can focus the engineer on important features of the data or provide insight about
the type of model that should be used in solving the problem. The computer has become an
important tool in the presentation and analysis of data. Although many statistical techniques
require only a handheld calculator, this approach may require much time and effort, and a
computer will perform the tasks much more eficiently.
Most statistical analysis is done using a prewritten library of statistical programs. The user
enters the data and then selects the types of analysis and output displays that are of interest.
Statistical software packages are available for both mainframe machines and personal computers. We will present examples of typical output from computer software throughout the
book. We will not discuss the hands-on use of speciic software packages for entering and
editing data or using commands.
We often ind it useful to describe data features numerically. For example, we can characterize the location or central tendency in the data by the ordinary arithmetic average or mean.
Because we almost always think of our data as a sample, we will refer to the arithmetic mean
as the sample mean.
Sample Mean
If the n observations in a sample are denoted by x1 , x2 , … , xn , the sample mean is
n
x + x2 + … + x n
=
x− = 1
n
∑ xi
i =1
n
(6-1)
Sample Mean Let’s consider the eight observations on pull-off force collected from the prototype engine connectors from Chapter 1. The eight observations are x1 = 12.6, x2 = 12.9, x3 = 13.4,
x4 = 12.3, x5 = 13.6, x6 = 13.5, x7 = 12.6, and x8 = 13.1. The sample mean is
Example 6-1
8
xi
12.6 + 12.9 + ⋯ + 13.1 104
x + x + ⋯ + xn ∑
= i =1 =
=
= 13.0 pounds
x= 1 2
n
8
8
8
Section 6-1/Numerical Summaries of Data
201
A physical interpretation of the sample mean as a measure of location is shown in the dot diagram of the pull-off
force data. See Fig. 6-1. Notice that the sample mean x = 13.0 can be thought of as a “balance point.” That is, if each
observation represents 1 pound of mass placed at the point on the x-axis, a fulcrum located at x would exactly balance
this system of weights.
The sample mean is the average value of all observations in the data set. Usually, these data
are a sample of observations that have been selected from some larger population of observations. Here the population might consist of all the connectors that will be manufactured and
sold to customers. Recall that this type of population is called a conceptual or hypothetical
population because it does not physically exist. Sometimes there is an actual physical population, such as a lot of silicon wafers produced in a semiconductor factory.
In previous chapters, we have introduced the mean of a probability distribution, denoted
μ. If we think of a probability distribution as a model for the population, one way to think of
the mean is as the average of all the measurements in the population. For a inite population
with N equally likely values, the probability mass function is f ( xi ) = 1 / N and the mean is
N
N
μ = ∑ xi f ( xi ) =
i −1
∑ xi
i =1
N
(6-2)
The sample mean, x, is a reasonable estimate of the population mean, μ. Therefore, the engineer designing the connector using a 3/32-inch wall thickness would conclude on the basis of
the data that an estimate of the mean pull-off force is 13.0 pounds.
Although the sample mean is useful, it does not convey all of the information about a sample of data. The variability or scatter in the data may be described by the sample variance or
the sample standard deviation.
Sample Variance and
Standard Deviation
If x1 , x2 ,… , xn is a sample of n observations, the sample variance is
n
s =
2
∑ ( xi − x )
2
i =1
(6-3)
n −1
The sample standard deviation, s, is the positive square root of the sample variance.
The units of measurement for the sample variance are the square of the original units of
the variable. Thus, if x is measured in pounds, the units for the sample variance are (pounds)2.
The standard deviation has the desirable property of measuring variability in the original units
of the variable of interest, x.
How Does the Sample Variance Measure Variability?
To see how the sample variance measures dispersion or variability, refer to Fig. 6-2, which
shows a dot diagram with the deviations xi − x for the connector pull-off force data. The
higher the amount of variability in the pull-off force data, the larger in absolute magnitude
x = 13
12
14
15
Pull-off force
FIGURE 6-1 Dot diagram showing the sample mean
as a balance point for a system of weights.
202
Chapter 6/Descriptive Statistics
some of the deviations xi − x will be. Because the deviations xi − x always sum to zero, we
must use a measure of variability that changes the negative deviations to non-negative quantities. Squaring the deviations is the approach used in the sample variance. Consequently, if
s 2 is small, there is relatively little variability in the data, but if s 2 is large, the variability is
relatively large.
Example 6-2
Sample Variance Table 6-1 displays the quantities needed for calculating the sample variance
and sample standard deviation for the pull-off force data. These data are plotted in Fig. 6-2. The
numerator of s 2 is
8
∑ ( xi − x ) = 1.60
2
i =1
5"#-&t6-1 Calculation of Terms for the Sample Variance and Sample Standard Deviation
i
xi
xi − x
( xi − x ) 2
1
12.6
–0.4
0.16
2
12.9
–0.1
0.01
3
13.4
0.4
0.16
4
12.3
–0.7
0.49
5
13.6
0.6
0.36
6
13.5
0.5
0.25
7
12.6
–0.4
0.16
8
13.1
0.1
0.01
104.0
0.0
1.60
x
12
13
x2
14
15
x8
x1
x7
x4
x3
x6
x5
FIGURE 6-2 How the sample variance measures variability through the deviations x i − x .
so the sample variance is
s2 =
1.60 1.60
2
=
= 0.2286 ( pounds)
8 −1
7
and the sample standard deviation is
s = 0.2286 = 0.48 pounds
Computation of s 2
The computation of s 2 requires calculation of x , n subtractions, and n squaring and adding
operations. If the original observations or the deviations xi − x are not integers, the deviations xi − x may be tedious to work with, and several decimals may have to be carried
Section 6-1/Numerical Summaries of Data
203
to ensure numerical accuracy. A more eficient computational formula for the sample
variance is obtained as follows:
n
s2 =
∑ ( xi − x )
i =1
n −1
∑ ( xi 2 + x 2 − 2 xxi )
n
2
=
i =1
n −1
n
=
n
∑ xi 2 + nx 2 − 2 x ∑ xi
i =1
i =1
n −1
and because x = (1 / n)Σ in=1 xi , this last equation reduces to
n
s2 =
∑ xi 2 −
⎛ n ⎞
xi ⎟
⎜⎝ i∑
=1 ⎠
i =1
2
n
(6-4)
n −1
Note that Equation 6-4 requires squaring each individual xi , then squaring the sum of the xi ,
2
subtracting ( ∑ xi ) / n from ∑ xi 2 , and inally dividing by n − 1. Sometimes this is called the
shortcut method for calculating s 2 (or s).
Example 6-3
We will calculate the sample variance and standard deviation using the shortcut method,
Equation 6-4. The formula gives
n
s2 =
∑ xi 2 −
⎛ n ⎞
xi ⎟
⎜⎝ i∑
=1 ⎠
i =1
n −1
n
2
=
(104)2
8 = 1.60 = 0.2286 ( pounds)2
7
7
1353.6 −
and
s = 0.2286 = 0.48 pounds
These results agree exactly with those obtained previously.
Analogous to the sample variance s2 , the variability in the population is deined by the
population variance (σ 2 ). As in earlier chapters, the positive square root of σ 2, or σ, will
denote the population standard deviation. When the population is inite and consists of N
equally likely values, we may deine the population variance as
N
σ2 =
∑ ( xi − μ )
i =1
2
(6-5)
N
We observed previously that the sample mean could be used as an estimate of the population
mean. Similarly, the sample variance is an estimate of the population variance. In Chapter 7,
we will discuss estimation of parameters more formally.
Note that the divisor for the sample variance is the sample size minus 1 ( n − 1) , and for the
population variance, it is the population size N . If we knew the true value of the population
mean μ, we could ind the sample variance as the average square deviation of the sample
observations about μ. In practice, the value of μ is almost never known, and so the sum of the
square deviations about the sample average x must be used instead. However, the observations
xi tend to be closer to their average, x , than to the population mean, μ. Therefore, to compensate for this, we use n − 1 as the divisor rather than n. If we used n as the divisor in the sample
variance, we would obtain a measure of variability that is on the average consistently smaller
than the true population variance σ 2.
204
Chapter 6/Descriptive Statistics
Another way to think about this is to consider the sample variance s 2 as being based
on n − 1 degrees of freedom. The term degrees of freedom results from the fact that the n
deviations x1 − x , x2 − x ,… , xn − x always sum to zero, and so specifying the values of any
n −1 of these quantities automatically determines the remaining one. This was illustrated in
Table 6-1. Thus, only n −1 of the n deviations, xi − x, are freely determined. We may think
of the number of degrees of freedom as the number of independent pieces of information
in the data.
In addition to the sample variance and sample standard deviation, the sample range, or the
difference between the largest and smallest observations, is often a useful measure of variability. The sample range is deined as follows.
Sample Range
If the n observations in a sample are denoted by x1 , x2 , … , xn , the sample range is
r = max ( xi ) − min ( xi )
(6-6)
For the pull-off force data, the sample range is r = 13.6 − 12.3 = 1.3. Generally, as the variability
in sample data increases, the sample range increases.
The sample range is easy to calculate, but it ignores all of the information in the sample
data between the largest and smallest values. For example, the two samples 1, 3, 5, 8, and 9
and 1, 5, 5, 5, and 9 both have the same range (r = 8). However, the standard deviation of the
irst sample is s1 = 3.35, while the standard deviation of the second sample is s2 = 2.83. The
variability is actually less in the second sample.
Sometimes when the sample size is small, say n < 8 or 10 , the information loss associated
with the range is not too serious. For example, the range is used widely in statistical quality
control where sample sizes of 4 or 5 are fairly common. We will discuss some of these applications in Chapter 15.
In most statistics problems, we work with a sample of observations selected from the
population that we are interested in studying. Figure 6-3 illustrates the relationship between
the population and the sample.
Population
m
s
Sample (x1, x2, x3, … , xn)
x, sample average
s, sample standard
deviation
Histogram
x
x
s
FIGURE 6-3 Relationship between a population and a sample.
Section 6-1/Numerical Summaries of Data
Exercises
205
FOR SECTION 6-1
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-1. Will the sample mean always correspond to one of the
observations in the sample?
6-2. Will exactly half of the observations in a sample fall
below the mean?
6-3. Will the sample mean always be the most frequently
occurring data value in the sample?
6-4. For any set of data values, is it possible for the sample
standard deviation to be larger than the sample mean? If so,
give an example.
6-5. Can the sample standard deviation be equal to zero? If so,
give an example.
6-6. Suppose that you add 10 to all of the observations in a
sample. How does this change the sample mean? How does it
change the sample standard deviation?
6-7.
Eight measurements were made on the inside diameter of forged piston rings used in an automobile engine. The
data (in millimeters) are 74.001, 74.003, 74.015, 74.000,
74.005, 74.002, 74.005, and 74.004. Calculate the sample
mean and sample standard deviation, construct a dot diagram,
and comment on the data.
6-8.
In Applied Life Data Analysis (Wiley,
1982), Wayne Nelson presents the breakdown time of an insulating luid between electrodes at 34 kV. The times, in minutes, are
as follows: 0.19, 0.78, 0.96, 1.31, 2.78, 3.16, 4.15, 4.67, 4.85,
6.50, 7.35, 8.01, 8.27, 12.06, 31.75, 32.52, 33.91, 36.71, and
72.89. Calculate the sample mean and sample standard deviation.
6-9.
The January 1990 issue of Arizona Trend contains a
supplement describing the 12 “best” golf courses in the state.
The yardages (lengths) of these courses are as follows: 6981,
7099, 6930, 6992, 7518, 7100, 6935, 7518, 7013, 6800, 7041,
and 6890. Calculate the sample mean and sample standard
deviation. Construct a dot diagram of the data.
6-10.
An article in the Journal of Structural Engineering
(Vol. 115, 1989) describes an experiment to test the yield strength
of circular tubes with caps welded to the ends. The irst yields (in
kN) are 96, 96, 102, 102, 102, 104, 104, 108, 126, 126, 128, 128,
140, 156, 160, 160, 164, and 170. Calculate the sample mean and
sample standard deviation. Construct a dot diagram of the data.
6-11.
An article in Human Factors (June 1989) presented
data on visual accommodation (a function of eye movement)
when recognizing a speckle pattern on a high-resolution CRT
screen. The data are as follows: 36.45, 67.90, 38.77, 42.18,
26.72, 50.77, 39.30, and 49.71. Calculate the sample mean and
sample standard deviation. Construct a dot diagram of the data.
6-12.
The following data are direct solar intensity measurements (watts/m2) on different days at a location in southern
Spain: 562, 869, 708, 775, 775, 704, 809, 856, 655, 806, 878,
909, 918, 558, 768, 870, 918, 940, 946, 661, 820, 898, 935,
952, 957, 693, 835, 905, 939, 955, 960, 498, 653, 730, and
753. Calculate the sample mean and sample standard deviation.
Prepare a dot diagram of these data. Indicate where the sample
mean falls on this diagram. Give a practical interpretation of
the sample mean.
6-13.
The April 22, 1991, issue of Aviation Week and
Space Technology reported that during Operation Desert
Storm, U.S. Air Force F-117A pilots lew 1270 combat sorties for a total of 6905 hours. What is the mean duration of an
F-117A mission during this operation? Why is the parameter
you have calculated a population mean?
6-14. Preventing fatigue crack propagation in aircraft structures
is an important element of aircraft safety. An engineering study
to investigate fatigue crack in n = 9 cyclically loaded wing boxes
reported the following crack lengths (in mm): 2.13, 2.96, 3.02,
1.82, 1.15, 1.37, 2.04, 2.47, 2.60. Calculate the sample mean and
sample standard deviation. Prepare a dot diagram of the data.
6-15. An article in the Journal of Physiology [“Response of Rat
Muscle to Acute Resistance Exercise Deined by Transcriptional
and Translational Proiling” (2002, Vol. 545, pp. 27–41)] studied
gene expression as a function of resistance exercise. Expression
data (measures of gene activity) from one gene are shown in the
following table. One group of rats was exercised for six hours
while the other received no exercise. Compute the sample mean
and standard deviation of the exercise and no-exercise groups separately. Construct a dot diagram for the exercise and no-exercise
groups separately. Comment on any differences for the groups.
6 Hours of
Exercise
425.313
223.306
388.793
139.262
212.565
324.024
6 Hours of
Exercise
208.475
286.484
244.242
408.099
157.743
436.37
No
Exercise
485.396
159.471
478.314
245.782
236.212
252.773
No
Exercise
406.921
335.209
Exercise 6-11 describes data from an article in
6-16.
Human Factors on visual accommodation from an experiment
involving a high-resolution CRT screen.
Data from a second experiment using a low-resolution screen
were also reported in the article. They are 8.85, 35.80, 26.53,
64.63, 9.00, 15.38, 8.14, and 8.24. Prepare a dot diagram for this
second sample and compare it to the one for the irst sample.
What can you conclude about CRT resolution in this situation?
6-17.
The pH of a solution is measured eight times by one
operator using the same instrument. She obtains the following
data: 7.15, 7.20, 7.18, 7.19, 7.21, 7.20, 7.16, and 7.18. Calculate the sample mean and sample standard deviation. Comment
on potential major sources of variability in this experiment.
6-18.
An article in the Journal of Aircraft (1988) described
the computation of drag coeficients for the NASA 0012 airfoil. Different computational algorithms were used at M ∞ = 0.7
206
Chapter 6/Descriptive Statistics
with the following results (drag coeficients are in units of drag
counts; that is, one count is equivalent to a drag coeficient of
0.0001): 79, 100, 74, 83, 81, 85, 82, 80, and 84. Compute the
sample mean, sample variance, and sample standard deviation,
and construct a dot diagram.
6-19.
The following data are the joint temperatures of the
O-rings (°F) for each test iring or actual launch of the space
shuttle rocket motor (from Presidential Commission on the
Space Shuttle Challenger Accident, Vol. 1, pp. 129–131): 84, 49,
61, 40, 83, 67, 45, 66, 70, 69, 80, 58, 68, 60, 67, 72, 73, 70, 57,
63, 70, 78, 52, 67, 53, 67, 75, 61, 70, 81, 76, 79, 75, 76, 58, 31.
(a) Compute the sample mean and sample standard deviation and
construct a dot diagram of the temperature data.
(b) Set aside the smallest observation (31°F ) and recompute
the quantities in part (a). Comment on your indings. How
“different” are the other temperatures from this last value?
into clouds to promote rainfall was widely used in the 20th century. Recent research has questioned its effectiveness [Journal
of Atmospheric Research (2010, Vol. 97 (2), pp. 513–525)]. An
experiment was performed by randomly assigning 52 clouds
to be seeded or not. The amount of rain generated was then
measured in acre-feet. Here are the data for the unseeded and
seeded clouds:
6-20. The United States has an aging infrastructure as witnessed by several recent disasters, including the I-35 bridge
failure in Minnesota. Most states inspect their bridges regularly
and report their condition (on a scale from 1–17) to the public.
Here are the condition numbers from a sample of 30 bridges
in New York State (https://www.dot.ny.gov/main/bridgedata):
Find the sample mean, sample standard deviation, and range
of rainfall for
(a) All 52 clouds
(b) The unseeded clouds
(c) The seeded clouds
5.08 5.44 6.66 5.07 6.80 5.43 4.83 4.00 4.41 4.38
7.00 5.72 4.53 6.43 3.97 4.19 6.26 6.72 5.26 5.48
4.95 6.33 4.93 5.61 4.66 7.00 5.57 3.42 5.18 4.54
6-23. Construct dot diagrams of the seeded and unseeded
clouds and compare their distributions in a couple of sentences.
(a) Find the sample mean and sample standard deviation of
these condition numbers.
(b) Construct a dot diagram of the data.
6-21. In an attempt to measure the effects of acid rain, researchers measured the pH (7 is neutral and values below 7 are acidic)
of water collected from rain in Ingham County, Michigan.
5.47 5.37 5.38 4.63 5.37 3.74 3.71 4.96 4.64 5.11
5.65 5.39 4.16 5.62 4.57 4.64 5.48 4.57 4.57 4.51
4.86 4.56 4.61 4.32 3.98 5.70 4.15 3.98 5.65 3.10
5.04 4.62 4.51 4.34 4.16 4.64 5.12 3.71 4.64 5.59
(a) Find the sample mean and sample standard deviation of
these measurements.
(b) Construct a dot diagram of the data.
6-22. Cloud seeding, a process in which chemicals such as silver iodide and frozen carbon dioxide are introduced by aircraft
Unseeded:
81.2 26.1 95.0 41.1 28.6 21.7 11.5 68.5 345.5 321.2
1202.6 1.0 4.9 163.0 372.4 244.3 47.3 87.0 26.3 24.4
830.1 4.9 36.6 147.8 17.3 29.0
Seeded:
274.7 302.8 242.5 255.0 17.5 115.3 31.4 703.4 334.1
1697.8 118.3 198.6 129.6 274.7 119.0 1656.0 7.7 430.0
40.6 92.4 200.7 32.7 4.1 978.0 489.1 2745.6
6-24. In the 2000 Sydney Olympics, a special program initiated by IOC president Juan Antonio Samaranch allowed developing countries to send athletes to the Olympics without the
usual qualifying procedure. Here are the 71 times for the irst
round of the 100 meter men’s swim (in seconds).
60.39
50.56
49.75
49.76
52.43
51.93
50.49
49.45
49.93
52.72
54.06
49.73
51.28
52.24
49.84
51.28
53.40
50.95
53.50
50.90
52.22
52.82
52.91
49.09
51.82
49.74
50.63
59.26
49.76
50.96
52.52
58.79
50.46
49.16
51.93
49.29
49.70
48.64
50.32
49.74
51.34 50.28 50.19 52.14
52.57 52.53 52.09 52.40
51.62 52.58 53.55 51.07
52.78 112.72 49.79 49.83
52.90 50.19 54.33 62.45
51.11 50.87 52.18 54.12
51.52 52.0 52.85 52.24
49.32 50.62 49.45
(a) Find the sample mean and sample standard deviation of
these 100 meter swim times.
(b) Construct a dot diagram of the data.
(c) Comment on anything unusual that you see.
6-2 Stem-and-Leaf Diagrams
The dot diagram is a useful data display for small samples up to about 20 observations. However, when the number of observations is moderately large, other graphical displays may be
more useful.
For example, consider the data in Table 6-2. These data are the compressive strengths in
pounds per square inch (psi) of 80 specimens of a new aluminum-lithium alloy undergoing
evaluation as a possible material for aircraft structural elements. The data were recorded in
the order of testing, and in this format they do not convey much information about compressive strength. Questions such as “What percent of the specimens fail below 120 psi?” are not
Section 6-2/Stem-and-Leaf Diagrams
207
easy to answer. Because there are many observations, constructing a dot diagram of these data
would be relatively ineficient; more effective displays are available for large data sets.
5"#-&t6-2 Compressive Strength (in psi) of 80 Aluminum-Lithium Alloy Specimens
105
221
183
186
121
181
180
143
97
154
153
174
120
168
167
141
245
228
174
199
181
158
176
110
163
131
154
115
160
208
158
133
207
180
190
193
194
133
156
123
134
178
76
167
184
135
229
146
218
157
101
171
165
172
158
169
199
151
142
163
145
171
148
158
160
175
149
87
160
237
150
135
196
201
200
176
150
170
118
149
A stem-and-leaf diagram is a good way to obtain an informative visual display of a data
set x1 , x2 ,… , xn where each number xi consists of at least two digits. To construct a stem-andleaf diagram, use the following steps.
Steps to Construct
a Stem-and-Leaf
Diagram
(1) Divide each number xi into two parts: a stem, consisting of one or more of the
leading digits, and a leaf, consisting of the remaining digit.
(2) List the stem values in a vertical column.
(3) Record the leaf for each observation beside its stem.
(4) Write the units for stems and leaves on the display.
To illustrate, if the data consist of percent defective information between 0 and 100 on lots
of semiconductor wafers, we can divide the value 76 into the stem 7 and the leaf 6. In general,
we should choose relatively few stems in comparison with the number of observations. It is
usually best to choose between 5 and 20 stems.
Example 6-4
Alloy Strength To illustrate the construction of a stem-and-leaf diagram, consider the alloy
compressive strength data in Table 6-2. We will select as stem values the numbers 7, 8, 9,… , 24.
The resulting stem-and-leaf diagram is presented in Fig. 6-4. The last column in the diagram is a frequency count
of the number of leaves associated with each stem. Inspection of this display immediately reveals that most of the
compressive strengths lie between 110 and 200 psi and that a central value is somewhere between 150 and 160 psi.
Furthermore, the strengths are distributed approximately symmetrically about the central value. The stem-and-leaf
diagram enables us to determine quickly some important features of the data that were not immediately obvious in
the original display in Table 6-2.
In some data sets, providing more classes or stems may be desirable. One way to do
this would be to modify the original stems as follows: Divide stem 5 into two new stems,
5L and 5U. Stem 5L has leaves 0, 1, 2, 3, and 4, and stem 5U has leaves 5, 6, 7, 8, and 9.
This will double the number of original stems. We could increase the number of original
stems by four by deining ive new stems: 5z with leaves 0 and 1, 5t (for twos and three)
with leaves 2 and 3, 5f (for fours and ives) with leaves 4 and 5, 5s (for six and seven) with
leaves 6 and 7, and 5e with leaves 8 and 9.
208
Chapter 6/Descriptive Statistics
Stem
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Leaf
6
7
7
51
580
103
413535
29583169
471340886808
3073050879
8544162106
0361410
960934
7108
8
189
7
5
Frequency
1
1
1
2
3
3
6
8
12
10
10
7
6
4
1
3
1
1
Stem: Tens and hundreds digits (psi); Leaf: Ones digits (psi).
FIGURE 6-4
Stem-and-leaf diagram for the compressive strength data in Table 6-2.
Chemical Yield Figure 6-5 is the stem-and-leaf diagram for 25 observations on batch yields
Example 6-5
from a chemical process. In Fig. 6-5(a), we have used 6, 7, 8, and 9 as the stems. This results in too
few stems, and the stem-and-leaf diagram does not provide much information about the data. In Fig. 6-5(b), we have
divided each stem into two parts, resulting in a display that more adequately displays the data. Figure 6-5(c) illustrates a
stem-and-leaf display with each stem divided into ive parts. There are too many stems in this plot, resulting in a display
that does not tell us much about the shape of the data.
Stem
6
7
8
9
Leaf
134556
011357889
1344788
235
(a)
Stem
6L
6U
7L
7U
8L
8U
9L
9U
Leaf
134
556
0113
57889
1344
788
23
5
(b)
Stem
6z
6t
6f
6s
6e
7z
7t
7f
7s
7e
8z
8t
8f
8s
8e
9z
9t
9f
9s
9e
Leaf
1
3
455
6
011
3
5
7
889
1
3
44
7
88
23
5
(c)
FIGURE 6-5 Stem-and-leaf displays for Example 6-5. Stem: Tens digits. Leaf: Ones digits.
Section 6-2/Stem-and-Leaf Diagrams
209
Stem-and-leaf of Strength
N = 80 Leaf
Unit = 1.0
FIGURE 6-6 A typical
computer-generated
stem-and-leaf
diagram.
1
2
3
5
8
11
17
25
37
(10)
33
23
16
10
6
5
2
1
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
6
7
7
15
058
013
133455
12356899
001344678888
0003357789
0112445668
0011346
034699
0178
8
189
7
5
Figure 6-6 is a typical computer-generated stem-and-leaf display of the compressive strength
data in Table 6-2. The software uses the same stems as in Fig. 6-4. Note also that the computer
orders the leaves from smallest to largest on each stem. This form of the plot is usually called an
ordered stem-and-leaf diagram. This is not usually used when the plot is constructed manually because it can be time-consuming. The computer also adds a column to the left of the stems
that provides a count of the observations at and above each stem in the upper half of the display
and a count of the observations at and below each stem in the lower half of the display. At the
middle stem of 16, the column indicates the number of observations at this stem.
The ordered stem-and-leaf display makes it relatively easy to ind data features such as percentiles, quartiles, and the median. The sample median is a measure of central tendency that
divides the data into two equal parts, half below the median and half above. If the number of
observations is even, the median is halfway between the two central values. From Fig. 6-6 we
ind the 40th and 41st values of strength as 160 and 163, so the median is (160 + 163) / 2 = 161.5.
If the number of observations is odd, the median is the central value. The sample mode is the
most frequently occurring data value. Figure 6-6 indicates that the mode is 158; this value
occurs four times, and no other value occurs as frequently in the sample. If there were more
than one value that occurred four times, the data would have multiple modes.
We can also divide data into more than two parts. When an ordered set of data is divided
into four equal parts, the division points are called quartiles. The irst or lower quartile, q1,
is a value that has approximately 25% of the observations below it and approximately 75%
of the observations above. The second quartile, q2, has approximately 50% of the observations below its value. The second quartile is exactly equal to the median. The third or upper
quartile, q3 , has approximately 75% of the observations below its value. As in the case of
the median, the quartiles may not be unique. The compressive strength data in Fig. 6-6 contain n = 80 observations. Therefore, calculate the irst and third quartiles as the ( n + 1) / 4 and
3 ( n + 1) / 4 ordered observations and interpolate as needed, for example, (80 + 1) / 4 = 20.25
and 3 (80 + 1) / 4 = 60.75. Therefore, interpolating between the 20th and 21st ordered observation we obtain q1 = 143.50 and between the 60th and 61st observation we obtain q3 = 181.00.
In general, the 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100 (1 − k ) % of them are above it. Finally,
210
Chapter 6/Descriptive Statistics
we may use the interquartile range, deined as IQR = q3 − q1 , as a measure of variability. The
interquartile range is less sensitive to the extreme values in the sample than is the ordinary
sample range.
Many statistics software packages provide data summaries that include these quantities.
Typical computer output for the compressive strength data in Table 6-2 is shown in Table 6-3.
5"#-&t6-3 Summary Statistics for the Compressive Strength Data from Software
Exercises
N
Mean
Median
StDev
SE Mean
Min
Max
Q1
Q3
80
162.66
161.50
33.77
3.78
76.00
245.00
143.50
181.00
FOR SECTION 6-2
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-25. For the data in Exercise 6-20,
(a) Construct a stem-and-leaf diagram.
(b) Do any of the bridges appear to have unusually good or
poor ratings?
(c) If so, compute the mean with and without these bridges and
comment.
6-26. For the data in Exercise 6-21,
(a) Construct a stem-and-leaf diagram.
(b) Many scientists consider rain with a pH below 5.3 to be
acid rain (http://www.ec.gc.ca/eau-water/default.asp?
lang=En&n=FDF30C16-1). What percentage of these samples could be considered as acid rain?
6-27. A back-to-back stem-and-leaf display on two data sets
is conducted by hanging the data on both sides of the same
stems. Here is a back-to-back stem-and-leaf display for the
cloud seeding data in Exercise 6-22 showing the unseeded
clouds on the left and the seeded clouds on the right.
65098754433332221000 | 0 | 01233492223
| 2 | 00467703
| 4 | 39
| 6|0
3| 8|8
| 10 |
0 | 12 |
| 14 |
| 16 | 60
| 18 |
| 20 |
| 22 |
| 24 |
| 26 | 5
6-30.
An article in Technometrics (1977, Vol. 19, p. 425)
presented the following data on the motor fuel octane ratings of
several blends of gasoline:
88.5
94.7
84.3
90.1
89.0
89.8
91.6
90.3
90.0
91.5
89.9
98.8
88.3
90.4
91.2
90.6
92.2
87.7
91.1
86.7
93.4
96.1
89.6 92.2
90.4 83.4
91.6 91.0
90.7 88.2
88.6 88.5
88.3 93.3
94.2 87.4
85.3 91.1
90.1 90.5
89.3 100.3
91.1 87.6
92.7
87.9
93.0
94.4
90.4
91.2
86.7
94.2
90.8
90.1
91.8
88.4
92.6
93.7
96.5
84.3
93.2
88.6
88.7
92.7
89.3
91.0
87.5
87.8
88.3
89.2
92.3
88.9
89.8
92.7
93.3
86.7
91.0
90.9
89.9
91.8
89.7
92.2
Construct a stem-and-leaf display for these data. Calculate the
median and quartiles of these data.
6-31.
The following data are the numbers
of cycles to failure of aluminum test coupons subjected to
repeated alternating stress at 21,000 psi, 18 cycles per second.
How does the back-to-back stem-and-leaf display show the differences in the data set in a way that the dotplot cannot?
1115
1310
1540
1502
1258
1315
1085
798
1020
865
2130
1421
1109
1481
1567
1883
1203
1270
1015
845
1674
1016
1102
1605
706
2215
785
885
1223
375
2265
1910
1018
1452
1890
2100
1594
2023
1315
1269
1260
1888
1782
1522
1792
1000
1820
1940
1120
910
1730
1102
1578
758
1416
1560
1055
1764
1330
1608
1535
1781
1750
1501
1238
990
1468
1512
1750
1642
6-28. When will the median of a sample be equal to the
sample mean?
6-29. When will the median of a sample be equal to the mode?
Construct a stem-and-leaf display for these data. Calculate the
median and quartiles of these data. Does it appear likely that a
coupon will “survive” beyond 2000 cycles? Justify your answer.
Section 6-2/Stem-and-Leaf Diagrams
6-32.
The percentage of cotton in material used to manufacture men’s shirts follows. Construct a stem-and-leaf display
for the data. Calculate the median and quartiles of these data.
34.2
37.8
33.6
32.6
33.8
35.8
34.7
34.6
33.1
36.6
34.7
33.1
34.2
37.6
33.6
33.6
34.5
35.4
35.0
34.6
33.4
37.3
32.5
34.1
35.6
34.6
35.4
35.9
34.7
34.6
34.1
34.7
36.3
33.8
36.2
34.7
34.6
35.5
35.1
35.7
35.1
37.1
36.8
33.6
35.2
32.8
36.8
36.8
34.7
34.0
35.1
32.9
35.0
32.1
37.9
34.3
33.6
34.1
35.3
33.5
34.9
34.5
36.4
32.7
Hong Kong
India
Indonesia
Japan
Korea, North
Korea, South
Laos
Malaysia
Mongolia
Nepal
New Zealand
Pakistan
Philippines
Singapore
Sri Lanka
Taiwan
Thailand
Vietnam
Total
6-33. The following data represent the yield on 90 consecutive
batches of ceramic substrate to which a metal coating has been
applied by a vapor-deposition process. Construct a stem-andleaf display for these data. Calculate the median and quartiles
of these data.
94.1
86.1
95.3
84.9
88.8
84.6
94.4
84.1
93.2
90.4
94.1
78.3
86.4
83.6
96.1
83.7
90.6
89.1
97.8
89.6
85.1
85.4
98.0
82.9
91.4
87.3
93.1
90.3
84.0
89.7
85.4
87.3
88.2
84.1
86.4
93.1
93.7
87.6
86.6
86.4
86.1
90.1
87.6
94.6
87.7
85.1
91.7
84.5
95.1
95.2
94.1
96.3
90.6
89.6
87.5
90.0
86.1
92.1
94.7
89.4
90.0
84.2
92.4
94.3
96.4
91.1
88.6
90.1
85.1
87.3
93.2
88.2
92.4
84.1
94.3
90.5
86.6
86.7
86.4
90.6
82.6
97.3
95.6
91.2
83.0
85.0
89.1
83.1
96.8
88.3
6-34.
Calculate the sample median, mode, and mean of
the data in Exercise 6-30. Explain how these three measures of
location describe different features of the data.
6-35. Calculate the sample median, mode, and mean of the
data in Exercise 6-31. Explain how these three measures of
location describe different features in the data.
6-36.
Calculate the sample median, mode, and mean for
the data in Exercise 6-32. Explain how these three measures of
location describe different features of the data.
6-37. The net energy consumption (in billions of kilowatthours) for countries in Asia in 2003 was as follows (source:
U.S. Department of Energy Web site, www.eia.doe.gov/emeu).
Construct a stem-and-leaf diagram for these data and comment
on any important features that you notice. Compute the sample
mean, sample standard deviation, and sample median.
Afghanistan
Australia
Bangladesh
Burma
China
Billions of Kilowatt-Hours
1.04
200.66
16.20
6.88
1671.23
211
38.43
519.04
101.80
946.27
17.43
303.33
3.30
73.63
2.91
2.30
37.03
71.54
44.48
30.89
6.80
154.34
107.34
36.92
4393.8
The female students in an undergraduate engineer6-38.
ing core course at ASU self-reported their heights to the nearest
inch. The data follow. Construct a stem-and-leaf diagram for
the height data and comment on any important features that
you notice. Calculate the sample mean, the sample standard
deviation, and the sample median of height.
62
64 61 67 65 68 61 65 60 65 64 63 59
68
64 66 68 69 65 67 62 66 68 67 66 65
69
65 69 65 67 67 65 63 64 67 65
6-39. The shear strengths of 100 spot welds in a titanium
alloy follow. Construct a stem-and-leaf diagram for the weld
strength data and comment on any important features that you
notice. What is the 95th percentile of strength?
5408 5431 5475 5442 5376 5388 5459 5422 5416 5435
5420 5429 5401 5446 5487 5416 5382 5357 5388 5457
5407 5469 5416 5377 5454 5375 5409 5459 5445 5429
5463 5408 5481 5453 5422 5354 5421 5406 5444 5466
5399 5391 5477 5447 5329 5473 5423 5441 5412 5384
5445 5436 5454 5453 5428 5418 5465 5427 5421 5396
5381 5425 5388 5388 5378 5481 5387 5440 5482 5406
5401 5411 5399 5431 5440 5413 5406 5342 5452 5420
5458 5485 5431 5416 5431 5390 5399 5435 5387 5462
5383 5401 5407 5385 5440 5422 5448 5366 5430 5418
6-40. An important quality characteristic of water is the concentration of suspended solid material. Following are 60 measurements on suspended solids from a certain lake. Construct
a stem-and-leaf diagram for these data and comment on any
important features that you notice. Compute the sample mean,
the sample standard deviation, and the sample median. What is
the 90th percentile of concentration?
212
Chapter 6/Descriptive Statistics
42.4 65.7 29.8 58.7 52.1 55.8 57.0 68.7 67.3 67.3
681 679 691 683 705 746 706 649 668 672 690 724
54.3 54.0 73.1 81.3 59.9 56.9 62.2 69.9 66.9 59.0
652 720 660 695 701 724 668 698 668 660 680 739
56.3 43.3 57.4 45.3 80.1 49.7 42.8 42.4 59.6 65.8
61.4 64.0 64.2 72.6 72.5 46.1 53.1 56.1 67.2 70.7
42.6 77.4 54.7 57.1 77.3 39.3 76.4 59.3 51.1 73.8
61.4 73.1 77.3 48.5 89.8 50.7 52.0 59.6 66.1 31.6
6-41. The United States Golf Association tests golf balls
to ensure that they conform to the rules of golf. Balls are
tested for weight, diameter, roundness, and overall distance.
The overall distance test is conducted by hitting balls with
a driver swung by a mechanical device nicknamed “Iron
Byron” after the legendary great Byron Nelson, whose swing
the machine is said to emulate. Following are 100 distances
(in yards) achieved by a particular brand of golf ball in the
overall distance test. Construct a stem-and-leaf diagram for
these data and comment on any important features that you
notice. Compute the sample mean, sample standard deviation, and the sample median. What is the 90th percentile of
distances?
261.3 259.4 265.7 270.6 274.2 261.4 254.5 283.7
258.1 270.5 255.1 268.9 267.4 253.6 234.3 263.2
717 727 653 637 660 693 679 682 724 642 704 695
704 652 664 702 661 720 695 670 656 718 660 648
683 723 710 680 684 705 681 748 697 703 660 722
662 644 683 695 678 674 656 667 683 691 680 685
681 715 665 676 665 675 655 659 720 675 697 663
6-43.
A group of wine enthusiasts taste-tested a pinot noir
wine from Oregon. The evaluation was to grade the wine on
a 0-to-100-point scale. The results follow. Construct a stemand-leaf diagram for these data and comment on any important
features that you notice. Compute the sample mean, the sample standard deviation, and the sample median. A wine rated
above 90 is considered truly exceptional. What proportion
of the taste-tasters considered this particular pinot noir truly
exceptional?
94
90
88
85
90
93
95
91
92
87
91
85
91
90
88
89
91
91
89
88
86
92
92
84
89
89
87
85
91
86
89
90
91
89
95
90
90
90
92
83
254.2 270.7 233.7 263.5 244.5 251.8 259.5 257.5
257.7 272.6 253.7 262.2 252.0 280.3 274.9 233.7
237.9 274.0 264.5 244.8 264.0 268.3 272.1 260.2
255.8 260.7 245.5 279.6 237.8 278.5 273.3 263.7
241.4 260.6 280.3 272.7 261.0 260.0 279.3 252.1
244.3 272.2 248.3 278.7 236.0 271.2 279.8 245.6
241.2 251.1 267.0 273.4 247.7 254.8 272.8 270.5
254.4 232.1 271.5 242.9 273.6 256.1 251.6
256.8 273.0 240.8 276.6 264.5 264.5 226.8
255.3 266.6 250.2 255.8 285.3 255.4 240.5
255.0 273.2 251.4 276.1 277.8 266.8 268.5
6-42. A semiconductor manufacturer produces devices used as
central processing units in personal computers. The speed of the
devices (in megahertz) is important because it determines the
price that the manufacturer can charge for the devices. The following table contains measurements on 120 devices. Construct
a stem-and-leaf diagram for these data and comment on any
important features that you notice. Compute the sample mean,
the sample standard deviation, and the sample median. What
percentage of the devices has a speed exceeding 700 megahertz?
680 669 719 699 670 710 722 663 658 634 720 690
6-44.
In their book Introduction to Linear Regression
Analysis (5th edition, Wiley, 2012), Montgomery, Peck, and
Vining presented measurements on NbOCl3 concentration
from a tube-low reactor experiment. The data, in gram-mole
per liter × 10 −3 , are as follows. Construct a stem-and-leaf diagram for these data and comment on any important features
that you notice. Compute the sample mean, the sample standard deviation, and the sample median.
450
450
473
507
457
452
453 1215 1256
1145 1085 1066 1111 1364 1254 1396 1575 1617
1733 2753 3186 3227 3469 1911 2588 2635 2725
6-45. In Exercise 6-38, we presented height data that were
self-reported by female undergraduate engineering students
in a core course at ASU. In the same class, the male students
self-reported their heights as follows. Construct a comparative
stem-and-leaf diagram by listing the stems in the center of the
display and then placing the female leaves on the left and the
male leaves on the right. Comment on any important features
that you notice in this display.
69 67 69 70 65 68 69 70 71 69 66 67 69 75 68 67 68
677 669 700 718 690 681 702 696 692 690 694 660
69 70 71 72 68 69 69 70 71 68 72 69 69 68 69 73 70
649 675 701 721 683 735 688 763 672 698 659 704
73 68 69 71 67 68 65 68 68 69 70 74 71 69 70 69
213
Section 6-3/Frequency Distributions and Histograms
6-3
Frequency Distributions and Histograms
Choosing the Number
of Bins in a Frequency
Distribution or Histogram is Important
A frequency distribution is a more compact summary of data than a stem-and-leaf diagram. To
construct a frequency distribution, we must divide the range of the data into intervals, which are
usually called class intervals, cells, or bins. If possible, the bins should be of equal width in order
to enhance the visual information in the frequency distribution. Some judgment must be used in
selecting the number of bins so that a reasonable display can be developed. The number of bins
depends on the number of observations and the amount of scatter or dispersion in the data. A frequency distribution that uses either too few or too many bins will not be informative. We usually
ind that between 5 and 20 bins is satisfactory in most cases and that the number of bins should
increase with n. Several sets of rules can be used to determine the member of bins in a histogram.
However, choosing the number of bins approximately equal to the square root of the number of
observations often works well in practice.
A frequency distribution for the comprehensive strength data in Table 6-2 is shown in
Table 6-4. Because the data set contains 80 observations, and because 80 ≃ 9 , we suspect
that about eight to nine bins will provide a satisfactory frequency distribution. The largest and
smallest data values are 245 and 76, respectively, so the bins must cover a range of at least
245 − 76 = 169 units on the psi scale. If we want the lower limit for the irst bin to begin
slightly below the smallest data value and the upper limit for the last bin to be slightly above
the largest data value, we might start the frequency distribution at 70 and end it at 250. This
is an interval or range of 180 psi units. Nine bins, each of width 20 psi, give a reasonable frequency distribution, so the frequency distribution in Table 6-4 is based on nine bins.
The second row of Table 6-4 contains a relative frequency distribution. The relative frequencies are found by dividing the observed frequency in each bin by the total number of
observations. The last row of Table 6-4 expresses the relative frequencies on a cumulative
basis. Frequency distributions are often easier to interpret than tables of data. For example,
from Table 6-4, it is very easy to see that most of the specimens have compressive strengths
between 130 and 190 psi and that 97.5 percent of the specimens fall below 230 psi.
The histogram is a visual display of the frequency distribution. The steps for constructing
a histogram follow.
Constructing a
Histogram (Equal
Bin Widths)
(1) Label the bin (class interval) boundaries on a horizontal scale.
(2) Mark and label the vertical scale with the frequencies or the relative frequencies.
(3) Above each bin, draw a rectangle where height is equal to the frequency (or relative frequency) corresponding to that bin.
Figure 6-7 is the histogram for the compression strength data. The histogram, like the stemand-leaf diagram, provides a visual impression of the shape of the distribution of the measurements and information about the central tendency and scatter or dispersion in the data.
5"#-&t6-4 Frequency Distribution for the Compressive Strength Data in Table 6-2
Class
Frequency
70 Ä x < 90 90 Ä x < 110 110 Ä x < 130 130 Ä x < 150 150 Ä x < 170 170 Ä x < 190 190 Ä x < 210 210 Ä x < 230 230 Ä x < 250
2
3
6
14
22
17
10
4
2
Relative
frequency
0.0250
0.0375
0.0750
0.1750
0.2750
0.2125
0.1250
0.0500
0.0250
Cumulative
relative
frequency
0.0250
0.0625
0.1375
0.3125
0.5875
0.8000
0.9250
0.9750
1.0000
Chapter 6/Descriptive Statistics
0.3125
25
0.2500
20
0.1895
0.1250
Frequency
FIGURE 6-7
Histogram of
compressive strength
for 80 aluminumlithium alloy
specimens.
Relative frequency
214
15
10
0.0625
5
0
0
70
90 110 130 150 170 190 210 230 250
Compressive strength (psi)
Notice the symmetric, bell-shaped distribution of the strength measurements in Fig. 6-7.
This display often gives insight about possible choices of probability distributions to use
as a model for the population. For example, here we would likely conclude that the normal
distribution is a reasonable model for the population of compression strength measurements.
Sometimes a histogram with unequal bin widths will be employed. For example, if the
data have several extreme observations or outliers, using a few equal-width bins will result
in nearly all observations falling in just a few of the bins. Using many equal-width bins will
result in many bins with zero frequency. A better choice is to use shorter intervals in the region
where most of the data fall and a few wide intervals near the extreme observations. When the
bins are of unequal width, the rectangle’s area (not its height) should be proportional to the
bin frequency. This implies that the rectangle height should be
Rectangular height =
Histograms are Best
for Relatively Large
Samples
Bin frequancy
Bin width
In passing from either the original data or stem-and-leaf diagram to a frequency distribution or histogram, we have lost some information because we no longer have the individual
observations. However, this information loss is often small compared with the conciseness
and ease of interpretation gained in using the frequency distribution and histogram.
Figure 6-8 is a histogram of the compressive strength data with 17 bins. We have noted
that histograms may be relatively sensitive to the number of bins and their width. For small
data sets, histograms may change dramatically in appearance if the number and/or width of
the bins changes. Histograms are more stable and thus reliable for larger data sets, preferably
of size 75 to 100 or more. Figure 6-9 is a histogram for the compressive strength data with
20
Frequency
Frequency
10
5
0
10
0
100
150
Strength
200
250
80
100 120 140 160 180 200 220 240
Strength
FIGURE 6-8 A histogram of the compressive
FIGURE 6-9 A histogram of the compressive strength
strength data with 17 bins.
data with nine bins.
Section 6-3/Frequency Distributions and Histograms
215
Cumulative frequency
80
FIGURE 6-10
A cumulative
distribution plot of the
compressive strength
data.
70
60
50
40
30
20
10
0
100
150
Strength
200
250
nine bins. This is similar to the original histogram shown in Fig. 6-7. Because the number of
observations is moderately large (n = 80), the choice of the number of bins is not especially
important, and both Figs. 6-8 and 6-9 convey similar information.
Figure 6-10 is a variation of the histogram available in some software packages, the
cumulative frequency plot. In this plot, the height of each bar is the total number of observations that are less than or equal to the upper limit of the bin. Cumulative distributions
are also useful in data interpretation; for example, we can read directly from Fig. 6-10 that
approximately 70 observations are less than or equal to 200 psi.
When the sample size is large, the histogram can provide a reasonably reliable indicator of
the general shape of the distribution or population of measurements from which the sample
was drawn. See Figure 6-11 for three cases. The median is denoted as xɶ . Generally, if the data
are symmetric, as in Fig. 6-11(b), the mean and median coincide. If, in addition, the data have
only one mode (we say the data are unimodal), the mean, median, and mode all coincide.
If the data are skewed (asymmetric, with a long tail to one side), as in Fig. 6-11(a) and (c),
the mean, median, and mode do not coincide. Usually, we ind that mode < median < mean
if the distribution is skewed to the right, whereas mode > median > mean if the distribution
is skewed to the left.
Frequency distributions and histograms can also be used with qualitative or categorical
data. Some applications will have a natural ordering of the categories (such as freshman,
sophomore, junior, and senior), whereas in others, the order of the categories will be arbitrary
(such as male and female). When using categorical data, the bins should have equal width.
Example 6-6
Figure 6-12 presents the production of transport aircraft by the Boeing Company in 1985. Notice
that the 737 was the most popular model, followed by the 757, 747, 767, and 707.
A chart of occurrences by category (in which the categories are ordered by the number of
occurrences) is sometimes referred to as a Pareto chart. An exercise asks you to construct
such a chart.
FIGURE 6-11
Histograms for
symmetric and
skewed distributions.
x
|
x
Negative or left skew
(a)
x
|
x
Symmetric
(b)
|
x
x
Positive or right skew
(c)
216
Chapter 6/Descriptive Statistics
FIGURE 6-12
Airplane production
in 1985. (Source:
Boeing Company.)
Number of airplanes
manufactured in 1985
250
150
100
50
0
737
757
747 767
Airplane model
707
In this section, we have concentrated on descriptive methods for the situation in which
each observation in a data set is a single number or belongs to one category. In many cases,
we work with data in which each observation consists of several measurements. For example,
in a gasoline mileage study, each observation might consist of a measurement of miles per
gallon, the size of the engine in the vehicle, engine horsepower, vehicle weight, and vehicle
length. This is an example of multivariate data. In section 6.6, we will illustrate one simple
graphical display or multivariate data. In later chapters, we will discuss analyzing this type
of data.
Exercises
FOR SECTION 6-3
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-46.
Construct a frequency distribution and histogram for
the motor fuel octane data from Exercise 6-30. Use eight bins.
6-47. Construct a frequency distribution and histogram using
the failure data from Exercise 6-31.
6-48.
Construct a frequency distribution and histogram
for the cotton content data in Exercise 6-32.
6-49. Construct a frequency distribution and histogram for the
yield data in Exercise 6-33.
6-50.
Construct frequency distributions and histograms
with 8 bins and 16 bins for the motor fuel octane data in Exercise 6-30. Compare the histograms. Do both histograms display similar information?
6-51. Construct histograms with 8 and 16 bins for the data in
Exercise 6-31. Compare the histograms. Do both histograms
display similar information?
6-52.
Construct histograms with 8 and 16 bins for the data
in Exercise 6-32. Compare the histograms. Do both histograms
display similar information?
6-53. Construct a histogram for the energy consumption data
in Exercise 6-37.
6-54.
Construct a histogram for the female student height
data in Exercise 6-38.
6-55. Construct a histogram for the spot weld shear strength
data in Exercise 6-39. Comment on the shape of the histogram.
Does it convey the same information as the stem-and-leaf
display?
6-56.
Construct a histogram for the water quality data in
Exercise 6-40. Comment on the shape of the histogram. Does
it convey the same information as the stem-and-leaf display?
6-57. Construct a histogram for the overall golf distance data
in Exercise 6-41. Comment on the shape of the histogram. Does
it convey the same information as the stem-and-leaf display?
6-58. Construct a histogram for the semiconductor speed
data in Exercise 6-42. Comment on the shape of the histogram. Does it convey the same information as the stem-andleaf display?
6-59. Construct a histogram for the pinot noir wine rating
data in Exercise 6-43. Comment on the shape of the histogram. Does it convey the same information as the stem-andleaf display?
6-60.
The Pareto Chart. An important variation of a
histogram for categorical data is the Pareto chart. This chart
is widely used in quality improvement efforts, and the categories usually represent different types of defects, failure
modes, or product/process problems. The categories are
ordered so that the category with the largest frequency is on
the left, followed by the category with the second largest frequency, and so forth. These charts are named after the Italian
economist V. Pareto, and they usually exhibit “Pareto’s law”;
that is, most of the defects can be accounted for by only a
few categories. Suppose that the following information on
structural defects in automobile doors is obtained: dents, 4;
Section 6-4/Box Plots
pits, 4; parts assembled out of sequence, 6; parts undertrimmed, 21; missing holes/slots, 8; parts not lubricated, 5;
parts out of contour, 30; and parts not deburred, 3. Construct
and interpret a Pareto chart.
6-61. Construct a frequency distribution and histogram for the
bridge condition data in Exercise 6-20.
6-4
217
6-62. Construct a frequency distribution and histogram for the
acid rain measurements in Exercise 6-21.
6-63. Construct a frequency distribution and histogram for the
combined cloud-seeding rain measurements in Exercise 6-22.
6-64. Construct a frequency distribution and histogram for the
swim time measurements in Exercise 6-24.
Box Plots
The stem-and-leaf display and the histogram provide general visual impressions about a data
set, but numerical quantities such as x or s provide information about only one feature of
the data. The box plot is a graphical display that simultaneously describes several important
features of a data set, such as center, spread, departure from symmetry, and identiication of
unusual observations or outliers.
A box plot, sometimes called box-and-whisker plots, displays the three quartiles, the minimum, and the maximum of the data on a rectangular box, aligned either horizontally or vertically. The box encloses the interquartile range with the left (or lower) edge at the irst quartile,
q1, and the right (or upper) edge at the third quartile, q3 . A line is drawn through the box at
the second quartile (which is the 50th percentile or the median), q2 = x . A line, or whisker,
extends from each end of the box. The lower whisker is a line from the irst quartile to the
smallest data point within 1.5 interquartile ranges from the irst quartile. The upper whisker is
a line from the third quartile to the largest data point within 1.5 interquartile ranges from the
third quartile. Data farther from the box than the whiskers are plotted as individual points. A
point beyond a whisker, but less than three interquartile ranges from the box edge, is called an
outlier. A point more than three interquartile ranges from the box edge is called an extreme
outlier. See Fig. 6-13. Occasionally, different symbols, such as open and illed circles, are
used to identify the two types of outliers.
Figure 6-14 presents a typical computer-generated box plot for the alloy compressive
strength data shown in Table 6-2. This box plot indicates that the distribution of compressive
strengths is fairly symmetric around the central value because the left and right whiskers and
the lengths of the left and right boxes around the median are about the same. There are also
two mild outliers at lower strength and one at higher strength. The upper whisker extends to
observation 237 because it is the highest observation below the limit for upper outliers. This
limit is q3 + 1.5IQR = 181 + 1.5 (181 − 143.5) = 237.25. The lower whisker extends to observation 97 because it is the smallest observation above the limit for lower outliers. This limit is
q1 − 1.5IQR = 143.5 − 1.5 (181 − 143.5) = 87.25.
Box plots are very useful in graphical comparisons among data sets because they have
high visual impact and are easy to understand. For example, Fig. 6-15 shows the comparative
box plots for a manufacturing quality index on semiconductor devices at three manufacturing
plants. Inspection of this display reveals that there is too much variability at plant 2 and that
plants 2 and 3 need to raise their quality index performance.
Whisker extends to
largest data point within
1.5 interquartile ranges
from third quartile
Whisker extends to
smallest data point within
1.5 interquartile ranges from
first quartile
First quartile
FIGURE 6-13
Description of a
box plot.
Second quartile
Third quartile
Outliers
1.5 IQR
Outliers
1.5 IQR
IQR
1.5 IQR
Extreme outlier
1.5 IQR
218
Chapter 6/Descriptive Statistics
120
237.25
1.5 IQR
200
q3 = 181
181
Strength
110
* 245
237
q2 = 161.5
IQR
143.5
150
Quality index
250
100
90
q1 = 143.5
80
1.5 IQR
100
87.25
97
* 87
70
* 76
FIGURE 6-14 Box plot for compressive strength data in
Table 6-2.
Exercises
1
2
3
Plant
FIGURE 6-15 Comparative box plots of a
quality index at three plants.
FOR SECTION 6-4
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-65. Using the data on bridge conditions from Exercise 6-20,
(a) Find the quartiles and median of the data.
(b) Draw a box plot for the data.
(c) Should any points be considered potential outliers? Compare
this to your answer in Exercise 6-20. Explain.
6-66. Using the data on acid rain from Exercise 6-21,
(a) Find the quartiles and median of the data.
(b) Draw a box plot for the data.
(c) Should any points be considered potential outliers? Compare
this to your answer in Exercise 6-21. Explain.
6-67. Using the data from Exercise 6-22 on cloud seeding,
(a) Find the median and quartiles for the unseeded cloud data.
(b) Find the median and quartiles for the seeded cloud data.
(c) Make two side-by-side box plots, one for each group on the
same plot.
(d) Compare the distributions from what you can see in the
side-by-side box plots.
6-68. Using the data from Exercise 6-24 on swim times,
(a) Find the median and quartiles for the data.
(b) Make a box plot of the data.
(c) Repeat (a) and (b) for the data without the extreme outlier
and comment.
(d) Compare the distribution of the data with and without the
extreme outlier.
6-69.
The “cold start ignition time” of an
automobile engine is being investigated by a gasoline manufacturer. The following times (in seconds) were obtained for a
test vehicle: 1.75, 1.92, 2.62, 2.35, 3.09, 3.15, 2.53, 1.91.
(a) Calculate the sample mean, sample variance, and sample
standard deviation.
(b) Construct a box plot of the data.
6-70. An article in Transactions of the Institution of Chemical Engineers (1956, Vol. 34, pp. 280–293) reported data from
an experiment investigating the effect of several process variables on the vapor phase oxidation of naphthalene. A sample
of the percentage mole conversion of naphthalene to maleic
anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0, 5.1, 3.1, 3.8,
4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0.
(a) Calculate the sample mean, sample variance, and sample
standard deviation.
(b) Construct a box plot of the data.
6-71.
The nine measurements that follow are furnace temperatures recorded on successive batches in a semiconductor
manufacturing process (units are °F): 953, 950, 948, 955, 951,
949, 957, 954, 955.
(a) Calculate the sample mean, sample variance, and standard
deviation.
(b) Find the median. How much could the highest temperature
measurement increase without changing the median value?
(c) Construct a box plot of the data.
6-72. Exercise 6-18 presents drag coeficients for the NASA
0012 airfoil. You were asked to calculate the sample mean, sample variance, and sample standard deviation of those coeficients.
(a) Find the median and the upper and lower quartiles of the
drag coeficients.
(b) Construct a box plot of the data.
(c) Set aside the highest observation (100) and rework parts (a)
and (b). Comment on your indings.
6-73. Exercise 6-19 presented the joint temperatures of the
O-rings (°F) for each test iring or actual launch of the space
shuttle rocket motor. In that exercise, you were asked to ind
the sample mean and sample standard deviation of temperature.
Section 6-5/Time Sequence Plots
(a) Find the median and the upper and lower quartiles of
temperature.
(b) Set aside the lowest observation (31°F ) and recompute the
quantities in part (a). Comment on your indings. How “different” are the other temperatures from this lowest value?
(c) Construct a box plot of the data and comment on the possible presence of outliers.
6-74.
Reconsider the motor fuel octane rating data in
Exercise 6-28. Construct a box plot of the data and write an
interpretation of the plot. How does the box plot compare in
interpretive value to the original stem-and-leaf diagram?
6-75. Reconsider the energy consumption data in Exercise
6-37. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive
value to the original stem-and-leaf diagram?
6-76.
Reconsider the water quality data in Exercise 6-40.
Construct a box plot of the concentrations and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram?
6-77. Reconsider the weld strength data in Exercise 6-39.
Construct a box plot of the data and write an interpretation of
the plot. How does the box plot compare in interpretive value
to the original stem-and-leaf diagram?
6-78. Reconsider the semiconductor speed data in Exercise
6-42. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive
value to the original stem-and-leaf diagram?
6-79.
Use the data on heights of female and male engineering students from Exercises 6-38 and 6-45 to construct
comparative box plots. Write an interpretation of the information that you see in these plots.
6-80
In Exercise 6-69, data were presented on the cold
start ignition time of a particular gasoline used in a test vehicle.
A second formulation of the gasoline was tested in the same
vehicle, with the following times (in seconds): 1.83, 1.99, 3.13,
3.29, 2.65, 2.87, 3.40, 2.46, 1.89, and 3.35. Use these new data
along with the cold start times reported in Exercise 6-69 to
219
construct comparative box plots. Write an interpretation of the
information that you see in these plots.
6-81. An article in Nature Genetics [“Treatment-speciic
Changes in Gene Expression Discriminate in Vivo Drug
Response in Human Leukemia Cells” (2003, Vol. 34(1), pp.
85–90)] studied gene expression as a function of treatments for
leukemia. One group received a high dose of the drug, while
the control group received no treatment. Expression data (measures of gene activity) from one gene are shown in Table 6E.1.
Construct a box plot for each group of patients. Write an interpretation to compare the information in these plots.
5"#-&t6E.1 Gene Expression
High Dose
16.1
Control
297.1
Control
25.1
Control
131.1
134.9
52.7
14.4
124.3
99
24.3
16.3
15.2
47.7
12.9
72.7
126.7
46.4
60.3
23.5
43.6
79.4
38
58.2
26.5
491.8
1332.9
1172
1482.7
335.4
528.9
24.1
545.2
92.9
337.1
102.3
255.1
100.5
159.9
168
95.2
132.5
442.6
15.8
175.6
820.1
82.5
713.9
785.6
114
31.9
86.3
646.6
169.9
20.2
280.2
194.2
408.4
155.5
864.6
355.4
634
2029.9
362.1
166.5
2258.4
497.5
263.4
252.3
351.4
678.9
3010.2
67.1
318.2
2476.4
181.4
2081.5
424.3
188.1
563
149.1
2122.9
1295.9
6-5 Time Sequence Plots
The graphical displays that we have considered thus far such as histograms, stem-and-leaf plots, and
box plots are very useful visual methods for showing the variability in data. However, we noted in
Chapter 1 that time is an important factor that contributes to variability in data, and those graphical
methods do not take this into account. A time series or time sequence is a data set in which the
observations are recorded in the order in which they occur. A time series plot is a graph in which the
vertical axis denotes the observed value of the variable (say, x ) and the horizontal axis denotes the
time (which could be minutes, days, years, etc.). When measurements are plotted as a time series,
we often see trends, cycles, or other broad features of the data that could not be seen otherwise.
For example, consider Fig. 6-16(a), which presents a time series plot of the annual sales of
a company for the last 10 years. The general impression from this display is that sales show
an upward trend. There is some variability about this trend with some years’ sales increasing
over those of the last year and some years’ sales decreasing. Figure 6-16(b) shows the last
three years of sales reported by quarter. This plot clearly shows that the annual sales in this
business exhibit a cyclic variability by quarter with the irst- and second-quarter sales being
generally higher than sales during the third and fourth quarters.
220
Chapter 6/Descriptive Statistics
Sales, x
Sales, x
Sometimes it can be very helpful to combine a time series plot with some of the other
graphical displays that we have considered previously. J. Stuart Hunter (The American Statistician, 1988, Vol. 42, p. 54) has suggested combining the stem-and-leaf plot with a time series
plot to form a digidot plot.
Figure 6-17 is a digidot plot for the observations on compressive strength from Table
6-2, assuming that these observations are recorded in the order in which they occurred.
This plot effectively displays the overall variability in the compressive strength data
and simultaneously shows the variability in these measurements over time. The general
impression is that compressive strength varies around the mean value of 162.66, and no
strong obvious pattern occurs in this variability over time.
The digidot plot in Fig. 6-18 tells a different story. This plot summarizes 30 observations on
concentration of the output product from a chemical process where the observations are recorded
at one-hour time intervals. This plot indicates that during the irst 20 hours of operation, this
process produced concentrations generally above 85 grams per liter, but that following sample
20, something may have occurred in the process that resulted in lower concentrations. If this variability in output product concentration can be reduced, operation of this process can be improved.
Notice that this apparent change in the process output is not seen in the stem-and-leaf portion of
the digidot plot. The stem-and-leaf plot compresses the time dimension out of the data. This illustrates why it is always important to construct a time series plot for time-oriented data.
1982 1983 1984 1985 1986 19871988 1989 1990 1991 Years
(a)
1
2 3
1989
4
1
2 3
1990
(b)
FIGURE 6-16 Company sales by year (a). By quarter (b).
Leaf
FIGURE 6-17 A
digidot plot of the
compressive strength
data in Table 6-2.
Stem
5
24
7
23
189
22
8
21
7108
20
960934
19
0361410
18
8544162106
17
3073050879
16
471340886808
15
29583169
14
413535
13
103
12
580
11
15
10
7
9
7
8
6
7
Time series plot
4
1
2 3
1991
4 Quarters
Section 6-5/Time Sequence Plots
Leaf
FIGURE 6-18 A
digidot plot of
chemical process
concentration
readings, observed
hourly.
Exercises
Stem
8
9e
6
9s
45
9f
2333
9t
0010000
9z
99998
8e
66677
8s
45
8f
23
8t
1
8z
221
Time series plot
FOR SECTION 6-5
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-82.
The following data are the viscosity measurements
for a chemical product observed hourly (read down, then left to
right). Construct and interpret either a digidot plot or a separate
stem-and-leaf and time series plot of these data. Speciications
on product viscosity are at 48 ± 2 . What conclusions can you
make about process performance?
47.9
47.9
48.6
48.0
48.4
48.1
48.0
48.6
48.8
48.1
48.3
47 .2
48.9
48.6
48.0
47 .5
48.6
48.0
47.9
48.3
48.5
48.1
48.0
48.3
43.2
43.0
43.5
43.1
43.0
42.9
43.6
43.3
43.0
42.8
43.1
43.2
43.6
43.2
43.5
43.0
The pull-off force for a connector is measured in a
6-83.
laboratory test. Data for 40 test specimens follow (read down,
then left to right). Construct and interpret either a digidot plot
or a separate stem-and-leaf and time series plot of the data.
241
203
201
251
236
190
258
195
195
238
245
175
237
249
255
210
209
178
210
220
245
198
212
175
194
194
235
199
185
190
225
245
220
183
187
248
209
249
213
218
6-84.
In their book Time Series Analysis, Forecasting, and
Control (Prentice Hall, 1994), G. E. P. Box, G. M. Jenkins, and
G. C. Reinsel present chemical process concentration readings
made every two hours. Some of these data follow (read down,
then left to right).
17.0
16.6
16.7
17.4
17.1
17.4
17.5
18.1
17.6
17.5
16.3
16.1
17.1
16.9
16.8
17.4
17.1
17.0
17.2
17.4
17.4
17.0
17.3
17.2
17.4
16.8
17.4
17.5
17.4
17.6
17.4
17.3
17.0
17.8
17.5
17.4
17.4
17.1
17.6
17.7
17.4
17.8
16.5
17.8
17.3
17.3
17.1
17.4
16.9
17.3
Construct and interpret either a digidot plot or a separate stemand-leaf and time series plot of these data.
The 100 annual Wolfer sunspot numbers from 1770 to
6-85.
1869 follow. (For an interesting analysis and interpretation of these
numbers, see the book by Box, Jenkins, and Reinsel referenced in
Exercise 6-84. Their analysis requires some advanced knowledge
of statistics and statistical model building.) Read down, then left to
right. The 1869 result is 74. Construct and interpret either a digidot
plot or a stem-and-leaf and time series plot of these data.
101
82
66
35
41
21
16
6
4
7
14
34
45
43
48
42
28
31
7
20
92
10
8
2
0
1
5
12
14
35
46
41
30
24
154
125
85
68
16
7
4
2
8
17
36
50
62
67
71
48
28
38
23
10
24
8
13
57
122
138
103
86
63
37
24
11
15
40
83
132
131
118
62
98
124
96
66
64
54
39
21
7
4
23
55
90
67
60
47
94
96
77
59
44
47
30
16
7
37
74
222
Chapter 6/Descriptive Statistics
In their book Introduction to Time Series Analysis and
6-86.
Forecasting (Wiley, 2008), Montgomery, Jennings, and Kolahci
presented the data in Table 6E.2, which are the monthly total passenger airline miles lown in the United Kingdom from 1964 to
1970 (in millions of miles). Comment on any features of the data
that are apparent. Construct and interpret either a digidot plot or a
separate stem-and-leaf and time series plot of these data.
6-87.
Table 6E.3 shows the number of earthquakes per
year of magnitude 7.0 and higher since 1900 (source: Earthquake Data Base System of the U.S. Geological Survey,
National Earthquake Information Center, Golden, Colorado).
Construct and interpret either a digidot plot or a separate stemand-leaf and time series plot of these data.
6-88. Table 6E.4 shows U.S. petroleum imports as a percentage of the totals, and Persian Gulf imports as a percentage of
all imports by year since 1973 (source: U.S. Department of
Energy Web site, www.eia.doe.gov/). Construct and interpret
either a digidot plot or a separate stem-and-leaf and time series
plot for each column of data.
6-89. Table 6E.5 contains the global mean surface air temperature anomaly and the global CO2 concentration for the years
1880–2004. The temperature is measured at a number of locations around the world and averaged annually, and then subtracted from a base period average (1951–1980) and the result
reported as an anomaly.
(a) Construct a time series plot of the global mean surface air
temperature anomaly data and comment on any features
that you observe.
(b) Construct a time series plot of the global CO2 concentration
data and comment on any features that you observe.
(c) Overlay the two plots on the same set of axes and comment
on the plot.
5"#-&t6E.2 United Kingdom Passenger Airline Miles Flown
Month
Jan.
Feb.
Mar.
Apr.
May
June
July
Aug.
Sept.
Oct.
Nov.
Dec.
1964
1965
1966
1967
1968
1969
1970
7.269
6.775
7.819
8.371
9.069
10.248
11.030
10.882
10.333
9.109
7.685
7.682
8.350
7.829
8.829
9.948
10.638
11.253
11.424
11.391
10.665
9.396
7.775
7.933
8.186
7.444
8.484
9.864
10.252
12.282
11.637
11.577
12.417
9.637
8.094
9.280
8.334
7.899
9.994
10.078
10.801
12.953
12.222
12.246
13.281
10.366
8.730
9.614
8.639
8.772
10.894
10.455
11.179
10.588
10.794
12.770
13.812
10.857
9.290
10.925
9.491
8.919
11.607
8.852
12.537
14.759
13.667
13.731
15.110
12.185
10.645
12.161
10.840
10.436
13.589
13.402
13.103
14.933
14.147
14.057
16.234
12.389
11.594
12.772
5"#-&t6E.3 Earthquake Data
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
13
14
8
10
16
26
32
27
18
32
36
24
22
23
22
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
22
19
13
26
13
14
22
24
21
22
26
21
23
24
27
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
15
34
10
15
22
18
15
20
15
22
19
16
30
27
29
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
8
15
6
11
8
7
18
16
13
12
13
20
15
16
12
Section 6-5/Time Sequence Plots
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
18
25
21
21
14
8
11
14
23
18
17
19
20
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
41
31
27
35
26
28
36
39
21
17
22
17
19
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
23
20
16
21
21
25
16
18
15
18
14
10
15
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
5"#-&t6E.4 Petroleum Import Data
Year
Petroleum Imports
(thousand barrels per
day)
Total Petroleum Imports
as Percent of Petroleum
Products Supplied
Petroleum Imports from Persian
Gulf as Percent of Total
Petroleum Imports
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
6256
6112
6055
7313
8807
8363
8456
6909
5996
5113
5051
5437
5067
6224
6678
7402
8061
8018
7627
7888
8620
8996
8835
9478
10,162
10,708
10,852
11,459
11,871
36.1
36.7
37.1
41.8
47.7
44.3
45.6
40.5
37.3
33.4
33.1
34.5
32.2
38.2
40.0
42.8
46.5
47.1
45.6
46.3
50.0
50.7
49.8
51.7
54.5
56.6
55.5
58.1
60.4
13.5
17.0
19.2
25.1
27.8
26.5
24.4
21.9
20.3
13.6
8.7
9.3
6.1
14.6
16.1
20.8
23.0
24.5
24.1
22.5
20.6
19.2
17.8
16.9
17.2
19.9
22.7
21.7
23.2
18
15
16
13
15
16
11
11
18
12
15
223
224
Chapter 6/Descriptive Statistics
5"#-&t6E.4 (Continued)
Year
Petroleum Imports
(thousand barrels per day)
Total Petroleum Imports as
Percent of Petroleum Products Supplied
Petroleum Imports from Persian Gulf as
Percent of Total Petroleum
Imports
2002
2003
2004
2005
2006
2007
2008
11,530
12,264
13,145
13,714
13,707
13,468
12,915
58.3
61.2
63.4
65.9
66.3
65.1
66.2
19.6
20.3
18.9
17.0
16.1
16.1
18.4
5"#-&t6E.5 Global Mean Surface Air Temperature Anomaly and Global CO2 Concentration
Year
Anomaly, oC
CO2, ppmv
Year
Anomaly, oC
CO2, ppmv
Year
Anomaly, oC
CO2, ppmv
1880
−0.11
290.7
1922
−0.09
303.8
1964
−0.25
319.2
1881
−0.13
291.2
1923
−0.16
304.1
1965
−0.15
320.0
1882
−0.01
291.7
1924
−0.11
304.5
1966
−0.07
321.1
1883
−0.04
292.1
1925
−0.15
305.0
1967
−0.02
322.0
1884
−0.42
292.6
1926
0.04
305.4
1968
−0.09
322.9
1885
–0.23
293.0
1927
−0.05
305.8
1969
0.00
324.2
1886
−0.25
293.3
1928
0.01
306.3
1970
0.04
325.2
1887
−0.45
293.6
1929
−0.22
306.8
1971
−0.10
326.1
1888
1889
−0.23
0.04
293.8
294.0
1930
1931
−0.03
0.03
307.2
307.7
1972
1973
−0.05
0.18
327.2
328.8
1890
−0.22
294.2
1932
0.04
308.2
1974
−0.06
329.7
1891
−0.55
294.3
1933
−0.11
308.6
1975
−0.02
330.7
1892
−0.40
294.5
1934
0.05
309.0
1976
−0.21
331.8
1893
−0.39
294.6
1935
−0.08
309.4
1977
0.16
333.3
1894
−0.32
294.7
1936
0.01
309.8
1978
0.07
334.6
1895
−0.32
294.8
1937
0.12
310.0
1979
0.13
336.9
1896
−0.27
294.9
1938
0.15
310.2
1980
0.27
338.7
1897
−0.15
295.0
1939
−0.02
310.3
1981
0.40
339.9
1898
−0.21
295.2
1940
0.14
310.4
1982
0.10
341.1
1899
−0.25
295.5
1941
0.11
310.4
1983
0.34
342.8
1900
−0.05
295.8
1942
0.10
310.3
1984
0.16
344.4
1901
−0.05
296.1
1943
0.06
310.2
1985
0.13
345.9
1902
−0.30
296.5
1944
0.10
310.1
1986
0.19
347.2
1903
−0.35
296.8
1945
−0.01
310.1
1987
0.35
348.9
1904
−0.42
297.2
1946
0.01
310.1
1988
0.42
351.5
1905
−0.25
297.6
1947
0.12
310.2
1989
0.28
352.9
1906
−0.15
298.1
1948
−0.03
310.3
1990
0.49
354.2
1907
−0.41
298.5
1949
−0.09
310.5
1991
0.44
355.6
1908
−0.30
298.9
1950
−0.17
310.7
1992
0.16
356.4
1909
−0.31
299.3
1951
−0.02
311.1
1993
0.18
357.0
299.7
1952
0.03
311.5
1994
1910
−0.21
0.31
358.9
Section 6-6/Scatter Diagrams
Year
Anomaly, oC
CO2, ppmv
Year
Anomaly, oC
CO2, ppmv
Year
Anomaly, oC
CO2, ppmv
1911
−0.25
300.1
1912
−0.33
300.4
1953
0.12
311.9
1995
0.47
360.9
1954
−0.09
312.4
1996
0.36
1913
−0.28
362.6
300.8
1955
−0.09
313.0
1997
0.40
1914
1915
363.8
−0.02
0.06
301.1
301.4
1956
1957
−0.18
0.08
313.6
314.2
1998
1999
0.71
0.43
366.6
368.3
1916
−0.20
301.7
1958
0.10
314.9
2000
0.41
369.5
1917
−0.46
302.1
1959
0.05
315.8
2001
0.56
371.0
1918
−0.33
302.4
1960
−0.02
316.6
2002
0.70
373.1
1919
−0.09
302.7
1961
0.10
317.3
2003
0.66
375.6
1920
−0.15
303.0
1962
0.05
318.1
2004
0.60
377.4
1921
−0.04
303.4
1963
0.03
318.7
225
(source: http://data.giss.nasa.gov/gistemp/)
5"#-&t6.5 Quality Data for Young Red Wines
6-6
Quality
pH
Total SO2
Color Density
Color
19.2
18.3
17.1
15.2
14.0
13.8
12.8
17.3
16.3
16.0
15.7
15.3
14.3
14.0
13.8
12.5
11.5
14.2
17.3
15.8
3.85
3.75
3.88
3.66
3.47
3.75
3.92
3.97
3.76
3.98
3.75
3.77
3.76
3.76
3.90
3.80
3.65
3.60
3.86
3.93
66
79
73
86
178
108
96
59
22
58
120
144
100
104
67
89
192
301
99
66
9.35
11.15
9.40
6.40
3.60
5.80
5.00
10.25
8.20
10.15
8.80
5.60
5.55
8.70
7.41
5.35
6.35
4.25
12.85
4.90
5.65
6.95
5.75
4.00
2.25
3.20
2.70
6.10
5.00
6.00
5.50
3.35
3.25
5.10
4.40
3.15
3.90
2.40
7.70
2.75
Scatter Diagrams
In many problems, engineers and scientists work with data that is multivariate in nature; that
is, each observation consists of measurements of several variables. We saw an example of this
in the wire bond pull strength data in Table 1.2. Each observation consisted of data on the
pull strength of a particular wire bond, the wire length, and the die height. Such data are very
commonly encountered. Table 6.5 contains a second example of multivariate data taken from
an article on the quality of different young red wines in the Journal of the Science of Food
226
Chapter 6/Descriptive Statistics
Scatterplot of Quality vs. Color
20
19
18
Quality
17
16
15
14
13
FIGURE 6-19
Scatter diagram of
wine quality and
color from Table 6-5.
12
11
2
3.50
3
3.75
4
4.00
5
Color
100
200
6
300 4
7
8
8
12 2.0
4.5
7.0
20
16
Quality
12
4.00
3.75
pH
3.50
300
200
Total so2
100
12
FIGURE 6-20
Matrix of scatter
diagrams for the
wine quality data in
Table 6-5.
Color Density
8
4
Color
and Agriculture (1974, Vol. 25) by T.C. Somers and M.E. Evans. The authors reported quality
along with several other descriptive variables. We show only quality, pH, total SO2 (in ppm),
color density, and wine color for a sample of their wines.
Suppose that we wanted to graphically display the potential relationship between quality and
one of the other variables, say color. The scatter diagram is a useful way to do this. A scatter
diagram is constructed by plotting each pair of observations with one measurement in the pair
on the vertical axis of the graph and the other measurement in the pair on the horizontal axis.
Figure 6.19 is the scatter diagram of quality versus the descriptive variable color. Notice
that there is an apparent relationship between the two variables with wines of more intense
color generally having a higher quality rating.
A scatter diagram is an excellent exploratory tool and can be very useful in identifying potential relationships between two variables. Data in Figure 6-19 indicate that a linear relationship
Section 6-6/Scatter Diagrams
227
between quality and color may exist. We saw an example of a three-dimensional scatter diagram
in Chapter 1 where we plotted wire bond strength versus wire length and die height for the bond
pull strength data.
When two or more variables exist, the matrix of scatter diagrams may be useful in
looking at all of the pairwise relationships between the variables in the sample. Figure 6-20
is the matrix of scatter diagrams (upper half only shown) for the wine quality data in
Table 6-5. The top row of the graph contains individual scatter diagrams of quality versus the other four descriptive variables, and other cells contain other pairwise plots of
the four descriptive variables pH, SO2, color density, and color. This display indicates a
weak potential linear relationship between quality and pH and somewhat stronger potential
relationships between quality and color density and quality and color (which was noted
previously in Figure 6-19). A strong apparent linear relationship between color density and
color exists (this should be expected).
The sample correlation coeficient rxy is a quantitative measure of the strength of the
linear relationship between two random variables x and y. The sample correlation coeficient
is deined as
n
rxy =
∑ yi ( xi − x )
i =1
n
⎡n
2
2⎤
⎢ ∑ ( yi − y ) ∑ ( xi − x ) ⎥
i =1
⎦
⎣ i =1
(6.6)
1/ 2
If the two variables are perfectly linearly related with a positive slope rxy = 1 and if they are
perfectly linearly related with a negative slope, then rxy = −1. If no linear relationship between
the two variables exists, then rxy = 0. The simple correlation coeficient is also sometimes
called the Pearson correlation coeficient after Karl Pearson, one of the giants of the ields
of statistics in the late 19th and early 20th centuries.
The value of the sample correlation coeficient between quality and color, the two variables plotted in the scatter diagram of Figure 6-19, is 0.712. This is moderately strong correlation, indicating a possible linear relationship between the two variables. Correlations below
| 0.5 | are generally considered weak and correlations above | 0.8 | are generally considered
strong.
All pairwise sample correlations between the ive variables in Table 6-5 are as follows:
Quality
pH
Total SO2
Color density
Color
pH
0.349
−0.445
0.702
0.712
Total SO2
Color
Density
−0.679
0.482
0.430
−0.492
−0.480
0.996
Moderately strong correlations exist between quality and the two variables color and
color density and between pH and total SO2 (note that this correlation is negative). The
correlation between color and color density is 0.996, indicating a nearly perfect linear
relationship.
See Fig. 6-21 for several examples of scatter diagrams exhibiting possible relationships
between two variables. Parts (e) and (f) of the igure deserve special attention; in part (e), a
probable quadratic relationship exists between y and x, but the sample correlation coeficient
is close to zero because the correlation coeficient is a measure of linear association, but
in part (f), the correlation is approximately zero because no association exists between the
two variables.
228
Chapter 6/Descriptive Statistics
(a) Weak positive relationship
(b) Strong positive relationship
(c) Weak negative relationship
(d) Strong negative relationship
(e) Nonlinear quadratic relationship, rxy < o
(f) No relationship, rxy < o
FIGURE 6-21
Potential relationship
between variables.
Exercises
FOR SECTION 6-6
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-90. Table 6E.6 presents data on the ratings of quarterbacks
for the 2008 National Football League season (source: The
Sports Network). It is suspected that the rating (y) is related to
the average number of yards gained per pass attempt (x).
(a) Construct a scatter plot of quarterback rating versus yards
per attempt. Comment on the suspicion that rating is related
to yards per attempt.
(b) What is the simple correlation coeficient between these two
variables?
6-91. An article in Technometrics by S. C. Narula and J. F.
Wellington [“Prediction, Linear Regression, and a Minimum
Sum of Relative Errors” (1977, Vol. 19)] presents data on
the selling price and annual taxes for 24 houses. The data are
shown in Table 6E.7.
5"#-&t6E.6 2008 NFL Quarterback Rating Data
Player
Team
Yards per
Attempt
Rating
Points
Philip
Rivers
SD
8.39
105.5
Chad
Pennington
MIA
7.67
97.4
Kurt
Warner
ARI
7.66
96.9
Drew
Brees
NO
7.98
96.2
Peyton
Manning
IND
7.21
95
Aaron
Rodgers
GB
7.53
93.8
Section 6-6/Scatter Diagrams
Matt
Schaub
HOU
8.01
92.7
Tony
Romo
DAL
7.66
91.4
Jeff
Garcia
TB
7.21
90.2
Matt
Cassel
NE
7.16
89.4
Matt
Ryan
ATL
7.93
87.7
Shaun
Hill
SF
7.10
87.5
Seneca
Wallace
SEA
6.33
87
Eli
Manning
NYG
6.76
86.4
Donovan
McNabb
PHI
6.86
86.4
Jay
Cutler
DEN
7.35
86
Trent
Edwards
BUF
7.22
85.4
Jake
Delhomme
CAR
7.94
84.7
Jason
Campbell
WAS
6.41
84.3
David
Garrard
JAC
6.77
81.7
Brett
Favre
NYJ
6.65
81
Joe
Flacco
BAL
6.94
80.3
Kerry
Collins
TEN
6.45
80.2
Ben
Roethlisberger
PIT
7.04
80.1
Kyle
Orton
CHI
6.39
79.6
JaMarcus
Russell
OAK
6.58
77.1
Tyler
Thigpen
KC
6.21
76
Gus
Freotte
MIN
7.17
73.7
Dan
Orlovsky
DET
6.34
72.6
Marc
Bulger
STL
6.18
71.4
Ryan
Fitzpatrick
CIN
5.12
70
Derek
Anderson
CLE
5.71
66.5
(a) Construct a scatter plot of sales price versus taxes paid.
Comment on the widely held belief that price is related to
taxes paid.
(b) What is the simple correlation coeficient between these
two variables?
6-92. An article in the Journal of Pharmaceuticals Sciences
(1991, Vol. 80, pp. 971–977) presented data on the observed
mole fraction solubility of a solute at a constant temperature
and the dispersion, dipolar, and hydrogen-bonding Hansen
partial solubility parameters. The data are as shown in Table
6E.8, where y is the negative logarithm of the mole fraction
solubility, x1 is the dispersion partial solubility, x2 is the dipolar partial solubility, and x3 is the hydrogen bonding partial
solubility.
(a) Construct a matrix of scatter plots for these variables.
(b) Comment on the apparent relationships among y and the
other three variables?
5"#-&t6E.7 House Price and Tax Data
Sale
Price/1000
Taxes (local,
school),
county)/1000
Sale
Price/1000
Taxes (local,
school),
county)/1000
25.9
29.5
27.9
25.9
29.9
29.9
30.9
28.9
35.9
31.5
31.0
30.9
4.9176
5.0208
4.5429
4.5573
5.0597
3.8910
5.8980
5.6039
5.8282
5.3003
6.2712
5.9592
30.0
36.9
41.9
40.5
43.9
37.5
37.9
44.5
37.9
38.9
36.9
45.8
5.0500
8.2464
6.6969
7.7841
9.0384
5.9894
7.5422
8.7951
6.0831
8.3607
8.1400
9.1416
5"#-&t6E.8 Solubility Data for Exercise 6-93
Observation
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
y
0.22200
0.39500
0.42200
0.43700
0.42800
0.46700
0.44400
0.37800
0.49400
0.45600
0.45200
0.11200
0.43200
0.10100
0.23200
0.30600
0.09230
0.11600
0.07640
0.43900
0.09440
0.11700
0.07260
0.04120
0.25100
0.00002
x1
7.3
8.7
8.8
8.1
9.0
8.7
9.3
7.6
10.0
8.4
9.3
7.7
9.8
7.3
8.5
9.5
7.4
7.8
7.7
10.3
7.8
7.1
7.7
7.4
7.3
7.6
x2
0.0
0.0
0.7
4.0
0.5
1.5
2.1
5.1
0.0
3.7
3.6
2.8
4.2
2.5
2.0
2.5
2.8
2.8
3.0
1.7
3.3
3.9
4.3
6.0
2.0
7.8
x3
0.0
0.3
1.0
0.2
1.0
2.8
1.0
3.4
0.3
4.1
2.0
7.1
2.0
6.8
6.6
5.0
7.8
7.7
8.0
4.2
8.5
6.6
9.5
10.9
5.2
20.7
229
230
Chapter 6/Descriptive Statistics
6-7
Probability Plots
How do we know whether a particular probability distribution is a reasonable model for data?
Sometimes this is an important question because many of the statistical techniques presented
in subsequent chapters are based on an assumption that the population distribution is of a speciic type. Thus, we can think of determining whether data come from a speciic probability
distribution as verifying assumptions. In other cases, the form of the distribution can give
insight into the underlying physical mechanism generating the data. For example, in reliability
engineering, verifying that time-to-failure data come from an exponential distribution identiies the failure mechanism in the sense that the failure rate is constant with respect to time.
Some of the visual displays we used earlier, such as the histogram, can provide insight about
the form of the underlying distribution. However, histograms are usually not really reliable indicators of the distribution form unless the sample size is very large. A probability plot is a graphical
method for determining whether sample data conform to a hypothesized distribution based on a
subjective visual examination of the data. The general procedure is very simple and can be performed quickly. It is also more reliable than the histogram for small- to moderate-size samples.
Probability plotting typically uses special axes that have been scaled for the hypothesized distribution. Software is widely available for the normal, lognormal, Weibull, and various chi-square
and gamma distributions. We focus primarily on normal probability plots because many statistical
techniques are appropriate only when the population is (at least approximately) normal.
To construct a probability plot, the observations in the sample are irst ranked from smallest to largest. That is, the sample x1 , x2 ,… , xn is arranged as x(1) , x(2) ,… , x(n) , where x(1) is the
smallest observation, x(2) is the second-smallest observation, and so forth with x(n) the largest.
The ordered observations x( j ) are then plotted against their observed cumulative frequency
( j − 0.5) / n on the appropriate probability paper. If the hypothesized distribution adequately
describes the data, the plotted points will fall approximately along a straight line; if the plotted points deviate signiicantly from a straight line, the hypothesized model is not appropriate.
Usually, the determination of whether or not the data plot is a straight line is subjective. The
procedure is illustrated in the following example.
Battery Life Ten observations on the effective service life in minutes of batteries used in a
portable personal computer are as follows: 176, 191, 214, 220, 205, 192, 201, 190, 183, 185. We
hypothesize that battery life is adequately modeled by a normal distribution. To use probability plotting to investigate this hypothesis, irst arrange the observations in ascending order and calculate their cumulative frequencies
( j − 0.5) / 10 as shown in Table 6-6.
Example 6-7
5"#-&t6-6 Calculation for Constructing a Normal Probability Plot
j
x( j )
( j − 0.5) /10
zj
1
2
3
4
5
6
7
8
9
10
176
183
185
190
191
192
201
205
214
220
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
–1.64
–1.04
–0.67
–0.39
–0.13
0.13
0.39
0.67
1.04
1.64
Section 6-7/Probability Plots
231
The pairs of values x( j ) and ( j − 0.5) / 10 are now plotted on normal probability axes. This plot is shown in Fig. 6-22.
Most normal probability plots have 100 ( j − 0.5) / n on the left vertical scale and (sometimes) 100 [1 − ( j − 0.5) / n ] on
the right vertical scale, with the variable value plotted on the horizontal scale. A straight line, chosen subjectively, has
been drawn through the plotted points. In drawing the straight line, you should be inluenced more by the points near
the middle of the plot than by the extreme points. A good rule of thumb is to draw the line approximately between the
25th and 75th percentile points. This is how the line in Fig. 6-22 was determined. In assessing the “closeness” of the
points to the straight line, imagine a “fat pencil” lying along the line. If all the points are covered by this imaginary
pencil, a normal distribution adequately describes the data. Because the points in Fig. 6-19 would pass the “fat pencil”
test, we conclude that the normal distribution is an appropriate model.
0.1
99
1
95
5
80
20
50
50
20
80
5
95
1
99
0.1
170
180
190
200
210
220
100[1 – ( j – 0.5)/n]
100( j – 0.5)/n
99.9
99.9
x ( j)
FIGURE 6-22 Normal probability plot for battery life.
A normal probability plot can also be constructed on ordinary axes by plotting the standardized normal scores z j against x( j ) where the standardized normal scores satisfy
j − 0.5
= P( Z ≤ z j ) = Φ ( z j )
n
For example, if ( j − 0.5) / n = 0.05, Φ ( z j ) = 0.05 implies that z j = −1.64. To illustrate, consider
the data from Example 6-4. In the last column of Table 6-6 we show the standardized normal
scores. Figure 6-23 is the plot of z j versus x( j ). This normal probability plot is equivalent to the
one in Fig. 6-22.
We have constructed our probability plots with the probability scale (or the z-scale) on
the vertical axis. Some computer packages “lip” the axis and put the probability scale on the
horizontal axis.
3.30
1.65
zj
FIGURE 6-23
Normal probability
plot obtained from
standardized normal
scores.
0
–1.65
–3.30
170
180
190
200
x( j )
210
220
232
zj
Chapter 6/Descriptive Statistics
3.30
3.30
3.30
1.65
1.65
1.65
zj
0
–1.65
–3.30
170
zj
0
–1.65
180
190 200
x( j)
210
220
–3.30
170
0
–1.65
180
190 200
x( j)
(a)
210
220
(b)
–3.30
170
180
190 200
x( j )
210
220
(c)
FIGURE 6-24 Normal probability plots indicating a nonnormal distribution. (a) Light-tailed distribution.
(b) Heavy-tailed distribution. (c) A distribution with positive (or right) skew.
Normal Probability
Plots of Small Samples
Can Be Unreliable
Exercises
The normal probability plot can be useful in identifying distributions that are symmetric but
that have tails that are “heavier” or “lighter” than the normal. They can also be useful in identifying skewed distributions. When a sample is selected from a light-tailed distribution (such as
the uniform distribution), the smallest and largest observations will not be as extreme as would
be expected in a sample from a normal distribution. Thus, if we consider the straight line drawn
through the observations at the center of the normal probability plot, observations on the left side
will tend to fall below the line, and observations on the right side will tend to fall above the line.
This will produce an S-shaped normal probability plot such as shown in Fig. 6-24(a). A heavytailed distribution will result in data that also produce an S-shaped normal probability plot, but now
the observations on the left will be above the straight line and the observations on the right will lie
below the line. See Fig. 6-24(b). A positively skewed distribution will tend to produce a pattern
such as shown in Fig. 6-24(c), where points on both ends of the plot tend to fall below the line,
giving a curved shape to the plot. This occurs because both the smallest and the largest observations from this type of distribution are larger than expected in a sample from a normal distribution.
Even when the underlying population is exactly normal, the sample data will not plot
exactly on a straight line. Some judgment and experience are required to evaluate the plot.
Generally, if the sample size is n < 30, there can be signiicant deviation from linearity in
normal plots, so in these cases only a very severe departure from linearity should be interpreted as a strong indication of nonnormality. As n increases, the linear pattern will tend
to become stronger, and the normal probability plot will be easier to interpret and more
reliable as an indicator of the form of the distribution.
FOR SECTION 6-7
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-93.
Construct a normal probability plot of the piston
ring diameter data in Exercise 6-7. Does it seem reasonable to
assume that piston ring diameter is normally distributed?
6-94.
Construct a normal probability plot of the insulating
luid breakdown time data in Exercise 6-8. Does it seem reasonable to assume that breakdown time is normally distributed?
6-95.
Construct a normal probability plot of the visual
accommodation data in Exercise 6-11. Does it seem reasonable to assume that visual accommodation is normally
distributed?
6-96.
Construct a normal probability plot of the solar
intensity data in Exercise 6-12. Does it seem reasonable to
assume that solar intensity is normally distributed?
6-97. Construct a normal probability plot of the O-ring joint
temperature data in Exercise 6-19. Does it seem reasonable to
assume that O-ring joint temperature is normally distributed?
Discuss any interesting features that you see on the plot.
6-98.
Construct a normal probability plot of the octane
rating data in Exercise 6-30. Does it seem reasonable to assume
that octane rating is normally distributed?
Section 6-7/Probability Plots
6-99.
Construct a normal probability plot of the cycles
to failure data in Exercise 6-31. Does it seem reasonable to
assume that cycles to failure is normally distributed?
6-100. Construct a normal probability plot of the suspended
solids concentration data in Exercise 6-40. Does it seem reasonable to assume that the concentration of suspended solids
in water from this particular lake is normally distributed?
6-101. Construct two normal probability plots for the height
data in Exercises 6-38 and 6-45. Plot the data for female
and male students on the same axes. Does height seem to
be normally distributed for either group of students? If both
233
populations have the same variance, the two normal probability plots should have identical slopes. What conclusions would
you draw about the heights of the two groups of students from
visual examination of the normal probability plots?
6-102. It is possible to obtain a “quick-and-dirty” estimate
of the mean of a normal distribution from the 50th percentile
value on a normal probability plot. Provide an argument why
this is so. It is also possible to obtain an estimate of the standard deviation of a normal distribution by subtracting the 84th
percentile value from the 50th percentile value. Provide an
argument explaining why this is so.
Supplemental Exercises
Problem available in WileyPLUS at instructor’s discretion.
Tutoring problem available in WileyPLUS at instructor’s discretion.
6-103. The National Oceanic and Atmospheric Administration provided the monthly absolute estimates of global (land
and ocean combined) temperature index (degrees C) from
2000. Read January to December from left to right in www.
ncdc.noaa.gov/oa/climate/research/anomalies/anomalies.
html). Construct and interpret either a digidot plot or a separate
stem-and-leaf and time series plot of these data.
6-104.
The concentration of a solution is measured six
times by one operator using the same instrument. She obtains
the following data: 63.2, 67.1, 65.8, 64.0, 65.1, and 65.3
(grams per liter).
(a) Calculate the sample mean. Suppose that the desirable
value for this solution has been speciied to be 65.0 grams
per liter. Do you think that the sample mean value computed here is close enough to the target value to accept the
solution as conforming to target? Explain your reasoning.
(b) Calculate the sample variance and sample standard
deviation.
(c) Suppose that in measuring the concentration, the operator
must set up an apparatus and use a reagent material. What
do you think the major sources of variability are in this
experiment? Why is it desirable to have a small variance of
these measurements?
6-105.
Table 6E.10 shows unemployment data for the
United States that are seasonally adjusted. Construct a time
series plot of these data and comment on any features (source:
U.S. Bureau of Labor Web site, http://data.bls.gov).
6-106. A sample of six resistors yielded the following resistances
(ohms): x1 = 45, x2 = 38, x3 = 47, x4 = 41, x5 = 35, and x6 = 43.
(a) Compute the sample variance and sample standard deviation.
(b) Subtract 35 from each of the original resistance measurements and compute s 2 and s. Compare your results with
those obtained in part (a) and explain your indings.
(c) If the resistances were 450, 380, 470, 410, 350, and 430
ohms, could you use the results of previous parts of this
problem to ind s 2 and s?
6-107.
Consider the following two samples:
Sample 1: 10, 9, 8, 7, 8, 6, 10, 6
Sample 2: 10, 6, 10, 6, 8, 10, 8, 6
(a) Calculate the sample range for both samples. Would you conclude that both samples exhibit the same variability? Explain.
(b) Calculate the sample standard deviations for both samples.
Do these quantities indicate that both samples have the
same variability? Explain.
(c) Write a short statement contrasting the sample range versus
the sample standard deviation as a measure of variability.
5"#-&t 6E.9 Global Monthly Temperature
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
1
12.3
12.4
12.7
12.6
12.6
12.6
12.4
12.8
12.2
12.5
2
12.6
12.5
12.9
12.6
12.8
12.5
12.6
12.7
12.4
12.6
3
13.2
13.3
13.4
13.2
13.3
13.4
13.2
13.3
13.4
13.2
4
14.3
14.2
14.2
14.2
14.3
14.4
14.2
14.4
14.1
14.3
5
15.3
15.4
15.3
15.4
15.2
15.4
15.3
15.3
15.2
15.3
6
15.9
16.0
16.1
16.0
16.0
16.2
16.1
16.0
16.0
16.1
7
16.2
16.3
16.4
16.3
16.3
16.4
16.4
16.3
16.3
16.4
8
16.0
16.2
16.1
16.2
16.1
16.2
16.2
16.1
16.1
16.2
9
15.4
15.5
15.5
15.6
15.5
15.7
15.6
15.5
15.5
15.6
10
14.3
14.5
14.5
14.7
14.6
14.6
14.6
14.5
14.6
11
13.1
13.5
13.5
13.4
13.6
13.6
13.5
13.4
13.5
12
12.5
12.7
12.6
12.9
12.7
12.8
12.9
12.6
12.7
234
Chapter 6/Descriptive Statistics
5"#-&t6E.10 Unemployment Percentage
Year
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Jan
4.3
4.0
4.2
5.7
5.8
5.7
5.2
4.7
4.6
4.9
7.6
Feb
4.4
4.1
4.2
5.7
5.9
5.6
5.4
4.8
4.5
4.8
8.1
Mar
4.2
4.0
4.3
5.7
5.9
5.8
5.2
4.7
4.4
5.1
8.5
Apr
4.3
3.8
4.4
5.9
6.0
5.6
5.2
4.7
4.5
5.0
8.9
May
4.2
4.0
4.3
5.8
6.1
5.6
5.1
4.7
4.5
5.5
9.4
6-108. An article in Quality Engineering (1992, Vol. 4,
pp. 487–495) presents viscosity data from a batch chemical
process. A sample of these data is in Table 6E.11.
13.3
14.5
15.3
15.3
14.3
14.8
15.2
14.5
14.6
14.1
14.3
16.1
13.1
15.5
12.6
14.6
14.3
15.4
15.2
16.8
14.9
13.7
15.2
14.5
15.3
15.6
15.8
13.3
14.1
15.4
15.2
15.2
15.9
16.5
14.8
15.1
17.0
14.9
14.8
14.0
15.8
13.7
15.1
13.4
14.1
14.8
14.3
14.3
16.4
16.9
14.2
16.9
14.9
15.2
14.4
15.2
14.6
16.4
14.2
15.7
16.0
14.9
13.6
15.3
14.3
15.6
16.1
13.9
15.2
14.4
14.0
14.4
13.7
13.8
15.6
14.5
12.8
16.1
16.6
15.6
(a) Reading left to right and up and down, draw a time series
plot of all the data and comment on any features of the data
that are revealed by this plot.
(b) Consider the notion that the irst 40 observations were generated from a speciic process, whereas the last 40 observations were generated from a different process. Does the plot
indicate that the two processes generate similar results?
(c) Compute the sample mean and sample variance of the irst
40 observations; then compute these values for the second 40
observations. Do these quantities indicate that both processes
yield the same mean level? The same variability? Explain.
5"#-&t6E.11 Viscosity Data
13.3
14.5
15.3
15.3
14.3
14.8
15.2
14.5
14.6
14.1
14.3
16.1
13.1
15.5
12.6
14.6
14.3
15.4
15.2
16.8
14.9
13.7
15.2
14.5
15.3
15.6
15.8
13.3
14.1
15.4
15.2
15.2
15.9
16.5
14.8
15.1
17.0
14.9
14.8
14.0
15.8
13.7
15.1
13.4
14.1
14.8
14.3
14.3
16.4
16.9
14.2
16.9
14.9
15.2
14.4
15.2
14.6
16.4
14.2
15.7
16.0
14.9
13.6
15.3
14.3
15.6
16.1
13.9
15.2
14.4
14.0
14.4
13.7
13.8
15.6
14.5
12.8
16.1
16.6
15.6
Jun
4.3
4.0
4.5
5.8
6.3
5.6
5.1
4.6
4.6
5.6
9.5
Jul
4.3
4.0
4.6
5.8
6.2
5.5
5.0
4.7
4.7
5.8
9.4
Aug
4.2
4.1
4.9
5.7
6.1
5.4
4.9
4.7
4.7
6.2
9.7
Sep
4.2
3.9
5.0
5.7
6.1
5.4
5.0
4.5
4.7
6.2
9.8
Oct
4.1
3.9
5.3
5.7
6.0
5.5
5.0
4.4
4.8
6.6
Nov
4.1
3.9
5.5
5.9
5.8
5.4
5.0
4.5
4.7
6.8
Dec
4.0
3.9
5.7
6.0
5.7
5.4
4.8
4.4
4.9
7.2
The total net electricity consumption of the United
6-109.
States by year from 1980 to 2007 (in billion kilowatt-hours)
is in Table 6E.12. Net consumption excludes the energy consumed by the generating units.
TABLE t6E.12 U.S. Electricity Consumption
1980 2094.4 1981 2147.1 1982 2086.4 1983 2151.0
1984 2285.8 1985 2324.0 1986 2368.8 1987 2457.3
1988 2578.1 1989 2755.6 1990 2837.1 1991 2886.1
1992 2897.2 1993 3000.7 1994 3080.9 1995 3164.0
1996 3253.8 1997 3301.8 1998 3425.1 1999 3483.7
2000 3592.4 2001 3557.1 2002 3631.7 2003 3662.0
2004 3715.9 2005 3811.0 2006 3816.8 2007 3891.7
(source: U.S. Department of Energy Web site, www.eia.doe.gov/emeu/
international/contents.html#InternationalElectricity).
Construct a time series plot of these data. Construct and
interpret a stem-and-leaf display of these data.
6-110. Reconsider the data from Exercise 6-108. Prepare comparative box plots for two groups of observations: the irst 40
and the last 40. Comment on the information in the box plots.
6-111. The data shown in Table 6E.13 are monthly champagne sales in France (1962–1969) in thousands of bottles.
(a) Construct a time series plot of the data and comment on
any features of the data that reveals by this plot.
(b) Speculate on how you would use a graphical procedure to
forecast monthly champagne sales for the year 1970.
6-112.
The following data are the temperatures of efluent
at discharge from a sewage treatment facility on consecutive
days:
43
45
49
47
52
45
51
46
44
48
51
50
52
44
48
50
49
50
46
46
49
49
51
50
(a) Calculate the sample mean, sample median, sample variance, and sample standard deviation.
(b) Construct a box plot of the data and comment on the information in this display.
Section 6-7/Probability Plots
235
5"#-&t6E.13 Champagne Sales in France
Month
1962
1963
1964
1965
1966
1967
1968
1969
Jan.
Feb.
Mar.
Apr.
May
June
July
Aug.
Sept.
Oct.
Nov.
Dec.
2.851
2.672
2.755
2.721
2.946
3.036
2.282
2.212
2.922
4.301
5.764
7.132
2.541
2.475
3.031
3.266
3.776
3.230
3.028
1.759
3.595
4.474
6.838
8.357
3.113
3.006
4.047
3.523
3.937
3.986
3.260
1.573
3.528
5.211
7.614
9.254
5.375
3.088
3.718
4.514
4.520
4.539
3.663
1.643
4.739
5.428
8.314
10.651
3.633
4.292
4.154
4.121
4.647
4.753
3.965
1.723
5.048
6.922
9.858
11.331
4.016
3.957
4.510
4.276
4.968
4.677
3.523
1.821
5.222
6.873
10.803
13.916
2.639
2.899
3.370
3.740
2.927
3.986
4.217
1.738
5.221
6.424
9.842
13.076
3.934
3.162
4.286
4.676
5.010
4.874
4.633
1.659
5.591
6.981
9.851
12.670
6-113.
A manufacturer of coil springs is interested in
implementing a quality control system to monitor his production process. As part of this quality system, it is decided
to record the number of nonconforming coil springs in each
production batch of size 50. During 40 days of production, 40
batches of data were collected as follows:
Read data across and down.
9
12
6
9
7
14
12
4
6
7
8
5
9
7
8
11
3
6
7
7
11
4
4
8
7
5
6
4
5
8
19
19
18
12
11
17
15
17
13 13
(a) Construct a stem-and-leaf plot of the data.
(b) Find the sample average and standard deviation.
(c) Construct a time series plot of the data. Is there evidence
that there was an increase or decrease in the average number of nonconforming springs made during the 40 days?
Explain.
6-114. A communication channel is being monitored by recording the number of errors in a string of 1000 bits. Data for 20 of
these strings follow:
Read data across and down
3
1
0
1
3
2
4
1
3
1
1
1
2
3
3
2
0
2
0
1
(a) Construct a stem-and-leaf plot of the data.
(b) Find the sample average and standard deviation.
(c) Construct a time series plot of the data. Is there evidence
that there was an increase or decrease in the number of
errors in a string? Explain.
6-115. Reconsider the golf course yardage data in Exercise
6-9. Construct a box plot of the yardages and write an interpretation of the plot.
6-116. Reconsider the data in Exercise 6-108. Construct normal
probability plots for two groups of the data: the irst 40 and the
last 40 observations. Construct both plots on the same axes. What
tentative conclusions can you draw?
6-117. Construct a normal probability plot of the efluent
discharge temperature data from Exercise 6-112. Based on the
plot, what tentative conclusions can you draw?
6-118. Construct normal probability plots of the cold start
ignition time data presented in Exercises 6-69 and 6-80. Construct a separate plot for each gasoline formulation, but arrange
the plots on the same axes. What tentative conclusions can
you draw?
6-119. Reconsider the golf ball overall distance data in Exercise 6-41. Construct a box plot of the yardage distance and write
an interpretation of the plot. How does the box plot compare in
interpretive value to the original stem-and-leaf diagram?
6-120. Transformations. In some data sets, a transformation
by some mathematical function applied to the original data,
such as y or log y, can result in data that are simpler to work
with statistically than the original data. To illustrate the effect
of a transformation, consider the following data, which represent cycles to failure for a yarn product: 675, 3650, 175, 1150,
290, 2000, 100, 375.
(a) Construct a normal probability plot and comment on the
shape of the data distribution.
(b) Transform the data using logarithms; that is, let y ∗(new
value) = log y (old value). Construct a normal probability
plot of the transformed data and comment on the effect of
the transformation.
6-121. In 1879, A. A. Michelson made 100 determinations of
the velocity of light in air using a modiication of a method
proposed by the French physicist Foucault. Michelson made
the measurements in ive trials of 20 measurements each. The
observations (in kilometers per second) are in Table 6E.14.
Each value has 299,000 subtracted from it.
The currently accepted true velocity of light in a vacuum
is 299,792.5 kilometers per second. Stigler (1977, The Annals
of Statistics) reported that the “true” value for comparison to
these measurements is 734.5. Construct comparative box plots
of these measurements. Does it seem that all ive trials are
236
Chapter 6/Descriptive Statistics
5"#-&t6E.14 Velocity of Light Data
900
930
1070
650
Trial 1
930
760
850
810
950
1000
980
1000
980
960
880
960
960
830
940
790
960
810
940
880
Trial 2
880
880
800
830
850
800
880
790
900
760
840
800
880
880
880
910
880
850
860
870
Trial 3
720
840
720
840
620
850
860
840
970
840
950
840
890
910
810
920
810
890
820
860
Trial 4
800
880
770
720
760
840
740
850
750
850
760
780
890
870
840
870
780
810
810
740
Trial 5
760
810
810
940
790
950
810
800
820
810
850
870
850
1000
740
980
consistent with respect to the variability of the measurements?
Are all ive trials centered on the same value? How does each
group of trials compare to the true value? Could there have been
“startup” effects in the experiment that Michelson performed?
Could there have been bias in the measuring instrument?
6-122.
In 1789, Henry Cavendish estimated the density
of the Earth by using a torsion balance. His 29 measurements
follow, expressed as a multiple of the density of water.
5.50
5.55
5.57
5.34
5.42
5.30
5.61
5.36
5.53
5.79
5.47
5.75
4.88
5.29
5.62
5.10
5.63
5.86
4.07
5.58
5.29
5.27
5.34
5.85
5.26
5.65
5.44
5.39
5.46
(a) Calculate the sample mean, sample standard deviation, and
median of the Cavendish density data.
(b) Construct a normal probability plot of the data. Comment on
the plot. Does there seem to be a “low” outlier in the data?
(c) Would the sample median be a better estimate of the density of the earth than the sample mean? Why?
6-123. In their book Introduction to Time Series Analysis
and Forecasting (Wiley, 2008), Montgomery, Jennings, and
Kulahci presented the data on the drowning rate for children
between one and four years old per 100,000 of population in
Arizona from 1970 to 2004. The data are: 19.9, 16.1, 19.5, 19.8,
21.3, 15.0, 15.5, 16.4, 18.2, 15.3, 15.6, 19.5, 14.0, 13.1, 10.5,
11.5, 12.9, 8.4, 9.2, 11.9, 5.8, 8.5, 7.1, 7.9, 8.0, 9.9, 8.5, 9.1, 9.7,
6.2, 7.2, 8.7, 5.8, 5.7, and 5.2.
(a) Perform an appropriate graphical analysis of the data.
(b) Calculate and interpret the appropriate numerical summaries.
(c) Notice that the rate appears to decrease dramatically starting about 1990. Discuss some potential reasons explaining
why this could have happened.
(d) If there has been a real change in the drowning rate beginning about 1990, what impact does this have on the summary statistics that you calculated in part (b)?
6-124. Patients arriving at a hospital emergency department
present a variety of symptoms and complaints. The following
data were collected during one weekend night shift (11:00 p.m.
to 7:00 a.m.):
Chest pain
Dificulty breathing
Numbness in extremities
Broken bones
Abrasions
Cuts
Stab wounds
Gunshot wounds
Blunt force trauma
Fainting, loss of consciousness
Other
8
7
3
11
16
21
9
4
10
5
9
(a) Calculate numerical summaries of these data. What practical
interpretation can you give to these summaries?
(b) Suppose that you knew that a certain fraction of these
patients leave without treatment (LWOT). This is an important problem because these patients may be seriously ill or
injured. Discuss what additional data you would require to
begin a study into the reasons why patients LWOT.
6-125. One of the authors (DCM) has a Mercedes-Benz 500
SL Roadster. It is a 2003 model and has fairly low mileage
(currently 45,324 miles on the odometer). He is interested in
learning how his car’s mileage compares with the mileage on
similar SLs. Table 6E.15 contains the mileage on 100 MercedesBenz SLs from the model years 2003−2009 taken from the
Cars.com website.
(a) Calculate the sample mean and standard deviation of the
odometer readings.
(b) Construct a histogram of the odometer readings and comment on the shape of the data distribution.
(c) Construct a stem-and-leaf diagram of the odometer
readings.
(d) What is the percentile of DCM’s mileage?
6-126. The energy consumption for 90 gas-heated homes
during a winter heating season is given in Table 6E.16. The
variable reported is BTU/number of heating degree days.
(a) Calculate the sample mean and standard deviation of
energy usage.
(b) Construct a histogram of the energy usage data and comment on the shape of the data distribution.
Section 6-7/Probability Plots
(c) Construct a stem-and-leaf diagram of energy usage.
(d) What proportion of the energy usage data is above the average usage plus 2 standard deviations?
6-127. The force needed to remove the cap from a medicine
bottle is an important feature of the product because requiring too much force may cause dificulty for elderly patients
or patients with arthritis or similar conditions. Table 6E.17
presents the results of testing a sample of 68 caps attached to
bottles for the force (in pounds) required for removing the cap.
(a) Construct a stem-and-leaf diagram of the force data.
(b) What are the average and the standard deviation of the force?
(c) Construct a normal probability plot of the data and comment on the plot.
(d) If the upper speciication on required force is 30 pounds,
what proportion of the caps do not meet this requirement?
(e) What proportion of the caps exceeds the average force plus
2 standard deviations?
(f) Suppose that the irst 36 observations in the table come from
one machine and the remaining come from a second machine
(read across the rows and the down). Does there seem to be a
possible difference in the two machines? Construct an appropriate graphical display of the data as part of your answer.
(g) Plot the irst 36 observations in the table on a normal probability plot and the remaining observations on another normal
probability plot. Compare the results with the single normal
probability plot that you constructed for all of the data in part (c).
6-128. Consider the global mean surface air temperature
anomaly and the global CO2 concentration data originally
shown in Table 6E.5.
(a) Construct a scatter plot of the global mean surface air
temperature anomaly versus the global CO2 concentration
Comment on the plot.
(b) What is the simple correlation coeficient between these
two variables?
5"#-&t6E.15 Odometer Readings on 100 Mercedes-Benz SL500 Automobiles, Model Years 2003−2009
2020
7893
19,000
15,984
32,271
38,277
32,524
15,972
63,249
60,583
8905
10,327
31,668
16,903
36,889
72,272
38,139
41,218
58,526
83,500
1698
37,687
33,512
37,789
21,564
3800
62,940
43,382
66,325
56,314
17,971
15,000
28,522
28,958
31,000
21,218
51,326
15,879
49,489
67,072
6207
4166
5824
40,944
42,915
29,250
54,126
13,500
32,800
62,500
22,643
9056
18,327
18,498
19,377
48,648
4100
77,809
67,000
47,603
4977
19,842
31,845
40,057
19,634
29,216
45,540
25,708
60,499
51,936
17,656
15,598
30,015
15,272
26,313
44,944
26,235
29,000
63,260
65,195
8940
33,745
2171
28,968
43,049
49,125
46,505
58,006
60,449
64,473
11,508
22,168
36,161
30,487
30,396
33,065
34,420
51,071
27,422
85,475
10.04
13.96
12.71
10.64
9.96
6.72
13.47
12.19
6.35
12.62
6.80
6.78
6.62
10.30
10.21
8.47
5.56
9.83
7.62
4.00
9.82
5.20
16.06
8.61
11.70
9.76
12.16
5"#-&t6E.16 Energy Usage in BTU/Number of Heating Degree Days
7.87
11.12
8.58
12.91
12.28
14.24
11.62
7.73
7.15
9.43
13.43
8.00
10.35
7.23
11.43
11.21
8.37
12.69
7.16
9.07
5.98
9.60
2.97
10.28
10.95
7.29
13.38
8.67
6.94
15.24
9.58
8.81
13.60
7.62
10.49
13.11
12.31
10.28
8.54
9.83
9.27
5.94
10.40
8.69
10.50
9.84
9.37
11.09
9.52
11.29
10.36
12.92
8.26
14.35
16.90
7.93
11.70
18.26
8.29
6.85
15.12
7.69
13.42
TABLE t6E.17 Force to Remove Bottle Caps
14
17
25
17
24
31
19
17
18
22
15
15
27
34
21
17
27
16
16
17
17
32
14
21
24
16
15
20
32
24
14
34
24
18
15
17
31
16
19
24
28
30
19
20
27
37
15
237
22
16
19
15
21
36
30
21
14
10
17
21
34
24
16
15
22
20
26
20
15
238
Chapter 6/Descriptive Statistics
Mind-Expanding Exercises
6-129. Consider the airfoil data in Exercise 6-18. Subtract
30 from each value and then multiply the resulting quantities
by 10. Now compute s 2 for the new data. How is this quantity related to s 2 for the original data? Explain why.
2
n
6-130. Consider the quantity ∑ i = 1 ( xi − a ) . For what
value of a is this quantity minimized?
6-131. Using the results of Exercise 6-130, which of the
2
2
n
n
two quantities ∑ i = 1 ( xi − x ) and ∑ i = 1 ( xi − μ ) will be
smaller, provided that x ≠ μ?
6-132. Coding the Data. Let yi = a + bxi , i = 1, 2, … , n,
where a and b are nonzero constants. Find the relationship
between x and y, and between s x and s y .
6-133. A sample of temperature measurements in a furnace yielded a sample average (°F) of 835.00 and a sample
standard deviation of 10.5. Using the results from Exercise
6-132, what are the sample average and sample standard
deviations expressed in oC?
6-134. Consider the sample x1, x2 , … , xn with sample mean x and sample standard deviation s . Let
zi = ( xi − x ) / s, i = 1, 2, … , n. What are the values of the
sample mean and sample standard deviation of the zi ?
6-135. An experiment to investigate the survival time in
hours of an electronic component consists of placing the
parts in a test cell and running them for 100 hours under
elevated temperature conditions. (This is called an “accelerated” life test.) Eight components were tested with the
following resulting failure times:
6-136. Suppose that you have a sample x1, x2 , … , xn and
have calculated xn and sn2 for the sample. Now an (n + 1)
st observation becomes available. Let xn + 1 and s2n + 1 be the
sample mean and sample variance for the sample using all
n + 1 observations.
(a) Show how xn + 1 can be computed using xn and xn + 1 .
n ( x n +1 − x n )
n +1
(c) Use the results of parts (a) and (b) to calculate the new
sample average and standard deviation for the data of
Exercise 6-38, when the new observation is x38 = 64.
(b) Show that nsn2+1 = ( n − 1) sn2 +
2
75, 63, 100 + , 36, 51, 45, 80, 90
6-137. Trimmed Mean. Suppose that the data are arranged
in increasing order, T% of the observations are removed from
each end, and the sample mean of the remaining numbers is
calculated. The resulting quantity is called a trimmed mean,
which generally lies between the sample mean x and the sample median x. Why? The trimmed mean with a moderate trimming percentage (5% to 20%) is a reasonably good estimate of
the middle or center. It is not as sensitive to outliers as the mean
but is more sensitive than the median.
(a) Calculate the 10% trimmed mean for the yield data in
Exercise 6-33.
(b) Calculate the 20% trimmed mean for the yield data in
Exercise 6-33 and compare it with the quantity found
in part (a).
(c) Compare the values calculated in parts (a) and (b) with the
sample mean and median for the yield data. Is there much
difference in these quantities? Why?
The observation 100 + indicates that the unit still functioned at
100 hours. Is there any meaningful measure of location that
can be calculated for these data? What is its numerical value?
6-138. Trimmed Mean. Suppose that the sample size n is
such that the quantity nT / 100 is not an integer. Develop a
procedure for obtaining a trimmed mean in this case.
Important Terms and Concepts
Box plot
Degrees of freedom
Frequency distribution and
histogram
Histogram
Interquartile range
Matrix of scatter plots
Quartiles, and percentiles
Multivariate data
Normal probability plot
Outlier
Pareto chart
Percentile
Population mean
Population standard deviation
Population variance
Probability plot
Relative frequency
distribution
Sample correlation coeficient
Sample mean
Sample median
Sample mode
Sample range
Sample standard deviation
Sample variance
Scatter diagram
Stem-and-leaf diagram
Time series
Download