Descriptive Statistics: Data Summarization Techniques

6 Descriptive Statistics Chapter Outline 6-1 Numerical Summaries of Data 6-2 Stem-and-Leaf Diagrams 6-3 Frequency Distributions and Histograms 6-4 Box Plots 6-5 Time Sequence Plots 6-6 Scatter Diagrams 6-7 Probability Plots Statistics is the science of data. An important aspect of dealing with data is organizing and summarizing the data in ways that facilitate its interpretation and subsequent analysis. This aspect of statistics is called descriptive statistics, and is the subject of this chapter. For example, in Chapter 1 we presented eight prototype units made on the pull-off force of prototype automobile engine connectors. The observations (in pounds) were 12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6, and 13.1. There is obvious variability in the pull-off force values. How should we summarize the information in these data? This is the general question that we consider. Data summary methods should highlight the important features of the data, such as the middle or central tendency and the variability, because these characteristics are most often important for engineering decision making. We will see that there are both numerical methods for summarizing data and a number of powerful graphical techniques. The graphical techniques are particularly important. Any good statistical analysis of data should always begin with plotting the data. 199 200 Chapter 6/Descriptive Statistics Learning Objectives After careful study of this chapter, you should be able to do the following: 1. Compute and interpret the sample mean, sample variance, sample standard deviation, sample median, and sample range 2. Explain the concepts of sample mean, sample variance, population mean, and population variance 3. Construct and interpret visual data displays, including the stem-and-leaf display, the histogram, and the box plot 4. Explain the concept of random sampling 5. Construct and interpret normal probability plots 6. Explain how to use box plots and other data displays to visually compare two or more samples of data 7. Know how to use simple time series plots to visually display the important features of time-oriented data 8. Know how to construct and interpret scatter diagrams of two or more variables 6-1 Numerical Summaries of Data Well-constructed data summaries and displays are essential to good statistical thinking, because they can focus the engineer on important features of the data or provide insight about the type of model that should be used in solving the problem. The computer has become an important tool in the presentation and analysis of data. Although many statistical techniques require only a handheld calculator, this approach may require much time and effort, and a computer will perform the tasks much more eficiently. Most statistical analysis is done using a prewritten library of statistical programs. The user enters the data and then selects the types of analysis and output displays that are of interest. Statistical software packages are available for both mainframe machines and personal computers. We will present examples of typical output from computer software throughout the book. We will not discuss the hands-on use of speciic software packages for entering and editing data or using commands. We often ind it useful to describe data features numerically. For example, we can characterize the location or central tendency in the data by the ordinary arithmetic average or mean. Because we almost always think of our data as a sample, we will refer to the arithmetic mean as the sample mean. Sample Mean If the n observations in a sample are denoted by x1 , x2 , … , xn , the sample mean is n x + x2 + … + x n = x− = 1 n ∑ xi i =1 n (6-1) Sample Mean Let’s consider the eight observations on pull-off force collected from the prototype engine connectors from Chapter 1. The eight observations are x1 = 12.6, x2 = 12.9, x3 = 13.4, x4 = 12.3, x5 = 13.6, x6 = 13.5, x7 = 12.6, and x8 = 13.1. The sample mean is Example 6-1 8 xi 12.6 + 12.9 + ⋯ + 13.1 104 x + x + ⋯ + xn ∑ = i =1 = = = 13.0 pounds x= 1 2 n 8 8 8 Section 6-1/Numerical Summaries of Data 201 A physical interpretation of the sample mean as a measure of location is shown in the dot diagram of the pull-off force data. See Fig. 6-1. Notice that the sample mean x = 13.0 can be thought of as a “balance point.” That is, if each observation represents 1 pound of mass placed at the point on the x-axis, a fulcrum located at x would exactly balance this system of weights. The sample mean is the average value of all observations in the data set. Usually, these data are a sample of observations that have been selected from some larger population of observations. Here the population might consist of all the connectors that will be manufactured and sold to customers. Recall that this type of population is called a conceptual or hypothetical population because it does not physically exist. Sometimes there is an actual physical population, such as a lot of silicon wafers produced in a semiconductor factory. In previous chapters, we have introduced the mean of a probability distribution, denoted μ. If we think of a probability distribution as a model for the population, one way to think of the mean is as the average of all the measurements in the population. For a inite population with N equally likely values, the probability mass function is f ( xi ) = 1 / N and the mean is N N μ = ∑ xi f ( xi ) = i −1 ∑ xi i =1 N (6-2) The sample mean, x, is a reasonable estimate of the population mean, μ. Therefore, the engineer designing the connector using a 3/32-inch wall thickness would conclude on the basis of the data that an estimate of the mean pull-off force is 13.0 pounds. Although the sample mean is useful, it does not convey all of the information about a sample of data. The variability or scatter in the data may be described by the sample variance or the sample standard deviation. Sample Variance and Standard Deviation If x1 , x2 ,… , xn is a sample of n observations, the sample variance is n s = 2 ∑ ( xi − x ) 2 i =1 (6-3) n −1 The sample standard deviation, s, is the positive square root of the sample variance. The units of measurement for the sample variance are the square of the original units of the variable. Thus, if x is measured in pounds, the units for the sample variance are (pounds)2. The standard deviation has the desirable property of measuring variability in the original units of the variable of interest, x. How Does the Sample Variance Measure Variability? To see how the sample variance measures dispersion or variability, refer to Fig. 6-2, which shows a dot diagram with the deviations xi − x for the connector pull-off force data. The higher the amount of variability in the pull-off force data, the larger in absolute magnitude x = 13 12 14 15 Pull-off force FIGURE 6-1 Dot diagram showing the sample mean as a balance point for a system of weights. 202 Chapter 6/Descriptive Statistics some of the deviations xi − x will be. Because the deviations xi − x always sum to zero, we must use a measure of variability that changes the negative deviations to non-negative quantities. Squaring the deviations is the approach used in the sample variance. Consequently, if s 2 is small, there is relatively little variability in the data, but if s 2 is large, the variability is relatively large. Example 6-2 Sample Variance Table 6-1 displays the quantities needed for calculating the sample variance and sample standard deviation for the pull-off force data. These data are plotted in Fig. 6-2. The numerator of s 2 is 8 ∑ ( xi − x ) = 1.60 2 i =1 5"#-&t6-1 Calculation of Terms for the Sample Variance and Sample Standard Deviation i xi xi − x ( xi − x ) 2 1 12.6 –0.4 0.16 2 12.9 –0.1 0.01 3 13.4 0.4 0.16 4 12.3 –0.7 0.49 5 13.6 0.6 0.36 6 13.5 0.5 0.25 7 12.6 –0.4 0.16 8 13.1 0.1 0.01 104.0 0.0 1.60 x 12 13 x2 14 15 x8 x1 x7 x4 x3 x6 x5 FIGURE 6-2 How the sample variance measures variability through the deviations x i − x . so the sample variance is s2 = 1.60 1.60 2 = = 0.2286 ( pounds) 8 −1 7 and the sample standard deviation is s = 0.2286 = 0.48 pounds Computation of s 2 The computation of s 2 requires calculation of x , n subtractions, and n squaring and adding operations. If the original observations or the deviations xi − x are not integers, the deviations xi − x may be tedious to work with, and several decimals may have to be carried Section 6-1/Numerical Summaries of Data 203 to ensure numerical accuracy. A more eficient computational formula for the sample variance is obtained as follows: n s2 = ∑ ( xi − x ) i =1 n −1 ∑ ( xi 2 + x 2 − 2 xxi ) n 2 = i =1 n −1 n = n ∑ xi 2 + nx 2 − 2 x ∑ xi i =1 i =1 n −1 and because x = (1 / n)Σ in=1 xi , this last equation reduces to n s2 = ∑ xi 2 − ⎛ n ⎞ xi ⎟ ⎜⎝ i∑ =1 ⎠ i =1 2 n (6-4) n −1 Note that Equation 6-4 requires squaring each individual xi , then squaring the sum of the xi , 2 subtracting ( ∑ xi ) / n from ∑ xi 2 , and inally dividing by n − 1. Sometimes this is called the shortcut method for calculating s 2 (or s). Example 6-3 We will calculate the sample variance and standard deviation using the shortcut method, Equation 6-4. The formula gives n s2 = ∑ xi 2 − ⎛ n ⎞ xi ⎟ ⎜⎝ i∑ =1 ⎠ i =1 n −1 n 2 = (104)2 8 = 1.60 = 0.2286 ( pounds)2 7 7 1353.6 − and s = 0.2286 = 0.48 pounds These results agree exactly with those obtained previously. Analogous to the sample variance s2 , the variability in the population is deined by the population variance (σ 2 ). As in earlier chapters, the positive square root of σ 2, or σ, will denote the population standard deviation. When the population is inite and consists of N equally likely values, we may deine the population variance as N σ2 = ∑ ( xi − μ ) i =1 2 (6-5) N We observed previously that the sample mean could be used as an estimate of the population mean. Similarly, the sample variance is an estimate of the population variance. In Chapter 7, we will discuss estimation of parameters more formally. Note that the divisor for the sample variance is the sample size minus 1 ( n − 1) , and for the population variance, it is the population size N . If we knew the true value of the population mean μ, we could ind the sample variance as the average square deviation of the sample observations about μ. In practice, the value of μ is almost never known, and so the sum of the square deviations about the sample average x must be used instead. However, the observations xi tend to be closer to their average, x , than to the population mean, μ. Therefore, to compensate for this, we use n − 1 as the divisor rather than n. If we used n as the divisor in the sample variance, we would obtain a measure of variability that is on the average consistently smaller than the true population variance σ 2. 204 Chapter 6/Descriptive Statistics Another way to think about this is to consider the sample variance s 2 as being based on n − 1 degrees of freedom. The term degrees of freedom results from the fact that the n deviations x1 − x , x2 − x ,… , xn − x always sum to zero, and so specifying the values of any n −1 of these quantities automatically determines the remaining one. This was illustrated in Table 6-1. Thus, only n −1 of the n deviations, xi − x, are freely determined. We may think of the number of degrees of freedom as the number of independent pieces of information in the data. In addition to the sample variance and sample standard deviation, the sample range, or the difference between the largest and smallest observations, is often a useful measure of variability. The sample range is deined as follows. Sample Range If the n observations in a sample are denoted by x1 , x2 , … , xn , the sample range is r = max ( xi ) − min ( xi ) (6-6) For the pull-off force data, the sample range is r = 13.6 − 12.3 = 1.3. Generally, as the variability in sample data increases, the sample range increases. The sample range is easy to calculate, but it ignores all of the information in the sample data between the largest and smallest values. For example, the two samples 1, 3, 5, 8, and 9 and 1, 5, 5, 5, and 9 both have the same range (r = 8). However, the standard deviation of the irst sample is s1 = 3.35, while the standard deviation of the second sample is s2 = 2.83. The variability is actually less in the second sample. Sometimes when the sample size is small, say n < 8 or 10 , the information loss associated with the range is not too serious. For example, the range is used widely in statistical quality control where sample sizes of 4 or 5 are fairly common. We will discuss some of these applications in Chapter 15. In most statistics problems, we work with a sample of observations selected from the population that we are interested in studying. Figure 6-3 illustrates the relationship between the population and the sample. Population m s Sample (x1, x2, x3, … , xn) x, sample average s, sample standard deviation Histogram x x s FIGURE 6-3 Relationship between a population and a sample. Section 6-1/Numerical Summaries of Data Exercises 205 FOR SECTION 6-1 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-1. Will the sample mean always correspond to one of the observations in the sample? 6-2. Will exactly half of the observations in a sample fall below the mean? 6-3. Will the sample mean always be the most frequently occurring data value in the sample? 6-4. For any set of data values, is it possible for the sample standard deviation to be larger than the sample mean? If so, give an example. 6-5. Can the sample standard deviation be equal to zero? If so, give an example. 6-6. Suppose that you add 10 to all of the observations in a sample. How does this change the sample mean? How does it change the sample standard deviation? 6-7. Eight measurements were made on the inside diameter of forged piston rings used in an automobile engine. The data (in millimeters) are 74.001, 74.003, 74.015, 74.000, 74.005, 74.002, 74.005, and 74.004. Calculate the sample mean and sample standard deviation, construct a dot diagram, and comment on the data. 6-8. In Applied Life Data Analysis (Wiley, 1982), Wayne Nelson presents the breakdown time of an insulating luid between electrodes at 34 kV. The times, in minutes, are as follows: 0.19, 0.78, 0.96, 1.31, 2.78, 3.16, 4.15, 4.67, 4.85, 6.50, 7.35, 8.01, 8.27, 12.06, 31.75, 32.52, 33.91, 36.71, and 72.89. Calculate the sample mean and sample standard deviation. 6-9. The January 1990 issue of Arizona Trend contains a supplement describing the 12 “best” golf courses in the state. The yardages (lengths) of these courses are as follows: 6981, 7099, 6930, 6992, 7518, 7100, 6935, 7518, 7013, 6800, 7041, and 6890. Calculate the sample mean and sample standard deviation. Construct a dot diagram of the data. 6-10. An article in the Journal of Structural Engineering (Vol. 115, 1989) describes an experiment to test the yield strength of circular tubes with caps welded to the ends. The irst yields (in kN) are 96, 96, 102, 102, 102, 104, 104, 108, 126, 126, 128, 128, 140, 156, 160, 160, 164, and 170. Calculate the sample mean and sample standard deviation. Construct a dot diagram of the data. 6-11. An article in Human Factors (June 1989) presented data on visual accommodation (a function of eye movement) when recognizing a speckle pattern on a high-resolution CRT screen. The data are as follows: 36.45, 67.90, 38.77, 42.18, 26.72, 50.77, 39.30, and 49.71. Calculate the sample mean and sample standard deviation. Construct a dot diagram of the data. 6-12. The following data are direct solar intensity measurements (watts/m2) on different days at a location in southern Spain: 562, 869, 708, 775, 775, 704, 809, 856, 655, 806, 878, 909, 918, 558, 768, 870, 918, 940, 946, 661, 820, 898, 935, 952, 957, 693, 835, 905, 939, 955, 960, 498, 653, 730, and 753. Calculate the sample mean and sample standard deviation. Prepare a dot diagram of these data. Indicate where the sample mean falls on this diagram. Give a practical interpretation of the sample mean. 6-13. The April 22, 1991, issue of Aviation Week and Space Technology reported that during Operation Desert Storm, U.S. Air Force F-117A pilots lew 1270 combat sorties for a total of 6905 hours. What is the mean duration of an F-117A mission during this operation? Why is the parameter you have calculated a population mean? 6-14. Preventing fatigue crack propagation in aircraft structures is an important element of aircraft safety. An engineering study to investigate fatigue crack in n = 9 cyclically loaded wing boxes reported the following crack lengths (in mm): 2.13, 2.96, 3.02, 1.82, 1.15, 1.37, 2.04, 2.47, 2.60. Calculate the sample mean and sample standard deviation. Prepare a dot diagram of the data. 6-15. An article in the Journal of Physiology [“Response of Rat Muscle to Acute Resistance Exercise Deined by Transcriptional and Translational Proiling” (2002, Vol. 545, pp. 27–41)] studied gene expression as a function of resistance exercise. Expression data (measures of gene activity) from one gene are shown in the following table. One group of rats was exercised for six hours while the other received no exercise. Compute the sample mean and standard deviation of the exercise and no-exercise groups separately. Construct a dot diagram for the exercise and no-exercise groups separately. Comment on any differences for the groups. 6 Hours of Exercise 425.313 223.306 388.793 139.262 212.565 324.024 6 Hours of Exercise 208.475 286.484 244.242 408.099 157.743 436.37 No Exercise 485.396 159.471 478.314 245.782 236.212 252.773 No Exercise 406.921 335.209 Exercise 6-11 describes data from an article in 6-16. Human Factors on visual accommodation from an experiment involving a high-resolution CRT screen. Data from a second experiment using a low-resolution screen were also reported in the article. They are 8.85, 35.80, 26.53, 64.63, 9.00, 15.38, 8.14, and 8.24. Prepare a dot diagram for this second sample and compare it to the one for the irst sample. What can you conclude about CRT resolution in this situation? 6-17. The pH of a solution is measured eight times by one operator using the same instrument. She obtains the following data: 7.15, 7.20, 7.18, 7.19, 7.21, 7.20, 7.16, and 7.18. Calculate the sample mean and sample standard deviation. Comment on potential major sources of variability in this experiment. 6-18. An article in the Journal of Aircraft (1988) described the computation of drag coeficients for the NASA 0012 airfoil. Different computational algorithms were used at M ∞ = 0.7 206 Chapter 6/Descriptive Statistics with the following results (drag coeficients are in units of drag counts; that is, one count is equivalent to a drag coeficient of 0.0001): 79, 100, 74, 83, 81, 85, 82, 80, and 84. Compute the sample mean, sample variance, and sample standard deviation, and construct a dot diagram. 6-19. The following data are the joint temperatures of the O-rings (°F) for each test iring or actual launch of the space shuttle rocket motor (from Presidential Commission on the Space Shuttle Challenger Accident, Vol. 1, pp. 129–131): 84, 49, 61, 40, 83, 67, 45, 66, 70, 69, 80, 58, 68, 60, 67, 72, 73, 70, 57, 63, 70, 78, 52, 67, 53, 67, 75, 61, 70, 81, 76, 79, 75, 76, 58, 31. (a) Compute the sample mean and sample standard deviation and construct a dot diagram of the temperature data. (b) Set aside the smallest observation (31°F ) and recompute the quantities in part (a). Comment on your indings. How “different” are the other temperatures from this last value? into clouds to promote rainfall was widely used in the 20th century. Recent research has questioned its effectiveness [Journal of Atmospheric Research (2010, Vol. 97 (2), pp. 513–525)]. An experiment was performed by randomly assigning 52 clouds to be seeded or not. The amount of rain generated was then measured in acre-feet. Here are the data for the unseeded and seeded clouds: 6-20. The United States has an aging infrastructure as witnessed by several recent disasters, including the I-35 bridge failure in Minnesota. Most states inspect their bridges regularly and report their condition (on a scale from 1–17) to the public. Here are the condition numbers from a sample of 30 bridges in New York State (https://www.dot.ny.gov/main/bridgedata): Find the sample mean, sample standard deviation, and range of rainfall for (a) All 52 clouds (b) The unseeded clouds (c) The seeded clouds 5.08 5.44 6.66 5.07 6.80 5.43 4.83 4.00 4.41 4.38 7.00 5.72 4.53 6.43 3.97 4.19 6.26 6.72 5.26 5.48 4.95 6.33 4.93 5.61 4.66 7.00 5.57 3.42 5.18 4.54 6-23. Construct dot diagrams of the seeded and unseeded clouds and compare their distributions in a couple of sentences. (a) Find the sample mean and sample standard deviation of these condition numbers. (b) Construct a dot diagram of the data. 6-21. In an attempt to measure the effects of acid rain, researchers measured the pH (7 is neutral and values below 7 are acidic) of water collected from rain in Ingham County, Michigan. 5.47 5.37 5.38 4.63 5.37 3.74 3.71 4.96 4.64 5.11 5.65 5.39 4.16 5.62 4.57 4.64 5.48 4.57 4.57 4.51 4.86 4.56 4.61 4.32 3.98 5.70 4.15 3.98 5.65 3.10 5.04 4.62 4.51 4.34 4.16 4.64 5.12 3.71 4.64 5.59 (a) Find the sample mean and sample standard deviation of these measurements. (b) Construct a dot diagram of the data. 6-22. Cloud seeding, a process in which chemicals such as silver iodide and frozen carbon dioxide are introduced by aircraft Unseeded: 81.2 26.1 95.0 41.1 28.6 21.7 11.5 68.5 345.5 321.2 1202.6 1.0 4.9 163.0 372.4 244.3 47.3 87.0 26.3 24.4 830.1 4.9 36.6 147.8 17.3 29.0 Seeded: 274.7 302.8 242.5 255.0 17.5 115.3 31.4 703.4 334.1 1697.8 118.3 198.6 129.6 274.7 119.0 1656.0 7.7 430.0 40.6 92.4 200.7 32.7 4.1 978.0 489.1 2745.6 6-24. In the 2000 Sydney Olympics, a special program initiated by IOC president Juan Antonio Samaranch allowed developing countries to send athletes to the Olympics without the usual qualifying procedure. Here are the 71 times for the irst round of the 100 meter men’s swim (in seconds). 60.39 50.56 49.75 49.76 52.43 51.93 50.49 49.45 49.93 52.72 54.06 49.73 51.28 52.24 49.84 51.28 53.40 50.95 53.50 50.90 52.22 52.82 52.91 49.09 51.82 49.74 50.63 59.26 49.76 50.96 52.52 58.79 50.46 49.16 51.93 49.29 49.70 48.64 50.32 49.74 51.34 50.28 50.19 52.14 52.57 52.53 52.09 52.40 51.62 52.58 53.55 51.07 52.78 112.72 49.79 49.83 52.90 50.19 54.33 62.45 51.11 50.87 52.18 54.12 51.52 52.0 52.85 52.24 49.32 50.62 49.45 (a) Find the sample mean and sample standard deviation of these 100 meter swim times. (b) Construct a dot diagram of the data. (c) Comment on anything unusual that you see. 6-2 Stem-and-Leaf Diagrams The dot diagram is a useful data display for small samples up to about 20 observations. However, when the number of observations is moderately large, other graphical displays may be more useful. For example, consider the data in Table 6-2. These data are the compressive strengths in pounds per square inch (psi) of 80 specimens of a new aluminum-lithium alloy undergoing evaluation as a possible material for aircraft structural elements. The data were recorded in the order of testing, and in this format they do not convey much information about compressive strength. Questions such as “What percent of the specimens fail below 120 psi?” are not Section 6-2/Stem-and-Leaf Diagrams 207 easy to answer. Because there are many observations, constructing a dot diagram of these data would be relatively ineficient; more effective displays are available for large data sets. 5"#-&t6-2 Compressive Strength (in psi) of 80 Aluminum-Lithium Alloy Specimens 105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141 245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133 207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146 218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158 160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149 A stem-and-leaf diagram is a good way to obtain an informative visual display of a data set x1 , x2 ,… , xn where each number xi consists of at least two digits. To construct a stem-andleaf diagram, use the following steps. Steps to Construct a Stem-and-Leaf Diagram (1) Divide each number xi into two parts: a stem, consisting of one or more of the leading digits, and a leaf, consisting of the remaining digit. (2) List the stem values in a vertical column. (3) Record the leaf for each observation beside its stem. (4) Write the units for stems and leaves on the display. To illustrate, if the data consist of percent defective information between 0 and 100 on lots of semiconductor wafers, we can divide the value 76 into the stem 7 and the leaf 6. In general, we should choose relatively few stems in comparison with the number of observations. It is usually best to choose between 5 and 20 stems. Example 6-4 Alloy Strength To illustrate the construction of a stem-and-leaf diagram, consider the alloy compressive strength data in Table 6-2. We will select as stem values the numbers 7, 8, 9,… , 24. The resulting stem-and-leaf diagram is presented in Fig. 6-4. The last column in the diagram is a frequency count of the number of leaves associated with each stem. Inspection of this display immediately reveals that most of the compressive strengths lie between 110 and 200 psi and that a central value is somewhere between 150 and 160 psi. Furthermore, the strengths are distributed approximately symmetrically about the central value. The stem-and-leaf diagram enables us to determine quickly some important features of the data that were not immediately obvious in the original display in Table 6-2. In some data sets, providing more classes or stems may be desirable. One way to do this would be to modify the original stems as follows: Divide stem 5 into two new stems, 5L and 5U. Stem 5L has leaves 0, 1, 2, 3, and 4, and stem 5U has leaves 5, 6, 7, 8, and 9. This will double the number of original stems. We could increase the number of original stems by four by deining ive new stems: 5z with leaves 0 and 1, 5t (for twos and three) with leaves 2 and 3, 5f (for fours and ives) with leaves 4 and 5, 5s (for six and seven) with leaves 6 and 7, and 5e with leaves 8 and 9. 208 Chapter 6/Descriptive Statistics Stem 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Leaf 6 7 7 51 580 103 413535 29583169 471340886808 3073050879 8544162106 0361410 960934 7108 8 189 7 5 Frequency 1 1 1 2 3 3 6 8 12 10 10 7 6 4 1 3 1 1 Stem: Tens and hundreds digits (psi); Leaf: Ones digits (psi). FIGURE 6-4 Stem-and-leaf diagram for the compressive strength data in Table 6-2. Chemical Yield Figure 6-5 is the stem-and-leaf diagram for 25 observations on batch yields Example 6-5 from a chemical process. In Fig. 6-5(a), we have used 6, 7, 8, and 9 as the stems. This results in too few stems, and the stem-and-leaf diagram does not provide much information about the data. In Fig. 6-5(b), we have divided each stem into two parts, resulting in a display that more adequately displays the data. Figure 6-5(c) illustrates a stem-and-leaf display with each stem divided into ive parts. There are too many stems in this plot, resulting in a display that does not tell us much about the shape of the data. Stem 6 7 8 9 Leaf 134556 011357889 1344788 235 (a) Stem 6L 6U 7L 7U 8L 8U 9L 9U Leaf 134 556 0113 57889 1344 788 23 5 (b) Stem 6z 6t 6f 6s 6e 7z 7t 7f 7s 7e 8z 8t 8f 8s 8e 9z 9t 9f 9s 9e Leaf 1 3 455 6 011 3 5 7 889 1 3 44 7 88 23 5 (c) FIGURE 6-5 Stem-and-leaf displays for Example 6-5. Stem: Tens digits. Leaf: Ones digits. Section 6-2/Stem-and-Leaf Diagrams 209 Stem-and-leaf of Strength N = 80 Leaf Unit = 1.0 FIGURE 6-6 A typical computer-generated stem-and-leaf diagram. 1 2 3 5 8 11 17 25 37 (10) 33 23 16 10 6 5 2 1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 6 7 7 15 058 013 133455 12356899 001344678888 0003357789 0112445668 0011346 034699 0178 8 189 7 5 Figure 6-6 is a typical computer-generated stem-and-leaf display of the compressive strength data in Table 6-2. The software uses the same stems as in Fig. 6-4. Note also that the computer orders the leaves from smallest to largest on each stem. This form of the plot is usually called an ordered stem-and-leaf diagram. This is not usually used when the plot is constructed manually because it can be time-consuming. The computer also adds a column to the left of the stems that provides a count of the observations at and above each stem in the upper half of the display and a count of the observations at and below each stem in the lower half of the display. At the middle stem of 16, the column indicates the number of observations at this stem. The ordered stem-and-leaf display makes it relatively easy to ind data features such as percentiles, quartiles, and the median. The sample median is a measure of central tendency that divides the data into two equal parts, half below the median and half above. If the number of observations is even, the median is halfway between the two central values. From Fig. 6-6 we ind the 40th and 41st values of strength as 160 and 163, so the median is (160 + 163) / 2 = 161.5. If the number of observations is odd, the median is the central value. The sample mode is the most frequently occurring data value. Figure 6-6 indicates that the mode is 158; this value occurs four times, and no other value occurs as frequently in the sample. If there were more than one value that occurred four times, the data would have multiple modes. We can also divide data into more than two parts. When an ordered set of data is divided into four equal parts, the division points are called quartiles. The irst or lower quartile, q1, is a value that has approximately 25% of the observations below it and approximately 75% of the observations above. The second quartile, q2, has approximately 50% of the observations below its value. The second quartile is exactly equal to the median. The third or upper quartile, q3 , has approximately 75% of the observations below its value. As in the case of the median, the quartiles may not be unique. The compressive strength data in Fig. 6-6 contain n = 80 observations. Therefore, calculate the irst and third quartiles as the ( n + 1) / 4 and 3 ( n + 1) / 4 ordered observations and interpolate as needed, for example, (80 + 1) / 4 = 20.25 and 3 (80 + 1) / 4 = 60.75. Therefore, interpolating between the 20th and 21st ordered observation we obtain q1 = 143.50 and between the 60th and 61st observation we obtain q3 = 181.00. In general, the 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100 (1 − k ) % of them are above it. Finally, 210 Chapter 6/Descriptive Statistics we may use the interquartile range, deined as IQR = q3 − q1 , as a measure of variability. The interquartile range is less sensitive to the extreme values in the sample than is the ordinary sample range. Many statistics software packages provide data summaries that include these quantities. Typical computer output for the compressive strength data in Table 6-2 is shown in Table 6-3. 5"#-&t6-3 Summary Statistics for the Compressive Strength Data from Software Exercises N Mean Median StDev SE Mean Min Max Q1 Q3 80 162.66 161.50 33.77 3.78 76.00 245.00 143.50 181.00 FOR SECTION 6-2 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-25. For the data in Exercise 6-20, (a) Construct a stem-and-leaf diagram. (b) Do any of the bridges appear to have unusually good or poor ratings? (c) If so, compute the mean with and without these bridges and comment. 6-26. For the data in Exercise 6-21, (a) Construct a stem-and-leaf diagram. (b) Many scientists consider rain with a pH below 5.3 to be acid rain (http://www.ec.gc.ca/eau-water/default.asp? lang=En&n=FDF30C16-1). What percentage of these samples could be considered as acid rain? 6-27. A back-to-back stem-and-leaf display on two data sets is conducted by hanging the data on both sides of the same stems. Here is a back-to-back stem-and-leaf display for the cloud seeding data in Exercise 6-22 showing the unseeded clouds on the left and the seeded clouds on the right. 65098754433332221000 | 0 | 01233492223 | 2 | 00467703 | 4 | 39 | 6|0 3| 8|8 | 10 | 0 | 12 | | 14 | | 16 | 60 | 18 | | 20 | | 22 | | 24 | | 26 | 5 6-30. An article in Technometrics (1977, Vol. 19, p. 425) presented the following data on the motor fuel octane ratings of several blends of gasoline: 88.5 94.7 84.3 90.1 89.0 89.8 91.6 90.3 90.0 91.5 89.9 98.8 88.3 90.4 91.2 90.6 92.2 87.7 91.1 86.7 93.4 96.1 89.6 92.2 90.4 83.4 91.6 91.0 90.7 88.2 88.6 88.5 88.3 93.3 94.2 87.4 85.3 91.1 90.1 90.5 89.3 100.3 91.1 87.6 92.7 87.9 93.0 94.4 90.4 91.2 86.7 94.2 90.8 90.1 91.8 88.4 92.6 93.7 96.5 84.3 93.2 88.6 88.7 92.7 89.3 91.0 87.5 87.8 88.3 89.2 92.3 88.9 89.8 92.7 93.3 86.7 91.0 90.9 89.9 91.8 89.7 92.2 Construct a stem-and-leaf display for these data. Calculate the median and quartiles of these data. 6-31. The following data are the numbers of cycles to failure of aluminum test coupons subjected to repeated alternating stress at 21,000 psi, 18 cycles per second. How does the back-to-back stem-and-leaf display show the differences in the data set in a way that the dotplot cannot? 1115 1310 1540 1502 1258 1315 1085 798 1020 865 2130 1421 1109 1481 1567 1883 1203 1270 1015 845 1674 1016 1102 1605 706 2215 785 885 1223 375 2265 1910 1018 1452 1890 2100 1594 2023 1315 1269 1260 1888 1782 1522 1792 1000 1820 1940 1120 910 1730 1102 1578 758 1416 1560 1055 1764 1330 1608 1535 1781 1750 1501 1238 990 1468 1512 1750 1642 6-28. When will the median of a sample be equal to the sample mean? 6-29. When will the median of a sample be equal to the mode? Construct a stem-and-leaf display for these data. Calculate the median and quartiles of these data. Does it appear likely that a coupon will “survive” beyond 2000 cycles? Justify your answer. Section 6-2/Stem-and-Leaf Diagrams 6-32. The percentage of cotton in material used to manufacture men’s shirts follows. Construct a stem-and-leaf display for the data. Calculate the median and quartiles of these data. 34.2 37.8 33.6 32.6 33.8 35.8 34.7 34.6 33.1 36.6 34.7 33.1 34.2 37.6 33.6 33.6 34.5 35.4 35.0 34.6 33.4 37.3 32.5 34.1 35.6 34.6 35.4 35.9 34.7 34.6 34.1 34.7 36.3 33.8 36.2 34.7 34.6 35.5 35.1 35.7 35.1 37.1 36.8 33.6 35.2 32.8 36.8 36.8 34.7 34.0 35.1 32.9 35.0 32.1 37.9 34.3 33.6 34.1 35.3 33.5 34.9 34.5 36.4 32.7 Hong Kong India Indonesia Japan Korea, North Korea, South Laos Malaysia Mongolia Nepal New Zealand Pakistan Philippines Singapore Sri Lanka Taiwan Thailand Vietnam Total 6-33. The following data represent the yield on 90 consecutive batches of ceramic substrate to which a metal coating has been applied by a vapor-deposition process. Construct a stem-andleaf display for these data. Calculate the median and quartiles of these data. 94.1 86.1 95.3 84.9 88.8 84.6 94.4 84.1 93.2 90.4 94.1 78.3 86.4 83.6 96.1 83.7 90.6 89.1 97.8 89.6 85.1 85.4 98.0 82.9 91.4 87.3 93.1 90.3 84.0 89.7 85.4 87.3 88.2 84.1 86.4 93.1 93.7 87.6 86.6 86.4 86.1 90.1 87.6 94.6 87.7 85.1 91.7 84.5 95.1 95.2 94.1 96.3 90.6 89.6 87.5 90.0 86.1 92.1 94.7 89.4 90.0 84.2 92.4 94.3 96.4 91.1 88.6 90.1 85.1 87.3 93.2 88.2 92.4 84.1 94.3 90.5 86.6 86.7 86.4 90.6 82.6 97.3 95.6 91.2 83.0 85.0 89.1 83.1 96.8 88.3 6-34. Calculate the sample median, mode, and mean of the data in Exercise 6-30. Explain how these three measures of location describe different features of the data. 6-35. Calculate the sample median, mode, and mean of the data in Exercise 6-31. Explain how these three measures of location describe different features in the data. 6-36. Calculate the sample median, mode, and mean for the data in Exercise 6-32. Explain how these three measures of location describe different features of the data. 6-37. The net energy consumption (in billions of kilowatthours) for countries in Asia in 2003 was as follows (source: U.S. Department of Energy Web site, www.eia.doe.gov/emeu). Construct a stem-and-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, sample standard deviation, and sample median. Afghanistan Australia Bangladesh Burma China Billions of Kilowatt-Hours 1.04 200.66 16.20 6.88 1671.23 211 38.43 519.04 101.80 946.27 17.43 303.33 3.30 73.63 2.91 2.30 37.03 71.54 44.48 30.89 6.80 154.34 107.34 36.92 4393.8 The female students in an undergraduate engineer6-38. ing core course at ASU self-reported their heights to the nearest inch. The data follow. Construct a stem-and-leaf diagram for the height data and comment on any important features that you notice. Calculate the sample mean, the sample standard deviation, and the sample median of height. 62 64 61 67 65 68 61 65 60 65 64 63 59 68 64 66 68 69 65 67 62 66 68 67 66 65 69 65 69 65 67 67 65 63 64 67 65 6-39. The shear strengths of 100 spot welds in a titanium alloy follow. Construct a stem-and-leaf diagram for the weld strength data and comment on any important features that you notice. What is the 95th percentile of strength? 5408 5431 5475 5442 5376 5388 5459 5422 5416 5435 5420 5429 5401 5446 5487 5416 5382 5357 5388 5457 5407 5469 5416 5377 5454 5375 5409 5459 5445 5429 5463 5408 5481 5453 5422 5354 5421 5406 5444 5466 5399 5391 5477 5447 5329 5473 5423 5441 5412 5384 5445 5436 5454 5453 5428 5418 5465 5427 5421 5396 5381 5425 5388 5388 5378 5481 5387 5440 5482 5406 5401 5411 5399 5431 5440 5413 5406 5342 5452 5420 5458 5485 5431 5416 5431 5390 5399 5435 5387 5462 5383 5401 5407 5385 5440 5422 5448 5366 5430 5418 6-40. An important quality characteristic of water is the concentration of suspended solid material. Following are 60 measurements on suspended solids from a certain lake. Construct a stem-and-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, the sample standard deviation, and the sample median. What is the 90th percentile of concentration? 212 Chapter 6/Descriptive Statistics 42.4 65.7 29.8 58.7 52.1 55.8 57.0 68.7 67.3 67.3 681 679 691 683 705 746 706 649 668 672 690 724 54.3 54.0 73.1 81.3 59.9 56.9 62.2 69.9 66.9 59.0 652 720 660 695 701 724 668 698 668 660 680 739 56.3 43.3 57.4 45.3 80.1 49.7 42.8 42.4 59.6 65.8 61.4 64.0 64.2 72.6 72.5 46.1 53.1 56.1 67.2 70.7 42.6 77.4 54.7 57.1 77.3 39.3 76.4 59.3 51.1 73.8 61.4 73.1 77.3 48.5 89.8 50.7 52.0 59.6 66.1 31.6 6-41. The United States Golf Association tests golf balls to ensure that they conform to the rules of golf. Balls are tested for weight, diameter, roundness, and overall distance. The overall distance test is conducted by hitting balls with a driver swung by a mechanical device nicknamed “Iron Byron” after the legendary great Byron Nelson, whose swing the machine is said to emulate. Following are 100 distances (in yards) achieved by a particular brand of golf ball in the overall distance test. Construct a stem-and-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, sample standard deviation, and the sample median. What is the 90th percentile of distances? 261.3 259.4 265.7 270.6 274.2 261.4 254.5 283.7 258.1 270.5 255.1 268.9 267.4 253.6 234.3 263.2 717 727 653 637 660 693 679 682 724 642 704 695 704 652 664 702 661 720 695 670 656 718 660 648 683 723 710 680 684 705 681 748 697 703 660 722 662 644 683 695 678 674 656 667 683 691 680 685 681 715 665 676 665 675 655 659 720 675 697 663 6-43. A group of wine enthusiasts taste-tested a pinot noir wine from Oregon. The evaluation was to grade the wine on a 0-to-100-point scale. The results follow. Construct a stemand-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, the sample standard deviation, and the sample median. A wine rated above 90 is considered truly exceptional. What proportion of the taste-tasters considered this particular pinot noir truly exceptional? 94 90 88 85 90 93 95 91 92 87 91 85 91 90 88 89 91 91 89 88 86 92 92 84 89 89 87 85 91 86 89 90 91 89 95 90 90 90 92 83 254.2 270.7 233.7 263.5 244.5 251.8 259.5 257.5 257.7 272.6 253.7 262.2 252.0 280.3 274.9 233.7 237.9 274.0 264.5 244.8 264.0 268.3 272.1 260.2 255.8 260.7 245.5 279.6 237.8 278.5 273.3 263.7 241.4 260.6 280.3 272.7 261.0 260.0 279.3 252.1 244.3 272.2 248.3 278.7 236.0 271.2 279.8 245.6 241.2 251.1 267.0 273.4 247.7 254.8 272.8 270.5 254.4 232.1 271.5 242.9 273.6 256.1 251.6 256.8 273.0 240.8 276.6 264.5 264.5 226.8 255.3 266.6 250.2 255.8 285.3 255.4 240.5 255.0 273.2 251.4 276.1 277.8 266.8 268.5 6-42. A semiconductor manufacturer produces devices used as central processing units in personal computers. The speed of the devices (in megahertz) is important because it determines the price that the manufacturer can charge for the devices. The following table contains measurements on 120 devices. Construct a stem-and-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, the sample standard deviation, and the sample median. What percentage of the devices has a speed exceeding 700 megahertz? 680 669 719 699 670 710 722 663 658 634 720 690 6-44. In their book Introduction to Linear Regression Analysis (5th edition, Wiley, 2012), Montgomery, Peck, and Vining presented measurements on NbOCl3 concentration from a tube-low reactor experiment. The data, in gram-mole per liter × 10 −3 , are as follows. Construct a stem-and-leaf diagram for these data and comment on any important features that you notice. Compute the sample mean, the sample standard deviation, and the sample median. 450 450 473 507 457 452 453 1215 1256 1145 1085 1066 1111 1364 1254 1396 1575 1617 1733 2753 3186 3227 3469 1911 2588 2635 2725 6-45. In Exercise 6-38, we presented height data that were self-reported by female undergraduate engineering students in a core course at ASU. In the same class, the male students self-reported their heights as follows. Construct a comparative stem-and-leaf diagram by listing the stems in the center of the display and then placing the female leaves on the left and the male leaves on the right. Comment on any important features that you notice in this display. 69 67 69 70 65 68 69 70 71 69 66 67 69 75 68 67 68 677 669 700 718 690 681 702 696 692 690 694 660 69 70 71 72 68 69 69 70 71 68 72 69 69 68 69 73 70 649 675 701 721 683 735 688 763 672 698 659 704 73 68 69 71 67 68 65 68 68 69 70 74 71 69 70 69 213 Section 6-3/Frequency Distributions and Histograms 6-3 Frequency Distributions and Histograms Choosing the Number of Bins in a Frequency Distribution or Histogram is Important A frequency distribution is a more compact summary of data than a stem-and-leaf diagram. To construct a frequency distribution, we must divide the range of the data into intervals, which are usually called class intervals, cells, or bins. If possible, the bins should be of equal width in order to enhance the visual information in the frequency distribution. Some judgment must be used in selecting the number of bins so that a reasonable display can be developed. The number of bins depends on the number of observations and the amount of scatter or dispersion in the data. A frequency distribution that uses either too few or too many bins will not be informative. We usually ind that between 5 and 20 bins is satisfactory in most cases and that the number of bins should increase with n. Several sets of rules can be used to determine the member of bins in a histogram. However, choosing the number of bins approximately equal to the square root of the number of observations often works well in practice. A frequency distribution for the comprehensive strength data in Table 6-2 is shown in Table 6-4. Because the data set contains 80 observations, and because 80 ≃ 9 , we suspect that about eight to nine bins will provide a satisfactory frequency distribution. The largest and smallest data values are 245 and 76, respectively, so the bins must cover a range of at least 245 − 76 = 169 units on the psi scale. If we want the lower limit for the irst bin to begin slightly below the smallest data value and the upper limit for the last bin to be slightly above the largest data value, we might start the frequency distribution at 70 and end it at 250. This is an interval or range of 180 psi units. Nine bins, each of width 20 psi, give a reasonable frequency distribution, so the frequency distribution in Table 6-4 is based on nine bins. The second row of Table 6-4 contains a relative frequency distribution. The relative frequencies are found by dividing the observed frequency in each bin by the total number of observations. The last row of Table 6-4 expresses the relative frequencies on a cumulative basis. Frequency distributions are often easier to interpret than tables of data. For example, from Table 6-4, it is very easy to see that most of the specimens have compressive strengths between 130 and 190 psi and that 97.5 percent of the specimens fall below 230 psi. The histogram is a visual display of the frequency distribution. The steps for constructing a histogram follow. Constructing a Histogram (Equal Bin Widths) (1) Label the bin (class interval) boundaries on a horizontal scale. (2) Mark and label the vertical scale with the frequencies or the relative frequencies. (3) Above each bin, draw a rectangle where height is equal to the frequency (or relative frequency) corresponding to that bin. Figure 6-7 is the histogram for the compression strength data. The histogram, like the stemand-leaf diagram, provides a visual impression of the shape of the distribution of the measurements and information about the central tendency and scatter or dispersion in the data. 5"#-&t6-4 Frequency Distribution for the Compressive Strength Data in Table 6-2 Class Frequency 70 Ä x < 90 90 Ä x < 110 110 Ä x < 130 130 Ä x < 150 150 Ä x < 170 170 Ä x < 190 190 Ä x < 210 210 Ä x < 230 230 Ä x < 250 2 3 6 14 22 17 10 4 2 Relative frequency 0.0250 0.0375 0.0750 0.1750 0.2750 0.2125 0.1250 0.0500 0.0250 Cumulative relative frequency 0.0250 0.0625 0.1375 0.3125 0.5875 0.8000 0.9250 0.9750 1.0000 Chapter 6/Descriptive Statistics 0.3125 25 0.2500 20 0.1895 0.1250 Frequency FIGURE 6-7 Histogram of compressive strength for 80 aluminumlithium alloy specimens. Relative frequency 214 15 10 0.0625 5 0 0 70 90 110 130 150 170 190 210 230 250 Compressive strength (psi) Notice the symmetric, bell-shaped distribution of the strength measurements in Fig. 6-7. This display often gives insight about possible choices of probability distributions to use as a model for the population. For example, here we would likely conclude that the normal distribution is a reasonable model for the population of compression strength measurements. Sometimes a histogram with unequal bin widths will be employed. For example, if the data have several extreme observations or outliers, using a few equal-width bins will result in nearly all observations falling in just a few of the bins. Using many equal-width bins will result in many bins with zero frequency. A better choice is to use shorter intervals in the region where most of the data fall and a few wide intervals near the extreme observations. When the bins are of unequal width, the rectangle’s area (not its height) should be proportional to the bin frequency. This implies that the rectangle height should be Rectangular height = Histograms are Best for Relatively Large Samples Bin frequancy Bin width In passing from either the original data or stem-and-leaf diagram to a frequency distribution or histogram, we have lost some information because we no longer have the individual observations. However, this information loss is often small compared with the conciseness and ease of interpretation gained in using the frequency distribution and histogram. Figure 6-8 is a histogram of the compressive strength data with 17 bins. We have noted that histograms may be relatively sensitive to the number of bins and their width. For small data sets, histograms may change dramatically in appearance if the number and/or width of the bins changes. Histograms are more stable and thus reliable for larger data sets, preferably of size 75 to 100 or more. Figure 6-9 is a histogram for the compressive strength data with 20 Frequency Frequency 10 5 0 10 0 100 150 Strength 200 250 80 100 120 140 160 180 200 220 240 Strength FIGURE 6-8 A histogram of the compressive FIGURE 6-9 A histogram of the compressive strength strength data with 17 bins. data with nine bins. Section 6-3/Frequency Distributions and Histograms 215 Cumulative frequency 80 FIGURE 6-10 A cumulative distribution plot of the compressive strength data. 70 60 50 40 30 20 10 0 100 150 Strength 200 250 nine bins. This is similar to the original histogram shown in Fig. 6-7. Because the number of observations is moderately large (n = 80), the choice of the number of bins is not especially important, and both Figs. 6-8 and 6-9 convey similar information. Figure 6-10 is a variation of the histogram available in some software packages, the cumulative frequency plot. In this plot, the height of each bar is the total number of observations that are less than or equal to the upper limit of the bin. Cumulative distributions are also useful in data interpretation; for example, we can read directly from Fig. 6-10 that approximately 70 observations are less than or equal to 200 psi. When the sample size is large, the histogram can provide a reasonably reliable indicator of the general shape of the distribution or population of measurements from which the sample was drawn. See Figure 6-11 for three cases. The median is denoted as xɶ . Generally, if the data are symmetric, as in Fig. 6-11(b), the mean and median coincide. If, in addition, the data have only one mode (we say the data are unimodal), the mean, median, and mode all coincide. If the data are skewed (asymmetric, with a long tail to one side), as in Fig. 6-11(a) and (c), the mean, median, and mode do not coincide. Usually, we ind that mode < median < mean if the distribution is skewed to the right, whereas mode > median > mean if the distribution is skewed to the left. Frequency distributions and histograms can also be used with qualitative or categorical data. Some applications will have a natural ordering of the categories (such as freshman, sophomore, junior, and senior), whereas in others, the order of the categories will be arbitrary (such as male and female). When using categorical data, the bins should have equal width. Example 6-6 Figure 6-12 presents the production of transport aircraft by the Boeing Company in 1985. Notice that the 737 was the most popular model, followed by the 757, 747, 767, and 707. A chart of occurrences by category (in which the categories are ordered by the number of occurrences) is sometimes referred to as a Pareto chart. An exercise asks you to construct such a chart. FIGURE 6-11 Histograms for symmetric and skewed distributions. x | x Negative or left skew (a) x | x Symmetric (b) | x x Positive or right skew (c) 216 Chapter 6/Descriptive Statistics FIGURE 6-12 Airplane production in 1985. (Source: Boeing Company.) Number of airplanes manufactured in 1985 250 150 100 50 0 737 757 747 767 Airplane model 707 In this section, we have concentrated on descriptive methods for the situation in which each observation in a data set is a single number or belongs to one category. In many cases, we work with data in which each observation consists of several measurements. For example, in a gasoline mileage study, each observation might consist of a measurement of miles per gallon, the size of the engine in the vehicle, engine horsepower, vehicle weight, and vehicle length. This is an example of multivariate data. In section 6.6, we will illustrate one simple graphical display or multivariate data. In later chapters, we will discuss analyzing this type of data. Exercises FOR SECTION 6-3 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-46. Construct a frequency distribution and histogram for the motor fuel octane data from Exercise 6-30. Use eight bins. 6-47. Construct a frequency distribution and histogram using the failure data from Exercise 6-31. 6-48. Construct a frequency distribution and histogram for the cotton content data in Exercise 6-32. 6-49. Construct a frequency distribution and histogram for the yield data in Exercise 6-33. 6-50. Construct frequency distributions and histograms with 8 bins and 16 bins for the motor fuel octane data in Exercise 6-30. Compare the histograms. Do both histograms display similar information? 6-51. Construct histograms with 8 and 16 bins for the data in Exercise 6-31. Compare the histograms. Do both histograms display similar information? 6-52. Construct histograms with 8 and 16 bins for the data in Exercise 6-32. Compare the histograms. Do both histograms display similar information? 6-53. Construct a histogram for the energy consumption data in Exercise 6-37. 6-54. Construct a histogram for the female student height data in Exercise 6-38. 6-55. Construct a histogram for the spot weld shear strength data in Exercise 6-39. Comment on the shape of the histogram. Does it convey the same information as the stem-and-leaf display? 6-56. Construct a histogram for the water quality data in Exercise 6-40. Comment on the shape of the histogram. Does it convey the same information as the stem-and-leaf display? 6-57. Construct a histogram for the overall golf distance data in Exercise 6-41. Comment on the shape of the histogram. Does it convey the same information as the stem-and-leaf display? 6-58. Construct a histogram for the semiconductor speed data in Exercise 6-42. Comment on the shape of the histogram. Does it convey the same information as the stem-andleaf display? 6-59. Construct a histogram for the pinot noir wine rating data in Exercise 6-43. Comment on the shape of the histogram. Does it convey the same information as the stem-andleaf display? 6-60. The Pareto Chart. An important variation of a histogram for categorical data is the Pareto chart. This chart is widely used in quality improvement efforts, and the categories usually represent different types of defects, failure modes, or product/process problems. The categories are ordered so that the category with the largest frequency is on the left, followed by the category with the second largest frequency, and so forth. These charts are named after the Italian economist V. Pareto, and they usually exhibit “Pareto’s law”; that is, most of the defects can be accounted for by only a few categories. Suppose that the following information on structural defects in automobile doors is obtained: dents, 4; Section 6-4/Box Plots pits, 4; parts assembled out of sequence, 6; parts undertrimmed, 21; missing holes/slots, 8; parts not lubricated, 5; parts out of contour, 30; and parts not deburred, 3. Construct and interpret a Pareto chart. 6-61. Construct a frequency distribution and histogram for the bridge condition data in Exercise 6-20. 6-4 217 6-62. Construct a frequency distribution and histogram for the acid rain measurements in Exercise 6-21. 6-63. Construct a frequency distribution and histogram for the combined cloud-seeding rain measurements in Exercise 6-22. 6-64. Construct a frequency distribution and histogram for the swim time measurements in Exercise 6-24. Box Plots The stem-and-leaf display and the histogram provide general visual impressions about a data set, but numerical quantities such as x or s provide information about only one feature of the data. The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identiication of unusual observations or outliers. A box plot, sometimes called box-and-whisker plots, displays the three quartiles, the minimum, and the maximum of the data on a rectangular box, aligned either horizontally or vertically. The box encloses the interquartile range with the left (or lower) edge at the irst quartile, q1, and the right (or upper) edge at the third quartile, q3 . A line is drawn through the box at the second quartile (which is the 50th percentile or the median), q2 = x . A line, or whisker, extends from each end of the box. The lower whisker is a line from the irst quartile to the smallest data point within 1.5 interquartile ranges from the irst quartile. The upper whisker is a line from the third quartile to the largest data point within 1.5 interquartile ranges from the third quartile. Data farther from the box than the whiskers are plotted as individual points. A point beyond a whisker, but less than three interquartile ranges from the box edge, is called an outlier. A point more than three interquartile ranges from the box edge is called an extreme outlier. See Fig. 6-13. Occasionally, different symbols, such as open and illed circles, are used to identify the two types of outliers. Figure 6-14 presents a typical computer-generated box plot for the alloy compressive strength data shown in Table 6-2. This box plot indicates that the distribution of compressive strengths is fairly symmetric around the central value because the left and right whiskers and the lengths of the left and right boxes around the median are about the same. There are also two mild outliers at lower strength and one at higher strength. The upper whisker extends to observation 237 because it is the highest observation below the limit for upper outliers. This limit is q3 + 1.5IQR = 181 + 1.5 (181 − 143.5) = 237.25. The lower whisker extends to observation 97 because it is the smallest observation above the limit for lower outliers. This limit is q1 − 1.5IQR = 143.5 − 1.5 (181 − 143.5) = 87.25. Box plots are very useful in graphical comparisons among data sets because they have high visual impact and are easy to understand. For example, Fig. 6-15 shows the comparative box plots for a manufacturing quality index on semiconductor devices at three manufacturing plants. Inspection of this display reveals that there is too much variability at plant 2 and that plants 2 and 3 need to raise their quality index performance. Whisker extends to largest data point within 1.5 interquartile ranges from third quartile Whisker extends to smallest data point within 1.5 interquartile ranges from first quartile First quartile FIGURE 6-13 Description of a box plot. Second quartile Third quartile Outliers 1.5 IQR Outliers 1.5 IQR IQR 1.5 IQR Extreme outlier 1.5 IQR 218 Chapter 6/Descriptive Statistics 120 237.25 1.5 IQR 200 q3 = 181 181 Strength 110 * 245 237 q2 = 161.5 IQR 143.5 150 Quality index 250 100 90 q1 = 143.5 80 1.5 IQR 100 87.25 97 * 87 70 * 76 FIGURE 6-14 Box plot for compressive strength data in Table 6-2. Exercises 1 2 3 Plant FIGURE 6-15 Comparative box plots of a quality index at three plants. FOR SECTION 6-4 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-65. Using the data on bridge conditions from Exercise 6-20, (a) Find the quartiles and median of the data. (b) Draw a box plot for the data. (c) Should any points be considered potential outliers? Compare this to your answer in Exercise 6-20. Explain. 6-66. Using the data on acid rain from Exercise 6-21, (a) Find the quartiles and median of the data. (b) Draw a box plot for the data. (c) Should any points be considered potential outliers? Compare this to your answer in Exercise 6-21. Explain. 6-67. Using the data from Exercise 6-22 on cloud seeding, (a) Find the median and quartiles for the unseeded cloud data. (b) Find the median and quartiles for the seeded cloud data. (c) Make two side-by-side box plots, one for each group on the same plot. (d) Compare the distributions from what you can see in the side-by-side box plots. 6-68. Using the data from Exercise 6-24 on swim times, (a) Find the median and quartiles for the data. (b) Make a box plot of the data. (c) Repeat (a) and (b) for the data without the extreme outlier and comment. (d) Compare the distribution of the data with and without the extreme outlier. 6-69. The “cold start ignition time” of an automobile engine is being investigated by a gasoline manufacturer. The following times (in seconds) were obtained for a test vehicle: 1.75, 1.92, 2.62, 2.35, 3.09, 3.15, 2.53, 1.91. (a) Calculate the sample mean, sample variance, and sample standard deviation. (b) Construct a box plot of the data. 6-70. An article in Transactions of the Institution of Chemical Engineers (1956, Vol. 34, pp. 280–293) reported data from an experiment investigating the effect of several process variables on the vapor phase oxidation of naphthalene. A sample of the percentage mole conversion of naphthalene to maleic anhydride follows: 4.2, 4.7, 4.7, 5.0, 3.8, 3.6, 3.0, 5.1, 3.1, 3.8, 4.8, 4.0, 5.2, 4.3, 2.8, 2.0, 2.8, 3.3, 4.8, 5.0. (a) Calculate the sample mean, sample variance, and sample standard deviation. (b) Construct a box plot of the data. 6-71. The nine measurements that follow are furnace temperatures recorded on successive batches in a semiconductor manufacturing process (units are °F): 953, 950, 948, 955, 951, 949, 957, 954, 955. (a) Calculate the sample mean, sample variance, and standard deviation. (b) Find the median. How much could the highest temperature measurement increase without changing the median value? (c) Construct a box plot of the data. 6-72. Exercise 6-18 presents drag coeficients for the NASA 0012 airfoil. You were asked to calculate the sample mean, sample variance, and sample standard deviation of those coeficients. (a) Find the median and the upper and lower quartiles of the drag coeficients. (b) Construct a box plot of the data. (c) Set aside the highest observation (100) and rework parts (a) and (b). Comment on your indings. 6-73. Exercise 6-19 presented the joint temperatures of the O-rings (°F) for each test iring or actual launch of the space shuttle rocket motor. In that exercise, you were asked to ind the sample mean and sample standard deviation of temperature. Section 6-5/Time Sequence Plots (a) Find the median and the upper and lower quartiles of temperature. (b) Set aside the lowest observation (31°F ) and recompute the quantities in part (a). Comment on your indings. How “different” are the other temperatures from this lowest value? (c) Construct a box plot of the data and comment on the possible presence of outliers. 6-74. Reconsider the motor fuel octane rating data in Exercise 6-28. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-75. Reconsider the energy consumption data in Exercise 6-37. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-76. Reconsider the water quality data in Exercise 6-40. Construct a box plot of the concentrations and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-77. Reconsider the weld strength data in Exercise 6-39. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-78. Reconsider the semiconductor speed data in Exercise 6-42. Construct a box plot of the data and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-79. Use the data on heights of female and male engineering students from Exercises 6-38 and 6-45 to construct comparative box plots. Write an interpretation of the information that you see in these plots. 6-80 In Exercise 6-69, data were presented on the cold start ignition time of a particular gasoline used in a test vehicle. A second formulation of the gasoline was tested in the same vehicle, with the following times (in seconds): 1.83, 1.99, 3.13, 3.29, 2.65, 2.87, 3.40, 2.46, 1.89, and 3.35. Use these new data along with the cold start times reported in Exercise 6-69 to 219 construct comparative box plots. Write an interpretation of the information that you see in these plots. 6-81. An article in Nature Genetics [“Treatment-speciic Changes in Gene Expression Discriminate in Vivo Drug Response in Human Leukemia Cells” (2003, Vol. 34(1), pp. 85–90)] studied gene expression as a function of treatments for leukemia. One group received a high dose of the drug, while the control group received no treatment. Expression data (measures of gene activity) from one gene are shown in Table 6E.1. Construct a box plot for each group of patients. Write an interpretation to compare the information in these plots. 5"#-&t6E.1 Gene Expression High Dose 16.1 Control 297.1 Control 25.1 Control 131.1 134.9 52.7 14.4 124.3 99 24.3 16.3 15.2 47.7 12.9 72.7 126.7 46.4 60.3 23.5 43.6 79.4 38 58.2 26.5 491.8 1332.9 1172 1482.7 335.4 528.9 24.1 545.2 92.9 337.1 102.3 255.1 100.5 159.9 168 95.2 132.5 442.6 15.8 175.6 820.1 82.5 713.9 785.6 114 31.9 86.3 646.6 169.9 20.2 280.2 194.2 408.4 155.5 864.6 355.4 634 2029.9 362.1 166.5 2258.4 497.5 263.4 252.3 351.4 678.9 3010.2 67.1 318.2 2476.4 181.4 2081.5 424.3 188.1 563 149.1 2122.9 1295.9 6-5 Time Sequence Plots The graphical displays that we have considered thus far such as histograms, stem-and-leaf plots, and box plots are very useful visual methods for showing the variability in data. However, we noted in Chapter 1 that time is an important factor that contributes to variability in data, and those graphical methods do not take this into account. A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say, x ) and the horizontal axis denotes the time (which could be minutes, days, years, etc.). When measurements are plotted as a time series, we often see trends, cycles, or other broad features of the data that could not be seen otherwise. For example, consider Fig. 6-16(a), which presents a time series plot of the annual sales of a company for the last 10 years. The general impression from this display is that sales show an upward trend. There is some variability about this trend with some years’ sales increasing over those of the last year and some years’ sales decreasing. Figure 6-16(b) shows the last three years of sales reported by quarter. This plot clearly shows that the annual sales in this business exhibit a cyclic variability by quarter with the irst- and second-quarter sales being generally higher than sales during the third and fourth quarters. 220 Chapter 6/Descriptive Statistics Sales, x Sales, x Sometimes it can be very helpful to combine a time series plot with some of the other graphical displays that we have considered previously. J. Stuart Hunter (The American Statistician, 1988, Vol. 42, p. 54) has suggested combining the stem-and-leaf plot with a time series plot to form a digidot plot. Figure 6-17 is a digidot plot for the observations on compressive strength from Table 6-2, assuming that these observations are recorded in the order in which they occurred. This plot effectively displays the overall variability in the compressive strength data and simultaneously shows the variability in these measurements over time. The general impression is that compressive strength varies around the mean value of 162.66, and no strong obvious pattern occurs in this variability over time. The digidot plot in Fig. 6-18 tells a different story. This plot summarizes 30 observations on concentration of the output product from a chemical process where the observations are recorded at one-hour time intervals. This plot indicates that during the irst 20 hours of operation, this process produced concentrations generally above 85 grams per liter, but that following sample 20, something may have occurred in the process that resulted in lower concentrations. If this variability in output product concentration can be reduced, operation of this process can be improved. Notice that this apparent change in the process output is not seen in the stem-and-leaf portion of the digidot plot. The stem-and-leaf plot compresses the time dimension out of the data. This illustrates why it is always important to construct a time series plot for time-oriented data. 1982 1983 1984 1985 1986 19871988 1989 1990 1991 Years (a) 1 2 3 1989 4 1 2 3 1990 (b) FIGURE 6-16 Company sales by year (a). By quarter (b). Leaf FIGURE 6-17 A digidot plot of the compressive strength data in Table 6-2. Stem 5 24 7 23 189 22 8 21 7108 20 960934 19 0361410 18 8544162106 17 3073050879 16 471340886808 15 29583169 14 413535 13 103 12 580 11 15 10 7 9 7 8 6 7 Time series plot 4 1 2 3 1991 4 Quarters Section 6-5/Time Sequence Plots Leaf FIGURE 6-18 A digidot plot of chemical process concentration readings, observed hourly. Exercises Stem 8 9e 6 9s 45 9f 2333 9t 0010000 9z 99998 8e 66677 8s 45 8f 23 8t 1 8z 221 Time series plot FOR SECTION 6-5 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-82. The following data are the viscosity measurements for a chemical product observed hourly (read down, then left to right). Construct and interpret either a digidot plot or a separate stem-and-leaf and time series plot of these data. Speciications on product viscosity are at 48 ± 2 . What conclusions can you make about process performance? 47.9 47.9 48.6 48.0 48.4 48.1 48.0 48.6 48.8 48.1 48.3 47 .2 48.9 48.6 48.0 47 .5 48.6 48.0 47.9 48.3 48.5 48.1 48.0 48.3 43.2 43.0 43.5 43.1 43.0 42.9 43.6 43.3 43.0 42.8 43.1 43.2 43.6 43.2 43.5 43.0 The pull-off force for a connector is measured in a 6-83. laboratory test. Data for 40 test specimens follow (read down, then left to right). Construct and interpret either a digidot plot or a separate stem-and-leaf and time series plot of the data. 241 203 201 251 236 190 258 195 195 238 245 175 237 249 255 210 209 178 210 220 245 198 212 175 194 194 235 199 185 190 225 245 220 183 187 248 209 249 213 218 6-84. In their book Time Series Analysis, Forecasting, and Control (Prentice Hall, 1994), G. E. P. Box, G. M. Jenkins, and G. C. Reinsel present chemical process concentration readings made every two hours. Some of these data follow (read down, then left to right). 17.0 16.6 16.7 17.4 17.1 17.4 17.5 18.1 17.6 17.5 16.3 16.1 17.1 16.9 16.8 17.4 17.1 17.0 17.2 17.4 17.4 17.0 17.3 17.2 17.4 16.8 17.4 17.5 17.4 17.6 17.4 17.3 17.0 17.8 17.5 17.4 17.4 17.1 17.6 17.7 17.4 17.8 16.5 17.8 17.3 17.3 17.1 17.4 16.9 17.3 Construct and interpret either a digidot plot or a separate stemand-leaf and time series plot of these data. The 100 annual Wolfer sunspot numbers from 1770 to 6-85. 1869 follow. (For an interesting analysis and interpretation of these numbers, see the book by Box, Jenkins, and Reinsel referenced in Exercise 6-84. Their analysis requires some advanced knowledge of statistics and statistical model building.) Read down, then left to right. The 1869 result is 74. Construct and interpret either a digidot plot or a stem-and-leaf and time series plot of these data. 101 82 66 35 41 21 16 6 4 7 14 34 45 43 48 42 28 31 7 20 92 10 8 2 0 1 5 12 14 35 46 41 30 24 154 125 85 68 16 7 4 2 8 17 36 50 62 67 71 48 28 38 23 10 24 8 13 57 122 138 103 86 63 37 24 11 15 40 83 132 131 118 62 98 124 96 66 64 54 39 21 7 4 23 55 90 67 60 47 94 96 77 59 44 47 30 16 7 37 74 222 Chapter 6/Descriptive Statistics In their book Introduction to Time Series Analysis and 6-86. Forecasting (Wiley, 2008), Montgomery, Jennings, and Kolahci presented the data in Table 6E.2, which are the monthly total passenger airline miles lown in the United Kingdom from 1964 to 1970 (in millions of miles). Comment on any features of the data that are apparent. Construct and interpret either a digidot plot or a separate stem-and-leaf and time series plot of these data. 6-87. Table 6E.3 shows the number of earthquakes per year of magnitude 7.0 and higher since 1900 (source: Earthquake Data Base System of the U.S. Geological Survey, National Earthquake Information Center, Golden, Colorado). Construct and interpret either a digidot plot or a separate stemand-leaf and time series plot of these data. 6-88. Table 6E.4 shows U.S. petroleum imports as a percentage of the totals, and Persian Gulf imports as a percentage of all imports by year since 1973 (source: U.S. Department of Energy Web site, www.eia.doe.gov/). Construct and interpret either a digidot plot or a separate stem-and-leaf and time series plot for each column of data. 6-89. Table 6E.5 contains the global mean surface air temperature anomaly and the global CO2 concentration for the years 1880–2004. The temperature is measured at a number of locations around the world and averaged annually, and then subtracted from a base period average (1951–1980) and the result reported as an anomaly. (a) Construct a time series plot of the global mean surface air temperature anomaly data and comment on any features that you observe. (b) Construct a time series plot of the global CO2 concentration data and comment on any features that you observe. (c) Overlay the two plots on the same set of axes and comment on the plot. 5"#-&t6E.2 United Kingdom Passenger Airline Miles Flown Month Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 1964 1965 1966 1967 1968 1969 1970 7.269 6.775 7.819 8.371 9.069 10.248 11.030 10.882 10.333 9.109 7.685 7.682 8.350 7.829 8.829 9.948 10.638 11.253 11.424 11.391 10.665 9.396 7.775 7.933 8.186 7.444 8.484 9.864 10.252 12.282 11.637 11.577 12.417 9.637 8.094 9.280 8.334 7.899 9.994 10.078 10.801 12.953 12.222 12.246 13.281 10.366 8.730 9.614 8.639 8.772 10.894 10.455 11.179 10.588 10.794 12.770 13.812 10.857 9.290 10.925 9.491 8.919 11.607 8.852 12.537 14.759 13.667 13.731 15.110 12.185 10.645 12.161 10.840 10.436 13.589 13.402 13.103 14.933 14.147 14.057 16.234 12.389 11.594 12.772 5"#-&t6E.3 Earthquake Data 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 13 14 8 10 16 26 32 27 18 32 36 24 22 23 22 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 22 19 13 26 13 14 22 24 21 22 26 21 23 24 27 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 15 34 10 15 22 18 15 20 15 22 19 16 30 27 29 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 8 15 6 11 8 7 18 16 13 12 13 20 15 16 12 Section 6-5/Time Sequence Plots 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 18 25 21 21 14 8 11 14 23 18 17 19 20 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 41 31 27 35 26 28 36 39 21 17 22 17 19 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 23 20 16 21 21 25 16 18 15 18 14 10 15 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 5"#-&t6E.4 Petroleum Import Data Year Petroleum Imports (thousand barrels per day) Total Petroleum Imports as Percent of Petroleum Products Supplied Petroleum Imports from Persian Gulf as Percent of Total Petroleum Imports 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 6256 6112 6055 7313 8807 8363 8456 6909 5996 5113 5051 5437 5067 6224 6678 7402 8061 8018 7627 7888 8620 8996 8835 9478 10,162 10,708 10,852 11,459 11,871 36.1 36.7 37.1 41.8 47.7 44.3 45.6 40.5 37.3 33.4 33.1 34.5 32.2 38.2 40.0 42.8 46.5 47.1 45.6 46.3 50.0 50.7 49.8 51.7 54.5 56.6 55.5 58.1 60.4 13.5 17.0 19.2 25.1 27.8 26.5 24.4 21.9 20.3 13.6 8.7 9.3 6.1 14.6 16.1 20.8 23.0 24.5 24.1 22.5 20.6 19.2 17.8 16.9 17.2 19.9 22.7 21.7 23.2 18 15 16 13 15 16 11 11 18 12 15 223 224 Chapter 6/Descriptive Statistics 5"#-&t6E.4 (Continued) Year Petroleum Imports (thousand barrels per day) Total Petroleum Imports as Percent of Petroleum Products Supplied Petroleum Imports from Persian Gulf as Percent of Total Petroleum Imports 2002 2003 2004 2005 2006 2007 2008 11,530 12,264 13,145 13,714 13,707 13,468 12,915 58.3 61.2 63.4 65.9 66.3 65.1 66.2 19.6 20.3 18.9 17.0 16.1 16.1 18.4 5"#-&t6E.5 Global Mean Surface Air Temperature Anomaly and Global CO2 Concentration Year Anomaly, oC CO2, ppmv Year Anomaly, oC CO2, ppmv Year Anomaly, oC CO2, ppmv 1880 −0.11 290.7 1922 −0.09 303.8 1964 −0.25 319.2 1881 −0.13 291.2 1923 −0.16 304.1 1965 −0.15 320.0 1882 −0.01 291.7 1924 −0.11 304.5 1966 −0.07 321.1 1883 −0.04 292.1 1925 −0.15 305.0 1967 −0.02 322.0 1884 −0.42 292.6 1926 0.04 305.4 1968 −0.09 322.9 1885 –0.23 293.0 1927 −0.05 305.8 1969 0.00 324.2 1886 −0.25 293.3 1928 0.01 306.3 1970 0.04 325.2 1887 −0.45 293.6 1929 −0.22 306.8 1971 −0.10 326.1 1888 1889 −0.23 0.04 293.8 294.0 1930 1931 −0.03 0.03 307.2 307.7 1972 1973 −0.05 0.18 327.2 328.8 1890 −0.22 294.2 1932 0.04 308.2 1974 −0.06 329.7 1891 −0.55 294.3 1933 −0.11 308.6 1975 −0.02 330.7 1892 −0.40 294.5 1934 0.05 309.0 1976 −0.21 331.8 1893 −0.39 294.6 1935 −0.08 309.4 1977 0.16 333.3 1894 −0.32 294.7 1936 0.01 309.8 1978 0.07 334.6 1895 −0.32 294.8 1937 0.12 310.0 1979 0.13 336.9 1896 −0.27 294.9 1938 0.15 310.2 1980 0.27 338.7 1897 −0.15 295.0 1939 −0.02 310.3 1981 0.40 339.9 1898 −0.21 295.2 1940 0.14 310.4 1982 0.10 341.1 1899 −0.25 295.5 1941 0.11 310.4 1983 0.34 342.8 1900 −0.05 295.8 1942 0.10 310.3 1984 0.16 344.4 1901 −0.05 296.1 1943 0.06 310.2 1985 0.13 345.9 1902 −0.30 296.5 1944 0.10 310.1 1986 0.19 347.2 1903 −0.35 296.8 1945 −0.01 310.1 1987 0.35 348.9 1904 −0.42 297.2 1946 0.01 310.1 1988 0.42 351.5 1905 −0.25 297.6 1947 0.12 310.2 1989 0.28 352.9 1906 −0.15 298.1 1948 −0.03 310.3 1990 0.49 354.2 1907 −0.41 298.5 1949 −0.09 310.5 1991 0.44 355.6 1908 −0.30 298.9 1950 −0.17 310.7 1992 0.16 356.4 1909 −0.31 299.3 1951 −0.02 311.1 1993 0.18 357.0 299.7 1952 0.03 311.5 1994 1910 −0.21 0.31 358.9 Section 6-6/Scatter Diagrams Year Anomaly, oC CO2, ppmv Year Anomaly, oC CO2, ppmv Year Anomaly, oC CO2, ppmv 1911 −0.25 300.1 1912 −0.33 300.4 1953 0.12 311.9 1995 0.47 360.9 1954 −0.09 312.4 1996 0.36 1913 −0.28 362.6 300.8 1955 −0.09 313.0 1997 0.40 1914 1915 363.8 −0.02 0.06 301.1 301.4 1956 1957 −0.18 0.08 313.6 314.2 1998 1999 0.71 0.43 366.6 368.3 1916 −0.20 301.7 1958 0.10 314.9 2000 0.41 369.5 1917 −0.46 302.1 1959 0.05 315.8 2001 0.56 371.0 1918 −0.33 302.4 1960 −0.02 316.6 2002 0.70 373.1 1919 −0.09 302.7 1961 0.10 317.3 2003 0.66 375.6 1920 −0.15 303.0 1962 0.05 318.1 2004 0.60 377.4 1921 −0.04 303.4 1963 0.03 318.7 225 (source: http://data.giss.nasa.gov/gistemp/) 5"#-&t6.5 Quality Data for Young Red Wines 6-6 Quality pH Total SO2 Color Density Color 19.2 18.3 17.1 15.2 14.0 13.8 12.8 17.3 16.3 16.0 15.7 15.3 14.3 14.0 13.8 12.5 11.5 14.2 17.3 15.8 3.85 3.75 3.88 3.66 3.47 3.75 3.92 3.97 3.76 3.98 3.75 3.77 3.76 3.76 3.90 3.80 3.65 3.60 3.86 3.93 66 79 73 86 178 108 96 59 22 58 120 144 100 104 67 89 192 301 99 66 9.35 11.15 9.40 6.40 3.60 5.80 5.00 10.25 8.20 10.15 8.80 5.60 5.55 8.70 7.41 5.35 6.35 4.25 12.85 4.90 5.65 6.95 5.75 4.00 2.25 3.20 2.70 6.10 5.00 6.00 5.50 3.35 3.25 5.10 4.40 3.15 3.90 2.40 7.70 2.75 Scatter Diagrams In many problems, engineers and scientists work with data that is multivariate in nature; that is, each observation consists of measurements of several variables. We saw an example of this in the wire bond pull strength data in Table 1.2. Each observation consisted of data on the pull strength of a particular wire bond, the wire length, and the die height. Such data are very commonly encountered. Table 6.5 contains a second example of multivariate data taken from an article on the quality of different young red wines in the Journal of the Science of Food 226 Chapter 6/Descriptive Statistics Scatterplot of Quality vs. Color 20 19 18 Quality 17 16 15 14 13 FIGURE 6-19 Scatter diagram of wine quality and color from Table 6-5. 12 11 2 3.50 3 3.75 4 4.00 5 Color 100 200 6 300 4 7 8 8 12 2.0 4.5 7.0 20 16 Quality 12 4.00 3.75 pH 3.50 300 200 Total so2 100 12 FIGURE 6-20 Matrix of scatter diagrams for the wine quality data in Table 6-5. Color Density 8 4 Color and Agriculture (1974, Vol. 25) by T.C. Somers and M.E. Evans. The authors reported quality along with several other descriptive variables. We show only quality, pH, total SO2 (in ppm), color density, and wine color for a sample of their wines. Suppose that we wanted to graphically display the potential relationship between quality and one of the other variables, say color. The scatter diagram is a useful way to do this. A scatter diagram is constructed by plotting each pair of observations with one measurement in the pair on the vertical axis of the graph and the other measurement in the pair on the horizontal axis. Figure 6.19 is the scatter diagram of quality versus the descriptive variable color. Notice that there is an apparent relationship between the two variables with wines of more intense color generally having a higher quality rating. A scatter diagram is an excellent exploratory tool and can be very useful in identifying potential relationships between two variables. Data in Figure 6-19 indicate that a linear relationship Section 6-6/Scatter Diagrams 227 between quality and color may exist. We saw an example of a three-dimensional scatter diagram in Chapter 1 where we plotted wire bond strength versus wire length and die height for the bond pull strength data. When two or more variables exist, the matrix of scatter diagrams may be useful in looking at all of the pairwise relationships between the variables in the sample. Figure 6-20 is the matrix of scatter diagrams (upper half only shown) for the wine quality data in Table 6-5. The top row of the graph contains individual scatter diagrams of quality versus the other four descriptive variables, and other cells contain other pairwise plots of the four descriptive variables pH, SO2, color density, and color. This display indicates a weak potential linear relationship between quality and pH and somewhat stronger potential relationships between quality and color density and quality and color (which was noted previously in Figure 6-19). A strong apparent linear relationship between color density and color exists (this should be expected). The sample correlation coeficient rxy is a quantitative measure of the strength of the linear relationship between two random variables x and y. The sample correlation coeficient is deined as n rxy = ∑ yi ( xi − x ) i =1 n ⎡n 2 2⎤ ⎢ ∑ ( yi − y ) ∑ ( xi − x ) ⎥ i =1 ⎦ ⎣ i =1 (6.6) 1/ 2 If the two variables are perfectly linearly related with a positive slope rxy = 1 and if they are perfectly linearly related with a negative slope, then rxy = −1. If no linear relationship between the two variables exists, then rxy = 0. The simple correlation coeficient is also sometimes called the Pearson correlation coeficient after Karl Pearson, one of the giants of the ields of statistics in the late 19th and early 20th centuries. The value of the sample correlation coeficient between quality and color, the two variables plotted in the scatter diagram of Figure 6-19, is 0.712. This is moderately strong correlation, indicating a possible linear relationship between the two variables. Correlations below | 0.5 | are generally considered weak and correlations above | 0.8 | are generally considered strong. All pairwise sample correlations between the ive variables in Table 6-5 are as follows: Quality pH Total SO2 Color density Color pH 0.349 −0.445 0.702 0.712 Total SO2 Color Density −0.679 0.482 0.430 −0.492 −0.480 0.996 Moderately strong correlations exist between quality and the two variables color and color density and between pH and total SO2 (note that this correlation is negative). The correlation between color and color density is 0.996, indicating a nearly perfect linear relationship. See Fig. 6-21 for several examples of scatter diagrams exhibiting possible relationships between two variables. Parts (e) and (f) of the igure deserve special attention; in part (e), a probable quadratic relationship exists between y and x, but the sample correlation coeficient is close to zero because the correlation coeficient is a measure of linear association, but in part (f), the correlation is approximately zero because no association exists between the two variables. 228 Chapter 6/Descriptive Statistics (a) Weak positive relationship (b) Strong positive relationship (c) Weak negative relationship (d) Strong negative relationship (e) Nonlinear quadratic relationship, rxy < o (f) No relationship, rxy < o FIGURE 6-21 Potential relationship between variables. Exercises FOR SECTION 6-6 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-90. Table 6E.6 presents data on the ratings of quarterbacks for the 2008 National Football League season (source: The Sports Network). It is suspected that the rating (y) is related to the average number of yards gained per pass attempt (x). (a) Construct a scatter plot of quarterback rating versus yards per attempt. Comment on the suspicion that rating is related to yards per attempt. (b) What is the simple correlation coeficient between these two variables? 6-91. An article in Technometrics by S. C. Narula and J. F. Wellington [“Prediction, Linear Regression, and a Minimum Sum of Relative Errors” (1977, Vol. 19)] presents data on the selling price and annual taxes for 24 houses. The data are shown in Table 6E.7. 5"#-&t6E.6 2008 NFL Quarterback Rating Data Player Team Yards per Attempt Rating Points Philip Rivers SD 8.39 105.5 Chad Pennington MIA 7.67 97.4 Kurt Warner ARI 7.66 96.9 Drew Brees NO 7.98 96.2 Peyton Manning IND 7.21 95 Aaron Rodgers GB 7.53 93.8 Section 6-6/Scatter Diagrams Matt Schaub HOU 8.01 92.7 Tony Romo DAL 7.66 91.4 Jeff Garcia TB 7.21 90.2 Matt Cassel NE 7.16 89.4 Matt Ryan ATL 7.93 87.7 Shaun Hill SF 7.10 87.5 Seneca Wallace SEA 6.33 87 Eli Manning NYG 6.76 86.4 Donovan McNabb PHI 6.86 86.4 Jay Cutler DEN 7.35 86 Trent Edwards BUF 7.22 85.4 Jake Delhomme CAR 7.94 84.7 Jason Campbell WAS 6.41 84.3 David Garrard JAC 6.77 81.7 Brett Favre NYJ 6.65 81 Joe Flacco BAL 6.94 80.3 Kerry Collins TEN 6.45 80.2 Ben Roethlisberger PIT 7.04 80.1 Kyle Orton CHI 6.39 79.6 JaMarcus Russell OAK 6.58 77.1 Tyler Thigpen KC 6.21 76 Gus Freotte MIN 7.17 73.7 Dan Orlovsky DET 6.34 72.6 Marc Bulger STL 6.18 71.4 Ryan Fitzpatrick CIN 5.12 70 Derek Anderson CLE 5.71 66.5 (a) Construct a scatter plot of sales price versus taxes paid. Comment on the widely held belief that price is related to taxes paid. (b) What is the simple correlation coeficient between these two variables? 6-92. An article in the Journal of Pharmaceuticals Sciences (1991, Vol. 80, pp. 971–977) presented data on the observed mole fraction solubility of a solute at a constant temperature and the dispersion, dipolar, and hydrogen-bonding Hansen partial solubility parameters. The data are as shown in Table 6E.8, where y is the negative logarithm of the mole fraction solubility, x1 is the dispersion partial solubility, x2 is the dipolar partial solubility, and x3 is the hydrogen bonding partial solubility. (a) Construct a matrix of scatter plots for these variables. (b) Comment on the apparent relationships among y and the other three variables? 5"#-&t6E.7 House Price and Tax Data Sale Price/1000 Taxes (local, school), county)/1000 Sale Price/1000 Taxes (local, school), county)/1000 25.9 29.5 27.9 25.9 29.9 29.9 30.9 28.9 35.9 31.5 31.0 30.9 4.9176 5.0208 4.5429 4.5573 5.0597 3.8910 5.8980 5.6039 5.8282 5.3003 6.2712 5.9592 30.0 36.9 41.9 40.5 43.9 37.5 37.9 44.5 37.9 38.9 36.9 45.8 5.0500 8.2464 6.6969 7.7841 9.0384 5.9894 7.5422 8.7951 6.0831 8.3607 8.1400 9.1416 5"#-&t6E.8 Solubility Data for Exercise 6-93 Observation Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 y 0.22200 0.39500 0.42200 0.43700 0.42800 0.46700 0.44400 0.37800 0.49400 0.45600 0.45200 0.11200 0.43200 0.10100 0.23200 0.30600 0.09230 0.11600 0.07640 0.43900 0.09440 0.11700 0.07260 0.04120 0.25100 0.00002 x1 7.3 8.7 8.8 8.1 9.0 8.7 9.3 7.6 10.0 8.4 9.3 7.7 9.8 7.3 8.5 9.5 7.4 7.8 7.7 10.3 7.8 7.1 7.7 7.4 7.3 7.6 x2 0.0 0.0 0.7 4.0 0.5 1.5 2.1 5.1 0.0 3.7 3.6 2.8 4.2 2.5 2.0 2.5 2.8 2.8 3.0 1.7 3.3 3.9 4.3 6.0 2.0 7.8 x3 0.0 0.3 1.0 0.2 1.0 2.8 1.0 3.4 0.3 4.1 2.0 7.1 2.0 6.8 6.6 5.0 7.8 7.7 8.0 4.2 8.5 6.6 9.5 10.9 5.2 20.7 229 230 Chapter 6/Descriptive Statistics 6-7 Probability Plots How do we know whether a particular probability distribution is a reasonable model for data? Sometimes this is an important question because many of the statistical techniques presented in subsequent chapters are based on an assumption that the population distribution is of a speciic type. Thus, we can think of determining whether data come from a speciic probability distribution as verifying assumptions. In other cases, the form of the distribution can give insight into the underlying physical mechanism generating the data. For example, in reliability engineering, verifying that time-to-failure data come from an exponential distribution identiies the failure mechanism in the sense that the failure rate is constant with respect to time. Some of the visual displays we used earlier, such as the histogram, can provide insight about the form of the underlying distribution. However, histograms are usually not really reliable indicators of the distribution form unless the sample size is very large. A probability plot is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data. The general procedure is very simple and can be performed quickly. It is also more reliable than the histogram for small- to moderate-size samples. Probability plotting typically uses special axes that have been scaled for the hypothesized distribution. Software is widely available for the normal, lognormal, Weibull, and various chi-square and gamma distributions. We focus primarily on normal probability plots because many statistical techniques are appropriate only when the population is (at least approximately) normal. To construct a probability plot, the observations in the sample are irst ranked from smallest to largest. That is, the sample x1 , x2 ,… , xn is arranged as x(1) , x(2) ,… , x(n) , where x(1) is the smallest observation, x(2) is the second-smallest observation, and so forth with x(n) the largest. The ordered observations x( j ) are then plotted against their observed cumulative frequency ( j − 0.5) / n on the appropriate probability paper. If the hypothesized distribution adequately describes the data, the plotted points will fall approximately along a straight line; if the plotted points deviate signiicantly from a straight line, the hypothesized model is not appropriate. Usually, the determination of whether or not the data plot is a straight line is subjective. The procedure is illustrated in the following example. Battery Life Ten observations on the effective service life in minutes of batteries used in a portable personal computer are as follows: 176, 191, 214, 220, 205, 192, 201, 190, 183, 185. We hypothesize that battery life is adequately modeled by a normal distribution. To use probability plotting to investigate this hypothesis, irst arrange the observations in ascending order and calculate their cumulative frequencies ( j − 0.5) / 10 as shown in Table 6-6. Example 6-7 5"#-&t6-6 Calculation for Constructing a Normal Probability Plot j x( j ) ( j − 0.5) /10 zj 1 2 3 4 5 6 7 8 9 10 176 183 185 190 191 192 201 205 214 220 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 –1.64 –1.04 –0.67 –0.39 –0.13 0.13 0.39 0.67 1.04 1.64 Section 6-7/Probability Plots 231 The pairs of values x( j ) and ( j − 0.5) / 10 are now plotted on normal probability axes. This plot is shown in Fig. 6-22. Most normal probability plots have 100 ( j − 0.5) / n on the left vertical scale and (sometimes) 100 [1 − ( j − 0.5) / n ] on the right vertical scale, with the variable value plotted on the horizontal scale. A straight line, chosen subjectively, has been drawn through the plotted points. In drawing the straight line, you should be inluenced more by the points near the middle of the plot than by the extreme points. A good rule of thumb is to draw the line approximately between the 25th and 75th percentile points. This is how the line in Fig. 6-22 was determined. In assessing the “closeness” of the points to the straight line, imagine a “fat pencil” lying along the line. If all the points are covered by this imaginary pencil, a normal distribution adequately describes the data. Because the points in Fig. 6-19 would pass the “fat pencil” test, we conclude that the normal distribution is an appropriate model. 0.1 99 1 95 5 80 20 50 50 20 80 5 95 1 99 0.1 170 180 190 200 210 220 100[1 – ( j – 0.5)/n] 100( j – 0.5)/n 99.9 99.9 x ( j) FIGURE 6-22 Normal probability plot for battery life. A normal probability plot can also be constructed on ordinary axes by plotting the standardized normal scores z j against x( j ) where the standardized normal scores satisfy j − 0.5 = P( Z ≤ z j ) = Φ ( z j ) n For example, if ( j − 0.5) / n = 0.05, Φ ( z j ) = 0.05 implies that z j = −1.64. To illustrate, consider the data from Example 6-4. In the last column of Table 6-6 we show the standardized normal scores. Figure 6-23 is the plot of z j versus x( j ). This normal probability plot is equivalent to the one in Fig. 6-22. We have constructed our probability plots with the probability scale (or the z-scale) on the vertical axis. Some computer packages “lip” the axis and put the probability scale on the horizontal axis. 3.30 1.65 zj FIGURE 6-23 Normal probability plot obtained from standardized normal scores. 0 –1.65 –3.30 170 180 190 200 x( j ) 210 220 232 zj Chapter 6/Descriptive Statistics 3.30 3.30 3.30 1.65 1.65 1.65 zj 0 –1.65 –3.30 170 zj 0 –1.65 180 190 200 x( j) 210 220 –3.30 170 0 –1.65 180 190 200 x( j) (a) 210 220 (b) –3.30 170 180 190 200 x( j ) 210 220 (c) FIGURE 6-24 Normal probability plots indicating a nonnormal distribution. (a) Light-tailed distribution. (b) Heavy-tailed distribution. (c) A distribution with positive (or right) skew. Normal Probability Plots of Small Samples Can Be Unreliable Exercises The normal probability plot can be useful in identifying distributions that are symmetric but that have tails that are “heavier” or “lighter” than the normal. They can also be useful in identifying skewed distributions. When a sample is selected from a light-tailed distribution (such as the uniform distribution), the smallest and largest observations will not be as extreme as would be expected in a sample from a normal distribution. Thus, if we consider the straight line drawn through the observations at the center of the normal probability plot, observations on the left side will tend to fall below the line, and observations on the right side will tend to fall above the line. This will produce an S-shaped normal probability plot such as shown in Fig. 6-24(a). A heavytailed distribution will result in data that also produce an S-shaped normal probability plot, but now the observations on the left will be above the straight line and the observations on the right will lie below the line. See Fig. 6-24(b). A positively skewed distribution will tend to produce a pattern such as shown in Fig. 6-24(c), where points on both ends of the plot tend to fall below the line, giving a curved shape to the plot. This occurs because both the smallest and the largest observations from this type of distribution are larger than expected in a sample from a normal distribution. Even when the underlying population is exactly normal, the sample data will not plot exactly on a straight line. Some judgment and experience are required to evaluate the plot. Generally, if the sample size is n < 30, there can be signiicant deviation from linearity in normal plots, so in these cases only a very severe departure from linearity should be interpreted as a strong indication of nonnormality. As n increases, the linear pattern will tend to become stronger, and the normal probability plot will be easier to interpret and more reliable as an indicator of the form of the distribution. FOR SECTION 6-7 Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-93. Construct a normal probability plot of the piston ring diameter data in Exercise 6-7. Does it seem reasonable to assume that piston ring diameter is normally distributed? 6-94. Construct a normal probability plot of the insulating luid breakdown time data in Exercise 6-8. Does it seem reasonable to assume that breakdown time is normally distributed? 6-95. Construct a normal probability plot of the visual accommodation data in Exercise 6-11. Does it seem reasonable to assume that visual accommodation is normally distributed? 6-96. Construct a normal probability plot of the solar intensity data in Exercise 6-12. Does it seem reasonable to assume that solar intensity is normally distributed? 6-97. Construct a normal probability plot of the O-ring joint temperature data in Exercise 6-19. Does it seem reasonable to assume that O-ring joint temperature is normally distributed? Discuss any interesting features that you see on the plot. 6-98. Construct a normal probability plot of the octane rating data in Exercise 6-30. Does it seem reasonable to assume that octane rating is normally distributed? Section 6-7/Probability Plots 6-99. Construct a normal probability plot of the cycles to failure data in Exercise 6-31. Does it seem reasonable to assume that cycles to failure is normally distributed? 6-100. Construct a normal probability plot of the suspended solids concentration data in Exercise 6-40. Does it seem reasonable to assume that the concentration of suspended solids in water from this particular lake is normally distributed? 6-101. Construct two normal probability plots for the height data in Exercises 6-38 and 6-45. Plot the data for female and male students on the same axes. Does height seem to be normally distributed for either group of students? If both 233 populations have the same variance, the two normal probability plots should have identical slopes. What conclusions would you draw about the heights of the two groups of students from visual examination of the normal probability plots? 6-102. It is possible to obtain a “quick-and-dirty” estimate of the mean of a normal distribution from the 50th percentile value on a normal probability plot. Provide an argument why this is so. It is also possible to obtain an estimate of the standard deviation of a normal distribution by subtracting the 84th percentile value from the 50th percentile value. Provide an argument explaining why this is so. Supplemental Exercises Problem available in WileyPLUS at instructor’s discretion. Tutoring problem available in WileyPLUS at instructor’s discretion. 6-103. The National Oceanic and Atmospheric Administration provided the monthly absolute estimates of global (land and ocean combined) temperature index (degrees C) from 2000. Read January to December from left to right in www. ncdc.noaa.gov/oa/climate/research/anomalies/anomalies. html). Construct and interpret either a digidot plot or a separate stem-and-leaf and time series plot of these data. 6-104. The concentration of a solution is measured six times by one operator using the same instrument. She obtains the following data: 63.2, 67.1, 65.8, 64.0, 65.1, and 65.3 (grams per liter). (a) Calculate the sample mean. Suppose that the desirable value for this solution has been speciied to be 65.0 grams per liter. Do you think that the sample mean value computed here is close enough to the target value to accept the solution as conforming to target? Explain your reasoning. (b) Calculate the sample variance and sample standard deviation. (c) Suppose that in measuring the concentration, the operator must set up an apparatus and use a reagent material. What do you think the major sources of variability are in this experiment? Why is it desirable to have a small variance of these measurements? 6-105. Table 6E.10 shows unemployment data for the United States that are seasonally adjusted. Construct a time series plot of these data and comment on any features (source: U.S. Bureau of Labor Web site, http://data.bls.gov). 6-106. A sample of six resistors yielded the following resistances (ohms): x1 = 45, x2 = 38, x3 = 47, x4 = 41, x5 = 35, and x6 = 43. (a) Compute the sample variance and sample standard deviation. (b) Subtract 35 from each of the original resistance measurements and compute s 2 and s. Compare your results with those obtained in part (a) and explain your indings. (c) If the resistances were 450, 380, 470, 410, 350, and 430 ohms, could you use the results of previous parts of this problem to ind s 2 and s? 6-107. Consider the following two samples: Sample 1: 10, 9, 8, 7, 8, 6, 10, 6 Sample 2: 10, 6, 10, 6, 8, 10, 8, 6 (a) Calculate the sample range for both samples. Would you conclude that both samples exhibit the same variability? Explain. (b) Calculate the sample standard deviations for both samples. Do these quantities indicate that both samples have the same variability? Explain. (c) Write a short statement contrasting the sample range versus the sample standard deviation as a measure of variability. 5"#-&t 6E.9 Global Monthly Temperature Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1 12.3 12.4 12.7 12.6 12.6 12.6 12.4 12.8 12.2 12.5 2 12.6 12.5 12.9 12.6 12.8 12.5 12.6 12.7 12.4 12.6 3 13.2 13.3 13.4 13.2 13.3 13.4 13.2 13.3 13.4 13.2 4 14.3 14.2 14.2 14.2 14.3 14.4 14.2 14.4 14.1 14.3 5 15.3 15.4 15.3 15.4 15.2 15.4 15.3 15.3 15.2 15.3 6 15.9 16.0 16.1 16.0 16.0 16.2 16.1 16.0 16.0 16.1 7 16.2 16.3 16.4 16.3 16.3 16.4 16.4 16.3 16.3 16.4 8 16.0 16.2 16.1 16.2 16.1 16.2 16.2 16.1 16.1 16.2 9 15.4 15.5 15.5 15.6 15.5 15.7 15.6 15.5 15.5 15.6 10 14.3 14.5 14.5 14.7 14.6 14.6 14.6 14.5 14.6 11 13.1 13.5 13.5 13.4 13.6 13.6 13.5 13.4 13.5 12 12.5 12.7 12.6 12.9 12.7 12.8 12.9 12.6 12.7 234 Chapter 6/Descriptive Statistics 5"#-&t6E.10 Unemployment Percentage Year 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Jan 4.3 4.0 4.2 5.7 5.8 5.7 5.2 4.7 4.6 4.9 7.6 Feb 4.4 4.1 4.2 5.7 5.9 5.6 5.4 4.8 4.5 4.8 8.1 Mar 4.2 4.0 4.3 5.7 5.9 5.8 5.2 4.7 4.4 5.1 8.5 Apr 4.3 3.8 4.4 5.9 6.0 5.6 5.2 4.7 4.5 5.0 8.9 May 4.2 4.0 4.3 5.8 6.1 5.6 5.1 4.7 4.5 5.5 9.4 6-108. An article in Quality Engineering (1992, Vol. 4, pp. 487–495) presents viscosity data from a batch chemical process. A sample of these data is in Table 6E.11. 13.3 14.5 15.3 15.3 14.3 14.8 15.2 14.5 14.6 14.1 14.3 16.1 13.1 15.5 12.6 14.6 14.3 15.4 15.2 16.8 14.9 13.7 15.2 14.5 15.3 15.6 15.8 13.3 14.1 15.4 15.2 15.2 15.9 16.5 14.8 15.1 17.0 14.9 14.8 14.0 15.8 13.7 15.1 13.4 14.1 14.8 14.3 14.3 16.4 16.9 14.2 16.9 14.9 15.2 14.4 15.2 14.6 16.4 14.2 15.7 16.0 14.9 13.6 15.3 14.3 15.6 16.1 13.9 15.2 14.4 14.0 14.4 13.7 13.8 15.6 14.5 12.8 16.1 16.6 15.6 (a) Reading left to right and up and down, draw a time series plot of all the data and comment on any features of the data that are revealed by this plot. (b) Consider the notion that the irst 40 observations were generated from a speciic process, whereas the last 40 observations were generated from a different process. Does the plot indicate that the two processes generate similar results? (c) Compute the sample mean and sample variance of the irst 40 observations; then compute these values for the second 40 observations. Do these quantities indicate that both processes yield the same mean level? The same variability? Explain. 5"#-&t6E.11 Viscosity Data 13.3 14.5 15.3 15.3 14.3 14.8 15.2 14.5 14.6 14.1 14.3 16.1 13.1 15.5 12.6 14.6 14.3 15.4 15.2 16.8 14.9 13.7 15.2 14.5 15.3 15.6 15.8 13.3 14.1 15.4 15.2 15.2 15.9 16.5 14.8 15.1 17.0 14.9 14.8 14.0 15.8 13.7 15.1 13.4 14.1 14.8 14.3 14.3 16.4 16.9 14.2 16.9 14.9 15.2 14.4 15.2 14.6 16.4 14.2 15.7 16.0 14.9 13.6 15.3 14.3 15.6 16.1 13.9 15.2 14.4 14.0 14.4 13.7 13.8 15.6 14.5 12.8 16.1 16.6 15.6 Jun 4.3 4.0 4.5 5.8 6.3 5.6 5.1 4.6 4.6 5.6 9.5 Jul 4.3 4.0 4.6 5.8 6.2 5.5 5.0 4.7 4.7 5.8 9.4 Aug 4.2 4.1 4.9 5.7 6.1 5.4 4.9 4.7 4.7 6.2 9.7 Sep 4.2 3.9 5.0 5.7 6.1 5.4 5.0 4.5 4.7 6.2 9.8 Oct 4.1 3.9 5.3 5.7 6.0 5.5 5.0 4.4 4.8 6.6 Nov 4.1 3.9 5.5 5.9 5.8 5.4 5.0 4.5 4.7 6.8 Dec 4.0 3.9 5.7 6.0 5.7 5.4 4.8 4.4 4.9 7.2 The total net electricity consumption of the United 6-109. States by year from 1980 to 2007 (in billion kilowatt-hours) is in Table 6E.12. Net consumption excludes the energy consumed by the generating units. TABLE t6E.12 U.S. Electricity Consumption 1980 2094.4 1981 2147.1 1982 2086.4 1983 2151.0 1984 2285.8 1985 2324.0 1986 2368.8 1987 2457.3 1988 2578.1 1989 2755.6 1990 2837.1 1991 2886.1 1992 2897.2 1993 3000.7 1994 3080.9 1995 3164.0 1996 3253.8 1997 3301.8 1998 3425.1 1999 3483.7 2000 3592.4 2001 3557.1 2002 3631.7 2003 3662.0 2004 3715.9 2005 3811.0 2006 3816.8 2007 3891.7 (source: U.S. Department of Energy Web site, www.eia.doe.gov/emeu/ international/contents.html#InternationalElectricity). Construct a time series plot of these data. Construct and interpret a stem-and-leaf display of these data. 6-110. Reconsider the data from Exercise 6-108. Prepare comparative box plots for two groups of observations: the irst 40 and the last 40. Comment on the information in the box plots. 6-111. The data shown in Table 6E.13 are monthly champagne sales in France (1962–1969) in thousands of bottles. (a) Construct a time series plot of the data and comment on any features of the data that reveals by this plot. (b) Speculate on how you would use a graphical procedure to forecast monthly champagne sales for the year 1970. 6-112. The following data are the temperatures of efluent at discharge from a sewage treatment facility on consecutive days: 43 45 49 47 52 45 51 46 44 48 51 50 52 44 48 50 49 50 46 46 49 49 51 50 (a) Calculate the sample mean, sample median, sample variance, and sample standard deviation. (b) Construct a box plot of the data and comment on the information in this display. Section 6-7/Probability Plots 235 5"#-&t6E.13 Champagne Sales in France Month 1962 1963 1964 1965 1966 1967 1968 1969 Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 2.851 2.672 2.755 2.721 2.946 3.036 2.282 2.212 2.922 4.301 5.764 7.132 2.541 2.475 3.031 3.266 3.776 3.230 3.028 1.759 3.595 4.474 6.838 8.357 3.113 3.006 4.047 3.523 3.937 3.986 3.260 1.573 3.528 5.211 7.614 9.254 5.375 3.088 3.718 4.514 4.520 4.539 3.663 1.643 4.739 5.428 8.314 10.651 3.633 4.292 4.154 4.121 4.647 4.753 3.965 1.723 5.048 6.922 9.858 11.331 4.016 3.957 4.510 4.276 4.968 4.677 3.523 1.821 5.222 6.873 10.803 13.916 2.639 2.899 3.370 3.740 2.927 3.986 4.217 1.738 5.221 6.424 9.842 13.076 3.934 3.162 4.286 4.676 5.010 4.874 4.633 1.659 5.591 6.981 9.851 12.670 6-113. A manufacturer of coil springs is interested in implementing a quality control system to monitor his production process. As part of this quality system, it is decided to record the number of nonconforming coil springs in each production batch of size 50. During 40 days of production, 40 batches of data were collected as follows: Read data across and down. 9 12 6 9 7 14 12 4 6 7 8 5 9 7 8 11 3 6 7 7 11 4 4 8 7 5 6 4 5 8 19 19 18 12 11 17 15 17 13 13 (a) Construct a stem-and-leaf plot of the data. (b) Find the sample average and standard deviation. (c) Construct a time series plot of the data. Is there evidence that there was an increase or decrease in the average number of nonconforming springs made during the 40 days? Explain. 6-114. A communication channel is being monitored by recording the number of errors in a string of 1000 bits. Data for 20 of these strings follow: Read data across and down 3 1 0 1 3 2 4 1 3 1 1 1 2 3 3 2 0 2 0 1 (a) Construct a stem-and-leaf plot of the data. (b) Find the sample average and standard deviation. (c) Construct a time series plot of the data. Is there evidence that there was an increase or decrease in the number of errors in a string? Explain. 6-115. Reconsider the golf course yardage data in Exercise 6-9. Construct a box plot of the yardages and write an interpretation of the plot. 6-116. Reconsider the data in Exercise 6-108. Construct normal probability plots for two groups of the data: the irst 40 and the last 40 observations. Construct both plots on the same axes. What tentative conclusions can you draw? 6-117. Construct a normal probability plot of the efluent discharge temperature data from Exercise 6-112. Based on the plot, what tentative conclusions can you draw? 6-118. Construct normal probability plots of the cold start ignition time data presented in Exercises 6-69 and 6-80. Construct a separate plot for each gasoline formulation, but arrange the plots on the same axes. What tentative conclusions can you draw? 6-119. Reconsider the golf ball overall distance data in Exercise 6-41. Construct a box plot of the yardage distance and write an interpretation of the plot. How does the box plot compare in interpretive value to the original stem-and-leaf diagram? 6-120. Transformations. In some data sets, a transformation by some mathematical function applied to the original data, such as y or log y, can result in data that are simpler to work with statistically than the original data. To illustrate the effect of a transformation, consider the following data, which represent cycles to failure for a yarn product: 675, 3650, 175, 1150, 290, 2000, 100, 375. (a) Construct a normal probability plot and comment on the shape of the data distribution. (b) Transform the data using logarithms; that is, let y ∗(new value) = log y (old value). Construct a normal probability plot of the transformed data and comment on the effect of the transformation. 6-121. In 1879, A. A. Michelson made 100 determinations of the velocity of light in air using a modiication of a method proposed by the French physicist Foucault. Michelson made the measurements in ive trials of 20 measurements each. The observations (in kilometers per second) are in Table 6E.14. Each value has 299,000 subtracted from it. The currently accepted true velocity of light in a vacuum is 299,792.5 kilometers per second. Stigler (1977, The Annals of Statistics) reported that the “true” value for comparison to these measurements is 734.5. Construct comparative box plots of these measurements. Does it seem that all ive trials are 236 Chapter 6/Descriptive Statistics 5"#-&t6E.14 Velocity of Light Data 900 930 1070 650 Trial 1 930 760 850 810 950 1000 980 1000 980 960 880 960 960 830 940 790 960 810 940 880 Trial 2 880 880 800 830 850 800 880 790 900 760 840 800 880 880 880 910 880 850 860 870 Trial 3 720 840 720 840 620 850 860 840 970 840 950 840 890 910 810 920 810 890 820 860 Trial 4 800 880 770 720 760 840 740 850 750 850 760 780 890 870 840 870 780 810 810 740 Trial 5 760 810 810 940 790 950 810 800 820 810 850 870 850 1000 740 980 consistent with respect to the variability of the measurements? Are all ive trials centered on the same value? How does each group of trials compare to the true value? Could there have been “startup” effects in the experiment that Michelson performed? Could there have been bias in the measuring instrument? 6-122. In 1789, Henry Cavendish estimated the density of the Earth by using a torsion balance. His 29 measurements follow, expressed as a multiple of the density of water. 5.50 5.55 5.57 5.34 5.42 5.30 5.61 5.36 5.53 5.79 5.47 5.75 4.88 5.29 5.62 5.10 5.63 5.86 4.07 5.58 5.29 5.27 5.34 5.85 5.26 5.65 5.44 5.39 5.46 (a) Calculate the sample mean, sample standard deviation, and median of the Cavendish density data. (b) Construct a normal probability plot of the data. Comment on the plot. Does there seem to be a “low” outlier in the data? (c) Would the sample median be a better estimate of the density of the earth than the sample mean? Why? 6-123. In their book Introduction to Time Series Analysis and Forecasting (Wiley, 2008), Montgomery, Jennings, and Kulahci presented the data on the drowning rate for children between one and four years old per 100,000 of population in Arizona from 1970 to 2004. The data are: 19.9, 16.1, 19.5, 19.8, 21.3, 15.0, 15.5, 16.4, 18.2, 15.3, 15.6, 19.5, 14.0, 13.1, 10.5, 11.5, 12.9, 8.4, 9.2, 11.9, 5.8, 8.5, 7.1, 7.9, 8.0, 9.9, 8.5, 9.1, 9.7, 6.2, 7.2, 8.7, 5.8, 5.7, and 5.2. (a) Perform an appropriate graphical analysis of the data. (b) Calculate and interpret the appropriate numerical summaries. (c) Notice that the rate appears to decrease dramatically starting about 1990. Discuss some potential reasons explaining why this could have happened. (d) If there has been a real change in the drowning rate beginning about 1990, what impact does this have on the summary statistics that you calculated in part (b)? 6-124. Patients arriving at a hospital emergency department present a variety of symptoms and complaints. The following data were collected during one weekend night shift (11:00 p.m. to 7:00 a.m.): Chest pain Dificulty breathing Numbness in extremities Broken bones Abrasions Cuts Stab wounds Gunshot wounds Blunt force trauma Fainting, loss of consciousness Other 8 7 3 11 16 21 9 4 10 5 9 (a) Calculate numerical summaries of these data. What practical interpretation can you give to these summaries? (b) Suppose that you knew that a certain fraction of these patients leave without treatment (LWOT). This is an important problem because these patients may be seriously ill or injured. Discuss what additional data you would require to begin a study into the reasons why patients LWOT. 6-125. One of the authors (DCM) has a Mercedes-Benz 500 SL Roadster. It is a 2003 model and has fairly low mileage (currently 45,324 miles on the odometer). He is interested in learning how his car’s mileage compares with the mileage on similar SLs. Table 6E.15 contains the mileage on 100 MercedesBenz SLs from the model years 2003−2009 taken from the Cars.com website. (a) Calculate the sample mean and standard deviation of the odometer readings. (b) Construct a histogram of the odometer readings and comment on the shape of the data distribution. (c) Construct a stem-and-leaf diagram of the odometer readings. (d) What is the percentile of DCM’s mileage? 6-126. The energy consumption for 90 gas-heated homes during a winter heating season is given in Table 6E.16. The variable reported is BTU/number of heating degree days. (a) Calculate the sample mean and standard deviation of energy usage. (b) Construct a histogram of the energy usage data and comment on the shape of the data distribution. Section 6-7/Probability Plots (c) Construct a stem-and-leaf diagram of energy usage. (d) What proportion of the energy usage data is above the average usage plus 2 standard deviations? 6-127. The force needed to remove the cap from a medicine bottle is an important feature of the product because requiring too much force may cause dificulty for elderly patients or patients with arthritis or similar conditions. Table 6E.17 presents the results of testing a sample of 68 caps attached to bottles for the force (in pounds) required for removing the cap. (a) Construct a stem-and-leaf diagram of the force data. (b) What are the average and the standard deviation of the force? (c) Construct a normal probability plot of the data and comment on the plot. (d) If the upper speciication on required force is 30 pounds, what proportion of the caps do not meet this requirement? (e) What proportion of the caps exceeds the average force plus 2 standard deviations? (f) Suppose that the irst 36 observations in the table come from one machine and the remaining come from a second machine (read across the rows and the down). Does there seem to be a possible difference in the two machines? Construct an appropriate graphical display of the data as part of your answer. (g) Plot the irst 36 observations in the table on a normal probability plot and the remaining observations on another normal probability plot. Compare the results with the single normal probability plot that you constructed for all of the data in part (c). 6-128. Consider the global mean surface air temperature anomaly and the global CO2 concentration data originally shown in Table 6E.5. (a) Construct a scatter plot of the global mean surface air temperature anomaly versus the global CO2 concentration Comment on the plot. (b) What is the simple correlation coeficient between these two variables? 5"#-&t6E.15 Odometer Readings on 100 Mercedes-Benz SL500 Automobiles, Model Years 2003−2009 2020 7893 19,000 15,984 32,271 38,277 32,524 15,972 63,249 60,583 8905 10,327 31,668 16,903 36,889 72,272 38,139 41,218 58,526 83,500 1698 37,687 33,512 37,789 21,564 3800 62,940 43,382 66,325 56,314 17,971 15,000 28,522 28,958 31,000 21,218 51,326 15,879 49,489 67,072 6207 4166 5824 40,944 42,915 29,250 54,126 13,500 32,800 62,500 22,643 9056 18,327 18,498 19,377 48,648 4100 77,809 67,000 47,603 4977 19,842 31,845 40,057 19,634 29,216 45,540 25,708 60,499 51,936 17,656 15,598 30,015 15,272 26,313 44,944 26,235 29,000 63,260 65,195 8940 33,745 2171 28,968 43,049 49,125 46,505 58,006 60,449 64,473 11,508 22,168 36,161 30,487 30,396 33,065 34,420 51,071 27,422 85,475 10.04 13.96 12.71 10.64 9.96 6.72 13.47 12.19 6.35 12.62 6.80 6.78 6.62 10.30 10.21 8.47 5.56 9.83 7.62 4.00 9.82 5.20 16.06 8.61 11.70 9.76 12.16 5"#-&t6E.16 Energy Usage in BTU/Number of Heating Degree Days 7.87 11.12 8.58 12.91 12.28 14.24 11.62 7.73 7.15 9.43 13.43 8.00 10.35 7.23 11.43 11.21 8.37 12.69 7.16 9.07 5.98 9.60 2.97 10.28 10.95 7.29 13.38 8.67 6.94 15.24 9.58 8.81 13.60 7.62 10.49 13.11 12.31 10.28 8.54 9.83 9.27 5.94 10.40 8.69 10.50 9.84 9.37 11.09 9.52 11.29 10.36 12.92 8.26 14.35 16.90 7.93 11.70 18.26 8.29 6.85 15.12 7.69 13.42 TABLE t6E.17 Force to Remove Bottle Caps 14 17 25 17 24 31 19 17 18 22 15 15 27 34 21 17 27 16 16 17 17 32 14 21 24 16 15 20 32 24 14 34 24 18 15 17 31 16 19 24 28 30 19 20 27 37 15 237 22 16 19 15 21 36 30 21 14 10 17 21 34 24 16 15 22 20 26 20 15 238 Chapter 6/Descriptive Statistics Mind-Expanding Exercises 6-129. Consider the airfoil data in Exercise 6-18. Subtract 30 from each value and then multiply the resulting quantities by 10. Now compute s 2 for the new data. How is this quantity related to s 2 for the original data? Explain why. 2 n 6-130. Consider the quantity ∑ i = 1 ( xi − a ) . For what value of a is this quantity minimized? 6-131. Using the results of Exercise 6-130, which of the 2 2 n n two quantities ∑ i = 1 ( xi − x ) and ∑ i = 1 ( xi − μ ) will be smaller, provided that x ≠ μ? 6-132. Coding the Data. Let yi = a + bxi , i = 1, 2, … , n, where a and b are nonzero constants. Find the relationship between x and y, and between s x and s y . 6-133. A sample of temperature measurements in a furnace yielded a sample average (°F) of 835.00 and a sample standard deviation of 10.5. Using the results from Exercise 6-132, what are the sample average and sample standard deviations expressed in oC? 6-134. Consider the sample x1, x2 , … , xn with sample mean x and sample standard deviation s . Let zi = ( xi − x ) / s, i = 1, 2, … , n. What are the values of the sample mean and sample standard deviation of the zi ? 6-135. An experiment to investigate the survival time in hours of an electronic component consists of placing the parts in a test cell and running them for 100 hours under elevated temperature conditions. (This is called an “accelerated” life test.) Eight components were tested with the following resulting failure times: 6-136. Suppose that you have a sample x1, x2 , … , xn and have calculated xn and sn2 for the sample. Now an (n + 1) st observation becomes available. Let xn + 1 and s2n + 1 be the sample mean and sample variance for the sample using all n + 1 observations. (a) Show how xn + 1 can be computed using xn and xn + 1 . n ( x n +1 − x n ) n +1 (c) Use the results of parts (a) and (b) to calculate the new sample average and standard deviation for the data of Exercise 6-38, when the new observation is x38 = 64. (b) Show that nsn2+1 = ( n − 1) sn2 + 2 75, 63, 100 + , 36, 51, 45, 80, 90 6-137. Trimmed Mean. Suppose that the data are arranged in increasing order, T% of the observations are removed from each end, and the sample mean of the remaining numbers is calculated. The resulting quantity is called a trimmed mean, which generally lies between the sample mean x and the sample median x. Why? The trimmed mean with a moderate trimming percentage (5% to 20%) is a reasonably good estimate of the middle or center. It is not as sensitive to outliers as the mean but is more sensitive than the median. (a) Calculate the 10% trimmed mean for the yield data in Exercise 6-33. (b) Calculate the 20% trimmed mean for the yield data in Exercise 6-33 and compare it with the quantity found in part (a). (c) Compare the values calculated in parts (a) and (b) with the sample mean and median for the yield data. Is there much difference in these quantities? Why? The observation 100 + indicates that the unit still functioned at 100 hours. Is there any meaningful measure of location that can be calculated for these data? What is its numerical value? 6-138. Trimmed Mean. Suppose that the sample size n is such that the quantity nT / 100 is not an integer. Develop a procedure for obtaining a trimmed mean in this case. Important Terms and Concepts Box plot Degrees of freedom Frequency distribution and histogram Histogram Interquartile range Matrix of scatter plots Quartiles, and percentiles Multivariate data Normal probability plot Outlier Pareto chart Percentile Population mean Population standard deviation Population variance Probability plot Relative frequency distribution Sample correlation coeficient Sample mean Sample median Sample mode Sample range Sample standard deviation Sample variance Scatter diagram Stem-and-leaf diagram Time series

Descriptive Statistics: Data Summarization Techniques

Related documents

Products

Support

Descriptive Statistics: Data Summarization Techniques

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib