Statistics 2014, Fall 2001

1 Chapter 2 – Organization and Description of Data When data are in their original form, as collected, they are called raw data. The first task to be done with raw data is clean-up. This is always done. The data must be double-checked to see that it was collected accurately. Any unusual data values should be followed up to see whether they resulted from errors in data collection or from unusual members of the sample. When the data is entered into a calculator or spreadsheet, it should be double-checked to see that it was entered correctly. After the clean-up procedure, the next task is to describe the data. There two kinds of methods for summarizing and describing data – graphical techniques and numerical summaries. We will discuss some graphical techniques first. With non-numeric data, we often want a graph which is a variation on the histogram, called a Pareto chart. This type of graph is useful in quality control and process improvement studies, in which the data often represent the different types of defects or failure modes. A Pareto chart graphs the frequencies of occurrences of the different types of defects, ordered from the most frequent to the least frequent. The purpose of a Pareto chart is to focus on the main causes or modes of failure. Example: We have data, listed below, on number of accidents between 1959 and 1999 for each of a number of different types of aircraft, as well as the number of accidents per million flights. Aircraft type MD-11 707/720 DC-8 F-28 BAC 1-11 DC-10 747-Early A310 A300-600 DC-9 A300-Early 737-1 & 2 727 A310/319/321 F100 L1011 BAe 146 747-400 757 MD-80/90 767 737-3, 4 & 5 Actual no. of hull losses 5 115 71 32 22 20 21 4 3 75 7 62 70 7 3 4 3 1 4 10 3 12 Hull losses/million departures 6.54 6.46 5.84 3.94 2.64 2.57 1.90 1.40 1.34 1.29 1.29 1.23 0.97 0.96 0.80 0.77 0.59 0.49 0.46 0.43 0.41 0.39 2 The Pareto chart is shown below. To construct the graph using Excel, we enter the data, with the categories listed in the first column, and the frequencies or relative frequencies listed in the second column. Highlight the data, and choose Insert, Chart, Column. 7 76 7 75 46 BA e1 00 F1 7 72 ly Ea r 00 - A3 00 - 60 0 rly A3 7Ea 74 111 BA C D M D C8 7 6 5 4 3 2 1 0 -1 1 Number of Accidents per Million Flights Aircraft Accident Rates, 1959 - 1999 Type of Aircraft In this case, of the 22 types of aircraft, we see that the MD-11 had the highest accident rate, followed by the Boeing 707/720 and the DC-8. The latter two are no longer in service in most of the world. The years of service of the MD-11 were 1990 – 1999. Frequency Distributions and Histograms For numeric data, there are a number of different graphical techniques available. The author presents several, including the dot-plot. We will not include the dot-plot, as other types of graphs, such as histograms and stem-and-leaf plots, are equally useful. Often, with univariate data (resulting from a single measured characteristic of a sample), there are too many different data values for a listing of the raw data to be useful in visualizing the characteristics of the data. It is common to divide the interval of values of the data into a relatively small number of subintervals, called classes, and to tabulate the data using the frequencies. Each frequency is the number of occurrences of data values within a subinterval. We sometimes want also to use relative frequencies. The relative frequency for a class is found by dividing the frequency for that class by the size of the entire data set. Defn: A histogram is a graph that displays numeric data by using vertical bars of various heights to represent the frequencies of occurrence of data values within a subinterval. Characteristics of a histogram: 1) The classes are listed in order along the horizontal axis. 2) The vertical axis provides a scale for the frequencies. 3) A bar is drawn for each class having width equal to the class width and height equal to the class frequency. 4) The axes are labeled and the graph is titled. 3 Note: The number of classes, or subintervals, depends on the size of the data set. A good rule of thumb is to choose 5 classes for a small data set (n = 25) and 20 classes with a large data set (n = 1000). Note: The class width is found by dividing the range of the data by the number of classes and rounding up slightly, so that the largest data value will be included in the last class. The class limits are the uppermost and lowermost data values that could be included in the class (note that there may be no actual data values equal to the upper- or lower-class limit for any given class). Since we may do the histogram with the calculator or with Excel, we do the histogram first, followed by the grouped frequency distribution. Example: Compressive strength, in pounds per square inch (psi) of specimens of a new aluminumlithium alloy undergoing evaluation for possible use in aircraft structural components. The data are listed in the following table. 105 167 160 76 199 150 221 141 208 167 151 135 183 245 158 184 142 196 186 228 133 135 163 201 121 174 207 229 145 200 181 199 180 146 171 176 180 181 190 218 148 150 143 158 193 157 158 170 97 176 194 101 160 118 154 110 133 171 175 149 153 163 156 165 149 174 131 123 172 87 120 154 134 158 160 168 115 178 169 237 We will construct a histogram for the data using Excel (Instructions for constructing a histogram using the TI-83/TI-84 are included in the calculator handout). We have a data set with n = 80. We will choose to use 7 classes. The range is 245 – 76 = 169. Therefore the class width will be 𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ = 𝑅𝑎𝑛𝑔𝑒 169 ↑= ↑= 24.142857 ↑= 24.2. 𝑁𝑜. 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 7 The lower limit of the first class will be the smallest data value, 76 (the author sometimes chooses a different value for the lower class limit of the first class). To construct the histogram in Excel: 1) 2) 3) 4) 5) 6) Enter the data. Enter a second column giving the upper class limits for all classes except the last class – 100.2, 124.4, 148.6, 172.8, 197.0, 221.2. Choose Tools, Data Analysis, Histogram. The input range will be a1..a80. The bin range will be b1..b6. The output range will be c1. The type of output will be chart output. Below is the resulting histogram, followed by the grouped frequency table, constructed using the information from the histogram (In the table, relative frequencies are included). 4 Histogram 30 Frequency 25 20 15 10 5 0 100.2 124.4 148.6 172.8 197 221.2 More Compression Strength Class (psi) 76.0 – 100.2 100.3 – 124.4 124.5 – 148.6 148.7 – 172.8 172.9 – 197.0 197.1 – 221.2 221.3 – 245.4 Frequency 3 8 12 28 17 8 4 Relative Frequency 0.0375 = 3.75% 0.1000 = 10.00% 0.1500 = 15.00% 0.3500 = 35.00% 0.2125 = 21.25% 0.1000 = 10.00% 0.0500 = 5.00% Looking at a histogram of a data set can sometimes provide a quick way of answering questions about data, by simply noting the characteristics of the graph. Example 1: p. 18 It is immediately apparent from the graph that there are two superimposed distributions, perhaps due to two different operating processes. Example 2: p. 19 It is immediately obvious from the histogram that most of the interrequest times are relatively small, with only a few very large times. Sometimes we want to do a relative frequency histogram of a data set (sometimes called a density histogram, for reasons to be covered in Chapter 6). Example: pp. 19 – 20 The density histogram shows an approximately symmetric, bell-shaped distribution for the compressive strengths. 5 A simple graphical display of relatively small data sets may be done with a stem-and-leaf plot, (Excel does not do this kind of plot.) Example: The original aluminum-lithium alloy compressive-strength data set. 7 |6 8 |7 9 |7 10 | 1 5 11 | 0 5 8 12 | 0 1 3 13 | 1 3 3 4 5 5 14 | 1 2 3 5 6 8 9 9 15 | 0 0 1 3 4 4 6 7 8 8 8 8 16 | 0 0 0 3 3 5 7 7 8 9 17 | 0 1 1 2 4 4 5 6 6 8 18 | 0 0 1 1 3 4 6 19 | 0 3 4 6 9 9 20 | 0 1 7 8 21 | 8 22 | 1 8 9 23 | 7 24 | 5 It is clear from the graph that the distribution of compressive strengths is approximately bell-shaped, centered at a value of approximately 160 psi. Numerical Descriptive Measures One type of numerical summary describes, in some sense, the location of the center of a data set. There are several measures of central tendency, the most important of which is the mean. Defn: For a variable X measured for every member of a finite population of size N, yielding a set of values x1, x2, …, xN, the mean, or average, is given by   1 N N x i 1 i . For a sample of size n chosen from the population, yielding a set of values x1, x2, …, xn, the sample mean, or average, is given by x 1 n  xi . n i 1 Sometimes, the sample mean is not the most useful measure of central tendency. For example, sometimes a data set has some extreme values (either very large or very small). These extreme values are called outliers (more on this topic later). The value of the sample mean may be strongly affected by these outliers. In such a case, a more useful measure of central tendency may be the sample median. 6 x Defn: The sample median, , is the center of the data set when the data are ordered from smallest to largest. If n is odd, then the median is the middle item of data. If n is even, then the median is the average of the two middle items of data. The median is not usually affected by outliers (Example on page 26). Example: In the original compression strength data set, n = 80, so x 160  163  161.5 psi. 2 In addition to locating the center of the data set, we want to describe the dispersion of the data values. The simplest, although least useful, measure of dispersion is the range of the data set. Defn: The range of a data set is the difference between the largest and smallest values of the data; the range is a simple measure of the dispersion of the data. Example: For the compression strength data, Range = 245 psi – 76 psi = 169 psi The range cannot distinguish between the dispersion of two data sets that have the same largest and smallest values, even though the values in between may be quite different from one data set to the other. For this reason, we need a measure of dispersion that takes into consideration the location of each data value relative to the center of the data set. Consider a data set with data values 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 . For each data value 𝑥𝑖 , we define the deviation from the mean as 𝑥𝑖 − 𝑥̅ . This value gives the (directed) distance of the ith data value from the mean of the sample data. We may consider using the sum of all of these deviations as our measure of dispersion. However, it would be useless to do so, as you will show in Exercise 2.50. Instead, we define two other measures of dispersion, the variance and the standard deviation. Defn: For a variable X measured for every member of a finite population of size N, yielding a set of 1 values x1, x2, …, xN, the variance of the data is given by   N 2 deviation is given by N  x   i 1 i 2 , and the standard  . For a sample of size n chosen from the population, yielding a set of values 2 x1, x2, …, xn, the sample variance is given by s  1 n 2  xi  x  , and the sample standard  n  1 i 1 deviation is s. Note: In the above definitions,  and  are parameters; these two quantities have fixed but usually unknown values. The two quantities x and s are statistics; the values of these two quantities depend on the particular sample chosen from the population. If all of the data values in a data set are the same, then the variance and standard deviation are both 0. If there are any differences among the data values, then both the variance and standard deviation are 7 positive; the greater the differences among the data values, the greater the values of the variance and standard deviation. Note: While the defining formulae for the population mean and the sample mean have the same form, the defining formulae for the population variance and the sample variance differ. For the population, the variance is the mean of the squared deviations of the data values from the mean value. For the sample, the variance is almost the mean of the squared deviations of the data values from the mean value. Instead of dividing the sum of squared deviations by the sample size, we divide by n – 1. The reason for doing so has to do with the fact that we want the sample variance to be a good estimator of the population variance. A better estimator is given by dividing by n – 1, rather than by n. Statistically, we say that there are n – 1 degrees of freedom associated with the sample variance. Note: If we select a random sample of size n from a population or distribution, we start out with n quantities which are free to vary, so that we have n degrees of freedom. Each time we use the data to estimate a parameter (such as using the sample mean to estimate the population mean), we use up one degree of freedom. Thus, we have only n – 1 degrees of freedom associated with the sample variance. Note: Another, and often simpler, way to calculate the variance is to use the following fact: 1 n 1 n 2 2 s  xi  2 xxi  x 2   xi  x      n  1 i 1 n  1 i 1 2 2   n   x   i  1  n 2 2  n  n  1  n   1  n 2  i 1   .    xi    xi    xi     xi     xi  n  n  1  i 1 n  i 1   i 1  n  i 1   n  1  i 1     2 Example: Compressive strength, in pounds per square inch (psi) of specimens of a new aluminumlithium alloy undergoing evaluation for possible use in aircraft structural components. The data are listed in the following table. 105 167 160 76 199 150 221 141 208 167 151 135 183 245 158 184 142 196 186 228 133 135 163 201 121 174 207 229 145 200 181 199 180 146 171 176 80 The sum of the data values is x i 1 80 x i 1 2 2 i i 180 181 190 218 148 150 143 158 193 157 158 170 97 176 194 101 160 118 154 110 133 171 175 149 153 163 156 165 149 174 131 123 172 87 120 154 134 158 160 168 115 178 169 237  13013 psi. The sum of the squared data values is  2206837 psi2. Hence, the sample mean is 162.6625 psi; the sample variance is 1140.6315 psi . The sample standard deviation is then 33.7732 psi. The above example illustrates the usefulness of the standard deviation as a measure of variation; the data have units of psi. The variance has units of psi2. The standard deviation has the same units of measurement as the data. 8 As an example of the uses of the sample statistics, let us find the fraction of the compression strength data that lie within two standard deviations on either side of the mean. We have 𝑥̅ − 2𝑠 = 162.6625 − (2)(33.7732) = 95.1161 𝑝𝑠𝑖, and 𝑥̅ + 2𝑠 = 162.6625 + (2)(33.7732) = 230.2089 𝑝𝑠𝑖, From the stem-and-leaf plot, we see that there are two data values below 95.1161 psi, and two 230.2089 psi. Hence, the fraction of the data set that lie within two standard deviations on either side of the mean is 76 (100) ( ) = 95%. 80 (Hint: Remember this number.) Coefficient of Variation If we want to compare the variations of several sets of data, we often use a quantity called the coefficient of variation, which scales each data set according to its mean value and measure relative variation. (Note that the coefficient of variation is used is used only with data sets for which the values of the variables are non-negative.) Defn: If a set of non-negative numeric data has sample mean 𝑥̅ and sample standard deviation s, then the coefficient of variation for the data is 𝑠 𝑉= . 𝑥̅ Example: We want to compare the precision of two micrometers. Measurements are made on the diameters of ball bearings using micrometer A. The mean and standard deviation are found to be 𝑥̅𝐴 = 3.92 𝑚𝑚, 𝑠𝐴 = 0.0152 𝑚𝑚. For micrometer B, measurements are made of the length of an unstretched spring. The mean and standard deviation are found to be 𝑥̅ 𝐵 = 1.54 𝑖𝑛, 𝑠𝐵 = 0.0086 𝑖𝑛. The coefficient of variation for micrometer A is 𝑠𝐴 0.0152 𝑉𝐴 = = = 0.39%. 𝑥̅𝐴 3.92 The coefficient of variation for micrometer B is 𝑠𝐵 0.0086 𝑉𝐵 = = = 0.56%. 𝑥̅𝐵 1.54 Since the coefficient of variation is larger for micrometer B than for micrometer A, we conclude that B is less precise than A. Quartiles and Percentiles Defn: The first quartile, Q1, of a data set is a number such that 25% of the data values are no greater than that number and 75% of the data values are no less than that number. The third quartile, Q3, of a data set is a number such that 75% of the data values are no greater than that number and 25% of the data values are no less than that number. 9 Example: For the aluminum-lithium alloy compression strength data, 181  181 143  145  181 psi. Q1   144 psi, and Q3  2 2 25% of the specimens had compressive strengths no greater than 144 psi, and 75% of the specimens had compressive strengths no greater than 181 psi. Defn: The interquartile range, IQR, is the difference between the third and first quartiles. IQR is a measure of spread of the data set. Example: For the original compression strength data, IQR = 87 psi. Defn: The 100kth percentile of a data set is a number such that 100k% of the data are no greater than that number and 100(1-k)% of the data values are no less than that number. It is generally easy for small data sets to read a percentile from an ordered stem-and-leaf plot. For larger data sets, percentiles must be calculated. Steps in calculating the 100 pth percentile for a numeric data set: 1. Re-order the data values from smallest to largest. 2. Determine the value of the product np, where n is the size of the data set. 3. If np is not an integer, round it up to the next integer. Count up to that position in the listed data to find the 100 pth percentile. If np is an integer, count up to the npth position in the listed data, and calculate the average of that data value and the next higher data value. Example: For the aluminum-lithium alloy compression strength data, the 35th percentile is a number such that 35% of the data values, or 28 values, are no greater than that number. From the stem-andleaf plot, we see that the 35th percentile is 152. Thirty-five percent of the specimens in the sample have compression strengths no greater than 152 psi. Alternatively: 1. The data presented in the stem-and-leaf plot are already ordered. 2. np = (80)(0.35) = 28. This is an integer, so we average the 28th and the 29th data values, obtaining 151 + 153 = 152 = 35𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒. 2 Boxplots Defn: The five-number summary of a data set consists of the minimum value, the first quartile, the median, the third quartile, and the maximum value. Example: For the aluminum-lithium alloy compression strength data, minX = 76 psi, Q1 = 144 psi, x  161.5 psi , Q3 = 182 psi, and maxX = 245 psi. Defn: A boxplot is a graphical representation of a numeric data set using the 5-number summary. The data values between the first and third quartiles are represented by a box, with a vertical line at the median value. The data values between minX and the first quartile are represented by a line drawn from one end of the box; the data values between the third quartile and maxX are represented by a line drawn from the other end of the box. 10 Note: Excel does not do boxplots. Excel can help in constructing boxplots through providing the 5number summary for the data, using the Rank and Percentile function under Data Analysis. Example: For the compression strength data, the boxplot is shown below. To find the 5-number summary with Excel, we enter the data, and use Tools, Data Analysis, Rank and Percentiles. ____________ -----------------------|_____|______|-----------------------------|______|______|______|______|______|______|______|______|__ 75 115 135 155 175 195 215 235 255 Compression Strength (psi) If the median line is approximately in the center of the box, and if the two whiskers are of approximately equal length, then the data distribution is symmetric. Defn: An outlier is an observation whose value is quite different from the values of most of the observations in the data set. Note: When outliers are encountered, they should be investigated. They may result from mistakes in data collection or in data entry. Or they may result from unusual members of the sample. Note: Practically speaking, an outlier is an observation whose value is either at least 1.5 IQR’s below Q1, or at least 1.5 IQR’s above Q3. An extreme outlier is an observation whose value is either at least 3 IQR’s below Q1, or at least 3 IQR’s above Q3. Example: A boxplot of the compression strength data, with outliers indicated, is shown below: ____________ * *------------------|_____|______|-------------------------- * * |______|______|______|______|______|______|______|______|__ 75 115 135 155 175 195 215 235 255 Compression Strength (psi) Side-by-side boxplots are often useful in comparing the central tendencies and variabilities of several data sets, as in the results of scientific experiments. Example: pp. 32-33. From examination of the side-by-side boxplots, we see that the quality index is most variable for Plant 2, is lowest (on average) for Plant 4, and is highest (on average) for Plant 3. 11 Time Series Plots Often, in a manufacturing situation, we are interested in the development of the value of a variable over time. The other graphs we have discussed examine data collected at a single point in time. A time series is an ordered sequence of observations. Usually the ordering is over time, although it may also be over some spatial dimension. The key point here is that successive observations are dependent, or correlated with each other. This is what makes time series data different from the other types of data we have looked at. In time series analysis, we are looking for two types of characteristics in the data – trends and cycles. The following two graphs show the two types of characteristics. Example 1: The first graph is of data on U.S. yearly lumber production for the period 1921 to 1982. There are some long-term trends evident, but no obvious cyclical patterns. Steps in constructing a time series plot using Excel: Enter the data. Choose Insert, Chart, Line Graph. Input range is the range of the data. U.S. Lumber Production, 1921 - 1982 45 Production (Billions of Board Ft.) 1) 2) 3) 40 35 30 25 20 15 10 5 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 Time Point 12 Example 2: The next time series plot shows distinct periodicity. This is a plot of the yearly count of sunspots over the period 1770 to 1869. Sunspot Cycle, 1770 - 1869 180 Number of Sunspots 160 140 120 100 80 60 40 20 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 Time Point Example 3: p. 33 We see that for the measurement instrument, the measurements of material thickness display a decreasing trend over time. The instrument is not being consistent in its measurements.

Statistics 2014, Fall 2001

Related documents

Products

Support

Statistics 2014, Fall 2001

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib