Note

advertisement
1
Chapter 2 – Organization and Description of Data
When data are in their original form, as collected, they are called
raw data. The first task to be done with raw data is clean-up. This
is always done. The data must be double-checked to see that it was
collected accurately. Any unusual data values should be followed
up to see whether they resulted from errors in data collection or
from unusual members of the sample. When the data is entered into
a calculator or spreadsheet, it should be double-checked to see that it
was entered correctly.
After the clean-up procedure, the next task is to describe the data.
There two kinds of methods for summarizing and describing data –
graphical techniques and numerical summaries. We will discuss
some graphical techniques first.
With non-numeric data, we often want a graph which is a variation
on the histogram, called a Pareto chart. This type of graph is useful
in quality control and process improvement studies, in which the
data often represent the different types of defects or failure modes.
A Pareto chart graphs the frequencies of occurrences of the different
types of defects, ordered from the most frequent to the least
frequent. The purpose of a Pareto chart is to focus on the main
causes or modes of failure.
Example: We have data, listed below, on number of accidents
between 1959 and 1999 for each of a number of different types of
aircraft, as well as the number of accidents per million flights.
2
Aircraft type
Actual no. of hull
losses
MD-11
5
707/720
115
DC-8
71
F-28
32
BAC 1-11
22
DC-10
20
747-Early
21
A310
4
A300-600
3
DC-9
75
A300-Early
7
737-1 & 2
62
727
70
A310/319/321
7
F100
3
L1011
4
BAe 146
3
747-400
1
757
4
MD-80/90
10
767
3
737-3, 4 & 5
12
Hull losses per
million departures
6.54
6.46
5.84
3.94
2.64
2.57
1.90
1.40
1.34
1.29
1.29
1.23
0.97
0.96
0.80
0.77
0.59
0.49
0.46
0.43
0.41
0.39
The Pareto chart is shown below. To construct the graph using
Excel, we enter the data, with the categories listed in the first
column, and the frequencies or relative frequencies listed in the
second column. Highlight the data, and choose Insert, Chart,
Column.
3
7
76
7
75
46
BA
e1
00
F1
7
72
ly
Ea
r
00
-
A3
00
-
60
0
rly
A3
7Ea
74
111
BA
C
D
M
D
C8
7
6
5
4
3
2
1
0
-1
1
Number of Accidents per
Million Flights
Aircraft Accident Rates, 1959 - 1999
Type of Aircraft
In this case, of the 22 types of aircraft, we see that the MD-11 had
the highest accident rate, followed by the Boeing 707/720 and the
DC-8. The latter two are no longer in service in most of the world.
The years of service of the MD-11 were 1990 – 1999.
Frequency Distributions and Histograms
For numeric data, there are a number of different graphical
techniques available. The author presents several, including the dotplot. We will not include the dot-plot, as other types of graphs, such
as histograms, are equally useful.
Often, with univariate data (resulting from a single measured
characteristic of a sample), there are too many different data values
for a listing of the raw data to be useful in visualizing the
characteristics of the data. It is common to divide the interval of
values of the data into a relatively small number of subintervals,
called classes, and to tabulate the data using the frequencies. Each
frequency is the number of occurrences of data values within a
4
subinterval. We sometimes want also to use relative frequencies.
The relative frequency for a class is found by dividing the frequency
for that class by the size of the entire data set.
Defn: A histogram is a graph that displays numeric data by using
vertical bars of various heights to represent the frequencies of
occurrence of data values within a subinterval.
Characteristics of a histogram:
1) The classes are listed in order along the horizontal axis.
2) The vertical axis provides a scale for the frequencies.
3) A bar is drawn for each class having width equal to the class
width and height equal to the class frequency.
4) The axes are labeled and the graph is titled.
Note: The number of classes, or subintervals, depends on the size of
the data set. A good rule of thumb is to choose the number of
classes to be approximately equal to the square root of the size of the
data set. For example, if n = 25, then we would use 5 classes; if n =
80, then we would use 9 classes.
Note: The class width is found by dividing the range of the data by
the number of classes and rounding up slightly, so that the largest
data value will be included in the last class.
The class limits are the uppermost and lowermost data values that
could be included in the class (note that there may be no actual data
values equal to the upper- or lower-class limit for any given class).
Since we may do the histogram with the calculator or with Excel, we
do the histogram first, followed by the grouped frequency
distribution.
5
Example: Compressive strength, in pounds per square inch (psi) of
specimens of a new aluminum-lithium alloy undergoing evaluation
for possible use in aircraft structural components. The data are
listed in the following table.
105
167
160
76
199
150
221
141
208
167
151
135
183
245
158
184
142
196
186
228
133
135
163
201
121
174
207
229
145
200
181
199
180
146
171
176
180
181
190
218
148
150
143
158
193
157
158
170
97
176
194
101
160
118
154
110
133
171
175
149
153
163
156
165
149
174
131
123
172
87
120
154
134
158
160
168
115
178
169
237
We will construct a histogram for the data using Excel. We have a
data set with n = 80. We will choose to use 9 classes. The range is
245 – 76 = 169. Therefore the class width will be
π‘…π‘Žπ‘›π‘”π‘’
169
πΆπ‘™π‘Žπ‘ π‘  π‘€π‘–π‘‘π‘‘β„Ž =
↑=
↑= 18.77778 ↑= 18.8.
π‘π‘œ. π‘œπ‘“ π‘π‘™π‘Žπ‘ π‘ π‘’π‘ 
9
The lower limit of the first class will be the smallest data value, 76
(the author sometimes chooses a different value for the lower class
limit of the first class). To construct the histogram in Excel:
1) Enter the data.
2) Enter a second column giving the upper class limits for all
classes except the last class – 94.8, 113.6, 132.4, 151.2, 170.0,
188.8, 207.6, 226.4.
3) Choose Tools, Data Analysis, Histogram.
4) The input range will be a1..a80. The bin range will be b1..b6.
5) The output range will be c1.
6) The type of output will be chart output.
Below is the resulting histogram, followed by the grouped frequency
table, constructed using the information from the histogram (In the
table, relative frequencies are included).
6
Histogram
25
Frequency
20
15
10
5
0
94.8
113.6
132.4
151.2
170
188.8
207.6
226.4
More
Bin
Class (psi)
76.0 – 94.8
94.9 – 113.6
113.7 – 132.4
132.5 – 151.2
151.3 – 170.0
170.1 – 188.8
188.9 – 207.6
207.7 – 226.4
226.5 – 245.2
Frequency
2
4
6
16
20
16
9
3
4
Relative Frequency
0.0250 = 2.50%
0.0500 = 5.00%
0.0750 = 7.50%
0.2000 = 20.00%
0.2500 = 25.00%
0.2000 = 20.00%
0.1125 = 11.25%
0.0375 = 3.750%
0.0500 = 5.00%
Looking at a histogram of a data set can sometimes provide a quick
way of answering questions about data, by simply noting the
characteristics of the graph.
7
Example 1: p. 18
It is immediately apparent from the graph that there are two
superimposed distributions, perhaps due to two different operating
processes.
Example 2: p. 19
It is immediately obvious from the histogram that most of the
interrequest times are relatively small, with only a few very large
times.
Sometimes we want to do a relative frequency histogram of a data
set (sometimes called a density histogram, for reasons to be covered
in Chapter 6).
Example: pp. 19 – 20
The density histogram shows an approximately symmetric, bellshaped distribution for the compressive strengths.
Numerical Descriptive Measures
One type of numerical summary describes, in some sense, the
location of the center of a data set. There are several measures of
central tendency, the most important of which is the mean.
Defn: For a variable X measured for every member of a finite
population of size N, yielding a set of values x1, x2, …, xN, the mean,
or average, is given by

1
N
N
οƒ₯ x . For a sample of size n chosen
i ο€½1
i
from the population, yielding a set of values x1, x2, …, xn, the
1 n
sample mean, or average, is given by x ο€½ n οƒ₯ xi .
i ο€½1
8
Sometimes, the sample mean is not the most useful measure of
central tendency. For example, sometimes a data set has some
extreme values (either very large or very small). These extreme
values are called outliers (more on this topic later). The value of the
sample mean may be strongly affected by these outliers. In such a
case, a more useful measure of central tendency may be the sample
median.
Defn: The sample median, x , is the center of the data set when the
data are ordered from smallest to largest. If n is odd, then the
median is the middle item of data. If n is even, then the median is
the average of the two middle items of data.
The median is not usually affected by outliers (Example on page
26).
Example: In the original compression strength data set, n = 80, so
xο€½
160  163
ο€½ 161.5 psi.
2
In addition to locating the center of the data set, we want to describe
the dispersion of the data values.
The simplest, although least useful, measure of dispersion is the
range of the data set.
Defn: The range of a data set is the difference between the largest
and smallest values of the data; the range is a simple measure of the
dispersion of the data.
9
Example: For the compression strength data,
Range = 245 psi – 76 psi = 169 psi
The range cannot distinguish between the dispersion of two data sets
that have the same largest and smallest values, even though the
values in between may be quite different from one data set to the
other. For this reason, we need a measure of dispersion that takes
into consideration the location of each data value relative to the
center of the data set.
Consider a data set with data values π‘₯1 , π‘₯2 , π‘₯3 , … , π‘₯𝑛 . For each data
value π‘₯𝑖 , we define the deviation from the mean as π‘₯𝑖 − π‘₯Μ… . This
value gives the (directed) distance of the ith data value from the
mean of the sample data. We may consider using the sum of all of
these deviations as our measure of dispersion. However, it would be
useless to do so, as you will show in Exercise 2.50.
Instead, we define two other measures of dispersion, the variance
and the standard deviation.
Defn: For a variable X measured for every member of a finite
population of size N, yielding a set of values x1, x2, …, xN, the
1 N
2
2

ο€½
x
ο€­



οƒ₯ i
variance of the data is given by
, and the
N i ο€½1
standard deviation is given by  . For a sample of size n chosen
from the population, yielding a set of values x1, x2, …, xn, the
1 n
2
s
ο€½
x
ο€­
x


οƒ₯
i
sample variance is given by
, and the sample
n ο€­ 1 i ο€½1
2
standard deviation is s.
10
Note: In the above definitions,  and  are parameters; these two
quantities have fixed but usually unknown values. The two
quantities x and s are statistics; the values of these two quantities
depend on the particular sample chosen from the population.
If all of the data values in a data set are the same, then the variance
and standard deviation are both 0. If there are any differences
among the data values, then both the variance and standard deviation
are positive; the greater the differences among the data values, the
greater the values of the variance and standard deviation.
Note: While the defining formulae for the population mean and the
sample mean have the same form, the defining formulae for the
population variance and the sample variance differ. For the
population, the variance is the mean of the squared deviations of the
data values from the mean value. For the sample, the variance is
almost the mean of the squared deviations of the data values from
the mean value. Instead of dividing the sum of squared deviations
by the sample size, we divide by n – 1. The reason for doing so has
to do with the fact that we want the sample variance to be a good
estimator of the population variance. A better estimator is given by
dividing by n – 1, rather than by n. Statistically, we say that there
are n – 1 degrees of freedom associated with the sample variance.
Note: If we select a random sample of size n from a population or
distribution, we start out with n quantities which are free to vary, so
that we have n degrees of freedom. Each time we use the data to
estimate a parameter (such as using the sample mean to estimate the
population mean), we use up one degree of freedom. Thus, we have
only n – 1 degrees of freedom associated with the sample variance.
11
Note: Another, and often simpler, way to calculate the variance is to
use the following fact:
1 n
1 n 2
2
s ο€½
xi ο€­ 2 xxi  x 2 
 xi ο€­ x  ο€½

οƒ₯
οƒ₯
n ο€­ 1 i ο€½1
n ο€­ 1 i ο€½1
2
2

 n οƒΆ οƒΉ
οƒͺ
2
οƒ₯ xi οƒΊ
1  n 2 2  n  n οƒΆ 1  n οƒΆ οƒΉ
1 οƒͺ n 2  i ο€½1 οƒ·οƒΈ οƒΊ
ο€½
οƒͺ οƒ₯ xi ο€­  οƒ₯ xi οƒ·  οƒ₯ xi οƒ·   οƒ₯ xi οƒ· οƒΊ ο€½
οƒ₯ xi ο€­ n οƒΊ .
n ο€­ 1 οƒͺ i ο€½1
n  i ο€½1 οƒΈ  i ο€½1 οƒΈ n  i ο€½1 οƒΈ  n ο€­ 1 οƒͺ i ο€½1
οƒͺ
οƒΊ
οƒͺ

Example: Compressive strength, in pounds per square inch (psi) of
specimens of a new aluminum-lithium alloy undergoing evaluation
for possible use in aircraft structural components. The data are
listed in the following table.
105
167
160
76
199
150
221
141
208
167
151
135
183
245
158
184
142
196
186
228
133
135
163
201
121
174
207
229
145
200
181
199
180
146
171
176
180
181
190
218
148
150
143
158
193
157
158
170
80
The sum of the data values is
80
squared data values is
οƒ₯x
i ο€½1
2
i
οƒ₯x
i ο€½1
i
97
176
194
101
160
118
154
110
133
171
175
149
153
163
156
165
149
174
131
123
172
87
120
154
134
158
160
168
115
178
169
237
ο€½ 13013 psi. The sum of the
ο€½ 2206837 psi2. Hence, the sample
mean is 162.6625 psi; the sample variance is 1140.6315 psi2. The
sample standard deviation is then 33.7732 psi.
The above example illustrates the usefulness of the standard
deviation as a measure of variation; the data have units of psi. The
variance has units of psi2. The standard deviation has the same units
of measurement as the data.
12
As an example of the uses of the sample statistics, let us find the
fraction of the compression strength data that lie within two standard
deviations on either side of the mean. We have
π‘₯Μ… − 2𝑠 = 162.6625 − (2)(33.7732) = 95.1161 𝑝𝑠𝑖,
and
π‘₯Μ… + 2𝑠 = 162.6625 + (2)(33.7732) = 230.2089 𝑝𝑠𝑖,
From the data set, we see that there are two data values below
95.1161 psi, and two 230.2089 psi. Hence, the fraction of the data
set that lie within two standard deviations on either side of the mean
is
76
(100) ( ) = 95%.
80
(Hint: Remember this number.)
Quartiles and Percentiles
Defn: The first quartile, Q1, of a data set is a number such that 25%
of the data values are no greater than that number and 75% of the
data values are no less than that number. The third quartile, Q3, of a
data set is a number such that 75% of the data values are no greater
than that number and 25% of the data values are no less than that
number.
Example:
data,
For the aluminum-lithium alloy compression strength
143  145
181  181
ο€½ 144 psi, and Q3 ο€½
ο€½ 181 psi.
2
2
25% of the specimens had compressive strengths no greater than
144 psi, and 75% of the specimens had compressive strengths no
greater than 181 psi.
Q1 ο€½
13
Defn: The interquartile range, IQR, is the difference between the
third and first quartiles. IQR is a measure of spread of the data set.
Example: For the original compression strength data, IQR = 87 psi.
Defn: The 100kth percentile of a data set is a number such that
100k% of the data are no greater than that number and 100(1-k)% of
the data values are no less than that number.
Steps in calculating the 100 pth percentile for a numeric data
set:
1. Re-order the data values from smallest to largest.
2. Determine the value of the product np, where n is the size of the
data set.
3. If np is not an integer, round it up to the next integer. Count up to
that position in the listed data to find the 100 pth percentile.
If np is an integer, count up to the npth position in the listed data, and
calculate the average of that data value and the next higher data
value.
Example: For the aluminum-lithium alloy compression strength
data, the 35th percentile is a number such that 35% of the data
values, or 28 values, are no greater than that number. From the
stem-and-leaf plot, we see that the 35th percentile is 152. Thirty-five
percent of the specimens in the sample have compression strengths
no greater than 152 psi.
Alternatively:
1. The data presented in the stem-and-leaf plot are already ordered.
2. np = (80)(0.35) = 28. This is an integer, so we average the 28th
and the 29th data values, obtaining
151 + 153
= 152 = 35π‘‘β„Ž π‘π‘’π‘Ÿπ‘π‘’π‘›π‘‘π‘–π‘™π‘’.
2
14
Boxplots
Defn: The five-number summary of a data set consists of the
minimum value, the first quartile, the median, the third quartile, and
the maximum value.
Example: For the aluminum-lithium alloy compression strength
data, minX = 76 psi, Q1 = 144 psi, x ο€½ 161.5 psi , Q3 = 182 psi, and
maxX = 245 psi.
Defn: A boxplot is a graphical representation of a numeric data set
using the 5-number summary. The data values between the first and
third quartiles are represented by a box, with a vertical line at the
median value. The data values between minX and the first quartile
are represented by a line drawn from one end of the box; the data
values between the third quartile and maxX are represented by a line
drawn from the other end of the box.
Note: Excel does boxplots, but not readily; some Excel
programming is required. Excel can help in constructing boxplots
through providing the 5-number summary for the data, using the
Rank and Percentile function under Data Analysis.
Example: For the compression strength data, the boxplot is shown
below. To find the 5-number summary with Excel, we enter the
data, and use Tools, Data Analysis, Rank and Percentiles.
Aluminum-Lithium Alloy Compression Strength
____________
-----------------------|_____|______|-----------------------------|______|______|______|______|______|______|______|______|__
75
115
135
155
175
195 215
235
255
Compression Strength (psi)
15
If the median line is approximately in the center of the box, and if
the two whiskers are of approximately equal length, then the data
distribution is symmetric.
Defn: An outlier is an observation whose value is quite different
from the values of most of the observations in the data set.
Note: When outliers are encountered, they should be investigated.
They may result from mistakes in data collection or in data entry.
Or they may result from unusual members of the sample.
Note: Practically speaking, an outlier is an observation whose value
is either at least 1.5 IQR’s below Q1, or at least 1.5 IQR’s above Q3.
An extreme outlier is an observation whose value is either at least 3
IQR’s below Q1, or at least 3 IQR’s above Q3.
Example: A boxplot of the compression strength data, with outliers
indicated, is shown below:
Aluminum-Lithium Alloy Compression Strength
____________
* *------------------|_____|______|-------------------------- * *
|______|______|______|______|______|______|______|______|__
75
115
135
155
175
195 215
235
255
Compression Strength (psi)
Side-by-side boxplots are often useful in comparing the central
tendencies and variabilities of several data sets, as in the results of
scientific experiments.
Example: pp. 32-33.
16
From examination of the side-by-side boxplots, we see that the
quality index is most variable for Plant 2, is lowest (on average) for
Plant 4, and is highest (on average) for Plant 3.
Example: Handout
Time Series Plots
Often, in a manufacturing situation, we are interested in the
development of the value of a variable over time. The other graphs
we have discussed examine data collected at a single point in time.
A time series is an ordered sequence of observations. Usually the
ordering is over time, although it may also be over some spatial
dimension. The key point here is that successive observations are
dependent, or correlated with each other. This is what makes time
series data different from the other types of data we have looked at.
In time series analysis, we are looking for two types of
characteristics in the data – trends and cycles.
The following two graphs show the two types of characteristics.
Example 1: p. 33
We see that for the measurement instrument, the measurements of
material thickness display a decreasing trend over time. The
instrument is not being consistent in its measurements.
Example 2: Handout
Download