Statistical Interpretation of Data Chapter 8 Introduction—Sources of Variability in Measurement

advertisement
Chapter 8
Statistical Interpretation of Data
Introduction—Sources of Variability in Measurement
It is commonplace to observe that repeated measurements of what seems to
be the same object or phenomenon do not produce identical results. Measurement
variation arises from a number of sources, but one root cause is often the finite precision of the measuring tool. If a simple yardstick is used to measure carpet, we
expect to obtain a result no better than perhaps 1/4 inch, the smallest division on
the measurement scale, but this is entirely adequate for our purpose. If we try to
measure the carpet to a higher resolution, say 1/64 inch, we are likely to find that
the roughness of the edge of the carpet causes uncertainty about just where the
boundary really is, and measurements at different locations produce different
answers. To push this particular example just a bit farther, if the measurement is
performed without taking care that the ruler is perpendicular to the edges of the
carpet, the measurement will be biased and the result too large.
The limitations of the measurement tool and the difficulty in defining just
what is to be measured are two factors that limit the precision and accuracy of measurements. Precision is defined in terms of repeatability, the ability of the measurement process to duplicate the same measurement and produce the same result.
Accuracy is defined as the agreement between the measurement and some objective
standard taken as the “truth.” In many cases, the latter requires standards which
are themselves defined and measured by some national standards body such as NIST
or NPL (for example, measures of dimension). Arguing with a traffic officer that
your speedometer is very precise because it always reads 55 mph when your car is
traveling at a particular velocity will not overcome his certainty that the accuracy
of his radar detector indicated you were actually going 60 mph (arguing about the
accuracy of the calibration of the radar gun may in fact be worthwhile- the calibration and standardization of such devices can be rather poor!). Figure 8.1 suggests the difference between precision and accuracy—a highly precise but inaccurate
measurement is generally quite useless.
In a few cases, such as counting of events, it is possible to imagine an absolute
truth, but even in such an apparently simple case the ability of our measurement
process to yield the correct answer is far from certain. In one classic example of
counting errors, George Moore from the National Bureau of Standards (before
it became NIST) asked participants to count the occurrences of the letter “e” in
a paragraph of text. Aside from minor ambiguities such as whether the intent
was to count the capital “E” as well as lower case, and whether a missing “e” in a
misspelled word should be included, the results (Figure 8.2) showed that many
people systematically undercounted the occurrences (perhaps because we read by
149
150
Chapter 8
Figure 8.1. Effects of accuracy and precision in shooting at a target.
recognizing whole words, and do not naturally dwell on the individual letters) while
a significant number counted more occurrences than were actually present.
A computer spell checker program can do a flawless job of counting the
letter “e” in text of course, provided that there are no ambiguities about what is to
be counted. This highlights one of the advantages that computer-assisted measurements have over manual ones. But the problem remains one of identifying the
things that are to be counted. In most cases involving images of microstructure (particularly of biological structures which tend to be more complex than man-made
ones) this is not so easy. The human visual system, aided by knowledge about the
objects of interest, the system in which they reside, and the history of the sample,
can often outperform a computer program. It is, however, also subject to distractions and results will vary from one person to another or from one time until the
next. Tests in which the same series of images are presented to an experienced
observer in a different orientation or sequence and yield quite different results are
often reported.
In addition to the basic idea that measurement values can vary due to
finite precision of measurement, natural variations also occur due to finite sampling
of results. There are practically no instances in which a measurement procedure
Statistical Interpretation of Data
151
Nearly all laboratories where there is occasion to use extensive manual
measurement of micrographs, or counting of cells or particles therein, have reason to
note that certain workers consistently obtain values which are either higher or lower
than the average. While it is fairly easy to count all the objects in a defined field,
either visually or by machine, it is difficult to count only objects of a single class
when mixed with objects of other classes. Before impeaching the eyesight or
arithmetic of your colleagues, see if you can correctly count objects which you have
been trained to recognize for years, specifically the letter “E.” After you have finished
reading these instructions, go back to the beginning of this paragraph and count all
the appearances of the letter “e” in the body of this paragraph. You may use a pencil
to guide your eyes along the lines, but do not strike out or mark E’s and do not tally
individual lines. Go through the text once only; do not repeat or retrace. You should
not require more than 2 minutes. Please stop counting here!
Please write down your count on a separate piece of paper before you forget it or
are tempted to “improve” it. You will later be given an opportunity to compare your
count with the results obtained by others. Thank you for your cooperation!
a
b
Figure 8.2. The “e” counting experiment (a) and typical results (b). The true answer is
115, and the mean (average) from many trials is about 103.
can be applied to all members of a population. Instead, we sample from the
entire collection of objects and measure a smaller number. Assuming (a very large
assumption) that the samples are representative of the population, the results of
our measurements allow us to infer something about the entire population.
Achieving a representative sample is difficult, and many of the other chapters
deal to greater or lesser degrees with how sampling should be done in order to
eliminate bias.
The human population affords many illustrations of improper sampling.
It would not be reasonable to use the student body of a university as a sample to
152
Chapter 8
determine the average age of the population, because as a group students tend to
be age selected. It would not be reasonable to use the population of a city to represent an entire nation to assess health, because both positive and negative factors
of environment, nutrition and health care distinguish them from rural dwellers. In
fact, it is very difficult to figure out a workable scheme by which to obtain a representative and unbiased sample of the diverse human population on earth. This has
been presented as one argument against the existence of flying saucers: If space
aliens are so smart that they can build ships to travel here, they must also understand statistics; if they understand the importance of proper random sampling, we
would not expect the majority of sightings and contact reports to come from people
in the rural south who drive pickup trucks and have missing teeth; Q.E.D.
For the purposes of this chapter, we will ignore the various sources of bias
(systematic variation or offset in results) that can arise from human errors, sampling
bias, etc. It will be enough to deal with the statistical characterization of our data
and the practical tests to which we can subject them.
Distributions of Values
Since the measurement results from repeated attempts to measure the same
thing naturally vary, either when the operation is performed on the same specimen
(due to measurement precision) or when multiple samples of the population are
taken (involving also the natural random variation of the samples), it is natural to
treat them as a frequency distribution. Plotting the number of times that each measured value is obtained as a function of the value, as was done in Figure 8.2 for the
“e” experiment, shows the distribution. In cases where the measurement value is
continuous rather than discrete, it is common to “bin” the data into ranges. Such a
distribution plot is often called a histogram.
Usually for convenience the bin ranges are either scaled to fit the actual range
of data values, so as to produce a reasonable number of bins (typically 10 to 20) for
plotting, or they are set up with boundary limits that are rounded off to an appropriate degree of precision. However this is done, the important goal is to be sure
that the bins are narrow enough and there are enough of them to reveal the actual
shape of the distribution, but there are not so many bins that there are only a few
counts in each, which as we will see means that the heights of the bins are not well
defined. Figure 8.3 shows cases of plotting the same data with different numbers of
bins. A useful guide is to use no more bins than about N/20 where N is the number
of data points, as this will put an average of 20 counts into each bin. Of course, it
is the variation in counts from one bin to another that makes distributions interesting and so there must be some bins with very few counts, so this offers only a
starting point.
Plotting histogram distributions of measured data is a very common technique that communicates in a more succinct way than tables of raw data how the
measurements vary. In some cases, the distribution itself is the only meaningful way
to show this. A complicated distribution with several peaks of different sizes and
shapes is not easily reduced to fewer numbers in a way that retains the complexity
and permits comparison with other populations.
Statistical Interpretation of Data
153
a
b
c
Figure 8.3. Histogram plots of 200 data points in 4, 12 and 60 bins.
154
Chapter 8
Usually the reason that we perform the measurements in the first place is to
have some means of comparing a set of data from one population with a set of data
from another group. This may be to compare two groups that are both being measured, or to compare one with data taken at another time and place, or to compare
real data with some theoretical model or design specification. In any case, plotting
the two histograms for visual comparison is not a very good solution. We would
prefer to have only a few numbers extracted from the histogram that summarize the
most important aspects of the data, and tools that we can employ to compare them.
The comparisons will most often be of the form that can be reported as a
probability—usually the probability that the observed difference(s) between the two
or more groups are considered to be real rather than chance. This is a surprisingly
difficult concept for some people to grasp. We can almost never make a statement
that group A is definitely different from group B based on a limited set of data that
have variations due to the measurement process and to the sampling of the populations. Instead we say that based on the sample size we have taken there is a probability that the difference we actually observe between our two samples is small
enough to have arisen purely by chance, due to the random nature of our sampling.
If this probability is very high, say 30% or 50%, then it is risky to conclude
that the two samples really are different. If it is very low, say 1% or 5%, then we
may be encouraged to believe that the difference is a real reflection of an underlying difference between the two populations. A 5% value in this case means that only
one time in twenty would we expect to find such a large difference due solely to the
chance variation in the measurements, if we repeated the experiment. Much of the
rest of this chapter will present tools for estimating such probabilities for our data,
which will depend on just how great the differences are, how many measurements
we have made, how variable the results within each group are, and what the nature
of the actual distribution of results is.
Let us consider two extreme examples by way of illustration. Imagine measuring the lengths of two groups of straws that are all chosen randomly from the
same bucket. Just because there is an inherent variation in the measurement results
(due both to the measurement process and the variation in the straws), the average
value of the length of straws in group A might be larger than group B. But we would
expect that as the number of straws we tested in each group increased, the difference would be reduced. We would expect the statistical tests to conclude that there
was a very large probability that the two sets of data were not distinguishable and
might actually have come from the same master population.
Next, imagine measuring the heights of male and female students in a class.
We expect that for a large test sample (lots of students) the average height of the
men will be greater than that for the women, because that is what is usually observed
in the adult human population. But if our sample is small, say twenty students with
11 men and 9 women, we may not get such a clear-cut result. The variation within
the groups (men and women) may be as large or larger than that between them. But
even for a small group the statistical tests should conclude that there is a low probability that the samples could have been taken from the same population.
When political polls quote an uncertainty of several percent in their results,
they are expressing this same idea in a slightly different way. Most of us would
Statistical Interpretation of Data
155
recognize that preferences of 49 and 51% for two candidates are not really very different if the stated uncertainty of the poll was 4 percentage points, whereas preferences of 35 and 65% probably are different. Of course, this makes the implicit
assumption that the poll has been unbiased in how it selected people from the pool
of voters, in how it asked the questions, and so forth.
Expressing statistical probabilities in a meaningful and universally understandable way is not simple, and this is why the abuse of statistical data is so widespread. When the weatherman says that there is a 50% chance of showers tomorrow,
does this mean that a) 50% of the locations within the forecast area will see some
moment of rain during the day; b) every place will experience rain for 50% of the
day; c) every location has a 50% probability of being rained upon 50% of the time;
or d) something else entirely?
The Mean, Median and Mode
The simplest way to reduce all of the data from a series of measurements to
a single value is simply to average them. The mean value is the sum of all the measurements divided by the number of measurements, and is familiar to most people
as the average. It is often used because of a belief that it must be a better reflection
of the “true” value than any single measurement, but this is not always the case; it
depends on the shape of the distribution. Notice that for the data on the “e” experiment in Figure 8.2, the mean value is not a good estimate of the true value.
There are actually three parameters that are commonly used to describe a
histogram distribution in terms of one single value. The mean or average is one; the
median and mode are the others. For discrete measurements such as the “e” experiment the mode is the value that corresponds to the peak of the distribution (108
in that example). For a continuous measurement such as length, it is usually taken
as the central value within the bin that has the highest frequency. The meaning of
the mode is simple—it is the value that is most frequently observed.
The median is the value that lies in the middle of the distribution in the sense
that just as many measured values are larger than it as are smaller than it. This
brings in the idea of ranking in order rather than looking at the values themselves,
which will be used in several of the tests in this chapter.
If a distribution is symmetric and has a single peak, then the mean or average
does correspond to the peak of the distribution (the mode) and there are just as
many values to the right and left so this is also the median. So, for a simple symmetrical distribution the mean, median and mode are identical. For a distribution
that consists of a single peak but is not symmetric, the mode lies at the top of the
peak, and the median is always closer to the mode than the mean is. This is summarized in Figure 8.4.
When distributions do not have very many data points, determining the
mode can be quite difficult. Even the distribution in Figure 8.2 is “noisy” so that
repeated experiments with new groups of people would probably show the mode
varying by quite a lot compared to the mean. The mean is the most stable of the
three values, and it is also the most widely used. This is partly because it is the easiest
to determine (it is just the average), and partly because many distributions of real
156
Chapter 8
Figure 8.4. Distribution showing the mean, median and mode.
data are symmetrical and in fact have a very specific shape that makes the mean a
particularly robust way to describe the population.
The Central Limit Theorem and the Gaussian Distribution
One of the most common distributions that is observed in natural data has
a particular shape, sometimes called the “normal” or “bell” curve; the proper name
is Gaussian. Whether we are measuring the weights of bricks, human intelligence
(the so-called IQ test), or even some stereological parameters, the distribution of
results often takes on this form. This is not an accident. The Central Limit Theorem
in statistics states that whenever there are a very large number of independent causes
of variation, each producing fluctuations in the measured result, then regardless of
what the shape of each individual distribution may be, the effect of adding them all
together is to produce a Gaussian shape for the overall distribution.
The Gaussian curve that describes the probability of observing a result of
any particular value x, and which fits the shape of the histogram distribution, is
given by equation (8.1).
1
G (x, m , s ) =
◊e
s 2p
( x - m )2
2 ◊s 2
(8.1)
The parameters m and s are the mean and standard deviation of the
distribution, which are discussed below. The Gaussian function is a continuous
probability of the value x, and when it is compared to a discrete histogram
Statistical Interpretation of Data
157
produced by summing occurrences in bins or ranges of values, the match is only
approximate. As discussed below, the calculation of the m and s values from the
data should properly be performed with the individual values rather than the histogram as well, although in many real cases the differences are not large.
Figure 8.5 shows several histogram distributions of normally distributed (i.e.
Gaussian) data, with the Gaussian curve described by the mean and standard deviation of the data superimposed. There are three data sets represented, each containing 100 values generated by simulation with a Gaussian probability using a
random number generator. Notice first of all that several of the individual histogram
bins vary substantially from the smooth Gaussian curve. We will return to this variation to analyze its significance in terms of the number of observed data points in
each bin later.
There are certainly many situations in which our measured data will not
conform to a Gaussian distribution, and later in this chapter we will deal with the
statistical tests that are properly used in those cases. But when the data do fit a
Gaussian shape, the statistical tools for describing and comparing sets of data are
particularly simple, well developed and analyzed in textbooks, and easily calculated
and understood.
Variance and Standard Deviation
In Figure 8.5, notice that the three distributions have different mean values
(locations of the peak in the distribution) and different widths. Two of the distributions (b and c) have the same mean value of 50, but one is much broader than the
other. Two of the distributions (a and b) have a similar breadth, but the mean values
differ. For the examples in the figure, the means and standard deviations of the populations from which the data were sampled are:
Figure
5.a
5.b
5.c
m
53.0
50.0
50.0
s
3.0
3.0
1.0
The mean value, as noted above, is simply the average of all of the measurements. For a Gaussian distribution, since it is symmetrical, the mode and
median are equal to the mean. The only remaining parameter needed to describe
the Gaussian distribution is the width of the distribution, for which either the standard deviation (s in equation 8.1) or the variance (the square of the standard deviation) is used. These descriptive parameters for a distribution of values can be
calculated whether it is actually displayed as a histogram or not.
The standard deviation is defined as the square root of the mean (average)
value of the square of the difference between each value and the mean (hence the
name “root-mean-square” or rms difference). However, calculating it according to
this definition is not efficient. A procedure for doing so with a single pass through
the data is shown in Figure 8.6 (this also includes several parameters discussed in
the next section).
158
Chapter 8
a
b
c
Figure 8.5. Distributions of normally distributed data with differing means and standard
deviations.
Statistical Interpretation of Data
159
// assume data are held in array value[1..N]
for i = 1 to 5 do
sum[i] = 0.0;
for j = 1 to N
{
temp = 1.0
for i = 1 to 5 do
{
sum[i] = sum[i] + temp;
temp = temp * value[j];
}
}
mean = sum[2]/sum[1];
variance = sum[3]/sum[1]-mean*mean;
std_dev = sqrt(variance);
skew = ((sum[4]-3*mean*sum[3])/sum[1]+2*mean**3)/variance**1.5;
kurtosis = ((sum[5]-4*sum[4]*mean+6*sum[3]*mean**2)/sum[1]3*mean**4)/variance**2;
Figure 8.6. Example computer procedure to calculate statistical parameters.
Because it has the same units as the mean, the standard deviation is often
used to denote the spread of the data in the form (e.g.) 50.0 ± 3.0 cm, the units in
which the value was measured. For data that are normally distributed, this means
that 68% of the measured values are expected to fall within the range from 47.0 (50
- 3.0) to 53.0 (50 + 3.0) cm, 95% within the range 44.0 (50 - 2 · 3.0) to 56.0 (50 +
2 · 3.0) cm, and 99% within the range 41.0 (50 - 3 · 3.0) to 59.0 (50 + 3 · 3.0) cm.
The values 68, 95 and 99% come directly from integrating the mathematical
shape of the Gaussian curve within limits of ±1, 2 and 3 times the standard deviation. As written in equation (8.1), the integral of the Gaussian curve is 1.0 (meaning
that the probability function is normalized and the total probability of all possible
measurement values being observed is unity, which makes the statisticians happy).
The definite integral between the limits of ±1, 2 and 3s gives results of 0.6827,
0.9545, and 0.9973, respectively. Tables of the value and integral of the Gaussian
function are widely available in statistics books, but will not be needed for the analytical tests discussed in this chapter.
The variance is just the square of the standard deviation and another
measure of the distribution width. It is used in some statistical tests discussed later
on. Because in most cases the analyzed data set is but a sample of the population
of all possible data, the estimated variance of the population is slightly greater than
that calculated in the procedure of Figure 8.6 for the data set. The variance of the
population is N/(N - 1) times the sample variance, where N is the number of values
in the analyzed data set.
The mean value calculated from the data values sampled from a larger population is also only an estimate of the true value of the entire population. For
example, the data shown in Figure 8.5b come from a simulated population with a
mean of exactly 50, but the sample of 100 data points have a calculated mean value
of 50.009. We also need to be able to evaluate how well our calculated mean value
estimates that of the population, and again the result depends on how many data
points we have taken. The standard error of the mean is just s/÷N where s is the
160
Chapter 8
calculated standard deviation of the sample and N is the number of values used. It
is used in the same way as the standard deviation, namely we expect to find the true
population mean within 1, 2 or 3 standard errors of the calculated sample mean 68,
95 and 99% of the time, respectively. It is important to understand the difference
between the standard deviation of our data sample (s) and the standard error of
the calculated mean value m.
The mean and standard deviation are also used when measurement data are
combined. Adding together two sets of data is fairly straightforward. If one set of
measurements on one population sample determine a mean and standard deviation
for a particular event as m1 and s1, and a second set of measurements on another
sample determine the mean and standard deviation for second event as m2 and s2,
then the mean probability of either event is the sum of the two individual means
(m1 + m2) and the standard deviation is the square root of the sum of the two variances (s12 + s22)1/2. To give a concrete example, if the fraction of vehicles passing on
the highway is determined in one time period as 50 ± 6.5 buses per hour, and in a
second time period we count 120 ± 8.4 trucks per hour, then the combined rate of
“large vehicles” (busses plus trucks) would be estimated as 170 ± 10.6. The 170 is
simply 50 + 120 and the 10.6 is the square root of 112.8 = (6.5)2 + (8.4)2.
The situation is a bit more complicated and certainly much more unfavorable if we try to combine two events to determine the difference. Imagine that we
counted 120 ± 8.4 trucks per hour in one time period and then 75 ± 7.2 eighteenwheel transports (a subclass of trucks) in a second period. Subtracting the second
class from the first would give us the rate of small trucks, and the mean value is estimated simply as 45 per hour (= 120 - 75). But the standard deviation of the net
value is determined just as it was for the case of addition; multiple sources of uncertainty always add together as the square root of the sum of squares. The standard
deviation is thus 11.1 (the square root of 122.4 = (8.4)2 + (7.2)2) for an estimated
rate of 45 ± 11.1, which is a very large uncertainty. This highlights the importance
of always trying to measure the things we are really interested in, and not determining them by difference between two larger values.
Testing Distributions for Normality—Skew and Kurtosis
In order to properly use the mean and standard deviation as descriptors of
the distribution, it is required that the data actually have a normal or Gaussian distribution. As can be seen from the graphs in Figure 8.5, visual judgment is not a
reliable guide (especially when the total number of observations is not great). With
little additional computational effort we can calculate two additional parameters,
the skew and kurtosis, that can reveal deviations from normality.
The skew and kurtosis are generally described as the third and fourth
moments or powers of the distribution. Just as the variance uses the sum of
(x - m)2, the skew uses the average value of (x - m)3 and the kurtosis uses the
average value of (x - m)4 divided by s4. Viewed in this way, the mean is the first
moment (it describes how far the data lie on the average from zero on the measurement scale) and the variance is the second moment (it uses the squares of the
deviations from the mean as a measure of the spread of the data). The skew uses
Statistical Interpretation of Data
161
the third powers and the kurtosis the fourth powers of the values, according to the
definitions:
Skew = m3 m2
32
(3rd moment)
2
Kurtosis = m4 m2 ( 4th moment )
where mk = S(xi - m)k/N. An efficient calculation method is included in the procedure shown in Figure 8.6.
The skew is a measure of how symmetrical the distribution is. A perfectly
symmetrical distribution has a skew of zero, while positive and negative values indicate distributions that have tails extending to the right (larger values) and left
(smaller values) respectively. Kurtosis is a measure of the shape of the distribution.
A perfectly Gaussian distribution has a Kurtosis value of 3.0; smaller values indicate that the distribution is flatter-topped than the Gaussian, while larger values
result from a distribution that has a high central peak. A word of caution is needed:
some statistical analysis packages subtract 3.0 from the calculated kurtosis value so
that zero corresponds to a normal distribution, positive values to ones with a central
peak, and negative values to ones that are flat-topped.
It can be quite important to use these additional parameters to test a data set
to see whether it is Gaussian, before using descriptive parameters and tests that
depend on that assumption. Figure 8.7 shows an example of four sets of values that
have the same mean and standard deviation but are very different in the way the data
are actually distributed. One set (#3 in the figure) does contain values sampled from
a normal population. The others do not—one is uniformly spaced, one is bimodal
(with two peaks), and one has a tight cluster of values with a few outliers. The skew
and kurtosis reveal this. The outliers in data set #1 produce a positive skew and the
clustering produces a large kurtosis. The kurtosis values of the bimodal and uniformly distributed data sets are smaller than 3. For the sample of data taken from a
normal distribution the values are close to the ideal zero (skewness) and 3 (kurtosis).
But how close do they actually need to be, and how can we practically use
these values to test for normality? As usual, the results must be expressed as a probability. For a given value of skewness or kurtosis calculated for an actual data set,
what is the probability that it could have resulted due to the finite sampling of a
population that is actually normal? The graphs in Figures 8.8 and 8.9 show the
answers, as a function of the number of data values actually used. If we apply these
tests to the 100 data points shown in Figure 8.5b we calculate a skew value of 0.09
and a kurtosis of 2.742, well within the values of 0.4 and range of 2.4–3.6 that the
graphs predict can happen one time in twenty (5%). Consequently there is no reason
to expect that the data did not come from a normal population and we can use that
assumption in further analysis. As the data set grows larger, the constraints on skew
and kurtosis narrow.
Some Other Common Distributions
The Gaussian or normal curve is certainly not the only distribution that
can arise from simple physical processes. As one important example, many of the
162
Chapter 8
Data Set #
1
2
3
4
Mean = m
100
100
100
100
Variance = s 2
140
140
140
140
Skew
1.68
0
-0.352
0
Kurtosis
5.06
1.12
2.929
1.62
Figure 8.7. Four data sets with the same mean and standard deviation values, but
different shapes as revealed by higher moments (skew and kurtosis).
Figure 8.8. The probability that the absolute value of skewness will exceed various
values for random samples from a Gaussian distribution, as a function of the number
of observations.
Statistical Interpretation of Data
163
Figure 8.9. The probability that the kurtosis will exceed various values for random
samples from a Gaussian distribution, as a function of the number of observations.
procedures we use to obtain stereological data are based on counting, and a proper
understanding of the statistics of counting is therefore important. Consider a simple
event counting process, such as determining the number of cars per hour that pass
on a particular freeway. We could count for an entire hour, but it seems simpler to
count for a short period of time and scale the results up. For example, if we counted
the number of cars in a 5 minute period (e.g., 75 cars) and multiplied by 12, it should
give the number of cars per hour (12 · 75 = 900). But of course, the particular five
minute period we chose cannot be perfectly representative. Counting for a different
5 minute period (even assuming that the average rate of traffic does not change)
would produce a slightly different result. To what extent can we predict the variation, or to put it another way how confident can we be that the result we obtain is
representative of the population?
When we count events, it is not possible to get a negative result. This is immediately a clue that the results cannot have a Gaussian or normal distribution,
because the tails of the Gaussian curve extend in both directions. Instead, counting
statistics are Poisson. The Poisson function is
P (x, m ) =
m ◊ x -m
e
x!
(8.2)
where m is the mean value and x! indicates factorial. Notice that there is no s
term in this expression, as there was for the Gaussian distribution. The width of
164
Chapter 8
the Poisson distribution is uniquely determined by the number of events counted;
it is the square root of the mean value.
This simple result means that if we counted 75 cars in our 5 minute period,
the standard deviation that allows us to predict the variation of results (68%
within 1 standard deviation, 95% within 2, etc.) is simply 75 = 8.66. The five
minute result of 75 ± 8.66 scales up to an hourly rate of 900 ± 103.9, which is a
much less precise result than we would have obtained by counting longer. Counting
900 cars in an hour would have given a standard deviation of 900 = 30, considerably better than 104. On the other hand, counting for 1 minute would have
tallied approximately 900/60 = 15 cars, 15 = 3.88, and 15 ± 3.88 scales to an estimated hourly rate of 900 ± 232.8 which is much worse. This indicates that controlling the number of events (e.g., marks on grids) that we count is vitally important
to control the precision of the estimate that we obtain for the desired quantity being
measured.
Figure 8.10 shows a Poisson distribution for the case of m = 2 (a mean value
of two counts). The continuous line for the P(x, m) curve is only of mathematical
interest because counting deals only with integers. But the curve does show that the
mean value (2.0) is greater than the mode (the highest point on the curve), which is
always true for the Poisson distribution. So the Poisson function has a positive skew
(and also, incidentally, a kurtosis greater than three).
Fortunately, we usually deal with a much larger number of counts than 2.
When the mean of the Poisson function is large the distribution is indistinguishable
from the Gaussian (except of course that the standard deviation s is still given by
the square root of the mean m). This means that the function and distribution
becomes symmetrical and shaped like a Gaussian, and consequently the statistical
tools for describing and comparing data sets can be used.
Figure 8.10. Poisson function and distribution for a mean of 2.
Statistical Interpretation of Data
165
However, there is one important arena in which the assumption of large
numbers is not always met. When a distribution of any measured function is plotted,
even if the number of bins is small enough that most of them contain a large number
of counts, there are usually some bins at the extremities of the distribution that have
only a few counts in them. If these bins are compared from one distribution to
another, it must be based on their underlying Poisson nature and not on an assumption of Gaussian behavior. Comparing the overall distributions by their means and
standard deviations is fine as a way to determine if the populations can be distinguished. Trying to detect small differences at the extremes of the populations, such
as the presence of a small fraction of larger or smaller members of the population,
is much more difficult.
Another distribution that shows up often enough in stereological
measurement to take note of is the log-normal distribution. This is a histogram
in which the horizontal axis is not the measured value but rather the logarithm
of the measured value, but the resulting shape of the distribution is normal or
Gaussian. Note that the bins in the distribution are still of uniform width, which
means they each cover the same ratio of size values (or whatever the measurement
records) rather than the same increment of sizes. This type of distribution is
often observed for particle sizes, since physical processes such as the grinding and
fracturing of brittle materials produce this behavior. Figure 8.11 shows a histogram
distribution for log-normally distributed data; on a linear scale it is positively
skewed.
For a thorough discussion of various distributions that occur and the tools
available to analyze them, see P. R. Bevington (1969) “Data Reduction and Error
Analysis for the Physical Sciences” McGraw Hill, New York.
Comparing Sets of Measurements—The T-Test
As mentioned at the beginning of this chapter, one common purpose of statistical analysis is to determine the likelihood that two (or more) sets of measurements (for example, on different specimens) are the same or different. The answer
to this question is usually expressed as a probability that the two samples of data
could have come from the same original population. If this probability is very low,
it is often taken as an indication that the specimens are really different. A probability of 5% (often written as a = 0.05) that two data sets could have come from the
same population and have slightly different descriptive parameters (mean, etc.)
purely due to sampling variability is often expressed as a 95% confidence level that
they are different. It is in fact much more difficult to prove the opposite position,
that the two specimens really are the same.
When the data are normally distributed, a very efficient comparison can be
made with student’s t-test. This compares the mean and standard deviation values
for two populations, and calculates a probability that the two data sets are really
drawn from the same master population and that the observed differences have
arisen by chance due to sampling. Remember that for a Gaussian distribution, the
mean and standard deviation contain all of the descriptive information that is
needed (the skew and kurtosis are fixed).
166
Chapter 8
a
b
Figure 8.11. Linear and logarithmic scale histogram plots of the same set of lognormally distributed values.
Given two normal (Gaussian) sets of data, each with ni observations and
characterized by a mean value mi and a standard deviation si, what is the probability that they are significantly different? This is calculated from the difference in the
mean values, in terms of the magnitude of the standard deviations and number of
observations in each data set, as shown in equation (8.3); n is the number of degrees
of freedom, which is needed to assess the probabilities in Figure 8.12.
T=
m1 - m2
s 12 s 22
n1 n2
n=
Ê s 12 s 22 ˆ
+
Á
˜
Ë n1 n2 ¯
2
2
2
Ê s 12 ˆ
Ê s 22 ˆ
Á ˜
Á ˜
Ë n1 ¯
Ën ¯
+ 2
n1 - 1
n2 - 1
(8.3)
Statistical Interpretation of Data
167
Figure 8.12. Critical values for t-test comparison as a function of the number of degrees
of freedom.
The larger the difference of means (relative to the standard deviations) and
the larger the data sets, the more likely it is that the two sets of observations could
not have come from a single population by random selection. The procedure is to
compare the value of the parameter T to the table of student’s t values shown by
the graph in Figure 8.12 for the number of degrees of freedom and probability a.
If the magnitude of T is less than the table value, it indicates that the two data sets
could have come from the same population, with the differences arising simply from
random selection. If it exceeds the table value, it indicates the two sets are probably different at the corresponding level of confidence 100 · (1 - a); for example, a =
0.01 corresponds to a 99% confidence level that the two groups are not the same.
The value of a is given for a double-sided test (the two means are “different”). In a single sided test (deciding whether one mean is “greater” or “less” than
the second), the value of a is halved. Notice that the curves of critical values level
off quickly as the number of degrees of freedom increases. For typical data sets with
at least tens of observations in each, only the asymptotic limit values are needed.
Hence a value of T greater than 1.282 indicates that the groups are not the same
with a confidence level of 90% (a = 0.10), and values greater than 1.645 and 1.960
give corresponding confidence values of 95% (a = 0.05) and 99% (a = 0.01) respectively. Common and convenient use is made of the easily remembered approximate
test value T ≥ 2.0 for 99% confidence.
Comparing the two sets of data in Figure 8.5a and 8.5b using the t-test gives
the following results:
Data Set
a
b
m
52.976
50.009
s
3.110
3.152
n
100
100
The calculated value of T is 57.83, the number of degrees of freedom
is large enough to consider just the asymptotic values, and since T exceeds
168
Chapter 8
n2
3
5
10
40
•
n1 = 3
9.28
5.41
3.71
2.84
2.60
n1 = 5
9.01
5.05
3.33
2.45
2.21
n1 = 10
8.79
4.74
2.98
2.08
1.83
n1 = 40
8.59
4.46
2.66
1.69
1.39
n1 = •
8.53
4.36
2.54
1.51
1.00
Figure 8.13. Critical values of F for a = 0.05 (ANOVA test).
1.96 the probability that the two data sets are significantly different is greater
than 99%.
The Analysis of Variance or ANOVA test is a generalization of the t-test to
more than 2 groups, still making the assumption that each of the groups has a
normal data distribution. The ANOVA test compares the differences between the
means of several classes, in terms of their number of observations and variances.
For the case of two data sets, it is identical to the t-test.
To perform the ANOVA test, we calculate the following sums-of-squares
terms from the observations yij (i = class, j = observation number). yi* is the mean
of observations in class i, and ymean is the global average. There are ni observations
in each class, k total classes, and t total observations.
SST (total sum of squares of differences) = SS(yij - ymean)2
SSA (sum of squares of difference within the classes) = SS(yi* - ymean)2 =
Sni · (yi* - ymean)2
SSE (difference between the total variation and that within classes) =
SST - SSA
From these, an F value is calculated as F = (SSA/n1)/(SSE/n2)
where the degrees of freedom are n1 = k - 1 and n2 = t - k
This value of F is then used to determine the probability that the observations in the k classes could have been selected randomly from a single parent population. If the value is less than the critical values shown in tables, then the difference
between the groups is not significant at the corresponding level of probability. Figure
8.13 shows critical values for F for the case of a = 0.05 (95% confidence that not all
of the data sets come from the same parent population).
Nonparametric Comparisons
The t-test and ANOVA test are flawed if the data are not actually Gaussian
or “normal” since the mean and standard deviation do not then fully describe the
data. Nonparametric tests do not make the assumption of normality and do not
use “parameters” such as mean and standard deviation to characterize the data. One
set of methods is based on using the rank order of the measured data rather than
their actual numerical values. As applied to two data sets, the Wilcoxon test sorts
the two sets of values together into order based on the measurement values as shown
Statistical Interpretation of Data
169
Figure 8.14. Principle of the Wilcoxon or Mann-Whitney test based on rank order.
schematically in Figure 8.14. Then the positions of the observations from the two
data sets in the sequence are examined. If all of the members of one set are sorted
to one end of the stack, it indicates that the two groups are not the same, while if
they are intimately mixed together the two groups are considered indistinguishable.
This test is also called the Mann-Whitney test, which generalized the original
method to deal with groups of different sizes.
This idea is much the same as examining the sequence of red and black cards
in a deck of playing cards to decide whether it has been well shuffled. The binomial
theorem allows calculation of the probability of any particular sequence of red and
black cards occurring. Shuffles that separate the red and black cards are much less
likely to occur than ones with them mixed together, and will therefore happen only
rarely by chance. To illustrate the procedure, consider the two extreme cases of data
shown in Figure 8.15. Case A has the two groups well mixed and Case B is completely segregated. The sum of rank positions of the groups are tallied and
whichever is smaller is then used to calculate the test statistic U.
U = W1 - n1 ◊
(n1 + 1)
2
(8.5)
where Wi is the sum of rank values and ni is the number of observations in the
groups. If the value of U is less than a critical value that depends on the number of
observations in the two data sets and on a the probability of chance occurrence by
random sampling of values from a single population, then the two groups are considered to be different with the corresponding degree of confidence. In the example,
the two groups in Case A are not distinguishable (there is a 60% probability that
they could have come from the same population) and those in Case B are (there is
a 0.9% probability that they came from the same population). Figure 8.16 shows
the critical test values of U for a = 0.01 and a = 0.05 (99 and 95% confidence respectively) for several values of ni.
170
Chapter 8
Rank
1
2
3
4
5
6
7
8
9
10
Case A
1
2
1
2
1
2
1
2
1
2
Case B
1
1
1
1
1
2
2
2
2
2
Case
A
B
W1 = Rank Sum
25
15
n1
5
5
U
10
0
a
0.6015
0.0090
Figure 8.15. Example of two extreme cases for Wilcoxon comparison.
The Kruskal-Wallis test is a generalization of the Wilcoxon or MannWhitney test to more than two groups, in much the same way that the ANOVA test
generalizes the t-test. The data sets are sorted into one list and their rank positions
used to calculate the parameter H based on the number of groups k, the number of
objects n (and number in each group ni), and the rank order of each observation R
(summed for those in each group). This test value H is compared to the critical
values (which come from the chi-squared distribution) for the number of degrees of
freedom (k - 1) and the probability a that this magnitude of H could have occurred
purely by chance for observations all sampled randomly from one parent group.
H=
k
12
Ê R2 ˆ
◊ Â Á i ˜ - 3 ◊ (n + 1)
n ◊ (n + 1) i =1 Ë ni ¯
Figure 8.16. Critical values for the Wilcoxon (Mann-Whitney) test.
(8.6)
Statistical Interpretation of Data
171
Figure 8.17 shows a tiny data set to illustrate the principle of the KruskalWallis test. Three sets of measurement values are listed with the groups identified.
Sorting them into order and summing the rank order for each group gives the values
shown. Based on the number of degrees of freedom (df = k - 1), the probability that
the calculated value of H could have occurred by chance in sampling from a single
parent population is 43.5%, so we would not conclude that the three data sets are
distinguishable. Figure 8.18 shows a table of critical values for H for several values
of df and different values of a. H must exceed the table value for the difference to
be considered significant at the corresponding level of confidence.
The Wilcoxon (Mann-Whitney) and Kruskal-Wallis tests rely on sorting the
measured values into order based on their numerical magnitude, but then using the
rank order of the values in the sorted list for analysis. Sorting of values into order
is a very slow task, even for a computer, when the number of observations becomes
large. For such cases there is a more efficient nonparametric test, the KolmogorovSmirnov test, that uses cumulative plots of variables and compares these plots for
two data sets to find the largest difference.
Length
24.0
16.7
22.8
19.8
18.9
23.2
19.8
18.1
17.6
20.2
17.8
18.4
19.1
17.3
19.7
18.9
18.8
19.3
17.3
Class
1
2
3
ni
5
6
8
k
df
n
H
a
Group
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
3
3
Rank sum
61
63.5
65.5
3
2
19
1.663
0.4354
Figure 8.17. Kruskal-Wallis example (3 groups of data).
172
Chapter 8
df
1
2
3
4
5
6
7
8
9
10
a 0.05
3.841
5.991
7.815
9.488
11.070
12.592
14.067
15.507
16.919
18.307
0.025
5.024
7.378
9.348
11.143
12.832
14.449
16.013
17.535
19.023
20.483
0.01
6.635
9.210
11.345
13.277
15.086
16.812
18.475
20.090
21.666
23.209
Figure 8.18. Critical value table for the Kruskal-Wallis test.
As shown in Figure 8.19, the cumulative distribution plot shows the fraction
or percentage of observations that have a value (length in the example) at least as
great as the value along the axis. Cumulative plots can be constructed with binned
data or directly from the actual measurement values. Because they are drawn with
the vertical axis in percent rather than the actual number of observations, it becomes
possible to compare distributions that have different numbers of measurements.
a
b
Figure 8.19. Cumulative plot (b) compared with usual differential histogram display.
Statistical Interpretation of Data
173
Figure 8.20 shows the step-by-step process of performing the KolmogorovSmirnov test. The two data sets are plotted as cumulative distributions showing the
fraction of values as a function of length. Once the plots have been obtained, the
greatest vertical difference between them is located. Since the vertical axis is a fraction, there are no units associated with the difference value. It does not matter where
along the horizontal axis the maximum difference occurs, so the actual magnitude
of the measurement values is unimportant.
The maximum difference is compared to a test value calculated from the
number of observations in the groups ni
1
Ên +n ˆ2
S = A◊Á 1 2 ˜
Ë n1 ◊ n2 ¯
(8.7)
where the parameter A is taken from the table in Figure 8.21 to correspond to
the desired degree of confidence. For the specific case of a = 0.05 (95% probability that the two data sets did not come from sampling the same population),
A is 1.22 and the test value is 0.521. Since the maximum difference of 0.214 is
less than the test value, the conclusion is that the two data sets cannot be said to be
distinguishable.
Data Set
A
B
Length Values
1.023, 1.117, 1.232, 1.291, 1.305, 1.413, 1.445, 1.518,
1.602, 1.781, 1.822, 1.889, 1.904, 1.967
1.019, 1.224, 1.358, 1.456, 1.514, 1.640, 1.759, 1.803,
1.872
n1
n2
Max.Diff.
A (a = 0.05)
Test Value
14
9
0.214
1.22
0.521
Figure 8.20. Example of applying the Kolmogorov-Smirnov test.
174
Chapter 8
a
A
0.10
1.07
0.05
1.22
0.025
1.36
0.010
1.52
Figure 8.21. Critical Values for the Kolmogorov-Smirnov test.
For large data sets, the Kolmogorov-Smirnov test is a more efficient nonparametric test than the Wilcoxon because much less work is required to construct
the cumulative distributions than to sort the values into order. Figure 8.22 shows
an example of a comparison of two large populations (n1 = 889, n2 = 523). The greatest difference in the cumulative histograms is 0.118 (11.8%). For a confidence level
of 99%, a = 0.01 and A = 1.52, so the test value S is 0.84. Since the measured
Figure 8.22. Overlay comparison of cumulative plots of two populations for the Kolmogorov-Smirnov test.
Statistical Interpretation of Data
175
difference is greater than the test value, we conclude that the two populations are
distinguishable and not likely to have come from the same population.
For a complete discussion of these and other nonparametric analysis tools,
see J. D. Gibbons (1985) “Nonparametric Methods for Quantitative Analysis, 2nd
Edition” American Sciences Press, Columbus, OH. Nonparametric tests are not as
efficient at providing answers of a given confidence level as parametric tests, but of
course they can be applied to normally distributed data as well as to any other data
set. They will provide the same answers as the more common and better-known
parametric tests (e.g., the t-test) when the data are actually normally distributed,
but they generally require from 50% more to twice as many data values to achieve
the same confidence level. However, unless the measured data are known to be
Gaussian in distribution (such as count data when the numbers are reasonably
large), or have been tested to verify that they are normal, it is generally safer to use
a nonparametric test since the improper use of a parametric test can lead to quite
erroneous results if the data do not meet the assumed criterion of normality.
Linear Regression
When two or more measurement parameters are recorded for each object, it
is usually of interest to look for some relationship between them. If we recorded
the height, weight and grade average of each student we might search for correlations. It would be interesting to determine the probability that there is a real correlation between the first two variables but not between them and the third one. Again,
we will want to express this as a confidence limit or a probability that any observed
correlation did not arise from chance as the data were selected from a truly random
population.
The first tool used is generally to plot the variables in two (or more) dimensions to look for visual trends and patterns. Figure 8.23 shows some of the common
ways that this is done. The point plot works well if the individual points representing data measured from one object are not so dense that they overlap. Flower plots
are produced by binning the data into cells and counting the number within each
cell, while the two-way histogram shows quite graphically where the clusters of
points lie but in doing so hides some parts of the plot. With interactive computer
graphics, the scatter plot can be extended to handle more than two dimensions by
allowing the free rotation of the data space, and color coding of points allows comparison of two or more different populations in the same space.
Since the human visual system, abetted by some prior knowledge about the
problem being studied, is very good at detecting patterns in quite noisy data, this
approach is often a very useful starting point in data analysis. However, humans are
often a bit too good at finding patterns in data (for instance, constellations in the
starry sky), and so we would like to have more objective ways to evaluate whether
a statistically significant correlation exists.
Linear regression makes the underlying assumption that one variable (traditionally plotted on the vertical y axis) is a dependent function of the second (plotted
on the horizontal x axis), and seeks to represent the relationship by an equation of
the form y = m · x + b, which is the equation for a straight line. The process then
176
Chapter 8
a
b
c
Figure 8.23. Presentation modes for data correlating two variables: a) point plot, b)
flower plot, c) two-way histogram.
Statistical Interpretation of Data
177
determines the values of m and b which give the best fit of the data to the line in
the specific sense of minimizing the sum of squares of the vertical deviations of the
line from the points. For N data points xi, yi the optimal values of m and b are
m=
N ◊  xi yi -  xi ◊ yi
2
N ◊ Â xi2 - (Â xi )
 y - m ◊ Âx
b=
i
(8.8)
i
N
This is conventional linear regression. When it is used with data that are not
all of the same value (weighted regression), or where the data have been previously
transformed in some way (for example by taking the logarithm, equivalent to plotting the data on log paper and fitting a straight line), the procedure is a bit more
complex and beyond the intended scope of this chapter. Figure 8.24 shows a computer procedure that will determine m and b and also calculate the standard deviation of both values.
But how good is the line as a description of the data (or in other words how
well does it fit the points)? Figure 8.25 shows two examples with the same number
of points. In both cases the procedure outlined above determines a best fit line, but
the fit is clearly better for one data set than the other. The parameter that is generally used to describe the goodness of fit is the correlation coefficient R or R2. This
is a dimensionless value that can vary between zero (no correlation whatever—any
// assumes data are held in arrays x[1..N], y[1..N]
sumx = 0;
sumy = 0;
sumx2 = 0;
sumy2 = 0;
sumxy = 0;
for i = 1 TO N
{
sum = sum+1;
sumx = sumx+x[i];
sumy = sumy+y[i];
sumx2 = sumx2+x[i]*x[i];
sumy2 = sumy2+y[i]*y[i];
sumxy = sumxy+x[i]*y[i];
}
dx = sum*sumx2-sumx*sumx;
dy = sum*sumy2-sumy*sumy;
d2 = sum*sumxy-sumx*sumy;
m = d2/dx;
//slope
b = (sumy-a*sumx)/sum;
//intercept
ss = 1.0/(sum-1)*(sumy2+sum*b*b+a*a*sumx2-2*(b*sumya*b*sumx+a*sumxy));
sigma_b = sqrt(ss*sumx2/dx);
//intercept standard deviation
sigma_m = sqrt(sum*ss/dx);
//slope standard deviation
r2 = d2/sqrt(abs(dx*dy));
//correlation coefficient
Figure 8.24. Example computer procedure to calculate linear regression.
178
Chapter 8
Figure 8.25. Data sets with high and low correlation coefficients.
line would be as good) and one (a perfect fit with the line passing exactly through
all of the points). The sign of the R value can be either positive (y increases with
x) or negative (y decreases as x increases) but only the magnitude is considered in
assessing the goodness of fit. R is calculated from the product of two slopes, one
treating y as a dependent function of x (minimizing the squares of the vertical deviations of the line from the points) and one treating x as a dependent function of y
(minimizing the sum of squares of the horizontal deviations).
R=
N ◊  xi yi -  xi ◊ yi
2
2
N ◊ Â x - (Â xi ) ◊ N ◊ Â yi2 - (Â yi )
2
i
(8.9)
For the examples in Figure 8.25, the “good fit” data have an R value of 0.939
and the “poor fit” data a value of 0.271. How are we to assess these values? It
depends upon the number of points used in calculating the fit as well a the magnitude of R, and again we use a as the probability that a fit with the same R might
occur by chance if we were fitting the line to truly random points like salt grains
sprinkled on the table. For the case of 12 data points, the value of 0.939 would be
expected to occur only once in 10,000 times (a = 0.01%), while a value of 0.271
would be expected to occur nearly 2 times in 5 (a = 39.4%). We would consider the
first to be statistically significant but not the second.
Figure 8.26 shows the relationship for evaluating the fit. For any given
number of points, an R value above the corresponding line means that the fit is significant with the corresponding level of confidence (i.e. did not occur by chance with
a probability a). As the number of points increases, the value of R required for any
selected degree of confidence is reduced. Alpha is the probability that the apparent
fit arose by chance from uncorrelated values. The R2 value is a measure of the percentage of the variation in the y (dependent variable) values that is “explained” is
a statistical sense by the fit to the independent variable x.
Linear regression can be extended relatively straightforwardly to deal with
more than two variables. Multiple regression tries to express one dependent
Statistical Interpretation of Data
179
Figure 8.26. Probability of significance of values of the linear correlation coefficient R
for N observations.
parameter z as a linear combination of many others (z = a0 + a1 · x1 + a2 · x2 + . . .). The
procedure for efficiently determining the ai values is very similar to the matrix arithmetic usually employed to calculate analysis of variance for multiple parameters. If
the xi parameters are actually values of a single parameter raised to successive
powers, then the relationship is z = a0 + a1 · x1 + a2 · x2 + . . . and we have polynomial
regression. An excellent reference for regression methods is N. Draper, H. Smith
(1981) “Applied Regression Analysis, 2nd Edition” J. Wiley & Sons, New York.
Stepwise multiple regression starts with a list of independent variables x and
adds and removes them one at a time, keeping only those which have a values that
are statistically significant. We will not cover the details of the calculation here but
many computer statistics packages provide the capability. A closely related technique provided by quite a few programs plots all of the data points in a high dimensionality space (corresponding to the number of measured parameters) and then
finds the orientation of axes that best projects the data onto planes that give high
correlation coefficients. The “principal components analysis” method is another
way, like stepwise multiple regression, that identifies which of the available independent variables actually correlate with the dependent variable.
The main difference is that stepwise regression treats the independent variables separately while principal components analysis groups them together algebraically. Neither method tells the user whether the correlation reveals any physical
significance, whether the assumed dependent variable is actually causally related to
the independent one, or vice versa, or whether both may be dependent on some
other (perhaps unmeasured) parameter. Attempting to infer causality from correlation is a common error that provides some of the more egregious examples of the
misuse of statistical analysis.
Nonlinear Regression
The previous section dealt with the mathematics of fitting straight lines to
data, and referenced methods that fit polynomials and other fixed mathematical
180
Chapter 8
a
b
Figure 8.27. Comparison of regression using actual values (a) and rank order of
values (b).
relationships. These all assume, of course, that the user has some idea of what the
appropriate relationship between the independent and dependent variables is (as
well as correctly identifying which is which). The assumption of linearity is convenient from a calculation standpoint, but there are few reasons to expect genuine
physical circumstances to correspond to it.
When there is no reason to expect a particular functional form to the data,
we would still like to have a tool to assess the degree to which the data are correlated (one increases monotonically with variation in the other). Spearman or rank
Statistical Interpretation of Data
181
correlation accomplishes this in a typical nonparametric way by using the rank order
of the values rather than their numerical values. The data are sorted into rank order
and each point’s numerical values for the x and y parameters is replaced by the
integer rank position of the value for the corresponding parameter. These rank positions are then plotted against each other (as illustrated in Figure 8.27) and the R
value calculated and interpreted in the usual way. This method also has advantages
in analyzing data that are strongly clustered at one end of a plot or otherwise distributed nonuniformly along a conventional linear regression line.
Download