Statistical Interpretation of Data Chapter 8 Introduction—Sources of Variability in Measurement

Chapter 8 Statistical Interpretation of Data Introduction—Sources of Variability in Measurement It is commonplace to observe that repeated measurements of what seems to be the same object or phenomenon do not produce identical results. Measurement variation arises from a number of sources, but one root cause is often the finite precision of the measuring tool. If a simple yardstick is used to measure carpet, we expect to obtain a result no better than perhaps 1/4 inch, the smallest division on the measurement scale, but this is entirely adequate for our purpose. If we try to measure the carpet to a higher resolution, say 1/64 inch, we are likely to find that the roughness of the edge of the carpet causes uncertainty about just where the boundary really is, and measurements at different locations produce different answers. To push this particular example just a bit farther, if the measurement is performed without taking care that the ruler is perpendicular to the edges of the carpet, the measurement will be biased and the result too large. The limitations of the measurement tool and the difficulty in defining just what is to be measured are two factors that limit the precision and accuracy of measurements. Precision is defined in terms of repeatability, the ability of the measurement process to duplicate the same measurement and produce the same result. Accuracy is defined as the agreement between the measurement and some objective standard taken as the “truth.” In many cases, the latter requires standards which are themselves defined and measured by some national standards body such as NIST or NPL (for example, measures of dimension). Arguing with a traffic officer that your speedometer is very precise because it always reads 55 mph when your car is traveling at a particular velocity will not overcome his certainty that the accuracy of his radar detector indicated you were actually going 60 mph (arguing about the accuracy of the calibration of the radar gun may in fact be worthwhile- the calibration and standardization of such devices can be rather poor!). Figure 8.1 suggests the difference between precision and accuracy—a highly precise but inaccurate measurement is generally quite useless. In a few cases, such as counting of events, it is possible to imagine an absolute truth, but even in such an apparently simple case the ability of our measurement process to yield the correct answer is far from certain. In one classic example of counting errors, George Moore from the National Bureau of Standards (before it became NIST) asked participants to count the occurrences of the letter “e” in a paragraph of text. Aside from minor ambiguities such as whether the intent was to count the capital “E” as well as lower case, and whether a missing “e” in a misspelled word should be included, the results (Figure 8.2) showed that many people systematically undercounted the occurrences (perhaps because we read by 149 150 Chapter 8 Figure 8.1. Effects of accuracy and precision in shooting at a target. recognizing whole words, and do not naturally dwell on the individual letters) while a significant number counted more occurrences than were actually present. A computer spell checker program can do a flawless job of counting the letter “e” in text of course, provided that there are no ambiguities about what is to be counted. This highlights one of the advantages that computer-assisted measurements have over manual ones. But the problem remains one of identifying the things that are to be counted. In most cases involving images of microstructure (particularly of biological structures which tend to be more complex than man-made ones) this is not so easy. The human visual system, aided by knowledge about the objects of interest, the system in which they reside, and the history of the sample, can often outperform a computer program. It is, however, also subject to distractions and results will vary from one person to another or from one time until the next. Tests in which the same series of images are presented to an experienced observer in a different orientation or sequence and yield quite different results are often reported. In addition to the basic idea that measurement values can vary due to finite precision of measurement, natural variations also occur due to finite sampling of results. There are practically no instances in which a measurement procedure Statistical Interpretation of Data 151 Nearly all laboratories where there is occasion to use extensive manual measurement of micrographs, or counting of cells or particles therein, have reason to note that certain workers consistently obtain values which are either higher or lower than the average. While it is fairly easy to count all the objects in a defined field, either visually or by machine, it is difficult to count only objects of a single class when mixed with objects of other classes. Before impeaching the eyesight or arithmetic of your colleagues, see if you can correctly count objects which you have been trained to recognize for years, specifically the letter “E.” After you have finished reading these instructions, go back to the beginning of this paragraph and count all the appearances of the letter “e” in the body of this paragraph. You may use a pencil to guide your eyes along the lines, but do not strike out or mark E’s and do not tally individual lines. Go through the text once only; do not repeat or retrace. You should not require more than 2 minutes. Please stop counting here! Please write down your count on a separate piece of paper before you forget it or are tempted to “improve” it. You will later be given an opportunity to compare your count with the results obtained by others. Thank you for your cooperation! a b Figure 8.2. The “e” counting experiment (a) and typical results (b). The true answer is 115, and the mean (average) from many trials is about 103. can be applied to all members of a population. Instead, we sample from the entire collection of objects and measure a smaller number. Assuming (a very large assumption) that the samples are representative of the population, the results of our measurements allow us to infer something about the entire population. Achieving a representative sample is difficult, and many of the other chapters deal to greater or lesser degrees with how sampling should be done in order to eliminate bias. The human population affords many illustrations of improper sampling. It would not be reasonable to use the student body of a university as a sample to 152 Chapter 8 determine the average age of the population, because as a group students tend to be age selected. It would not be reasonable to use the population of a city to represent an entire nation to assess health, because both positive and negative factors of environment, nutrition and health care distinguish them from rural dwellers. In fact, it is very difficult to figure out a workable scheme by which to obtain a representative and unbiased sample of the diverse human population on earth. This has been presented as one argument against the existence of flying saucers: If space aliens are so smart that they can build ships to travel here, they must also understand statistics; if they understand the importance of proper random sampling, we would not expect the majority of sightings and contact reports to come from people in the rural south who drive pickup trucks and have missing teeth; Q.E.D. For the purposes of this chapter, we will ignore the various sources of bias (systematic variation or offset in results) that can arise from human errors, sampling bias, etc. It will be enough to deal with the statistical characterization of our data and the practical tests to which we can subject them. Distributions of Values Since the measurement results from repeated attempts to measure the same thing naturally vary, either when the operation is performed on the same specimen (due to measurement precision) or when multiple samples of the population are taken (involving also the natural random variation of the samples), it is natural to treat them as a frequency distribution. Plotting the number of times that each measured value is obtained as a function of the value, as was done in Figure 8.2 for the “e” experiment, shows the distribution. In cases where the measurement value is continuous rather than discrete, it is common to “bin” the data into ranges. Such a distribution plot is often called a histogram. Usually for convenience the bin ranges are either scaled to fit the actual range of data values, so as to produce a reasonable number of bins (typically 10 to 20) for plotting, or they are set up with boundary limits that are rounded off to an appropriate degree of precision. However this is done, the important goal is to be sure that the bins are narrow enough and there are enough of them to reveal the actual shape of the distribution, but there are not so many bins that there are only a few counts in each, which as we will see means that the heights of the bins are not well defined. Figure 8.3 shows cases of plotting the same data with different numbers of bins. A useful guide is to use no more bins than about N/20 where N is the number of data points, as this will put an average of 20 counts into each bin. Of course, it is the variation in counts from one bin to another that makes distributions interesting and so there must be some bins with very few counts, so this offers only a starting point. Plotting histogram distributions of measured data is a very common technique that communicates in a more succinct way than tables of raw data how the measurements vary. In some cases, the distribution itself is the only meaningful way to show this. A complicated distribution with several peaks of different sizes and shapes is not easily reduced to fewer numbers in a way that retains the complexity and permits comparison with other populations. Statistical Interpretation of Data 153 a b c Figure 8.3. Histogram plots of 200 data points in 4, 12 and 60 bins. 154 Chapter 8 Usually the reason that we perform the measurements in the first place is to have some means of comparing a set of data from one population with a set of data from another group. This may be to compare two groups that are both being measured, or to compare one with data taken at another time and place, or to compare real data with some theoretical model or design specification. In any case, plotting the two histograms for visual comparison is not a very good solution. We would prefer to have only a few numbers extracted from the histogram that summarize the most important aspects of the data, and tools that we can employ to compare them. The comparisons will most often be of the form that can be reported as a probability—usually the probability that the observed difference(s) between the two or more groups are considered to be real rather than chance. This is a surprisingly difficult concept for some people to grasp. We can almost never make a statement that group A is definitely different from group B based on a limited set of data that have variations due to the measurement process and to the sampling of the populations. Instead we say that based on the sample size we have taken there is a probability that the difference we actually observe between our two samples is small enough to have arisen purely by chance, due to the random nature of our sampling. If this probability is very high, say 30% or 50%, then it is risky to conclude that the two samples really are different. If it is very low, say 1% or 5%, then we may be encouraged to believe that the difference is a real reflection of an underlying difference between the two populations. A 5% value in this case means that only one time in twenty would we expect to find such a large difference due solely to the chance variation in the measurements, if we repeated the experiment. Much of the rest of this chapter will present tools for estimating such probabilities for our data, which will depend on just how great the differences are, how many measurements we have made, how variable the results within each group are, and what the nature of the actual distribution of results is. Let us consider two extreme examples by way of illustration. Imagine measuring the lengths of two groups of straws that are all chosen randomly from the same bucket. Just because there is an inherent variation in the measurement results (due both to the measurement process and the variation in the straws), the average value of the length of straws in group A might be larger than group B. But we would expect that as the number of straws we tested in each group increased, the difference would be reduced. We would expect the statistical tests to conclude that there was a very large probability that the two sets of data were not distinguishable and might actually have come from the same master population. Next, imagine measuring the heights of male and female students in a class. We expect that for a large test sample (lots of students) the average height of the men will be greater than that for the women, because that is what is usually observed in the adult human population. But if our sample is small, say twenty students with 11 men and 9 women, we may not get such a clear-cut result. The variation within the groups (men and women) may be as large or larger than that between them. But even for a small group the statistical tests should conclude that there is a low probability that the samples could have been taken from the same population. When political polls quote an uncertainty of several percent in their results, they are expressing this same idea in a slightly different way. Most of us would Statistical Interpretation of Data 155 recognize that preferences of 49 and 51% for two candidates are not really very different if the stated uncertainty of the poll was 4 percentage points, whereas preferences of 35 and 65% probably are different. Of course, this makes the implicit assumption that the poll has been unbiased in how it selected people from the pool of voters, in how it asked the questions, and so forth. Expressing statistical probabilities in a meaningful and universally understandable way is not simple, and this is why the abuse of statistical data is so widespread. When the weatherman says that there is a 50% chance of showers tomorrow, does this mean that a) 50% of the locations within the forecast area will see some moment of rain during the day; b) every place will experience rain for 50% of the day; c) every location has a 50% probability of being rained upon 50% of the time; or d) something else entirely? The Mean, Median and Mode The simplest way to reduce all of the data from a series of measurements to a single value is simply to average them. The mean value is the sum of all the measurements divided by the number of measurements, and is familiar to most people as the average. It is often used because of a belief that it must be a better reflection of the “true” value than any single measurement, but this is not always the case; it depends on the shape of the distribution. Notice that for the data on the “e” experiment in Figure 8.2, the mean value is not a good estimate of the true value. There are actually three parameters that are commonly used to describe a histogram distribution in terms of one single value. The mean or average is one; the median and mode are the others. For discrete measurements such as the “e” experiment the mode is the value that corresponds to the peak of the distribution (108 in that example). For a continuous measurement such as length, it is usually taken as the central value within the bin that has the highest frequency. The meaning of the mode is simple—it is the value that is most frequently observed. The median is the value that lies in the middle of the distribution in the sense that just as many measured values are larger than it as are smaller than it. This brings in the idea of ranking in order rather than looking at the values themselves, which will be used in several of the tests in this chapter. If a distribution is symmetric and has a single peak, then the mean or average does correspond to the peak of the distribution (the mode) and there are just as many values to the right and left so this is also the median. So, for a simple symmetrical distribution the mean, median and mode are identical. For a distribution that consists of a single peak but is not symmetric, the mode lies at the top of the peak, and the median is always closer to the mode than the mean is. This is summarized in Figure 8.4. When distributions do not have very many data points, determining the mode can be quite difficult. Even the distribution in Figure 8.2 is “noisy” so that repeated experiments with new groups of people would probably show the mode varying by quite a lot compared to the mean. The mean is the most stable of the three values, and it is also the most widely used. This is partly because it is the easiest to determine (it is just the average), and partly because many distributions of real 156 Chapter 8 Figure 8.4. Distribution showing the mean, median and mode. data are symmetrical and in fact have a very specific shape that makes the mean a particularly robust way to describe the population. The Central Limit Theorem and the Gaussian Distribution One of the most common distributions that is observed in natural data has a particular shape, sometimes called the “normal” or “bell” curve; the proper name is Gaussian. Whether we are measuring the weights of bricks, human intelligence (the so-called IQ test), or even some stereological parameters, the distribution of results often takes on this form. This is not an accident. The Central Limit Theorem in statistics states that whenever there are a very large number of independent causes of variation, each producing fluctuations in the measured result, then regardless of what the shape of each individual distribution may be, the effect of adding them all together is to produce a Gaussian shape for the overall distribution. The Gaussian curve that describes the probability of observing a result of any particular value x, and which fits the shape of the histogram distribution, is given by equation (8.1). 1 G (x, m , s ) = ◊e s 2p ( x - m )2 2 ◊s 2 (8.1) The parameters m and s are the mean and standard deviation of the distribution, which are discussed below. The Gaussian function is a continuous probability of the value x, and when it is compared to a discrete histogram Statistical Interpretation of Data 157 produced by summing occurrences in bins or ranges of values, the match is only approximate. As discussed below, the calculation of the m and s values from the data should properly be performed with the individual values rather than the histogram as well, although in many real cases the differences are not large. Figure 8.5 shows several histogram distributions of normally distributed (i.e. Gaussian) data, with the Gaussian curve described by the mean and standard deviation of the data superimposed. There are three data sets represented, each containing 100 values generated by simulation with a Gaussian probability using a random number generator. Notice first of all that several of the individual histogram bins vary substantially from the smooth Gaussian curve. We will return to this variation to analyze its significance in terms of the number of observed data points in each bin later. There are certainly many situations in which our measured data will not conform to a Gaussian distribution, and later in this chapter we will deal with the statistical tests that are properly used in those cases. But when the data do fit a Gaussian shape, the statistical tools for describing and comparing sets of data are particularly simple, well developed and analyzed in textbooks, and easily calculated and understood. Variance and Standard Deviation In Figure 8.5, notice that the three distributions have different mean values (locations of the peak in the distribution) and different widths. Two of the distributions (b and c) have the same mean value of 50, but one is much broader than the other. Two of the distributions (a and b) have a similar breadth, but the mean values differ. For the examples in the figure, the means and standard deviations of the populations from which the data were sampled are: Figure 5.a 5.b 5.c m 53.0 50.0 50.0 s 3.0 3.0 1.0 The mean value, as noted above, is simply the average of all of the measurements. For a Gaussian distribution, since it is symmetrical, the mode and median are equal to the mean. The only remaining parameter needed to describe the Gaussian distribution is the width of the distribution, for which either the standard deviation (s in equation 8.1) or the variance (the square of the standard deviation) is used. These descriptive parameters for a distribution of values can be calculated whether it is actually displayed as a histogram or not. The standard deviation is defined as the square root of the mean (average) value of the square of the difference between each value and the mean (hence the name “root-mean-square” or rms difference). However, calculating it according to this definition is not efficient. A procedure for doing so with a single pass through the data is shown in Figure 8.6 (this also includes several parameters discussed in the next section). 158 Chapter 8 a b c Figure 8.5. Distributions of normally distributed data with differing means and standard deviations. Statistical Interpretation of Data 159 // assume data are held in array value[1..N] for i = 1 to 5 do sum[i] = 0.0; for j = 1 to N { temp = 1.0 for i = 1 to 5 do { sum[i] = sum[i] + temp; temp = temp * value[j]; } } mean = sum[2]/sum[1]; variance = sum[3]/sum[1]-mean*mean; std_dev = sqrt(variance); skew = ((sum[4]-3*mean*sum[3])/sum[1]+2*mean**3)/variance**1.5; kurtosis = ((sum[5]-4*sum[4]*mean+6*sum[3]*mean**2)/sum[1]3*mean**4)/variance**2; Figure 8.6. Example computer procedure to calculate statistical parameters. Because it has the same units as the mean, the standard deviation is often used to denote the spread of the data in the form (e.g.) 50.0 ± 3.0 cm, the units in which the value was measured. For data that are normally distributed, this means that 68% of the measured values are expected to fall within the range from 47.0 (50 - 3.0) to 53.0 (50 + 3.0) cm, 95% within the range 44.0 (50 - 2 · 3.0) to 56.0 (50 + 2 · 3.0) cm, and 99% within the range 41.0 (50 - 3 · 3.0) to 59.0 (50 + 3 · 3.0) cm. The values 68, 95 and 99% come directly from integrating the mathematical shape of the Gaussian curve within limits of ±1, 2 and 3 times the standard deviation. As written in equation (8.1), the integral of the Gaussian curve is 1.0 (meaning that the probability function is normalized and the total probability of all possible measurement values being observed is unity, which makes the statisticians happy). The definite integral between the limits of ±1, 2 and 3s gives results of 0.6827, 0.9545, and 0.9973, respectively. Tables of the value and integral of the Gaussian function are widely available in statistics books, but will not be needed for the analytical tests discussed in this chapter. The variance is just the square of the standard deviation and another measure of the distribution width. It is used in some statistical tests discussed later on. Because in most cases the analyzed data set is but a sample of the population of all possible data, the estimated variance of the population is slightly greater than that calculated in the procedure of Figure 8.6 for the data set. The variance of the population is N/(N - 1) times the sample variance, where N is the number of values in the analyzed data set. The mean value calculated from the data values sampled from a larger population is also only an estimate of the true value of the entire population. For example, the data shown in Figure 8.5b come from a simulated population with a mean of exactly 50, but the sample of 100 data points have a calculated mean value of 50.009. We also need to be able to evaluate how well our calculated mean value estimates that of the population, and again the result depends on how many data points we have taken. The standard error of the mean is just s/÷N where s is the 160 Chapter 8 calculated standard deviation of the sample and N is the number of values used. It is used in the same way as the standard deviation, namely we expect to find the true population mean within 1, 2 or 3 standard errors of the calculated sample mean 68, 95 and 99% of the time, respectively. It is important to understand the difference between the standard deviation of our data sample (s) and the standard error of the calculated mean value m. The mean and standard deviation are also used when measurement data are combined. Adding together two sets of data is fairly straightforward. If one set of measurements on one population sample determine a mean and standard deviation for a particular event as m1 and s1, and a second set of measurements on another sample determine the mean and standard deviation for second event as m2 and s2, then the mean probability of either event is the sum of the two individual means (m1 + m2) and the standard deviation is the square root of the sum of the two variances (s12 + s22)1/2. To give a concrete example, if the fraction of vehicles passing on the highway is determined in one time period as 50 ± 6.5 buses per hour, and in a second time period we count 120 ± 8.4 trucks per hour, then the combined rate of “large vehicles” (busses plus trucks) would be estimated as 170 ± 10.6. The 170 is simply 50 + 120 and the 10.6 is the square root of 112.8 = (6.5)2 + (8.4)2. The situation is a bit more complicated and certainly much more unfavorable if we try to combine two events to determine the difference. Imagine that we counted 120 ± 8.4 trucks per hour in one time period and then 75 ± 7.2 eighteenwheel transports (a subclass of trucks) in a second period. Subtracting the second class from the first would give us the rate of small trucks, and the mean value is estimated simply as 45 per hour (= 120 - 75). But the standard deviation of the net value is determined just as it was for the case of addition; multiple sources of uncertainty always add together as the square root of the sum of squares. The standard deviation is thus 11.1 (the square root of 122.4 = (8.4)2 + (7.2)2) for an estimated rate of 45 ± 11.1, which is a very large uncertainty. This highlights the importance of always trying to measure the things we are really interested in, and not determining them by difference between two larger values. Testing Distributions for Normality—Skew and Kurtosis In order to properly use the mean and standard deviation as descriptors of the distribution, it is required that the data actually have a normal or Gaussian distribution. As can be seen from the graphs in Figure 8.5, visual judgment is not a reliable guide (especially when the total number of observations is not great). With little additional computational effort we can calculate two additional parameters, the skew and kurtosis, that can reveal deviations from normality. The skew and kurtosis are generally described as the third and fourth moments or powers of the distribution. Just as the variance uses the sum of (x - m)2, the skew uses the average value of (x - m)3 and the kurtosis uses the average value of (x - m)4 divided by s4. Viewed in this way, the mean is the first moment (it describes how far the data lie on the average from zero on the measurement scale) and the variance is the second moment (it uses the squares of the deviations from the mean as a measure of the spread of the data). The skew uses Statistical Interpretation of Data 161 the third powers and the kurtosis the fourth powers of the values, according to the definitions: Skew = m3 m2 32 (3rd moment) 2 Kurtosis = m4 m2 ( 4th moment ) where mk = S(xi - m)k/N. An efficient calculation method is included in the procedure shown in Figure 8.6. The skew is a measure of how symmetrical the distribution is. A perfectly symmetrical distribution has a skew of zero, while positive and negative values indicate distributions that have tails extending to the right (larger values) and left (smaller values) respectively. Kurtosis is a measure of the shape of the distribution. A perfectly Gaussian distribution has a Kurtosis value of 3.0; smaller values indicate that the distribution is flatter-topped than the Gaussian, while larger values result from a distribution that has a high central peak. A word of caution is needed: some statistical analysis packages subtract 3.0 from the calculated kurtosis value so that zero corresponds to a normal distribution, positive values to ones with a central peak, and negative values to ones that are flat-topped. It can be quite important to use these additional parameters to test a data set to see whether it is Gaussian, before using descriptive parameters and tests that depend on that assumption. Figure 8.7 shows an example of four sets of values that have the same mean and standard deviation but are very different in the way the data are actually distributed. One set (#3 in the figure) does contain values sampled from a normal population. The others do not—one is uniformly spaced, one is bimodal (with two peaks), and one has a tight cluster of values with a few outliers. The skew and kurtosis reveal this. The outliers in data set #1 produce a positive skew and the clustering produces a large kurtosis. The kurtosis values of the bimodal and uniformly distributed data sets are smaller than 3. For the sample of data taken from a normal distribution the values are close to the ideal zero (skewness) and 3 (kurtosis). But how close do they actually need to be, and how can we practically use these values to test for normality? As usual, the results must be expressed as a probability. For a given value of skewness or kurtosis calculated for an actual data set, what is the probability that it could have resulted due to the finite sampling of a population that is actually normal? The graphs in Figures 8.8 and 8.9 show the answers, as a function of the number of data values actually used. If we apply these tests to the 100 data points shown in Figure 8.5b we calculate a skew value of 0.09 and a kurtosis of 2.742, well within the values of 0.4 and range of 2.4–3.6 that the graphs predict can happen one time in twenty (5%). Consequently there is no reason to expect that the data did not come from a normal population and we can use that assumption in further analysis. As the data set grows larger, the constraints on skew and kurtosis narrow. Some Other Common Distributions The Gaussian or normal curve is certainly not the only distribution that can arise from simple physical processes. As one important example, many of the 162 Chapter 8 Data Set # 1 2 3 4 Mean = m 100 100 100 100 Variance = s 2 140 140 140 140 Skew 1.68 0 -0.352 0 Kurtosis 5.06 1.12 2.929 1.62 Figure 8.7. Four data sets with the same mean and standard deviation values, but different shapes as revealed by higher moments (skew and kurtosis). Figure 8.8. The probability that the absolute value of skewness will exceed various values for random samples from a Gaussian distribution, as a function of the number of observations. Statistical Interpretation of Data 163 Figure 8.9. The probability that the kurtosis will exceed various values for random samples from a Gaussian distribution, as a function of the number of observations. procedures we use to obtain stereological data are based on counting, and a proper understanding of the statistics of counting is therefore important. Consider a simple event counting process, such as determining the number of cars per hour that pass on a particular freeway. We could count for an entire hour, but it seems simpler to count for a short period of time and scale the results up. For example, if we counted the number of cars in a 5 minute period (e.g., 75 cars) and multiplied by 12, it should give the number of cars per hour (12 · 75 = 900). But of course, the particular five minute period we chose cannot be perfectly representative. Counting for a different 5 minute period (even assuming that the average rate of traffic does not change) would produce a slightly different result. To what extent can we predict the variation, or to put it another way how confident can we be that the result we obtain is representative of the population? When we count events, it is not possible to get a negative result. This is immediately a clue that the results cannot have a Gaussian or normal distribution, because the tails of the Gaussian curve extend in both directions. Instead, counting statistics are Poisson. The Poisson function is P (x, m ) = m ◊ x -m e x! (8.2) where m is the mean value and x! indicates factorial. Notice that there is no s term in this expression, as there was for the Gaussian distribution. The width of 164 Chapter 8 the Poisson distribution is uniquely determined by the number of events counted; it is the square root of the mean value. This simple result means that if we counted 75 cars in our 5 minute period, the standard deviation that allows us to predict the variation of results (68% within 1 standard deviation, 95% within 2, etc.) is simply 75 = 8.66. The five minute result of 75 ± 8.66 scales up to an hourly rate of 900 ± 103.9, which is a much less precise result than we would have obtained by counting longer. Counting 900 cars in an hour would have given a standard deviation of 900 = 30, considerably better than 104. On the other hand, counting for 1 minute would have tallied approximately 900/60 = 15 cars, 15 = 3.88, and 15 ± 3.88 scales to an estimated hourly rate of 900 ± 232.8 which is much worse. This indicates that controlling the number of events (e.g., marks on grids) that we count is vitally important to control the precision of the estimate that we obtain for the desired quantity being measured. Figure 8.10 shows a Poisson distribution for the case of m = 2 (a mean value of two counts). The continuous line for the P(x, m) curve is only of mathematical interest because counting deals only with integers. But the curve does show that the mean value (2.0) is greater than the mode (the highest point on the curve), which is always true for the Poisson distribution. So the Poisson function has a positive skew (and also, incidentally, a kurtosis greater than three). Fortunately, we usually deal with a much larger number of counts than 2. When the mean of the Poisson function is large the distribution is indistinguishable from the Gaussian (except of course that the standard deviation s is still given by the square root of the mean m). This means that the function and distribution becomes symmetrical and shaped like a Gaussian, and consequently the statistical tools for describing and comparing data sets can be used. Figure 8.10. Poisson function and distribution for a mean of 2. Statistical Interpretation of Data 165 However, there is one important arena in which the assumption of large numbers is not always met. When a distribution of any measured function is plotted, even if the number of bins is small enough that most of them contain a large number of counts, there are usually some bins at the extremities of the distribution that have only a few counts in them. If these bins are compared from one distribution to another, it must be based on their underlying Poisson nature and not on an assumption of Gaussian behavior. Comparing the overall distributions by their means and standard deviations is fine as a way to determine if the populations can be distinguished. Trying to detect small differences at the extremes of the populations, such as the presence of a small fraction of larger or smaller members of the population, is much more difficult. Another distribution that shows up often enough in stereological measurement to take note of is the log-normal distribution. This is a histogram in which the horizontal axis is not the measured value but rather the logarithm of the measured value, but the resulting shape of the distribution is normal or Gaussian. Note that the bins in the distribution are still of uniform width, which means they each cover the same ratio of size values (or whatever the measurement records) rather than the same increment of sizes. This type of distribution is often observed for particle sizes, since physical processes such as the grinding and fracturing of brittle materials produce this behavior. Figure 8.11 shows a histogram distribution for log-normally distributed data; on a linear scale it is positively skewed. For a thorough discussion of various distributions that occur and the tools available to analyze them, see P. R. Bevington (1969) “Data Reduction and Error Analysis for the Physical Sciences” McGraw Hill, New York. Comparing Sets of Measurements—The T-Test As mentioned at the beginning of this chapter, one common purpose of statistical analysis is to determine the likelihood that two (or more) sets of measurements (for example, on different specimens) are the same or different. The answer to this question is usually expressed as a probability that the two samples of data could have come from the same original population. If this probability is very low, it is often taken as an indication that the specimens are really different. A probability of 5% (often written as a = 0.05) that two data sets could have come from the same population and have slightly different descriptive parameters (mean, etc.) purely due to sampling variability is often expressed as a 95% confidence level that they are different. It is in fact much more difficult to prove the opposite position, that the two specimens really are the same. When the data are normally distributed, a very efficient comparison can be made with student’s t-test. This compares the mean and standard deviation values for two populations, and calculates a probability that the two data sets are really drawn from the same master population and that the observed differences have arisen by chance due to sampling. Remember that for a Gaussian distribution, the mean and standard deviation contain all of the descriptive information that is needed (the skew and kurtosis are fixed). 166 Chapter 8 a b Figure 8.11. Linear and logarithmic scale histogram plots of the same set of lognormally distributed values. Given two normal (Gaussian) sets of data, each with ni observations and characterized by a mean value mi and a standard deviation si, what is the probability that they are significantly different? This is calculated from the difference in the mean values, in terms of the magnitude of the standard deviations and number of observations in each data set, as shown in equation (8.3); n is the number of degrees of freedom, which is needed to assess the probabilities in Figure 8.12. T= m1 - m2 s 12 s 22 n1 n2 n= Ê s 12 s 22 ˆ + Á ˜ Ë n1 n2 ¯ 2 2 2 Ê s 12 ˆ Ê s 22 ˆ Á ˜ Á ˜ Ë n1 ¯ Ën ¯ + 2 n1 - 1 n2 - 1 (8.3) Statistical Interpretation of Data 167 Figure 8.12. Critical values for t-test comparison as a function of the number of degrees of freedom. The larger the difference of means (relative to the standard deviations) and the larger the data sets, the more likely it is that the two sets of observations could not have come from a single population by random selection. The procedure is to compare the value of the parameter T to the table of student’s t values shown by the graph in Figure 8.12 for the number of degrees of freedom and probability a. If the magnitude of T is less than the table value, it indicates that the two data sets could have come from the same population, with the differences arising simply from random selection. If it exceeds the table value, it indicates the two sets are probably different at the corresponding level of confidence 100 · (1 - a); for example, a = 0.01 corresponds to a 99% confidence level that the two groups are not the same. The value of a is given for a double-sided test (the two means are “different”). In a single sided test (deciding whether one mean is “greater” or “less” than the second), the value of a is halved. Notice that the curves of critical values level off quickly as the number of degrees of freedom increases. For typical data sets with at least tens of observations in each, only the asymptotic limit values are needed. Hence a value of T greater than 1.282 indicates that the groups are not the same with a confidence level of 90% (a = 0.10), and values greater than 1.645 and 1.960 give corresponding confidence values of 95% (a = 0.05) and 99% (a = 0.01) respectively. Common and convenient use is made of the easily remembered approximate test value T ≥ 2.0 for 99% confidence. Comparing the two sets of data in Figure 8.5a and 8.5b using the t-test gives the following results: Data Set a b m 52.976 50.009 s 3.110 3.152 n 100 100 The calculated value of T is 57.83, the number of degrees of freedom is large enough to consider just the asymptotic values, and since T exceeds 168 Chapter 8 n2 3 5 10 40 • n1 = 3 9.28 5.41 3.71 2.84 2.60 n1 = 5 9.01 5.05 3.33 2.45 2.21 n1 = 10 8.79 4.74 2.98 2.08 1.83 n1 = 40 8.59 4.46 2.66 1.69 1.39 n1 = • 8.53 4.36 2.54 1.51 1.00 Figure 8.13. Critical values of F for a = 0.05 (ANOVA test). 1.96 the probability that the two data sets are significantly different is greater than 99%. The Analysis of Variance or ANOVA test is a generalization of the t-test to more than 2 groups, still making the assumption that each of the groups has a normal data distribution. The ANOVA test compares the differences between the means of several classes, in terms of their number of observations and variances. For the case of two data sets, it is identical to the t-test. To perform the ANOVA test, we calculate the following sums-of-squares terms from the observations yij (i = class, j = observation number). yi* is the mean of observations in class i, and ymean is the global average. There are ni observations in each class, k total classes, and t total observations. SST (total sum of squares of differences) = SS(yij - ymean)2 SSA (sum of squares of difference within the classes) = SS(yi* - ymean)2 = Sni · (yi* - ymean)2 SSE (difference between the total variation and that within classes) = SST - SSA From these, an F value is calculated as F = (SSA/n1)/(SSE/n2) where the degrees of freedom are n1 = k - 1 and n2 = t - k This value of F is then used to determine the probability that the observations in the k classes could have been selected randomly from a single parent population. If the value is less than the critical values shown in tables, then the difference between the groups is not significant at the corresponding level of probability. Figure 8.13 shows critical values for F for the case of a = 0.05 (95% confidence that not all of the data sets come from the same parent population). Nonparametric Comparisons The t-test and ANOVA test are flawed if the data are not actually Gaussian or “normal” since the mean and standard deviation do not then fully describe the data. Nonparametric tests do not make the assumption of normality and do not use “parameters” such as mean and standard deviation to characterize the data. One set of methods is based on using the rank order of the measured data rather than their actual numerical values. As applied to two data sets, the Wilcoxon test sorts the two sets of values together into order based on the measurement values as shown Statistical Interpretation of Data 169 Figure 8.14. Principle of the Wilcoxon or Mann-Whitney test based on rank order. schematically in Figure 8.14. Then the positions of the observations from the two data sets in the sequence are examined. If all of the members of one set are sorted to one end of the stack, it indicates that the two groups are not the same, while if they are intimately mixed together the two groups are considered indistinguishable. This test is also called the Mann-Whitney test, which generalized the original method to deal with groups of different sizes. This idea is much the same as examining the sequence of red and black cards in a deck of playing cards to decide whether it has been well shuffled. The binomial theorem allows calculation of the probability of any particular sequence of red and black cards occurring. Shuffles that separate the red and black cards are much less likely to occur than ones with them mixed together, and will therefore happen only rarely by chance. To illustrate the procedure, consider the two extreme cases of data shown in Figure 8.15. Case A has the two groups well mixed and Case B is completely segregated. The sum of rank positions of the groups are tallied and whichever is smaller is then used to calculate the test statistic U. U = W1 - n1 ◊ (n1 + 1) 2 (8.5) where Wi is the sum of rank values and ni is the number of observations in the groups. If the value of U is less than a critical value that depends on the number of observations in the two data sets and on a the probability of chance occurrence by random sampling of values from a single population, then the two groups are considered to be different with the corresponding degree of confidence. In the example, the two groups in Case A are not distinguishable (there is a 60% probability that they could have come from the same population) and those in Case B are (there is a 0.9% probability that they came from the same population). Figure 8.16 shows the critical test values of U for a = 0.01 and a = 0.05 (99 and 95% confidence respectively) for several values of ni. 170 Chapter 8 Rank 1 2 3 4 5 6 7 8 9 10 Case A 1 2 1 2 1 2 1 2 1 2 Case B 1 1 1 1 1 2 2 2 2 2 Case A B W1 = Rank Sum 25 15 n1 5 5 U 10 0 a 0.6015 0.0090 Figure 8.15. Example of two extreme cases for Wilcoxon comparison. The Kruskal-Wallis test is a generalization of the Wilcoxon or MannWhitney test to more than two groups, in much the same way that the ANOVA test generalizes the t-test. The data sets are sorted into one list and their rank positions used to calculate the parameter H based on the number of groups k, the number of objects n (and number in each group ni), and the rank order of each observation R (summed for those in each group). This test value H is compared to the critical values (which come from the chi-squared distribution) for the number of degrees of freedom (k - 1) and the probability a that this magnitude of H could have occurred purely by chance for observations all sampled randomly from one parent group. H= k 12 Ê R2 ˆ ◊ Â Á i ˜ - 3 ◊ (n + 1) n ◊ (n + 1) i =1 Ë ni ¯ Figure 8.16. Critical values for the Wilcoxon (Mann-Whitney) test. (8.6) Statistical Interpretation of Data 171 Figure 8.17 shows a tiny data set to illustrate the principle of the KruskalWallis test. Three sets of measurement values are listed with the groups identified. Sorting them into order and summing the rank order for each group gives the values shown. Based on the number of degrees of freedom (df = k - 1), the probability that the calculated value of H could have occurred by chance in sampling from a single parent population is 43.5%, so we would not conclude that the three data sets are distinguishable. Figure 8.18 shows a table of critical values for H for several values of df and different values of a. H must exceed the table value for the difference to be considered significant at the corresponding level of confidence. The Wilcoxon (Mann-Whitney) and Kruskal-Wallis tests rely on sorting the measured values into order based on their numerical magnitude, but then using the rank order of the values in the sorted list for analysis. Sorting of values into order is a very slow task, even for a computer, when the number of observations becomes large. For such cases there is a more efficient nonparametric test, the KolmogorovSmirnov test, that uses cumulative plots of variables and compares these plots for two data sets to find the largest difference. Length 24.0 16.7 22.8 19.8 18.9 23.2 19.8 18.1 17.6 20.2 17.8 18.4 19.1 17.3 19.7 18.9 18.8 19.3 17.3 Class 1 2 3 ni 5 6 8 k df n H a Group 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 Rank sum 61 63.5 65.5 3 2 19 1.663 0.4354 Figure 8.17. Kruskal-Wallis example (3 groups of data). 172 Chapter 8 df 1 2 3 4 5 6 7 8 9 10 a 0.05 3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 0.025 5.024 7.378 9.348 11.143 12.832 14.449 16.013 17.535 19.023 20.483 0.01 6.635 9.210 11.345 13.277 15.086 16.812 18.475 20.090 21.666 23.209 Figure 8.18. Critical value table for the Kruskal-Wallis test. As shown in Figure 8.19, the cumulative distribution plot shows the fraction or percentage of observations that have a value (length in the example) at least as great as the value along the axis. Cumulative plots can be constructed with binned data or directly from the actual measurement values. Because they are drawn with the vertical axis in percent rather than the actual number of observations, it becomes possible to compare distributions that have different numbers of measurements. a b Figure 8.19. Cumulative plot (b) compared with usual differential histogram display. Statistical Interpretation of Data 173 Figure 8.20 shows the step-by-step process of performing the KolmogorovSmirnov test. The two data sets are plotted as cumulative distributions showing the fraction of values as a function of length. Once the plots have been obtained, the greatest vertical difference between them is located. Since the vertical axis is a fraction, there are no units associated with the difference value. It does not matter where along the horizontal axis the maximum difference occurs, so the actual magnitude of the measurement values is unimportant. The maximum difference is compared to a test value calculated from the number of observations in the groups ni 1 Ên +n ˆ2 S = A◊Á 1 2 ˜ Ë n1 ◊ n2 ¯ (8.7) where the parameter A is taken from the table in Figure 8.21 to correspond to the desired degree of confidence. For the specific case of a = 0.05 (95% probability that the two data sets did not come from sampling the same population), A is 1.22 and the test value is 0.521. Since the maximum difference of 0.214 is less than the test value, the conclusion is that the two data sets cannot be said to be distinguishable. Data Set A B Length Values 1.023, 1.117, 1.232, 1.291, 1.305, 1.413, 1.445, 1.518, 1.602, 1.781, 1.822, 1.889, 1.904, 1.967 1.019, 1.224, 1.358, 1.456, 1.514, 1.640, 1.759, 1.803, 1.872 n1 n2 Max.Diff. A (a = 0.05) Test Value 14 9 0.214 1.22 0.521 Figure 8.20. Example of applying the Kolmogorov-Smirnov test. 174 Chapter 8 a A 0.10 1.07 0.05 1.22 0.025 1.36 0.010 1.52 Figure 8.21. Critical Values for the Kolmogorov-Smirnov test. For large data sets, the Kolmogorov-Smirnov test is a more efficient nonparametric test than the Wilcoxon because much less work is required to construct the cumulative distributions than to sort the values into order. Figure 8.22 shows an example of a comparison of two large populations (n1 = 889, n2 = 523). The greatest difference in the cumulative histograms is 0.118 (11.8%). For a confidence level of 99%, a = 0.01 and A = 1.52, so the test value S is 0.84. Since the measured Figure 8.22. Overlay comparison of cumulative plots of two populations for the Kolmogorov-Smirnov test. Statistical Interpretation of Data 175 difference is greater than the test value, we conclude that the two populations are distinguishable and not likely to have come from the same population. For a complete discussion of these and other nonparametric analysis tools, see J. D. Gibbons (1985) “Nonparametric Methods for Quantitative Analysis, 2nd Edition” American Sciences Press, Columbus, OH. Nonparametric tests are not as efficient at providing answers of a given confidence level as parametric tests, but of course they can be applied to normally distributed data as well as to any other data set. They will provide the same answers as the more common and better-known parametric tests (e.g., the t-test) when the data are actually normally distributed, but they generally require from 50% more to twice as many data values to achieve the same confidence level. However, unless the measured data are known to be Gaussian in distribution (such as count data when the numbers are reasonably large), or have been tested to verify that they are normal, it is generally safer to use a nonparametric test since the improper use of a parametric test can lead to quite erroneous results if the data do not meet the assumed criterion of normality. Linear Regression When two or more measurement parameters are recorded for each object, it is usually of interest to look for some relationship between them. If we recorded the height, weight and grade average of each student we might search for correlations. It would be interesting to determine the probability that there is a real correlation between the first two variables but not between them and the third one. Again, we will want to express this as a confidence limit or a probability that any observed correlation did not arise from chance as the data were selected from a truly random population. The first tool used is generally to plot the variables in two (or more) dimensions to look for visual trends and patterns. Figure 8.23 shows some of the common ways that this is done. The point plot works well if the individual points representing data measured from one object are not so dense that they overlap. Flower plots are produced by binning the data into cells and counting the number within each cell, while the two-way histogram shows quite graphically where the clusters of points lie but in doing so hides some parts of the plot. With interactive computer graphics, the scatter plot can be extended to handle more than two dimensions by allowing the free rotation of the data space, and color coding of points allows comparison of two or more different populations in the same space. Since the human visual system, abetted by some prior knowledge about the problem being studied, is very good at detecting patterns in quite noisy data, this approach is often a very useful starting point in data analysis. However, humans are often a bit too good at finding patterns in data (for instance, constellations in the starry sky), and so we would like to have more objective ways to evaluate whether a statistically significant correlation exists. Linear regression makes the underlying assumption that one variable (traditionally plotted on the vertical y axis) is a dependent function of the second (plotted on the horizontal x axis), and seeks to represent the relationship by an equation of the form y = m · x + b, which is the equation for a straight line. The process then 176 Chapter 8 a b c Figure 8.23. Presentation modes for data correlating two variables: a) point plot, b) flower plot, c) two-way histogram. Statistical Interpretation of Data 177 determines the values of m and b which give the best fit of the data to the line in the specific sense of minimizing the sum of squares of the vertical deviations of the line from the points. For N data points xi, yi the optimal values of m and b are m= N ◊ Â xi yi - Â xi ◊Â yi 2 N ◊ Â xi2 - (Â xi ) Â y - m ◊ Âx b= i (8.8) i N This is conventional linear regression. When it is used with data that are not all of the same value (weighted regression), or where the data have been previously transformed in some way (for example by taking the logarithm, equivalent to plotting the data on log paper and fitting a straight line), the procedure is a bit more complex and beyond the intended scope of this chapter. Figure 8.24 shows a computer procedure that will determine m and b and also calculate the standard deviation of both values. But how good is the line as a description of the data (or in other words how well does it fit the points)? Figure 8.25 shows two examples with the same number of points. In both cases the procedure outlined above determines a best fit line, but the fit is clearly better for one data set than the other. The parameter that is generally used to describe the goodness of fit is the correlation coefficient R or R2. This is a dimensionless value that can vary between zero (no correlation whatever—any // assumes data are held in arrays x[1..N], y[1..N] sumx = 0; sumy = 0; sumx2 = 0; sumy2 = 0; sumxy = 0; for i = 1 TO N { sum = sum+1; sumx = sumx+x[i]; sumy = sumy+y[i]; sumx2 = sumx2+x[i]*x[i]; sumy2 = sumy2+y[i]*y[i]; sumxy = sumxy+x[i]*y[i]; } dx = sum*sumx2-sumx*sumx; dy = sum*sumy2-sumy*sumy; d2 = sum*sumxy-sumx*sumy; m = d2/dx; //slope b = (sumy-a*sumx)/sum; //intercept ss = 1.0/(sum-1)*(sumy2+sum*b*b+a*a*sumx2-2*(b*sumya*b*sumx+a*sumxy)); sigma_b = sqrt(ss*sumx2/dx); //intercept standard deviation sigma_m = sqrt(sum*ss/dx); //slope standard deviation r2 = d2/sqrt(abs(dx*dy)); //correlation coefficient Figure 8.24. Example computer procedure to calculate linear regression. 178 Chapter 8 Figure 8.25. Data sets with high and low correlation coefficients. line would be as good) and one (a perfect fit with the line passing exactly through all of the points). The sign of the R value can be either positive (y increases with x) or negative (y decreases as x increases) but only the magnitude is considered in assessing the goodness of fit. R is calculated from the product of two slopes, one treating y as a dependent function of x (minimizing the squares of the vertical deviations of the line from the points) and one treating x as a dependent function of y (minimizing the sum of squares of the horizontal deviations). R= N ◊ Â xi yi - Â xi ◊Â yi 2 2 N ◊ Â x - (Â xi ) ◊ N ◊ Â yi2 - (Â yi ) 2 i (8.9) For the examples in Figure 8.25, the “good fit” data have an R value of 0.939 and the “poor fit” data a value of 0.271. How are we to assess these values? It depends upon the number of points used in calculating the fit as well a the magnitude of R, and again we use a as the probability that a fit with the same R might occur by chance if we were fitting the line to truly random points like salt grains sprinkled on the table. For the case of 12 data points, the value of 0.939 would be expected to occur only once in 10,000 times (a = 0.01%), while a value of 0.271 would be expected to occur nearly 2 times in 5 (a = 39.4%). We would consider the first to be statistically significant but not the second. Figure 8.26 shows the relationship for evaluating the fit. For any given number of points, an R value above the corresponding line means that the fit is significant with the corresponding level of confidence (i.e. did not occur by chance with a probability a). As the number of points increases, the value of R required for any selected degree of confidence is reduced. Alpha is the probability that the apparent fit arose by chance from uncorrelated values. The R2 value is a measure of the percentage of the variation in the y (dependent variable) values that is “explained” is a statistical sense by the fit to the independent variable x. Linear regression can be extended relatively straightforwardly to deal with more than two variables. Multiple regression tries to express one dependent Statistical Interpretation of Data 179 Figure 8.26. Probability of significance of values of the linear correlation coefficient R for N observations. parameter z as a linear combination of many others (z = a0 + a1 · x1 + a2 · x2 + . . .). The procedure for efficiently determining the ai values is very similar to the matrix arithmetic usually employed to calculate analysis of variance for multiple parameters. If the xi parameters are actually values of a single parameter raised to successive powers, then the relationship is z = a0 + a1 · x1 + a2 · x2 + . . . and we have polynomial regression. An excellent reference for regression methods is N. Draper, H. Smith (1981) “Applied Regression Analysis, 2nd Edition” J. Wiley & Sons, New York. Stepwise multiple regression starts with a list of independent variables x and adds and removes them one at a time, keeping only those which have a values that are statistically significant. We will not cover the details of the calculation here but many computer statistics packages provide the capability. A closely related technique provided by quite a few programs plots all of the data points in a high dimensionality space (corresponding to the number of measured parameters) and then finds the orientation of axes that best projects the data onto planes that give high correlation coefficients. The “principal components analysis” method is another way, like stepwise multiple regression, that identifies which of the available independent variables actually correlate with the dependent variable. The main difference is that stepwise regression treats the independent variables separately while principal components analysis groups them together algebraically. Neither method tells the user whether the correlation reveals any physical significance, whether the assumed dependent variable is actually causally related to the independent one, or vice versa, or whether both may be dependent on some other (perhaps unmeasured) parameter. Attempting to infer causality from correlation is a common error that provides some of the more egregious examples of the misuse of statistical analysis. Nonlinear Regression The previous section dealt with the mathematics of fitting straight lines to data, and referenced methods that fit polynomials and other fixed mathematical 180 Chapter 8 a b Figure 8.27. Comparison of regression using actual values (a) and rank order of values (b). relationships. These all assume, of course, that the user has some idea of what the appropriate relationship between the independent and dependent variables is (as well as correctly identifying which is which). The assumption of linearity is convenient from a calculation standpoint, but there are few reasons to expect genuine physical circumstances to correspond to it. When there is no reason to expect a particular functional form to the data, we would still like to have a tool to assess the degree to which the data are correlated (one increases monotonically with variation in the other). Spearman or rank Statistical Interpretation of Data 181 correlation accomplishes this in a typical nonparametric way by using the rank order of the values rather than their numerical values. The data are sorted into rank order and each point’s numerical values for the x and y parameters is replaced by the integer rank position of the value for the corresponding parameter. These rank positions are then plotted against each other (as illustrated in Figure 8.27) and the R value calculated and interpreted in the usual way. This method also has advantages in analyzing data that are strongly clustered at one end of a plot or otherwise distributed nonuniformly along a conventional linear regression line.

Statistical Interpretation of Data Chapter 8 Introduction—Sources of Variability in Measurement

Related documents

Products

Support

Statistical Interpretation of Data Chapter 8 Introduction—Sources of Variability in Measurement

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib