This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. Comparing Methods used to Combine Multi-level Samples of Compositional Data Murray Todd Williams' Abstract.-Accuracy assessment of categorical data with complex sampling designs is an important tool in the use of thematic maps and other applications. One important development of note is the composite estimator-an application of the Kalman filter. This paper reviews the composite estimator and compares it with standard weighted least squares regression. In many cases the two methods produce identical results, although there are situations in which one has advantages over the other. The conclusion also raises some critical questions which arise from these classical methods, and it examines contemporary analyses of compositional data. In conjunction with the methodology described here, a public-domain software application has been developed. INTRODUCTION Much of the work discussed here emerged from natural resource issues, specifically the analysis of forest inventories carried out by the USDA Forest Service. However, the application of this work has already proven to be quite large. Let us broadly define the sort of data which can benefit from these analyses. In single-level categorical data, measured quantities (frequently discrete) are identified as one of a group of mutually exclusive categories. One example of this is categorizing the type of forest in a section of land by tree species. Another example would be the a typical blood test in which the number of specific components (red blood cells, lymphocytes, etc.) are counted. Multi-stage and multi-phase categorical sampling uses multiple classifiers to categorize each sampling unit. For example, in the above forest inventory situation, the composition of a section of land can be measured using three different methods: satellite imaging, aerial photography (photo interpretation), and manual observations in the field. In this example, let a primary sampling unit be a section of land with an area of 47 acres. Table 1 shows an example where the three identification methods was used to classify each acre of land. Notice that 38 acres of land were classified 'student, Department ofStatistics, Colorado State University, Ft. Collins, CO 477 the same by each method. For the other 9 acres, there is some degree of disagreement between categories. (These situations are shaded in the table.) In most situations a high level of agreement is considered optimal. Notice that Table 1 also displays the quantities of each multiple-classification as proportions of the total. This vector of proportions is referred to as a composition. Table 1.-Mutli-level data in a typical primary sampling unit. Ground Surveys Satellite Imaging Aerial Photography Non-forest Non-forest Non-forest Acres 12 Composition 0.255 Light forest Light forest Light forest 15 0.3 19 Heavv forest Heavv forest Heavv forest Total: 11 47 0.234 1.OOO There are many situations in which this type of data is encountered. Some of the possible applications become apparent: Multiple classification techniques. In these situations, various techniques might have different levels of accuracy. Such accuracy is frequently inversely proportional to cost. In the above example, satellite data might be readily available and therefore inexpensive, but the degree of accuracy obtained may be fairly poor. Ground surveys are considered to be the most accurate, but they are also exceedingly expensive. Multiple time periods. A large database of monitoring information may be available for previous time periods (say for example, a large survey taken in 1970 and 1980) where data for a recent time period (1990) is more limited. A combination of the larger two-period database may be combined with the small sample of plots which have all three time periods in order to extrapolate a better estimate of the forest composition in 1990. Construction of the Sample Composition Estimates Consider an (n x p ) data matrix X with n independent primary sampling units (PSUs) and p distinct categories. Note that the category labels resemble those in Table 1. Also let each row of X represent a composition, thus summing to one. The calculations of the composition estimate Y (the sample mean of the data matrix X) and its sample covariance are straightforward: Y = 1 ~ i X ~and l X, = ; X T ( I - ~ ~ ~ ) X [11 COMPOSITE ESTIMATOR The composite estimator was first applied to categorical analyses by Czaplewski (1992) as an application of the Kalman filter. Instead of motivating its derivation I will just describe its application. Consider two composition estimates. The estimates may have any dimensionality, but for this example we will define a two-dimensional composition Y1 and a three-dimensional composition Y2. Y2 would be a vector with the unit-sum constraint. The elements of Y2 will correspond with the rows in Table 1. Yl will only have categories defined on only two classifications: Satellite imaging and Aerial photography. Refer to Table 2. It is necessary now to construct a binary transformation matrix which defines the unique mapping from Yl to Y,. Take this partition of the binary matrix and call it H. An example of H is shown as a subset of the binary matrix in Table 3. Table 2.-A typical primary sampling unit for a two-dimensional composition Y,. Satellite Imaging Aerial Photography Acres Composition Non-forest Non- forest 19 0.268 Non-forest Light Forest 8 , . ' - 0.113'. . Light forest Light forest 22 0.310 A ' Light forest Heavy forest . >. + A 3 Heavy Heavy forest Total: 15 0.21 1 71 1 .OOO From Czaplewski (1994) the composite estimator Y can be expressed as Y=Y2+K(YI-HY2) where and &=(I-KH)X2 K = Z2HT(&+ HZ2HT) 121 PI The resulting composite estimator will have the same dimension and category labels as the largest component Y2. In a sense, the composite estimator can be considered a method of improving the estimate of Y2 by using another independent and unbiased vector estimate (Yl) of lower dimensionality. A problem which can sometimes occur with both the composite estimator and the weighted least squares is that the resulting composition estimate Y can have negative components. For the composite estimator, there is a partial remedy of this problem. Multiply K in the above equations with a scalar constant a. In optimal situations (where negative components do not result) we can let a=l. If negative components of the composite estimator are observed, we can replace a with the largest possible value in the interval [0,1] such that the resulting composite estimator is well-behaved. WEIGHTED LEAST SQUARES The use of weighted (generalized) least squares allows a greater level of flexibility than the composite estimator. The only thing about the weighted least squares regression that may not be considered straightforward is the fact that the covariance matrices of the compositions will be necessarily singular. By replacing the standard matrix inverse with the generalized inverse, the problem is alleviated. Table 3.---Creation of the design matrix X. WLS Regression Categories Original Categories - 1 1 - b Satellite Aerial Non-forest Light forest Light forest Heavy forest Light forest Light forest Heavy forest Heavy forest Non-forest <missing> Non-forest Non-forest Light forest Light forest Heavy forest Heavy forest I Non-forest Light forest Light forest Light forest Light forest Heavy forest <missing> Heavy forest Ground Non-forest Light forest Light forest Heavy forest Light forest Light forest Heavy forest Heavy forest Let Yl and Y2 be two compositions (unit-sum vectors) with corresponding covariance matrices El and Z;respectively. The two compositions can have a different dimensionality. Let Y, have two classifications (Satellite and Aerial) and let Yz have three dimensions. We can also introduce missing data with weighted least squares. Hence, we will let Y2 have two missing categories. The categories of these compositions are shown on the left side of Table 3. Let the design matrix X be specified by the binary matrix depicted in Table 3. This matrix is easily formed by placing a 1 in each cell where all the available categories from the original data (rows) match up with their corresponding regression categories (columns). If we consider YI and Y2 as partitions of a single vector Y, and define The generalized regression equation for the new WLS estimate (W) and its covariance matrix can be given by and where A- is the generalized inverse2of A. DISCUSSION We have shown that the composite estimator and weighted least squares are both capable of combining multi-level composition estimates of different dimensionality. Both methods require a binary transformation matrix which maps the categories of the corresponding compositions to those of the resulting estimate. Weighted least squares has an advantage over the composite estimator when multiple estimates need to be combined. It is also unique in its ability to handle data with missing categories. The composite estimator, however, is able to work with a composition estimate of zero covariance. (An example of this is an entire census of a population, where the population statistic represents no sampling error..) In addition, the composite estimator is the only method which can be modified to assure that no components of the resulting estimate are negative. Inherent Difficulties with Compositional Data The unit-sum constraint is an intrinsic property of any composition. Little, if any, attention has been paid to this restriction despite the fact that it greatly complicates any analysis of correlation between categories. Karl Pearson (1 897) 2 ~ dejinition, y A-is the generalized inverse of A if A =A A-A and A- 48 1 = A-A A: See GraybiN (1976). first pointed out difficulties inherent with the interpretation of correlations between ratios whose numerators and denominators contain common parts. Since that time, difficulties which have been encountered while trying to interpret these correlations have been described in papers by Chayes (196O,l962), Krumbein (1962), Moisimam (1962), and in books by Chayes (1971) and Le Maltre (1982). Aitchison (1986) points out four basic difficulties which arise from the use of a standard covariance matrix in analysis. Negative bias difficulty For a D-part composition, we have the restriction Although the correlations between all parts of a composition would be expected to freely span the range [0,1] this restriction requires that at least one of the correlations assumes a negative value. Such forced negative correlation could cause some misinterpretation of the data. Subcomposition difficulty In most multivariate situations, an m-part subset of an n-part composition would be expected to preserve the covariance structure. For example, consider a 4-part composition of categories A, B, C and D, and then a smaller subset of just the categories A, B, and C. Table 4.--Correlation matrices from a 4-category sample and its 3-category subsample. Correlation for {A,B,C,D} Correlation for {A,B,C} A B C D A B C The above data comes from a hypothetical data set. The correlation matrix on the left side is taken by calculating all four categories. If we instead consider the composition of only the first three categories, the resulting correlation matrix is given on the right side. Notice the correlation between categories A and B is positive (0.267), but when removing the fourth category, the same correlation between A and B becomes negative (-0.066). Basis difficulty The above correlation matrices were computed from a multivariate sample of unit-sum vectors. When we consider the original data which might be measured (before the unit-sum normalization is performed), we might expect the sample correlation matrix of the original data to closely resemble the samnle correlation matrix of the normalized data. This is not the case, and the two correlation matrices may be dramatically different. This suggests obvious problems when we try to use the covariance or correlation matrices of compositional data to interpret the relationship between two variables. Null correlation difficulty It is traditional to consider null-correlation a good indicator of independence. However, in the situations we've already pointed out, a truly independent sample will necessarily have negative correlations between elements. It is possible to calculate the naturally occurring negative correlations which would occur under an independence assumption and use this to compare with experimental results, but this problem is also fraught with difficulty. The Contemporary Analogs to a Composite Estimator or WLS In his text, Aitchison (1986) provides an approach for analyzing the correlation structure of compositional data using a log-ratio approach. The natural progression for this field of complex sampling designs is to attempt to replace the analysis of covariance matrices with these more contemporary methods. Whether it is possible (and valid) to continue to use the weighted least squares and composite estimator, applying the techniques of Aitchison to approximations derived from the resulting covariance matrices, or if these methods might need to be completely replaced by new methods is currently unknown. ACAS Software Application The procedures described in this paper have been developed into a computer package titled "ACAS." (Williams 1995) This application has been written in C++ and is available to the public. Although ACAS is a useable, full-featured package, continual development is underway to provide more advanced tools. The ACAS system already includes some of the methodology of Aitchison and aspires to further develop contemporary methods based on his work. ACKNOWLEDGEMENTS The development of this material would not be possible without the ACAS software project, which has been mostly written by David J. C. Beach. This work is currently funded by the State of Minnesota Department of Natural Resources and the USDA Forest Service. REFERENCES Aitchison, J. 1986. The statistical analysis of compositional data. Chapman and Hall. 416 p. Chayes, F. 1960. On correlation between variables of constant sum. Journal Geophysical Research, 65:4185-4 193. Chayes, F. 1962. Numerical correlation and petrographic variation. Journal of Mathematical Geology, 70:440-452. Chayes, F. 1971. Ratio Correlation. University of Chicago Press. Czaplewski, R.L. 1992. Accuracy assessment of remotely sensed classifications with multi-phase sampling and the multivariate composite estimator. Hamilton, New Zealand. 16th International Biometric Conference. 2:22. Czaplewski, R.L. 1996. Assessment of classification accuracy and extent estimates for a land cover map with double sample. Submitted to Forest Science. Graybill, F.A. 1976. Theory and application of the linear model. Wadsworth & Brooks/Cole Adv. Books and Software. 704p. Grizzle, J.E., Starrner, C.F., and Koch, G.G. 1969. Analysis of categorical data by linear models. Biometrics 25 :489-504. Krumbein, W.C. 1962. Open and closed number systems stratigraphic mapping. Bull. Amer. Assoc. Petrol. Geologists, 46: 2229-2245. Le Maitre, R.W. 1982. Numerical Petrography. Amsterdam: Elsevier. Lee, E.S., Forthofer, R.N., and Lorimor, R.J. 1989. Analyzing complex survey data. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-071. Beverly Hills: Sage Pubns. 78p. Mosimann, J.E. 1962. On the compound multinomial distribution, the multivariate D-distribution and correlations among proportions. Biometrika, 49:65-82. Pearson, K. 1897. Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc. R. Soc., 60:489-498. Williams, M.T., and Beach, D.J.C. 1995. ACAS 0.4: accuracy assessment system program manual. USDA Forest Service, Rocky Mountain Forest and Range Experiment Station, Fort Collins, CO. 33 p. + source code. BIOGRAPHICAL SKETCH Murray Todd Williams is currently finishing his M.S. degree in the department of Statistics at Colorado State University. The work which appears in this paper, along with the development of the ACAS software application, is a part of his thesis. Murray received his B.A. in Mathematics from Pomona College in Claremont, CA.