Comparing Methods used to
Combine Multi-level Samples
of Compositional Data
Murray Todd Williams'
assessment of categorical data with complex
sampling designs is an important tool in the use of thematic maps and
other applications. One important development of note is the composite
estimator-an application of the Kalman filter. This paper reviews the
composite estimator and compares it with standard weighted least
squares regression. In many cases the two methods produce identical
results, although there are situations in which one has advantages over
the other. The conclusion also raises some critical questions which arise
from these classical methods, and it examines contemporary analyses of
compositional data. In conjunction with the methodology described
here, a public-domain software application has been developed.
Much of the work discussed here emerged from natural resource issues,
specifically the analysis of forest inventories carried out by the USDA Forest
Service. However, the application of this work has already proven to be quite
large. Let us broadly define the sort of data which can benefit from these
In single-level categorical data, measured quantities (frequently discrete) are
identified as one of a group of mutually exclusive categories. One example of this
is categorizing the type of forest in a section of land by tree species. Another
example would be the a typical blood test in which the number of specific components (red blood cells, lymphocytes, etc.) are counted.
Multi-stage and multi-phase categorical sampling uses multiple classifiers to
categorize each sampling unit. For example, in the above forest inventory situation, the composition of a section of land can be measured using three different
methods: satellite imaging, aerial photography (photo interpretation), and manual
observations in the field.
In this example, let a primary sampling unit be a section of land with an area
of 47 acres. Table 1 shows an example where the three identification methods
was used to classify each acre of land. Notice that 38 acres of land were classified
'student, Department ofStatistics, Colorado State University, Ft. Collins, CO
the same by each method. For the other 9 acres, there is some degree of disagreement between categories. (These situations are shaded in the table.) In most
situations a high level of agreement is considered optimal.
Notice that Table 1 also displays the quantities of each multiple-classification
as proportions of the total. This vector of proportions is referred to as a
Table 1.-Mutli-level data in a typical primary sampling unit.
Ground Surveys
Satellite Imaging
Aerial Photography
Light forest
Light forest
Light forest
0.3 19
Heavv forest
Heavv forest
Heavv forest
There are many situations in which this type of data is encountered. Some of
the possible applications become apparent:
Multiple classification techniques. In these situations, various techniques
might have different levels of accuracy. Such accuracy is frequently inversely
proportional to cost. In the above example, satellite data might be readily
available and therefore inexpensive, but the degree of accuracy obtained may
be fairly poor. Ground surveys are considered to be the most accurate, but
they are also exceedingly expensive.
Multiple time periods. A large database of monitoring information may be
available for previous time periods (say for example, a large survey taken in
1970 and 1980) where data for a recent time period (1990) is more limited. A
combination of the larger two-period database may be combined with the
small sample of plots which have all three time periods in order to extrapolate
a better estimate of the forest composition in 1990.
Construction of the Sample Composition Estimates
Consider an (n x p ) data matrix X with n independent primary sampling units
(PSUs) and p distinct categories. Note that the category labels resemble those in
Table 1. Also let each row of X represent a composition, thus summing to one.
The calculations of the composition estimate Y (the sample mean of the data
matrix X) and its sample covariance are straightforward:
~ i X ~and
l X, = ; X T ( I - ~ ~ ~ ) X
The composite estimator was first applied to categorical analyses by
Czaplewski (1992) as an application of the Kalman filter. Instead of motivating
its derivation I will just describe its application.
Consider two composition estimates. The estimates may have any dimensionality, but for this example we will define a two-dimensional composition Y1 and a
three-dimensional composition Y2. Y2 would be a vector with the unit-sum
constraint. The elements of Y2 will correspond with the rows in Table 1.
Yl will only have categories defined on only two classifications: Satellite
imaging and Aerial photography. Refer to Table 2. It is necessary now to
construct a binary transformation matrix which defines the unique mapping from
Yl to Y,. Take this partition of the binary matrix and call it H. An example of H
is shown as a subset of the binary matrix in Table 3.
Table 2.-A typical primary sampling unit for a two-dimensional composition Y,.
Satellite Imaging Aerial Photography
Non- forest
Light Forest
8 ,
. ' - 0.113'.
Light forest
Light forest
Light forest
Heavy forest
Heavy forest
0.21 1
1 .OOO
From Czaplewski (1994) the composite estimator Y can be expressed as
K = Z2HT(&+ HZ2HT)
The resulting composite estimator will have the same dimension and category
labels as the largest component Y2. In a sense, the composite estimator can be
considered a method of improving the estimate of Y2 by using another independent and unbiased vector estimate (Yl) of lower dimensionality.
A problem which can sometimes occur with both the composite estimator and
the weighted least squares is that the resulting composition estimate Y can have
negative components. For the composite estimator, there is a partial remedy of
this problem. Multiply K in the above equations with a scalar constant a.
In optimal situations (where negative components do not result) we can let
a=l. If negative components of the composite estimator are observed, we can
replace a with the largest possible value in the interval [0,1] such that the resulting composite estimator is well-behaved.
The use of weighted (generalized) least squares allows a greater level of flexibility than the composite estimator. The only thing about the weighted least
squares regression that may not be considered straightforward is the fact that the
covariance matrices of the compositions will be necessarily singular. By replacing the standard matrix inverse with the generalized inverse, the problem is
Table 3.---Creation of the design matrix X.
WLS Regression Categories
Original Categories
Light forest
Light forest
Heavy forest
Light forest
Light forest
Heavy forest
Heavy forest
Light forest
Light forest
Heavy forest
Heavy forest
Light forest
Light forest
Light forest
Light forest
Heavy forest
Heavy forest
Light forest
Light forest
Heavy forest
Light forest
Light forest
Heavy forest
Heavy forest
Let Yl and Y2 be two compositions (unit-sum vectors) with corresponding
covariance matrices El and Z;respectively. The two compositions can have a
different dimensionality. Let Y, have two classifications (Satellite and Aerial) and
let Yz have three dimensions. We can also introduce missing data with weighted
least squares. Hence, we will let Y2 have two missing categories. The categories
of these compositions are shown on the left side of Table 3.
Let the design matrix X be specified by the binary matrix depicted in Table 3.
This matrix is easily formed by placing a 1 in each cell where all the available
categories from the original data (rows) match up with their corresponding regression categories (columns).
If we consider YI and Y2 as partitions of a single vector Y, and define
The generalized regression equation for the new WLS estimate (W) and its
covariance matrix can be given by
where A- is the generalized inverse2of A.
We have shown that the composite estimator and weighted least squares are
both capable of combining multi-level composition estimates of different dimensionality. Both methods require a binary transformation matrix which maps the
categories of the corresponding compositions to those of the resulting estimate.
Weighted least squares has an advantage over the composite estimator when
multiple estimates need to be combined. It is also unique in its ability to handle
data with missing categories. The composite estimator, however, is able to work
with a composition estimate of zero covariance. (An example of this is an entire
census of a population, where the population statistic represents no sampling
error..) In addition, the composite estimator is the only method which can be
modified to assure that no components of the resulting estimate are negative.
Inherent Difficulties with Compositional Data
The unit-sum constraint is an intrinsic property of any composition. Little, if
any, attention has been paid to this restriction despite the fact that it greatly
complicates any analysis of correlation between categories. Karl Pearson (1 897)
2 ~ dejinition,
A-is the generalized inverse of A if A
A-A and A-
48 1
= A-A
See GraybiN (1976).
first pointed out difficulties inherent with the interpretation of correlations
between ratios whose numerators and denominators contain common parts.
Since that time, difficulties which have been encountered while trying to interpret these correlations have been described in papers by Chayes (196O,l962),
Krumbein (1962), Moisimam (1962), and in books by Chayes (1971) and Le
Maltre (1982).
Aitchison (1986) points out four basic difficulties which arise from the use of
a standard covariance matrix in analysis.
Negative bias difficulty
For a D-part composition, we have the restriction
Although the correlations between all parts of a composition would be
expected to freely span the range [0,1] this restriction requires that at least one of
the correlations assumes a negative value. Such forced negative correlation could
cause some misinterpretation of the data.
Subcomposition difficulty
In most multivariate situations, an m-part subset of an n-part composition
would be expected to preserve the covariance structure. For example, consider a
4-part composition of categories A, B, C and D, and then a smaller subset of just
the categories A, B, and C.
Table 4.--Correlation matrices from a 4-category sample and its 3-category subsample.
Correlation for {A,B,C,D}
Correlation for {A,B,C}
The above data comes from a hypothetical data set. The correlation matrix on
the left side is taken by calculating all four categories. If we instead consider the
composition of only the first three categories, the resulting correlation matrix is
given on the right side. Notice the correlation between categories A and B is
positive (0.267), but when removing the fourth category, the same correlation
between A and B becomes negative (-0.066).
Basis difficulty
The above correlation matrices were computed from a multivariate sample of
unit-sum vectors. When we consider the original data which might be measured
(before the unit-sum normalization is performed), we might expect the sample
correlation matrix of the original data to closely resemble the samnle correlation
matrix of the normalized data. This is not the case, and the two correlation matrices may be dramatically different. This suggests obvious problems when we try
to use the covariance or correlation matrices of compositional data to interpret the
relationship between two variables.
Null correlation difficulty
It is traditional to consider null-correlation a good indicator of independence.
However, in the situations we've already pointed out, a truly independent sample
will necessarily have negative correlations between elements. It is possible to
calculate the naturally occurring negative correlations which would occur under
an independence assumption and use this to compare with experimental results,
but this problem is also fraught with difficulty.
The Contemporary Analogs to a Composite Estimator or WLS
In his text, Aitchison (1986) provides an approach for analyzing the correlation structure of compositional data using a log-ratio approach. The natural
progression for this field of complex sampling designs is to attempt to replace the
analysis of covariance matrices with these more contemporary methods. Whether
it is possible (and valid) to continue to use the weighted least squares and
composite estimator, applying the techniques of Aitchison to approximations
derived from the resulting covariance matrices, or if these methods might need to
be completely replaced by new methods is currently unknown.
ACAS Software Application
The procedures described in this paper have been developed into a computer
package titled "ACAS." (Williams 1995) This application has been written in
C++ and is available to the public. Although ACAS is a useable, full-featured
package, continual development is underway to provide more advanced tools.
The ACAS system already includes some of the methodology of Aitchison and
aspires to further develop contemporary methods based on his work.
The development of this material would not be possible without the ACAS
software project, which has been mostly written by David J. C. Beach. This work
is currently funded by the State of Minnesota Department of Natural Resources
and the USDA Forest Service.
Murray Todd Williams is currently finishing his M.S. degree in the department of Statistics at Colorado State University. The work which appears in this
paper, along with the development of the ACAS software application, is a part of
his thesis. Murray received his B.A. in Mathematics from Pomona College in
Claremont, CA.