(xi, is of the form (xl, - * *, xr), (xr+1l, * * *,xn), for some r. Thus,

advertisement
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984
656
composition and linear prediction," Proc. IEEE, vol. 71, no. 12,
pp. 1459-1460, Dec. 1983.
[51 J. P. Burg, "Maximum entropy spectral estimation," presented at
the 37th Annu. Int. SEG Meeting, Oklahoma City, OK, Dec. 31,
1967.
[6l S. Kullback, Information Theory and Statistics. New York:
Wiley, 1959, p. 189.
[71 P. M. Fishman, L. K. Jones, and C. W. Therrien, "The design of
masking processes by the method of minimal divergence," IEEE
Trans. Inform. Theory, -ol. IT-29, no. 2, pp. 245-255, Mar. 1983.
[81 J. Capon, "High resolut in frequency-wavenumber spectrum analysis," Proc. IEEE, vol. 5 i , pp. 1408-1418, 1969.
[91 J. P. Burg, "The relationship between maximum entropy and
maximum likelihood spectra," Geophysics, vol. 37, pp. 375376, 1972.
[101 P. M. Fishman, L. K. Jones, and C. W. Therrien, "The minimal
divergence solution to the Gaussian masking problem," Lincoln
Laboratory, Massachusetts Institute of Technology, Cambridge,
MA, May 22, 1981, DTIC A103062, Tech. Rep. 569.
Fig. 1.
Comments on "Application of the Conditional
Population-Mixture Model to Image Segmentation"
1) To develop a procedure for "estimating" the unknown
and fc's, for a given choice of k, the number of underlying
segments.
2) To discuss methods for choosing the number of classes,
giving particular mention to Akaike's information criterion
AIC 2], [3 ]. The aim of the present note is to issue caveats
on both these points.
xo
y's
D. M. TITTERINGTON
Abstract-In the above correspondencel a maximum likelihood
method is proposed for "estimating" class memberships and underlying statistical parameters, within the context of distribution mixtures.
In the present comment it is pointed out that biases are incurred in
parameter estimation, that the class memberships and parameters are
conceptually different, and therefore that the so-called standard mixture likelihood is to be preferred. Also in the correspondence,1 Akaike's
information criterion (AIC) is used to choose the number of classes
in the mixture. Here a brief theoretical caveat is issued.
Index Ter7ns-Cluster analysis, image processing, image segmentation,
maximum likelihood, mixtures of distributions, pattern recognition,
pixel classification.
I. INTRODUCTION
In the correspondence' the following problem is considered.
At each of n pixels of a digital image a vector of p features
is observed. The vector of features is X and the observed digital image is {xt, t = 1,
n}. The image consists of several
segments and each pixel belongs to exactly one segment. The
set of labels which indicate to which segment any given pixel
belongs is denoted by {yt, t = 1, * * *, n}. The complete data
on the tth pixel is therefore a realization of (Xt, t). In the
problem considered, yt is unobservable. It is assumed that, for
different pixels, the (X, y) are independent and that, given y =
c, X has the conditional probability function f(xic). Often
parametric models are chosen for f, so that
II. ESTIMATION, GIVEN THE NUMBER OF CLASSES
As pointed out in the correspondence' it is convenient to
replace yt by a set of indicators {act, = 1, * * *, k}. If Yt = c',
then Oct = 0 for all c c' and Oc't = 1. This leads, in 1, to the
consideration of the likelihood
c
L=L(B,O)=
k
n
{h(xt;PC)}0ct
(1)
t=l c=l
where B denotes the set of IBc's and 0 the set of Ot's.
L is then maximized jointly with respect to B and 0, the latter being restricted to indicators of the type described above.
The resulting 0 defines a clustering or segmentation of the
pixels into the k segments. Although segmentation is the major objective in the correspondence,' there could be said to
be the implication that the corresponding B provides sensible
estimates of B.
It is this latter implication that is misleading, because of
gross biases that occur. We illustrate this with a very simple
example.
Mixture of Two Univariate Normal Densities
with Common Variance
Suppose k = 2, p = I and
hc(x)
=
h(X,fc) h(x,
=
(27ror2)-1/2
=
exp
I-c,
{-
o)
I
(x
-
Pc)2/U2 },
c
=
1, 2.
Suppose also, without loss of generality, that juI <gW2 and
Xl 6...
*
xn. Any choice of 0 defines a partitioning of
, xn) and it turns out that the optimal partitioning
(xi,
f(x c) h (x;PC)
is of the form (xl, * *, xr), (xr+1l, * * *, xn), for some r. Thus,
the data are partitioned with all the "small" values in one
wherepc is a (possibly vector) parameter.
In the correspondence' the following objectives are pursued, group and all the "large" values in the other.
Let x0 denote the partitioning point. If the partitioned
among others.
sample is now analyzed as "ordinary" samples from the two
normal components, estimating the means by the sample
Manuscript received August 29, 1983.
common variance using the withinThe author is with the Department of Statistics, University of Glas- averages and estimating the
samples sum of squares, then it is clear that biases are introgow, Glasgow G12 8QW, Scotland.
1 S. L. Sclove, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, duced which do not even disappear asymptotically. Fig. 1,
with the case of equal class probabilities in mind, indicates how
pp. 428-433, July 1983.
=
-
0162-8828/84/0900-0656$01.00 i
1984 IEEE
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984
this is caused by the overlap of the two densities and the
truncation of both densities induced by the cutoff x0. The
estimate of /,1 will, on average, be too low, that for /,2 too
high and that for a2 also too low.
As far as asymptotically unbiased estimation of B is concerned, in general, use of what is called in the correspondencel
the standard mixture model, is much more appropriate. The
corresponding likelihood is
LM =LM(B,) =
ft
E h(xt: pc)7rc}
c
(2)
where a contains the k class probabilities. Once this is maximized, at (B, IT), say, discriminant functions can be set up to
form the basis of randomized or decision-directed pixel classification. Furthermore, since, often, consistent estimators of
(B,v) result [9], the discriminant functions will be, asymptotically, the optimal likelihood-ratio rules.
The remarks in this section are not original. The inevitable
biases were pointed out in [8], a critique of the use in [10]
of the conditional mixture model. These comments were followed up and developed in [4] . More recently, [11] argues
the usefulness of maximizing L, in (1), as a clustering technique.
It is pointed out, however, that often unacceptably equal
clusters are generated. An improvement is possible if (1) is
replaced by
LC(B,r, 0) =
H
t
H
c
{lTch(xt,pc)}Oct.
(3)
657
foundation requires the same conditions as does the above
"chi-squared" result. Although one cannot quarrel with the
result of the application in 1 of the method to Fisher's iris
data, it still has to be regarded as another ad hoc approach in
this context. At the present, however, there is no real formal
alternative.
A computationally demanding Monte Carlo
method is mentioned in [ 11.
IV. DISCUSSION
Important recent references on mixtures are [9], where
maximum likelihood methods in particular are thoroughly
examined, the monograph [5], and a forthcoming monograph
by U. E. Makov, A.F.M. Smith, and the present author.
REFERENCES
[1] M. Aitkin, D. Anderson, and J. Hinde, "Statistical modelling of
[2]
[3]
[4]
[5]
[6]
data on teaching styles (with discussion)," J. R. Statist. Soc. A,
vol. 144, pp. 419-461, 1981.
H. Akaike, "A new look at statistical model identification,"
IEEE Trans. Automat. Contr., vol. AC-19, pp. 716-723, 1974.
-, "Likelihood of a model and information criteria," J. Econometr., vol. 16, pp. 1-14, 1981.
P. Bryant and J. A. Williamson, "Asymptotic behaviour of classification maximum likelihood estimates," Biometrika, vol. 65,
pp. 273-281, 1978.
B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981.
R.J.A. Little and D. B. Rubin, "On jointly estimating parameters
and missing data by maximizing the complete-data likelihood,"
Amer. Statist., vol. 37, pp. 218-220, 1983.
U. E. Makov and A.F.M. Smith, "A quasi-Bayes unsupervised
learning procedure for priors," IEEE Trans. Inform. Theory, vol.
IT-23, pp. 761-764, 1977.
F.H.C. Marriott, "Separating mixtures of normal distributions,"
Biometrics, vol. 31, pp. 767-769, 1975.
R. A. Redner and H. F. Walker, "Mixture densities, maximum
likelihood, and the EM algorithm," SIAM Rev., vol. 26, pp. 195239, 1984.
A. J. Scott and M. J. Symons, "Clustering methods based on likelihood ratio criteria," Biometrics, vol. 27, pp. 387-397, 1971.
M. J. Symons, "Clustering criteria and multivariate normal mixtures," Biometrics, vol. 37, pp. 35-43, 1981.
T. Y. Young and A. A. Farjo, "On decision-directed estimation
and stochastic approximation," IEEE Trans. Inform. Theory,
vol. IT-18, pp. 671-673, 1972.
[7]
This allows for possible inequality of the class probabilities.
Clearly, setting rTc = k-', for all c, reduces (3) to (1).
Use of (3) still does not remove the biases from the esti[8]
mates of B, for instance. There is a fundamental reason for
using (2) instead. In the formulation of (1) and (3) it is as[9]
sumed that B, r, and 0 are conceptually the same and that
they are all parameters, to be estimated by maximum likelihood. In fact, only B and aT are unknown population-defined [10]
quantities and therefore to be regarded as parameters in this
sense. The 0's are best treated as unobservable random vari- [11]
ables associated from the data. The "correct" likelihood to
maximize is that obtained from the observable data, that is, the [12]
xt's. Given the relationship between likelihood and probability
densities we obtain the marginal probability density for the
xt's by writing down the joint probability density for the xt's
and 0 t's and then summing over the Ot's. For mixture prob- Author's Reply 2
lems, the joint density in question is LC and summation over
The points made by Titterington on parameter estimation
0 gives LM. We can now motivate the subscript of LC. LC is are well taken. However, the emphasis in my correspondence'
the "complete data" likelihood.
was on image segmentation. Although the standard mixture
The dangers of treating missing values as parameters in this model may be more appropriate for parameter estimation-even
way in other problems have been pointed out in [6]. The
same difficulties arise in recursive algorithms for unsupervised
learning. Use of L corresponds to the decision-directed method
[121, where biases occur; use of LM corresponds to the more
reliable probabilistic-teacher and quasi-Bayes methods [7].
III. CHOICE OF NUMBER OF CLASSES
As far as the establishment of a general formal procedure is
concerned, this remains an unsolved problem in the context
of mixtures. Even the question of choosing oetween k-l and A
classes is not easy. As pointed out in the correspondence.'
use of the generalized likelihood ratio test in general fails.
Given that the more parsimonious model is true, the test
statistic usually does not, even asymptotically, have a chisquared distribution. Certain of the necessary regularity conditions fail. In the correspondencel the use of Akaike's AIC
criterion is proposed, but it has to be said that its theoretical
this is debatable-it is not clear that it is to be preferred for
segmentation. Even in the context of parameter estimation, it
should be noted that unbiasedness is neither a neccessary nor
sufficient condition for a procedure to be good in some larger
sense, such as mean squared error. Of course, asymptotic bias
prevents consistency, but consistency is a large-sample concept
and it is not clear how meaningful it is in finite samples.
Now, having said the above, let me say that I am certainly
sympathetic to use of the standard mixture model. In fact, in
my own work, more recent than that reported in my article,' I
have included the standard mixture model approach in a
S. L. Sclove, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5,
428-433, July 1983.
2Manuscript received September 29, 1983; revised December 15,
1983.
The author is with the Department of Quantitative Methods, University of IRinois at Chicago, Chicago, IL 60680.
pp.
658
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984
Markov model for image segmentation. Here [2] the mixture
probabilities are dependent upon neighboring pixels.
Let me make some more general remarks. Part of the thrust
of the work reported in my article' (and earlier in [ 1] ) was to
provide a probabilistic interpretation for the widely used isodata and K-means procedures. That is, I set for myself the
question of whether I could find some probabilistic model for
the clustering problem such that isodata and K-means fell out
as corresponding to some method of estimation in that model.
I was able to show that they correspond to methods of iterative maximum-likelihood estimation in what I termed the conditional mixture model for those problems. (Colleagues have
told me that in doing this-the work was done in 1971-1972
and issued as a technical report, but was not published until
1977 [ 1] -I anticipated the EM algorithm.) Furthermore, the
interpretation of these algorithms in terms of a probabilistic
model shows how to generalize them and generate other algorithms which may at times be more appropriate. I think that
people have found these results to be illuminating.
REFERENCES
S. L. Sclove, "Population mixture models and clustering algorithms," Commun. Statist., vol. A6, pp. 417-434, 1977.
[2]
,"On segmentation of digital images using spatial and contextual information via a two-dimensional Markov model," in Proc.
Army Res. Office Conf Unsupervised Image Anal., D. B. Cooper,
R. Launer, and D. McClure, Eds. New York: Academic, to be
published.
Comments on "A Model for Radar Images and Its Application
to Adaptive Digital Filtering of Multiplicative Noise"
where r(x) with x = (x, y) is the desired image, n(x) is a signalindependent fading noise component, and h(x) is the pointspread function of the radar system (antenna, receiver, and
correlator). Instead of developing an MMSE filter as was stated,
the authors try to derive the best linear noncausal MSE filter
under the assumption of wide-sense stationary image and noise
data.
2) The first requirement on the best linear filter is that it
be a member of the class of unbiased linear estimators, i.e.,
the noncausal estimator should have the following form:
r(x)=r +m(x)Q [I(x)- /1.
(2)
The mean-square error then becomes
C' =E[(r(x)- r - (I(x)- I) g m W) I
(3)
which should be compared to [ 1, (7)] . The optimal transfer
function M(f) is then
nSr(f) H*(f) - nr2 H*(O) 6(f)
J H(
Jr
f) 12 Sr(f)(3Sn ( f)
-
n2 r2
()
(f)
(4)
for the case H(f) # 0, which should be compared to [1, (8)] .
(See Appendix for derivation.) By substituting the signal and
noise models as used in [ 1 ] into (4), it can easily be shown that
the value M(f) = 1 Ini for f = 0 in the second part of (8) in [ 1]
is not the optimal value for the transfer function.
In the case where nonzero means are present, it is generally
better to work with covariance functions instead of correlation
functions and define the power spectral density as the Fourier
transform of the covariance function. Rewriting (4) in terms
of these power spectral densities Srr(f) and S,n(f) (Sr(f) and
Sn(f) refer to the spectra as used in [1]), we obtain
M
H()2[Sr(f
in Srr(f) H*(f)
Snn(f) + r2Snn(f) + n2Srr(f)]
(5)
Abstract-In a recent paper 1] a model for radar images was derived which is free of delta functions. We can rewrite (5) as
and a method for mean-square error (MSE) filtering of noisy radar images
J[Srr(f)
JOHN W. WOODS AND JAN BIEMOND
was presented. The purpose of this correspondence is to point out that
the filter in [1] is not optimum in the MSE sense and to show that the
ultimate filter scheme is based on a very restricted model for radar images.
Recently, Frost et al. [ 1] have addressed the problem of optimum minimum mean-square error (MM SE) estimation of
images in multiplicative noise with particular application to
synthetic aperture radar (SAR). Some filter parameters are
then estimated on-line to make the filter adaptive. The purpose
of this correspondence is to point out that the filter in [ 1] is
not optimum in the MSE sense and to correct several mistakes
in its derivation.
1) In [1 the following model is used for a recorded SAR
image:
(1)
I(x) = [r(x)- n(x)] ® h(x)
Manuscript received March 28, 1983; revised June 6, 1983. This work
was supported in part by The Netherlands Organization for the Advancement of Pure Research (ZWO), and in part by the U.S. National Science
Foundation under Grant ECS 80-12569.
J. W. Woods is with Department of Electrical, Computer, and Systems
Engineering, Rensselaer Polytechnic Institute, Troy, NY 12181.
J. Biemond is with the Information Theory Group, Department of
Electrical Engineering, Delft University of Technology, Delft, The
Netherlands.
H(f) [Srr(f) ( Snn(f) + r2Snn(f) + n2 Srr(f)
(6)
to display the "data dependent" part of the filter (in the brackets) denoted M'(f) by Frost et al. Equation (5) or (6) then
gives the correct best linear estimator whose form is specified
in (2).
3) The argument that H(f) is not data dependent and hence
can be assumed constant over some finite bandwidth is
misleading.
a) Due to the proposed signal and noise models having
possibly very large (infinite) bandwidth, theoretically there will
always be an unacceptable amplification of signal or noise due
to the zeros in H(f).
b) By leaving out H(f), actually by assuming H(f) = 1 as
was done in [ 1, (9)], the filter problem reduces from a multiplicative noise type problem including some blur function to
the problem of filtering multiplicative noise only, which has
been addressed before by several authors [2] -[4].
c) It is worth noting that both [5] and [6] address the
problem of Wiener filtering multiplicative noise and blur.
d) If one has zeros of H(f) within the band of interests,
then a possible data model would be
I(x) [ r(x) n (x) I .) h (x) + n I (x)
0162-8828/84/0900-0658$01.00
=
© 1984 IEEE
(7)
Download