IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984 656 composition and linear prediction," Proc. IEEE, vol. 71, no. 12, pp. 1459-1460, Dec. 1983. [51 J. P. Burg, "Maximum entropy spectral estimation," presented at the 37th Annu. Int. SEG Meeting, Oklahoma City, OK, Dec. 31, 1967. [6l S. Kullback, Information Theory and Statistics. New York: Wiley, 1959, p. 189. [71 P. M. Fishman, L. K. Jones, and C. W. Therrien, "The design of masking processes by the method of minimal divergence," IEEE Trans. Inform. Theory, -ol. IT-29, no. 2, pp. 245-255, Mar. 1983. [81 J. Capon, "High resolut in frequency-wavenumber spectrum analysis," Proc. IEEE, vol. 5 i , pp. 1408-1418, 1969. [91 J. P. Burg, "The relationship between maximum entropy and maximum likelihood spectra," Geophysics, vol. 37, pp. 375376, 1972. [101 P. M. Fishman, L. K. Jones, and C. W. Therrien, "The minimal divergence solution to the Gaussian masking problem," Lincoln Laboratory, Massachusetts Institute of Technology, Cambridge, MA, May 22, 1981, DTIC A103062, Tech. Rep. 569. Fig. 1. Comments on "Application of the Conditional Population-Mixture Model to Image Segmentation" 1) To develop a procedure for "estimating" the unknown and fc's, for a given choice of k, the number of underlying segments. 2) To discuss methods for choosing the number of classes, giving particular mention to Akaike's information criterion AIC 2], [3 ]. The aim of the present note is to issue caveats on both these points. xo y's D. M. TITTERINGTON Abstract-In the above correspondencel a maximum likelihood method is proposed for "estimating" class memberships and underlying statistical parameters, within the context of distribution mixtures. In the present comment it is pointed out that biases are incurred in parameter estimation, that the class memberships and parameters are conceptually different, and therefore that the so-called standard mixture likelihood is to be preferred. Also in the correspondence,1 Akaike's information criterion (AIC) is used to choose the number of classes in the mixture. Here a brief theoretical caveat is issued. Index Ter7ns-Cluster analysis, image processing, image segmentation, maximum likelihood, mixtures of distributions, pattern recognition, pixel classification. I. INTRODUCTION In the correspondence' the following problem is considered. At each of n pixels of a digital image a vector of p features is observed. The vector of features is X and the observed digital image is {xt, t = 1, n}. The image consists of several segments and each pixel belongs to exactly one segment. The set of labels which indicate to which segment any given pixel belongs is denoted by {yt, t = 1, * * *, n}. The complete data on the tth pixel is therefore a realization of (Xt, t). In the problem considered, yt is unobservable. It is assumed that, for different pixels, the (X, y) are independent and that, given y = c, X has the conditional probability function f(xic). Often parametric models are chosen for f, so that II. ESTIMATION, GIVEN THE NUMBER OF CLASSES As pointed out in the correspondence' it is convenient to replace yt by a set of indicators {act, = 1, * * *, k}. If Yt = c', then Oct = 0 for all c c' and Oc't = 1. This leads, in 1, to the consideration of the likelihood c L=L(B,O)= k n {h(xt;PC)}0ct (1) t=l c=l where B denotes the set of IBc's and 0 the set of Ot's. L is then maximized jointly with respect to B and 0, the latter being restricted to indicators of the type described above. The resulting 0 defines a clustering or segmentation of the pixels into the k segments. Although segmentation is the major objective in the correspondence,' there could be said to be the implication that the corresponding B provides sensible estimates of B. It is this latter implication that is misleading, because of gross biases that occur. We illustrate this with a very simple example. Mixture of Two Univariate Normal Densities with Common Variance Suppose k = 2, p = I and hc(x) = h(X,fc) h(x, = (27ror2)-1/2 = exp I-c, {- o) I (x - Pc)2/U2 }, c = 1, 2. Suppose also, without loss of generality, that juI <gW2 and Xl 6... * xn. Any choice of 0 defines a partitioning of , xn) and it turns out that the optimal partitioning (xi, f(x c) h (x;PC) is of the form (xl, * *, xr), (xr+1l, * * *, xn), for some r. Thus, the data are partitioned with all the "small" values in one wherepc is a (possibly vector) parameter. In the correspondence' the following objectives are pursued, group and all the "large" values in the other. Let x0 denote the partitioning point. If the partitioned among others. sample is now analyzed as "ordinary" samples from the two normal components, estimating the means by the sample Manuscript received August 29, 1983. common variance using the withinThe author is with the Department of Statistics, University of Glas- averages and estimating the samples sum of squares, then it is clear that biases are introgow, Glasgow G12 8QW, Scotland. 1 S. L. Sclove, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, duced which do not even disappear asymptotically. Fig. 1, with the case of equal class probabilities in mind, indicates how pp. 428-433, July 1983. = - 0162-8828/84/0900-0656$01.00 i 1984 IEEE IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984 this is caused by the overlap of the two densities and the truncation of both densities induced by the cutoff x0. The estimate of /,1 will, on average, be too low, that for /,2 too high and that for a2 also too low. As far as asymptotically unbiased estimation of B is concerned, in general, use of what is called in the correspondencel the standard mixture model, is much more appropriate. The corresponding likelihood is LM =LM(B,) = ft E h(xt: pc)7rc} c (2) where a contains the k class probabilities. Once this is maximized, at (B, IT), say, discriminant functions can be set up to form the basis of randomized or decision-directed pixel classification. Furthermore, since, often, consistent estimators of (B,v) result [9], the discriminant functions will be, asymptotically, the optimal likelihood-ratio rules. The remarks in this section are not original. The inevitable biases were pointed out in [8], a critique of the use in [10] of the conditional mixture model. These comments were followed up and developed in [4] . More recently, [11] argues the usefulness of maximizing L, in (1), as a clustering technique. It is pointed out, however, that often unacceptably equal clusters are generated. An improvement is possible if (1) is replaced by LC(B,r, 0) = H t H c {lTch(xt,pc)}Oct. (3) 657 foundation requires the same conditions as does the above "chi-squared" result. Although one cannot quarrel with the result of the application in 1 of the method to Fisher's iris data, it still has to be regarded as another ad hoc approach in this context. At the present, however, there is no real formal alternative. A computationally demanding Monte Carlo method is mentioned in [ 11. IV. DISCUSSION Important recent references on mixtures are [9], where maximum likelihood methods in particular are thoroughly examined, the monograph [5], and a forthcoming monograph by U. E. Makov, A.F.M. Smith, and the present author. REFERENCES [1] M. Aitkin, D. Anderson, and J. Hinde, "Statistical modelling of [2] [3] [4] [5] [6] data on teaching styles (with discussion)," J. R. Statist. Soc. A, vol. 144, pp. 419-461, 1981. H. Akaike, "A new look at statistical model identification," IEEE Trans. Automat. Contr., vol. AC-19, pp. 716-723, 1974. -, "Likelihood of a model and information criteria," J. Econometr., vol. 16, pp. 1-14, 1981. P. Bryant and J. A. Williamson, "Asymptotic behaviour of classification maximum likelihood estimates," Biometrika, vol. 65, pp. 273-281, 1978. B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981. R.J.A. Little and D. B. Rubin, "On jointly estimating parameters and missing data by maximizing the complete-data likelihood," Amer. Statist., vol. 37, pp. 218-220, 1983. U. E. Makov and A.F.M. Smith, "A quasi-Bayes unsupervised learning procedure for priors," IEEE Trans. Inform. Theory, vol. IT-23, pp. 761-764, 1977. F.H.C. Marriott, "Separating mixtures of normal distributions," Biometrics, vol. 31, pp. 767-769, 1975. R. A. Redner and H. F. Walker, "Mixture densities, maximum likelihood, and the EM algorithm," SIAM Rev., vol. 26, pp. 195239, 1984. A. J. Scott and M. J. Symons, "Clustering methods based on likelihood ratio criteria," Biometrics, vol. 27, pp. 387-397, 1971. M. J. Symons, "Clustering criteria and multivariate normal mixtures," Biometrics, vol. 37, pp. 35-43, 1981. T. Y. Young and A. A. Farjo, "On decision-directed estimation and stochastic approximation," IEEE Trans. Inform. Theory, vol. IT-18, pp. 671-673, 1972. [7] This allows for possible inequality of the class probabilities. Clearly, setting rTc = k-', for all c, reduces (3) to (1). Use of (3) still does not remove the biases from the esti[8] mates of B, for instance. There is a fundamental reason for using (2) instead. In the formulation of (1) and (3) it is as[9] sumed that B, r, and 0 are conceptually the same and that they are all parameters, to be estimated by maximum likelihood. In fact, only B and aT are unknown population-defined [10] quantities and therefore to be regarded as parameters in this sense. The 0's are best treated as unobservable random vari- [11] ables associated from the data. The "correct" likelihood to maximize is that obtained from the observable data, that is, the [12] xt's. Given the relationship between likelihood and probability densities we obtain the marginal probability density for the xt's by writing down the joint probability density for the xt's and 0 t's and then summing over the Ot's. For mixture prob- Author's Reply 2 lems, the joint density in question is LC and summation over The points made by Titterington on parameter estimation 0 gives LM. We can now motivate the subscript of LC. LC is are well taken. However, the emphasis in my correspondence' the "complete data" likelihood. was on image segmentation. Although the standard mixture The dangers of treating missing values as parameters in this model may be more appropriate for parameter estimation-even way in other problems have been pointed out in [6]. The same difficulties arise in recursive algorithms for unsupervised learning. Use of L corresponds to the decision-directed method [121, where biases occur; use of LM corresponds to the more reliable probabilistic-teacher and quasi-Bayes methods [7]. III. CHOICE OF NUMBER OF CLASSES As far as the establishment of a general formal procedure is concerned, this remains an unsolved problem in the context of mixtures. Even the question of choosing oetween k-l and A classes is not easy. As pointed out in the correspondence.' use of the generalized likelihood ratio test in general fails. Given that the more parsimonious model is true, the test statistic usually does not, even asymptotically, have a chisquared distribution. Certain of the necessary regularity conditions fail. In the correspondencel the use of Akaike's AIC criterion is proposed, but it has to be said that its theoretical this is debatable-it is not clear that it is to be preferred for segmentation. Even in the context of parameter estimation, it should be noted that unbiasedness is neither a neccessary nor sufficient condition for a procedure to be good in some larger sense, such as mean squared error. Of course, asymptotic bias prevents consistency, but consistency is a large-sample concept and it is not clear how meaningful it is in finite samples. Now, having said the above, let me say that I am certainly sympathetic to use of the standard mixture model. In fact, in my own work, more recent than that reported in my article,' I have included the standard mixture model approach in a S. L. Sclove, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-5, 428-433, July 1983. 2Manuscript received September 29, 1983; revised December 15, 1983. The author is with the Department of Quantitative Methods, University of IRinois at Chicago, Chicago, IL 60680. pp. 658 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-6, NO. 5, SEPTEMBER 1984 Markov model for image segmentation. Here [2] the mixture probabilities are dependent upon neighboring pixels. Let me make some more general remarks. Part of the thrust of the work reported in my article' (and earlier in [ 1] ) was to provide a probabilistic interpretation for the widely used isodata and K-means procedures. That is, I set for myself the question of whether I could find some probabilistic model for the clustering problem such that isodata and K-means fell out as corresponding to some method of estimation in that model. I was able to show that they correspond to methods of iterative maximum-likelihood estimation in what I termed the conditional mixture model for those problems. (Colleagues have told me that in doing this-the work was done in 1971-1972 and issued as a technical report, but was not published until 1977 [ 1] -I anticipated the EM algorithm.) Furthermore, the interpretation of these algorithms in terms of a probabilistic model shows how to generalize them and generate other algorithms which may at times be more appropriate. I think that people have found these results to be illuminating. REFERENCES S. L. Sclove, "Population mixture models and clustering algorithms," Commun. Statist., vol. A6, pp. 417-434, 1977. [2] ,"On segmentation of digital images using spatial and contextual information via a two-dimensional Markov model," in Proc. Army Res. Office Conf Unsupervised Image Anal., D. B. Cooper, R. Launer, and D. McClure, Eds. New York: Academic, to be published. Comments on "A Model for Radar Images and Its Application to Adaptive Digital Filtering of Multiplicative Noise" where r(x) with x = (x, y) is the desired image, n(x) is a signalindependent fading noise component, and h(x) is the pointspread function of the radar system (antenna, receiver, and correlator). Instead of developing an MMSE filter as was stated, the authors try to derive the best linear noncausal MSE filter under the assumption of wide-sense stationary image and noise data. 2) The first requirement on the best linear filter is that it be a member of the class of unbiased linear estimators, i.e., the noncausal estimator should have the following form: r(x)=r +m(x)Q [I(x)- /1. (2) The mean-square error then becomes C' =E[(r(x)- r - (I(x)- I) g m W) I (3) which should be compared to [ 1, (7)] . The optimal transfer function M(f) is then nSr(f) H*(f) - nr2 H*(O) 6(f) J H( Jr f) 12 Sr(f)(3Sn ( f) - n2 r2 () (f) (4) for the case H(f) # 0, which should be compared to [1, (8)] . (See Appendix for derivation.) By substituting the signal and noise models as used in [ 1 ] into (4), it can easily be shown that the value M(f) = 1 Ini for f = 0 in the second part of (8) in [ 1] is not the optimal value for the transfer function. In the case where nonzero means are present, it is generally better to work with covariance functions instead of correlation functions and define the power spectral density as the Fourier transform of the covariance function. Rewriting (4) in terms of these power spectral densities Srr(f) and S,n(f) (Sr(f) and Sn(f) refer to the spectra as used in [1]), we obtain M H()2[Sr(f in Srr(f) H*(f) Snn(f) + r2Snn(f) + n2Srr(f)] (5) Abstract-In a recent paper 1] a model for radar images was derived which is free of delta functions. We can rewrite (5) as and a method for mean-square error (MSE) filtering of noisy radar images J[Srr(f) JOHN W. WOODS AND JAN BIEMOND was presented. The purpose of this correspondence is to point out that the filter in [1] is not optimum in the MSE sense and to show that the ultimate filter scheme is based on a very restricted model for radar images. Recently, Frost et al. [ 1] have addressed the problem of optimum minimum mean-square error (MM SE) estimation of images in multiplicative noise with particular application to synthetic aperture radar (SAR). Some filter parameters are then estimated on-line to make the filter adaptive. The purpose of this correspondence is to point out that the filter in [ 1] is not optimum in the MSE sense and to correct several mistakes in its derivation. 1) In [1 the following model is used for a recorded SAR image: (1) I(x) = [r(x)- n(x)] ® h(x) Manuscript received March 28, 1983; revised June 6, 1983. This work was supported in part by The Netherlands Organization for the Advancement of Pure Research (ZWO), and in part by the U.S. National Science Foundation under Grant ECS 80-12569. J. W. Woods is with Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12181. J. Biemond is with the Information Theory Group, Department of Electrical Engineering, Delft University of Technology, Delft, The Netherlands. H(f) [Srr(f) ( Snn(f) + r2Snn(f) + n2 Srr(f) (6) to display the "data dependent" part of the filter (in the brackets) denoted M'(f) by Frost et al. Equation (5) or (6) then gives the correct best linear estimator whose form is specified in (2). 3) The argument that H(f) is not data dependent and hence can be assumed constant over some finite bandwidth is misleading. a) Due to the proposed signal and noise models having possibly very large (infinite) bandwidth, theoretically there will always be an unacceptable amplification of signal or noise due to the zeros in H(f). b) By leaving out H(f), actually by assuming H(f) = 1 as was done in [ 1, (9)], the filter problem reduces from a multiplicative noise type problem including some blur function to the problem of filtering multiplicative noise only, which has been addressed before by several authors [2] -[4]. c) It is worth noting that both [5] and [6] address the problem of Wiener filtering multiplicative noise and blur. d) If one has zeros of H(f) within the band of interests, then a possible data model would be I(x) [ r(x) n (x) I .) h (x) + n I (x) 0162-8828/84/0900-0658$01.00 = © 1984 IEEE (7)