GNGTS - Appendice Atti del 16° Convegno Nazionale - 06.07 F. Migliaccio, V. Tornatore e M.A. Brovelli DIIAR - Sez. Rilevamento - Politecnico di Milano CLUSTERS AND PROBABILISTIC MODELS: THE ONE - DIMENSIONAL, MULTI - GAUSSIAN CASE Abstract. In this paper we want to specify the theory of the probabilistic approach to the problem of cluster analysis, in the case where more than two gaussian distributions are involved, meaning that at least three clusters have been identified in the data. We will widely refer to [6], where the principles and equations ruling this approach have been presented. However, the estimation theory will be again summarized in the first section of the paper (Introduction). CLUSTERS E MODELLI PROBABILISTICI: IL CASO MONODIMENSIONALE, MULTI - NORMALE Riassunto. Nel presente lavoro si intende estendere la teoria dell'approccio probabilistico al problema della cluster analysis al caso in cui siano presenti almeno tre distribuzioni normali, il che significa che esaminando i dati si sono identificati almeno tre gruppi (clusters) omogenei al loro interno. Si farà ampiamente riferimento al precedente lavoro [6], dove sono stati illustrati i principi e le equazioni che governano questo tipo di approccio. Tuttavia, la teoria della stima sarà comunque riassunta nel paragrafo introduttivo. INTRODUCTION: THE PROBABILISTIC APPROACH Cluster analysis can be seen as a set of procedures which allow to subdivide sets of data into groups with homogeneous characteristics. From our point of view, it is mostly fruitful to use a statistical model for the data. This means that the problem of cluster analysis can be defined in the following way: to estimate a distribution which is a "mixture" of two or more distributions. Once the number of distributions and their family have been identified, the problem can be solved as a parameter estimation. A standard method to estimate the parameters could be the maximum likelihood criterion: unluckily it gives rise to equations of increasing complexity. A simpler method can be used, such as the least squares principle, but it must be applied in a nonelementary way. It has been shown in [6] how it is possible to apply a pseudo-least-squares method to the solution of a cluster analysis problem in which the data can be divided into two groups, both describing the theoretical approach and applying the results to the analysis of data representing grey values in two portions of SPOT images. The results will be here summarized. In general, the aim is to estimate a distribution resulting from the "mixture" of two or more distributions according to equation M L( x | p, ) p j f j ( x | ) ( j) j 1 where: M number of distributions which form the "mixture" distribution; p j probability of having data in the j th distribution; (1) ( j ) parameters of the j th distribution. The parameters to be estimated are p and . The problem is quite complex, both because the probabilities p i are involved and because the likelihood (1) does not belong to the exponential family. The solution can be achieved by defining a principle which is strictly analogous (both from the intuitive and from the statistical point of view) to the least squares, in the form m ( N i i ( p, )) 2 i 1 i ( p, ) Min ( x | p, ) (2) In fact, (2) states that, dividing the real axis x into m intervals, N i (actual number of points falling in the i th interval) have to be close to i (number of points falling in the i th interval according to the theoretical distribution); besides, the target function has the same distribution as that of the ordinary least squares principle. It must be noticed (cf. [6]) that the least squares principle cannot be applied straightforwardly to the vector with components ( N i - i ), as such a vector (asymptotically) has a singular covariance matrix. It can be proved that the estimators ( pˆ ,ˆ ) so obtained are consistent, i.e. when N , tot ( pˆ ,ˆ ) ( p, ) in probability. As an illustration of the above principle, in [6] a rather simple case was treated, which however is already capable of modelling realistic situations. It was assumed to have a series of N tot data, known to belong to one of two gaussian distributions. It means that the likelihood family is of the form L( x | p, ) p 1 ( 1 , 1 ) (1 p) 2 ( 2 , 2 ) (3) where, according to the symbols adopted, (1) ( 1 , 1 ) p1 p ( 2) ( 2 , 2 ) p2 1 p In this case, the “observation equations” can then be written as bi i ( x | p, ) N tot [ p 1 ( x; 1 , 1 ) (1 p) 2 ( x; 2 , 2 )] dx ai b 1 a 1 erf i N tot p erf i 1 1 b 2 a 2 erf i N tot (1 p) erf i 2 2 (4) where: bi ai 1 x erf 1 2 x e ( x )2 2 2 dx Equation (4) must be linearized, then inserted into the minimum principle (2). A normal system is obtained, which is typical of least squares; however in this case the parameter estimates are retrieved, but not their covariance matrix (which can be subsequently estimated). THE ONE-DIMENSIONAL, THREE-GAUSSIANS CASE We want here to extend the probabilistic theory summarized in Section 1 to the case where the underlying likelihood family is of the type L( x | p, ) p1 1 1 , 1 p2 2 2 , 2 (1 p1 p2 ) 3 3 , 3 (5) We assume to work with a one-dimensional series of N tot data, of which nothing is known regarding the distribution, except the fact that they belong to one of three normal curves. Their distribution can be written in the explicit form of the probability density L( x | p, ) f x ( x; p1 , p2 , 1 , 2 , 3 , 1 , 2 , 3 ) p1 1 2 1 e ( x 1 ) 2 2 12 p2 1 2 2 e ( x2 )2 2 22 (1 p1 p2 ) 1 2 3 e ( x 3 ) 2 2 32 (6) The parameters to be estimated in this case are eight: (1) ( 1 , 1 ) ( 3) ( 3 , 3 ) ( 2) ( 2 , 2 ) p ( p1 , p2 ) From the previously described theory, one can derive the observation equations bi i ( x | p, (1) , ( 2) , (3) ) N tot [ p1 1 ( x; 1 , 1 ) p 2 2 ( x; 2 , 2 ) ai (1 p1 p 2 ) 3 ( x; 3 , 3 )]dx (7) bi 1 a 1 erf i 1 1 i ( x | p, (1) , ( 2) , (3) ) N tot p1 erf N tot b 2 p 2 erf i 2 a 2 erf i 2 (8) b 3 a 3 erf i N tot (1 p1 p 2 ) erf i 3 3 As (8) is not linear in the parameters p1 , p2 , 1 , 2 , 3 , 1 , 2 , 3 , it has to be linearized. We set: p1 ~ p1 p1 ~ 1 1 1 1 1 ~ 1 1 1 p2 ~ p 2 p 2 ~ 2 2 2 1 1 ~ 2 2 2 3 ~3 3 1 1 ~ 3 3 3 The resulting linearized equation has the form b ~ b ~ a ~ a ~ i ~i N tot erf i ~ 1 erf i ~ 1 erf i ~ 3 erf i ~ 3 p1 1 1 3 3 N tot bi ~ 2 erf ~ 2 a ~ erf i ~ 2 2 b ~ erf i ~ 3 3 a ~ erf i ~ 3 3 p 2 ( b ~ ) ( a ~ ) i 21 i 21 ~ ~ 1 1 N tot ~ p1 e 2 1 e 2 1 1 ~ 2 ~1 2 1 ( b ~ ) 2 ( a ~ ) 2 i 22 i 22 ~ ~ 1 1 N tot ~ p2 e 2 2 e 2 2 2 ~ ~ 2 2 2 2 ( b ~ ) 2 ( a ~ ) 2 i 23 i 23 ~ ~ 1 1 2 3 N tot (1 ~ p1 ~ p 2 ) e e 2 3 3 ~ ~ 2 3 2 3 ( b ~ ) 2 ( a ~ ) 2 i 21 i 21 ~ ~ 1 1 N tot ~ p1 e 2 1 bi ~1 e 2 1 a i ~1 1 2 2 ( b ~ ) 2 ( a ~ ) 2 i 22 i 22 ~ ~ 1 1 N tot ~ p2 e 2 2 bi ~ 2 e 2 2 a i ~ 2 2 2 2 ( b ~ ) 2 ( a ~ ) 2 i 23 i 23 ~ 1 1 2 2~3 ~ ~ ~ ~ 3 bi 3 a i 3 3 N tot (1 p1 p 2 ) e e 2 2 2 2 (9) From this equation, the elements of the pseudo least squares procedure can be described as in [6] and subsequently applied to both simulated and real data in order to test the effectiveness of the theory developed. EXPERIMENTS WITH SIMULATED DATA The least squares equations written for the three-gaussian case were tested at first by using different sets of simulated data, in order to establish a correct procedure to treat the observations. Three simulations were performed, depicting various situations: in each one of them, synthetic data were produced representing points belonging to one of three normal curves, and the corresponding histograms were plotted (Fig. 1, Fig. 2, Fig. 3). The first case was obviously the simplest to treat: in fact one can well distinguish the three curves since there is little overlapping, while it is quite large in the second case; the third situation represents an intermediate case. 250 200 150 100 50 0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 5 5.5 13 14 15 16 Fig. 1- Histogram of simulated data: case 1. 250 200 150 100 50 0 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 Fig. 2- Histogram of simulated data: case 2. 2 2.5 3 3.5 4 4.5 6 6.5 7 7.5 8 8.5 To start the least squares procedure, approximate values of the parameters were computed on the basis of the "threshold criterion" described in [6]. Obviously, now the minimum points must be two in each histogram (thresholds ~ t1 , ~ t2 ): their values are reported in Table 1. Table 1 -. Thresholds (minimum points) in the three cases. ~t 1 2.25 2.25 2.75 Case 1 2 3 ~t 2 8.25 3.75 6.75 The computed approximate values of the parameters, along with the values used to simulate the data, can be found in Table 2 (a), 2 (b), 2 (c), where the results obtained after the least squares estimation are also given. 250 200 150 100 50 0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 Fig. 3 - Histogram of simulated data: case 3. Table 2 (a) - Results of the experiment with simulated data, case 1: approximate values computed with the "threshold" criterion. Parameter p1 p2 1 2 3 1 2 3 “True” value Approx. value 0.5000 0.2000 0.0000 5.0000 12.0000 1.0000 1.5000 1.5000 0.4970 0.2015 0.0088 5.0752 11.9376 0.9524 1.4393 1.5602 Estim.value 0.4946 0.2050 0.0081 5.0353 11.9407 0.9659 1.5523 1.4890 Error (%) 1.08 2.51 0.71 0.49 3.41 3.49 0.74 12 Table 2 (b) - Results of the experiment with simulated data, case 2: approximate values computed with the "threshold" criterion. Parameter p1 p2 1 2 3 1 2 3 “True” value Approx.valu e 0.5600 0.6411 0.3300 0.1450 0.0000 0.1985 3.0000 2.9799 6.0000 5.4098 1.0000 1.0428 1.5000 0.4580 1.0000 1.1175 Estim. value 0.6512 0.0753 0.2559 2.9469 4.8530 1.1138 0.4355 1.4737 Error (%) 17.22 77.42 1.77 19.12 11.38 70.97 47.37 Table 2 (c) - Results of the experiment with simulated data, case 3: approximate values computed with the "threshold" criterion. Parameter p1 p2 1 2 3 1 2 3 “True” value 0.5000 0.2000 0.0000 4.5000 9.0000 1.0000 1.5000 1.0000 Approx. value 0.5180 0.1720 0.1015 4.7132 8.9313 1.0424 1.1347 1.0531 Estim. value 0.4907 0.2134 0.0039 4.5415 8.9749 0.9723 1.6636 0.9858 Error (%) 1.87 6.68 0.92 0.28 2.77 10.91 1.42 As one immediately sees from Table 2 (a), case 1 (as it was expected) gives very good results, 3.49% being the maximum relative error. Also the parameter estimates in case 3 (Table 2 (c)) can be considered satisfactory. On the contrary, in case 2 (representing a much more difficult situation with a large overlapping between two consecutive gaussian curves) the parameter estimates are very bad and easily attain unacceptable relative errors, mainly regarding the central curve 2 (see Table 2 (c)). To obtain better results, a stepwise procedure was devised especially in order to compute better approximate values (from Tables 2 it is evident that these are already bad) from which to start to estimate the parameters. This procedure can be summarized as follows: compute the differences between the observed values and the corresponding points computed with the approximate values of the parameters of the first and third normal p1 , ~1 , ~1 and 3 (1 ~ p1 ~ p 2 ), ~3 , ~3 ; the differences (residuals) are shown in curves 1 ~ Fig. 4; 100 80 60 40 20 0 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 -20 -40 Fig. 4 - Histogram of the differences (residuals) between the observed values and the corresponding points computed with the approximate values of the parameters of the first and third normal curves. compute new approximate values ~ p2 , ~2 ,~2 using the (positive) residuals of the above histogram; the negative residuals are equally shared among 1 and 3 , giving rise to new approximate values of ~ p1 and (1 ~ p1 ~ p2 ) ; proceed with the least squares estimate of the parameters, dividing it into two steps: estimate p1 , p2 , keeping the remaining parameters fixed at the approximate values; estimate 1 , 2 , 3 , 1 , 2 , 3 , with p1 , p2 fixed. The results of this procedure are summarized in Table 3: they show an appreciable improvement with respect to the previously achieved results, especially regarding the values of p2 , 2 , 3 . A confirmation of this behaviour was obtained using a 2 test to check the goodness-of-fit of the estimated model to the observed values. The hypothesis was accepted at the 5 % confidence level ( 2 20.28; 17 degrees of freedom). Table 3 - Results of the stepwise procedure experimented on the data of case 2. Parameter p1 p2 1 2 3 1 2 3 “True” value 0.5600 0.3300 0.0000 3.0000 6.0000 1.0000 1.5000 1.0000 New approx. value 0.6233 0.1806 0.1985 2.8908 5.4098 1.0428 1.3823 1.1175 Estimated value Step I 0.5856 0.2478 Error (%) Step II 5.42 25.65 0.0958 2.9240 5.5848 1.0292 1.1672 1.1736 2.53 6.92 2.92 22.19 17.36 Finally, it has to be remarked that the stepwise procedure described above was also tried on the data of case 3. Although we do not report here the results, they confirmed what Table 2 shows for case 3, meaning that this procedure is a good way of processing cases where the data available are quite "mixed". AN EXPERIMENT WITH DATA FROM A SPOT IMAGE It was decided to work on the data representing a portion of SPOT image already treated in [6], which was named “SPOT2” (71 rows 82 columns = 5822 pixels), as we were left with the impression that in this case the homogeneous groups were indeed three. The image is again represented in Fig. 5, while Fig. 6 shows the histogram of the corresponding grey values. Looking at it, the approximate threshold values were identified: ~ t1 34, ~ t2 66 . Using ~ t1 , ~ t2 , the approximate values of the parameters were computed to start the least squares estimation. As it did not turn out to converge, the stepwise procedure was applied, after computing (as described in Section 3) better approximate values. The results are reported in Table 4, while Fig. 7 shows the histogram of the estimated normal curves. Table 4 - Results of the stepwise procedure for the SPOT2 data. Parameter p1 p2 1 2 3 1 2 3 Fig. 5 - The SPOT2 image. “True” value 0.2807 0.5240 28.4504 48.2933 81.8945 2.8174 8.5713 12.2390 New approx. value 0.2521 0.5811 28.4504 49.6313 81.8945 2.8174 12.4050 12.2390 Estimated value Step I Step II 0.2200 0.6254 28.1048 46.9284 82.9092 1.9212 12.2749 13.5827 23 29 35 41 47 53 59 65 71 77 83 89 95 101 107 113 119 125 131 137 0 50 100 150 200 250 300 350 400 450 500 Fig. 6 - Histogram of the SPOT2 image grey values data. Using the estimated values of the parameters, it is then possible to go back again to new threshold values (estimate values), better discriminating the three clusters. The criterion used to estimate t1 is: equal the probability that one point belongs to the first normal curve P(k 1 | x) p1 f ( x | 1) (10) 3 p i 1 i f ( x | i) to the probability that the point belongs to the second or third normal curves P(k 2 | x k 3 | x) p 2 f ( x | 2) p3 f ( x | 3) 3 p i 1 i (11) f ( x | i) Equation P(k 1 | x) P(k 2 | x k 3 | x) is explicitly written as (12) ( x 1 ) 2 p1e ( x2 )2 2 22 p2 e ln 2 2 12 ln 1 (1 p1 p 2 ) e ( x 3 ) 2 2 32 1 (13) ln 3 or, in another form, e ( x 1 ) 2 2 12 ln 1 ln p1 e ( x 2 )2 2 22 ln 2 ln p2 e ( x 3 ) 2 2 32 ln 3 ln (1- p1 p2 ) 0 (14) 600 500 400 300 200 100 0 22 28 34 40 46 52 58 64 70 76 82 88 94 100 106 112 118 124 130 136 Fig. 7 - Estimated normal curves for the SPOT2 image. In a similar way, to estimate t 2 another equation is written e ( x 3 ) 2 2 32 ln 3 ln (1- p1 p2 ) e ( x 1 ) 2 2 12 ln 1 ln p1 e ( x 2 )2 2 22 ln 2 ln p2 Fig. 8 - SPOT2 image: cluster identification by means of the 0 ~ t1 , ~ t2 threshold values. (15) Fig. 9 - SPOT2 image: cluster identification by means of the tˆ1 , tˆ2 threshold values. Equations (14) and (15) must be solved for x , thus obtaining tˆ1 , tˆ2 . Unluckily, no analytical solution exists for these equations, so to solve them it is necessary to carry out a Taylor expansion. To obtain approximate values for t1 , t 2 , the likelihood ratio can be used (cf. [6]): what is found is ~ t1 31.88, ~ t 64.47 . The final estimated threshold values are: tˆ1 31.48, tˆ2 69.90 . The approximate (see Fig. 8) and estimated (see Fig. 9) threshold values were used to plot the SPOT2 image highlighting the three clusters with three grey values (namely: black, white and an intermediate grey value). REFEREE Il presente articolo è stato rivisto dal Prof. Fausto Sacerdote, Dipartimento di Ingegneria Civile, Università di Firenze. REFERENCES [1] B. Crippa, L. Mussio (1993). "Data compression and evaluation by cluster analysis", Proceedings of ISPRS Commission I Workshop on "Digital sensors and systems". [2] A. de Haan (1991). "Fundamentals of cluster analysis", Proceedings of ISPRS Tutorial on "Mathematical aspects of data analysis". [3] R.O. Duda, P.E. Hart (1973). "Pattern classification and scene analysis", J. Wiley and Sons. [4] L. Kaufman, P.J. Rousseeuw (1990). "Finding groups in data", J. Wiley and Sons. [5] J.S. Lim (1990). "Two-dimensional signal and image processing", Prentice Hall. [6] F. Migliaccio, F. Sansò, V. Tornatore (1998). "Clusters and probabilistic models for a refined estimation theory", Bollettino di Geodesia e Scienze Affini, Anno LVII, N. 3. [7] A.M. Mood, F.A. Graybill, D.C. Boes (1983). "Introduction to the theory of statistics", McGraw-Hill.