summarized parameters

advertisement
GNGTS - Appendice Atti del 16° Convegno Nazionale - 06.07
F. Migliaccio, V. Tornatore e M.A. Brovelli
DIIAR - Sez. Rilevamento - Politecnico di Milano
CLUSTERS AND PROBABILISTIC MODELS:
THE ONE - DIMENSIONAL, MULTI - GAUSSIAN CASE
Abstract. In this paper we want to specify the theory of the probabilistic approach to the problem of cluster
analysis, in the case where more than two gaussian distributions are involved, meaning that at least three
clusters have been identified in the data.
We will widely refer to [6], where the principles and equations ruling this approach have been presented.
However, the estimation theory will be again summarized in the first section of the paper (Introduction).
CLUSTERS E MODELLI PROBABILISTICI:
IL CASO MONODIMENSIONALE, MULTI - NORMALE
Riassunto. Nel presente lavoro si intende estendere la teoria dell'approccio probabilistico al problema della
cluster analysis al caso in cui siano presenti almeno tre distribuzioni normali, il che significa che esaminando
i dati si sono identificati almeno tre gruppi (clusters) omogenei al loro interno.
Si farà ampiamente riferimento al precedente lavoro [6], dove sono stati illustrati i principi e le equazioni che
governano questo tipo di approccio. Tuttavia, la teoria della stima sarà comunque riassunta nel paragrafo
introduttivo.
INTRODUCTION: THE PROBABILISTIC APPROACH
Cluster analysis can be seen as a set of procedures which allow to subdivide sets of
data into groups with homogeneous characteristics. From our point of view, it is mostly
fruitful to use a statistical model for the data. This means that the problem of cluster
analysis can be defined in the following way: to estimate a distribution which is a "mixture"
of two or more distributions. Once the number of distributions and their family have been
identified, the problem can be solved as a parameter estimation.
A standard method to estimate the parameters could be the maximum likelihood
criterion: unluckily it gives rise to equations of increasing complexity. A simpler method
can be used, such as the least squares principle, but it must be applied in a nonelementary way.
It has been shown in [6] how it is possible to apply a pseudo-least-squares method to
the solution of a cluster analysis problem in which the data can be divided into two groups,
both describing the theoretical approach and applying the results to the analysis of data
representing grey values in two portions of SPOT images. The results will be here
summarized. In general, the aim is to estimate a distribution resulting from the "mixture" of
two or more distributions according to equation
M
L( x | p, )   p j f j ( x |  )
( j)
j 1
where:
M  number of distributions which form the "mixture" distribution;
p j  probability of having data in the j  th distribution;
(1)
 ( j )  parameters of the j  th distribution.
The parameters to be estimated are p and  .
The problem is quite complex, both because the probabilities p i are involved and
because the likelihood (1) does not belong to the exponential family. The solution can be
achieved by defining a principle which is strictly analogous (both from the intuitive and
from the statistical point of view) to the least squares, in the form
m
( N i   i ( p, )) 2
i 1
 i ( p, )

 Min ( x | p, )
(2)
In fact, (2) states that, dividing the real axis x into m intervals, N i (actual number of
points falling in the i  th interval) have to be close to  i (number of points falling in the
i  th interval according to the theoretical distribution); besides, the target function has the
same distribution as that of the ordinary least squares principle. It must be noticed (cf. [6])
that the least squares principle cannot be applied straightforwardly to the vector with
components ( N i -  i ), as such a vector (asymptotically) has a singular covariance matrix.
It can be proved that the estimators ( pˆ ,ˆ ) so obtained are consistent, i.e. when N   ,
tot
( pˆ ,ˆ )  ( p, ) in probability.
As an illustration of the above principle, in [6] a rather simple case was treated,
which however is already capable of modelling realistic situations. It was assumed to have
a series of N tot data, known to belong to one of two gaussian distributions. It means that
the likelihood family is of the form
L( x | p, )  p  1 ( 1 , 1 )  (1  p)  2 (  2 , 2 )
(3)
where, according to the symbols adopted,
 (1)  ( 1 , 1 )
p1  p
 ( 2)  (  2 , 2 )
p2  1  p
In this case, the “observation equations” can then be written as
bi
 i ( x | p, )  N tot  [ p 1 ( x; 1 , 1 )  (1  p)  2 ( x;  2 , 2 )] dx 
ai
  b  1 
 a  1 
  erf  i
 
 N tot p erf  i
  1 
  1 
  b  2 
 a   2 
  erf  i

 N tot (1  p) erf  i
  2 
  2 
(4)
where:
bi  ai 1
x
erf 

  
1
2 
x
e

( x )2
2 2
dx

Equation (4) must be linearized, then inserted into the minimum principle (2). A
normal system is obtained, which is typical of least squares; however in this case the
parameter estimates are retrieved, but not their covariance matrix (which can be
subsequently estimated).
THE ONE-DIMENSIONAL, THREE-GAUSSIANS CASE
We want here to extend the probabilistic theory summarized in Section 1 to the case
where the underlying likelihood family is of the type
L( x | p, )  p1 1 1 , 1   p2  2  2 , 2   (1  p1  p2 )  3  3 , 3 
(5)
We assume to work with a one-dimensional series of N tot data, of which nothing is
known regarding the distribution, except the fact that they belong to one of three normal
curves.
Their distribution can be written in the explicit form of the probability density
L( x | p, )  f x ( x; p1 , p2 , 1 ,  2 ,  3 , 1 , 2 , 3 ) 
 p1
1
2  1

e
( x  1 ) 2
2 12
 p2
1
2  2

e
( x2 )2
2 22
 (1  p1  p2 )
1
2  3

e
( x  3 ) 2
2 32
(6)
The parameters to be estimated in this case are eight:
 (1)  ( 1 , 1 )
 ( 3)  (  3 ,  3 )
 ( 2)  ( 2 , 2 )
p  ( p1 , p2 )
From the previously described theory, one can derive the observation equations
bi
 i ( x | p,  (1) ,  ( 2) ,  (3) )  N tot  [ p1  1 ( x;  1 ,  1 )  p 2  2 ( x;  2 ,  2 ) 
ai
 (1  p1  p 2 )  3 ( x;  3 ,  3 )]dx
(7)

 bi  1 
 a   1 
  erf  i
 
 1 
  1 
 i ( x | p,  (1) ,  ( 2) ,  (3) )  N tot p1 erf 
 N tot

  b  2
p 2 erf  i
  2

 a  2
  erf  i

 2

 

(8)
  b   3 
 a   3 
  erf  i

 N tot (1  p1  p 2 ) erf  i

   3 
3


As (8) is not linear in the parameters p1 , p2 , 1 ,  2 ,  3 , 1 , 2 , 3 , it has to be linearized.
We set:
p1  ~
p1  p1
~
    
1
1
1
1
1
 ~  1
1 1
p2  ~
p 2  p 2
~
    
2
2
2
1
1
 ~  2
2 2
 3  ~3  3
1
1
 ~  3
3 3
The resulting linearized equation has the form
  b  ~ 
 b  ~ 
 a  ~ 
 a  ~ 
 i  ~i  N tot erf  i ~ 1   erf  i ~ 1   erf  i ~ 3   erf  i ~ 3  p1 
   1 
 1 
 3 
  3 
 N tot
  bi  ~ 2
erf  ~
   2

 a  ~
  erf  i ~ 2

 2
 b  ~

  erf  i ~ 3

 3

 a  ~
  erf  i ~ 3

 3

 p 2 

( b  ~ )
( a  ~ )

 i 21
 i 21 
~
~

1
1


 N tot ~
p1 
e 2 1 
e 2 1  1 
~
2 ~1
 2  1

( b  ~ ) 2
( a  ~ ) 2

 i 22
 i 22 
~
~

1
1


 N tot ~
p2 
e 2 2 
e 2 2   2 
~
~
2  2
 2  2

( b  ~ ) 2
( a  ~ ) 2

 i 23
 i 23 
~
~

1
1


2

3
 N tot (1  ~
p1  ~
p 2 )
e

e 2 3  3 
~
~
2  3
 2  3

( b  ~ ) 2
( a  ~ ) 2


 i 21
 i 21
~
~
1
1


 N tot ~
p1 
e 2 1 bi  ~1  
e 2 1 a i  ~1  1 
2
 2

( b  ~ ) 2
( a  ~ ) 2


 i 22
 i 22
~
~
1
 1

 N tot ~
p2 
e 2 2 bi  ~ 2  
e 2 2 a i  ~ 2  2 
2
 2

( b  ~ ) 2
( a  ~ ) 2


 i 23
 i 23
~
1
1

2

2~3
~
~
~
~
3
bi   3  
a i   3  3
 N tot (1  p1  p 2 )
e
e
2
 2

2
2
(9)
From this equation, the elements of the pseudo least squares procedure can be
described as in [6] and subsequently applied to both simulated and real data in order to
test the effectiveness of the theory developed.
EXPERIMENTS WITH SIMULATED DATA
The least squares equations written for the three-gaussian case were tested at first
by using different sets of simulated data, in order to establish a correct procedure to treat
the observations. Three simulations were performed, depicting various situations: in each
one of them, synthetic data were produced representing points belonging to one of three
normal curves, and the corresponding histograms were plotted (Fig. 1, Fig. 2, Fig. 3).
The first case was obviously the simplest to treat: in fact one can well distinguish the
three curves since there is little overlapping, while it is quite large in the second case; the
third situation represents an intermediate case.
250
200
150
100
50
0
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
5
5.5
13
14
15
16
Fig. 1- Histogram of simulated data: case 1.
250
200
150
100
50
0
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
0
0.5
1
1.5
Fig. 2- Histogram of simulated data: case 2.
2
2.5
3
3.5
4
4.5
6
6.5
7
7.5
8
8.5
To start the least squares procedure, approximate values of the parameters were
computed on the basis of the "threshold criterion" described in [6]. Obviously, now the
minimum points must be two
in each histogram (thresholds ~
t1 , ~
t2 ): their values are reported in Table 1.
Table 1 -. Thresholds (minimum points) in the three cases.
~t
1
2.25
2.25
2.75
Case
1
2
3
~t
2
8.25
3.75
6.75
The computed approximate values of the parameters, along with the values used to
simulate the data, can be found in Table 2 (a), 2 (b), 2 (c), where the results obtained after
the least squares estimation are also given.
250
200
150
100
50
0
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
Fig. 3 - Histogram of simulated data: case 3.
Table 2 (a) - Results of the experiment with simulated data, case 1:
approximate values computed with the "threshold" criterion.
Parameter
p1
p2
1
2
3
1
2
3
“True” value Approx. value
0.5000
0.2000
0.0000
5.0000
12.0000
1.0000
1.5000
1.5000
0.4970
0.2015
0.0088
5.0752
11.9376
0.9524
1.4393
1.5602
Estim.value
0.4946
0.2050
0.0081
5.0353
11.9407
0.9659
1.5523
1.4890
Error
(%)
1.08
2.51
0.71
0.49
3.41
3.49
0.74
12
Table 2 (b) - Results of the experiment with simulated data, case 2:
approximate values computed with the "threshold" criterion.
Parameter
p1
p2
1
2
3
1
2
3
“True” value Approx.valu
e
0.5600
0.6411
0.3300
0.1450
0.0000
0.1985
3.0000
2.9799
6.0000
5.4098
1.0000
1.0428
1.5000
0.4580
1.0000
1.1175
Estim. value
0.6512
0.0753
0.2559
2.9469
4.8530
1.1138
0.4355
1.4737
Error
(%)
17.22
77.42
1.77
19.12
11.38
70.97
47.37
Table 2 (c) - Results of the experiment with simulated data, case 3:
approximate values computed with the "threshold" criterion.
Parameter
p1
p2
1
2
3
1
2
3
“True” value
0.5000
0.2000
0.0000
4.5000
9.0000
1.0000
1.5000
1.0000
Approx.
value
0.5180
0.1720
0.1015
4.7132
8.9313
1.0424
1.1347
1.0531
Estim.
value
0.4907
0.2134
0.0039
4.5415
8.9749
0.9723
1.6636
0.9858
Error
(%)
1.87
6.68
0.92
0.28
2.77
10.91
1.42
As one immediately sees from Table 2 (a), case 1 (as it was expected) gives very
good results, 3.49% being the maximum relative error. Also the parameter estimates in
case 3 (Table 2 (c)) can be considered satisfactory. On the contrary, in case 2
(representing a much more difficult situation with a large overlapping between two
consecutive gaussian curves) the parameter estimates are very bad and easily attain
unacceptable relative errors, mainly regarding the central curve  2 (see Table 2 (c)).
To obtain better results, a stepwise procedure was devised especially in order to
compute better approximate values (from Tables 2 it is evident that these are already bad)
from which to start to estimate the parameters. This procedure can be summarized as
follows:
compute the differences between the observed values and the corresponding points
computed with the approximate values of the parameters of the first and third normal
p1 , ~1 , ~1  and  3 (1  ~
p1  ~
p 2 ), ~3 , ~3  ; the differences (residuals) are shown in
curves  1  ~
Fig. 4;
100
80
60
40
20
0
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
-20
-40
Fig. 4 - Histogram of the differences (residuals) between the observed values and the corresponding
points computed with the approximate values of the parameters of the first and third normal curves.
compute new approximate values ~
p2 , ~2 ,~2 using the (positive) residuals of the
above histogram; the negative residuals are equally shared among 1 and  3 , giving rise
to new approximate values of ~
p1 and (1  ~
p1  ~
p2 ) ;
proceed with the least squares estimate of the parameters, dividing it into two steps:
estimate p1 , p2 , keeping the remaining parameters fixed at the approximate values;
estimate 1 ,  2 ,  3 , 1 , 2 , 3 , with p1 , p2 fixed.
The results of this procedure are summarized in Table 3: they show an appreciable
improvement with respect to the previously achieved results, especially regarding the
values of p2 , 2 , 3 .
A confirmation of this behaviour was obtained using a  2 test to check the
goodness-of-fit of the estimated model to the observed values. The hypothesis was
accepted at the 5 % confidence level (  2  20.28; 17 degrees of freedom).
Table 3 - Results of the stepwise procedure experimented on the data of case 2.
Parameter
p1
p2
1
2
3
1
2
3
“True” value
0.5600
0.3300
0.0000
3.0000
6.0000
1.0000
1.5000
1.0000
New approx.
value
0.6233
0.1806
0.1985
2.8908
5.4098
1.0428
1.3823
1.1175
Estimated value
Step I
0.5856
0.2478
Error
(%)
Step II
5.42
25.65
0.0958
2.9240
5.5848
1.0292
1.1672
1.1736
2.53
6.92
2.92
22.19
17.36
Finally, it has to be remarked that the stepwise procedure described above was also
tried on the data of case 3. Although we do not report here the results, they confirmed
what Table 2 shows for case 3, meaning that this procedure is a good way of processing
cases where the data available are quite "mixed".
AN EXPERIMENT WITH DATA FROM A SPOT IMAGE
It was decided to work on the data representing a portion of SPOT image already
treated in [6], which was named “SPOT2” (71 rows  82 columns = 5822 pixels), as we
were left with the impression that in this case the homogeneous groups were indeed
three. The image is again represented in Fig. 5, while Fig. 6 shows the histogram of the
corresponding grey values. Looking at it, the approximate threshold values were identified:
~
t1  34, ~
t2  66 .
Using ~
t1 , ~
t2 , the approximate values of the parameters were computed to start the
least squares estimation. As it did not turn out to converge, the stepwise procedure was
applied, after computing (as described in Section 3) better approximate values. The
results are reported in Table 4, while Fig. 7 shows the histogram of the estimated normal
curves.
Table 4 - Results of the stepwise procedure for the SPOT2 data.
Parameter
p1
p2
1
2
3
1
2
3
Fig. 5 - The SPOT2 image.
“True”
value
0.2807
0.5240
28.4504
48.2933
81.8945
2.8174
8.5713
12.2390
New approx.
value
0.2521
0.5811
28.4504
49.6313
81.8945
2.8174
12.4050
12.2390
Estimated value
Step I
Step II
0.2200
0.6254
28.1048
46.9284
82.9092
1.9212
12.2749
13.5827
23
29
35
41
47
53
59
65
71
77
83
89
95
101
107
113
119
125
131
137
0
50
100
150
200
250
300
350
400
450
500
Fig. 6 - Histogram of the SPOT2 image grey values data.
Using the estimated values of the parameters, it is then possible to go back again to
new threshold values (estimate values), better discriminating the three clusters. The
criterion used to estimate t1 is: equal the probability that one point belongs to the first
normal curve
P(k  1 | x) 
p1 f ( x | 1)
(10)
3
p
i 1
i
f ( x | i)
to the probability that the point belongs to the second or third normal curves
P(k  2 | x  k  3 | x) 
p 2 f ( x | 2)  p3 f ( x | 3)
3
p
i 1
i
(11)
f ( x | i)
Equation
P(k  1 | x)  P(k  2 | x  k  3 | x)
is explicitly written as
(12)

( x  1 ) 2
p1e

( x2 )2
2 22
p2 e
 ln  2
2 12
 ln  1
 (1  p1  p 2 ) e

( x  3 ) 2
2 32
1
(13)
 ln  3
or, in another form,

e
( x  1 ) 2
2 12
ln  1 ln p1
e

( x 2 )2
2 22
ln  2 ln p2
e

( x  3 ) 2
2 32
ln  3 ln (1- p1  p2 )
0
(14)
600
500
400
300
200
100
0
22
28
34
40
46
52
58
64
70
76
82
88
94
100
106
112
118
124
130
136
Fig. 7 - Estimated normal curves for the SPOT2 image.
In a similar way, to estimate t 2 another equation is written

e
( x  3 ) 2
2 32
ln  3 ln (1- p1  p2 )
e

( x  1 ) 2
2 12
ln  1 ln p1
e

( x  2 )2
2 22
ln  2 ln p2
Fig. 8 - SPOT2 image: cluster identification by means of the
0
~
t1 , ~
t2 threshold values.
(15)
Fig. 9 - SPOT2 image: cluster identification by means of the
tˆ1 , tˆ2 threshold values.
Equations (14) and (15) must be solved for x , thus obtaining tˆ1 , tˆ2 . Unluckily, no
analytical solution exists for these equations, so to solve them it is necessary to carry out
a Taylor expansion. To obtain approximate values for t1 , t 2 , the likelihood ratio can be
used (cf. [6]): what is found is ~
t1  31.88, ~
t   64.47 .
The final estimated threshold values are:
tˆ1  31.48, tˆ2  69.90 .
The approximate (see Fig. 8) and estimated (see Fig. 9) threshold values were used
to plot the SPOT2 image highlighting the three clusters with three grey values (namely:
black, white and an intermediate grey value).
REFEREE
Il presente articolo è stato rivisto dal Prof. Fausto Sacerdote, Dipartimento di
Ingegneria Civile, Università di Firenze.
REFERENCES
[1] B. Crippa, L. Mussio (1993). "Data compression and evaluation by cluster analysis",
Proceedings of ISPRS Commission I Workshop on "Digital sensors and systems".
[2] A. de Haan (1991). "Fundamentals of cluster analysis", Proceedings of ISPRS Tutorial on "Mathematical
aspects of data analysis".
[3] R.O. Duda, P.E. Hart (1973). "Pattern classification and scene analysis", J. Wiley and Sons.
[4] L. Kaufman, P.J. Rousseeuw (1990). "Finding groups in data", J. Wiley and Sons.
[5] J.S. Lim (1990). "Two-dimensional signal and image processing", Prentice Hall.
[6] F. Migliaccio, F. Sansò, V. Tornatore (1998). "Clusters and probabilistic models for a refined estimation
theory", Bollettino di Geodesia e Scienze Affini, Anno LVII, N. 3.
[7] A.M. Mood, F.A. Graybill, D.C. Boes (1983). "Introduction to the theory of statistics", McGraw-Hill.
Download