Supplement to “Simulating Data to Study Performance

advertisement
Supplement to “Simulating Data to Study Performance
of Finite Mixture Modeling and Clustering Algorithms”
published in the Journal of Computational and
Graphical Statistics
Ranjan Maitra and Volodymyr Melnykov
S-1.
GEOMETRY AND VARIABILITY OF SIMULATED REALIZATIONS
Figures S-1 and S-2 provide contour plots of four sample two-component two-dimensional mixtures, each for different values of overlap ω. As we can see, there is a clear correspondence
between visual overlap and its quantitative representation, indexed by ω. Note that there is also a
wide variety of overlaps and complexity of simulated mixtures, including how these two components interact with each other. Figure S-1 indicates that an overlap of no more than 0.01 can be
considered to be quite well-separated. On the other hand, 0.05 ≤ ω ≤ 0.10 suggests moderate separation between components. Figure S-2 indicates that an overlap of 0.15 or higher represents poor
separation of components. Note that there is barely any visual difference between the simulated
components for ω = 0.75: they appear to be practically indistinguishable.
Figures S-3 and S-4 provide contour plots of four sample six-component two-dimensional mixture models generated for different values of (ω̄, ω̌) obtained using our algorithm. The figures
provide a visual indication of the range of geometries and variability in the finite mixture models generated using our algorithm. Thus, the first two sets of contour plots represent very high
overlaps with ω̄ = 0.15 and ω̄ = 0.10 accordingly. For such ω̄ and for practical cases, ω̌ is
necessarily high and we are unable to visually distinguish all components successfully due to the
substantial overlap in many pairs. This is especially true for ω̄ = 0.15. Based on these plots,
the cases for which ω̄ = 0.05 are substantially clearer, but even here not all components can be
1
1.0
0.6
0.4
0.2
0.0
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
(n) ω = 0.10
1.0
(l) ω = 0.05
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
(m) ω = 0.10
(h) ω = 0.01
(k) ω = 0.05
1.0
(j) ω = 0.05
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(i) ω = 0.05
(d) ω = 0.001
(g) ω = 0.01
1.0
(f) ω = 0.01
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) ω = 0.001
1.0
(b) ω = 0.001
(e) ω = 0.01
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) ω = 0.001
(o) ω = 0.10
(p) ω = 0.10
Figure S-1: Contour plots of sample two-component mixture distributions in two dimensions obtained using our algorithm for ω between 0.001 and 0.10.
2
1.0
0.6
0.4
0.2
0.0
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
(n) ω = 0.75
1.0
(l) ω = 0.50
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.0
0.2
0.4
0.6
0.8
0.6
0.4
0.2
(m) ω = 0.75
(h) ω = 0.25
(k) ω = 0.50
1.0
(j) ω = 0.50
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(i) ω = 0.50
(d) ω = 0.15
(g) ω = 0.25
1.0
(f) ω = 0.25
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) ω = 0.15
1.0
(b) ω = 0.15
(e) ω = 0.25
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) ω = 0.15
(o) ω = 0.75
(p) ω = 0.75
Figure S-2: Contour plots of sample two-component mixture distributions in two dimensions obtained using our algorithm for ω between 0.15 and 0.75.
3
1.0
0.6
0.4
0.2
0.0
1.0
0.8
0.4
0.2
0.0
1.0
0.6
0.4
0.2
0.0
(k) ω̄ = 0.05, ω̌ = 0.135
(n) ω̄ = 0.05, ω̌ = 0.198
1.0
(l) ω̄ = 0.05, ω̌ = 0.135
0.8
0.8
0.6
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0.6
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
(m) ω̄ = 0.05, ω̌ = 0.198
(h) ω̄ = 0.10, ω̌ = 0.274
0.8
0.8
0.6
0.4
0.2
0.0
(j) ω̄ = 0.05, ω̌ = 0.135
1.0
(i) ω̄ = 0.05, ω̌ = 0.135
1.0
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
(g) ω̄ = 0.10, ω̌ = 0.274
1.0
(f) ω̄ = 0.10, ω̌ = 0.274
(d) ω̄ = 0.15, ω̌ = 0.433
0.6
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) ω̄ = 0.15, ω̌ = 0.433
1.0
(b) ω̄ = 0.15, ω̌ = 0.433
(e) ω̄ = 0.10, ω̌ = 0.274
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) ω̄ = 0.15, ω̌ = 0.433
(o) ω̄ = 0.05, ω̌ = 0.198
(p) ω̄ = 0.05, ω̌ = 0.198
Figure S-3: Contour plots of sample six-component mixture distributions in two dimensions for
different values of (ω̄, ω̌) obtained using our algorithm.
4
1.0
1.0
0.8
0.6
0.6
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.4
0.2
0.0
1.0
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
(h) ω̄ = 0.01, ω̌ = 0.036
0.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(g) ω̄ = 0.01, ω̌ = 0.036
1.0
(f) ω̄ = 0.01, ω̌ = 0.036
(d) ω̄ = 0.01, ω̌ = 0.027
0.8
1.0
0.8
0.2
0.0
0.0
0.2
0.4
0.6
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
(c) ω̄ = 0.01, ω̌ = 0.027
1.0
(b) ω̄ = 0.01, ω̌ = 0.027
(e) ω̄ = 0.01, ω̌ = 0.036
0.0
0.8
1.0
0.8
1.0
0.8
0.6
0.4
0.2
0.0
(a) ω̄ = 0.01, ω̌ = 0.027
1.0
0.8
0.6
0.2
0.0
0.4
1.0
0.2
0.0
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
(i) ω̄ = 0.001, ω̌ = 0.0027 (j) ω̄ = 0.001, ω̌ = 0.0027 (k) ω̄ = 0.001, ω̌ = 0.0027 (l) ω̄ = 0.001, ω̌ = 0.0027
(m) ω̄ = 0.001, ω̌ = 0.0036 (n) ω̄ = 0.001, ω̌ = 0.0036 (o) ω̄ = 0.001, ω̌ = 0.0036 (p) ω̄ = 0.001, ω̌ = 0.0036
Figure S-4: Contour plots of sample six-component mixture distributions in two dimensions for
different values of (ω̄, ω̌) obtained using our algorithm.
5
visually identified, owing to the presence of some large individual pairwise overlaps (dictated by
ω̌ = 0.135 and ω̌ = 0.198). Thus, mixtures with ω̄ equal to 0.05 and 0.01 can be considered
as moderately difficult cases; clustering performance will also be affected by ω̌. Overall, there is
increasing complexity in the simulated mixtures with a higher number of components for the same
level of ω̄ which is understandable since in two-component mixtures, ω̄ represents overlap coming
from the only pair of components, but for a six-component mixture, for example, this measures
the average overlap from all K(K − 1)/2 pairs of components. Thus, some pairwise overlaps are
higher than ω̄ and may result in quite extreme cases. This is what is regulated by the additional
characteristic ω̌. A larger value for ω̌ means that fewer pairs of clusters contribute to ω̄. Finally,
we note that ω̄ = 0.001 allows generation of reasonably well separated clusters for both choices of
ω̌. Overall, we can see that the produced mixtures have different patterns and geometries: indeed,
the generated components have very widely varying shapes, being controlled only by the maximal
eccentricity (set at 0.9, here) which prevents the appearance of very ”narrow” clusters.
We investigated variability in the simulated mixture models for different numbers of components and dimensions. We generated 25 datasets for each setting as mentioned in Section 3.2.4
of the paper. Expected Kullback-Leibler divergence (KL) was computed for all pairs of mixture
densities generated for a setting: summary measures are presented in Table S-1, which suggests
that the KL is higher for settings with smaller ω̄ and ω̌. In other words, when the overlap is high,
the difference between generated mixtures is smaller as opposed to the case when clusters are well
separated. Note that, for fairness of comparison, we have enforced the constraint that all simulated
mixture models have almost all their support in the same hypercube. Within ω̄, KL is higher for
the case when ω̌ is larger. When there are multiple pairwise overlaps in a mixture, KL is lower
than when for mixtures which have few overlapping components. This is because the density has
more gradual peaks for the former case than for when the components are well-separated. To assess
whether the calculated KL were reasonable, we also generated 25 six-component two-dimensional
datasets for the settings in Figures S-3 and S-4. In two dimensions, and for K = 6, we know from
these two figures that there is wide variability in the simulated mixture densities. The KL-measures
obtained here provide an idea of the values that are reasonable to assess variability: we note that
6
Table S-1: Kullback-Leibler measures between simulated mixtures for various parameter combinations: mean and standard deviation.
p
k
2
6
5
7
7
9
10
11
ω̄
0.15
0.10
ω̌
0.433
0.274
0.135
0.198
0.027
0.036
0.003
0.004
4.68 (2.45)
4.83 (2.55)
5.21 (2.26)
6.87 (3.63)
7.57 (2.49)
10.57 (5.68)
13.03 (4.33)
16.79 (7.70)
0.49
0.319
0.15
0.27
0.03
0.05
0.003
0.005
4.22 (1.52)
5.17 (1.38)
7.03 (1.27)
9.16 (1.99)
11.75 (2.13)
14.53 (3.72)
20.05 (4.07)
21.54 (3.68)
0.566
0.365
0.180
0.370
0.036
0.064
0.004
0.006
3.63 (0.74)
5.18 (1.34)
7.11 (1.12)
8.25 (1.50)
12.67 (2.11)
13.97 (2.21)
22.99 (3.38)
26.86 (4.33)
ω̌
ω̌
ω̌
0.05
0.01
0.001
0.62
0.41
0.20
0.44
0.04
0.08
0.004
0.008
3.33 (0.44)
4.75 (0.80)
7.28 (1.09)
7.69 (1.10)
13.01 (1.12)
14.87 (1.65)
24.35 (2.87)
27.01 (2.69)
the values for the other dimensions are of similar magnitude as the ones for p = 2.
S-2.
CONVERGENCE
Table S-2 presents the number of iterations necessary to reach a desired mixture for various combinations of p and K. A represents the number of failed “valid” initial values satisfying Step 3 of
the algorithm in Section 2.2.1, while B represents the number of reseeds that are needed in order
to get a realization that satisfies Step 2 of the algorithm in Section 2.2.2. The summaries suggest
that there are two cases that might require some time and computational power. The first case
happens when the average as well as maximal overlaps are both relatively small; this is especially
problematic for a high number of clusters. In this case it might be not easy to randomly simulate
a mixture with components that are separated enough to guarantee the small values for both ω̄ and
ω̌. On the other hand, cases with high ω̌ (such as in the case p = 10, K = 11, ω̄ = 0.15, ωσ = ω̄
above, for which ω̌ = 0.619, or p = 10, K = 11, ω̄ = 0.05, ωσ = 2ω̄, for which ω̌ = 0.441) can
also be problematic because of the difficulty to find a valid initial mixture satisfying Step 3 of the
algorithm in Section 2.2.1.
S-3.
EVALUATING INITIALIZATION STRATEGIES
Table S-3 provides details on the results of using the four different initialization strategies for
nine-component seven-dimensional datasets, and for different values of ω̌ and ω̄.
7
Table S-2: Convergence of the algorithm for different parameter settings: number of failures (A)
till obtaining the initial combination of clusters (median A 1 and interquartile range I A
1 ); number
2
2
of reseedings ( B) of all clusters (median B 1 and interquartile range I B1 ).
2
2
p
k
ω̄
0.15
0.10
ω̌
0.433
0.274
0.135
0.198
0.027
0.036
0.003
0.004
2
6
A 1 (I A
1 )
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
2
0 (0)
0 (0)
6 (15)
0 (0)
373 (311)
4 (11)
3,352 (4,861)
40 (91)
ω̌
0.49
0.319
0.15
0.27
0.03
0.05
0.003
0.005
(I A )
2 (3)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
B 1 (I B
1)
0 (0)
0 (0)
0 (0)
0 (0)
17 (16)
0 (1)
375 (548)
3 (6)
ω̌
0.566
0.365
0.180
0.370
0.036
0.064
0.004
0.006
A 1 (I A
1 )
5 (9)
1 (2)
0 (0)
5 (7)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (1)
0 (0)
19 (30)
0 (1)
1,254 (1,225)
3 (4)
1
2
A1
1
2
2
2
7
9
2
2
2
B 1 (I B
1)
2
10
11
0.001
2
2
7
0.01
(I B )
B1
5
0.05
2
ω̌
0.62
0.41
0.20
0.44
0.04
0.08
0.004
0.008
A 1 (I A
1 )
734 (877)
14 (20)
0 (0)
239 (282)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
0 (1)
0 (0)
21 (31)
0 (0)
2,324 (5,171)
2 (7)
2
B1
2
2
(I B )
1
2
8
Table S-3: Adjusted Rand (R) similarity measures of EM-cluster groupings and expected
Kullback-Leibler divergences (KL) of estimated mixture model densities obtained, using different
initialization strategies (starts), over 25 replications for different overlap characteristics. Summary
statistics represented are the median R (R 1 ) and median KL (KL 1 ) and corresponding interquar2
2
KL
tile ranges (I R
1 and I 1 ). Finally, R#1 and KL#1 represent the number of replications (out of 25)
2
2
for which the given initialization strategy did as well (in terms of higher R and lower KL) as the
Starts
best strategy.
p = 7, k = 9, n = 1, 000
ω̄
0.15
0.10
ω̌
0.566
0.365
0.180
0.370
0.036
0.064
0.004
0.006
R 21
0.31
0.40
0.66
0.63
0.82
0.85
0.90
0.89
R
0.17
0.18
0.13
0.09
0.16
0.10
0.09
0.12
emEM
I1
0.05
0.01
0.001
2
R#1
8
12
14
10
5
4
5
1
KL 21
0.31
0.32
0.36
0.33
0.35
0.31
0.36
0.32
I KL
1
0.05
0.05
0.08
0.07
0.08
0.08
0.09
0.11
KL#1
7
12
5
14
7
7
1
3
R 21
0.29
0.41
0.60
0.64
0.86
0.85
0.89
0.92
R
0.19
0.15
0.14
0.16
0.08
0.15
0.11
0.08
R#1
13
7
7
12
8
8
2
1
KL 21
0.32
0.34
0.32
0.33
0.34
0.32
0.36
0.35
KL
0.04
0.06
0.05
0.04
0.06
0.07
0.11
0.10
KL#1
13
12
14
9
7
7
1
1
R 21
0.27
0.39
0.60
0.61
0.87
0.87
1.0
1.0
R
0.16
0.19
0.18
0.17
0.14
0.16
0.00
0.00
R#1
3
5
2
2
11
11
21
21
KL 21
0.39
0.40
0.40
0.38
0.33
0.31
0.21
0.20
KL
0.06
0.06
0.08
0.10
0.11
0.15
0.03
0.04
KL#1
3
1
4
1
11
9
23
20
R 21
0.19
0.26
0.51
0.54
0.68
0.75
0.78
0.83
R
0.26
0.22
0.25
0.16
0.22
0.18
0.18
0.24
R#1
1
1
2
1
1
2
0
2
KL 21
0.38
0.41
0.39
0.37
0.49
0.47
0.57
0.52
KL
0.15
0.10
0.13
0.10
0.25
0.18
0.36
0.43
2
0
2
1
0
2
0
1
Rnd-EM
2
I1
2
I1
2
Mclust
I1
2
I1
Multi-staged
2
I1
2
I1
2
KL#1
9
Download