Extending metric multidimensional scaling with Bregman divergences Jigang Sun and Colin Fyfe Visualising 18 dimensional data Outline • • • • Bregman divergence. Multidimensional scaling(MDS). Extending MDS with Bregman divergences. Relating the Sammon mapping to mappings with Bregman divergences. Comparison of effects and explanation. • Conclusion Strictly Convex function for any q and p in its domain, and any 0 1, F(η q (1 η) p) ηF(q) (1 - η) F(p), q p Pictorially, the strictly convex function F(x) lies below segment connecting two points q and p. Bregman Divergences d ( x, y) ( x) ( y) x y, ( y) is the Bregman divergence between x and y based on convex function, φ. Taylor Series expansion is ( x) ( y ) ( x y ) ' ( y ) ( x y ) 2 ' ' ( y) 2! ... Bregman Divergences Euclidean distance is a Bregman divergence ( x) x 2 d ( x, y ) x y ( x y ).2 y 2 2 d ( x, y ) x 2 xy y 2 d ( x, y ) x y 2 2 Kullback Leibler Divergence d ( p) pi log pi i 1 d d i 1 i 1 d d d i 1 i 1 i 1 d ( p, q) pi log pi qi log qi p q, (q ) d ( p, q) pi log pi qi log qi ( pi qi )(log 2 qi log 2 e) d d pi d ( p, q) pi log log 2 e ( pi qi ) qi i 1 i 1 d pi d ( p, q) pi log qi i 1 Generalised Information Divergence • φ(z)=z log(z) d ( x, y ) x log x y log y x y, log y 1 d ( x, y ) x log x x log y x y d ( x, y ) x log xy x y Other Divergences • Itakura-Saito Divergence ( x) log( x) • Mahalanobis distance ( x) xT 1 x • Logistic loss ( x) x log x (1 x) log( 1 x) • Any convex function Some Properties • • • • • dφ(x,y)≥0, with equality iff x==y. Not a metric since dφ(x,y)≠ dφ(y,x) (Though d(x,y)=(dφ(x,y)+dφ(y,x)) is symmetric) Convex in the first parameter. Linear, dφ+aγ(x,y)= dφ(x,y) + a.dγ(x,y) Multidimensional Scaling • Creates one latent point for each data point. • The latent space is often 2 dimensional. • Positions the latent points so that they best represent the data distances. – Two latent points are close if the two corresponding data points are close. – Two latent points are distant if the two corresponding data points are distant. Classical/Basic Metric MDS • We minimise the stress function E BasicMDS N N (L ij D ij ) 2 i 1 j i 1 N N 2 E ij i 1 j i 1 where error E ij abs( Lij Dij ) Dij || Yi - Yj || , the distance between points Yi and Y j in data space Lij || X i - X j ||, the mapped distance between points X i and X j in latent space data space Latent space Yi Xi Yj X Dij j Lij Sammon Mapping (1969) N N E Sammon C1 i 1 ji 1 (L ij D ij ) 2 Dij N N C1 i 1 ji 1 E ij 2 Dij where error E ij abs(L ij D ij ) N N Normalisat ion scalar C D ij i 1 ji 1 Focuses on small distances: for the same error, the smaller distance is given bigger stress. Possible Extensions Bregman divergences in both data space and latent space J N (d i , j 1 F1 (x i , x j ) d F2 (y i , y j )) 2 Or even J N d i , j 1 F3 (d F1 (x i , x j ), d F2 (y i , y j )) Metric MDs with Bregman divergence between distances J BMDS N d i , j 1 F1 (d F2 ( xi , x j ), d F3 ( yi , y j )) Lij d F2 ( xi , x j ) Dij d F3 ( yi , y j ) Euclidean distance on latents. Any divergence on data Itakura-Saito divergence between them: Lij Lij Dij (Sammon-like) d IS ( Lij , Dij ) log D Dij ij d IS ( Lij , Dij ) 1 1 to minimise Lij divergence. Lij Dij Lij Moving the Latent Points N d J BMDS i , j 1 Lij F1 (d F2 ( xi , x j ), d F3 ( yi , y j ) d IS ( Lij , Dij ) Lij 1 1 Lij Dij 1 1 xi ( xi x j ) L D j 1 ij ij N F1 for I.S. divergence, F2 for euclidean , F3 any divergence The algae data set The algae data set Two representations The standard Bregman representation: N EBMDS (Y ) N d i 1 j i 1 N F ( Lij , Dij ) N (F (L ) F (D ) (L i 1 j i 1 ij ij ij Dij )F ( Dij )) Concentrating on the residual errors: 2 3 d F ( Dij ) 1 d F ( Dij ) 1 2 3 EBMDS (Y ) ( L D ) ( L D ) ... ij ij ij ij 2 3 dDij 3! dDij i 1 j i 12! N N Basic MDS is a special BMMDS • Base convex function is chosen as • And higher order derivatives are • So • is derived as Sammon Mapping F ( x) x log x, Select dF ( x) d 2 F ( x) 1 1 log x, 2 dx dx x Then 2 3 d F ( Dij ) 1 d F ( Dij ) 1 2 3 EBMDS (Y ) ( L D ) ( L D ) ... ij ij ij ij 2 3 dDij 3! dDij i 1 j i 12! N N 3 d F ( Dij ) 1 Sammon 3 I ij ( L D ) ... ij ij 3 3! dDij i 1 j i 1 N N Example 2: Extended Sammon • Base convex function F(x) x log x, x 0, • This is equivalent to • The Sammon mapping is rewritten as Sammon and Extended Sammon • The common term • The Sammon mapping is thus an approximation to the Extended Sammon mapping via the common term. • The Extended Sammon mapping will do more adjustments on the basis of the higher order terms. An Experiment on Swiss roll data set Distance preservation Relative standard deviation Relative standard deviation • On short distances, Sammon has smaller variance than BasicMDS, Extended Sammon has smaller variance than Sammon, i.e. control of small distances is enhanced. • Large distances are given more and more freedom in the same order as above. LCMC: local continuity meta-criterion (L. Chen 2006) • A common measure assesses projection quality of different MDS methods. • In terms of neighbourhood preservation. • Value between 0 and 1, the higher the better. Quality accessed by LCMC Why Extended Sammon outperforms Sammon • Stress formation when Features of the base convex function • Recall that the base convex function for the Extended Sammon mapping is • Higher order derivatives are • Even orders are positive and odd ones are negative. Stress comparison between Sammon and Extended Sammon Stress configured by Sammon, calculated and mapped by Extended Sammon Stress configured by Sammon, calculated and mapped by Extended Sammon • The Extended Sammon mapping calculates stress on the basis of the configuration found by the Sammon mapping. • For , the mean stresses calculated by the Extended Sammon are much higher than mapped by the Sammon mapping. • For , the calculated mean stresses are obviously lower than that of the Sammon mapping. • The Extended Sammon makes shorter mapped distance even more short, longer even more long. Stress formation by items Generalisation: from MDS to Bregman divergences • A group of MDS is generalised as • C is a normalisation scalar which is used for quantitative comparison purposes. It does not affect the mapping results. • Weight function for missing samples • The Basic MDS and the Sammon mapping belong to this group. Generalisation: from MDS to Bregman divergences • If C=1, then set • Then the generalised MDS is the first term of BMMDS and BMMDS is an extension of MDS. • Recall that BMMDS is equivalent to Criterion for base convex function selection • In order to focus on local distances and concentrate less on long distances, the base convex function must satisfy • Not all convex functions can be considered, such as F(x)=exp(x). • The 2nd order derivative is primarily considered. We wish it to be big for small distances and small for long distances. It represents the focusing power on local distances. Two groups of Convex functions • The even order derivatives are positive, odd order ones are negative. • No 1 is that of the Extended Sammon mapping. Focusing power Different strategies for focusing power • Vertical axis is logarithm of 2nd order derivative. • These use different strategies for increasing focusing power. • In the first group, the second order derivatives are higher and higher for small distances and lower and lower for long distances. • In the second group, second order derivatives have limited maximum values for very small distances, but derivatives are drastically lower and lower for long distances when λ increases. Two groups of Bregman divergences • Elastic scaling(Victor E McGee, 1966) Experiment on Swiss roll: The FirstGroup • Experiment on Swiss roll: FirstGroup • For Extended Sammon, Itakura-Saito, • , local distances are mapped better and better, long distances are stretched such that unfolding trend is obvious. Distances mapping : FirstGroup Standard deviation : FirstGroup LCMC measure : FirstGroup Experiment on Swiss roll:SecondGroup Distance mapping: SecondGroup • StandardDeviation: SecondGroup LCMC: SecondGroup OpenBox, Sammon and FirstGroup SecondGroup on OpenBox Distance mapping: two groups LCMC: two groups Standard deviation: two groups Swiss roll distances distribution OpenBox distances distribution Swiss roll vs OpenBox • Distances formation: • Swiss roll: proportion of longer distances is greater than that of the shorter distances. • OpenBox: Very large quantity of a set of medium distances, small distances take much of the rest. • Mapping results: • Swiss roll: Long distances are stretched and local distances are usually mapped shorter. • The OpenBox: the longest distances are not stretched obviously, perhaps even compressed. Small distances are mapped longer than original values in data space by some methods. • Conclusion: Tug of war between local and long distances. Trying to get the opportunities to be mapped to their original values in data space. Left and right Bregman divergences • All of this is with left divergences – latent points are in left position in divergence, ... • We can show that right divergences produce extensions of curvilinear component analysis. (Sun et al, ESANN2010) Conclusion • Applied Bregman divergences to multidimensional scaling. • Shown that basic MMDS is a special case and Sammon mapping approximates a BMMDS. • Improved upon both with 2 families of divergences. • Shown results on two artificial data sets.