Extending metric multidimensional scaling with Bregman divergences

advertisement
Extending metric multidimensional scaling
with Bregman divergences
Jigang Sun and Colin Fyfe
Visualising 18 dimensional data
Outline
•
•
•
•
Bregman divergence.
Multidimensional scaling(MDS).
Extending MDS with Bregman divergences.
Relating the Sammon mapping to mappings
with Bregman divergences. Comparison of
effects and explanation.
• Conclusion
Strictly Convex function
for any q and p in its domain, and any 0    1,
F(η q  (1  η) p)  ηF(q)  (1 - η) F(p), q  p
Pictorially, the strictly
convex function F(x) lies
below segment
connecting two points q
and p.
Bregman Divergences
d ( x, y)   ( x)   ( y)  x  y, ( y)
is the Bregman divergence between x and y based on
convex function, φ.
Taylor Series expansion is
 ( x)   ( y )  ( x  y ) ' ( y )  ( x  y )
2
 ' ' ( y)
2!
 ...
Bregman Divergences
Euclidean distance is a Bregman
divergence
 ( x)  x
2
d ( x, y )  x  y  ( x  y ).2 y
2
2
d ( x, y )  x  2 xy  y
2
d  ( x, y )  x  y
2
2
Kullback Leibler Divergence
d
 ( p)   pi log pi
i 1
d
d
i 1
i 1
d
d
d
i 1
i 1
i 1
d ( p, q)   pi log pi   qi log qi  p  q,  (q )
d ( p, q)   pi log pi   qi log qi   ( pi  qi )(log 2 qi  log 2 e)
d
d
pi
d ( p, q)   pi log  log 2 e ( pi  qi )
qi
i 1
i 1
d
pi
d ( p, q)   pi log
qi
i 1
Generalised Information Divergence
• φ(z)=z log(z)
d ( x, y )  x log x  y log y  x  y, log y  1
d ( x, y )  x log x  x log y  x  y
d ( x, y )  x log xy  x  y
Other Divergences
• Itakura-Saito Divergence
 ( x)   log( x)
• Mahalanobis distance
 ( x)  xT  1 x
• Logistic loss
 ( x)  x log x  (1  x) log( 1  x)
• Any convex function
Some Properties
•
•
•
•
•
dφ(x,y)≥0, with equality iff x==y.
Not a metric since dφ(x,y)≠ dφ(y,x)
(Though d(x,y)=(dφ(x,y)+dφ(y,x)) is symmetric)
Convex in the first parameter.
Linear, dφ+aγ(x,y)= dφ(x,y) + a.dγ(x,y)
Multidimensional Scaling
• Creates one latent point for each data point.
• The latent space is often 2 dimensional.
• Positions the latent points so that they best
represent the data distances.
– Two latent points are close if the two
corresponding data points are close.
– Two latent points are distant if the two
corresponding data points are distant.
Classical/Basic Metric MDS
• We
minimise the stress function
E BasicMDS 
N

N
 (L ij  D ij ) 
2
i 1 j i 1
N

N
2
E
 ij
i 1 j i 1
where
error E ij  abs( Lij  Dij )
Dij || Yi - Yj || , the distance between points Yi and Y j in data space
Lij  || X i - X j ||, the mapped distance between points X i and X j in latent space
data space
Latent space
Yi

Xi
Yj

X
Dij

j
Lij
Sammon Mapping (1969)
N
N
E Sammon  C1  
i 1 ji 1
(L ij  D ij ) 2
Dij
N
N
 C1  
i 1 ji 1
E ij
2
Dij
where
error E ij  abs(L ij  D ij )
N
N
Normalisat ion scalar C    D ij
i 1 ji 1
Focuses on small distances: for the same error,
the smaller distance is given bigger stress.
Possible Extensions
Bregman divergences in both data space and latent space
J
N
 (d
i , j 1
F1
(x i , x j )  d F2 (y i , y j ))
2
Or even
J
N
d
i , j 1
F3
(d F1 (x i , x j ), d F2 (y i , y j ))
Metric MDs with Bregman divergence between distances
J BMDS 
N
d
i , j 1
F1
(d F2 ( xi , x j ), d F3 ( yi , y j ))
Lij  d F2 ( xi , x j )
Dij  d F3 ( yi , y j )
Euclidean distance on latents.
Any divergence on data
Itakura-Saito divergence between them:
 Lij  Lij  Dij
(Sammon-like)

d IS ( Lij , Dij )  log 
D 
Dij
 ij 
d IS ( Lij , Dij )
1
1 to minimise
Lij  


divergence.
Lij
Dij Lij
Moving the Latent Points
N
d
J BMDS 
i , j 1
Lij  
F1
(d F2 ( xi , x j ), d F3 ( yi , y j )
d IS ( Lij , Dij )
Lij
1
1


Lij Dij
 1

1

xi    ( xi  x j ) 
L D 
j 1
ij 
 ij
N
F1 for I.S. divergence, F2 for euclidean , F3 any divergence
The algae data set
The algae data set
Two representations
The standard Bregman representation:
N
EBMDS (Y )  
N
d
i 1 j i 1
N

F
( Lij , Dij )
N
 (F (L )  F (D )  (L
i 1 j i 1
ij
ij
ij
 Dij )F ( Dij ))
Concentrating on the residual errors:
2
3
d
F ( Dij )
1 d F ( Dij )
1
2
3
EBMDS (Y )   
(
L

D
)

(
L

D
)
 ...
ij
ij
ij
ij
2
3
dDij
3! dDij
i 1 j i 12!
N
N
Basic MDS is a special BMMDS
• Base convex function is chosen as
• And higher order derivatives are
• So
• is derived as
Sammon Mapping
F ( x)  x log x,
Select
dF ( x)
d 2 F ( x) 1
 1  log x,

2
dx
dx
x
Then
2
3
d
F ( Dij )
1 d F ( Dij )
1
2
3
EBMDS (Y )   
(
L

D
)

(
L

D
)
 ...
ij
ij
ij
ij
2
3
dDij
3! dDij
i 1 j i 12!
N
N
3
d
F ( Dij )
1
Sammon
3
   I ij

(
L

D
)
 ...
ij
ij
3
3! dDij
i 1 j i 1
N
N
Example 2: Extended Sammon
• Base convex function F(x)  x log x, x  0,
• This is equivalent to
• The Sammon mapping is rewritten as
Sammon and Extended Sammon
• The common term
• The Sammon mapping is thus an
approximation to the Extended Sammon
mapping via the common term.
• The Extended Sammon mapping will do more
adjustments on the basis of the higher order
terms.
An Experiment on Swiss roll data set
Distance preservation
Relative standard deviation
Relative standard deviation
• On short distances, Sammon has smaller
variance than BasicMDS, Extended Sammon
has smaller variance than Sammon, i.e.
control of small distances is enhanced.
• Large distances are given more and more
freedom in the same order as above.
LCMC: local continuity meta-criterion
(L. Chen 2006)
• A common measure assesses projection
quality of different MDS methods.
• In terms of neighbourhood preservation.
• Value between 0 and 1, the higher the better.
Quality accessed by LCMC
Why Extended Sammon outperforms Sammon
• Stress formation
when
Features of the base convex function
• Recall that the base convex function for the
Extended Sammon mapping is
• Higher order derivatives are
• Even orders are positive and odd ones are
negative.
Stress comparison between Sammon
and Extended Sammon
Stress configured by Sammon, calculated
and mapped by Extended Sammon
Stress configured by Sammon, calculated
and mapped by Extended Sammon
• The Extended Sammon mapping calculates stress on
the basis of the configuration found by the Sammon
mapping.
• For
, the mean stresses calculated by
the Extended Sammon are much higher than
mapped by the Sammon mapping.
• For
, the calculated mean stresses are
obviously lower than that of the Sammon mapping.
• The Extended Sammon makes shorter mapped
distance even more short, longer even more long.
Stress formation by items
Generalisation: from MDS to Bregman
divergences
• A group of MDS is generalised as
• C is a normalisation scalar which is used for
quantitative comparison purposes. It does not
affect the mapping results.
• Weight function
for missing
samples
• The Basic MDS and the Sammon mapping
belong to this group.
Generalisation: from MDS to Bregman
divergences
• If C=1, then set
• Then the generalised MDS is the first term of
BMMDS and BMMDS is an extension of MDS.
• Recall that BMMDS is equivalent to
Criterion for base convex function
selection
• In order to focus on local distances and
concentrate less on long distances, the base
convex function must satisfy
• Not all convex functions can be considered, such
as F(x)=exp(x).
• The 2nd order derivative is primarily considered.
We wish it to be big for small distances and small
for long distances. It represents the focusing
power on local distances.
Two groups of Convex functions
• The even order derivatives are positive, odd
order ones are negative.
• No 1 is that of the Extended Sammon
mapping.
Focusing power
Different strategies for focusing power
• Vertical axis is logarithm of 2nd order derivative.
• These use different strategies for increasing
focusing power.
• In the first group, the second order derivatives
are higher and higher for small distances and
lower and lower for long distances.
• In the second group, second order derivatives
have limited maximum values for very small
distances, but derivatives are drastically lower
and lower for long distances when λ increases.
Two groups of Bregman divergences
• Elastic scaling(Victor E McGee, 1966)
Experiment on Swiss roll: The FirstGroup
•
Experiment on Swiss roll: FirstGroup
• For Extended Sammon, Itakura-Saito,
•
, local distances are mapped better
and better, long distances are stretched such
that unfolding trend is obvious.
Distances mapping : FirstGroup
Standard deviation : FirstGroup
LCMC measure : FirstGroup
Experiment on Swiss roll:SecondGroup
Distance mapping: SecondGroup
•
StandardDeviation: SecondGroup
LCMC: SecondGroup
OpenBox, Sammon and FirstGroup
SecondGroup on OpenBox
Distance mapping: two groups
LCMC: two groups
Standard deviation: two groups
Swiss roll distances distribution
OpenBox distances distribution
Swiss roll vs OpenBox
• Distances formation:
• Swiss roll: proportion of longer distances is greater than that of the
shorter distances.
• OpenBox: Very large quantity of a set of medium distances, small
distances take much of the rest.
• Mapping results:
• Swiss roll: Long distances are stretched and local distances are usually
mapped shorter.
• The OpenBox: the longest distances are not stretched obviously,
perhaps even compressed. Small distances are mapped longer than
original values in data space by some methods.
• Conclusion: Tug of war between local and long distances. Trying to get
the opportunities to be mapped to their original values in data space.
Left and right Bregman divergences
• All of this is with left divergences – latent
points are in left position in divergence, ...
• We can show that right divergences produce
extensions of curvilinear component analysis.
(Sun et al, ESANN2010)
Conclusion
• Applied Bregman divergences to
multidimensional scaling.
• Shown that basic MMDS is a special case and
Sammon mapping approximates a BMMDS.
• Improved upon both with 2 families of
divergences.
• Shown results on two artificial data sets.
Download