7.5Mb PPT

advertisement
Information Bottleneck
presented by
Boris Epshtein & Lena Gorelick
Advanced Topics in Computer and Human Vision
Spring 2004
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Motivation
Clustering Problem
Motivation
• “Hard” Clustering – partitioning of the input
data into several exhaustive and mutually
exclusive clusters
• Each cluster is represented by a centroid
Motivation
• “Good” clustering – should group similar
data points together and dissimilar points
apart
• Quality of partition – average distortion
between the data points and corresponding
representatives (cluster centroids)
Motivation
• “Soft” Clustering – each data point is
assigned to all clusters with some
normalized probability
• Goal – minimize expected distortion between
the data points and cluster centroids
Motivation…
Complexity-Precision Trade-off
• Too simple model
Poor precision
• Higher precision requires more complex model
Motivation…
Complexity-Precision Trade-off
• Too simple model
Poor precision
• Higher precision requires more complex model
• Too complex model
Overfitting
Motivation…
Complexity-Precision Trade-off
• Too Complex Model
– can lead to overfitting
– is hard to learn
Poor
generalization
• Too Simple Model
– can not capture the real structure of the data
• Examples of approaches:
– SRM Structural Risk Minimization
– MDL Minimum Description Length
– Rate Distortion Theory
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Definitions…
Entropy
• The measure of uncertainty about the
random variable
Definitions…
Entropy - Example
– Fair Coin:
– Unfair Coin:
Definitions…
Entropy - Illustration
Highest
Lowest
Definitions…
Conditional Entropy
• The measure of uncertainty about the
random variable given the value of
the variable
Definitions…
Conditional Entropy
Example
Definitions…
Mutual Information
• The reduction in uncertainty of
to the knowledge of
– Nonnegative
– Symmetric
– Convex w.r.t.
for a fixed
due
Definitions…
Mutual Information - Example
Definitions…
Kullback Leibler Distance
Over the same
alphabet
• A distance between distributions
– Nonnegative
– Asymmetric
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Rate Distortion Theory
Introduction
• Goal: obtain compact clustering of the
data with minimal expected distortion
• Distortion measure is a part of the
problem setup
• The clustering and its quality depend
on the choice of the distortion measure
Rate Distortion Theory
Data
?
• Obtain compact clustering of the data
with minimal expected distortion given
fixed set of representatives
Cover & Thomas
Rate Distortion Theory - Intuition
•
– zero distortion
– not compact
•
– high distortion
– very compact
Rate Distortion Theory – Cont.
• The quality of clustering is determined by
– Complexity is measured by
(a.k.a. Rate)
– Distortion is measured by
Rate Distortion Plane
D - distortion constraint
Minimal
Distortion
Ed(X,T)
Maximal
Compression
Rate Distortion Function
• Let
be an upper bound constraint on the
expected distortion
Higher values of
mean more relaxed
distortion constraint
Stronger compression levels are attainable
• Given the distortion constraint
find the most
compact model (with smallest complexity
)
Rate Distortion Function
• Given
– Set of points
with prior
– Set of representatives
– Distortion measure
• Find
– The most compact soft clustering
of
points of
that satisfies the distortion
constraint
• Rate Distortion Function
Rate Distortion Function
Complexity
Term
Distortion
Term
Lagrange
Multiplier
Minimize
!
Rate Distortion Curve
Minimal
Distortion
Ed(X,T)
Maximal
Compression
Rate Distortion Function
Minimize
Subject to
The minimum is attained when
Normalization
Solution - Analysis
Solution:
Known
The solution is implicit
Solution - Analysis
Solution:
For a fixed
When is similar to
closer points
probability
is small
are attached to
with higher
Solution - Analysis
Solution:
Fix t
reduces the influence of distortion
does not depend on
this + maximal compression
single cluster
Fix x
most of cond. prob. goes to some
with smallest distortion
hard clustering
Solution - Analysis
Solution:
Intermediate
soft clustering,
intermediate complexity
Varying
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Blahut – Arimoto Algorithm
Input:
Randomly init
Optimize convex function over convex set
the minimum is global
Blahut-Arimoto Algorithm
Advantages:
• Obtains compact clustering of the data with
minimal expected distortion
• Optimal clustering given fixed set of
representatives
Blahut-Arimoto Algorithm
Drawbacks:
• Distortion measure is a part of the problem
setup
– Hard to obtain for some problems
– Equivalent to determining relevant features
• Fixed set of representatives
• Slow convergence
Rate Distortion Theory –
Additional Insights
– Another problem would be to find optimal
representatives given the clustering.
– Joint optimization of clustering and
representatives doesn’t have a unique solution.
(like EM or K-means)
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Information Bottleneck
• Copes with the drawbacks of Rate Distortion
approach
• Compress the data while preserving “important”
(relevant) information
• It is often easier to define what information is
important than to define a distortion measure.
• Replace the distortion upper bound constraint by a
lower bound constraint over the relevant information
Tishby, Pereira & Bialek, 1999
Information Bottleneck-Example
Given:
Documents
Joint prior
Topics
Information Bottleneck-Example
Obtain:
I(Word;Topic)
I(Cluster;Topic)
I(Word;Cluster)
Words
Partitioning
Topics
Information Bottleneck-Example
Extreme case 1:
I(Cluster;Topic)=0
Not
Informative
I(Word;Cluster)=0
Very Compact
Information Bottleneck-Example
Extreme case 2:
I(Cluster;Topic)=max
Very
Informative
I(Word;Cluster)=max
Not Compact
Minimize I(Word; Cluster) & maximize I(Cluster; Topic)
Information Bottleneck
words
Compactness
topics
Relevant
Information
Relevance Compression Curve
D – relevance
constraint
Maximal
Compression
Maximal
Relevant
Information
Relevance Compression Function
• Let
be minimal allowed value of
Smaller
more relaxed relevant
information constraint
Stronger compression levels are attainable
• Given relevant information constraint
Find the most compact model
(with smallest )
Relevance Compression Function
Compression
Term
Relevance
Term
Lagrange
Multiplier
Minimize
!
Relevance Compression Curve
Maximal
Relevant
Information
Maximal
Compression
Relevance Compression Function
Minimize
Subject to
The minimum is attained when
Normalization
Solution - Analysis
Solution:
Known
The solution is
implicit
Solution - Analysis
Solution:
• KL distance emerges as effective distortion measure
from IB principle
For a fixed
When
is similar to
attach such points
KL is small
to
with higher probability
The optimization is also over cluster representatives
Solution - Analysis
Solution:
Fix t
reduces the influence of KL
does not depend on
this + maximal compression
single cluster
Fix x
most of cond. prob. goes to some
with smallest KL
(hard mapping)
Relevance Compression Curve
Maximal
Relevant
Information
Hard Mapping
Maximal
Compression
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Iterative Optimization Algorithm (iIB)
• Input:
• Randomly init
Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001
Iterative Optimization Algorithm (iIB)
p(topic | cluster)
p(cluster | word)
p(cluster)
Pereira, Tishby, Lee , 1993;
iIB simulation
• Given:
– 300 instances of
with prior
– Binary relevant variable
– Joint prior
–
• Obtain:
– Optimal clustering (with minimal
)
iIB simulation…
X points and their priors
iIB simulation…
Given
point
the
is given by the color of the
on the map
iIB simulation…
Single Cluster – Maximal Compression
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
Hard Clustering – Maximal Relevant Information
Iterative Optimization Algorithm (iIB)
Optimize non-convex functional over 3
convex sets
the minimum is local
• Analogy to K-means or EM
“Semantic change” in the clustering solution
Iterative Optimization Algorithm (iIB)
Advantages:
• Defining relevant variable is often easier and more
intuitive than defining distortion measure
• Finds local minimum
Iterative Optimization Algorithm (iIB)
Drawbacks:
• Finds local minimum (suboptimal solutions)
• Need to specify the parameters
• Slow convergence
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Deterministic Annealing-like algorithm (dIB)
• Iteratively increase the parameter
and then adapt
the solution from the previous value of to the new
one.
• Track the changes in the solution as the system
shifts its preference from compression to relevance
• Tries to reconstruct the relevance-compression
curve
Slonim, Friedman, Tishby, 2002
Deterministic Annealing-like algorithm (dIB)
Solution from previous step:
Deterministic Annealing-like algorithm (dIB)
Deterministic Annealing-like algorithm (dIB)
Small
Perturbation
Deterministic Annealing-like algorithm (dIB)
Apply iIB using the duplicated cluster set as initialization
Deterministic Annealing-like algorithm (dIB)
if
are different
leave the split
else
use the old
Deterministic Annealing-like algorithm (dIB)
Illustration
What clusters split at which values of
Deterministic
Annealing-like algorithm (dIB)
Advantages:
• Finds local minimum (suboptimal solutions)
• Speed-up convergence by adapting previous soultion
Deterministic
Annealing-like algorithm (dIB)
Drawbacks:
• Need to specify and tune several parameters:
- perturbation size
-
step for
(splits might be “skipped”)
- similarity threshold for splitting
- may need to vary parameters during the process
• Finds local minimum (suboptimal solutions)
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Agglomerative Algorithm (aIB)
• Find hierarchical clustering tree in
a greedy bottom-up fashion
• Results in different trees for each
• Each tree is a range of clustering solutions at
different resolutions
Same
Different
Resolutions
Slonim & Tishby 1999
Agglomerative Algorithm (aIB)
Fix
Start with
Agglomerative Algorithm (aIB)
For each pair
Compute new
Merge
and
that produce the smallest
Agglomerative Algorithm (aIB)
For each pair
Compute new
Merge
and
that produce the smallest
Agglomerative Algorithm (aIB)
For each pair
Continue merging until single cluster is left
Agglomerative Algorithm (aIB)
Agglomerative Algorithm (aIB)
Advantages:
• Non-parametric
• Full Hierarchy of clusters for each
• Simple
Agglomerative Algorithm (aIB)
Drawbacks:
• Greedy – is not guaranteed to extract even locally
minimal solutions along the tree
• Large data sample is required
Agenda
• Motivation
• Information Theory - Basic Definitions
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Applications…
Unsupervised Clustering of Images
Modeling assumption:
Shiri Gordon et. al., 2003
For a fixed
colors and
their spatial distribution
are generated by a
mixture of Gaussians in
5-dim
Applications…
Unsupervised Clustering of Images
Apply EM procedure to estimate the mixture
parameters
Mixture of Gaussians model:
Shiri Gordon et. al., 2003
Applications…
Unsupervised Clustering of
Images
• Assume uniform prior
• Calculate conditional
• Apply aIB algorithm
Shiri Gordon et. al., 2003
Applications…
Unsupervised Clustering of Images
Shiri Gordon et. al., 2003
Applications…
Unsupervised Clustering of Images
Shiri Gordon et. al., 2003
Summary
• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms
– iIB
– dIB
– aIB
• Application
Thank you
Blahut-Arimoto algorithm
A
Minimum Distance
B
?
Convex set of
distributions
Convex set of
distributions
When does it converge to the global minimum?
- A and B are convex + some requirements on
distance measure
Csiszar & Tusnady, 1984
Blahut-Arimoto algorithm
B
A
Reformulate
using
distance
Blahut-Arimoto algorithm
A
B
Rate Distortion Theory - Intuition
•
– zero distortion
– not compact
–
•
– high distortion
– very compact
–
Information Bottleneck - cont’d
• Assume Markov relations:
– T is a compressed representation of X, thus
independent of Y if X is given
– Information processing inequality:
Download