Graphical Models - VideoLectures.NET

advertisement
Introduction to Graphical Models for Data Mining
Arindam Banerjee
banerjee@cs.umn.edu
Dept of Computer Science & Engineering
University of Minnesota, Twin Cities
16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
July 25, 2010
Introduction
• Graphical Models
– Brief Overview
• Part I: Tree Structured Graphical Models
– Exact Inference
• Part II: Mixed Membership Models
– Latent Dirichlet Allocation
– Generalizations, Applications
• Part III: Graphical Models for Matrix Analysis
– Probabilistic Matrix Factorization
– Probabilistic Co-clustering
– Stochastic Block Models
Graphical Models
2
Graphical Models: What and Why
• Statistical Data Analaysis
– Build diagnostic/predictive models from data
– Uncertainty quantification based on (minimal) assumptions
• The I.I.D. assumption
– Data is independently and identically distributed
– Example: Words in a doc drawn i.i.d. from the dictionary
• Graphical models
– Assume (graphical) dependencies between (random) variables
– Closer to reality, domain knowledge can be captured
– Learning/inference is much more difficult
Graphical Models
3
Flavors of Graphical Models
• Basic nomenclature
– Node = random variable, maybe observed/hidden
– Edge = statistical dependency
• Two popular flavors: ‘Directed’ and ‘Undirected’
• Directed Graphs
– A directed graph between random variables, causal dependencies
X1
– Example: Bayesian networks, Hidden Markov Models
X3
– Joint distribution is a product of P(child|parents)
• Undirected Graphs
X4
X2
X5
– An undirected graph between random variables
– Example: Markov/Conditional random fields
– Joint distribution in terms of potential functions
Graphical Models
4
Bayesian Networks
X2
X1
X3
X4
X5
• Joint distribution in terms of P(X|Parents(X))
Graphical Models
5
Example I: Burglary Network
This and several other examples are from the Russell-Norvig AI book
Graphical Models
6
Computing Probabilities of Events
• Probability of any event can be computed:
P(B,E,A,J,M) = P(B) P(E|B) P(A|B,E) P(J|B,E,A) P(M|B,E,A,J)
= P(B) P(E) P(A|B,E) P(J|A)
P(M|A)
• Example:
P(b,¬e,a, ¬j,m) = P(b) P(¬e)P(a|b,¬e) P(¬j|a) P(m|a)
Graphical Models
7
Example II: Rain Network
Graphical Models
8
Example III: “Car Won’t Start” Diagnosis
Graphical Models
9
Inference
• Some variables in the Bayes net are observed
– the evidence/data, e.g., John has not called, Mary has called
• Inference
– How to compute value/probability of other variables
– Example: What is the probability of Burglary, i.e., P(b|¬j,m)
Graphical Models
10
Inference Algorithms
• Graphs without loops: Tree-structured Graphs
– Efficient exact inference algorithms are possible
– Sum-product algorithm, and its special cases
• Belief propagation in Bayes nets
• Forward-Backward algorithm in Hidden Markov Models (HMMs)
• Graphs with loops
– Junction tree algorithms
• Convert into a graph without loops
• May lead to exponentially large graph
– Sum-product/message passing algorithm, ‘disregarding loops’
• Active research topic, correct convergence ‘not guaranteed’
• Works well in practice
– Approximate inference
Graphical Models
11
Approximate Inference
• Variational Inference
– Deterministic approximation
– Approximate complex true distribution over latent variables
– Replace with family of simple/tractable distributions
• Use the best approximation in the family
– Examples: Mean-field, Bethe, Kikuchi, Expectation Propagation
• Stochastic Inference
– Simple sampling approaches
– Markov Chain Monte Carlo methods (MCMC)
• Powerful family of methods
– Gibbs sampling
• Useful special case of MCMC methods
Graphical Models
12
Part I: Tree Structured Graphical Models
• The Inference Problem
• Factor Graphs and the Sum-Product Algorithm
• Example: Hidden Markov Models
• Generalizations
Graphical Models
13
The Inference Problem
Graphical Models
14
Complexity of Naïve Inference
Graphical Models
15
Bayes Nets to Factor Graphs
Graphical Models
16
Factor Graphs: Product of Local Functions
Graphical Models
17
Marginalize Product of Functions (MPF)
• Marginalize product of functions
• Computing marginal functions
• The “not-sum” notation
Graphical Models
18
MPF using Distributive Law
• We focus on two examples: g1(x1) and g3(x3)
• Main Idea: Distributive law
ab + ac = a(b+c)
• For g1(x1), we have
• For g3(x3), we have
Graphical Models
19
Computing Single Marginals
• Main Idea:
– Target node becomes the root
– Pass messages from leaves up to the root
Graphical Models
20
Message Passing
Compute product of descendants
Compute product of descendants with f
Then do not-sum over part
Graphical Models
21
Example: Computing g1(x1)
Graphical Models
22
Example: Computing g3(x3)
Graphical Models
23
Hidden Markov Models (HMMs)
Latent variables:
z0,z1,…,zt-1,zt,zt+1,…,zT
Observed variables:
x1,…,xt-1,xt,xt+1,…,xT
Inference Problems:
1. Compute p(x1:T)
2. Compute p(zt|x1:T)
3. Find maxz1:T p(z1:T|x1:T)
Similar problem for chain-structured
Conditional Random Fields (CRFs)
Graphical Models
24
The Sum-Product Algorithm
• To compute gi(xi), form a tree rooted at xi
• Starting from the leaves, apply the following two rules
– Product Rule:
At a variable node, take the product of descendants
– Sum-product Rule:
At a factor node, take the product of f with descendants;
then perform not-sum over the parent node
• To compute all marginals
– Can be done one at a time; repeated computations, not efficient
– Simultaneous message passing following the sum-product algorithm
– Examples: Belief Propagation, Forward-Backward algorithm, etc.
Graphical Models
25
Sum-Product Updates
Graphical Models
26
Sum-Product Updates
Graphical Models
27
Example: Step 1
Graphical Models
28
Example: Step 2
Graphical Models
29
Example: Step 3
Graphical Models
30
Example: Step 4
Graphical Models
31
Example: Step 5
Graphical Models
32
Example: Termination
Graphical Models
33
HMMs Revisited
Latent variables:
z0,z1,…,zt-1,zt,zt+1,…,zT
Observed variables:
x1,…,xt-1,xt,xt+1,…,xT
Inference Problem:
1. Compute p(x1:T)
2. Compute p(zt|x1:T)
Sum-product algorithm is known
as the `forward-backward’ algorithm
Smoothing in Kalman Filtering
Graphical Models
34
Distributive Law on Semi-Rings
• Idea can be applied to any commutative semi-ring
• Semi-ring 101
– Two operations (+,×): Associative, Commutative, Identity
– Distributive law: a×b + a×c = a×(b+c)
•Belief Propagation in Bayes nets
•MAP inference in HMMs
•Max-product algorithm
•Alternative to Viterbi Decoding
•Kalman Filtering
•Error Correcting Codes
•Turbo Codes
•…
Graphical Models
35
Message Passing in General Graphs
• Tree structured graphs
– Message passing is guaranteed to give correct solutions
– Examples: HMMs, Kalman Filters
• General Graphs
– Active research topic
• Progress has been made in the past 10 years
– Message passing
• May not converge
• May converge to a ‘local minima’ of ‘Bethe variational free energy’
– New approaches to convergent and correct message passing
• Applications
– True Skill: Ranking System for Xbox Live
– Turbo Codes: 3G, 4G phones, satellite comm, Wimax, Mars orbiter
Graphical Models
36
Part II: Mixed Membership Models
• Mixture Models vs Mixed Membership Models
• Latent Dirichlet Allocation
• Inference
– Mean-Field and Collapsed Variational Inference
– MCMC/Gibbs Sampling
• Applications
• Generalizations
Graphical Models
37
Background: Plate Diagrams
a
a
b
b1
b2
b3
3
Compact representation of large Bayesian networks
Graphical Models
38
Model 1: Independent Features
x
0.8
0.7
0.6
0.5
0.4
0.3
0.3
x
0.2
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
d
0.6
0.5
n
1
0.4
0.3
0.2
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
θ
0.7
0.6
0.5
d
-2
0.4
0.3
0.2
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
d=3, n=1
Graphical Models
39
Model 2: Naïve Bayes (Mixture Models)
π
z
x
x
d
d
n
θ
d
n
θ
d
k
Graphical Models
40
Naïve Bayes Model
k
d
θ
n
d
z
π
0.8
0.8
0.7
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-8
-6
-4
-2
0
2
4
6
8
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
3
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-8
-6
-4
-2
0
2
4
6
8
0
-10
10
0.8
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
-0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
-10
0
-10
10
0.8
0
-10
x
-1
0.6
0.6
0
-10
x
0.1
-8
-6
-4
-2
0
2
4
6
8
10
Graphical Models
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
41
Naïve Bayes Model
k
d
θ
n
d
z
π
0.8
0.8
0.7
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-8
-6
-4
-2
0
2
4
6
8
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
2.1
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-8
-6
-4
-2
0
2
4
6
8
0
-10
10
0.8
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
-1.5
0.4
0.3
0.3
0.2
0.2
0.1
0
-10
0
-10
10
0.8
0
-10
x
0.1
0.6
0.6
0
-10
x
0.1
-8
-6
-4
-2
0
2
4
6
8
10
Graphical Models
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
42
Model 3: Mixed Membership Model
α
x
d
θ
d
π
π
z
z
x
x
d
n
θ
d
n
dk
Graphical Models
n
θ
d k
43
Mixed Membership Models
n
d
α
π
z
x
k
d
θ
0.8
0.8
0.7
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
3.1
0.4
0.3
0.3
0.2
0.2
0.1
0
-10
x
0.7
0.6
0.6
0.1
-8
-6
-4
-2
0
2
4
6
8
10
0
-10
0.8
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
0.5
-1
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-10
0
-10
-8
-6
-4
-2
0
2
Graphical Models
4
6
8
-8
-6
-4
-2
0
2
4
6
8
10
10
44
Mixed Membership Models
n
d
α
π
z
x
k
d
θ
0.8
0.8
0.7
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
-8
-6
-4
-2
0
2
4
6
8
10
0.7
0.6
0.6
0.5
0.5
0.4
2.1
0.4
0.3
0.3
0.2
0.2
0.1
0.1
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
0.8
0.7
0.7
0.6
0.6
0.5
-2
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
-10
0
-10
0.8
0.7
0
-10
x
0.9
0.6
0.6
0.1
-8
-6
-4
-2
0
2
Graphical Models
4
6
8
10
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
45
Mixture Model vs Mixed Membership Model
x
x
-1
0.1
3
2.1
-0.5
-1.5
Single component membership
x
0.7
x
0.9
3.1
2.1
-1
-2
Multi-component mixed membership
Graphical Models
46
Latent Dirichlet Allocation (LDA)

distribution over topics
for each document
p(d)  Dirichlet()
Dirichlet priors
p (d)
distribution over words
for each topic
K
bj
topic assignment
for each word
zi  Discrete(p (d) )
word generated from
assigned topic
xi  Discrete(b (zi) )
zi
xi
Nd D
Graphical Models
47
z
z~Discrete(p)
x1
x2
x3
b
z1
z2
z3
x1
x2
x3
z1
z2
z3
x1
x2
x3
b
a
b

p
p
p1
p2
p~Drichlet(α)
z1
z2
z3
z11
z12
z13
z21
z22
x1
x2
x3
x11
x12
x13
x21
x22
b
Graphical Models
b
48
LDA Generative Model
document1
document2
Graphical Models
49
LDA Generative Model
document1
Graphical Models
document2
50
Learning: Inference and Estimation
Graphical Models
51
Variational Inference
Graphical Models
52
Variational EM for LDA
Graphical Models
53
E-step: Variational Distribution and Updates
Graphical Models
54
M-step: Parameter Estimation
Graphical Models
55
Results: Topics Inferred
Graphical Models
56
Results: Perplexity Comparison
Graphical Models
57
Aviation Safety Reports (NASA)
Graphical Models
60
Results: NASA Reports I
Arrival
Departure
Passenger
Maintenance
runway
approach
departure
altitude
turn
tower
air traffic control
heading
taxi way
flight
passenger
attendant
flight
seat
medical
captain
attendants
lavatory
told
police
maintenance
engine
mel
zzz
air craft
installed
check
inspection
fuel
Work
Graphical Models
61
Results: NASA Reports II
Medical
Emergency
Wheel
Maintenance
Weather
Condition
Departure
medical
passenger
doctor
attendant
oxygen
emergency
paramedics
flight
nurse
aed
tire
wheel
assembly
nut
spacer
main
axle
bolt
missing
tires
knots
turbulence
aircraft
degrees
ice
winds
wind
speed
air speed
conditions
departure
sid
dme
altitude
climbing
mean sea level
heading
procedure
turn
degree
Graphical Models
62
Two-Dimensional Visualization for Reports
The pilot flies an owner's
airplane with the owner as a
passenger. Loses contact with
the center during the flight.
While performing a sky
diving, a jet approaches at
the same altitude, but an
accident is avoided.
Red: Flight Crew
Blue: Passenger
Graphical Models
Green: Maintenance
63
Two-Dimensional Visualization for Reports
Altimeter has a problem, but
the pilot overcomes the
difficulty during the flight.
During acceleration, a flap retraction
issue happens. The pilot then returns to
base and lands. The mechanic finds out
the problem.
Red: Flight Crew
Blue: Passenger
Graphical Models
Green: Maintenance
64
Two-Dimensional Visualization for Reports
The captain has a
medical emergency.
The pilot has a landing gear
problem. Maintenance crew
joins radio conversation to help.
Red: Flight crew
Blue: Passenger
Graphical Models
Green: Maintenance
65
Mixed Membership of Reports
Flight Crew: 0.7039
Passenger: 0.0009
Maintenance: 0.2953
Flight Crew: 0.2563
Passenger: 0.6599
Maintenance: 0.0837
Flight Crew: 0.1405
Passenger: 0.0663
Maintenance: 0.7932
Red: Flight Crew
Flight Crew: 0.0013
Passenger: 0.0013
Maintenance: 0.9973
Blue: Passenger
Graphical Models
Green: Maintenance
66
Smoothed Latent Dirichlet Allocation

distribution over topics
for each document
p(d)  Dirichlet()
Dirichlet priors
distribution over words
for each topic
 (j)  Dirichlet(b)
b
p (d)
 (j)
zi
topic assignment
for each word
zi  Discrete(p (d) )
word generated from
assigned topic
xi  Discrete( (zi) )
T
xi
Nd D
Graphical Models
67
Stochastic Inference using Markov Chains
• Powerful family of approximate inference methods
– Markov Chain Monte Carlo, Gibbs Sampling
• The basic idea
– Need to marginalize over complex latent variable distribution
p(x|q) = ∫z p(x,z|q) = ∫z p(x|q) p(z|x,q) = Ez~p(z|x,q)[p(x|q)]
– Draw ‘independent’ samples from p(z|x,q)
– Compute sample based average instead of the full integral
• Main Issue: How to draw samples?
–
–
–
–
Difficult to directly draw samples from p(z|x,q)
Construct a Markov chain whose stationary distribution is p(z|x,q)
Run chain till ‘convergence’
Obtain samples from p(z|x,q)
Graphical Models
68
The Metropolis-Hastings Algorithm
Graphical Models
69
The Metropolis-Hastings Algorithm (Contd)
Graphical Models
70
The Gibbs Sampler
Graphical Models
71
Collapsed Gibbs Sampling for LDA
Graphical Models
72
Collapsed Variational Inference for LDA
Graphical Models
73
Collapsed Variational Inference for LDA
Graphical Models
74
Results: Comparison of Inference Methods
Graphical Models
75
Results: Comparison of Inference Methods
Graphical Models
76
Generalizations
• Generalized Topic Models
– Correlated Topic Models
– Dynamic Topic Models, Topics over Time
– Dynamic Topics with birth/death
• Mixed membership models over non-text data, applications
– Mixed membership naïve-Bayes
– Discriminative models for classification
– Cluster Ensembles
• Nonparametric Priors
– Dirichlet Process priors: Infer number of topics
– Hierarchical Dirichlet processes: Infer hierarchical structures
– Several other priors: Pachinko allocation, Gaussian Processes, IBP, etc.
Graphical Models
77
CTM Results
Graphical Models
78
DTM Results
Graphical Models
79
DTM Results II
Graphical Models
80
Mixed Membership Naïve Bayes
• For each data point,
– Choose π ~ Dirichlet()
• For each of observed features fn:
– Choose a class zn ~ Discrete (π)
– Choose a feature value xn from p(xn|zn,fn,Θ), which could be Gaussian,
Poisson, Bernoulli…
Graphical Models
81/41
MMNB vs NB: Perplexity Surfaces
NB
MMNB
NB
MMNB
•MMNB typically achieves a lower
perplexity than NB
•On test set, NB shows overfitting,
but MMNB is stable and robust.
Graphical Models
NB
MMNB
82
Discriminative Mixed Membership Models
Graphical Models
83
Results: DLDA for text classification
Generally, Fast DLDA has a higher accuracy on most of the datasets
Graphical Models
84
Topics from DLDA
cabin
flight
ice
aircraft
flight
descent
hours
aircraft
gate
smoke
pressurization
time
flight
ramp
cabin
emergency
crew
wing
wing
passenger
flight
day
captain
taxi
aircraft
aircraft
duty
icing
stop
captain
pressure
rest
engine
ground
cockpit
oxygen
trip
anti
parking
attendant
atc
zzz
time
area
smell
masks
minutes
maintenance
line
emergency
Graphical Models
85
Cluster Ensembles
• Combining multiple base clusterings of a dataset
base clustering1
base clustering2
base clustering3
• Robust and stable
• Distributed and scalable
• Knowledge reuse, privacy preserving
Graphical Models
86
Problem Formulation
• Input & Output
Data
points
Base clusterings
Graphical Models
Consensus clustering
87
Results: State-of-the-art vs Bayesian Ensembles
Graphical Models
88
Part III: Graphical Models for Matrix Analysis
• Probabilistic Matrix Factorizations
• Probabilistic Co-clustering
• Stochastic Block Structures
Graphical Models
89
Matrix Factorization
• Singular value decomposition
≈
• Problems
– Large matrices, with millions of row/colums
• SVD can be rather slow
– Sparse matrices, most entries are missing
• Traditional approaches cannot handle missing entries
Graphical Models
90
Matrix Factorization: “Funk SVD”
vj
^
Xij = uiTvj =
u iT
^
error = (Xij –Xij)2
= (Xij –uiTvj)2
• Model X ϵ Rn×m as UVT where
– U is a Rn×k, V is Rm×k
– Alternatively optimize U and V
Graphical Models
91
Matrix Factorization (Contd)
vj
^
Xij = uiTvj =
u iT
^
error = (Xij -Xij )2
• Gradient descent updates
^
uik(t+1) = uik(t) + η (Xij-Xij) vjk(t)
^
vjk(t+1) = vjk(t) + η (Xij-Xij) ujk(t)
Graphical Models
92
Probabilistic Matrix Factorization (PMF)
N(0, σv2I)
vj
Xij ~ N(uiTvj , σ2)
N(0, σu2I)
uiT ~ N(0, σu2I)
vj ~ N(0, σv2I)
Rij ~ N(uiTvj , σ2)
u iT
Inference using gradient descent
Graphical Models
93
Bayesian Probabilistic Matrix Factorization
N(µv, Λv)
vj
Xij ~ N(uiTvj , σ2)
µu ~ N(µ0, Λ u), Λ u ~ W(ν0, W0)
µv ~ N(µ0, Λ v), Λ v ~ W(ν0, W0)
ui ~ N(µu, Λ u)
vj ~ N(µv, Λ v)
Rij ~ N(uiTvj , σ2)
Wishart
N(µu, Λu)
u iT
Gaussian
Graphical Models
Inference using MCMC
94
Results: PMF on the Netflix Dataset
Graphical Models
95
Results: PMF on the Netflix Dataset
Graphical Models
96
Results: Bayesian PMF on Netflix
Graphical Models
97
Results: Bayesian PMF on Netflix
Graphical Models
98
Results: Bayesian PMF on Netflix
Graphical Models
99
Co-clustering: Gene Expression Analysis
Original
Co-clustered
Graphical Models
100
Co-clustering and Matrix Approximation
Graphical Models
101
Probabilistic Co-clustering
…
Row clusters:
Column clusters:
…
Graphical Models
102
Probabilistic Co-clustering
Graphical Models
103
Generative Process
•
•
Assume a mixed membership for
each row and column
Assume a Gaussian for each cocluster
1. Pick row/column clusters
2
2. Generate each entry of the matrix
Graphical Models
104
Reduction to Mixture Models
3
Graphical Models
105
Reduction to Mixture Models
3 1.1
Graphical Models
106
Generative Process
•
•
Assume a mixed membership for
each row and column
Assume a Gaussian for each cocluster
1. Pick row/column clusters
2
2. Generate each entry of the matrix
Graphical Models
107
Bayesian Co-clustering (BCC)
•
A Dirichlet distribution over all
possible mixed memberships
2
Graphical Models
108
Bayesian Co-clustering (BCC)
Graphical Models
109
Learning: Inference and Estimation
• Learning
– Estimate model parameters (1 ,  2 ,q )
– Infer ‘mixed memberships’ of individual rows and columns
• Expectation Maximization
• Issues
– Posterior probability cannot be obtained in closed form
– Parameter estimation cannot be done directly
• Approach: Approximate inference
– Variational Inference
– Collapsed Gibbs Sampling, Collapsed Variational Inference
Graphical Models
110
Variational EM
• Introduce a variational distribution
to approximate
• Use Jensen’s inequality to get a tractable lower bound
• Maximize the lower bound w.r.t
– Alternatively minimize the KL divergence between
and
• Maximize the lower bound w.r.t.
Graphical Models
111
Variational Distribution
•
for each row,
Graphical Models
for each column
112
Collapsed Inference
• Latent distribution can be exactly marginalized over (p1, p2)
– Obtain p(X,z1,z2|1, 2,b) in closed form
– Analysis assumes discrete/categorical entries
– Can be generalized to exponential family distributions
• Collapsed Gibbs Sampling
– Conditional distribution of (z1uv,z2uv) in closed form
P(z1uv=i, z2uv=j | X, z1-uv, z2-uv, 1, 2, b)
– Sample states, run sampler till convergence
• Collapsed Variational Bayes
– Variational distribution q(z1,z2|g) = ∏u,v q(z1uv,z2uv|guv)
– Gaussian and Taylor approximation to obtain updates for guv
Graphical Models
113
Residual Bayesian Co-clustering (RBC)
•(z1,z2) determines the distribution
•(m1,m2): row/column means
•Users/movies may have bias
•(bm1,bm2): row/ column bias
Graphical Models
114
Results: Datasets
• Movielens: Movie recommendation data
– 100,000 ratings (1-5) for 1682 movies from 943 users (6.3%)
– Binarize: 0 (1-3), 1(4-5).
– Discrete (original), Bernoulli (binary), Real (z-scored)
• Foodmart: Transaction data
– 164,558 sales records for 7803 customers and 1559 products (1.35%)
– Binarize: 0 (less than median), 1(higher than median)
– Poisson (original), Bernoulli (binary), Real (z-scored)
• Jester: Joke rating data
– 100,000 ratings (-10.00,+10.00) for 100 jokes from 1000 users (100%)
– Binarize: 0 (lower than 0), 1 (higher than 0)
– Gaussian (original), Bernoulli (binary), Real (z-scored)
Graphical Models
115
Perplexity Comparison with 10 Clusters
Training Set
Test Set
MMNB
BCC
LDA
MMNB
BCC
LDA
Jester
1.7883
1.8186
98.3742
Jester
4.0237
2.5498
98.9964
Movielens
1.6994
1.9831
439.6361
Movielens
3.9320
2.8620 1557.0032
Foodmart
1.8691
1.9545 1461.7463
Foodmart
6.4751
2.1143 6542.9920
On Binary Data
Training Set
Test Set
MMNB
BCC
MMNB
BCC
Jester
15.4620
18.2495
Jester
39.9395
24.8239
Movielens
3.1495
0.8068
Movielens
38.2377
1.0265
Foodmart
4.5901
4.5938
Foodmart
4.6681
4.5964
On Original Data
Graphical Models
116
Co-embedding: Users
Graphical Models
117
Co-embedding: Movies
Graphical Models
118
RBC vs. other co-clustering algorithms
•RBC and RBC-FF
perform better than BCC
•RBC and RBC-FF are
also the best among others
Jester
Graphical Models
119
RBC vs. other co-clustering algorithms
Movielens
Foodmart
Graphical Models
120
RBC vs. SVD, NNMF, and CORR
•RBC and RBC-FF are
competitive with other
algorithms
Jester
Graphical Models
121
RBC vs. SVD, NNMF and CORR
Movielens
Foodmart
Graphical Models
122
SVD vs. Parallel RBC
Parallel RBC scales well to large matrices
Graphical Models
123
Inference Methods: VB, CVB, Gibbs
Graphical Models
124
Mixed Membership Stochastic Block Models
• Network data analysis
– Relational View: Rows and Columns are the same entity
– Example: Social networks, Biological networks
– Graph View: (Binary) adjacency matrix
• Model
Graphical Models
125
MMB Graphical Model
Graphical Models
126
Variational Inference
• Variational lower bound
• Fully factorized variational distribution
• Variational EM
– E-step: Update variational parameters (g,j)
– M-step: Update model parameters (,B)
Graphical Models
127
Results: Inferring Communities
Original friendship matrix
Friendships inferred from the posterior, respectively
based on thresholding ppTBpq and jpTBjq
Graphical Models
128
Results: Protein Interaction Analysis
“Ground truth”: MIPS collection of protein interactions (yellow diamond)
Comparison with other models based on protein interactions and
microarray expression analysis
Graphical Models
129
Non-parametric Bayes
Dirichlet Process Mixtures
Gaussian Processes
Hierarchical Dirichlet Processes
Chinese Restaurant Processes
Pittman-Yor Processes
Mondrain Processes
Indian Buffet Processes
Graphical Models
130
References: Graphical Models
• S. Russell & P. Norvig, Artificial Intelligence: A Modern Approach,
Prentice Hall, 2009.
• D. Koller & N. Friedman, Probabilistic Graphical Models: Principles and
Techniques, MIT Press, 2009.
• C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.
• D. Barber, Bayesian Reasoning and Machine Learning, Cambridge
University Press, 2010.
• M. I. Jordan (Ed), Learning in Graphical Models, MIT Press, 1998.
• S. L. Lauritzen, Graphical Models, Oxford University Press, 1996.
• J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference, Morgan Kaufmann, 1988.
Graphical Models
131
References: Inference
• F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the
• sum-product algorithm,” IEEE Transactions on Information Theory, vol.47,
no. 2, 498–519, 2001.
• S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE
Transactions on Information Theory, 46, 325–343, 2000.
• M. J. Wainwright and M. I. Jordan, “Graphical models, exponential
families, and variational inference,” Foundations and Trends in Machine
Learning, vol. 1, no. 1-2, 1-305, December 2008.
• C. Andrieu, N. De Freitas, A. Doucet, M. I. Jordan, “An Introduction to
MCMC for Machine Learning,” Machine Learning, 50, 5-43, 2003.
• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free energy
approximations and generalized belief propagation algorithms,” IEEE
Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, 2005.
Graphical Models
132
References: Mixed-Membership Models
•
•
•
•
•
•
•
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. “Indexing by
latent semantic analysis,” Journal of the Society for Information Science,
41(6):391–407, 1990.
T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,”
Machine Learning, 42(1):177–196, 2001.
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of
Machine Learning Research (JMLR), 3:993–1022, 2003.
T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the
National Academy of Sciences, 101(Suppl 1): 5228–5235, 2004.
Y. W. Teh, D. Newman, and M. Welling. “A collapsed variational Bayesian
inference algorithm for latent Dirichlet allocation, ” Neural Information Processing
Systems (NIPS), 2007.
A. Asuncion, P. Smyth, M. Welling, Y.W. Teh, “On Smoothing and Inference for
Topic Models,” Uncertainty in Artificial Intelligence (UAI), 2009.
H. Shan, A. Banerjee, and N. Oza, “Discriminative Mixed-membership Models,”
IEEE Conference on Data Mining (ICDM), 2009.
Graphical Models
133
References: Matrix Factorization
•
•
•
•
•
•
S. Funk, “Netflix update: Try this at home,”
http://sifter.org/~simon/journal/20061211.html
R. Salakhutdinov and A. Mnih. “Probabilistic matrix factorization,” Neural
Information Processing Systems (NIPS), 2008.
R. Salakhutdinov and A. Mnih. “Bayesian probabilistic matrix factorization using
Markov chain Monte Carlo,” International Conference on Machine Learning
(ICML), 2008.
I. Porteous, A. Asuncion, and M. Welling, “Bayesian matrix factorization with side
information and Dirichlet process mixtures,” Conference on Artificial Intelligence
(AAAI), 2010.
I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. “Modelling relational data using
Bayesian clustered tensor facotrization,” Neural Information Processing Systems
(NIPS), 2009.
A. Singh and G. Gordon, “A Bayesian matrix factorization model for relational
data,” Uncertainty in Artificial Intelligence (UAI), 2010.
Graphical Models
134
References: Co-clustering, Block Structures
•
•
•
•
•
•
•
A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, D. Modha., “A Generalized
Maximum Entropy Approach to Bregman Co-clustering and Matrix
Approximation,” Journal of Machine Learning Research (JMLR), 2007.
M. M. Shafiei and E. E. Milios, “Latent Dirichlet Co-Clustering,” IEEE Conference
on Data Mining (ICDM), 2006.
H. Shan and A. Banerjee, “Bayesian co-clustering,” IEEE International Conference
on Data Mining (ICDM), 2008.
P. Wang, C. Domeniconi, and K. B. Laskey, “Latent Dirichlet Bayesian CoClustering,” European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases (ECML/PKDD), 2009.
H. Shan and A. Banerjee, “Residual Bayesian Co-clustering for Matrix
Approximation,” SIAM International Conference on Data Mining (SDM), 2010.
T. A. B. Snijders and K. Nowicki, “Estimation and prediction for stochastic
blockmodels for graphs with latent block structure,” Journal of Classification,
14:75–100, 1997.
E.M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed-membership
stochastic blockmodels,” Journal of Machine Learning Research (JMLR), 9,
Graphical Models
135
1981-2014, 2008.
Acknowledgements
Hanhuai Shan
Amrudin Agovic
Graphical Models
136
Graphical Models
137
Download