Relational Learning via Collective Matrix Factorization SIGKDD 2008 A Bayesian Matrix Factorization Model for Relational Data UAI 2010 Authors: Ajit P. Singh & Geoffrey J. Gordon Presenter: Xian Xing Zhang Basic ideas • Collective matrix factorization is proposed for relational learning when an entity participates in multiple relations. • Several matrices (with different types of support) are factored simultaneously with shared parameters • CMF is extended to a hierarchical Bayesian model to enhance the sharing of statistics strength An example of application • Functional Magnetic Resonance Imaging (fMRI): – fMRI data can be viewed as a relation (real valued), Response(stimulus, voxel) ∈ [0, 1] – stimulus side-information: a relation (binary) Cooccurs(word, stimulus) ∈ {0, 1} (which is collected as the statistics of whether the stimulus word co-occurs with other commonlyused words in large) – The goal is to predict unobserved values of the Response relation Basic model description • In fMRI example, the Co-occurs relation is an m×n matrix X; the Response relation is an n×r matrix Y. • Likelihood of each matrix X and Y: • Co-occurs (p_X) is modeled by the Bernoulli distribution, Response (p_Y) is modeled by a Gaussian. Hierarchical Collective Matrix Factorization • Information between entities can only be shared indirectly, through another facto: e.g., in f(UV’), two distinct rows of U are correlated only through V . • The hierarchical prior acts as a shrinkage estimator for the rows of a factor, pooling information indirectly, through Θ. Bayesian Inference • Hessian Metropolis-Hastings: – In random walk Metropolis-Hastings it samples from a proposal distribution defined by a Gaussian with mean equal to the sample at time t, F_i(t) and covariance matrix , which is problematic. – HMH uses both the gradient and Hessian to automatically construct a proposal distribution at each sampling step. This is claimed as the main technical contribution of the UAI2010 paper. Related work Experiment setting • The Co-occurs(word, stimulus) relation is collected by measuring whether or not the stimulus word occurs within five tokens of a word in the Google Tera-word corpus. • Hold-out prediction: • Fold-in prediction (to predict a new row in Y) Experiment results Discussions • Existing methods force one to choose between ignoring parameter uncertainty or making Gaussianity assumptions. • Non-Gaussian response types significantly improve predictive accuracy. • While non-Gaussianity complicates the construction of proposal distributions for Metropolis-Hastings, it does have a significant impact on predictive accuracy