Introduction GP Our problem Matrix variate GP Learning of Model Parameters using Matrix Variate Gaussian Process Dalia Chakrabarty1 Sourabh Bhattacharya2 1 University of Warwick, Department of Statistics 2 Indian Statistical Institute, Bayesian & Interdisciplinary Research Unit November 30, 2011 Conclusions Introduction GP Our problem Matrix variate GP Conclusions A problem that we often face Estimate model parameters, S, given data D on variable V V = f (S) (1) where the unknown function f : S −→ V where S ∈ S and V ∈ V. Thus, the learning of S entails the inverse problem - we need to learn S = f −1 (V)|D (2) In general, S could be a vector with d components - we need to learn d model parameter values {s1 , . . . , sd } and at each S, j values of V are observed, where V is a k -dimensional vector. Introduction GP Our problem Matrix variate GP Conclusions Examples - when the underlying functional form is unknown • learn required dose of a drug (D ∈ R) using training data on cure • • • • rate (obtained from eg. trials) by inverting functional relationship E = f (D), where E effectiveness of drug . . . involvement of hyperparameters possible. learn gravitational field of dark matter in glaxies as a function of motion of stars moving in this field, given (small) measured data sets of one component of stellar velocity vector. learn price of a house, given training data comprising observations spanning over time using relationship between price and time it takes to sell a house. learn 3-D shapes of particles by looking at their 2-D images direct modelling based on geometrical assumptions possible; modelling using training data possible. learn the parameters S ∈ RD of relevant features of our Galaxy, using data comprising velocity V of stars that live in the neighbourhood of the Sun, by inverting function V = f (S). Introduction GP Our problem Matrix variate GP Conclusions Supervised learning Thus, we are discussing learning of model parameters, as supervised by training data - for data {s(n) , vn }N n=1 , we want to perform regression if vi ∈ R or classification if vi ∈ {0, 1}. Let f (s) describe the data. Then we want to infer f (·) given the data, i.e. predict the value of measurement vn+1 at a new point s(n+1) . Introduction GP Our problem Matrix variate GP Conclusions Modelling f (s) - Gaussian process • Inference of the (generally non-linear) function f (s), given high-dimensional data that comprise training data on variable V. • In the bayesian paradigm - place prior π(f (s)) on space of functions. Simplest such prior is a Gaussian Process. • A Gaussian Process (GP) is a Gaussian distribution over a space of functions (of infinite dimensions) . . . generate functions such that for S ∈ [s1 , s2 ], any finite subset of V follows a multivariate Gaussian distribution. • Like a Gaussian distribution, a GP is fully specified by a mean and covairance, except, 1. mean is a function, µ(s) - often taken as zero. 2. covariance is a function, k(s, s/ ) - expected covariance between value of f (·) at s and s/ . Introduction GP Our problem Matrix variate GP Conclusions Gaussian Process - noise free assumption f (s) ∼ GP(µ(s), k (s, s/ )) Covariance function chosen - the squared exponential a popular choice: " # −(s − s/ )2 / / 2 cov(f (s), f (s )) = k (s, s ) = σ exp 2ℓ2 ℓ parametrises effect of separation between s and s/ . (3) (4) Introduction GP Our problem Matrix variate GP Conclusions Gaussian Process - noise free assumption Our interest lies in harnessing the training data Ds to help make prediction of model parameter at a new observation (test data V(new) ). • How likely is training data given relevant process • • • • parameters, i.e. compute [Ds |φ]. Use this in Bayes rule to get posterior probability distribution of relevant process parameters, conditional on test data and other process parameters −→ to be used later. Construct the augmented data set Daug = (Ds , Dtest ) and use likelihood of Daug given φ to get posterior [s(test) , φ|Daug ]. Marginalise over φ (using stored posterior of relevant rocess parameters) to get posterior predictive distribution [s(test) |Daug ]. Introduction GP Our problem Matrix variate GP Conclusions Estimation of relevant Milky Way parameters S ∈ Rd in general, (with m=2 in Chakrabarty, 2007, 2011), given heliocentric, discrete, stellar velocity data tot Dtest := {ui , vi }N i=1 , using a calibration method in which we compare estimates of density of local velocity space f0 (U, V )|Dtest obtained from observed data and fi (U, V )|Ds obtained from the i th simulated data set, i = 1, . . . , N. (j) Generate Ds := (j) (j) {ui , vi }N i=1 ∀ S ∈ [sj−1 , sj ), using orbit simulations; j = 1, . . . , jmax Introduction GP Our problem Matrix variate GP Conclusions Figure: Left: The velocities recorded in the j th S cell are used to estimate fj (U, V ) (overlaid in solid black contour lines over) f0 (U, V )|D (in coloured contours). Middle: Distribution of the support in data D, to the null that the observed data are drawn from the j th simulated phase space density, as p-value of the test statistic - shown in gray-scale over ranges of s used in Chakrabarty (2007). Right: Estimated S1 (solar radius), with 90% unertainties. Introduction GP Our problem Matrix variate GP Conclusions Our work - when the target is a vector • V - a k -dimensional vector. • j stars have velocities measured, for each s. • velocity information is V, a j × k matrix. • S ∈ Rd . v11 v12 . . . v1k v21 v22 . . . v2k .. .. .. .. . . . . vj1 vj2 . . . v2k v = ξ(s), represented as η11 (s) η12 (s) . . . η1k (s) η21 (s) η22 (s) . . . η2k (s) (5) = .. .. .. .. . . . . ηj1 (s) ηj2 (s) . . . η2k (s) . . (j×1) (·).. · · · ..ζk (·)), where T • ζi (·) = (ηi1 (·), · · · , ηik (·)) and • ηit (·) is a Gaussian process, t = 1, . . . , k , i = 1, . . . , j; unknown velocity function is a j × k -variate GP. (j×1) • ξ (j×k ) (·) = (ζ1 Introduction GP Our problem Matrix variate GP Conclusions Our work - matrix variate GP: inversion • Posterior distributions of some process parameters, given • • • • • training data are computed. Likelihood of the augmented data Daug = (Ds , Dtest ), given process parameters computed - matrix normal with left and right coavriance matrices written in terms of process parameters. Posterior predictive probability of new value of S, given Daug and process parameters calculated using simple non-informative prior on process parameters. Parameters integrated out from this posterior - already computed posterior distributions of some process parameters are invoked in this calculation. Marginalised posterior of snew sampled from, using MCMC ... 95% highest probability density credibe region of two-components of milky Way model parameter noted, for each of 4 dynamical simulation perfrmed with a distinct GP Our problem Matrix variate GP 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 scaled density scaled density Introduction 0.5 0.4 0.4 0.3 0.2 0.2 0 1.7 0.1 1.8 1.9 2 radius 2.1 2.2 0 1.7 2.3 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 scaled density scaled density 0.5 0.3 0.1 0.5 0.4 1.9 2 radius 2.1 2.2 2.3 1.8 1.9 2 radius 2.1 2.2 2.3 0.4 0.3 0.2 0.2 0.1 1.8 0.5 0.3 0 1.7 Conclusions 0.1 1.8 1.9 2 radius 2.1 2.2 2.3 0 1.7 Plots of posterior probability density of the unknown model parameter S1 that represents the radial coordinate of the Sun from the “centre” of the Milky Way, given observed stellar velocity data and training (simulated) data obtained by simulating from dynamical models of the Milky Way in which S1 is a variable, (along with S2 ). GP Our problem Matrix variate GP 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 scaled density scaled density Introduction 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 0.1 0 10 20 30 40 50 60 70 80 0 90 0 10 20 30 40 50 60 70 80 90 50 60 70 80 90 azimuth 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 scaled density scaled density azimuth 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 Conclusions 0.1 0 10 20 30 40 50 azimuth 60 70 80 90 0 0 10 20 30 40 azimuth Plots of posterior probability density of the unknown model parameter S2 that represents the azimuthal coordinate of the Sun from the major axis of the central bar in the Milky Way, given observed stellar velocity data and training (simulated) data obtained by simulating from dynamical models of the Milky Way in which S2 is a variable, (along Introduction GP Our problem Matrix variate GP Conclusions Summary of the posterior distribution of the unknown radial location R (≡ S1 , using training data simulated from the 4 dynamical models of the Galaxy. Model bar 6 sp3bar 3 sp3bar 3_18 sp3bar 3_25 R (simulation units) Mode 2.20 1.73 1.76 1.95 95% HPD [2.04, 2.30] [1.70, 2.26] ∪ [2.27, 2.28] [1.70, 2.29] [1.70, 2.15] 50% HPD [2.16, 2.24] [1.71, 1.79] ∪ [1.96, 1.97] ∪ [1.99, 2.05] ∪ [2.10, 2.21] [1.72, 1.86] ∪ [1.98, 2.09] [1.86, 1.98] Summary of the posterior distributions of the unknown azimuthal location Θ for the 4 models. Model bar 6 sp3bar 3 sp3bar 3_18 sp3bar 3_25 θ (degrees) Mode 23.50 18.8 32.5 37.6 95% HPD [21.20, 25.80] [9.6, 61.5] [17.60, 79.90] [28.80, 40.40] 50% HPD [22.60, 24.30] [15.10, 22.50] ∪ [23.20, 27.80] ∪ [31.30, 35.50] ∪ [52.00, 57.80] [27.9, 49.9] [30.70, 31.50] ∪ [36.00, 39.60] Introduction GP Our problem Matrix variate GP Conclusions Conclusions • Supervised learning of high-diemnsional model parameters using training data, by imposing a GP as a prior on the unknown function between the measured variable and the unknown parameters and then inverting such a function. • Gaussian Processes as tool for Bayesian non-parametric regression. • Application to the learning of Milky Way model parameters using high-dimensional training (stellar) data and prediction of Milky Way parameters given measured data supplemented by training data (augmented data).