Learning of Model Parameters using Matrix Variate Gaussian Process Dalia Chakrabarty Sourabh Bhattacharya

advertisement
Introduction
GP
Our problem
Matrix variate GP
Learning of Model Parameters using Matrix
Variate Gaussian Process
Dalia Chakrabarty1
Sourabh Bhattacharya2
1 University of Warwick,
Department of Statistics
2 Indian Statistical Institute,
Bayesian & Interdisciplinary Research Unit
November 30, 2011
Conclusions
Introduction
GP
Our problem
Matrix variate GP
Conclusions
A problem that we often face
Estimate model parameters, S, given data D on variable V
V = f (S)
(1)
where the unknown function f : S −→ V where S ∈ S and
V ∈ V.
Thus, the learning of S entails the inverse problem - we need to
learn
S = f −1 (V)|D
(2)
In general, S could be a vector with d components - we need to
learn d model parameter values {s1 , . . . , sd } and
at each S, j values of V are observed, where V is a
k -dimensional vector.
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Examples - when the underlying functional form is unknown
• learn required dose of a drug (D ∈ R) using training data on cure
•
•
•
•
rate (obtained from eg. trials) by inverting functional relationship
E = f (D), where E effectiveness of drug . . . involvement of
hyperparameters possible.
learn gravitational field of dark matter in glaxies as a function of
motion of stars moving in this field, given (small) measured data
sets of one component of stellar velocity vector.
learn price of a house, given training data comprising
observations spanning over time using relationship between
price and time it takes to sell a house.
learn 3-D shapes of particles by looking at their 2-D images direct modelling based on geometrical assumptions possible;
modelling using training data possible.
learn the parameters S ∈ RD of relevant features of our Galaxy,
using data comprising velocity V of stars that live in the
neighbourhood of the Sun, by inverting function V = f (S).
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Supervised learning
Thus, we are discussing learning of model parameters, as
supervised by training data - for data {s(n) , vn }N
n=1 , we want
to perform regression if vi ∈ R or classification if vi ∈ {0, 1}.
Let f (s) describe the data.
Then we want to infer f (·)
given the data, i.e. predict
the value of measurement
vn+1 at a new point s(n+1) .
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Modelling f (s) - Gaussian process
• Inference of the (generally non-linear) function f (s), given
high-dimensional data that comprise training data on
variable V.
• In the bayesian paradigm - place prior π(f (s)) on space of
functions. Simplest such prior is a Gaussian Process.
• A Gaussian Process (GP) is a Gaussian distribution over a
space of functions (of infinite dimensions) . . . generate
functions such that for S ∈ [s1 , s2 ], any finite subset of V
follows a multivariate Gaussian distribution.
• Like a Gaussian distribution, a GP is fully specified by a
mean and covairance, except,
1. mean is a function, µ(s) - often taken as zero.
2. covariance is a function, k(s, s/ ) - expected covariance
between value of f (·) at s and s/ .
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Gaussian Process - noise free assumption
f (s) ∼ GP(µ(s), k (s, s/ ))
Covariance function chosen - the squared exponential a
popular choice:
"
#
−(s − s/ )2
/
/
2
cov(f (s), f (s )) = k (s, s ) = σ exp
2ℓ2
ℓ parametrises effect of separation between s and s/ .
(3)
(4)
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Gaussian Process - noise free assumption
Our interest lies in harnessing the training data Ds to help
make prediction of model parameter at a new observation
(test data V(new) ).
• How likely is training data given relevant process
•
•
•
•
parameters, i.e. compute [Ds |φ].
Use this in Bayes rule to get posterior probability
distribution of relevant process parameters, conditional on
test data and other process parameters −→ to be used
later.
Construct the augmented data set Daug = (Ds , Dtest ) and
use likelihood of Daug given φ to get posterior
[s(test) , φ|Daug ].
Marginalise over φ (using stored posterior of relevant
rocess parameters) to get posterior predictive distribution
[s(test) |Daug ].
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Estimation of relevant Milky Way parameters S ∈ Rd in
general, (with m=2 in Chakrabarty, 2007, 2011), given
heliocentric, discrete, stellar velocity data
tot
Dtest := {ui , vi }N
i=1 , using a calibration method in which we
compare estimates of density of local velocity space
f0 (U, V )|Dtest obtained from observed data and fi (U, V )|Ds
obtained from the i th simulated data set, i = 1, . . . , N.
(j)
Generate Ds :=
(j) (j)
{ui , vi }N
i=1 ∀ S ∈ [sj−1 , sj ),
using orbit simulations;
j = 1, . . . , jmax
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Figure: Left: The velocities recorded in the j th S cell are used to
estimate fj (U, V ) (overlaid in solid black contour lines over) f0 (U, V )|D
(in coloured contours). Middle: Distribution of the support in data D,
to the null that the observed data are drawn from the j th simulated
phase space density, as p-value of the test statistic - shown in
gray-scale over ranges of s used in Chakrabarty (2007). Right:
Estimated S1 (solar radius), with 90% unertainties.
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Our work - when the target is a vector
• V - a k -dimensional vector.
• j stars have velocities measured, for each s.
• velocity information is V, a j × k matrix.
• S ∈ Rd .





v11 v12 . . . v1k
v21 v22 . . . v2k
..
..
..
..
.
.
.
.
vj1 vj2 . . . v2k
v = ξ(s), represented as



η11 (s) η12 (s) . . . η1k (s)

 η21 (s) η22 (s) . . . η2k (s) 



(5)
=



..
..
..
..



.
.
.
.
ηj1 (s) ηj2 (s) . . . η2k (s)
.
. (j×1)
(·).. · · · ..ζk
(·)), where
T
• ζi (·) = (ηi1 (·), · · · , ηik (·)) and
• ηit (·) is a Gaussian process, t = 1, . . . , k , i = 1, . . . , j;
unknown velocity function is a j × k -variate GP.
(j×1)
• ξ (j×k ) (·) = (ζ1
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Our work - matrix variate GP: inversion
• Posterior distributions of some process parameters, given
•
•
•
•
•
training data are computed.
Likelihood of the augmented data Daug = (Ds , Dtest ), given
process parameters computed - matrix normal with left and
right coavriance matrices written in terms of process
parameters.
Posterior predictive probability of new value of S, given
Daug and process parameters calculated using simple
non-informative prior on process parameters.
Parameters integrated out from this posterior - already
computed posterior distributions of some process
parameters are invoked in this calculation.
Marginalised posterior of snew sampled from, using MCMC
...
95% highest probability density credibe region of
two-components of milky Way model parameter noted, for
each of 4 dynamical simulation perfrmed with a distinct
GP
Our problem
Matrix variate GP
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
scaled density
scaled density
Introduction
0.5
0.4
0.4
0.3
0.2
0.2
0
1.7
0.1
1.8
1.9
2
radius
2.1
2.2
0
1.7
2.3
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
scaled density
scaled density
0.5
0.3
0.1
0.5
0.4
1.9
2
radius
2.1
2.2
2.3
1.8
1.9
2
radius
2.1
2.2
2.3
0.4
0.3
0.2
0.2
0.1
1.8
0.5
0.3
0
1.7
Conclusions
0.1
1.8
1.9
2
radius
2.1
2.2
2.3
0
1.7
Plots of posterior probability density of the unknown model parameter S1 that represents the radial coordinate of the
Sun from the “centre” of the Milky Way, given observed stellar velocity data and training (simulated) data obtained by
simulating from dynamical models of the Milky Way in which S1 is a variable, (along with S2 ).
GP
Our problem
Matrix variate GP
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
scaled density
scaled density
Introduction
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
0.1
0
10
20
30
40
50
60
70
80
0
90
0
10
20
30
40
50
60
70
80
90
50
60
70
80
90
azimuth
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
scaled density
scaled density
azimuth
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
Conclusions
0.1
0
10
20
30
40
50
azimuth
60
70
80
90
0
0
10
20
30
40
azimuth
Plots of posterior probability density of the unknown model parameter S2 that represents the azimuthal coordinate of
the Sun from the major axis of the central bar in the Milky Way, given observed stellar velocity data and training
(simulated) data obtained by simulating from dynamical models of the Milky Way in which S2 is a variable, (along
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Summary of the posterior distribution of the unknown radial
location R (≡ S1 , using training data simulated from the 4
dynamical models of the Galaxy.
Model
bar 6
sp3bar 3
sp3bar 3_18
sp3bar 3_25
R (simulation units)
Mode
2.20
1.73
1.76
1.95
95% HPD
[2.04, 2.30]
[1.70, 2.26] ∪ [2.27, 2.28]
[1.70, 2.29]
[1.70, 2.15]
50% HPD
[2.16, 2.24]
[1.71, 1.79] ∪ [1.96, 1.97] ∪ [1.99, 2.05] ∪ [2.10, 2.21]
[1.72, 1.86] ∪ [1.98, 2.09]
[1.86, 1.98]
Summary of the posterior distributions of the unknown
azimuthal location Θ for the 4 models.
Model
bar 6
sp3bar 3
sp3bar 3_18
sp3bar 3_25
θ (degrees)
Mode
23.50
18.8
32.5
37.6
95% HPD
[21.20, 25.80]
[9.6, 61.5]
[17.60, 79.90]
[28.80, 40.40]
50% HPD
[22.60, 24.30]
[15.10, 22.50] ∪ [23.20, 27.80] ∪ [31.30, 35.50] ∪ [52.00, 57.80]
[27.9, 49.9]
[30.70, 31.50] ∪ [36.00, 39.60]
Introduction
GP
Our problem
Matrix variate GP
Conclusions
Conclusions
• Supervised learning of high-diemnsional model parameters
using training data, by imposing a GP as a prior on the
unknown function between the measured variable and the
unknown parameters and then inverting such a function.
• Gaussian Processes as tool for Bayesian non-parametric
regression.
• Application to the learning of Milky Way model parameters
using high-dimensional training (stellar) data and
prediction of Milky Way parameters given measured data
supplemented by training data (augmented data).
Download