Slides

advertisement
Gaussian Processes and
Fast Matrix-Vector Multiplies
Iain Murray
Dept. Computer Science, University of Toronto
Work with Joaquin Quiñonero Candela, Carl Edward Rasmussen,
Edward Snelson and Chris Williams
GP regression model
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
0
0.2
0.4
0.6
0.8
1
−1.5
f ∼ GP
f ∼ N (0, Σ), Σij = k(xi, xj )
0
0.2
0.4
0.6
0.8
y|x ∼ N (f (x), σn2 )
1
GP posterior
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
0
0.2
0.4
0.6
0.8
Draws ∼ p(f |data)
1
0
0.2
0.4
0.6
0.8
Mean and error bars
1
Standard matrix operations
Infer function at a point x∗:
p(f (x∗)|data) = N (m, s2)
Need covariances:
Kij = k(xi, xj ),
(k∗)i = k(x∗, xi)
Posterior available in closed-form:
M = K + σn2 I
−1
y
m = k>
M
∗
−1
s2 = k(x∗, x∗) − k>
M
k∗
∗
Learning (hyper-)parameters
k(xi, xj ) = exp(−0.5|xi − xj |2/`2)
1
0.5
0
−0.5
−1
−1.5
0
0.2
0.4
0.6
0.8
` = 0.1, σn = 0.01
1
0
0.2
0.4
0.6
0.8
` = 0.5, σn = 0.05
1
0
0.2
0.4
0.6
0.8
` = 1.5, σn = 0.15
(Marginal) likelihood:
log p(y|X, `, σn) = − 12 y>M −1y − 12 log |M | − n2 log 2π
1
Exploding costs
GPs scale poorly with large datasets
O(n3) computation usually takes the blame:
M −1 or M −1y, M −1k∗ and det(M )
Not the only story:
Kij = k(xi, xj )
O(dn2) computation
O(n2) memory
Large literature on GP approximations
Exploding costs
20,000 points, GBs of RAM, ∼ 1012 f.p. operations
The “SoD Approximation” [1]
Trivial, obvious solution:
randomly throw away most of the data
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
e.g. keeping 1/20 points
[1] Rasmussen and Williams (2006)
Local Regression
0.6
kernel value
output, y
0.8
0.4
0.2
0
0
0.2
0.4
x*
input, x
0.8
1
x*
kd-trees for > 1 dimensions
Moore et al. (2007) put the data in a kd-tree
Set k(x∗, xi) equal for many xi in a common node
So far:
— GP review and scaling problems
— Simple alternatives
Next:
— Numerical methods for full GP
Iterative methods
Alternatives to straightforward O(n3) operations:
— Conjugate Gradients (CG),
— “Block Gauss–Seidel”,
e.g., Mark Gibbs thesis (1997)
Li et al. (ICML 2007)
— Randomized approaches,
e.g., Liberty et al. (PNAS 2007)
—...
Matrix-vector multiplies dominate cost, O(n2) each
Example: CG finds α = M −1y by iterative optimization:
>
1 >
α = argmax F(z) = y z − 2 z M z
z
Comparing iterative methods
Focus has been on mean prediction:
Taken from Large-scale RLSC learning without agony, Li et al. 2007
Comparing iterative methods
Training on 16,384 datapoints from SARCOS
0
Relative residual
10
CG
CG init
GS cluster
GS
−5
10
0
10
1
2
10
10
Iteration number
3
10
Comparing iterative methods
Test error progression
0.02
1
10
0
SMSE
SMSE
0.018
10
0.016
0.014
−1
10
0.012
−2
10
0
10
1
2
10
10
Iteration number
3
10
0.01 1
10
2
10
Iteration number
3
10
So far:
— GP review and scaling problems
— Simple alternatives
— Numerical methods for full GP
Next:
— Fast MVMs intended to speed up GPs
Accelerating MVMs
— CG originally for sparse systems, MVMs in < O(n2)
— GP kernel matrix is often not sparse
— Zeros in the covariance ⇒ marginal independence
— Short length scales usually don’t match my beliefs
— Empirically, I often learn lengthscales ≈ 1
2.4
output, y
0.8
0.5
0.6
0
2.2
0.4
−0.5
−1
0.2
2
−1.5
0
0
0.2
0.4
x*
input, x
0.8
1
0
0.2
0.4
x*
input, x
0.8
1
0
0.2
0.4
0.6
input, x
0.8
1
Fast, approximate MVMs
— MVMs and similar ops needed in many algorithms
— Fast MVMs involving kernels is being actively researched
kernel value
— Alternatives to CG also need MVMs
Simplest idea:
x*
Give nearby points equal kernel values:
X
X
X
(Kα)j =
k(xj , xi)αi ≈
k(xj , hxiC )
αj
i
C
j∈C
Simple kd-trees and GPs
Merging GP kernels in kd-trees doesn’t work.
Mean abs error
time/s
Example: single test-time MVM for 2-D synthetic data:
0
10
Full GP
KD−tree
−1
10
−2
10
−1
10
0
10
lengthscale, l
1
10
−1
10
−2
10
Full GP
KD−tree
−3
10
−2
10
−1
10
0
10
lengthscale, l
1
10
Might better code, another recursion or tree-type work?
The merging idea is flawed
Data
Full GP mean
Merge method
Subset of Data
y
α = M −1y
P
m∗ = i k(x∗, xi)αi
0
x
1
Improve test time by grouping sum into pairs:
xi+xi+1 k(x∗, xi)αi + k(x∗, xi+1)αi+1 ≈ (αi + αi+1)k x∗, 2
(I)FGT expansions
— Gaussian kernel only
— series expand MVM into terms involving single points
— Aim: m×n MVM, O(mn) → O(m+n)
— Only works in low dimensions1
0
−1
Mean abs error
time/s
10
Full GP
10
KD−tree
FGT
−2
10
−1
10
−2
10
Full GP
KD−tree
FGT
IFGT
IFGT
−3
−2
10
1
−1
10
0
10
lengthscale, l
1
10
Unless lengthscales are huge
10
−2
10
−1
10
0
10
lengthscale, l
1
10
Do we believe the GP anyway?
300
250
240
200
230
150
220
100
100
210
150
200
250
215
220
225
230
235
240
— Real data is nasty: thresholding, jumps, cl umps, kinks
— Some sparse GPs “approximations” introduce flexibility
— Local GPs are sometimes surprisingly good
— Mixtures of GPs
Summary
• Fast MVMs are required to leverage iterative methods
• Tree-based merging methods fail on simple problems
This isn’t acknowledged in the literature
• The IFGT is fast and accurate in low dimensions only
• Is it worth using so much data?
Is the model flexible enough to justify it?
Extra Slides
Folk Theorem
“When you have computational problems, often
there’s a problem with your model.”
e.g. Andrew Gelman:
http://www.stat.columbia.edu/~cook/movabletype/archives/
2008/05/the_folk_theore.html
Inducing point methods
Approximate the GP prior:
Z
p(f , f∗) =
p(f , f∗|u) p(u) du
Z
' q(f , f∗) =
q(f |u) q(f∗|u) p(u) du
Several methods result from choosing different q’s:
SoR/DIC, PLV/PP/DTC, FI(T)C/SPGP and BCM
Quiñonero-Candela and Carl Edward Rasmussen (2005)
SoR/DIC, a finite linear model
q(f |u) deterministic:
u ∼ N (0, Kuu),
−1
set f∗ = k>
K
∗
uu u
Draws from the prior:
2
Costs (m inducing u’s):
1
O(m2n) training
O(mn) covariances
O(m) mean prediction
O(m2) error bars
0
−1
−2
0
0.2
0.4
0.6
0.8
1
Limited fitting power
Those SoR prior draws again:
2
1
0
−1
−2
0
0.5
1
1.5
2
2.5
3
FIC / SPGP
q(f |u) =
Q
i pGP (fi|u)
4
3
2
1
0
−1
−2
−3
−4
0
0.5
1
1.5
2
2.5
3
O(·) costs are the same as SoR
[If time or at end: discuss low noise behaviours]
Training and Test times
Training time: before looking at the test set
Method Covariance Inversion
Full GP
SoD
Sparse
O(n2)
O(m2)
O(mn)
O(n3)
O(m3)
O(mn2)
Test time: spent making predictions
Method
Mean
Full GP O(n)
SoD
O(m)
Sparse O(m)
Error bar
O(n2)
O(m2)
O(m2)
Test NLP vs. Training time
4.5
mean negative log probability
mean negative log probability
1.5
1
0.5
0
−0.5
−1
−4
10
−2
10
0
2
10
10
train time /s
4
10
4D synthetic data
6
10
4
3.5
3
2.5
2
−4
10
−2
10
0
2
10
10
train time /s
4
10
21D real robot arm data
6
10
Test NLP vs. Test time
sod.thrs
dtc.thrs
dtc.thgs
dtc.lhgs
fitc.thrs
fitc.thgs
fitc.lhgs
1
0.5
0
−0.5
−1
−1
10
0
1
10
10
test time per 10,000 test cases /s
2
10
4D synthetic data
4.5
mean negative log probability
mean negative log probability
1.5
4
3.5
3
2.5
2
−1
10
0
1
10
10
test time per 10,000 test cases /s
2
10
21D real robot arm data
Summary
• When training-time dominates: SoD can be best(!),
FIC wins when need to learn good hyper-parameters
• When test-time dominates: FIC is best in practice
and in theory (but see Ed’s thesis for updates)
Conjugate Gradients
Another way of saying α = M −1y:
>
1 >
α = argmax F(z) = y z − 2 z M z
z
Each iteration:
— picks a new direction and optimizes along a line
— matrix-vector multiply dominates cost, O(n2)
Provable error tolerances tricky, but researched
Numerical instability and accumulation of errors hard to analyse
Conjugate Gradients results
(Trying hard to get a speedup)
Synthetic data: D = 4, σn = 0.1
0.1
0.8
0.6
SMSE
SMSE
0.08
0.06
0.04
0.02
0
\
Chol.
CG
\
Chol.
CG
0.2
0.4
0.2
1
time / s
`=1
5
0
0.2
1
time / s
` = 0.1
5
Conjugate Gradients results
Real robot arm data: D = 21:
0.08
SMSE
SMSE
0.1
0.05
\
Chol.
CG
0.06
0.04
1
10
time / s
plain
100
0
\
Chol.
CG
0.08
0.06
0.04
0.02
0.02
0
0.1
SMSE
0.1
\
Chol.
CG
1
10
100
time / s
preconditioned
with FIC
0
1
10
100
time / s
1000
low-memory
Some timings
Example training times for fixed hyper-parameters on
SARCOS. All times are in seconds on a 2.8 GHz AMD
Opteron with Matlab 7.4.
Subset size Setup time Chol. solving time Total time
4096
8192
16384
3.1
9.0
28.9
6.5
55.2
375.8
9.6
64.2
404.7
Download