Uploaded by avaziri1

2212.12474

advertisement
Physics-Informed Gaussian Process Regression
Generalizes Linear PDE Solvers
arXiv:2212.12474v1 [cs.LG] 23 Dec 2022
Marvin Pförtner1
Ingo Steinwart2
Philipp Hennig1
Jonathan Wenger1
1
2
marvin.pfoertner@uni-tuebingen.de
ingo.steinwart@mathematik.uni-stuttgart.de
philipp.hennig@uni-tuebingen.de
jonathan.wenger@uni-tuebingen.de
University of Tübingen, Tübingen AI Center
University of Stuttgart
Abstract
Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and
wave propagation. In practice, specialized numerical methods based on discretization are
used to solve PDEs. They generally use an estimate of the unknown model parameters
and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models or analyses with a downstream application such that error
quantification plays a key role. However, by entirely ignoring parameter and measurement
uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion
by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression.
Our framework is based on a key generalization of a widely-applied theorem for conditioning GPs on a finite number of direct observations to observations made via an arbitrary
bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the
inherent discretization error ; (2) propagate uncertainty about the model parameters to the
solution; and (3) condition on noisy measurements. Demonstrating the strength of this
formulation, we prove that it strictly generalizes methods of weighted residuals, a central
class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized)
Galerkin methods such as finite element and spectral methods. This class can thus be
directly equipped with a structured error estimate and the capability to incorporate uncertain model parameters and observations. In summary, our results enable the seamless
integration of mechanistic models as modular building blocks into probabilistic models by
blurring the boundaries between numerical analysis and Bayesian inference.
Keywords: physics-informed machine learning, probabilistic numerics, partial differential
equations, Galerkin methods, Gaussian processes, bounded linear operator equations
1. Introduction
Partial differential equations (PDEs) are powerful mechanistic models of static and dynamic
systems with continuous spatial interactions (Borthwick, 2018). They are widely used in the
natural sciences, especially in physics, and in applied fields like engineering, medicine and
finance. Linear PDEs form a subclass describing physical phenomena such as heat diffusion (Fourier, 1822), electromagnetism (Maxwell, 1865) and continuum mechanics (Lautrup,
2005). Additionally, they are used in applications as diverse as computer graphics (Kazhdan
et al., 2006), medical imaging (Holder, 2005), or option pricing (Black and Scholes, 1973).
©2022 Marvin Pförtner, Ingo Steinwart, Philipp Hennig and Jonathan Wenger.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
Pförtner, Steinwart, Hennig and Wenger
Scientific inference with PDEs Given a mechanistic model of a (physical) system in
the form of a linear PDE D [u] = f , where D is a linear differential operator mapping
between vector spaces of functions, the system can be simulated by solving the PDE subject
to a set of linear boundary conditions (BC), given by a linear operator B and a function g
defined on the boundary of the domain, s.t. B [u] = g (Evans, 2010). For instance, given
all material parameters and heat sources involved, a PDE can describe the temperature
distribution in an electronic component, while the boundary conditions describe the heat
flux out of the component at the surface. Since hardly any practically relevant PDE can be
solved analytically (Borthwick, 2018), in practice, specialized numerical methods relying on
discretization are employed. Often such solvers are embedded into larger scientific models,
where model parameters are inferred from measurement and downstream analyses depend
on the resulting simulation. For example, we would like to model whether said electronic
component hits critical temperature thresholds during operation to assess its longevity.
Challenges when solving PDEs When performing scientific inference with PDEs via
numerical simulation, one is faced with three fundamental challenges.
(C1) Limited computation. Any numerically computed solution û ≈ u suffers from approximation error. In practice, a sufficiently accurate simulation often requires vast
amounts of computational resources.
(C2) Partially-known physics. While the underlying physical mechanism is encoded in the
formulation of the PDE, in practice, its exact parameters and boundary conditions are
often unknown. For example, the position and strength of heat sources f within the
aforementioned electric component are only approximately known. Similarly, material
parameters like thermal conductivity, which define D, can often only be estimated.
Finally, the initial or boundary conditions B [u] = g are also only partially known. For
example, how much heat an electrical component dissipates via its surface.
(C3) Error propagation. Limited computation and partially-known physics inevitably introduce error into the simulation. This resulting bias can fundamentally alter conclusions
drawn from downstream analysis steps, in particular if these are sensitive to input
variability. For example, an electronic component may be deemed safe based on the
simulation, although its true internal temperature hits safety-critical levels repeatedly.
Solving PDEs as a learning problem The challenges of scientific inference with PDEs
are fundamentally issues of partial information. Here, we interpret solving a PDE as a
learning problem, specifically as physics-informed regression, in the spirit of probabilistic
numerics (Hennig et al., 2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi
et al., 2019; Hennig et al., 2022). By leveraging the tools of Bayesian inference, we can tackle
the challenges (C1) to (C3). As illustrated in Figure 1(a), we model the solution of the PDE
with a Gaussian process, which we condition on observations of the boundary conditions,
the PDE itself and any physical measurements:
• Encoding prior knowledge. We can efficiently leverage any available computation by encoding inductive bias about the solution of the PDE. For example, we can identify the
solution space by “partial derivative counting”. Moreover, since PDEs typically model
2
Physics-Informed GP Regression Generalizes Linear PDE Solvers
physical systems, expert knowledge is often available. This includes known physical
properties of the system such as symmetries, as well as more subjective estimates from
previous experience with similar systems or computationally cheap approximations.
• Conditioning on the boundary conditions. The linear boundary conditions can be interpreted as measurements of the solution of the PDE on the boundary. By conditioning
on (some of) these measurements, we are not limited to satisfying the boundary conditions exactly, but can directly model uncertain constraints without having to resort
to point estimates. Instead, we propagate the uncertainty to the solution estimate.
This also allows us to handle cases where we do not have a functional form g of the
constraints, but only a discrete set of constraints at boundary points.
• Conditioning on the PDE. Conditioning a probability measure over the solution on
the analytic “observation” that the PDE holds is generally intractable. In the spirit of
classic approaches for solving PDEs, we relax the PDE-constraint by requiring only a
finite number of projections of the associated PDE residual onto carefully chosen test
functions to be zero. This choice of projections defines the discretization and allows for
control over the amount of expended computation. The resulting posterior quantifies
the algorithm’s uncertainty within a whole set of solution candidates.
• Conditioning on measurements. Finally, we can also condition on direct measurements
of the solution itself. This is especially useful if parameters of the differential operator
or boundary conditions are uncertain, or if the computational budget is restrictive.
The resulting posterior belief quantifies the uncertainty about the true solution induced
by limited computation and partially-known physics (see Figure 1(b)). By quantifying this
error probabilistically, we can propagate it to any downstream analysis or decision. For
example, to project the longevity of a newly designed electrical component, we want to
simulate how likely the component will hit a critical temperature threshold during operation. Given our posterior belief, we can simply compute the marginal probability instead of
performing Monte-Carlo sampling, which would require repeated PDE solves at significant
computational expense.
Contribution We introduce a probabilistic learning framework for the solution of (systems
of ) linear PDEs, including elliptic, parabolic and hyperbolic linear PDEs. Our framework
can be viewed as physics-informed Gaussian process regression. It is based on a crucial generalization of a popular result on conditioning GPs on linear observations to observations
made via an arbitrary bounded linear operator (Theorem 1). This enables combined quantification of uncertainty from the inherent discretization error, uncertain initial or boundary
conditions, as well as noisy measurements of the solution. Our approach is a strict probabilistic generalization of methods of weighted residuals (Corollary 3.3), including collocation,
finite volume, (pseudo)spectral, and (generalized) Galerkin methods such as finite element
methods. In doing so, we demonstrate that this class can be equipped with a structured
error estimate and the capability to incorporate partially-known physics and experimental
measurements.
3
u
u
Boundary Conditions
Prior
Pförtner, Steinwart, Hennig and Wenger
?
Measurements
PDE
Domain D
?
u
u | BC, PDE
?
u (XBC ) + BC
Domain D
u?
u | BC
u? (XBC ) + BC
Domain D
u?
u | BC, PDE, MEAS
u? (XBC ) + BC
u? (XMEAS ) + MEAS
Domain D
Marginal Std. Dev.
(a) Learning to solve the Poisson equation. A problem-specific Gaussian process prior u is conditioned on partially-known physics, given by uncertain boundary conditions (BC) and a linear PDE,
as well as on noisy physical measurements from experiment. The boundary conditions and the righthand side of the PDE are not known but inferred from a small set of noise-corrupted measurements.
The plots juxtapose the belief u | · · · with the true solution u? of the latent boundary value problem.
u ∼ GP(m, k)
u | BC
u | . . . , PDE
u | . . . , MEAS
u?
u | BC, PDE, SC
Domain D
Domain D
(b) Uncertainty quantification. Marginal posterior standard deviation after conditioning
on uncertain boundary conditions, a linear
PDE, and noisy (physical) measurements.
(c) Generalization of Classical Solvers. For
certain priors our framework reproduces any
method of weighted residuals, e.g. the finite
element method, in its posterior mean.
Figure 1: A physics-informed Gaussian process framework for the solution of linear PDEs.
4
Physics-Informed GP Regression Generalizes Linear PDE Solvers
2. Background
2.1 Linear Partial Differential Equations
A linear partial differential equation (PDE) is an equation of the form
(2.1)
D [u] = f,
where D : U → V is a linear differential operator (see Definition C.2) between a space U of
0
Rd -valued functions and a space V of real-valued functions on a common domain D ⊂ Rd ,
and f ∈ V is the so-called right-hand side function (Evans, 2010). Typically, systems
described by PDEs are further constrained via linear boundary conditions (BCs) B [u] = g
describing the system on the boundary ∂D, where B is a linear operator mapping functions
u ∈ U onto functions B [u] : ∂D → R defined on the boundary and g : ∂D → R. Common
types of boundary conditions are:
• Dirichlet: Specify the values of the solution on the boundary, i.e. B [u] = u|∂D .
• Neumann: Specify the exterior normal derivative on the boundary, i.e. B [u] (x) :=
∂ν(x) u (x), where ν(x) is the exterior normal vector at each point of the boundary.
A PDE and a set of boundary conditions is referred to as a boundary value problem (BVP). A
prototypical example of a linear PDE, used in thermodynamics,
and Newtonian
Pd ∂electrostatics
2u
gravity, is the Poisson equation −∆u = f , where ∆u = i=1 ∂x2 is the Laplacian.
i
2.1.1 Weak Formulation
Many models of physical phenomena are expressed as functions u, which are not (continuously) differentiable or even continuous (Evans, 2010; Borthwick, 2018; von Harrach, 2021).
In other words, they are not so-called strong solutions to any PDE. There are also PDEs
derived from established physical principles, which do not admit strong solutions at all.
To address this, one can weaken the notion of differentiability leading to the concept of
weak solutions. Many of the aforementioned physical phenomena are in fact weak solutions. As an example1 , consider the weak formulation of the stationary heat equation for
non-homogeneous media
− div (κ∇u) = q̇V .
(2.2)
Let D ⊂ Rd be an open and bounded domain and assume that u ∈ C 2 (D), κ ∈ C 1 (D),
and q̇V ∈ C 0 (D). If u is a solution to Equation (2.2), then we can integrate both sides of
the equation against a test function v ∈ Cc∞ (D), i.e. an infinitely smooth function with
compact support (see Definition C.5), which results in
Z
Z
−
div (κ∇u) (x) v(x) dx =
q̇V (x)v(x) dx.
D
D
Since both u and v are sufficiently differentiable, we can apply integration by parts (Green’s
first identity) to the first integral to obtain
Z
Z
hκ(x)∇u (x) , ∇v (x)i dx =
q̇V (x)v(x) dx,
(2.3)
D
D
|
{z
}
=:B[u,v]
1. Our exposition is a strongly abbreviated version of Evans (2010, Section 6.1.2).
5
Pförtner, Steinwart, Hennig and Wenger
since v|∂D = 0. Note that this expression does not only make sense if u ∈ C 2 (D), but also if u
is once weakly differentiable (see Evans 2010, Section 5.2.1) with ∇u ∈ L2 (D)d . Intuitively
speaking, a weak derivative of a (classically non-differentiable) function “behaves like a
derivative” when integrated against a smooth test function. These relaxed requirements on
u are exactly the defining properties of the Sobolev space H 1 (D) ⊃ C 2 (D), i.e. it suffices
that u ∈ H 1 (D). Similarly, we can weaken all other assumptions to v ∈ H01 (D), f ∈ L2 (D)
and κ ∈ L∞ (D). Then, for u ∈ H 1 (D) and v ∈ H01 (D), Equation (2.3) is equivalent to
B [u, v] = hq̇V , viL2 .
(2.4)
We define a weak solution of Equation (2.2) as u ∈ H 1 (D) such that Equation (2.4), known
as the weak or variational formulation, holds for all v ∈ H01 (D).
Definition 2.1. A weak formulation of a linear PDE D [u] = f is an equation of the form
B [u, v] = l [v] ,
(2.5)
where B : U ×V → R is a bilinear form derived from the differential operator D and l : V → R
is a continuous linear functional induced by the right-hand side f . A vector u ∈ U is a weak
solution of the PDE if it solves Equation (2.5) for all test functions v ∈ V . In this context,
D [u] = f is called the strong formulation of the PDE and any solution to it is called a
strong or classical solution. We refer to a weak solution as strictly weak if it can not be
interpreted as the a solution to the strong formulation.
2.1.2 Methods of Weighted Residuals2
Unfortunately, linear PDEs both in weak and strong formulation are in general not analytically solvable, so approximate solutions are sought instead. Methods of weighted residuals (MWR) constitute a large family of popular numerical approximation schemes for linear PDEs, including collocation, finite volume, (pseudo)spectral, and (generalized) Galerkin
methods such as finite-element methods. Intuitively speaking, MWRs interpret a linear PDE
as a root-finding problem for the associated PDE residual, i.e. D [u] − f = 0. Note that this
problem consists of infinitely many equations for infinitely many unknowns. To render the
problem tractable, MWRs approximate the unknown solution function u via finite linear
combinations of trial functions φ1 , . . . , φn , i.e.
û :=
m
X
ci φi ,
(2.6)
i=1
where c ∈ Rm is the coordinate vector of uc in the finite-dimensional subspace Û :=
span (φ1 , . . . , φm ) ⊂ U . In the following, we will assume that the trial functions φi are
chosen such that the boundary conditions are met, i.e. we describe so-called interior methods.3 To reduce the number of equations, MWRs only require a finite number of projections
2. This section is loosely based on Fletcher (1984).
3. By stacking the residuals corresponding to the PDE and the boundary conditions, the approach outlined
here can be used to realized mixed methods, which solve the boundary value problem without requiring
that uc fulfills the boundary conditions by construction.
6
Physics-Informed GP Regression Generalizes Linear PDE Solvers
of the residual onto test functions ψ1 , . . . , ψn to be zero, i.e.
hψi , D [û] − f iV = hψi , D [û]iV − hψi , f iV = 0
|
{z
} | {z }
=:B[û,ψi ]
(2.7)
=:l[ψi ]
for all i = 1, . . . , n, where h·, ·iV is a (semi-definite) inner product on the function space V .
A ubiquitous choice for h·, ·iV is the L2 inner product. In this case, the projected residual
can be interpreted as a weighted average of the residual, where the test function defines
the weight function, hence the name of the method. By substituting Equation (2.6) into
Equation (2.7) and rearranging terms, we can see that this approach leads to a linear system
B̂c = ˆl, where B̂ij := B [φj , ψi ] and ˆli := l [ψi ] . Hence, the approximate solution function
obtained from this method is given by
MWR
u
=
m
X
where
cMWR
φi ,
i
cMWR = B̂ −1 ˆl
(2.8)
i=1
assuming that B̂ is invertible. Note that Equation (2.7) is a weak formulation of the linear
PDE, restricted to the finite-dimensional subspaces Û ⊂ U and V̂ = span (ψ1 , . . . , ψn ) ⊂ V .
It is evident that the method described above can also be applied to weak formulations of
linear PDEs which were not obtained by projecting the residual onto the ψi as in Equation (2.7). Following Fletcher (1984), we will also refer to these methods as methods of
weighted residuals. Table 1 lists the aforementioned examples of MWRs together with the
corresponding trial and test functions that induce the method.
2.2 Gaussian Processes
A Gaussian process h with index set X is a family {hx }x∈X of real-valued random variables on
a common probability space (Ω, B (Ω) , P) such that, for each finite set of indices x1 , . . . , xn ,
the joint distribution of hx1 , . . . , hxn is Gaussian. We also write h(x, ω) = hx (ω) and h(x) :=
h(x, ·). The function x 7→ E [h(x)] is called the mean (function) of h and the function
(x1 , x2 ) 7→ Cov [h(x1 ), h(x2 )] is called the covariance function or kernel of h. For each
ω ∈ Ω, the function h(·, ω) : X → R, x 7→ h(x, ω) is called a sample or (sample) path of
the Gaussian process. The set of paths (h) := {h(·, ω) | ω ∈ Ω} ⊂ RX is referred to as the
path space of h. Given the notion of a sample path, it is easy to see why we use Gaussian
processes as priors over unknown real-valued functions. However, many functions describing
0
physical systems such as vector fields take values in Rd . Fortunately, the index set of a
Gaussian process can be chosen freely, which means that we can “emulate” vector-valued
0
GPs. More precisely, a function h : X 7→ Rd can be equivalently viewed as a function
h0 : {1, . . . , d0 } × X → R, (i, x) 7→ h0 (i, x) = hi (x). Applying this construction to a Gaussian
process leads to the notion of a multi-output Gaussian process.
7
Pförtner, Steinwart, Hennig and Wenger
3. Learning the Solution to a Linear PDE
Consider a linear partial differential equation D [u] = f subject to linear boundary conditions
B [u] = g as in Section 2.1. Our goal is to find a solution u ∈ U satisfying the PDE for
(partially) known (D, f ) and (B, g). In general, one cannot find a closed-form expression for
the solution u (Borthwick, 2018). Therefore, we aim to compute an accurate approximation
û ≈ u instead. Motivated by the challenges (C1) to (C3) of partial information inherent to
numerically solving PDEs, we approach the problem from a statistical inference perspective.
In other words, we will learn the solution of the PDE from multiple sources of information.
This way we can quantify the epistemic uncertainty about the solution at any time during
the computation, as Figure 1(a) illustrates.
Indirectly Observing the Solution of a PDE Typically, we think of observations as
a finite number of direct measurements u(xi ) = yi of the latent function u. As it turns
out, we can generalize this notion of a measurement and even interpret the PDE itself as
an (indirect) observation of u. As an example, consider the important case where u models
the state of a physical system. The laws of physics governing such a system are often
formulated as conservation laws in the language of PDEs. For example, they may require
physical quantities like mass, momentum, charge or energy to be conserved over time.
Example 3.1 (Thermal Conduction and the Heat Equation). Say we want to simulate
heat conduction in a solid object with shape D ⊂ R3 , i.e. we want to find the time-varying
temperature distribution u : [0, T ] × D → R. Neglecting radiation and convection, u(t, x)
is described by a linear PDE known as the heat equation (Lienhard and Lienhard, 2020).
Assuming spatially and temporally uniform material parameters cp , ρ, κ ∈ R, it reduces to
∂
cp ρ − κ∆ u − q̇V = 0.
(3.1)
∂t
Thermal conduction is described by −κ∆u, while q̇V are local heat sources, e.g. from electric
currents. Any energy flowing into a region due to conduction or a heat source is balanced by
an increase in energy of the material. The net-zero balance shows that energy is conserved.
Notice how a conservation law is an observation of the behavior of the physical system!
To formalize this, we begin by rephrasing the classical notion of an observation at a point
xi as measuring the result of a specific linear operator applied to the solution u:
u(xi ) = yi ⇐⇒ δxi [u] = yi
where δxi is the evaluation functional. Now, the key idea is to generalize the notion of a direct
observation to collecting information about the solution via an arbitrary linear operator L
applied to the solution u, such that L [u] = y ⇐⇒ L [u] − y = 0. The affine operator
I [u] := L [u] − y
(3.2)
is a specific kind of information operator (Cockayne et al., 2019b). In this setting the
information operator may describe a conservation law as in Equation (3.1), a general linear
PDE of the form (2.1) or an arbitrary affine operator of choice mapping between vector
8
Physics-Informed GP Regression Generalizes Linear PDE Solvers
spaces (which may be linear function spaces). This generalized notion of an observation
turns out to be very powerful to incorporate different kinds of mathematical, physical, or
experimental properties of the solution. Since PDEs and conservation laws are often assumed
to hold exactly, we focused on noise-free observations above. However, generally we are not
limited to this case and can also model f as random variable, in which case the information
operator I [u, f ] is a (jointly) linear function of the solution u and the right-hand side f .
3.1 Solving PDEs as a Bayesian Inference Problem
One of the main challenges (C1) to (C3) outlined in the beginning is the limited computational budget available to us to approximate the solution. Fortunately, in practice, the
solution u is not hopelessly unconstrained, but we usually a-priori have information about
it. At the very least, we know the space of functions U in which to search for the solution.
Additionally, we might have expert knowledge about its rough shape and value range, or
solutions to related PDEs at our disposal. Now, the question becomes: How do we combine this prior knowledge with indirect observations of the solution through the information
operator I (3.2)? To do so, we turn to the Bayesian inference framework. This provides a
different perspective on the numerical problem of solving a linear PDE as a learning task.
Gaussian Process Inference We represent our belief about the solution of the linear
PDE via a (multi-output) Gaussian process
u ∼ GP (m, k)
0
0
0
with mean function m : D → Rd and covariance function or kernel k : D × D → Rd ×d .
Gaussian processes are well-suited for this purpose since:
(i) For an appropriate choice of kernel, the Gaussian process defines a probability measure
over the function space in which the PDE’s solution is sought.
(ii) Kernels provide a powerful modeling toolkit to incorporate prior information (e.g. variability, periodicity, multi-scale effects, in- / equivariances, . . . ) in a modular fashion.
(iii) Measurement noise often follows a Gaussian distribution.
(iv) Conditioning a Gaussian process on observations made via a linear map again results
in a Gaussian process.
While the result in (iv) is used ubiquitously in the literature, its general form where observations are made via arbitrary linear operators as opposed to finite-dimensional linear maps,
has only been rigorously demonstrated for Gaussian measures on function spaces, not for
the Gaussian process perspective, to the best of our knowledge. The two perspectives are
closely related, but there are thorny technical difficulties to consider. We intentionally frame
the problem from the Gaussian process perspective to make use of the expressive modeling
capabilities provided by the kernel. Our framework at its very core relies on this result,
which we explain in detail in Section 4 and prove in Appendix B.4.
9
Pförtner, Steinwart, Hennig and Wenger
3.1.1 Encoding Prior Knowledge about the Solution
We can infer the solution of a linear PDE more quickly by specifying inductive biases in the
prior, which can encode both provable and approximately known properties of the solution.4
Function Space of the Solution The most basic known property derived from the
PDE is an appropriate choice of function space for the solution. This can be done by
inspecting the differential operator D and keeping track of the partial derivatives. In fact,
in implementation this can be automatically derived solely from the problem definition, e.g.
by compositionally defining differential operators and storing information on the necessary
differentiability. Let j ∈ N be the maximum order of the partial derivatives of the differential
operator D. If we choose a Matérn(ν) kernel with
d+1
ν=j+
+ε
2
with ε > 0, then under mild regularity conditions our prior defines a Gaussian measure
over the space of solutions of the linear PDE.5 The choice ε = 21 allows particularly efficient
kernel evaluations (Rasmussen and Williams, 2006).
Symmetries, In- and Equivariances Many solutions of PDEs exhibit a-priori known
symmetries. For example, to calculate the strength of a magnet rotated by R : R3 → R3 , one
can equivalently compute the field of the magnet in its original position and rotate the field,
i.e. u(Rx) = Ru(x). Corresponding inductive biases can be encoded via a kernel that is
invariant, i.e. k(ρg x0 , ρg x1 ) = k(x0 , x1 ), or equivariant k(ρg x0 , ρg x1 ) = ρg k(x0 , x1 )ρ∗g , where
ρg is a unitary group representation. The most commonly used kernels are stationary, i.e.
translation invariant, but one can also construct invariant kernels (Haasdonk and Burkhardt,
2007; Azangulov et al., 2022), as well equivariant kernels (Reisert and Burkhardt, 2007;
Holderrieth et al., 2021) for many other group actions.
Related Problems If solutions from related problems are available, the prior mean function can be set to an appropriate combination of the available solutions, and the prior kernel
can be chosen to reflect how related the problems are. For example, if we have an approximate solution of the same PDE computed on a coarser mesh, we can condition our function
space prior on the coarse solution with a noise level reflecting the fidelity of the discretization. Similarly, if we solved the same PDE with different parameters, we can condition on
the available solutions with a noise level chosen according to how similar the parameters are
to the one of interest.
Domain Expertise Domain experts often have approximate knowledge of what solutions
can be expected, either from experience, previous experiments or familiarity with the physical interpretation of the solution u. For example, an electrical engineer who designs electrical
4. In the special case of GP regression, if the prior smoothness matches the smoothness of the target function
u, the convergence rate is optimal in the number of observations (Kanagawa et al., 2018, Thm. 5.1).
5. Technically, it is impossible to formulate a GP prior whose paths are elements of a Sobolev space, since
such spaces are spaces of equivalence classes. However, similar intuition implies and can be formalized
through a continuous embedding. See Appendix B.5 for details.
10
Physics-Informed GP Regression Generalizes Linear PDE Solvers
components is able to give realistic temperature ranges for a component, whose temperature distribution we aim to simulate. This can be included by choosing the (initial) kernel
hyperparameters, such as the output- and lengthscales based on this expertise.
3.1.2 (Indirectly) Observing the Solution
From a computational perspective, the most important reason for choosing Gaussian processes is that when conditioning on linear observations, the resulting posterior is again a
Gaussian process with closed form mean and covariance function (Bishop, 2006). We extend
this classic result from observations via a finite-dimensional linear map to general linear operators in Theorem 1. This is crucial to condition on the different types of observations,
most importantly the PDE itself, made via the information operator in (3.2). Given such an
affine observation defined via a linear operator L : U → Rn and an independent Gaussian
random variable ∼ N (µ, Σ), we can condition our prior belief using Corollary 2 on the
observations to obtain a posterior of the form
u | (L [u] + = y) ∼ GP mu|y , ku|y
with mean and covariance function given by
mu|y (x) = m(x) + L [k(·, x)]> (LkL∗ + Σ)−1 (y − (L [m] + µ)) ,
ku|y (x1 , x2 ) = k(x1 , x2 ) + L [k(·, x1 )]> (LkL∗ + Σ)−1 L [k(·, x2 )] .
(3.3)
(3.4)
We will now look more closely at how we can condition on the boundary conditions, the
PDE itself and direct measurements of the solution.
Observing the Solution via the PDE The differential operator D in Equation (2.1) is
linear and therefore we can (in theory) condition on I [u] = D [u] − f = 0 directly by using
Theorem 1 with L = D and y = f . However, it turns out that this is at least as hard as
solving the PDE directly and thus typically intractable in practice. This is because f is a
function and hence D [u] = f corresponds to an infinite number of observations. However, by
only enforcing the PDE at a finite number of points in the domain, we can immediately give a
canonical example of an approximation to this intractable information operator. Concretely,
we can condition u on the fact that the PDE holds at a finite sequence of well-chosen
domain points X = {xi }ni=1 ⊂ int (D), i.e. we compute u | (D [u] (X) − f (X) = 0) by
choosing L = δX ◦ D and y = f (X). Intuitively speaking, if the set X of domain points
is dense enough, we obtain a good approximation to the exact conditional process. This
approach, known as the probabilistic meshless method (Cockayne et al., 2017), is analogous to
existing non-probabilistic approaches to solving PDEs, commonly referred to as collocation
methods, wherein the points X are called collocation points. Satisfying the PDE at a set of
collocation points is far from the only choice within our general framework. For example, we
can choose a set of test functions v ∈ V̂ , which we use to observe the PDE with, such that
L [u] = hv, D [u]iV and y = hv, f iV . For efficient evaluation of the differential operator we
can further represent
the solution in a basis of trial functions from a subspace Û , resulting
in L [u] = v, D PÛ u V . This turns out to be very powerful and is analogous to some of
the most successful classical PDE solvers choose sets of basis functions for which to satisfy
the PDE. In fact, for certain priors and choices of subspaces, our framework recovers several
11
Pförtner, Steinwart, Hennig and Wenger
important classic solvers in the posterior mean (see Section 3.3.4). Note that the above can
be applied to both time-dependent and time-independent PDEs and regardless of the type
of linear PDE (e.g. elliptic, parabolic, hyperbolic). Moreover, an extension to systems of
linear PDEs is straightforward.
Observing the Solution at the Boundary As for the PDE, we could attempt to directly
condition on the boundary conditions by choosing L = B and y = g. However, we are faced
with the same intractability issues that we discussed above. Instead, we observe that the
boundary conditions hold at a finite set of points XBC ⊂ ∂D, i.e. L = δXBC ◦ B and
y = g(XBC ). In practice, sometimes the boundary conditions are only known at a finite set
of points making this a natural choice.
Observing the Solution Directly Finally, as in standard GP regression, we can directly
condition on (noisy) measurements of the solution, for example from a real world experiment, by choosing L = δXMEAS and y = u∗ (XMEAS ).
In summary, the probabilistic viewpoint allows us to
• encode prior information about the solution,
• condition on various kinds of (partial) information, such as the boundary condition,
the PDE itself, or direct measurements, and
• output a structured error estimate, reflecting all obtained information and performed
computation.
We will now give concrete examples for some of the possible modeling choices described
above in a case study.
3.2 Case Study: Modeling the Temperature Distribution in a CPU
Central processing units (CPUs) are pieces of computing hardware that are constrained by
the vast amounts of heat they dissipate under computational load. Surpassing the maximum temperature threshold of a CPU for a prolonged period of time can result in reduced
longevity or even permanent hardware damage (Michaud, 2019). To counteract overheating,
cooling systems are attached to the CPU, which are controlled by digital thermal sensors
(DTS). For simplicity, assume that the CPU is under sustained computational load and that
the cooling device is controlled in a way such that the die reaches thermal equilibrium.
Example 3.2 (Stationary Heat Equation). The temperature distribution of a solid at thermal equilibrium, i.e. ∂u
∂t = 0 in Example 3.1, is described by the linear PDE
−κ∆u − q̇V = 0,
(3.5)
known as the stationary heat equation (Lienhard and Lienhard, 2020). For our choice of
material parameters Equation (3.5) is equivalent to the Poisson equation with f = q̇κV .
While the sensors control cooling, they only provide local, limited-precision measurements of the CPU temperature. This is problematic, since the chip may reach critical
temperature thresholds in unmonitored regions. Therefore, our goal will be to infer the
12
Physics-Informed GP Regression Generalizes Linear PDE Solvers
CPU
Core
CPU
Core
6.9 mm
CPU
Core
hCPU
61.0 ◦ C
60.0 ◦ C
59.0 ◦ C
58.0 ◦ C
57.0 ◦ C
56.0 ◦ C
55.0 ◦ C
CPU
Core
CPU
Core
2.3 mm
CPU
Core
4.6 mm
0.0 mm
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
−1.0
W
mm3
hCPU
6.9 mm
q̇V
0.0 mm
4.1 mm
8.1 mm
12.2 mm
2.3 mm
wCPU
XPDE
0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU
(a) Top: CPU die with CPU cores as heat
sources and uniform cooling over the whole
surface.
Bottom: Magnitude of heat sources and sinks
q̇V in the 1D slice in the upper subplot (—).
4.6 mm
XNBC
0.0 mm
(XDTS , uDTS )
(b) Gaussian process integrating prior information
about the temperature distribution, a mechanistic
model of heat conduction in the form of a linear PDE,
and empirical measurements (XDTS , uDTS ) taken by
limited-precision sensors (DTS). The plot shows the
GP mean and a 1D slice illustrating the posterior uncertainty along with a few samples.
Figure 2: Physics-informed Gaussian process model of the stationary temperature distribution in an idealized hexa-core CPU die under sustained computational load.
temperature in the entire CPU. We will use our framework to integrate the physics of heat
flow, the controlled cooling at the boundary, and the noisy temperature measurements from
the sensors. See Figure 2(b) for an illustration of the result. During manufacturing, the
resulting belief over the temperature distribution could then help decide whether the CPU
design needs to be changed to avoid premature failure. From here on out, we focus on a 1D
slice across the CPU surface, as shown in Figure 2(a) (top), to easily visualize uncertainty.
Encoding
Prior Knowledge By inspecting the PDE’s differential operator D = −κ∆ =
P
∂2
−κ di=1 ∂x
2 , we can deduce that the paths of our Gaussian process need to be twicei
differentiable.
The
in Section 3.1.1 results in a Matérn(ν) kernel with ν = j +
d+1 1
1+1construction
1
7
+ 2 = 2+ 2 + 2 = 2 . Assume we also know what temperature ranges are plausible
2
2
from similar CPU architectures, meaning that we set the kernel output scale to σout
= 9.
2
Figure 3 shows the prior process u on along with its image D [u] ∼ GP D [m] , σout DkD∗
under the differential operator. A draw from D [u] can be interpreted as the heat sources
and sinks that generated the corresponding temperature distribution draw from u.
13
Pförtner, Steinwart, Hennig and Wenger
64.0 ◦ C
4.0
W
mm3
62.0 ◦ C
2.0
W
mm3
60.0 ◦ C
0.0
W
mm3
−2.0
W
mm3
−4.0
W
mm3
−6.0
W
mm3
◦
58.0 C
56.0 ◦ C
54.0 ◦ C
0.0 mm
u
4.1 mm
8.1 mm
12.2 mm
wCPU
(a) Gaussian process prior with a Matérn- 27 kernel over the temperature distribution of the CPU.
0.0 mm
−κ∆u
q̇V
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Prior under the differential operator D [u] =
−κ∆u along with heat sources and sinks q̇V .
Figure 3: Prior model for the stationary temperature distribution of a CPU die under load.
64.0 ◦ C
◦
62.0 C
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
−1.0
W
mm3
−2.0
W
mm3
60.0 ◦ C
58.0 ◦ C
56.0 ◦ C
u | PDE
xPDE,i
54.0 ◦ C
0.0 mm
4.1 mm
8.1 mm
12.2 mm
wCPU
0.0 mm
(a) Belief about the solution after conditioning
on the PDE at a set of collocation points.
−κ∆u | PDE
q̇V
(xPDE,i , q̇V (xPDE,i ))
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Belief about heat sources and sinks after conditioning on the PDE at collocation points.
Figure 4: We integrate mechanistic knowledge about the system by conditioning on PDE
observations −κ∆u (xPDE,i ) − q̇V (xPDE,i ) = 0 at the collocation points xPDE,i ,
resulting in the conditional process u | PDE. The large remaining uncertainty in
Figure 4(a) illustrates that the PDE by itself does not identify a unique solution.
Conditioning on the PDE We can now inform our belief about the physics of heat
conduction using the mechanistic model defined by the stationary heat equation. We choose
a set of collocation points XPDE = {xPDE,i }ni=1 and then condition on the observation that
the PDE holds (exactly) at these points. In other words, we compute the physically-informed
Gaussian process u | PDE := u | {−κ∆u (xPDE,i )−q̇V (xPDE,i ) = 0}ni=1 visualized in Figure 4.
We can see that the resulting conditional process indeed satisfies the PDE exactly at the
collocation points (see Figure 4(b)). The remaining uncertainty in Figure 4(b) is due to the
approximation error introduced by only conditioning on a finite number of collocation points.
However, while the samples from our belief about the solution in Figure 4(a) exhibit much
more similarity to the mean function and less spatial variation, the marginal uncertainty
14
Physics-Informed GP Regression Generalizes Linear PDE Solvers
hardly decreases. The latter is explained by the PDE not identifying a unique solution, since
adding any affine function to u does not alter its image under the differential operator, i.e.
∆(a> x + b) = 0. There is an at least two-dimensional subspace of functions which can not
be observed. This ambiguity can be resolved by introducing boundary conditions.
Conditioning on the Boundary Conditions We assume that the CPU cooler extracts
heat (approximately) uniformly from all exposed parts of the CPU, in particular also from
the sides, rather than just from the top. Instead of directly specifying the value of the
temperature distribution at the edge points of the CPU slice, we only approximately know
the density q̇A of heat flowing out of each point on the CPU’s boundary based on the
cooler specification. We can use another thermodynamical law to turn this assumption into
information about the temperature distribution u.
Example 3.1 (continuing from p. 8). Fourier’s law states that the local density of heat q̇A
flowing through a surface with normal vector ν is proportional to the inner product of the
negative temperature gradient and the surface normal ν, i.e.
q̇A = −κ hν, ∇ui ,
where k is the material’s thermal conductivity in W m−1 K (Lienhard and Lienhard, 2020).
Assuming sufficient differentiability of u, the inner product above is equal to the directional derivative ∂ν u of u in direction ν. We can assign an outward-pointing vector ν(x)
(almost) everywhere on the boundary of the domain. Since the boundary of the CPU
domain is its surface, we can summarize the above in a Neumann boundary condition
−κ∂ν(x) u (x) = q̇A (x) for x ∈ ∂D. However, in practice we only know the approximate
heat flow out of the CPU due to cooling. We therefore leverage our probabilistic viewpoint once more to incorporate the uncertainty about the true value of q̇A . To that end
assume a joint Gaussian process prior (u, q̇A ), where q̇A is the heat flow out of the CPU
at the border and q̇A ⊥
⊥ u. We can use Corollary 3 to condition u | PDE on the Neumann boundary condition, meaning we compute (u, q̇A | I PDE [u] = 0) | I NBC [(u, q̇A )] = 0,
where I NBC [(u, q̇A )] = −κ∂ν(x) u (XNBC ) − q̇A (XNBC ) with XNBC = {0, wCPU } describes the
boundary conditions. Then, we marginalize over q̇A in the conditional process to obtain a
belief over u. The result is visualized in Figure 5. The structure of the samples illustrates
that most of the remaining uncertainty about the solution lies in a one-dimensional subspace of U corresponding to constant functions. This is due to the fact that two Neumann
boundary conditions on both sides of the domain only determine the solution of the PDE
up to an additive constant. Hence, we need an additional source of information to address
the remaining degree of freedom.
Conditioning on Direct Measurements Fortunately, CPUs are equipped with digital thermal sensors (DTS) located close to each of the cores, which provide (noisy) local measurements of the core temperatures (Michaud, 2019). These measurements can be
straightforwardly accounted for in our model by performing standard GP regression using u|
PDE, NBC from Figure 5 as a prior. The resulting belief about the temperature distribution
is visualized in Figure 6. We can see that integrating the interior measurements effectively
reduces the uncertainty due to the remaining degree of freedom, albeit not completely. The
15
Pförtner, Steinwart, Hennig and Wenger
64.0 ◦ C
◦
62.0 C
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
−1.0
W
mm3
−2.0
W
mm3
60.0 ◦ C
58.0 ◦ C
56.0 ◦ C
u | PDE, NBC
xPDE,i
q̇A
◦
54.0 C
◦
−κ∆u | PDE, NBC
q̇V
(xPDE,i , q̇V (xPDE,i ))
52.0 C
0.0 mm
4.1 mm
8.1 mm
12.2 mm
wCPU
0.0 mm
(a) Belief about the solution after conditioning
on the PDE and boundary conditions.
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Belief about heat sources and sinks after conditioning on the PDE and boundary conditions.
Figure 5: The cooler of the CPU produces an approximately specified outgoing heat flux q̇A
at the boundary of the CPU. As Figure 5(a) illustrates, after conditioning on the
resulting (approximate) Neumann boundary conditions, the solution of the PDE
is identified up to an additive constant.
62.0 ◦ C
60.0 ◦ C
58.0 ◦ C
56.0 ◦ C
0.0 mm
u | PDE, NBC, DTS
xPDE,i
q̇A
(xDTS,i , uDTS,i )
4.1 mm
8.1 mm
12.2 mm
wCPU
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
−1.0
W
mm3
−2.0
W
mm3
0.0 mm
(a) Belief about the solution after conditioning
on the PDE, BCs and noisy sensor data.
−κ∆u | PDE, NBC, DTS
q̇V
(xPDE,i , q̇V (xPDE,i ))
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Belief about heat sources and sinks after conditioning on the PDE, BCs and noisy sensor data.
Figure 6: The digital thermal sensors (DTS) within the CPU cores provide us with limitedprecision, local measurements of the temperature at locations xDTS,i . Integrating
these along with the PDE and boundary conditions identifies the solution up to
noise from the different types of observations and discretization error.
remaining uncertainty is due to the model’s consistent accounting for noise in the thermal
sensor readings, the uncertainty about the cooling, i.e. the boundary conditions, and the
discretization error incurred by only choosing a small set of collocation points.
Uncertainty in the Right-hand Side Above, we always assumed the true heat source
term q̇V , i.e. the right-hand side of the PDE, to be known exactly. However, in practice,
this assumption might also be violated, as was the case for the boundary conditions. A
16
Physics-Informed GP Regression Generalizes Linear PDE Solvers
straightforward relaxation of this assumption is to replace q̇V by a Gaussian process whose
mean is given by an estimate of q̇V .6 In the beginning of Section 3.2 we assumed that the
cooler is controlled in such a way, that the temperature distribution in the CPU does not
change over time. However, a naive prior over the heat flow q̇A out of the CPU may break
this assumption. We need to encode that the amount of heat entering the CPU is equal to
the amount of heat leaving the CPU via its boundary, i.e.
Z
Z
STAT
q̇A (x) dA = 0,
(3.6)
q̇V (x) dx −
I
[q̇V , q̇A ] :=
∂D
D
The (jointly) linear information operator I STAT computes the net amount of thermal energy
that the CPU gains per unit time. Using Corollary 2 we can construct a joint GP prior for u,
q̇V and q̇A , which is consistent with the assumption of stationarity. We posit a multi-output
GP prior over (u, q̇V , q̇A ), and condition on I STAT [q̇V , q̇A ] = 0. In this section, we choose all
outputs to be independent. In the one-dimensional model, we can simplify Equation (3.6)
by assuming that heat is drawn uniformly from the sides of the CPU. In this case, the GP
prior over q̇V turns into a four-dimensional Gaussian random vector
>
q̇A,N q̇A,E q̇A,S q̇A,W
∼ N (mq̇A , Σq̇A )
and the information operator is equivalent to
Z wCPU
STAT
I
[q̇V , q̇A ] = hCPU
q̇V (x) dx − hCPU (q̇A,E + q̇A,W ) − wCPU (q̇A,N + q̇A,S ) . (3.7)
0
The effect of this information operator on the marginal process q̇V is visualized in Figure 7(b).
The conditional mean is the same as the prior mean, since the prior mean is explicitly
constructed to fulfill Equation (3.7). However, note that the samples and the marginal
credible interval change substantially. Prior samples in Figure 7(a) seem to lie consistently
above or below the mean, indicating that there is a net increase or decrease in thermal
energy. In contrast, each sample from the conditional process q̇V | STAT in Figure 7(b)
conserves thermal energy in the system.
We can use Corollary 2 to condition our joint GP prior (u, q̇V , q̇A ) | STAT first on
I PDE [(u, q̇V , q̇A )] = 0 and then on I NBC [(u, q̇V , q̇A )] = 0 as above. It is important to
keep track of the correlations in (u, q̇V , q̇A ), since the outputs in (q̇V , q̇A ) | STAT are now
correlated. The resulting marginal conditional GP u | PDE, NBC, STAT after additionally
conditioning on sensor data is shown in Figure 8. Comparing Figures 6 and 8, we can see
that, due to the uncertainty in the right-hand side q̇V of the PDE, the samples of −κ∆u |
PDE, NBC, STAT, DTS exhibit much more spatial variation. Moreover, the samples of the
GP posterior over u now respect the stationarity constraint we imposed.
Stepping back, we can view the problem of modelling the CPU under computational
load as a scientific inference problem, where we need to aggregate heterogeneous sources of
information in a joint probabilistic model. This inference task is illustrated as a directed
graphical model in Figure 9. Our physics-informed regression framework is a local computation in the global inference procedure on the graph. Importantly, its implementation does
6. Technically speaking, if the right-hand-side of the PDE is given as a Gaussian process, the PDE turns
into a stochastic partial differential equation (SPDE).
17
Pförtner, Steinwart, Hennig and Wenger
4.0
W
mm3
2.0
W
mm3
0.0
W
mm3
−2.0
W
mm3
q̇V
0.0 mm
4.1 mm
8.1 mm 12.2 mm
wCPU
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
−1.0
W
mm3
−2.0
W
mm3
−3.0
W
mm3
0.0 mm
q̇V | STAT
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Conditional GP q̇V | STAT obtained by conditioning the GP prior q̇V from Figure 7(a) on
the stationarity constraint Equation (3.7).
(a) GP prior over the volumetric heat source q̇V ,
which is inconsistent with the assumption of a
stationary temperature distribution.
Figure 7: Construction of a joint prior over the temperature distribution u, the volumetric
heat source q̇V inside the CPU and the outgoing surface heat flux q̇A on its sides,
which is consistent with the assumption of a stationary temperature distribution.
62.0 ◦ C
2.0
W
mm3
1.0
W
mm3
0.0
W
mm3
u | PDE, NBC, STAT, DTS
xPDE,i
−1.0
W
mm3
q̇A | STAT
−2.0
W
mm3
−3.0
W
mm3
◦
60.0 C
58.0 ◦ C
56.0 ◦ C
(xDTS,i , uDTS,i )
0.0 mm
4.1 mm
8.1 mm
12.2 mm
wCPU
0.0 mm
(a) Posterior belief about the temperature distribution physically consistent with the assumption of stationarity.
−κ∆u | PDE, NBC, STAT, DTS
q̇V | STAT
(xPDE,i , q̇V (xPDE,i ))
4.1 mm
8.1 mm 12.2 mm
wCPU
(b) Posterior belief about the heat sources and
sinks after conditioning on the corresponding
uncertain right-hand-side q̇V of the PDE.
Figure 8: We integrate information from the joint prior (u, q̇V , q̇A ) | STAT over the solution,
the right-hand side of the PDE, and the values of the Neumann boundary conditions into our belief about the temperature distribution by conditioning on said
PDE and boundary conditions.
not change based on what happens to the solution estimate and the input data in either
upstream or downstream computations. All this information is already handily encoded in
the structured uncertainties of the Gaussian processes.
3.3 A General Class of Tractable Information Operators for Linear PDEs
In Section 3.1.2, we noted that conditioning on the information operator induced by the
linear PDE, i.e. I [u] = D [u] − f, is usually intractable. As a remedy, we approximated I
18
Physics-Informed GP Regression Generalizes Linear PDE Solvers
I STAT
q̇V
u
q̇A
I PDE
u(XDTS )
+
DTS
I NBC
Figure 9: Representation of the CPU model as a directed graphical model. The inference
procedure described in Section 3.2 is equivalent to the junction tree algorithm
(Bishop, 2006, Section 8.4.6) applied to the graphical model above. This example
shows that the language of information operators is a powerful tool for aggregating
heterogeneous sources of partial information in a joint probabilistic model.
by a finite family {Ii }ni=1 of tractable information operators with Ii [u] := D [u] (xi ) − f (xi )
with xi ∈ D. Crucially, this assumes that point evaluation on both D [u] and f is well-defined
and continuous, which means that this approach only applies to strong or classical solutions
of a PDE. In this section, we will extend this approximation scheme for I into a unifying
framework for tractable information operators aimed at approximating both (strictly) weak
and strong solutions to linear PDEs. Our framework is inspired by the method of weighted
residuals (MWR) (see Section 2.1.2), which is why we refer to these information operators
as MWR information operators. Indeed, in Section 3.3.4 we will show that GP inference
with information operators from our framework reproduces any weighted residual method
in the posterior mean while providing an estimate of the inherent approximation error.
In the following, we will consider a linear PDE in weak formulation, i.e. we want to solve
B [u, v] = l [v]
∀v ∈ V
(3.8)
for u ∈ U . Equation (3.8) does not have to be a weak formulation in the sense of Section 2.1.1,
but it could also be a weighted strong formulation as in Equation (2.7). We additionally
require that B is continuous for fixed v ∈ V , i.e. for any v ∈ V there must be a constant
C < ∞ such that B [u, v] ≤ C kukU for all u ∈ U . Let u ∼ GP (mu , ku ) be a Gaussian
process prior over the weak solution u, whose path space can be continuously embedded
into the solution space U of the PDE (see Appendix B.5 for more details on the latter
assumption). As in Section 3.1.2, it is intractable to condition the GP prior on the full
information provided by the PDE via the family {Iv }v∈V of affine information operators
Iv [u] := B [u, v] − l [v] , since V is typically infinite-dimensional. To find tractable families
of information operators, we will take inspiration from the method of weighted residuals (see
Section 2.1.2).
19
Pförtner, Steinwart, Hennig and Wenger
3.3.1 Infinite-Dimensional Trial Function Spaces
By Corollary 2 it is tractable to condition on a finite subfamily {Iψi }ni=1 ⊂ {Iv }v∈V , of
information operators, where ψ1 , . . . , ψn is a finite subset of test functions, as long as we
can compute Iψi [mu ] , Lψi [ku (x, ·)] , and Lψi ku L∗ψj , where Lψi = B [·, ψi ]. This might not
always be possible in closed-form, since B often involves computing integrals. However, in
these cases one could fall back to an efficient numeric quadrature method, since the integrals
are often low-dimensional (typically at most four-dimensional). A prominent example of this
approach is the probabilistic meshless method used in Section 3.
Example 3.3 (Symmetric Collocation). If the differential operator maps into a reproducing
kernel Hilbert space7 V, then, by the reproducing property, we know that there is a function
δx∗ ∈ V for every x ∈ D such that v(x) = δx [v] = hδx∗ , viV for all v ∈ V . Hence, if the
weak formulation is given by Equation (2.7), and V is an RKHS, then the choice ψi = δx∗i
for xi ∈ D leads to
Iψi [u] = D [u] (xi ) − f (xi ),
i.e. we recover the probabilistic meshless method from (Cockayne et al., 2017) and Section 3.
Cockayne et al. (2017) show that the conditional mean of this approach reproduces symmetric
collocation (Fasshauer, 1997, 1999), a non-probabilistic approximation method for strong
solutions of PDEs, in the conditional mean. Note that this family of information operators
can also be recovered without assuming that V is a Hilbert space. We only require u 7→
D [u] (xi ) − f (xi ) to be continuous.
Unfortunately, the probabilistic meshless method can only be applied to approximate
strong solutions of linear PDEs, since the test functions corresponding to point evaluation
functionals are usually not well-defined and continuous on the spaces V considered for finding a strictly weak solution. However, other choices of the vi lead to approximation schemes
for weak solutions. For instance, a weak solution of the stationary heat equation in inhomogeneous media from above can be approximated by choosing the Lagrange elements from
Figure 10 as test functions.
3.3.2 Finite-Dimensional Trial Function Spaces
As opposed to the methods outlined in Section 2.1.2, we did not need to choose a finitedimensional subspace of trial functions to arrive at tractable information operators in Section 3.3.1. Nevertheless, in practice, it might still be desirable to specify a finite-dimensional
trial function basis φ1 , . . . , φn , e.g. because
• we want to reproduce the output of a classical method in the posterior mean to use
the GP solver as an uncertainty-aware drop-in replacement (see Corollary 3.3);
• the trial basis encodes knowledge about the problem that is difficult to encode in the
prior; or
• we want to solve the problem in a coarse-to-fine scheme, allowing for mesh refinement
strategies, which are informed by the GP’s uncertainty estimation.
7. This is a reasonably weak assumption, since any Hilbert function space with continuous point evaluation
functionals is an RKHS (Steinwart and Christmann, 2008).
20
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Naively, one might achieve this goal by defining the prior over u as a parametric Gaussian
process with features φi . However, this means the posterior can not quantify the inherent
approximation error, since the GP has no support outside of the finite subspace of U spanned
by the trial functions. Consequently, we need to take a different approach. Starting from a
general, potentially nonparametric prior over u, we consider a bounded (potentially oblique)
projection PÛ : U → Û onto a subspace Û ⊂ U , i.e. PÛ2 = PÛ , PÛ < ∞, and ran(PÛ ) =
Û . In general, this subspace need not be finite-dimensional. We apply PÛ to our GP prior
over u, which, by Corollary 3, results in another GP
û := PÛ [u] ∼ GP PÛ [mu ] , PÛ ku PÛ∗ ,
with sample paths in Û . Note that this discards prior information about ker(PÛ ). Hence,
especially in case dim Û < ∞, applying the information operators Iψi from Section 3.3.1
directly to û would suffer from similar problems as choosing a parametric prior. However,
Iψi ,PÛ := Iψi ◦ PÛ = B PÛ [·] , ψi − l [ψi ]
is a valid information operator for u, which leads to a probabilistic generalization of the
method of weighted residuals. This is why we refer to Iψi ,PÛ as an MWR information
operator.
The similarity to the method of weighted residuals is particularly prominent if we choose
a finite-dimensional subspace Û = span (φ1 , . . . , φm ) as in Section 2.1.2. In this case, there
is a bounded linear operator PRm : U → Rm such that
PÛ [u] =
m
X
ci φi =: IRÛm [c] ,
i=1
where the c := PRm [u] ∈ Rm are the coordinates of PÛ [u] in Û and IRÛm : Rm → Û is the
canonical isomorphism between Rm and Û . Hence, we get the factorization
PÛ = IRÛm PRm ,
(3.9)
which implies that û is a parametric Gaussian process. Moreover, note that l [ψi ] = ˆli and
m
h
i X
Û
B IRm [c] , ψi =
ci B [φi , ψi ] = (B̂c)i
i=1
for c ∈ Rm , where B̂ and ˆl are defined as in Section 2.1.2. Consequently, the MWR information operator is given by Iψi ,PÛ [u] = (IRm ◦ PÛ ) [u]i , where IRm [c] := B̂c − ˆl. This
illustrates that we are dealing with the hierarchical model
u ∼ GP (mu , ku )
c | u ∼ δPRm [u]
with observations IRm [c] = 0, where c ∼ N (PRm [mu ] , PRm ku PR∗ m ). Inference in this
model can be broken down into two steps. First, we update our belief about the solution’s coordinates in Û by computing the conditional random variable c | IRm [c] = 0,
21
Pförtner, Steinwart, Hennig and Wenger
which is also Gaussian. If B̂ is invertible and c has full support on Rm , then the law of
c | IRm [c] = 0 is a Dirac measure whose mean is given by the coordinates of the MWR
approximation cMWR = B̂ −1 ˆl from Equation (2.8). Next, we can reuse precomputed quantities from the conditional moments of c | IRm [c] = 0 such as the representer weights
w = (B̂PRm ku PR∗ m B̂ > )† (ˆl − B̂PRm [mu ]) to efficiently compute the conditional random process
(u | (IRm ◦ PRm ) [c] = 0) = (u | {Iψi ,PÛ [u] = 0}ni=1 ),
i.e. the main object of interest. Assuming once more that B̂ is invertible and c has full
support on Rm , the remaining uncertainty of the conditional process lies in the kernel of Pû ,
since the law of c | IRm [c] = 0 is a Dirac measure and
(PÛ [u] | {Iψi ,PÛ [u] = 0}ni=1 ) = (IRÛm [c] | IRm [c] = 0).
Thus, all remaining uncertainty must be due to (id −PÛ ) [u] | {Iψi ,PÛ [u] = 0}ni=1 . Note the
striking similarity of this property to the notion of Galerkin orthogonality (Logg et al., 2012,
Equation 2.63).
A canonical choice for the projection PÛ would arguably be orthogonal projection w.r.t.
the RKHS inner product of the sample space of u (see e.g. Kanagawa et al. 2018). However,
this inner product is generally difficult to compute. Fortunately, we can use the L2 inner
products or Sobolev inner products on the samples to induce a (usually non-orthogonal)
projection PÛ .
Example 3.4. If the elements of U are square-integrable, then the linear operator
m
Z
,
φi (x)u(x) dx
PRm [u]i := P −1
D
where
i=1
Z
Pij :=
φi (x)φj (x) dx,
D
induces a projection PÛ = IRÛm PRm onto Û ⊂ U , even if h·, ·iU 6= h·, ·iL2 .
At first glance, information operators restricting Û to be finite-dimensional might seem
fundamentally inferior to the information operators from Section 3.3.1. However, note that
the conditional mean of a Gaussian process prior conditioned on {Iψi [u] = 0}ni=1 is updated
by a linear combination of n functions, while the covariance function receives an at most
rank n downdate. This means that, implicitly, Gaussian process projection methods also
have an implicit finite-dimensional trial function space, which is constructed from the test
function basis, the bilinear form B and the prior covariance function ku .
MWR information operators with finite-dimensional trial function bases can be used to
realize a GP-based analogue of the finite element method.
Example 3.5 (A 1D Finite Element Method). Generally speaking, finite element methods
are (generalized) Galerkin methods, where the functions in the test and trial bases have
compact support, i.e. they are nonzero only in a highly localized region of the domain. The
archetype of a finite element method chooses linear Lagrange elements (Logg et al., 2012,
22
Physics-Informed GP Regression Generalizes Linear PDE Solvers
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
−1.0
−0.5
0.0
0.5
−1.0
1.0
−0.5
0.0
0.5
1.0
(a) Test/trial functions φi = ψi . The functions (b) The trial functions φi span the space of pieceare defined on the whole interval [-1, 1], but we wise linear functions on the given grid.
only show the non-zero parts of the functions to
avoid clutter in the figure above.
Figure 10: Linear Lagrange elements are famous test and trial functions ψi = φi used in the
finite element method.
Section 3.3.1) as test and trial functions. Linear Lagrange elements are piecewise linear on
a triangulation of the domain. For instance, on a one-dimensional domain D = [−1, 1], this
amounts to fixing a grid −1 = x0 < · · · < xm+1 = 1 and then choosing the basis functions
φi (x) = ψi (x) =
 x−xi−1

 xi −xi−1
xi+1 −x
 xi+1 −xi

0
if xi−1 ≤ x ≤ xi ,
if xi ≤ x ≤ xi+1 ,
otherwise.
for i = 1, . . . , m. Note that multiplying a coordinate vector c ∈ Rm with these basis functions
leads to a piecewise linear interpolation between the points
(x0 , 0), (x1 , c1 ), . . . , (xn , cn ), (xn+1 , 0),
since, for x ∈ [xi , xi+1 ],
m
X
i=1
x − xi
xi+1 − x
ci φi (x) = ci
+ ci+1
=
xi+1 − xi
xi+1 − xi
x − xi
x − xi
1−
ci +
ci+1 .
xi+1 − xi
xi+1 − xi
The basis functions and an element in their span are visualized in Figure 10. The Lagrange
elements at the boundary of the can also be easily modified such that arbitrary piecewise linear
boundary conditions are fulfilled by construction. The effect of MWR information operators
based on this set of test and trial functions is visualized in Figure 11(a).
3.3.3 MWR Information Operators
Even though the class of information operators introduced above is constructed for weak
forms of linear PDEs, it can naturally be applied to the weak form of an arbitrary operator
23
Pförtner, Steinwart, Hennig and Wenger
1.5
1.5
1.0
1.0
0.5
0.5
?
u
u | BC, PDE
0.0
−1.0
−0.5
0.0
0.5
u?
u | BC, PDE
0.0
−1.0
1.0
(a) Posterior process corresponding to a Matérn3/2 prior. The sample paths of the process embed
continuously into the Sobolev space H 1 (D) (see
Appendix B.5).
−0.5
0.0
0.5
1.0
(b) Posterior process corresponding to an MWR
Recovery Prior constructed from a Matérn-3/2
prior via Lemma 3.4. The posterior mean corresponds to the point estimate produced by the
classical MWR.
Figure 11: Conditioning a Gaussian process prior on the MWR information operators
{Iψi ,PÛ }ni=1 corresponding to the weak formulation of the Poisson equation, i.e.
Equation (2.3), and m = 3 linear Lagrange elements as test functions ψi and trial
functions φi (see Example 3.5). The trial functions φ1 and φm were modified to
fulfill the non-zero boundary conditions exactly.
equation. In particular, we can use MWR information operators for the boundary conditions
in a BVP. Moreover, it is straightforward to extend Iψ,PÛ to a joint GP prior over (u, f )
if the right-hand side f of the operator equation is unknown, particularly if l [v] = hf, viV
as in Section 2.1. In this case, Iψ,PÛ is jointly linear in (u, f ). Summarizing Sections 3.3.1
and 3.3.2 and incorporating the extensions discussed here, we give the following general
definition of an MWR information operator:
Definition 3.1 (MWR Information Operator). Let B [u, v] = l [v] be an operator equation
in weak formulation. An MWR information operator for said operator equation is an affine
functional
Iψ,PÛ := B PÛ [·] , ψ − l [ψ]
parameterized by a test function ψ ∈ V and a bounded (potentially oblique) projection PÛ
onto a subspace Û ⊂ U . We also write Iψ := Iψ,idU . If l [v] = hf, viV , then the input of
Iψ,PÛ can be extended to the right-hand side f of the operator equation, i.e.
Iψ,PÛ [(u, f )] := B PÛ [u] , ψ − hf, ψiV ,
which is jointly linear in (u, f ).
3.3.4 Recovery of Classical Methods
In this section we will show that, under certain assumptions, the posterior mean of a GP
prior conditioned on a set of MWR information operators is identical to the approximation
24
Weak & Strong Solutions
Strong Solutions
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Method
Trial Functions φi
Test Functions ψi
Collocation
arbitrary
ψi = δx∗i for xi ∈ D
⇒ B [u, ψi ] = D [u] (xi )
Subdomain
(Finite Volume)
arbitrary
ψi = 1Di for DRi ⊂ D
⇒ B [u, ψi ] = Di D [u] (xi ) dx
Pseudospectral
orthogonal and globally
supported (e.g.
Fourier
basis or Chebychev polynomials)
ψi = δx∗i for xi ∈ D
⇒ B [u, ψi ] = D [u] (xi )
Generalized
Galerkin
arbitrary
arbitrary
Finite Element
locally
supported
(e.g.
piecewise polynomial)
same class as trial functions, but in
general ψi 6= φi
Spectral
(Galerkin)
orthogonal and globally
supported (e.g.
Fourier
basis or Chebychev polynomials)
same class as trial functions, but in
general ψi 6= φi
(Ritz-)Galerkin
arbitrary
ψi = φi
Table 1: Overview of trial and test functions defining commonly used methods of weighted
residuals. The table also shows whether the method is capable of approximating
weak solutions. See Fletcher (1984) for more details.
generated by the corresponding traditional method of weighted residuals, examples of which
are given in Table 1. More precisely, we will show that there is a flexible family of GP priors
u ∼ GP (mu , ku ) whose posterior means after conditioning on {Iψi ,PÛ }m
i=1 are identical to the
corresponding classical MWR approximation uMWR to the solution of the same weak form
linear PDE, where we use the same trial functions φ1 , . . . , φm and test functions ψ1 , . . . , ψn
in both cases, i.e. Û = span (φ1 , . . . , φm ). As in Section 2.1.2, we assume that the trial
functions are already constructed in such a way that the boundary conditions are fulfilled.
However, it is possible to extend the results below to the general case by adding MWR
information operators corresponding to the boundary conditions and using
−1 ˆlPDE
B̂PDE
MWE
c
=
ˆlBC
B̂BC
as coordinates for the reference solution generated by the traditional MWR.
Lemma 3.2. If B̂ ∈ Rn×m and Σc := PRm ku PR∗ m ∈ Rm×m are invertible, then
c | B̂c − ˆl = 0 ∼ δcMWR
and the conditional mean mu|B̂,l̂ of u B̂PRm [u] − ˆl = 0 admits a unique additive decomposition
mu|B̂,l̂ = uMWR + uker(PÛ )
(3.10)
25
Pförtner, Steinwart, Hennig and Wenger
with uMWR ∈ Û and uker(PÛ ) ∈ ker(PÛ ).
Corollary 3.3. If, additionally, mu ∈ Û and Pker(PÛ ) kuu PR∗ m = 0, then the conditional
mean mu|B̂,l̂ is equal to the MWR approximation, i.e.
mu|B̂,l̂ = uMWR .
It turns out that it is possible to transform any admissible GP prior over the (weak)
solution of the PDE into a prior that fulfills the assumptions of Corollary 3.3. We describe
this transformation in the following lemma.
Lemma 3.4 (MWR Recovery Prior). Let ũ ∼ GP m̃u , k̃u with mean and sample paths in
U . Then u ∼ GP (mu , ku ) with
mu := PÛ [m̃u ]
and
∗
kuu := PÛ k̃uu PÛ∗ + Pker(PÛ ) k̃uu Pker(P
=
PÛ k̃uu PÛ∗
Û )
+ (idU −PÛ )k̃uu (idU −PÛ )∗
= k̃uu − PÛ k̃uu − k̃uu PÛ∗ + 2PÛ k̃uu PÛ∗
has sample paths in U , mu ∈ Û , and Pker(PÛ ) kuu PR∗ n = 0.
Figure 11(b) visualizes how a prior of this form reproduces a 1D finite element method
in the posterior mean and Figure 11 as a whole contrasts the difference between ũ and u.
Intuitively speaking, the construction for the covariance from Lemma 3.4 enforces statistical
independence between the subspaces Û and ker(PÛ ) of the GP’s path space. This way, an
observation of the GP prior in the subspace Û gains no information about ker(PÛ ), which
means that the posterior process will not be updated along ker(PÛ ). Since mu ∈ Û , i.e.
Pker(PÛ ) [mu ], it follows that the posterior mean will also lie in Û . Even though this choice
of prior is somewhat restrictive, there are good reasons to use it in practice, arguably the
most important of which is that the uncertainty quantification provided by the GP can be
added on top of traditional MWR solvers in existing pipelines in a plug-and-play fashion.
This is due to the fact that, in this case, the mean estimate agrees with the point estimate
produced by the classical solver.
3.4 Algorithm
Algorithm 1 summarizes our framework from an algorithmic standpoint. It outlines how a
GP prior can be conditioned on heterogeneous sources of information such as mechanistic
knowledge given in the form of a linear boundary value problem, and noisy measurement
data by leveraging the notion of a linear information operator. All GP posteriors in this
article were computed by this algorithm with different choices of prior, PDE, boundary
conditions and policy.
26
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Algorithm 1: Solving PDEs via Gaussian Process Inference
Input: Joint GP prior (u, f, g, ) ∼ GP (m, k), linear PDE (D, f ) or (BPDE , f ), boundary
conditions (B, g) or (BBC , g), (noisy) measurements (XMEAS , YMEAS ), . . .
Output: GP posterior u ∼ GP (mi , ki )
1 procedure LinPDE-GP(m, k, I PDE , I BC , XMEAS , YMEAS )
2
i←0
3
(m0 , k0 ) ← (m, k)
4
w0 ← ()
5
G0 ← ()
6
while not StoppingCriterion() do
7
i←i+1
8
(ψPDE , ψBC , PÛ , vMEAS ) ← Policy(mi , ki )
B Action


IψPDE
[(u, f )]
PDE ,PÛ


BC
IψBC ,P [(u, g)]


Û


9
Ii ← (u, f, g, ) 7→ 
B Information operator
..



.
hvMEAS , u(XMEAS ) + i
>
B Observations
10
yi ← 0 0 . . . hvMEAS , YMEAS i
∗
Gi−1
I1:i−1 kIi
∗
= I1:i kI1:i
B Update Gram matrix
11
Gi ←
∗
Ii kI1:i−1
Ii kIi∗
12
13
14
15
wi ← G†i (y1:i − I1:i [m])
B Update representer weights
>
mi ← x 7→ m(x) + I1:i [k(x, ·)] wi
B Belief Update
> †
ki ← (x1 , x2 ) 7→ k(x1 , x2 ) − I1:i [k(x1 , ·)] Gi I1:i [k(·, x2 )]
return GP (mi , ki )
Modeling uncertainty over the right-hand side f of the PDE, the boundary function g
and the measurements YMEAS is achieved by specifying a joint prior over (u, f, g, ). Therefore, Algorithm 1 also returns a multi-output Gaussian process posterior over (u, f, g, ).
This means that our method can be used to solve PDE-constrained Bayesian inverse problems for the right-hand side f and the boundary function g, while computing a consistent
distributional estimate for the corresponding solution u of the forward problem. This is
a generalization of a linear latent force model (Alvarez et al., 2009). If f and g are not
uncertain, the corresponding covariance functions in the joint prior can simply be set to 0,
which (in the absence of measurements) reduces the joint prior to a simple prior over the
solution u. To condition the GP on the PDE and the boundary conditions, we make use
of MWR information operators (see Definition 3.1), where the test functions and projection
are chosen by an arbitrary policy in each iteration of the method. An example of such a
policy which reproduces Figure 1(c) chooses PÛ as the L2 projection onto the basis from
∗ , δ ∗ } and ψ
Example 3.5 in every iteration, the test functions ψBC ∈ {δ−1
PDE = 0 in the first
1
two iterations, and ψPDE = φi−2 (and ψBC = 0) from iteration 3 onward. The ellipses in the
information operator I and the observations yi indicate that adding additional information
27
Pförtner, Steinwart, Hennig and Wenger
operators is possible in the same fashion. For instance, adding additional PDE information
operators enables the solution of systems of linear PDEs.
Performance Considerations Instead of naively conditioning the previous conditional
process on the new observation in each iteration, Algorithm 1 always conditions the prior
on the accumulated observations. This is because the naive expressions for the conditional
moments become more and more complex over time. While, in principle, it is possible to
use automatic differentiation (AD) to compute Ii [mi ], Ii [ki−1 (x, ·)], and Ii ki−1 Ii∗ in each
iteration and then evaluate Equations (4.15) and (4.16) naively, we found that this is detrimental to the performance of the algorithm. In Algorithm 1, we only need to compute Ii [m],
and Ii [k(x, ·)], and Ii kIi∗ on the prior moments, which are much less complex and cheaper
to evaluate. For maximum efficiency, for many information operator / kernel combinations
one can compute optimized closed-form expressions for these terms, alleviating the need for
automatic differentiation or quadrature. We can avoid unnecessary recomputation of the
representer weights at every iteration of the method by means of block-matrix inversion.
For instance, if a Cholesky decomposition is used to invert the Gramian Gi , we can use a
variant of the block Cholesky decomposition (Golub and Van Loan, 2013) to update the
Cholesky factor of Gi−1 .
Code A Python implementation of Algorithm 1 based on ProbNum (Wenger et al., 2021)
and JAX (Bradbury et al., 2018) is available at:
https://github.com/marvinpfoertner/linpde-gp
3.5 Related Work
The area of physics-informed machine learning (Karniadakis et al., 2021) aims at augmenting machine learning models with mechanistic knowledge about physical phenomena, mostly
in the form of ordinary and partial differential equations. Recently, there has been growing interest in deep learning–based approaches (Raissi et al., 2019; Li et al., 2020, 2021).
However, this model choice makes it inherently difficult to quantify the uncertainty about
the solution induced by noise-corrupted input data and inevitable approximation error. Instead, we approach the problem through the lens of probabilistic numerics (Hennig et al.,
2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi et al., 2019; Hennig et al.,
2022), which frames numerical problems as statistical estimation tasks. Probabilistic numerical methods for the solution of PDEs are predominantly based on Gaussian process
priors. Our work builds upon and extends these works. Many existing methods aim to find
a strong solution to a linear PDE using a collocation scheme (e.g. Graepel 2003; Cockayne
et al. 2017; Raissi et al. 2017). Unfortunately, many practically relevant (linear) PDEs only
admit weak solutions. Our framework extends existing collocation approaches to weak formulations. Probabilistic numerical methods approximating weak formulations are primarily
based on discretization. For example, Cockayne et al. (2019a); Wenger and Hennig (2020)
apply a probabilistic linear solver to the linear system arising from discretization. Girolami
et al. (2021) propose a statistical version of the finite element method (statFEM), which
uses a specific parametric GP prior. However, these approaches do not quantify the inherent discretization error – often the largest source of uncertainty about the solution. In
contrast, our framework models this error and additionally admits a broader class of dis28
Physics-Informed GP Regression Generalizes Linear PDE Solvers
cretizations. Wang et al. (2021); Krämer et al. (2022) propose GP-based solvers for strong
formulations of time-dependent nonlinear PDEs by leveraging finite-difference approximations to the differential operator and linearization-based approximate inference. While it is
possible to apply such methods to linear PDEs, the finite difference approximation of the
differential operator introduces additional estimation error. By contrast, the evaluation of
the differential operator in our method is exact. Cockayne et al. (2017); Raissi et al. (2017);
Girolami et al. (2021) also apply their methods to solve PDE-constrained (Bayesian) inverse
problems. Särkkä (2011) directly infers the right-hand side of a linear PDE in strong formulation by observing measurements of the solution through the associated Green’s function.
Our approach also builds a belief over an unknown right-hand side without requiring access
to a Green’s function. The aforementioned methods use the closure of Gaussian processes
under conditioning on observations of the sample paths through a linear operator without
proof. Owhadi and Scovel (2018) show how to condition Gaussian measures on an orthogonal direct sum of separable Hilbert spaces on observations of one of the summands. Our
work extends these results to Gaussian processes with sample paths in separable reproducing
kernel Hilbert spaces by leveraging the dualities between these. Recent results about the
sample spaces of GPs (Steinwart, 2019; Kanagawa et al., 2018) ensure the applicability of
our work to practical GP regression problems. To our knowledge this is the first complete
proof of this widely used property of GPs. Thus, Theorem 1 provides the theoretical basis
for physics-informed GP regression, including the aforementioned methods for the solution
of PDEs. In our work, it enables conditioning on information operators constructed from
e.g. PDEs, integral equations, or boundary conditions.
4. Gaussian Process Inference with Affine Observations of Sample Paths
Our framework fundamentally relies on the fact that when a Gaussian process prior is
conditioned on affine observations of its paths, one obtains a closed-form posterior. This
section provides the theoretical foundation for this result. While this property is used widely
in the literature (see e.g. Graepel (2003); Rasmussen and Williams (2006); Särkkä (2011);
Särkkä et al. (2013); Cockayne et al. (2017); Raissi et al. (2017); Agrell (2019); Albert (2019);
Krämer et al. (2022)), no proof of its general form where observations are made via bounded
linear operators between separable Hilbert function spaces, instead of finite-dimensional
linear maps on a finite number of point evaluations exists, to the best of our knowledge.
Owhadi and Scovel (2018) give a proof of a related property for Gaussian measures. Here, we
extend their results to the case of Gaussian processes. While these perspectives are closely
related, significant technical attention needs to be paid for this result to transfer to the GP
case. For our framework this is essential such that we can leverage the modelling capabilities
provided by specifying a kernel as described in Section 3.1.1.
To state the result, let f ∼ GP (m, k) be a (multi-output) GP prior with index set
X ⊂ Rd , L : paths (f ) → Rn a linear operator acting on the paths of f , and ∼ N (µ, Σ)
a Gaussian random vector in Rn with ⊥
⊥ f . We need to compute the conditional random
process
f | L [f ] + = y
29
Pförtner, Steinwart, Hennig and Wenger
for some y ∈ Rn . Formally, this object is defined as the family
( f | L [f ] + = y ) := {f (x, ·) | E}x∈X ,
of conditional random variables8 , where (Ω, B (Ω) , P) is the probability space on which both
f and are defined, E is the event E := h−1 ({y}) ∈ B (Ω), and h is the random variable
h : Ω → Rn , ω 7→ L [f (·, ω)] + (ω).
We refer to Appendix B.1 for definitions of the objects mentioned above. For instance, in
Section 3, we use L := (D [·] (xi ))ni=1 , where D is a linear differential operator, as well as
L [f ] := (f (xi ))ni=1 , and, in Section 3.2, we additionally use
Z
L [f ] =
f (x) dx.
D
It is well-known that h is a Gaussian random vector
h ∼ GP (L [m] + µ, LkL∗ + Σ),
where LkL∗ ∈ Rn×n with
h
i
(LkL∗ )ij = L t 7→ L [k(t, ·)]j ,
i
and that the conditional random process is a Gaussian process
f | L [f ] + = y ∼ GP mf |y , kf |y
with conditional moments given by
mf |y (x) = m(x) + L [k(·, x)]> (LkL∗ + Σ)−1 (y − (L [m] + µ)) ,
and
kf |y (x1 , x2 ) = k(x1 , x2 ) + L [k(·, x1 )]> (LkL∗ + Σ)−1 L [k(·, x2 )]
Since the above are nontrivial claims about potentially ill-behaved infinite-dimensional objects, a proof is important, be it just to identify a precise set of assumptions about the
objects at play, which are required so that the result holds. For instance, it is possible
that h is not a random variable (because it might not measurable), i.e. E might not be
measurable. To remedy this situation, a major contribution of this work are Theorem 1
and Corollaries 2 and 3 and their proof in Appendix B, which provide a sequence of increasingly specialized results capturing the claims above. Hence, besides being the theoretical
basis for this work, Theorem 1 and Corollaries 2 and 3 also provide theoretical backing
for many of the publications cited above. Our results identify a set of mild assumptions,
which are easy to verify and widely-applicable in practical applications. Assumption 1 constitutes the common set of assumptions shared by Theorem 1 and Corollaries 2 and 3. See
Appendix B.5 for information on how to verify Assumption 1 in a practical scenario.
8. Here, we need to work with regular conditional probability measures (Klenke, 2014), since the event E
typically has probability 0.
30
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Assumption 1. Let f ∼ GP (mf , kf ) be a Gaussian process prior with index set X on the
Borel probability space (Ω, B (Ω) , P), whose mean function and sample paths lie in a real
separable RKHS H ⊂ RX with H ⊇ Hkf . Let L : H → HL be a bounded linear operator
mapping the paths of f into a separable Hilbert space HL .
We start our exposition here by presenting Theorem 1, our most general result. Using
Theorem 1, it is possible to condition Gaussian processes on affine observations of their
paths, which take values in arbitrary and potentially infinite-dimensional separable Hilbert
spaces. For instance, this means that conditioning on observations of a whole function is
well-defined, given that the assumptions of Theorem 1 are fulfilled. The formulation of this
theorem heavily relies on the theory of Gaussian measures on separable Hilbert spaces, some
of which is detailed in Appendix B.2.
Theorem 1 (Affine Gaussian Process Inference). Let Assumption 1 hold. Then ω 7→
f (·, ω) is an H-valued Gaussian random variable with mean mf and covariance operator
h 7→ Cf [h] (x) = hkf (x, ·), hiH . We also write f ∼ N (mf , Cf ). Let ∼ N (m , C ) be an
HL -valued Gaussian random variable with ⊥
⊥ f . Then
f
mf
Cf
Cf L∗
∼N
,
,
(4.1)
L [mf ] + m
LCf LCf L∗ + C
L [f ] + with values in H × HL and hence
L [f ] + ∼ N (L [mf ] + m , LCf L∗ + C ).
(4.2)
If ran(LCf L∗ + C ) is closed, then, for all y ∈ HL ,
(4.3)
f | L [f ] + = y ∼ GP mf |y , kf |y ,
where the conditional mean and covariance function are given by
D
E
mf |y (x) = mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )]
HL
,
(4.4)
and
D
E
kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )]
HL
,
(4.5)
respectively.
Unfortunately, especially in the context of PDEs, Theorem 1 is difficult to apply in
practice, since the operator LCf L∗ + C is infinite-dimensional and its pseudoinverse (if it
exists) usually has no analytic form. However, as seen in Section 3, its corollaries can, in
practical scenarios, be applied to great effect. Corollary 2 enables affine observations, in
which the GP sample paths enter through one or multiple continuous linear functionals. For
example, we used Corollary 2 in Section 3.2 to condition on observations of a GPs. To state
the result conveniently, we introduce some notation.
Notation 1. Let k : X × X → R be a positive-definite kernel and let Li : Hk → Rni for
i = 1, 2 be bounded linear operators. By L1 kL∗2 ∈ Rn1 ×n2 , we denote the matrix with entries
h
i
(L1 kL∗2 )ij := L1 x 7→ L2 [k(x, ·)]j .
i
31
Pförtner, Steinwart, Hennig and Wenger
Table 2: Theorem 1 provides the theoretical basis to condition on (affine) observations of
a Gaussian process. While results like conditioning on derivative evaluations are
used ubiquitously throughout the literature (e.g. monotonic GPs, Bayesian optimization, probabilistic numerical PDE solvers, . . . ) a complete proof does not
exist in the literature, to the best of our knowledge.
Observation
Information operator
Point evaluation
Affine finite-dim. operator
Point evaluation of derivative
Integral
Derivative
Integro-differential operator
Affine operator
f
f
f
f
f
f
f
Proof known?
7→ f (x)
7→ Af (X) + b
d
7→ Rdx
f (x) x=x0
7→ D f (x) dµ (x)
d
7→ dx
f
7→ D [f ]
7→ L [f ] + b
Reference
Bishop (2006)
Bishop (2006)
Corollary 3
Corollary 2
Theorem 1
Theorem 1
Theorem 1
It turns out that the order in which the operators L1 , L2 are applied to the arguments
of k does not matter, i.e.
h
i
(L1 kL∗2 )ij = L1 x 7→ L2 [k(x, ·)]j = L2 [x 7→ L1 [k(·, x)]i ]j
i
(see Lemma B.27). This motivates the parenthesis-free notation L1 kL∗2 introduced above.
Corollary 2. Let Assumption 1 hold for HL = Rn and let ∼ N (µ , Σ ) be an Rn -valued
Gaussian random variable with ⊥
⊥ f . Then
L [f ] + ∼ N (L [mf ] + µ , Lkf L∗ + Σ )
(4.6)
f | L [f ] + = y ∼ GP mf |y , kf |y ,
(4.7)
and, for any y ∈ Rn ,
with conditional mean and covariance function given by
D
E
mf |y (x) = mf (x) + L [kf (x, ·)] , (Lkf L∗ + Σ )† (y − (L [mf ] + µ ))
and
Rn
D
E
kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (x1 , ·)] , (Lkf L∗ + Σ )† L [kf (·, x2 )]
Rn
,
(4.8)
.
(4.9)
Finally, we turn to Corollary 3, which is the result that is most widely-used throughout
the literature (Graepel, 2003; Särkkä, 2011; Särkkä et al., 2013; Cockayne et al., 2017; Raissi
et al., 2017; Agrell, 2019; Albert, 2019; Krämer et al., 2022). It shows how Gaussian processes
can be conditioned on point evaluations of the image of their paths under a linear operator,
provided that the linear operator is bounded and maps into a Hilbert function space, on
which point evaluation is continuous. Moreover, it shows that, under these conditions, the
image of the GP under the linear operator is itself a Gaussian process. Again, we introduce
some notation to facilitate stating the result.
32
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Notation 2. Let k : X × X → R be a positive-definite kernel and let Li : Hk → Hi for
i = 1, 2 be bounded linear operators mapping into real RKHSs Hi ⊂ RXi . In analogy to
Notation 1, we define the bivariate functions
kL∗2 : X ×X2 → R, (x, x2 ) 7→ L2 [k(x, ·)] (x2 ) ,
L1 k : X1 ×X → R, (x1 , x) 7→ L1 [k(·, x)] (x1 ) ,
L1 kL∗2 :
(4.10)
(4.11)
and
X1 ×X2 → R, (x1 , x2 ) 7→ L2 [(L1 k)(x1 , ·)] (x2 ) =
L1 [(kL∗2 )(·, x2 )] (x1 ) .
(4.12)
0
Corollary 3. Let Assumption 1 hold such that HL is an RKHS HL ⊂ RX . Then
L [f ] ∼ GP (L [mf ] , Lkf L∗ ),
(4.13)
Let ∼ N (µ , Σ ) with values in Rn and ⊥
⊥ f . Then, for X 0 = {x0i }ni=1 ⊂ X 0 and y ∈ Rn ,
f | L [f ] X 0 + = y ∼ GP mf |y , kf |y
(4.14)
with
D
E
†
mf |y (x) := mf (x) + (kf L∗ )(x, X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (y − (L [mf ] (X) + µ ))
Rn
(4.15)
and
D
E
†
kf |y (x1 , x2 ) := kf (x1 , x2 ) − (kf L∗ )(x1 , X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (Lkf )(X 0 , x2 )
If additionally X =
Rn
.
(4.16)
X 0,
then
kf
mf
f
,
∼ GP
Lkf
L [mf ]
L [f ]
kf L∗
Lkf L∗
.
(4.17)
This corollary is is the theoretical basis for Section 3 and most of Section 3.2. Note that,
for L = idH , we recover standard GP regression as a special case in Corollary 3.
Remark 4.1 (Multi-Output Gaussian Processes). Theorem 1 and Corollaries 2 and 3 also
apply if the GPs involved are multi-output GPs. In this case, the sample paths are functions
I × X → R with I = {1, . . . , d} by Definition B.6. In order to apply linear operators defined
on functions X → Rd , we interpret a sample path f (·, ω) : I × X → R as a function
f˜(·, ω) : X → Rd , x 7→ (f ((i, x), ω))di=1 ∈ Rd .
(4.18)
5. Conclusion
In this work, we developed a probabilistic framework for the solution of (systems of) linear partial differential equations, which can be interpreted as physics-informed Gaussian
process regression. It enables the seamless fusion of (1) a-prior known, provable properties
of the system of interest, (2) exact and partial mechanistic information, (3) subjective domain expertise, as well as, (4) noisy empirical measurements into a unified scientific model.
33
Pförtner, Steinwart, Hennig and Wenger
This model outputs a consistent uncertainty estimate, which quantifies the inherent approximation error in addition to the uncertainty arising from partially-known physics, as well
as limited-precision measurements. Our framework fundamentally relies on the closure of
Gaussian processes under conditioning on observations of their sample paths through an
arbitrary bounded linear operator. While this result has been used ubiquitously in the literature, a rigorous proof for linear operator observations, as needed in the PDE setting, did
not exist prior to this work to the best of our knowledge. By choosing a specific prior and
information operator in our framework, it recovers methods of weighted residuals, a popular
family of numerical methods for the solution of (linear) PDEs, which includes generalized
Galerkin methods such as finite element and spectral methods. This demonstrates that
classical linear PDE solvers can be generalized in their functionality to include approximate
input data and equipped with a structured uncertainty estimate. Our work outlines a general framework for the integration of mechanistic building blocks in the form of information
operators derived from e.g. linear PDEs into probabilistic models. Our case study shows
that the language of information operators is a powerful toolkit for aggregating heterogeneous sources of partial information in a joint probabilistic model, especially in the context
of physics-informed machine learning. This opens up several interesting lines of research.
For example, the choice of prior and information operator are not fixed and can be specifically chosen for the problem at hand. The design of adaptive information operators, which
actively collect information based on the current belief about the solution could prove to
be a promising research direction. Further, the uncertainty estimate about the solution
could be used to inform experimental design choices. For example, in the case study from
Section 3.2, the posterior belief can be used to optimize the locations of the digital thermal
sensors in future CPU designs. Finally, it remains an open question whether this framework
can be adapted to nonlinear partial differential equations in a similar manner to how many
classic methods solve a sequence of linearized problems to approximate the solution of a
nonlinear PDE.
Acknowledgments
MP, PH and JW gratefully acknowledge financial support by the European Research Council
through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine
Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645; the
German Federal Ministry of Education and Research (BMBF) through the Tübingen AI
Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of
the State of Baden-Württemberg. The authors thank the International Max Planck Research
School for Intelligent Systems (IMPRS-IS) for supporting MP and JW.
Finally, the authors are grateful to Filip Tronarp for many helpful discussions concerning
the theoretical part of this work.
34
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Appendix A. Proofs for Section 3.3
Proof of Example 3.4
PÛ2 [u] = PÛ
"m
X
#
PRm [u]i φi
(A.1)
PRm [u]i PÛ [φi ]
(A.2)
i=1
=
=
=
=
=
=
m
X
i=1
m
X
i=1
m
X
(A.3)
PRm [φi ]j PRm [u]i
(A.4)
j=1
φj
m
X
j=1
i=1
m
X
m
m
X
X
φj
j=1
i=1
m
X
m
X
j=1
m
X
φj
φj
j=1
=
m
X
PRm [φi ]j φj
PRm [u]i
m
X
i=1
m
X
k=1
m
X
!
(P −1 )jk huk , φi iL2
PRm [u]i
(A.5)
!
(P −1 )jk Pki
PRm [u]i
(A.6)
k=1
P −1 P
ji
PRm [u]i
(A.7)
i=1
(A.8)
φj PRm [u]j
j=1
(A.9)
= PÛ [u]
Proof of Lemma 3.2 By Corollary 2, we have
−1 ˆl − B̂PRm [mu ]
mu|B̂,l̂ (x) = mu (x) + (B̂PRm ) [ku (x, ·)]> (B̂PRm )ku (B̂PRm )∗
−1 = mu (x) + PRm [ku (x, ·)]> B̂ > B̂Σc B̂ >
B̂ B̂ −1 ˆl − PRm [mu ]
= mu (x) + PRm [ku (x, ·)]> Σ−1
B̂ −1 ˆl − PRm [mu ] .
c
Since PÛ is a bounded projection, we have
U = ran(PÛ ) ⊕ ker(PÛ )
(A.10)
(Rudin 1991, Section 5.16)
= Û ⊕ ker(PÛ ),
(A.11)
where each u ∈ U decomposes uniquely into u = uÛ + ucÛ with uÛ ∈ Û and ucÛ ∈ ker(PÛ ).
It is clear that
uÛ = PÛ [u] ,
35
Pförtner, Steinwart, Hennig and Wenger
and
ucÛ = id −PÛ [u]
= Pker(PÛ ) [u] .
This implies
h
i
−1 ˆ
m [mu ]
PRm mu|B̂,l̂ = PRm [mu ] + PRm ku PR∗ m Σ−1
B̂
l
−
P
R
|
{z
} c
=Σc
= PRm [mu ] + B̂
= B̂ −1 ˆl
−1 ˆ
l − PRm [mu ]
= cMWR .
Hence, we have
m
m h
i
h
i X
X
PRn mu|B̂,l̂
φi =
cMWR
φi = uMWR
PÛ mu|B̂,l̂ =
i
i
i=1
(A.12)
i=1
h
i
and since U = Û ⊕ ker(PÛ ), the statement follows. Moreover, note that PRn mu|B̂,l̂ is the
mean of c | B̂c − ˆl = 0 and its covariance matrix is given by
−1
B̂Σc
Σc|B̂,l̂ = Σc − Σc B̂ > B̂Σc B̂ >
−1
= Σc − Σc B̂ > (B̂ > )−1 Σ−1
c B̂ B̂Σc
= Σc − Σc Σ−1
c Σc
= 0.
Consequently, c | B̂c − ˆl = 0 ∼ δcMWR .
Proof of Corollary 3.3
h
i
Pker(PÛ ) mu|B̂,l̂ (x)
−1 ˆl − B̂PRm [mu ]
= Pker(PÛ ) [mu ](x) + (δx ◦ Pker(PÛ ) )ku (B̂PRm )∗ (B̂PRm )ku (B̂PRm )∗
|
{z
}
=0
−1 ˆl − B̂PRm [mu ]
= δx (Pker(PÛ ) ku PR∗ m ) B̂ > (B̂PRm )ku (B̂PRm )∗
|
{z
}
=0
=0
Proof of Lemma 3.4 Since PÛ is idempotent, we have
Pker(PÛ ) PÛ = PÛ − PÛ2 = PÛ − PÛ = 0
36
Physics-Informed GP Regression Generalizes Linear PDE Solvers
and
PRn PÛ = (IRÛm )−1 PÛ2 = (IRÛm )−1 PÛ = PRn .
It follows that
Pker(PÛ ) ku PR∗ n = Pker(PÛ ) k̃u PR∗ n − Pker(PÛ ) PÛ k̃u
|
{z
}
=0
−
Pker(PÛ ) k̃u PÛ∗ PR∗ n
+ 2 Pker(PÛ ) PÛ k̃u PÛ∗ PR∗ n
|
{z
}
=0
∗
∗
= Pker(PÛ ) k̃u PR∗ n − Pker(PÛ ) k̃u PRn PÛ
| {z }
=PRn
= 0.
37
Pförtner, Steinwart, Hennig and Wenger
Appendix B. Proofs for Section 4
This appendix constitutes a proof of Theorem 1 and Corollaries 2 and 3. More precisely,
Appendices B.1, B.2 and B.2.2 introduce the objects needed to formalize these results, while
Appendices B.2.1, B.2.3 and B.3 develop the machinery used to conduct their proof which
is given in Appendix B.4.
In the following, B (Ω) denotes the Borel σ-algebra on some topological space Ω. Let
H be a Hilbert space. For a linear functional l ∈ H∗ , l∗ ∈ H denotes the unique vector
for which l [h] = hl∗ , hiH for all h ∈ H. Similarly, for h ∈ H, h∗ ∈ H∗ denotes the linear
functional h∗ [h0 ] = hh, h0 iH for h0 ∈ H.
B.1 Gaussian Processes
We start by reviewing the definition and basic properties of Gaussian processes.
Definition B.1. A Gaussian process (GP) f with index set X is a family {fx }x∈X of Rvalued random variables on a common probability space (Ω, B (Ω) , P), such that, for each
finite set of indices x1 , . . . , xn , the joint distribution of fx1 , . . . , fxn is Gaussian. We also
write f (x) := fx and f (x, ω) := fx (ω).
Definition B.2. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . The
function
m : X → R, x 7→ m(x) = EP [f (x)]
is called the mean (function) of f and the function
k : X × X → R, (x1 , x2 ) 7→ k(x1 , x2 ) = CovP [f (x1 ), f (x2 )]
is called the covariance function or kernel of f . We also often write f ∼ GP (m, k) if f is
a Gaussian process with mean m and kernel k.
We commonly use Gaussian processes to model our belief about unknown functions,
which can be motivated by interpreting their sample paths as function-valued random variables:
Definition B.3. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . For each
ω ∈ Ω, the function
f (·, ω) : X → R, x 7→ f (x, ω)
is called a (sample) path of the Gaussian process. The set paths (f ) := {f (·, ω) : ω ∈ Ω} ⊆
RX containing all sample paths of f is referred to as the path space of f .
Lemma B.4. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . Consider the
function
fX : Ω → paths (f ) , ω 7→ f (·, ω).
If there is a σ-algebra on paths (f ) such that fX is measurable, then fX is a function-valued
random variable with values in paths (f ). In the following, we will refer to function-valued
random variables as random functions, in analogy to the concept of a random variable.
38
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Using the rules of linear-Gaussian inference (Bishop, 2006), we can easily see that
f ∼ GP (m, k)
Af (X) ∼ N Am(X), Ak(X, X)A>
f | Af (X) + b = y ∼ GP mf |y , kf |y ,
where A ∈ Rm×n , X = {xi }ni=1 ⊂ X , b ∼ N (µ, Λ) with b ⊥
⊥ f and
mf |y (x) := m(x) + k(x, X)A> (Ak(X, X)A> + Λ)† (y − (Am + µ))
kf |y (x1 , x2 ) := k(x1 , x2 ) − k(x1 , X)A> (Ak(X, X)A> + Λ)† Ak(X, x2 ).
It is tempting to think that the above also extends to more general linear transformations of
f such as differentiation and integration. Unfortunately, this is not the case, since the result
from (Bishop, 2006) heavily uses the fact that, by definition, evaluations of the Gaussian
process at a finite set of points follow a joint Gaussian distribution. However, differentiation
and integration are examples of linear operators, i.e. linear maps between vector spaces of
functions, which operate on an (uncountably) infinite subset of the random variables.
To generalize the result above to linear operators L (or more generally affine maps)
acting on the paths of f , we need to analyze the objects L [f ] and f | L [f ] = h. By L [f ],
we denote the function ω 7→ L [f (·, ω)] = (L ◦ fX )(ω), which is a random variable if there
is a σ-algebra on paths (f ) and the image of L such that L and fX are measurable. If we
understand the joint law of fX and L [f ], we can compute the conditional random variable
fX | L [f ] = h and the conditional random process f | L [f ] = h. This outlines the proof
strategy we will follow below. Specifically, we will
1. gain an understanding of the structure of the GP’s path space paths (f ) in order to
be able to decide whether fX is a random function, i.e. measurable. We will focus on
cases, in which we can continuously embed paths (f ) into a separable Hilbert space H,
which is a measurable space with respect to B (H). This will be useful when applying
linear operators to the GP, since it helps decide whether paths (f ) lies in the domain
of the linear operator and whether the linear operator is measurable.
2. analyze the law of the random function fX in order to understand the belief about the
sample paths encoded in P and fX . If paths (f ) is (a subset of) a separable Hilbert
space H, then the law of fX will turn out to be a Gaussian measure on H.
3. analyze the law of the random functions L ◦ fX and (fX , L ◦ fX ). We will assume that
L maps into some separable Hilbert space HL . Since Gaussian random variables on
separable Hilbert spaces are closed under continuous affine transformations between
such spaces, L ◦ fX and (fX , L ◦ fX ) are also Gaussian if L : H 7→ HL is bounded.
4. compute the conditional Gaussian measure fX | L [f ] = h by marginalizing over L ◦ fX
in (fX , L ◦ fX ) | L ◦ fX = h.
5. show how to transform Gaussian random variables on separable Hilbert spaces into
Gaussian processes. With this result we are then able to transform L ◦ fX and fX |
L ◦ fX = h back into Gaussian processes.
39
Pförtner, Steinwart, Hennig and Wenger
Fortunately, the first point has already been extensively addressed in the literature. See
Kanagawa et al. (2018, Section 4) for an overview.
Remark B.5. Let f ∼ GP (m, k) be a Gaussian process with index set X and let Hk be
the reproducing kernel Hilbert space (RKHS) of the covariance function or kernel k. If
dim Hk = ∞, then the sample paths of f do almost surely not lie in Hk . Fortunately, in
many cases, there exists a larger related RKHS Hk0 ⊃ Hk , which contains the sample paths
with probability 19 . We refer to (Kanagawa et al., 2018, Section 4) and Steinwart (2019)
for more details on sample path properties.
In Appendix B.5, we have already seen that Sobolev spaces can be obtained as path
spaces of Gaussian processes with Matérn covariance functions.
B.1.1 Multi-output Gaussian Processes
The sample paths of Gaussian processes as defined in Definition B.1 are always real-valued.
However, especially in the context of PDEs, vector-valued functions are ubiquitous, e.g.
when dealing with vector fields such as the electric field. Fortunately, the index set of a
Gaussian process can be chosen freely, which means that we can “emulate” vector-valued
0
GPs. More precisely, a function f : X → Rd can be equivalently viewed as a function
f 0 : {1, . . . , d0 } × X → R, (i, x) 7→ f 0 (i, x) = fi (x). Applying this construction to a Gaussian
process leads to the following definition of a multi-output Gaussian process:
Definition B.6 (Multi-output Gaussian Process). A d-output Gaussian process f with
index set X on (Ω, B (Ω) , P) is a Gaussian process with index set X 0 := {1, . . . , d}×X on the
same probability space. With a slight abuse of notation, we write fx (ω) := (f(i,x) (ω))di=1 ∈ Rd ,
etc. We also write the mean and covariance functions m and k of f as m : X → Rd and
k : X × X → Rd×d , where




m(1, x)
k((1, x1 ), (1, x2 )) . . . k((1, x1 ), (d, x2 ))




..
..
..
m(x) =  ...  and k(x1 , x2 ) = 
.
.
.
.
m(d, x)
k((d, x1 ), (1, x2 )) . . . k((d, x1 ), (d, x2 ))
B.2 Gaussian Measures on Separable Hilbert Spaces
As stated before, we need to understand the law of the random function fX . This amounts
to analyzing the pushforward measure µ := P ◦ fX−1 . In many cases, µ will turn out to be a
Gaussian probability measure on a (usually) infinite-dimensional separable Hilbert function
space H ⊇ paths (f ) (see Proposition B.22 and Lemma B.13).
Definition B.7. Let H be a real separable Hilbert space. A probability measure µ on
(H, B (H)) is called Gaussian if hh, ·iH is a univariate Gaussian random variable for all
h ∈ H. An H-valued random variable is called Gaussian if its law is Gaussian.
Just as for probability measures on Euclidean vector space Rn , we can define a mean
and covariance (operator) for this more general class of probability measures.
9. In practice, f is virtually always implicitly defined via m and k without ever constructing the function
fX and the probability space. Hence, we can always choose fX and Ω such that f ∼ GP (m, k) where
f ∈ Hk0 even holds pathwise, i.e. f (·, ω) ∈ Hk0 for all ω ∈ Ω, instead of just with probability 1.
40
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Definition B.8. Let X be a random variable on (Ω, B (Ω) , P) with values in a real separable
Hilbert space H. If hh, X(·)iH ∈ L1 (Ω, P) for all h ∈ H, and there is mX ∈ H such that
Z
hh, mX iH = EX [hh, XiH ] =
Ω
(B.1)
hh, X(ω)iH dP (ω)
for all h ∈ H, then m is called the mean (vector) of X. Let X 0 be another random variable on
(Ω, B (Ω) , P) with values in a real separable Hilbert space H0 and mean mX 0 . If hh, X(·)iH ∈
L2 (Ω, P) for all h ∈ H, hh0 , X 0 (·)iH0 ∈ L2 (Ω, P) for all h0 ∈ H0 , and there is a linear operator
CX,X 0 : H → H0 such that
h0 , CX,X 0 [h]
H0
= CovX,X 0 hh, XiH , h0 , X 0 H0
Z
hh, X(ω) − miH h0 , X 0 (ω) − m0
=
Ω
(B.2)
H0
dP (ω)
for all h ∈ H and h0 ∈ H0 , then CX,X 0 is called the cross-covariance operator of X and X 0 .
If X = X 0 , then C is referred to as the covariance operator of X.
Remark B.9. One can show that the existence of the mean vector and (cross-)covariance
operator already follows from the given conditions. More precisely, the mean mX exists if
hh, X(·)iH ∈ L1 (Ω, P) for all h ∈ H, and the (cross-)covariance operator exists if hh, X(·)iH ∈
L2 (Ω, P) for all h ∈ H and hh0 , X 0 (·)iH0 ∈ L2 (Ω, P) for all h0 ∈ H0 .
Remark B.10. One can show that covariance operators are self-adjoint and positive. Moreover, covariance operators are in the trace class (Maniglia and Rhandi, 2004, Section 1.2)
and hence compact and bounded.
Remark B.11. The mean and the covariance operator of a Gaussian random variable with
values in a separable Hilbert space always exist and they identify its law uniquely (Maniglia
and Rhandi, 2004, Theorem 1.2.5). Conversely, for every self-adjoint, positive, trace-class
operator C : H → H and m ∈ H, there is a Gaussian measure with mean m and covariance
operator C. Hence, we also often write N (m, C) to denote Gaussian measures on separable
Hilbert spaces.
Using the notion of a Bochner integral (Yosida, 1995, section V.5), we can also give an
equivalent definition of the mean and covariance operator, which is more similar to the finitedimensional counterpart. For our purposes, Bochner integrals have the favorable property
that they commute with bounded linear operators, i.e. if f : Ω → V is a Bochner integrable
function mapping a measure space (Ω, B (Ω) , µ) into a Banach space V and L : V → U is
a bounded linear operator between V and another Banach space U , then ω 7→ L [f (ω)] is
Bochner integrable and
Z
Z
L [f (ω)] dµ (ω) = L
B
f (ω) dµ (ω)
B
for B ∈ B (Ω) (Yosida, 1995, Section V.5, Corollary 2).
41
(B.3)
Pförtner, Steinwart, Hennig and Wenger
Lemma B.12. Let X ∼ N (mX , CX ) be a Gaussian random variable on (Ω, B (Ω) , P) with
values in a real separable Hilbert space H. Then X is Bochner P-integrable and the mean m
of X is given by the following Bochner integral
Z
X(ω) dP (ω) .
(B.4)
mX =
Ω
X0
Let
∼ N (mX 0 , CX 0 ) be another Gaussian random variable on (Ω, B (Ω) , P) with values in
a real separable Hilbert space H0 . Then the function ω 7→ hh, X(ω) − mX iH (X 0 (ω) − mX 0 )
is Bochner P-integrable for any h ∈ H and the cross-covariance operator CX,X 0 of X and X 0
is given by the Bochner integral
Z
CX,X 0 [h] :=
hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) dP (ω) .
(B.5)
Ω
Proof By Maniglia and Rhandi (2004, Theorem 1.2.5), we have that kX(·)kH ∈ L2 (Ω, P).
Hence,
sZ
sZ
Z
Z
kX(ω)kH dP (ω) =
1 · kX(ω)kH dP (ω) ≤
1 dP (ω) ·
kX(ω)k2H dP (ω) < ∞
Ω
Ω
Ω
Ω
by the Cauchy-Schwarz inequality in L2 (Ω, P) and the fact that P is a probability measure.
Moreover, X is measurable and H ⊃ ran(X) is separable, which means that X is strongly
measurable (Yosida, 1995, Section V.4, Pettis’ Theorem). It follows that X is Bochner
integrable (Yosida, 1995, Section V.5, Theorem 1) and that
Z
Z
hh, mX iH =
hh, X(ω)iH dP (ω) = h, X(ω) dµ (ω)
Ω
Ω
H
for h ∈ H (Yosida, 1995, Section V.5, Corollary 2), since hh, ·iH is continuous.
The function ω 7→ hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) is clearly weakly measurable and,
since H is separable, also strongly measurable (Yosida, 1995, Section V.4, Pettis’ Theorem).
By the triangle inequalities in H and H0 and the fact that P is a probability measure, we
have kX(·) − mX kH ∈ L2 (Ω, P) and kX 0 (·) − mX 0 kH0 ∈ L2 (Ω, P). Hence, for h ∈ H,
Z
hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) H0 dP (ω)
ZΩ
= |hh, X(ω) − mX iH | X 0 (ω) − mX 0 H0 dP (ω)
Ω
Z
≤ khkH
kX(ω) − mX kH X 0 (ω) − mX 0 H0 dP (ω)
Ω
= khkH kX(·) − mX kH , X 0 (·) − mX 0
H0 L2 (Ω,P)
<∞,
by the Cauchy-Schwarz inequality in H. It follows that ω 7→ hh, X(ω) − mX iH (X 0 (ω)−mX 0 )
is Bochner integrable for any h ∈ H (Yosida, 1995, Section V.5, Theorem 1) and that
Z
0
h , CX,X 0 [h] H0 =
hh, X(ω) − mX iH h0 , X 0 (ω) − mX 0 H0 dP (ω)
Ω
42
Physics-Informed GP Regression Generalizes Linear PDE Solvers
=
0
Z
h,
Ω
0
hh, X(ω) − mX iH (X (ω) − mX 0 ) dP (ω)
H0
for any h ∈ H and h0 ∈ H0 (Yosida, 1995, Section V.5, Corollary 2), where we used the fact
that hh0 , ·iH0 is continuous.
B.2.1 Continuous Affine Transformations
Just as their finite-dimensional counterparts, Gaussian random variables with values in
separable Hilbert are closed under continuous affine transformations and the expressions for
the transformed mean and covariance operator are analogous to the finite-dimensional case.
In the following, we will use this result to compute the law of L ◦ fX .
Lemma B.13. Let L : H1 → H2 be a bounded linear operator between real separable Hilbert
spaces H1 , H2 and let b ∈ H2 . Let X ∼ N (m, C) be an H1 -valued Gaussian random variable.
Then L [X(·)] + b ∼ N (L [m] + b, LCL∗ ).
Proof See Lemma 1.2.7 in Maniglia and Rhandi (2004).
B.2.2 Joint Gaussian Measures on Separable Hilbert Spaces
In order to compute fX | L ◦ fX = h, we need access to the joint distribution of fX and
L ◦ fX . Using Lemma B.13 to apply the linear operator h 7→ (h, L [h]) to the Gaussian
random function fX , it becomes apparent that this joint distribution can be described by a
Gaussian measure on a Cartesian product H × HL of separable Hilbert spaces, where HL is
the codomain of L.
Remark B.14. The Cartesian product H× := H1 × · · · × Hn of a finite family {Hi }ni=1
of real Hilbert spaces equipped with elementwise addition and scalar multiplication is a real
Hilbert space with respect to the inner product
h, h0
H×
:=
n
X
hi , h0i
Hi
.
i=1
Additionally, if every Hi for i = 1, . . . , n is separable, then H× is separable (Adams and
Fournier, 2003, Theorem 1.23). Unless stated otherwise, we will always equip Cartesian
products of Hilbert spaces with the Hilbert space structure described above.
Lemma B.15. Let i ∈ {1, . . . , n}. The i-th projection map Πi : H× → Hi , h 7→ hi on H×
is a bounded linear operator and
Π∗i [hi ] = (0, . . . , 0 , hi , 0, . . . , 0).
| {z }
i−1 times
P
Proof Let h = (h1 , . . . , hn ) ∈ H× . Then kΠi [h]k2Hi = khi k2Hi ≤ nj=1 khj k2Hj = khk2H× and
+
*
X
0
hi , Πi h H = hi , h0i H = hi , h0i H +
0, h0j H = (0, . . . , 0 , hi , 0, . . . , 0), h0
i
i
i
j
| {z }
j6=i
for all h0 ∈ H× .
43
i−1 times
H×
Pförtner, Steinwart, Hennig and Wenger
Notation B.16. For linear operators L : H → H0 between Cartesian products H = H1 ×
0 of real Hilbert spaces, we introduce the notation
· · · × Hn and H0 = H10 × · · · × Hm
L [(h1 , . . . , hn )] = (L11 [h1 ] + · · · + L1n [hn ] , . . . , Lm1 [h1 ] + · · · + Lmn [hn ])


L11 . . . L1n

..  [(h , . . . , h )] ,
..
=:  ...
n
.
.  1
Lm1 . . . Lmn
with Lij := Π0i LΠ∗j : Hj → Hi0 , where Πi and Π0i denote the i-th projection maps on H and
H0 , respectively. Lemma B.15 implies that Lij is bounded if L is bounded. Specifically, for
a covariance operator L = C (i.e. H = H0 ), we know that C is bounded and hence all blocks
Cij of the covariance operator are bounded. One can show that Cij is the cross-covariance
∗.
operator between entries i and j of the tuple and hence Cij = Cji
We will refer to a Gaussian measure on a Cartesian product of separable Hilbert spaces
as a joint Gaussian measure on separable Hilbert spaces. In the remainder of this section,
we will show that joint Gaussian measures on separable Hilbert spaces share some important
properties with their finite-dimensional counterparts.
First of all, we can use orthogonal projections to marginalize over variables in a random
vector whose law is a joint Gaussian measure on separable Hilbert spaces.
Corollary B.17 (Marginalization in Joint Gaussian Measures). Let H1 , H2 be real separable
Hilbert spaces and let X ∼ N (m, C) be an H1 × H2 -valued Gaussian random variable. Then
Xi ∼ N (mi , Cii ) for i ∈ {1, 2}.
Proof This follows from Lemma B.13, since Xi = Πi ◦ X and Πi is linear and bounded.
The statistical independence properties of joint Gaussian measures on Hilbert spaces are
also analogous to the finite-dimensional case.
Proposition B.18 (Independence in Joint Gaussian Measures). Let H1 , H2 be real separable
Hilbert spaces and let X1 and X2 be independent random variables on (Ω, B (Ω) , P) with
values in H1 and H2 , respectively, where X1 ∼ N (m1 , C1 ) and X2 ∼ N (m2 , C2 ). Then
X : Ω → H, ω 7→ (X1 (ω), X2 (ω)) is a Gaussian random variable on (Ω, B (Ω) , P) with mean
m = (m1 , m2 ) and covariance operator
C1 0
C :=
.
0 C2
Proof Let H := H1 × H2 and h∗ ∈ H∗ . Then h∗ = h∗1 + h∗2 , where h∗i := h∗ ◦ Π∗i ∈ Hi∗
for i ∈ {1, 2}. X1 and X2 are Gaussian, which implies that h∗1 ◦ X1 and h∗2 ◦ X2 are
Gaussian. Moreover, h∗1 ◦ X1 ⊥
⊥ h∗2 ◦ X2 , because X1 ⊥
⊥ X2 . Since the sum of independent
(univariate) Gaussian random variables is Gaussian, it follows that h∗ ◦ X is Gaussian.
Hence, X is Gaussian. Πi for i ∈ {1, 2} is bounded and thus, by Lemma B.13, we have that
m = (m1 , m2 ), C11 = C1 , and C22 = C2 . Let µ, µ1 and µ2 be the laws of f , X1 and X2 ,
respectively. Then, X1 ⊥
⊥ X2 implies µ = µ1 ⊗ µ2 and hence, for h2 ∈ H2 ,
C12 [h2 ] = Π1 CΠ∗2 [h2 ]
44
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Z
0
0
0
= Π1
(0, h2 ), h − m H (h − m) dµ h
H
Z
=
h2 , h02 − m2 H2 (h01 − m1 ) dµ h0
H
Z
=
ZH2
=
H2
(Yosida 1995, Section V.5, Corollary 2)
h2 , h02 − m2 H2 (h01 − m1 ) dµ1 h01 dµ2 h02
H1
Z
0
h2 , h2 − m2 H
(h01 − m1 ) dµ1 h01 dµ2 h02 = 0
H
| 1
{z
}
Z
=0
∗ = 0∗ = 0.
and C21 = C12
Note that analogous versions of these results also hold in joint Gaussian measures with
more than two components. This follows from Lemma B.13 and the fact that there are
isometries between H1 × · · · × Hn and arbitrary reorderings and/or parenthesizations of the
Cartesian product.
B.2.3 Conditional Gaussian Measures on Separable Hilbert Spaces
At the heart of Theorem 1 is the conditional random process f | L [f ] = h. We will compute
this process by conditioning the joint Gaussian measure (fX , L ◦ fX ) on a given value of
its second component. To do so, our main workhorse will be a result by Owhadi and
Scovel (2018) who show how to condition Gaussian measures on an orthogonal direct sum
of separable Hilbert spaces on observations in one of the two subspaces, i.e. they show
how to compute X | X2 = t, where X = X1 + X2 is a Gaussian random variable with
values in H1 ⊕ H2 . Unfortunately, Owhadi and Scovel (2018) don’t give explicit expressions
for the conditional mean and covariance operator. In the following, we add to Theorem
3.3 in Owhadi and Scovel (2018) by constructing explicit expressions for the mean and
covariance operator of the conditional measure, which resemble the well-known expressions
for conditional Gaussian measures on finite-dimensional Euclidean vector spaces.
Theorem B.19. Let H1 , H2 be real separable Hilbert spaces and let H := H1 × H2 . Let X
be an H-valued Gaussian random variable with mean m = (m1 , m2 ) and covariance operator
C11 C12
:=
C
:H→H
∗
C12
C22
such that ran(C22 ) is closed. Then X | X2 = t for any t ∈ H2 is an H-valued Gaussian
random variable with mean
†
m1 + C12 C22
[t − m2 ]
:=
mX|X2 =t
,
(B.6)
t
and covariance operator
CX|X2 =t :=
† ∗
C11 − C12 C22
C12 0
.
0
0
45
(B.7)
Pförtner, Steinwart, Hennig and Wenger
Proof H is an orthogonal direct sum of (separable) subspaces
Ĥ1 := {(h1 , 0) | h1 ∈ H1 } = Π∗1 [H1 ] ,
Ĥ2 := {(0, h2 ) | h2 ∈ H2 } =
and
Π∗2 [H2 ] .
Let Π̂i := Πi |Ĥi : Ĥi → Hi for i ∈ {1, 2} be the restriction of Πi to Ĥi . Note that the Π̂i are
unitary. We have m = m̂1 + m̂2 , where m̂1 := Π̂∗1 [m1 ] and m̂2 := Π̂∗2 [m2 ]. Using the blockmatrix notation for operators on orthogonal direct sums of Hilbert spaces from Owhadi and
Scovel (2018); Anderson and Trapp (1975), the covariance operator can be represented by
∗
Cˆ11 Cˆ12
Π̂∗1 C12 Π̂2
Π̂1 C11 Π̂1
=: ˆ∗ ˆ
C=
C12 C22 Ĥ ⊕Ĥ
(Π̂∗1 C12 Π̂2 )∗ Π̂∗2 C22 Π̂2 Ĥ ⊕Ĥ
1
1
2
2
Let t ∈ H2 . Note that
X | X2 = t = X | (0, X2 ) = (0, t) = X | X̂2 = t̂
where X̂2 := Π̂∗2 [X2 ], and t̂ := Π̂∗2 [X2 ]. By Theorem 3.3 in Owhadi and Scovel (2018), X |
X̂2 = t̂ is Gaussian, its covariance operator is the short of C to Ĥ2 (Anderson and Trapp,
1975), and, if a C-symmetric oblique projection Q onto Ĥ2 (Owhadi and Scovel, 2018) of
the form
h
i
h i
Q ĥ1 + ĥ2 = Q̂21 ĥ1 + ĥ2 ∈ Ĥ2
for some Q̂21 : Ĥ1 → Ĥ2 exists, then its mean is given by
(m̂1 + Q̂∗21 t̂ − m̂2 ) + t̂.
In the following, we will show that the expressions for mX|X2 =t and CX|X2 =t from Equations (B.6) and (B.7) are indeed equal to the mean and covariance operator of X | X̂2 = t̂,
respectively.
† ˆ∗
We will first show that Q̂21 := Cˆ22
C12 defines a C-symmetric oblique projection Q onto
Ĥ2 . Evidently, Q is idempotent, i.e. Q2 = Q. In Notation B.16, we noted that C12 and
C22 are bounded and hence Cˆ12 and Cˆ22 are bounded. Since ran(C22 ) is closed and Π̂2 is
unitary, ran(Cˆ22 ) is closed. It follows from Theorem 3 in Ben-Israel and Greville (2003,
†
Section 8.3) that the Moore-Penrose pseudoinverse Cˆ22
exists and his bounded.
Q̂21 and Q
i
†
ˆ
ˆ
are bounded because C and C12 are. Moreover, ran(Q) = Q̂21 Ĥ1 + Ĥ2 = Ĥ2 , since
22
†
ran(Q̂21 ) ⊂ ran(Cˆ22
) ⊂ Ĥ2 . It remains to show that Q∗ C = CQ. Cˆ22 is bounded and thus
† ∗
∗ )† = Cˆ Cˆ†
closed (Yosida, 1995, Section II.6), which means that Q̂∗21 = Cˆ12 (Cˆ22
) = Cˆ12 (Cˆ22
12 22
ˆ22 is
by Theorem 2 (g) from Ben-Israel and Greville (2003, Section
8.3)
and
the
fact
that
C
i
h
h i
†
self-adjoint. Consequently, the adjoint of Q is given by Q∗ ĥ1 + ĥ2 = Cˆ12 Cˆ22
ĥ2 + ĥ2 ,
because
D
h
iE
D
h i
E
ĥ1 + ĥ2 , Q ĥ01 + ĥ02
= ĥ1 + ĥ2 , Q̂21 ĥ01 + ĥ02
H
H
D
h i
E
0
0
= ĥ2 , Q̂21 ĥ1 + ĥ2
H
46
Physics-Informed GP Regression Generalizes Linear PDE Solvers
(Ĥ1 ⊥ Ĥ2 )
D
E
+ h2 , h02 H
= Q̂∗21 [h2 ] , h01
H
D
E
D
h i
E
+ ĥ2 , ĥ01
= Q̂∗21 ĥ2 , ĥ01
D H E
D
h iH E
∗
0
+ ĥ2 , ĥ02
+ Q̂21 ĥ2 , ĥ2
H
H
(Ĥ1 ⊥ Ĥ2 )
h i
E
D
= Q̂∗21 ĥ2 + ĥ2 , ĥ01 + ĥ02
H
D
h i
E
†
0
ˆ
ˆ
= C12 C22 ĥ2 + ĥ2 , ĥ1 + ĥ02
H
for all ĥ1 + ĥ2 , ĥ01 + ĥ02 ∈ H. Since Cˆ22 is bounded, self-adjoint and positive, its square
1
2
exists and is also bounded, self-adjoint and positive (Bernau, 1968, Theorem 4).
root Cˆ22
Moreover, we have
1
∗
(B.8)
ran(Cˆ12
) ⊂ ran(Cˆ2 ) = ran(Cˆ22 ),
22
where the inclusion follows from Theorem 3 in Anderson and Trapp (1975) and the equality
holds due to the fact that ran(Cˆ22 ) is closed (Dixmier, 1949; Tarcsay, 2014). Let ĥ1 + ĥ2 ∈ H.
If ĥ2 ∈ ran(Cˆ22 ), then we indeed find
h
h
h i
i
h i
h i
h ii
∗
Q∗ C ĥ1 + ĥ2 = Q∗ (Cˆ11 ĥ1 + Cˆ12 ĥ2 ) + (Cˆ12
ĥ1 + Cˆ22 ĥ2 )
h
h i
h ii
h i
h i
†
∗
∗
Cˆ12
= Cˆ12 Cˆ22
ĥ1 + Cˆ22 ĥ2 + (Cˆ12
ĥ1 + Cˆ22 ĥ2 )
h
h i
h ii
h i
h i
† ˆ∗
† ˆ
∗
= Cˆ12 Cˆ22
C12 ĥ1 + Cˆ22
C22 ĥ2 + (Cˆ12
ĥ1 + Cˆ22 ĥ2 )
h
h i
h i
i
h
i
† ˆ∗
† ˆ∗
= Cˆ12 Cˆ22
C12 ĥ1 + ĥ2 + Cˆ22 Cˆ22
C12 ĥ1 + ĥ2
†
∗ ) ⊂ ran(Cˆ ) by Equation (B.8))
(ĥ2 ∈ ran(Cˆ22 ), Cˆ22 Cˆ22
|ran(Cˆ22 ) = idran(Cˆ22 ) and ran(Cˆ21
22
h
h i
i
† ˆ∗
= C Cˆ22
C12 ĥ1 + ĥ2
h
i
= CQ ĥ1 + ĥ2 .
(B.9)
Now consider a general ĥ2 ∈ Ĥ2 . Since ran(Cˆ22 ) is closed, we have
Ĥ2 = ran(Cˆ22 ) ⊕ ran(Cˆ22 )⊥
(Yosida, 1995, Section III.1, Theorem 1), which implies that there is a unique additive
k
k
⊥
ˆ
ˆ ⊥
ˆ
decomposition ĥ2 = ĥ2 + ĥ⊥
2 with ĥ2 ∈ ran(C22 ) and ĥ2 ∈ ran(C22 ) = ker(C22 ). Moreover,
⊥
⊥
∗
⊥
⊥
ˆ
ˆ
ˆ
ˆ This
ĥ2 ∈ ran(C22 ) ⊂h ran(
and hence ĥ2 ∈ ker(C).
h (B.8),
i
i C12 ) = ker(C12 )h by iEquation
∗
⊥
⊥
implies that Q∗ C ĥ⊥
2 = Q [0] = 0 = C ĥ2 = CQ ĥ2 , and hence
h
i
h
i
h i
k
k
∗
∗
Q∗ C ĥ1 + (ĥ2 + ĥ⊥
)
=
Q
C
ĥ
+
ĥ
+
Q
C
ĥ⊥
1
2
2
2
h
i
h i
k
= CQ ĥ1 + ĥ2 + CQ ĥ⊥
2
47
Pförtner, Steinwart, Hennig and Wenger
h
i
k
= CQ ĥ1 + (ĥ2 + ĥ⊥
)
,
2
k
by Equation (B.9), since ĥ2 ∈ ran(C22 ). This concludes the proof that Q with this choice of
Q̂21 is a C-symmetric oblique projection onto Ĥ2 . By Theorem 3.3 in Owhadi and Scovel
(2018), it follows that the mean of X | X̂2 = t̂ is given by
† (m̂1 + Q̂∗21 t̂ − m̂2 ) + t̂ = (m̂1 + Cˆ12 Cˆ22
t̂ − m̂2 ) + t̂
h
i
† t̂ − m̂2 , Π̂2 t̂
= Π̂1 m̂1 + Cˆ12 Cˆ22
†
= m1 + C12 (Π̂2 Cˆ22
Π̂∗2 ) [t − m2 ] , t .
†
Since C22 is bounded and ran(C22 ) is assumed to be closed, C22
exists and is bounded (BenIsrael and Greville, 2003, Section 8.3, Theorem 3). Moreover, Π̂2 is unitary. This means
that conditions (2), (3), and (4) from Theorem 3.1 in Bouldin (1973) hold and thus
†
Π̂∗2 = Π̂2 (Π̂∗2 C22 Π̂2 )† Π̂∗2
Π̂2 Cˆ22
†
(Π̂∗2 )† Π̂∗2
= Π̂2 Π̂†2 C22
(Bouldin 1973, Theorem 3.1)
†
−1
= Π̂2 Π̂−1
2 C22 Π̂2 Π̂2
(Π̂2 is unitary)
†
= C22
.
(B.10)
This shows that mX|X2 =t is indeed the mean of X | X2 = t.
By Theorem 3 in Anderson and Trapp (1975),
to S = Ĥ2 is given by
Cˆ11 − A∗ A
S(C) =
0
1
∗ = Cˆ2 A, then the short S(C) of C
if Cˆ12
22
0
0 Ĥ
.
1 ⊕Ĥ2
1
1
1
2
2
2 †
is bounded and ran(Cˆ22
) = ran(Cˆ22 ) is closed, the pseudoinverse (Cˆ22
) exists and
Since Cˆ22
1
2 † ˆ∗
is bounded (Ben-Israel and Greville, 2003, Section 8.3, Theorem 3). Let A := (Cˆ22
) C12 .
Then
1
1
1
∗
∗
= Cˆ12
,
Cˆ2 A = Cˆ2 (Cˆ2 )† Cˆ12
22
since
1
22
22
1
2
2 †
Cˆ22
(Cˆ22
)
1/2
= id
1/2
ran(Cˆ22 )
ran(Cˆ22 )
1
∗ ) ⊂ ran(Cˆ2 ) by Equa(Ben-Israel and Greville, 2003, Section 8.3, Definition 1) and ran(Cˆ12
22
1
1
∗
2 † ∗
2 ∗ †
ˆ
ˆ
ˆ
ˆ
tion (B.8). Moreover, A = C12 ((C ) ) = C12 ((C ) ) , and
22
1
22
1
1
1
† ˆ∗
2 ∗ † ˆ2 † ˆ∗
2 ∗ ˆ2 † ˆ∗
A∗ A = Cˆ12 ((Cˆ22
) ) (C22 ) C12 = Cˆ12 ((Cˆ22
) C22 ) C12 = Cˆ12 Cˆ22
C12
48
Physics-Informed GP Regression Generalizes Linear PDE Solvers
by Theorem 2 (g) and (j) in Ben-Israel and Greville (2003, Section 8.3) and the fact that
1
2
Cˆ22
is self-adjoint. Consequently,
† ˆ∗
Cˆ11 − Cˆ12 Cˆ22
C12 0
S(C) =
0
0 Ĥ ⊕Ĥ
1 2 !
†
∗
Π̂1 Cˆ11 − Cˆ12 Cˆ22 Cˆ12
Π̂∗1 0
=
0
0
†
∗
∗
C11 − C12 Π2 Cˆ22 Π2 C12 0
=
0
0
† ∗
C11 − C12 C22 C12 0
=
0
0
(by Equation (B.10))
= CX|X2 =t .
Remark B.20. One can show that ran(C22 ) being closed is equivalent to C22 having finite
rank.
By applying Corollary B.17 to the conditional random variable from Theorem B.19 we
find that X1 | X2 = t is an H1 -valued Gaussian random variable with mean
and covariance operator
†
mX1 |X2 =t := m1 + C12 C22
[t − m2 ]
(B.11)
† ∗
CX1 |X2 =t := C11 − C12 C22
C12 .
(B.12)
B.3 Gaussian Processes as Gaussian Random Functions
As mentioned before, the function ω 7→ f (·, ω) is often a Gaussian random variable with
values in a separable Hilbert space of real-valued functions on X . In the following, we will
make this statement precise and also give expressions for the mean and covariance operator
of the Gaussian random variable, which will depend on the mean and covariance functions
of the Gaussian process, respectively.
Assumption B.21. Let f ∼ GP (m, k) be a Gaussian process with index set X on a
Borel probability space (Ω, B (Ω) , P), whose mean and sample paths lie in a real separable
RKHS10 H ⊂ RX with Hk ⊂ H, i.e. m ∈ H and paths (f ) ⊂ H.
Proposition B.22. Let Assumption B.21 hold. Then ω → f (·, ω) is an H-valued Gaussian
random variable whose mean is given by the mean function m of the Gaussian process f and
whose covariance operator is given by
Ck : H → H, h 7→ Ck [h] (x) = hk(x, ·), hiH .
(B.13)
10. Any Hilbert function space H ⊂ RX with continuous point evaluation functionals δx : H → R is an RKHS
with kernel kH (x1 , x2 ) = hδx∗1 , δx∗2 iH (Steinwart and Christmann, 2008).
49
Pförtner, Steinwart, Hennig and Wenger
Proof By definition, f (x, ·) is a Gaussian random variable for every x ∈ X . Hence, Corollary 12 in (Berlinet and Thomas-Agnan, 2004, Chapter 4, Section 2, p.195) ensures that
ω 7→ f (·, ω) is Borel measurable and thus a random variable, which is Gaussian by Theorem
91 in (Berlinet and Thomas-Agnan, 2004, Chapter 4, Section 3.1, p.196).
Since ω 7→ f (·, ω) is Gaussian and H is separable, by Lemma B.12, it remains to show
that m and Ck fulfill
Z
Z
hh, f (·, ω) − miH (f (·, ω) − m) dP (ω)
f (·, ω) dP (ω)
and
Ck [h] =
m=
Ω
Ω
for all h ∈ H, which are both well-defined Bochner integrals. Consequently, for x ∈ X , we
find that
Z
Z
f (·, ω) dP (ω) ,
f (x, ω) dP (ω) = δx
m(x) =
Ω
Ω
where the last equation holds by Corollary 2 from (Yosida, 1995, Section V.5), since δx is
continuous. Hence, by Lemma B.12, m ∈ H is the mean of ω 7→ f (·, ω). Moreover, for
x1 , x2 ∈ X , we have
Z
k(x1 , x2 ) = (f (x1 , ω) − m(x1 ))(f (x2 , ω) − m(x2 )) dP (ω)
Ω
Z
∗
= δ x2
δx1 , f (·, ω) − m H (f (·, ω) − m) dP (ω) ,
Ω
and hence, for any h ∈ H,
Ck [h] (x) = hk(x, ·), hiH
Z
∗
= h, hδx , f (·, ω) − miH (f (·, ω) − m) dP (ω)
Ω
H
Z
hδx∗ , f (·, ω) − miH hh, f (·, ω) − miH dP (ω)
=
Ω
Z
= δx
hh, f (·, ω) − miH (f (·, ω) − m) dP (ω) ,
Ω
where we applied Corollary 2 from Yosida (1995, Section V.5) repeatedly. This shows that
Ck is indeed the covariance operator of ω 7→ f (·, ω).
The correspondence from Proposition B.22 also holds in reverse in the sense that, a
Gaussian random variable h with values in a separable Hilbert space H, and a set X ∗ ⊂ H∗
of continuous linear functionals on H induce a Gaussian process on the same probability
space as f , whose paths are given by x∗ 7→ x∗ [h(ω)].
Lemma B.23. Let f ∼ N (m, C) be a Gaussian random variable on (Ω, B (Ω) , P) with
values in a real separable Hilbert space H. For every set X ⊂ H, the family {hx, f (·)iH }x∈X
is a Gaussian process on (Ω, B (Ω) , P) with mean function x 7→ hx, miH and covariance
function (x1 , x2 ) 7→ hx1 , C [x2 ]iH .
50
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Proof Since f is Gaussian, hx, f (·)iH is Gaussian for all x ∈ X . Let X = {xi }ni=1 ⊂ X and
hX, ·iH : H → Rn , h 7→ (hX, hiH )i := hxi , hiH .
Then hX, ·iH is continuous and thus Borel measurable. It follows that the function fX :=
hX, f (·)iH is an Rn -valued random variable. Moreover, since hv, hX, ·iH iRn ∈ H∗ for all
v ∈ Rn , fX is Gaussian. All in all, it follows that {hx, f (·)iH }x∈X is a Gaussian process on
(Ω, B (Ω) , P). Moreover, its mean function is given by
Z
hx, f (ω)iH dP (ω) = hx, miH .
x 7→
Ω
by Equation (B.1) and its covariance function is given by
Z
(x1 , x2 ) 7→ (hx1 , f (ω)iH − hx1 , miH )(hx2 , f (ω)iH − hx2 , miH ) dP (ω) = hx1 , C [x2 ]iH
Ω
by Equation (B.2).
Note that, unlike before, H is not necessarily a space of functions and the sample paths
of the resulting process are, generally speaking, not contained in H. However, if H is an
RKHS of real-valued functions on some domain X , then all point evaluation functionals
δx for x ∈ X are continuous and Proposition B.22 produces Gaussian processes in the
spirit of Assumption B.21. Hence, the following corollary is a more accurate converse of
Proposition B.22 than Lemma B.23.
Corollary B.24. Let f ∼ N (m, C) be a Gaussian random variable on (Ω, B (Ω) , P) with
values in a real separable RKHS H ⊂ RX . Then the family {ω 7→ f (ω)(x)}x∈X is a Gaussian
process on (Ω, B (Ω) , P) with
paths in H. Its mean and covariance functions are given by
∗
m and k(x1 , x2 ) := C δx2 (x1 ) , respectively. With a slight abuse of notation, we also write
f ∼ GP (m, k).
We can also establish a similar correspondence between joint Gaussian measures on
separable Hilbert spaces and multi-output Gaussian processes.
Proposition B.25. Let {Hi ⊂ RX }ni=1 be a family of real separable RKHSs and let H :=
H1 × · · · × Hn . Let f ∼ N (m, C) on (Ω, B (Ω) , P) with values in H. Then the family
{ω 7→ f (ω)i (x)}(i,x)∈I×X with I = {1, . . . , n} is an n-output Gaussian process with index
set X on (Ω, B (Ω) , P). Its
and covariance functions are given by (i, x) 7→ mi (x) and
∗mean
((i1 , x1 ), (i2 , x2 )) 7→ Ci1 ,i2 δx2 (x1 ) , respectively.
Proof Let H̃ := {(i, x) 7→ hi (x) : h ∈ H} ⊂ RI×X D
. Then
with pointwise
E H̃ equipped
E addiPn D
0
0
:= i=1 h̃(i, ·), h̃ (i, ·)
is a
tion and scalar multiplication and inner product h̃, h̃
H̃
Hi
Hilbert space and the linear map I : H → H̃, h 7→ I [h] (i, x) = hi (x) is the canonical isometry between H and H̃. Lemma B.13 implies that I ◦ f is a Gaussian random variable with
mean (i, x) 7→ I [m] (i, x) = mi (x) and covariance operator ICI ∗ . Since the point evaluation
functionals on all Hi are continuous, it follows that the point evaluation functionals on H̃
51
Pförtner, Steinwart, Hennig and Wenger
are continuous. Hence, by Corollary B.24, {ω 7→ f (ω)i (x)}(i,x)∈I×X is indeed a Gaussian
process with mean function (i, x) 7→ mi (x) and covariance function
h
i
∗
((i1 , x1 ), (i2 , x2 )) 7→ ICI ∗ δ(i
(i1 , x1 )
2 ,x2 )
∗ ∗ = I CΠi2 δx2 (i1 , x1 )
= I (C1,i2 δx∗2 , . . . , Cn,i2 δx∗2 ) (i1 , x1 )
= Ci1 ,i2 δx∗2 (x1 ) .
B.4 Proofs of Theorem 1 and its Corollaries
Using the results from Appendices B.2 and B.3, particularly Proposition B.22, Theorem B.19,
and Corollary B.24, we can now conduct the proof of Theorem 1 and Corollaries 2 and 3 as
outlined in Appendix B.1. All three results share a common set of assumptions.
Assumption 1. Let f ∼ GP (mf , kf ) be a Gaussian process prior with index set X on the
Borel probability space (Ω, B (Ω) , P), whose mean function and sample paths lie in a real
separable RKHS H ⊂ RX with H ⊇ Hkf . Let L : H → HL be a bounded linear operator
mapping the paths of f into a separable Hilbert space HL .
In the most general case, the linear operator L maps into a space, which is either not a
function space or a function space on which point evaluation is not a continuous functional.
This happens for instance when applying the differential operator of highest possible order
on a Sobolev path space, since then the resulting object will be an L2 function, which is not
pointwise defined.
Theorem 1 (Affine Gaussian Process Inference). Let Assumption 1 hold. Then ω 7→
f (·, ω) is an H-valued Gaussian random variable with mean mf and covariance operator
h 7→ Cf [h] (x) = hkf (x, ·), hiH . We also write f ∼ N (mf , Cf ). Let ∼ N (m , C ) be an
HL -valued Gaussian random variable with ⊥
⊥ f . Then
mf
Cf
Cf L∗
f
∼N
,
,
(4.1)
L [mf ] + m
LCf LCf L∗ + C
L [f ] + with values in H × HL and hence
L [f ] + ∼ N (L [mf ] + m , LCf L∗ + C ).
(4.2)
If ran(LCf L∗ + C ) is closed, then, for all y ∈ HL ,
(4.3)
f | L [f ] + = y ∼ GP mf |y , kf |y ,
where the conditional mean and covariance function are given by
D
E
mf |y (x) = mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )]
HL
,
(4.4)
and
D
E
kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )]
HL
respectively.
52
,
(4.5)
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Proof f ∼ N (mf , Cf ) follows from Proposition B.22. By Proposition B.18 we know that
f
mf
C
∼N
, f
m
0
0
C
with values in H× = H × HL . Moreover, the map (h, h ) 7→ (h, L [f ] + h ) is realized by the
bounded linear operator
idH
0
L̃ :=
: H× → H× .
L idHL
Hence, Equation (4.1) follows from Lemma B.13 and Equation (4.2) follows from Corollary B.17. Under the assumption that ran(LCf L∗ + C ) is closed, Theorem B.19 and Corollary B.17 imply that
f | L [f ] + = y ∼ N mf |y , Cf |y
with
m̃f |y = mf + Cf L∗ (LCf L∗ + C )† [y − (L [mf ] + m )] , and
Cf |y = Cf − Cf L∗ (LCf L∗ + C )† LCf .
Since point evaluation functionals on H are continuous, Corollary B.24 shows that
{(f (x, ·) | L [f ] + = y)}x∈X
is a Gaussian process with mean function
h
i
m̃f |y (x) = mf (x) − Cf L∗ (LCf L∗ + C )† [y − (L [mf ] + m )] (x)
D
E
= mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )]
HL
= mf |y (x)
and covariance function
h
i
Cf |y δx∗2 (x1 ) = Cf δx∗2 (x1 ) − Cf L∗ (LCf L∗ + C )† L Cf δx∗2
(x1 )
D
E
= kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )]
HL
= kf |y (x1 , x2 ),
since Cf δx∗2 (x1 ) = kf (x1 , ·), δx∗2
H
= kf (x1 , x2 ) and
Cf L∗ [h] (x) = Cf [L∗ [h]] (x) = hkf (x, ·), L∗ [h]iHL = hL [kf (x, ·)] , hiHL
for h ∈ HL . This proves Equations (4.3) to (4.5).
The first corollary deals with the case, where we observe the GP through a finite number of linear functionals. This happens when conditioning on integral observations or on
(Galerkin) projections as in Section 3.3.
53
Pförtner, Steinwart, Hennig and Wenger
Corollary 2. Let Assumption 1 hold for HL = Rn and let ∼ N (µ , Σ ) be an Rn -valued
Gaussian random variable with ⊥
⊥ f . Then
L [f ] + ∼ N (L [mf ] + µ , Lkf L∗ + Σ )
(4.6)
f | L [f ] + = y ∼ GP mf |y , kf |y ,
(4.7)
and, for any y ∈ Rn ,
with conditional mean and covariance function given by
D
E
mf |y (x) = mf (x) + L [kf (x, ·)] , (Lkf L∗ + Σ )† (y − (L [mf ] + µ ))
and
Rn
D
E
kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (x1 , ·)] , (Lkf L∗ + Σ )† L [kf (·, x2 )]
Rn
,
(4.8)
.
(4.9)
To prove Corollary 2, we first need to show that LCf L∗ = Lkf L∗ . We will prove a slightly
more general result, for which the following generalization of Notation 1 will prove useful.
Notation B.26. Let H1 ⊆ RX1 and H2 ⊆ RX2 be Hilbert spaces and let k : X1 × X2 → R
such that k(·, x2 ) ∈ H1 for all x2 ∈ X2 and k(x1 , ·) ∈ H2 for all x1 ∈ X1 . Let Li : Hi → Rni
for i = 1, 2 be linear. By L1 k, kL∗2 and L1 kL∗2 11 , we denote the functions
L1 k : X2 → Rn1 , x2 7→ L1 [k(·, x2 )] ,
kL∗2 : X1 → Rn2 , x1 7→ L2 [k(x1 , ·)] ,
and the matrix L1 kL∗2 ∈ Rn1 ×n2 with entries (L1 kL∗2 )ij := L1 [(kL∗2 )j ]i , respectively.
Lemma B.27. Let H1 ⊆ RX1 and H2 ⊆ RX2 be RKHSs. Let k : X1 × X2 → R such
that k(·, x2 ) ∈ H1 for all x2 ∈ X2 and k(x1 , ·) ∈ H2 for all x1 ∈ X1 and let K : H2 →
H1 , K [h2 ] (x1 ) = hk(x1 , ·), h2 iH2 . Finally, let L1 : H1 → Rn1 and L2 : H2 → Rn2 be linear
and bounded. Then
(i) the adjoint of K is given by
K∗ : H1 → H2 , K∗ [h1 ] (x2 ) = hk(·, x2 ), h1 iH1 ,
(B.14)
(L1 K) [h2 ]i = h(L1 k)i , h2 iH
(B.15)
(KL∗2 ) [v] (x1 ) = h(kL∗2 )(x1 ), viRn2
(B.16)
(L1 KL∗2 )ij = L2 [(L1 k)i ]j
(B.17)
L1 [(kL∗2 )j ]i
(L1 kL∗2 )ij .
(B.18)
(ii) and we have
for all h2 ∈ H2 ,
for all v ∈ Rn2 , and
(iii) L1 KL∗2 ∈ Rn1 ×n2 with
=
=
(B.19)
11. The omission of parentheses in L1 kL∗2 is motivated by Equations (B.17) and (B.18) from Lemma B.27,
which shows that the order in which L1 and L2 are applied to k is irrelevant.
54
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Proof
• (B.14): Let h1 ∈ H1 and x2 ∈ X2 . Then
K∗ [h1 ] (x2 ) = δx∗2 , K∗ [h1 ]
and
H2
K δx∗2 (x1 ) = k(x1 , ·), δx∗2
= h1 , K δx∗2
H
H1
= k(x1 , x2 )
for all x1 ∈ X1 . This means that K∗ [h1 ] (x2 ) = hh, k(·, x2 )iH . Evidently, dom (K∗ ) =
H1 .
• (B.15): L1 [·]i is a bounded linear functional and hence, by the Riesz representation theorem (Yosida, 1995, Section III.6), there is hL1 ,i ∈ H1 such that L1 [h1 ]i = hhL1 ,i , h1 iH1
for all h1 ∈ H1 . It follows that
(L1 K) [h2 ] (x1 ) = L1 [K [h2 ]] (x1 )
= hhL1 ,i , K [h2 ]iH1
= hK∗ [hL1 ,i ] , h2 iH2 ,
for all h2 ∈ H2 and
K [hL1 ,i ] (x2 ) = hhL1 ,i , k(·, x2 )iH1 = L1 [k(·, x2 )]i = (L1 k)i (x2 )
for all x2 ∈ X2 . Hence, (L1 K) [h]i = h(L1 k)i , hiH for all h ∈ H.
• (B.16): Let v ∈ Rn2 and x1 ∈ X1 . Then
(KL∗2 ) [v] (x1 ) = hk(x1 , ·), L∗2 [v]iH2 = hL2 [k(x1 , ·)] , viRn2 = h(kL∗2 )(x1 ), viRn2 .
• (B.17): Let ej ∈ Rn2 such that hej , viRn2 = vj . Then we have
(L1 KL∗2 )ij = L1 K [L∗2 [ej ]]i = h(L1 k)i , L∗2 [ej ]iH2 = hL2 [(L1 k)i ] , ej iRn2 = L2 [(L1 k)i ]j
by Equation (B.15).
• (B.18): Let ej ∈ Rn2 such that hej , viRn2 = vj . Then we have
(L1 KL∗2 )ij = L1 [(KL∗2 ) [ej ]]i = L1 [(kL∗2 )j ]i
by Equation (B.16).
Corollary B.28. Let the assumptions of Lemma B.27 hold such that X := X1 = X2 and
H := H1 = H2 and let k be symmetric. Then K is self-adjoint. If n := n1 = n2 and
L := L1 = L2 , then LkL∗ ∈ Rn×n is symmetric. If K is additionally positive-(semi)definite,
then LkL∗ is positive-(semi)definite.
55
Pförtner, Steinwart, Hennig and Wenger
Proof By the symmetry of k, for h ∈ H and x ∈ X , we have K∗ [h] (x) = hh, k(·, x)iH =
hk(x, ·), hiH = K [h] (x) , i.e. K is symmetric. Obviously, H = dom (K) = dom (K∗ ). Consequently, K is self-adjoint. This implies that (LKL∗ )> = (LKL∗ )∗ = (L∗ )∗ K∗ L∗ = LKL∗ .
Finally, if K is positive-semidefinite, then hv, LKL∗ [v]iRn = hL∗ [x] , K [L∗ [x]]iH ≥ 0 for all
v ∈ Rn , where the inequality is strict if K is (strictly) positive-definite.
Proof of Corollary 2 By Lemma B.27 we know that LCf L∗ = Lkf L∗ and hence Equation (4.6) follows from Equation (4.2) in Theorem 1. Moreover, ran(Lkf L∗ + Σ ) is closed,
since it is finite-dimensional. This means that Equations (4.7) to (4.9) also follow from
Theorem 1.
Finally, we address the archetypical case, in which both the prior f and the prior predictive L [f ] + are Gaussian processes. This happens if the linear operator maps into a
function space, in which point evaluation is continuous. In this article, this case occurred in
Sections 3.1 and 3.2, where we inferred the strong solution of a PDE from observations of
the PDE residual at a finite number of domain points.
0
Corollary 3. Let Assumption 1 hold such that HL is an RKHS HL ⊂ RX . Then
L [f ] ∼ GP (L [mf ] , Lkf L∗ ),
(4.13)
Let ∼ N (µ , Σ ) with values in Rn and ⊥
⊥ f . Then, for X 0 = {x0i }ni=1 ⊂ X 0 and y ∈ Rn ,
f | L [f ] X 0 + = y ∼ GP mf |y , kf |y
(4.14)
with
D
E
†
mf |y (x) := mf (x) + (kf L∗ )(x, X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (y − (L [mf ] (X) + µ ))
Rn
(4.15)
and
D
E
†
kf |y (x1 , x2 ) := kf (x1 , x2 ) − (kf L∗ )(x1 , X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (Lkf )(X 0 , x2 )
If additionally X = X 0 , then
f
mf
kf
∼ GP
,
L [mf ]
L [f ]
Lkf
Rn
.
(4.16)
kf L∗
Lkf L∗
.
(4.17)
Proof Since point evaluation on HL is continuous, we have
h i
(LCf L∗ ) δx02 x01 = (Lkf L∗ )(x01 , x02 )
by Lemma B.27. Consequently, Equation (4.13) follows from Equation (4.2) in Theorem 1
and Corollary B.24. Moreover, ran((Lkf L∗ )(X 0 , X 0 ) + Σ ) is closed, since it is finitedimensional. This means that Equations (4.14) to (4.16) also follow from Theorem 1. Finally,
Equation (4.17) follows from Equation (4.1) in Theorem 1 and Proposition B.25, where we
used that, by Lemma B.27,
h i
(Cf L∗ ) δx02 (x1 ) = (kf L∗ )(x1 , x02 ),
56
Physics-Informed GP Regression Generalizes Linear PDE Solvers
(LCf ) [δx2 ] x01 = (Lkf )(x01 , x2 ), and
h i
(LCf L∗ ) δx02 x01 = (Lkf L∗ )(x01 , x02 ).
B.5 On Prior Selection
A typical choice for the solution space U of a linear PDE, especially in the context of weak
solutions (see Section 2.1.1), are Sobolev spaces (Adams and Fournier, 2003). Unfortunately,
it is impossible to formulate a Gaussian process prior u, whose paths are elements of a
Sobolev space U . This is due to the fact that Sobolev spaces are, technically speaking, not
function spaces, but rather spaces of equivalence classes [f ]∼ of functions, which are equal
almost everywhere (Adams and Fournier, 2003). By contrast, the path spaces of Gaussian
processes are proper function spaces, which means that, in this setting, paths (u) ⊆ U is
impossible.
Fortunately, if the path space can be continuously embedded in U , i.e. there is a continuous and injective linear operator ι : paths (u) → U , commonly referred to as an embedding,
then the inference procedure above can still be applied. If such an embedding exists, we can
interpret the paths of the GP as elements of U by applying ι implicitly. For instance, D [u]
is then a shorthand notation for D [ι [u]]. Fortunately, since the embedding is assumed to be
continuous, the conditions for GP inference with linear operator observations are still met
when applying ι implicitly. The canonical choice for the embedding in the case of Sobolev
spaces is ι [u] = [u]∼U .
Example B.1 (Matérn covariances and Sobolev spaces). Kanagawa et al. (2018) show
that, under certain assumptions, the sample spaces of GP priors with Matérn covariance
functions (Rasmussen and Williams, 2006) are continuously embedded in Sobolev spaces
whose smoothness depends on the parameter ν of the Matérn covariance function. To be
precise, let D ⊂ Rd be open and bounded with Lipschitz boundary such that the cone condition
(Adams and Fournier, 2003, Definition 4.6) holds. Denote by kν,l the Matérn kernel with
smoothness parameter ν > 0 and lengthscale l > 0. Then, with probability 1, the sample
paths of a Gaussian process f with covariance function kν,l are contained in any RKHS
Hkν 0 ,l0 with l0 > 0 and
d
0 < ν0 + < ν
(B.20)
| {z 2}
=:m0
(Kanagawa et al., 2018, Corollary 4.15 and Remark 4.15), i.e. paths (f ) ⊂ Hkν 0 ,l0 . More0
over, if m0 ∈ N, then the RKHS Hkν 0 ,l0 is norm-equivalent to the Sobolev space H m (D)
(Kanagawa et al., 2018, Example 2.6). This implies that the canonical embedding
0
ι : Hkν 0 ,l0 → H m (D) , f (·, ω) 7→ [f (·, ω)]∼
0
H s (D)
(B.21)
is continuous.
0
For U = H m (D), the example above shows that the Matérn covariance function kν,l
with ν = m0 + for any > 0 leads to an admissible GP prior. The choice = 12 makes
57
Pförtner, Steinwart, Hennig and Wenger
evaluating the covariance function particularly efficient (Rasmussen and Williams, 2006).
However, note that the elements of the Sobolev space H m (D) are only m-times weakly
differentiable, which means that H 2 (D) is not an admissible choice in Sections 3.1 and 3.2.
Remark B.29 (Sobolev Spaces and Strong Derivatives). The Sobolev embedding theorem
(Adams and Fournier, 2003, Theorem 4.12) gives conditions under which the elements of a
Sobolev space are embedded into Banach spaces of continuously differentiable functions. Let
D ⊂ Rd be open and bounded with Lipschitz boundary such that the cone condition (Adams
and Fournier, 2003, Definition 4.6) holds. Let j ≥ 0, m ≥ 1 be integers. If m > d2 , then
there is a continuous embedding
j
ι : H j+m (D) → CB
(D),
(B.22)
j
where CB
(D) is the space of continuously differentiable functions with bounded derivatives,
which is a Banach space under the norm
(B.23)
kf kC j (D) = max sup |Dα f (x)|.
B
0≤|α|≤j x∈D
j
Moreover, point-evaluated partial derivatives on CB
(D) are continuous linear functionals,
since, for any multi-index |α0 | ≤ j and any x0 ∈ D, we have
0
Dα [f ] x0
0
≤ sup Dα f (x) ≤ max sup |Dα f (x)| = kf kC j (D) .
x∈D
0≤|α|≤j x∈D
B
(B.24)
Example B.2 (Strong Derivatives in Matérn Sample Spaces). Under the assumptions of
Example B.1, for a prior GP f with Matérn covariance function kν,l such that
d+1
+ ,
(B.25)
ν := m +
2
where > 0, we have the following chain of continuous embeddings
m
paths (f ) ⊂ Hkν 0 ,l0 ,→ H m+k (D) ,→ CB
(D).
(B.26)
As noted in Remark B.29, point-evaluated partial derivatives of order ≤ m are continuous
m (D). It follows that a point-evaluated differential operator D [·] (x)
linear functionals on CB
of order ≤ m is a continuous linear functional on paths (f ) if the two continuous embeddings
are prepended.
In Section 3.2,we have d = 1 and a GP prior with Matérn covariance function, where
ν = 72 = 2 + d+1
+ 21 . It follows that point-evaluated differential operators of order ≤ 2
2
are continuous linear functionals. Hence, the assumptions of Corollary 3 are fulfilled, which
means that the inference procedure used in these sections is supported by our theoretical
results above.
58
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Appendix C. Linear Partial Differential Equations
Definition C.1 (Multi-index). Using a d-dimensional multi-index α ∈ Nd0 , we can represent
(mixed) partial derivatives of arbitrary order as
∂ |α|
∂ |α|
:=
,
(α )
(α )
∂xα
∂x1 1 · · · ∂xd d
(C.1)
P
where |α| := di=1 αi . If the variables w.r.t. which we differentiate are clear from the context,
we also denote this (mixed) partial derivative by Dα .
Definition C.2 (Linear differential operator). A linear differential operator D : U → V of
0
order k between a space U of Rd -valued functions and a space V of real-valued functions
defined on some common domain Ω ⊂ Rd is a linear operator that linearly combines partial
derivatives up to k-th order of its input function, i.e.
0
D [u] :=
d
X
i=1
X
Ai,α Dα ui ,
(C.2)
α∈Nd0 ,|α|≤k
where Ai,α ∈ R for every i ∈ {1, . . . , d0 } and every multi-index α ∈ Nd0 with |α| ≤ k.
Definition C.3 (Heat equation (Lienhard and Lienhard, 2020; Evans, 2010)). Let Ω ⊂ Rd
be an open and bounded region and T > 0. The heat equation is given by
ρcp
∂u
− div (k∇u) = q̇V ,
∂t
(C.3)
where k ∈ Rd×d , ρ, cp , kij ∈ L∞ (Ω × (0, T ]), and q̇V ∈ L2 (Ω × (0, T ]).
Definition C.4 (Elliptic PDE in nondivergence form). Let Ω ⊂ Rd be an open and bounded
region. The equation
− div (A∇u) + bT ∇u + cu = f,
(C.4)
where Aij , bi , c ∈ L∞ (Ω) and f ∈ L2 (Ω).
C.1 Weak Derivatives and Sobolev Spaces
Definition C.5 (Test Function). Let D ⊂ Rd be open and let
Cc∞ (D) := {φ ∈ C ∞ (D, R) | supp (φ) ⊂ U is compact}
(C.5)
be the space of smooth functions with compact support in D. A function φ ∈ Cc∞ (D) is
dubbed test function and we refer to Cc∞ (D) as the space of test functions.
Theorem C.6 (Sobolev Spaces12 ). Let D ⊂ Rd be open, m ∈ N>0 , and p ∈ [1, ∞) ∪ {∞}.
The functional

1/p
 P
α ukp
kD
if p < ∞,
|α|≤m
Lp (D)
kukm,p,D :=
(C.6)
max
α uk
kD
if
p
=
∞.
|α|≤m
L∞ (D)
12. This theorem is a summary of (Adams and Fournier, 2003, Definitions 3.1 and 3.2 and Theorems 3.3
and 3.6)
59
Pförtner, Steinwart, Hennig and Wenger
is called a Sobolev norm. A Sobolev norm kukm,p,D is a norm on subspaces of Lp (D), on
which the right-hand side is well-defined and finite. A Sobolev space of order m is defined
as the subspace
W m,p (D) := {u ∈ Lp (D) | Dα u ∈ Lp (D) for |α| ≤ m}.
(C.7)
of Lp , where the Dα are weak partial derivatives. Sobolev spaces W m,p (D) are Banach spaces
under the Sobolev norm k·km,p . The Sobolev space H m (D) := W 2,m (D) is a separable Hilbert
space with inner product
X
hDα u1 , Dα u2 iL2 (D)
(C.8)
hu1 , u2 im,D :=
|α|≤m
and norm
k·km,D :=
q
h·, ·im,D = k·km,2,D .
60
(C.9)
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Bibliography
Robert A. Adams and John J. F. Fournier. Sobolev Spaces, volume 140 of Pure and Applied
Mathematics. Elsevier, 2nd edition, 2003. ISBN 9780080541297.
Christian Agrell. Gaussian processes with linear operator inequality constraints. Journal
of Machine Learning Research, 20(135):1–36, 2019. URL http://jmlr.org/papers/v20/
19-065.html.
Christopher G. Albert. Gaussian processes for data fulfilling linear differential equations. Proceedings of the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 33(1), 2019. ISSN 2504-3900.
doi:10.3390/proceedings2019033005.
Mauricio Alvarez, David Luengo, and Neil D. Lawrence. Latent force models. In Proceedings
of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS),
volume 5, pages 9–16, Clearwater Beach, Florida, USA, 2009.
W. N. Anderson, Jr. and G. E. Trapp. Shorted operators. II. SIAM Journal on Applied
Mathematics, 28(1):60–71, 1975. doi:10.1137/0128007.
Iskander Azangulov, Andrei Smolensky, Alexander Terenin, and Viacheslav Borovitskiy.
Stationary kernels and Gaussian processes on Lie groups and their homogeneous spaces i:
the compact case. arXiv preprint arXiv:2208.14960, 2022.
Adi Ben-Israel and Thomas N.E. Greville. Generalized Inverses: Theory and Applications.
CMS Books in Mathematics. Springer, New York, 2nd edition, 2003. ISBN 978-0-38721634-8. doi:10.1007/b97366.
Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, first edition, 2004. ISBN 978-1-4613-4792-7. doi:10.1007/9781-4419-9096-9.
S. J. Bernau. The square root of a positive self-adjoint operator. Journal of The Australian
Mathematical Society, 8(1):17–36, February 1968. doi:10.1017/S1446788700004560.
Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer, New York, first edition, 2006. ISBN 978-0387-31073-2.
Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal
of Political Economy, 81(3):637–654, 1973. doi:10.1086/260062.
David Borthwick. Introduction to Partial Differential Equations. Universitext. Springer,
first edition, 2018. ISBN 978-3-319-48936-0. doi:10.1007/978-3-319-48936-0.
Richard Bouldin. The pseudo-inverse of a product. SIAM Journal on Applied Mathematics,
24(4):489–495, 1973. doi:10.1137/0124051.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and
61
Pförtner, Steinwart, Hennig and Wenger
Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL
http://github.com/google/jax.
Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Probabilistic numerical
methods for PDE-constrained Bayesian inverse problems. In Geert Verdoolaege, editor,
Proceedings of the 36th International Workshop on Bayesian Inference and Maximum
Entropy Methods in Science and Engineering, volume 1853 of AIP Conference Proceedings,
pages 060001–1 – 060001–8, 2017. doi:10.1063/1.4985359.
Jon Cockayne, Chris J. Oates, Ilse C.F. Ipsen, and Mark Girolami. A Bayesian conjugate gradient method (with discussion). Bayesian Analysis, 14(3):937–1012, 2019a.
doi:10.1214/19-BA1145.
Jon Cockayne, Chris J. Oates, T. J. Sullivan, and Mark Girolami. Bayesian probabilistic
numerical methods. SIAM Review, 61(4):756–789, 2019b. doi:10.1137/17M1139357.
Jacques Dixmier. étude sur les variétés et les opérateurs de Julia, avec quelques applications. Bulletin de la Société Mathématique de France, 77:11–101, 1949. ISSN 0037-9484.
doi:10.24033/bsmf.1403.
Lawrence C. Evans. Partial Differential Equations: Second Edition, volume 19 of Graduate
Studies in Mathematics. American Mathematical Society, Providence, Rhode Island, 2nd
edition, 2010. ISBN 978-0-82-184974-3. URL https://bookstore.ams.org/gsm-19-r.
Gregory E. Fasshauer. Solving partial differential equations by collocation with radial basis
functions. In Alain Le Méhauté, Christophe Rabut, and Larry L. Schumaker, editors,
Surface Fitting and Multiresolution Methods, pages 131–138. Vanderbilt University Press,
Nashville, TN, 1997. ISBN 9780826512949.
Gregory E. Fasshauer. Solving differential equations with radial basis functions: multilevel
methods and smoothing. Advances in Computational Mathematics, 11:139–159, November
1999. doi:10.1023/A:1018919824891.
C. A. J. Fletcher. Computational Galerkin Methods. Scientific Computation. Springer,
Berlin, Heidelberg, 1 edition, 1984. ISBN 978-3-642-85949-6. doi:10.1007/978-3-64285949-6.
Jean Baptiste Joseph Fourier. Théorie analytique de la chaleur.
doi:10.1017/CBO9780511693229.
Firmin Didot, 1822.
Mark Girolami, Eky Febrianto, Yin Ge, and Fehmi Cirak. The statistical finite element method (statFEM) for coherent synthesis of observation data and model predictions. Computer Methods in Applied Mechanics and Engineering, 275:113533, 2021.
doi:10.1016/j.cma.2020.113533.
Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in
the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, fourth edition, 2013. ISBN 978-1-4214-0794-4. URL https://www.press.jhu.edu/books/title/
10678/matrix-computations.
62
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Thore Graepel. Solving noisy linear operator equations by Gaussian processes: Application
to ordinary and partial differential equations. In Proceedings of the 20th International
Conference on Machine Learning, pages 234–241. AAAI Press, 2003.
Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and
machine learning. Machine learning, 68(1):35–61, 2007.
Philipp Hennig, Michael A. Osborne, and Mark Girolami. Probabilistic numerics and
uncertainty in computations. Proceedings of the Royal Society A, 471(2179), 2015.
doi:10.1098/rspa.2015.0142.
Philipp Hennig, Michael A. Osborne, and Hans P. Kersting. Probabilistic Numerics:
Computation as Machine Learning. Cambridge University Press, June 2022. ISBN
9781316681411. doi:10.1017/9781316681411.
David S. Holder, editor. Electrical Impedance Tomography: Methods, History and Applications. Institute of Physics Medical Physics Series. Institute of Physics Publishing, Bristol,
2005. ISBN 0750309520.
Peter Holderrieth, Michael J Hutchinson, and Yee Whye Teh. Equivariant learning of
stochastic fields: Gaussian processes and steerable conditional neural processes. In International Conference on Machine Learning, pages 4297–4307. PMLR, 2021.
Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K. Sriperumbudur.
Gaussian processes and kernel methods: A review on connections and equivalences. arXiv
preprint arXiv:1807.02582, 2018.
George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and
Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440,
2021. doi:https://doi.org/10.1038/s42254-021-00314-5.
Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In
Proceedings of the 4th Eurographics Symposium on Geometry Processing, volume 7, 2006.
Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer, London, second edition, 2014. doi:10.1007/978-1-4471-5361-0.
Nicholas Krämer, Jonathan Schmidt, and Philipp Hennig. Probabilistic numerical method of
lines for time-dependent partial differential equations. In Gustau Camps-Valls, Francisco
J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference
on Artificial Intelligence and Statistics (AISTATS), volume 151, pages 625–639. PMLR,
2022. URL https://proceedings.mlr.press/v151/kramer22a.html.
Benny Lautrup. The PDE’s of continuum physics. In Proceedings of the Workshop on PDE
methods in Computer Graphics, 2005.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya,
Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for
partial differential equations. In ICLR 2020 Workshop on Integration of Deep Neural
Models and Differential Equations, 2020. doi:10.48550/arXiv.2003.03485.
63
Pförtner, Steinwart, Hennig and Wenger
Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. doi:10.48550/arXiv.2010.08895.
John H. Lienhard, IV and John H. Lienhard, V. A Heat Transfer Textbook. Phlogiston
Press, Cambridge, MA, 5th edition, 2020. URL http://ahtt.mit.edu.
Anders Logg, Kent-Andre Mardal, and Garth Wells, editors. Automated Solution of Differential Equations by the Finite Element Method, volume 84 of Lecture Notes in Computational
Science and Engineering. Springer, Berlin, Heidelberg, 2012. ISBN 978-3-642-23099-8.
doi:10.1007/978-3-642-23099-8.
Stefania Maniglia and Abdelaziz Rhandi. Gaussian measures on separable Hilbert spaces
and applications, January 2004.
James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical transactions of the Royal Society of London, 155:459–512, 1865.
Pierre Michaud. A simple model of processor temperature for deterministic turbo clock
frequency. resreport RR-9308, Inria Rennes, 2019. URL https://hal.inria.fr/
hal-02391970.
Chris J. Oates and Tim J. Sullivan. A modern retrospective on probabilistic numerics.
Statistics and Computing, 29:1335–1351, 2019. doi:10.1007/s11222-019-09902-z.
Houman Owhadi and Clint Scovel. Conditioning Gaussian measure on Hilbert space. Journal
of Mathematical and Statistical Analysis, 1(109), 2018.
Houman Owhadi, Clint Scovel, and Florian Schäfer. Statistical numerical approximation. Notices of the American Mathematical Society, 66(10):1608–1617, 2019.
doi:10.1090/noti1963.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Machine learning of linear
differential equations using Gaussian processes. Journal of Computational Physics, 348:
683–693, 2017. ISSN 0021-9991. doi:10.1016/j.jcp.2017.07.050.
Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural
networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:
686–707, 2019. ISSN 0021-9991. doi:https://doi.org/10.1016/j.jcp.2018.10.045. URL
https://www.sciencedirect.com/science/article/pii/S0021999118307125.
Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine
Learning. MIT Press, London, England, 2006. ISBN 026218253X.
Marco Reisert and Hans Burkhardt. Learning equivariant functions with matrix valued
kernels. Journal of Machine Learning Research, 8(3), 2007.
64
Physics-Informed GP Regression Generalizes Linear PDE Solvers
Walter Rudin. Functional Analysis. International Series in Pure and Applied Mathematics.
McGraw-Hill, New York, second edition, 1991. ISBN 978-0-07-054236-5.
Simo Särkkä. Linear operators and stochastic partial differential equations in Gaussian
process regression. In Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel
Kaski, editors, Artificial Neural Networks and Machine Learning – ICANN 2011, pages
151–158, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-64221738-8_20.
Simo Särkkä, Arno Solin, and Jouni Hartikainen. Spatiotemporal learning via infinitedimensional Bayesian filtering and smoothing: A look at Gaussian process regression through Kalman filtering. IEEE Signal Processing Magazine, 30(4):51–61, 2013.
doi:10.1109/MSP.2013.2246292.
Ingo Steinwart. Convergence types and rates in generic Karhunen-Loève expansions
with applications to sample path properties. Potential Analysis, 51:361–395, 2019.
doi:10.1007/s11118-018-9715-5.
Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, first edition, 2008. ISBN 978-0-387-77242-4.
doi:10.1007/978-0-387-77242-4.
Zsigmond Tarcsay. Closed range positive operators on Banach spaces. Acta Mathematica
Hungarica, 142:494–501, 2014. doi:10.1007/s10474-013-0380-2.
Bastian von Harrach. Numerik partieller differentialgleichungen. Lecture Notes, 2021. URL
https://www.math.uni-frankfurt.de/~harrach/lehre/Numerik_PDGL.pdf.
Junyang Wang, Jon Cockayne, Oksana Chkrebtii, Tim J. Sullivan, and Chris J. Oates.
Bayesian numerical methods for nonlinear partial differential equations. Statistics and
Computing, 31(55), 2021. doi:10.1007/s11222-021-10030-w.
Jonathan Wenger and Philipp Hennig. Probabilistic linear solvers for machine learning. In
Advances in Neural Information Processing Systems (NeurIPS), 2020.
Jonathan Wenger, Nicholas Krämer, Marvin Pförtner, Jonathan Schmidt, Nathanael Bosch,
Nina Effenberger, Johannes Zenn, Alexandra Gessner, Toni Karvonen, François-Xavier
Briol, Maren Mahsereci, and Philipp Hennig. ProbNum: Probabilistic numerics in python,
2021. URL http://arxiv.org/abs/2112.02100.
Kôsaku Yosida. Functional Analysis, volume 123 of Classics in Mathematics. Springer, 6th
edition, 1995. ISBN 978-3-540-58654-8. doi:10.1007/978-3-642-61859-8.
65
Download