Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers arXiv:2212.12474v1 [cs.LG] 23 Dec 2022 Marvin Pförtner1 Ingo Steinwart2 Philipp Hennig1 Jonathan Wenger1 1 2 marvin.pfoertner@uni-tuebingen.de ingo.steinwart@mathematik.uni-stuttgart.de philipp.hennig@uni-tuebingen.de jonathan.wenger@uni-tuebingen.de University of Tübingen, Tübingen AI Center University of Stuttgart Abstract Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models or analyses with a downstream application such that error quantification plays a key role. However, by entirely ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of a widely-applied theorem for conditioning GPs on a finite number of direct observations to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error ; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate and the capability to incorporate uncertain model parameters and observations. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models by blurring the boundaries between numerical analysis and Bayesian inference. Keywords: physics-informed machine learning, probabilistic numerics, partial differential equations, Galerkin methods, Gaussian processes, bounded linear operator equations 1. Introduction Partial differential equations (PDEs) are powerful mechanistic models of static and dynamic systems with continuous spatial interactions (Borthwick, 2018). They are widely used in the natural sciences, especially in physics, and in applied fields like engineering, medicine and finance. Linear PDEs form a subclass describing physical phenomena such as heat diffusion (Fourier, 1822), electromagnetism (Maxwell, 1865) and continuum mechanics (Lautrup, 2005). Additionally, they are used in applications as diverse as computer graphics (Kazhdan et al., 2006), medical imaging (Holder, 2005), or option pricing (Black and Scholes, 1973). ©2022 Marvin Pförtner, Ingo Steinwart, Philipp Hennig and Jonathan Wenger. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Pförtner, Steinwart, Hennig and Wenger Scientific inference with PDEs Given a mechanistic model of a (physical) system in the form of a linear PDE D [u] = f , where D is a linear differential operator mapping between vector spaces of functions, the system can be simulated by solving the PDE subject to a set of linear boundary conditions (BC), given by a linear operator B and a function g defined on the boundary of the domain, s.t. B [u] = g (Evans, 2010). For instance, given all material parameters and heat sources involved, a PDE can describe the temperature distribution in an electronic component, while the boundary conditions describe the heat flux out of the component at the surface. Since hardly any practically relevant PDE can be solved analytically (Borthwick, 2018), in practice, specialized numerical methods relying on discretization are employed. Often such solvers are embedded into larger scientific models, where model parameters are inferred from measurement and downstream analyses depend on the resulting simulation. For example, we would like to model whether said electronic component hits critical temperature thresholds during operation to assess its longevity. Challenges when solving PDEs When performing scientific inference with PDEs via numerical simulation, one is faced with three fundamental challenges. (C1) Limited computation. Any numerically computed solution û ≈ u suffers from approximation error. In practice, a sufficiently accurate simulation often requires vast amounts of computational resources. (C2) Partially-known physics. While the underlying physical mechanism is encoded in the formulation of the PDE, in practice, its exact parameters and boundary conditions are often unknown. For example, the position and strength of heat sources f within the aforementioned electric component are only approximately known. Similarly, material parameters like thermal conductivity, which define D, can often only be estimated. Finally, the initial or boundary conditions B [u] = g are also only partially known. For example, how much heat an electrical component dissipates via its surface. (C3) Error propagation. Limited computation and partially-known physics inevitably introduce error into the simulation. This resulting bias can fundamentally alter conclusions drawn from downstream analysis steps, in particular if these are sensitive to input variability. For example, an electronic component may be deemed safe based on the simulation, although its true internal temperature hits safety-critical levels repeatedly. Solving PDEs as a learning problem The challenges of scientific inference with PDEs are fundamentally issues of partial information. Here, we interpret solving a PDE as a learning problem, specifically as physics-informed regression, in the spirit of probabilistic numerics (Hennig et al., 2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi et al., 2019; Hennig et al., 2022). By leveraging the tools of Bayesian inference, we can tackle the challenges (C1) to (C3). As illustrated in Figure 1(a), we model the solution of the PDE with a Gaussian process, which we condition on observations of the boundary conditions, the PDE itself and any physical measurements: • Encoding prior knowledge. We can efficiently leverage any available computation by encoding inductive bias about the solution of the PDE. For example, we can identify the solution space by “partial derivative counting”. Moreover, since PDEs typically model 2 Physics-Informed GP Regression Generalizes Linear PDE Solvers physical systems, expert knowledge is often available. This includes known physical properties of the system such as symmetries, as well as more subjective estimates from previous experience with similar systems or computationally cheap approximations. • Conditioning on the boundary conditions. The linear boundary conditions can be interpreted as measurements of the solution of the PDE on the boundary. By conditioning on (some of) these measurements, we are not limited to satisfying the boundary conditions exactly, but can directly model uncertain constraints without having to resort to point estimates. Instead, we propagate the uncertainty to the solution estimate. This also allows us to handle cases where we do not have a functional form g of the constraints, but only a discrete set of constraints at boundary points. • Conditioning on the PDE. Conditioning a probability measure over the solution on the analytic “observation” that the PDE holds is generally intractable. In the spirit of classic approaches for solving PDEs, we relax the PDE-constraint by requiring only a finite number of projections of the associated PDE residual onto carefully chosen test functions to be zero. This choice of projections defines the discretization and allows for control over the amount of expended computation. The resulting posterior quantifies the algorithm’s uncertainty within a whole set of solution candidates. • Conditioning on measurements. Finally, we can also condition on direct measurements of the solution itself. This is especially useful if parameters of the differential operator or boundary conditions are uncertain, or if the computational budget is restrictive. The resulting posterior belief quantifies the uncertainty about the true solution induced by limited computation and partially-known physics (see Figure 1(b)). By quantifying this error probabilistically, we can propagate it to any downstream analysis or decision. For example, to project the longevity of a newly designed electrical component, we want to simulate how likely the component will hit a critical temperature threshold during operation. Given our posterior belief, we can simply compute the marginal probability instead of performing Monte-Carlo sampling, which would require repeated PDE solves at significant computational expense. Contribution We introduce a probabilistic learning framework for the solution of (systems of ) linear PDEs, including elliptic, parabolic and hyperbolic linear PDEs. Our framework can be viewed as physics-informed Gaussian process regression. It is based on a crucial generalization of a popular result on conditioning GPs on linear observations to observations made via an arbitrary bounded linear operator (Theorem 1). This enables combined quantification of uncertainty from the inherent discretization error, uncertain initial or boundary conditions, as well as noisy measurements of the solution. Our approach is a strict probabilistic generalization of methods of weighted residuals (Corollary 3.3), including collocation, finite volume, (pseudo)spectral, and (generalized) Galerkin methods such as finite element methods. In doing so, we demonstrate that this class can be equipped with a structured error estimate and the capability to incorporate partially-known physics and experimental measurements. 3 u u Boundary Conditions Prior Pförtner, Steinwart, Hennig and Wenger ? Measurements PDE Domain D ? u u | BC, PDE ? u (XBC ) + BC Domain D u? u | BC u? (XBC ) + BC Domain D u? u | BC, PDE, MEAS u? (XBC ) + BC u? (XMEAS ) + MEAS Domain D Marginal Std. Dev. (a) Learning to solve the Poisson equation. A problem-specific Gaussian process prior u is conditioned on partially-known physics, given by uncertain boundary conditions (BC) and a linear PDE, as well as on noisy physical measurements from experiment. The boundary conditions and the righthand side of the PDE are not known but inferred from a small set of noise-corrupted measurements. The plots juxtapose the belief u | · · · with the true solution u? of the latent boundary value problem. u ∼ GP(m, k) u | BC u | . . . , PDE u | . . . , MEAS u? u | BC, PDE, SC Domain D Domain D (b) Uncertainty quantification. Marginal posterior standard deviation after conditioning on uncertain boundary conditions, a linear PDE, and noisy (physical) measurements. (c) Generalization of Classical Solvers. For certain priors our framework reproduces any method of weighted residuals, e.g. the finite element method, in its posterior mean. Figure 1: A physics-informed Gaussian process framework for the solution of linear PDEs. 4 Physics-Informed GP Regression Generalizes Linear PDE Solvers 2. Background 2.1 Linear Partial Differential Equations A linear partial differential equation (PDE) is an equation of the form (2.1) D [u] = f, where D : U → V is a linear differential operator (see Definition C.2) between a space U of 0 Rd -valued functions and a space V of real-valued functions on a common domain D ⊂ Rd , and f ∈ V is the so-called right-hand side function (Evans, 2010). Typically, systems described by PDEs are further constrained via linear boundary conditions (BCs) B [u] = g describing the system on the boundary ∂D, where B is a linear operator mapping functions u ∈ U onto functions B [u] : ∂D → R defined on the boundary and g : ∂D → R. Common types of boundary conditions are: • Dirichlet: Specify the values of the solution on the boundary, i.e. B [u] = u|∂D . • Neumann: Specify the exterior normal derivative on the boundary, i.e. B [u] (x) := ∂ν(x) u (x), where ν(x) is the exterior normal vector at each point of the boundary. A PDE and a set of boundary conditions is referred to as a boundary value problem (BVP). A prototypical example of a linear PDE, used in thermodynamics, and Newtonian Pd ∂electrostatics 2u gravity, is the Poisson equation −∆u = f , where ∆u = i=1 ∂x2 is the Laplacian. i 2.1.1 Weak Formulation Many models of physical phenomena are expressed as functions u, which are not (continuously) differentiable or even continuous (Evans, 2010; Borthwick, 2018; von Harrach, 2021). In other words, they are not so-called strong solutions to any PDE. There are also PDEs derived from established physical principles, which do not admit strong solutions at all. To address this, one can weaken the notion of differentiability leading to the concept of weak solutions. Many of the aforementioned physical phenomena are in fact weak solutions. As an example1 , consider the weak formulation of the stationary heat equation for non-homogeneous media − div (κ∇u) = q̇V . (2.2) Let D ⊂ Rd be an open and bounded domain and assume that u ∈ C 2 (D), κ ∈ C 1 (D), and q̇V ∈ C 0 (D). If u is a solution to Equation (2.2), then we can integrate both sides of the equation against a test function v ∈ Cc∞ (D), i.e. an infinitely smooth function with compact support (see Definition C.5), which results in Z Z − div (κ∇u) (x) v(x) dx = q̇V (x)v(x) dx. D D Since both u and v are sufficiently differentiable, we can apply integration by parts (Green’s first identity) to the first integral to obtain Z Z hκ(x)∇u (x) , ∇v (x)i dx = q̇V (x)v(x) dx, (2.3) D D | {z } =:B[u,v] 1. Our exposition is a strongly abbreviated version of Evans (2010, Section 6.1.2). 5 Pförtner, Steinwart, Hennig and Wenger since v|∂D = 0. Note that this expression does not only make sense if u ∈ C 2 (D), but also if u is once weakly differentiable (see Evans 2010, Section 5.2.1) with ∇u ∈ L2 (D)d . Intuitively speaking, a weak derivative of a (classically non-differentiable) function “behaves like a derivative” when integrated against a smooth test function. These relaxed requirements on u are exactly the defining properties of the Sobolev space H 1 (D) ⊃ C 2 (D), i.e. it suffices that u ∈ H 1 (D). Similarly, we can weaken all other assumptions to v ∈ H01 (D), f ∈ L2 (D) and κ ∈ L∞ (D). Then, for u ∈ H 1 (D) and v ∈ H01 (D), Equation (2.3) is equivalent to B [u, v] = hq̇V , viL2 . (2.4) We define a weak solution of Equation (2.2) as u ∈ H 1 (D) such that Equation (2.4), known as the weak or variational formulation, holds for all v ∈ H01 (D). Definition 2.1. A weak formulation of a linear PDE D [u] = f is an equation of the form B [u, v] = l [v] , (2.5) where B : U ×V → R is a bilinear form derived from the differential operator D and l : V → R is a continuous linear functional induced by the right-hand side f . A vector u ∈ U is a weak solution of the PDE if it solves Equation (2.5) for all test functions v ∈ V . In this context, D [u] = f is called the strong formulation of the PDE and any solution to it is called a strong or classical solution. We refer to a weak solution as strictly weak if it can not be interpreted as the a solution to the strong formulation. 2.1.2 Methods of Weighted Residuals2 Unfortunately, linear PDEs both in weak and strong formulation are in general not analytically solvable, so approximate solutions are sought instead. Methods of weighted residuals (MWR) constitute a large family of popular numerical approximation schemes for linear PDEs, including collocation, finite volume, (pseudo)spectral, and (generalized) Galerkin methods such as finite-element methods. Intuitively speaking, MWRs interpret a linear PDE as a root-finding problem for the associated PDE residual, i.e. D [u] − f = 0. Note that this problem consists of infinitely many equations for infinitely many unknowns. To render the problem tractable, MWRs approximate the unknown solution function u via finite linear combinations of trial functions φ1 , . . . , φn , i.e. û := m X ci φi , (2.6) i=1 where c ∈ Rm is the coordinate vector of uc in the finite-dimensional subspace Û := span (φ1 , . . . , φm ) ⊂ U . In the following, we will assume that the trial functions φi are chosen such that the boundary conditions are met, i.e. we describe so-called interior methods.3 To reduce the number of equations, MWRs only require a finite number of projections 2. This section is loosely based on Fletcher (1984). 3. By stacking the residuals corresponding to the PDE and the boundary conditions, the approach outlined here can be used to realized mixed methods, which solve the boundary value problem without requiring that uc fulfills the boundary conditions by construction. 6 Physics-Informed GP Regression Generalizes Linear PDE Solvers of the residual onto test functions ψ1 , . . . , ψn to be zero, i.e. hψi , D [û] − f iV = hψi , D [û]iV − hψi , f iV = 0 | {z } | {z } =:B[û,ψi ] (2.7) =:l[ψi ] for all i = 1, . . . , n, where h·, ·iV is a (semi-definite) inner product on the function space V . A ubiquitous choice for h·, ·iV is the L2 inner product. In this case, the projected residual can be interpreted as a weighted average of the residual, where the test function defines the weight function, hence the name of the method. By substituting Equation (2.6) into Equation (2.7) and rearranging terms, we can see that this approach leads to a linear system B̂c = ˆl, where B̂ij := B [φj , ψi ] and ˆli := l [ψi ] . Hence, the approximate solution function obtained from this method is given by MWR u = m X where cMWR φi , i cMWR = B̂ −1 ˆl (2.8) i=1 assuming that B̂ is invertible. Note that Equation (2.7) is a weak formulation of the linear PDE, restricted to the finite-dimensional subspaces Û ⊂ U and V̂ = span (ψ1 , . . . , ψn ) ⊂ V . It is evident that the method described above can also be applied to weak formulations of linear PDEs which were not obtained by projecting the residual onto the ψi as in Equation (2.7). Following Fletcher (1984), we will also refer to these methods as methods of weighted residuals. Table 1 lists the aforementioned examples of MWRs together with the corresponding trial and test functions that induce the method. 2.2 Gaussian Processes A Gaussian process h with index set X is a family {hx }x∈X of real-valued random variables on a common probability space (Ω, B (Ω) , P) such that, for each finite set of indices x1 , . . . , xn , the joint distribution of hx1 , . . . , hxn is Gaussian. We also write h(x, ω) = hx (ω) and h(x) := h(x, ·). The function x 7→ E [h(x)] is called the mean (function) of h and the function (x1 , x2 ) 7→ Cov [h(x1 ), h(x2 )] is called the covariance function or kernel of h. For each ω ∈ Ω, the function h(·, ω) : X → R, x 7→ h(x, ω) is called a sample or (sample) path of the Gaussian process. The set of paths (h) := {h(·, ω) | ω ∈ Ω} ⊂ RX is referred to as the path space of h. Given the notion of a sample path, it is easy to see why we use Gaussian processes as priors over unknown real-valued functions. However, many functions describing 0 physical systems such as vector fields take values in Rd . Fortunately, the index set of a Gaussian process can be chosen freely, which means that we can “emulate” vector-valued 0 GPs. More precisely, a function h : X 7→ Rd can be equivalently viewed as a function h0 : {1, . . . , d0 } × X → R, (i, x) 7→ h0 (i, x) = hi (x). Applying this construction to a Gaussian process leads to the notion of a multi-output Gaussian process. 7 Pförtner, Steinwart, Hennig and Wenger 3. Learning the Solution to a Linear PDE Consider a linear partial differential equation D [u] = f subject to linear boundary conditions B [u] = g as in Section 2.1. Our goal is to find a solution u ∈ U satisfying the PDE for (partially) known (D, f ) and (B, g). In general, one cannot find a closed-form expression for the solution u (Borthwick, 2018). Therefore, we aim to compute an accurate approximation û ≈ u instead. Motivated by the challenges (C1) to (C3) of partial information inherent to numerically solving PDEs, we approach the problem from a statistical inference perspective. In other words, we will learn the solution of the PDE from multiple sources of information. This way we can quantify the epistemic uncertainty about the solution at any time during the computation, as Figure 1(a) illustrates. Indirectly Observing the Solution of a PDE Typically, we think of observations as a finite number of direct measurements u(xi ) = yi of the latent function u. As it turns out, we can generalize this notion of a measurement and even interpret the PDE itself as an (indirect) observation of u. As an example, consider the important case where u models the state of a physical system. The laws of physics governing such a system are often formulated as conservation laws in the language of PDEs. For example, they may require physical quantities like mass, momentum, charge or energy to be conserved over time. Example 3.1 (Thermal Conduction and the Heat Equation). Say we want to simulate heat conduction in a solid object with shape D ⊂ R3 , i.e. we want to find the time-varying temperature distribution u : [0, T ] × D → R. Neglecting radiation and convection, u(t, x) is described by a linear PDE known as the heat equation (Lienhard and Lienhard, 2020). Assuming spatially and temporally uniform material parameters cp , ρ, κ ∈ R, it reduces to ∂ cp ρ − κ∆ u − q̇V = 0. (3.1) ∂t Thermal conduction is described by −κ∆u, while q̇V are local heat sources, e.g. from electric currents. Any energy flowing into a region due to conduction or a heat source is balanced by an increase in energy of the material. The net-zero balance shows that energy is conserved. Notice how a conservation law is an observation of the behavior of the physical system! To formalize this, we begin by rephrasing the classical notion of an observation at a point xi as measuring the result of a specific linear operator applied to the solution u: u(xi ) = yi ⇐⇒ δxi [u] = yi where δxi is the evaluation functional. Now, the key idea is to generalize the notion of a direct observation to collecting information about the solution via an arbitrary linear operator L applied to the solution u, such that L [u] = y ⇐⇒ L [u] − y = 0. The affine operator I [u] := L [u] − y (3.2) is a specific kind of information operator (Cockayne et al., 2019b). In this setting the information operator may describe a conservation law as in Equation (3.1), a general linear PDE of the form (2.1) or an arbitrary affine operator of choice mapping between vector 8 Physics-Informed GP Regression Generalizes Linear PDE Solvers spaces (which may be linear function spaces). This generalized notion of an observation turns out to be very powerful to incorporate different kinds of mathematical, physical, or experimental properties of the solution. Since PDEs and conservation laws are often assumed to hold exactly, we focused on noise-free observations above. However, generally we are not limited to this case and can also model f as random variable, in which case the information operator I [u, f ] is a (jointly) linear function of the solution u and the right-hand side f . 3.1 Solving PDEs as a Bayesian Inference Problem One of the main challenges (C1) to (C3) outlined in the beginning is the limited computational budget available to us to approximate the solution. Fortunately, in practice, the solution u is not hopelessly unconstrained, but we usually a-priori have information about it. At the very least, we know the space of functions U in which to search for the solution. Additionally, we might have expert knowledge about its rough shape and value range, or solutions to related PDEs at our disposal. Now, the question becomes: How do we combine this prior knowledge with indirect observations of the solution through the information operator I (3.2)? To do so, we turn to the Bayesian inference framework. This provides a different perspective on the numerical problem of solving a linear PDE as a learning task. Gaussian Process Inference We represent our belief about the solution of the linear PDE via a (multi-output) Gaussian process u ∼ GP (m, k) 0 0 0 with mean function m : D → Rd and covariance function or kernel k : D × D → Rd ×d . Gaussian processes are well-suited for this purpose since: (i) For an appropriate choice of kernel, the Gaussian process defines a probability measure over the function space in which the PDE’s solution is sought. (ii) Kernels provide a powerful modeling toolkit to incorporate prior information (e.g. variability, periodicity, multi-scale effects, in- / equivariances, . . . ) in a modular fashion. (iii) Measurement noise often follows a Gaussian distribution. (iv) Conditioning a Gaussian process on observations made via a linear map again results in a Gaussian process. While the result in (iv) is used ubiquitously in the literature, its general form where observations are made via arbitrary linear operators as opposed to finite-dimensional linear maps, has only been rigorously demonstrated for Gaussian measures on function spaces, not for the Gaussian process perspective, to the best of our knowledge. The two perspectives are closely related, but there are thorny technical difficulties to consider. We intentionally frame the problem from the Gaussian process perspective to make use of the expressive modeling capabilities provided by the kernel. Our framework at its very core relies on this result, which we explain in detail in Section 4 and prove in Appendix B.4. 9 Pförtner, Steinwart, Hennig and Wenger 3.1.1 Encoding Prior Knowledge about the Solution We can infer the solution of a linear PDE more quickly by specifying inductive biases in the prior, which can encode both provable and approximately known properties of the solution.4 Function Space of the Solution The most basic known property derived from the PDE is an appropriate choice of function space for the solution. This can be done by inspecting the differential operator D and keeping track of the partial derivatives. In fact, in implementation this can be automatically derived solely from the problem definition, e.g. by compositionally defining differential operators and storing information on the necessary differentiability. Let j ∈ N be the maximum order of the partial derivatives of the differential operator D. If we choose a Matérn(ν) kernel with d+1 ν=j+ +ε 2 with ε > 0, then under mild regularity conditions our prior defines a Gaussian measure over the space of solutions of the linear PDE.5 The choice ε = 21 allows particularly efficient kernel evaluations (Rasmussen and Williams, 2006). Symmetries, In- and Equivariances Many solutions of PDEs exhibit a-priori known symmetries. For example, to calculate the strength of a magnet rotated by R : R3 → R3 , one can equivalently compute the field of the magnet in its original position and rotate the field, i.e. u(Rx) = Ru(x). Corresponding inductive biases can be encoded via a kernel that is invariant, i.e. k(ρg x0 , ρg x1 ) = k(x0 , x1 ), or equivariant k(ρg x0 , ρg x1 ) = ρg k(x0 , x1 )ρ∗g , where ρg is a unitary group representation. The most commonly used kernels are stationary, i.e. translation invariant, but one can also construct invariant kernels (Haasdonk and Burkhardt, 2007; Azangulov et al., 2022), as well equivariant kernels (Reisert and Burkhardt, 2007; Holderrieth et al., 2021) for many other group actions. Related Problems If solutions from related problems are available, the prior mean function can be set to an appropriate combination of the available solutions, and the prior kernel can be chosen to reflect how related the problems are. For example, if we have an approximate solution of the same PDE computed on a coarser mesh, we can condition our function space prior on the coarse solution with a noise level reflecting the fidelity of the discretization. Similarly, if we solved the same PDE with different parameters, we can condition on the available solutions with a noise level chosen according to how similar the parameters are to the one of interest. Domain Expertise Domain experts often have approximate knowledge of what solutions can be expected, either from experience, previous experiments or familiarity with the physical interpretation of the solution u. For example, an electrical engineer who designs electrical 4. In the special case of GP regression, if the prior smoothness matches the smoothness of the target function u, the convergence rate is optimal in the number of observations (Kanagawa et al., 2018, Thm. 5.1). 5. Technically, it is impossible to formulate a GP prior whose paths are elements of a Sobolev space, since such spaces are spaces of equivalence classes. However, similar intuition implies and can be formalized through a continuous embedding. See Appendix B.5 for details. 10 Physics-Informed GP Regression Generalizes Linear PDE Solvers components is able to give realistic temperature ranges for a component, whose temperature distribution we aim to simulate. This can be included by choosing the (initial) kernel hyperparameters, such as the output- and lengthscales based on this expertise. 3.1.2 (Indirectly) Observing the Solution From a computational perspective, the most important reason for choosing Gaussian processes is that when conditioning on linear observations, the resulting posterior is again a Gaussian process with closed form mean and covariance function (Bishop, 2006). We extend this classic result from observations via a finite-dimensional linear map to general linear operators in Theorem 1. This is crucial to condition on the different types of observations, most importantly the PDE itself, made via the information operator in (3.2). Given such an affine observation defined via a linear operator L : U → Rn and an independent Gaussian random variable ∼ N (µ, Σ), we can condition our prior belief using Corollary 2 on the observations to obtain a posterior of the form u | (L [u] + = y) ∼ GP mu|y , ku|y with mean and covariance function given by mu|y (x) = m(x) + L [k(·, x)]> (LkL∗ + Σ)−1 (y − (L [m] + µ)) , ku|y (x1 , x2 ) = k(x1 , x2 ) + L [k(·, x1 )]> (LkL∗ + Σ)−1 L [k(·, x2 )] . (3.3) (3.4) We will now look more closely at how we can condition on the boundary conditions, the PDE itself and direct measurements of the solution. Observing the Solution via the PDE The differential operator D in Equation (2.1) is linear and therefore we can (in theory) condition on I [u] = D [u] − f = 0 directly by using Theorem 1 with L = D and y = f . However, it turns out that this is at least as hard as solving the PDE directly and thus typically intractable in practice. This is because f is a function and hence D [u] = f corresponds to an infinite number of observations. However, by only enforcing the PDE at a finite number of points in the domain, we can immediately give a canonical example of an approximation to this intractable information operator. Concretely, we can condition u on the fact that the PDE holds at a finite sequence of well-chosen domain points X = {xi }ni=1 ⊂ int (D), i.e. we compute u | (D [u] (X) − f (X) = 0) by choosing L = δX ◦ D and y = f (X). Intuitively speaking, if the set X of domain points is dense enough, we obtain a good approximation to the exact conditional process. This approach, known as the probabilistic meshless method (Cockayne et al., 2017), is analogous to existing non-probabilistic approaches to solving PDEs, commonly referred to as collocation methods, wherein the points X are called collocation points. Satisfying the PDE at a set of collocation points is far from the only choice within our general framework. For example, we can choose a set of test functions v ∈ V̂ , which we use to observe the PDE with, such that L [u] = hv, D [u]iV and y = hv, f iV . For efficient evaluation of the differential operator we can further represent the solution in a basis of trial functions from a subspace Û , resulting in L [u] = v, D PÛ u V . This turns out to be very powerful and is analogous to some of the most successful classical PDE solvers choose sets of basis functions for which to satisfy the PDE. In fact, for certain priors and choices of subspaces, our framework recovers several 11 Pförtner, Steinwart, Hennig and Wenger important classic solvers in the posterior mean (see Section 3.3.4). Note that the above can be applied to both time-dependent and time-independent PDEs and regardless of the type of linear PDE (e.g. elliptic, parabolic, hyperbolic). Moreover, an extension to systems of linear PDEs is straightforward. Observing the Solution at the Boundary As for the PDE, we could attempt to directly condition on the boundary conditions by choosing L = B and y = g. However, we are faced with the same intractability issues that we discussed above. Instead, we observe that the boundary conditions hold at a finite set of points XBC ⊂ ∂D, i.e. L = δXBC ◦ B and y = g(XBC ). In practice, sometimes the boundary conditions are only known at a finite set of points making this a natural choice. Observing the Solution Directly Finally, as in standard GP regression, we can directly condition on (noisy) measurements of the solution, for example from a real world experiment, by choosing L = δXMEAS and y = u∗ (XMEAS ). In summary, the probabilistic viewpoint allows us to • encode prior information about the solution, • condition on various kinds of (partial) information, such as the boundary condition, the PDE itself, or direct measurements, and • output a structured error estimate, reflecting all obtained information and performed computation. We will now give concrete examples for some of the possible modeling choices described above in a case study. 3.2 Case Study: Modeling the Temperature Distribution in a CPU Central processing units (CPUs) are pieces of computing hardware that are constrained by the vast amounts of heat they dissipate under computational load. Surpassing the maximum temperature threshold of a CPU for a prolonged period of time can result in reduced longevity or even permanent hardware damage (Michaud, 2019). To counteract overheating, cooling systems are attached to the CPU, which are controlled by digital thermal sensors (DTS). For simplicity, assume that the CPU is under sustained computational load and that the cooling device is controlled in a way such that the die reaches thermal equilibrium. Example 3.2 (Stationary Heat Equation). The temperature distribution of a solid at thermal equilibrium, i.e. ∂u ∂t = 0 in Example 3.1, is described by the linear PDE −κ∆u − q̇V = 0, (3.5) known as the stationary heat equation (Lienhard and Lienhard, 2020). For our choice of material parameters Equation (3.5) is equivalent to the Poisson equation with f = q̇κV . While the sensors control cooling, they only provide local, limited-precision measurements of the CPU temperature. This is problematic, since the chip may reach critical temperature thresholds in unmonitored regions. Therefore, our goal will be to infer the 12 Physics-Informed GP Regression Generalizes Linear PDE Solvers CPU Core CPU Core 6.9 mm CPU Core hCPU 61.0 ◦ C 60.0 ◦ C 59.0 ◦ C 58.0 ◦ C 57.0 ◦ C 56.0 ◦ C 55.0 ◦ C CPU Core CPU Core 2.3 mm CPU Core 4.6 mm 0.0 mm 2.0 W mm3 1.0 W mm3 0.0 W mm3 −1.0 W mm3 hCPU 6.9 mm q̇V 0.0 mm 4.1 mm 8.1 mm 12.2 mm 2.3 mm wCPU XPDE 0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU (a) Top: CPU die with CPU cores as heat sources and uniform cooling over the whole surface. Bottom: Magnitude of heat sources and sinks q̇V in the 1D slice in the upper subplot (—). 4.6 mm XNBC 0.0 mm (XDTS , uDTS ) (b) Gaussian process integrating prior information about the temperature distribution, a mechanistic model of heat conduction in the form of a linear PDE, and empirical measurements (XDTS , uDTS ) taken by limited-precision sensors (DTS). The plot shows the GP mean and a 1D slice illustrating the posterior uncertainty along with a few samples. Figure 2: Physics-informed Gaussian process model of the stationary temperature distribution in an idealized hexa-core CPU die under sustained computational load. temperature in the entire CPU. We will use our framework to integrate the physics of heat flow, the controlled cooling at the boundary, and the noisy temperature measurements from the sensors. See Figure 2(b) for an illustration of the result. During manufacturing, the resulting belief over the temperature distribution could then help decide whether the CPU design needs to be changed to avoid premature failure. From here on out, we focus on a 1D slice across the CPU surface, as shown in Figure 2(a) (top), to easily visualize uncertainty. Encoding Prior Knowledge By inspecting the PDE’s differential operator D = −κ∆ = P ∂2 −κ di=1 ∂x 2 , we can deduce that the paths of our Gaussian process need to be twicei differentiable. The in Section 3.1.1 results in a Matérn(ν) kernel with ν = j + d+1 1 1+1construction 1 7 + 2 = 2+ 2 + 2 = 2 . Assume we also know what temperature ranges are plausible 2 2 from similar CPU architectures, meaning that we set the kernel output scale to σout = 9. 2 Figure 3 shows the prior process u on along with its image D [u] ∼ GP D [m] , σout DkD∗ under the differential operator. A draw from D [u] can be interpreted as the heat sources and sinks that generated the corresponding temperature distribution draw from u. 13 Pförtner, Steinwart, Hennig and Wenger 64.0 ◦ C 4.0 W mm3 62.0 ◦ C 2.0 W mm3 60.0 ◦ C 0.0 W mm3 −2.0 W mm3 −4.0 W mm3 −6.0 W mm3 ◦ 58.0 C 56.0 ◦ C 54.0 ◦ C 0.0 mm u 4.1 mm 8.1 mm 12.2 mm wCPU (a) Gaussian process prior with a Matérn- 27 kernel over the temperature distribution of the CPU. 0.0 mm −κ∆u q̇V 4.1 mm 8.1 mm 12.2 mm wCPU (b) Prior under the differential operator D [u] = −κ∆u along with heat sources and sinks q̇V . Figure 3: Prior model for the stationary temperature distribution of a CPU die under load. 64.0 ◦ C ◦ 62.0 C 2.0 W mm3 1.0 W mm3 0.0 W mm3 −1.0 W mm3 −2.0 W mm3 60.0 ◦ C 58.0 ◦ C 56.0 ◦ C u | PDE xPDE,i 54.0 ◦ C 0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU 0.0 mm (a) Belief about the solution after conditioning on the PDE at a set of collocation points. −κ∆u | PDE q̇V (xPDE,i , q̇V (xPDE,i )) 4.1 mm 8.1 mm 12.2 mm wCPU (b) Belief about heat sources and sinks after conditioning on the PDE at collocation points. Figure 4: We integrate mechanistic knowledge about the system by conditioning on PDE observations −κ∆u (xPDE,i ) − q̇V (xPDE,i ) = 0 at the collocation points xPDE,i , resulting in the conditional process u | PDE. The large remaining uncertainty in Figure 4(a) illustrates that the PDE by itself does not identify a unique solution. Conditioning on the PDE We can now inform our belief about the physics of heat conduction using the mechanistic model defined by the stationary heat equation. We choose a set of collocation points XPDE = {xPDE,i }ni=1 and then condition on the observation that the PDE holds (exactly) at these points. In other words, we compute the physically-informed Gaussian process u | PDE := u | {−κ∆u (xPDE,i )−q̇V (xPDE,i ) = 0}ni=1 visualized in Figure 4. We can see that the resulting conditional process indeed satisfies the PDE exactly at the collocation points (see Figure 4(b)). The remaining uncertainty in Figure 4(b) is due to the approximation error introduced by only conditioning on a finite number of collocation points. However, while the samples from our belief about the solution in Figure 4(a) exhibit much more similarity to the mean function and less spatial variation, the marginal uncertainty 14 Physics-Informed GP Regression Generalizes Linear PDE Solvers hardly decreases. The latter is explained by the PDE not identifying a unique solution, since adding any affine function to u does not alter its image under the differential operator, i.e. ∆(a> x + b) = 0. There is an at least two-dimensional subspace of functions which can not be observed. This ambiguity can be resolved by introducing boundary conditions. Conditioning on the Boundary Conditions We assume that the CPU cooler extracts heat (approximately) uniformly from all exposed parts of the CPU, in particular also from the sides, rather than just from the top. Instead of directly specifying the value of the temperature distribution at the edge points of the CPU slice, we only approximately know the density q̇A of heat flowing out of each point on the CPU’s boundary based on the cooler specification. We can use another thermodynamical law to turn this assumption into information about the temperature distribution u. Example 3.1 (continuing from p. 8). Fourier’s law states that the local density of heat q̇A flowing through a surface with normal vector ν is proportional to the inner product of the negative temperature gradient and the surface normal ν, i.e. q̇A = −κ hν, ∇ui , where k is the material’s thermal conductivity in W m−1 K (Lienhard and Lienhard, 2020). Assuming sufficient differentiability of u, the inner product above is equal to the directional derivative ∂ν u of u in direction ν. We can assign an outward-pointing vector ν(x) (almost) everywhere on the boundary of the domain. Since the boundary of the CPU domain is its surface, we can summarize the above in a Neumann boundary condition −κ∂ν(x) u (x) = q̇A (x) for x ∈ ∂D. However, in practice we only know the approximate heat flow out of the CPU due to cooling. We therefore leverage our probabilistic viewpoint once more to incorporate the uncertainty about the true value of q̇A . To that end assume a joint Gaussian process prior (u, q̇A ), where q̇A is the heat flow out of the CPU at the border and q̇A ⊥ ⊥ u. We can use Corollary 3 to condition u | PDE on the Neumann boundary condition, meaning we compute (u, q̇A | I PDE [u] = 0) | I NBC [(u, q̇A )] = 0, where I NBC [(u, q̇A )] = −κ∂ν(x) u (XNBC ) − q̇A (XNBC ) with XNBC = {0, wCPU } describes the boundary conditions. Then, we marginalize over q̇A in the conditional process to obtain a belief over u. The result is visualized in Figure 5. The structure of the samples illustrates that most of the remaining uncertainty about the solution lies in a one-dimensional subspace of U corresponding to constant functions. This is due to the fact that two Neumann boundary conditions on both sides of the domain only determine the solution of the PDE up to an additive constant. Hence, we need an additional source of information to address the remaining degree of freedom. Conditioning on Direct Measurements Fortunately, CPUs are equipped with digital thermal sensors (DTS) located close to each of the cores, which provide (noisy) local measurements of the core temperatures (Michaud, 2019). These measurements can be straightforwardly accounted for in our model by performing standard GP regression using u| PDE, NBC from Figure 5 as a prior. The resulting belief about the temperature distribution is visualized in Figure 6. We can see that integrating the interior measurements effectively reduces the uncertainty due to the remaining degree of freedom, albeit not completely. The 15 Pförtner, Steinwart, Hennig and Wenger 64.0 ◦ C ◦ 62.0 C 2.0 W mm3 1.0 W mm3 0.0 W mm3 −1.0 W mm3 −2.0 W mm3 60.0 ◦ C 58.0 ◦ C 56.0 ◦ C u | PDE, NBC xPDE,i q̇A ◦ 54.0 C ◦ −κ∆u | PDE, NBC q̇V (xPDE,i , q̇V (xPDE,i )) 52.0 C 0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU 0.0 mm (a) Belief about the solution after conditioning on the PDE and boundary conditions. 4.1 mm 8.1 mm 12.2 mm wCPU (b) Belief about heat sources and sinks after conditioning on the PDE and boundary conditions. Figure 5: The cooler of the CPU produces an approximately specified outgoing heat flux q̇A at the boundary of the CPU. As Figure 5(a) illustrates, after conditioning on the resulting (approximate) Neumann boundary conditions, the solution of the PDE is identified up to an additive constant. 62.0 ◦ C 60.0 ◦ C 58.0 ◦ C 56.0 ◦ C 0.0 mm u | PDE, NBC, DTS xPDE,i q̇A (xDTS,i , uDTS,i ) 4.1 mm 8.1 mm 12.2 mm wCPU 2.0 W mm3 1.0 W mm3 0.0 W mm3 −1.0 W mm3 −2.0 W mm3 0.0 mm (a) Belief about the solution after conditioning on the PDE, BCs and noisy sensor data. −κ∆u | PDE, NBC, DTS q̇V (xPDE,i , q̇V (xPDE,i )) 4.1 mm 8.1 mm 12.2 mm wCPU (b) Belief about heat sources and sinks after conditioning on the PDE, BCs and noisy sensor data. Figure 6: The digital thermal sensors (DTS) within the CPU cores provide us with limitedprecision, local measurements of the temperature at locations xDTS,i . Integrating these along with the PDE and boundary conditions identifies the solution up to noise from the different types of observations and discretization error. remaining uncertainty is due to the model’s consistent accounting for noise in the thermal sensor readings, the uncertainty about the cooling, i.e. the boundary conditions, and the discretization error incurred by only choosing a small set of collocation points. Uncertainty in the Right-hand Side Above, we always assumed the true heat source term q̇V , i.e. the right-hand side of the PDE, to be known exactly. However, in practice, this assumption might also be violated, as was the case for the boundary conditions. A 16 Physics-Informed GP Regression Generalizes Linear PDE Solvers straightforward relaxation of this assumption is to replace q̇V by a Gaussian process whose mean is given by an estimate of q̇V .6 In the beginning of Section 3.2 we assumed that the cooler is controlled in such a way, that the temperature distribution in the CPU does not change over time. However, a naive prior over the heat flow q̇A out of the CPU may break this assumption. We need to encode that the amount of heat entering the CPU is equal to the amount of heat leaving the CPU via its boundary, i.e. Z Z STAT q̇A (x) dA = 0, (3.6) q̇V (x) dx − I [q̇V , q̇A ] := ∂D D The (jointly) linear information operator I STAT computes the net amount of thermal energy that the CPU gains per unit time. Using Corollary 2 we can construct a joint GP prior for u, q̇V and q̇A , which is consistent with the assumption of stationarity. We posit a multi-output GP prior over (u, q̇V , q̇A ), and condition on I STAT [q̇V , q̇A ] = 0. In this section, we choose all outputs to be independent. In the one-dimensional model, we can simplify Equation (3.6) by assuming that heat is drawn uniformly from the sides of the CPU. In this case, the GP prior over q̇V turns into a four-dimensional Gaussian random vector > q̇A,N q̇A,E q̇A,S q̇A,W ∼ N (mq̇A , Σq̇A ) and the information operator is equivalent to Z wCPU STAT I [q̇V , q̇A ] = hCPU q̇V (x) dx − hCPU (q̇A,E + q̇A,W ) − wCPU (q̇A,N + q̇A,S ) . (3.7) 0 The effect of this information operator on the marginal process q̇V is visualized in Figure 7(b). The conditional mean is the same as the prior mean, since the prior mean is explicitly constructed to fulfill Equation (3.7). However, note that the samples and the marginal credible interval change substantially. Prior samples in Figure 7(a) seem to lie consistently above or below the mean, indicating that there is a net increase or decrease in thermal energy. In contrast, each sample from the conditional process q̇V | STAT in Figure 7(b) conserves thermal energy in the system. We can use Corollary 2 to condition our joint GP prior (u, q̇V , q̇A ) | STAT first on I PDE [(u, q̇V , q̇A )] = 0 and then on I NBC [(u, q̇V , q̇A )] = 0 as above. It is important to keep track of the correlations in (u, q̇V , q̇A ), since the outputs in (q̇V , q̇A ) | STAT are now correlated. The resulting marginal conditional GP u | PDE, NBC, STAT after additionally conditioning on sensor data is shown in Figure 8. Comparing Figures 6 and 8, we can see that, due to the uncertainty in the right-hand side q̇V of the PDE, the samples of −κ∆u | PDE, NBC, STAT, DTS exhibit much more spatial variation. Moreover, the samples of the GP posterior over u now respect the stationarity constraint we imposed. Stepping back, we can view the problem of modelling the CPU under computational load as a scientific inference problem, where we need to aggregate heterogeneous sources of information in a joint probabilistic model. This inference task is illustrated as a directed graphical model in Figure 9. Our physics-informed regression framework is a local computation in the global inference procedure on the graph. Importantly, its implementation does 6. Technically speaking, if the right-hand-side of the PDE is given as a Gaussian process, the PDE turns into a stochastic partial differential equation (SPDE). 17 Pförtner, Steinwart, Hennig and Wenger 4.0 W mm3 2.0 W mm3 0.0 W mm3 −2.0 W mm3 q̇V 0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU 2.0 W mm3 1.0 W mm3 0.0 W mm3 −1.0 W mm3 −2.0 W mm3 −3.0 W mm3 0.0 mm q̇V | STAT 4.1 mm 8.1 mm 12.2 mm wCPU (b) Conditional GP q̇V | STAT obtained by conditioning the GP prior q̇V from Figure 7(a) on the stationarity constraint Equation (3.7). (a) GP prior over the volumetric heat source q̇V , which is inconsistent with the assumption of a stationary temperature distribution. Figure 7: Construction of a joint prior over the temperature distribution u, the volumetric heat source q̇V inside the CPU and the outgoing surface heat flux q̇A on its sides, which is consistent with the assumption of a stationary temperature distribution. 62.0 ◦ C 2.0 W mm3 1.0 W mm3 0.0 W mm3 u | PDE, NBC, STAT, DTS xPDE,i −1.0 W mm3 q̇A | STAT −2.0 W mm3 −3.0 W mm3 ◦ 60.0 C 58.0 ◦ C 56.0 ◦ C (xDTS,i , uDTS,i ) 0.0 mm 4.1 mm 8.1 mm 12.2 mm wCPU 0.0 mm (a) Posterior belief about the temperature distribution physically consistent with the assumption of stationarity. −κ∆u | PDE, NBC, STAT, DTS q̇V | STAT (xPDE,i , q̇V (xPDE,i )) 4.1 mm 8.1 mm 12.2 mm wCPU (b) Posterior belief about the heat sources and sinks after conditioning on the corresponding uncertain right-hand-side q̇V of the PDE. Figure 8: We integrate information from the joint prior (u, q̇V , q̇A ) | STAT over the solution, the right-hand side of the PDE, and the values of the Neumann boundary conditions into our belief about the temperature distribution by conditioning on said PDE and boundary conditions. not change based on what happens to the solution estimate and the input data in either upstream or downstream computations. All this information is already handily encoded in the structured uncertainties of the Gaussian processes. 3.3 A General Class of Tractable Information Operators for Linear PDEs In Section 3.1.2, we noted that conditioning on the information operator induced by the linear PDE, i.e. I [u] = D [u] − f, is usually intractable. As a remedy, we approximated I 18 Physics-Informed GP Regression Generalizes Linear PDE Solvers I STAT q̇V u q̇A I PDE u(XDTS ) + DTS I NBC Figure 9: Representation of the CPU model as a directed graphical model. The inference procedure described in Section 3.2 is equivalent to the junction tree algorithm (Bishop, 2006, Section 8.4.6) applied to the graphical model above. This example shows that the language of information operators is a powerful tool for aggregating heterogeneous sources of partial information in a joint probabilistic model. by a finite family {Ii }ni=1 of tractable information operators with Ii [u] := D [u] (xi ) − f (xi ) with xi ∈ D. Crucially, this assumes that point evaluation on both D [u] and f is well-defined and continuous, which means that this approach only applies to strong or classical solutions of a PDE. In this section, we will extend this approximation scheme for I into a unifying framework for tractable information operators aimed at approximating both (strictly) weak and strong solutions to linear PDEs. Our framework is inspired by the method of weighted residuals (MWR) (see Section 2.1.2), which is why we refer to these information operators as MWR information operators. Indeed, in Section 3.3.4 we will show that GP inference with information operators from our framework reproduces any weighted residual method in the posterior mean while providing an estimate of the inherent approximation error. In the following, we will consider a linear PDE in weak formulation, i.e. we want to solve B [u, v] = l [v] ∀v ∈ V (3.8) for u ∈ U . Equation (3.8) does not have to be a weak formulation in the sense of Section 2.1.1, but it could also be a weighted strong formulation as in Equation (2.7). We additionally require that B is continuous for fixed v ∈ V , i.e. for any v ∈ V there must be a constant C < ∞ such that B [u, v] ≤ C kukU for all u ∈ U . Let u ∼ GP (mu , ku ) be a Gaussian process prior over the weak solution u, whose path space can be continuously embedded into the solution space U of the PDE (see Appendix B.5 for more details on the latter assumption). As in Section 3.1.2, it is intractable to condition the GP prior on the full information provided by the PDE via the family {Iv }v∈V of affine information operators Iv [u] := B [u, v] − l [v] , since V is typically infinite-dimensional. To find tractable families of information operators, we will take inspiration from the method of weighted residuals (see Section 2.1.2). 19 Pförtner, Steinwart, Hennig and Wenger 3.3.1 Infinite-Dimensional Trial Function Spaces By Corollary 2 it is tractable to condition on a finite subfamily {Iψi }ni=1 ⊂ {Iv }v∈V , of information operators, where ψ1 , . . . , ψn is a finite subset of test functions, as long as we can compute Iψi [mu ] , Lψi [ku (x, ·)] , and Lψi ku L∗ψj , where Lψi = B [·, ψi ]. This might not always be possible in closed-form, since B often involves computing integrals. However, in these cases one could fall back to an efficient numeric quadrature method, since the integrals are often low-dimensional (typically at most four-dimensional). A prominent example of this approach is the probabilistic meshless method used in Section 3. Example 3.3 (Symmetric Collocation). If the differential operator maps into a reproducing kernel Hilbert space7 V, then, by the reproducing property, we know that there is a function δx∗ ∈ V for every x ∈ D such that v(x) = δx [v] = hδx∗ , viV for all v ∈ V . Hence, if the weak formulation is given by Equation (2.7), and V is an RKHS, then the choice ψi = δx∗i for xi ∈ D leads to Iψi [u] = D [u] (xi ) − f (xi ), i.e. we recover the probabilistic meshless method from (Cockayne et al., 2017) and Section 3. Cockayne et al. (2017) show that the conditional mean of this approach reproduces symmetric collocation (Fasshauer, 1997, 1999), a non-probabilistic approximation method for strong solutions of PDEs, in the conditional mean. Note that this family of information operators can also be recovered without assuming that V is a Hilbert space. We only require u 7→ D [u] (xi ) − f (xi ) to be continuous. Unfortunately, the probabilistic meshless method can only be applied to approximate strong solutions of linear PDEs, since the test functions corresponding to point evaluation functionals are usually not well-defined and continuous on the spaces V considered for finding a strictly weak solution. However, other choices of the vi lead to approximation schemes for weak solutions. For instance, a weak solution of the stationary heat equation in inhomogeneous media from above can be approximated by choosing the Lagrange elements from Figure 10 as test functions. 3.3.2 Finite-Dimensional Trial Function Spaces As opposed to the methods outlined in Section 2.1.2, we did not need to choose a finitedimensional subspace of trial functions to arrive at tractable information operators in Section 3.3.1. Nevertheless, in practice, it might still be desirable to specify a finite-dimensional trial function basis φ1 , . . . , φn , e.g. because • we want to reproduce the output of a classical method in the posterior mean to use the GP solver as an uncertainty-aware drop-in replacement (see Corollary 3.3); • the trial basis encodes knowledge about the problem that is difficult to encode in the prior; or • we want to solve the problem in a coarse-to-fine scheme, allowing for mesh refinement strategies, which are informed by the GP’s uncertainty estimation. 7. This is a reasonably weak assumption, since any Hilbert function space with continuous point evaluation functionals is an RKHS (Steinwart and Christmann, 2008). 20 Physics-Informed GP Regression Generalizes Linear PDE Solvers Naively, one might achieve this goal by defining the prior over u as a parametric Gaussian process with features φi . However, this means the posterior can not quantify the inherent approximation error, since the GP has no support outside of the finite subspace of U spanned by the trial functions. Consequently, we need to take a different approach. Starting from a general, potentially nonparametric prior over u, we consider a bounded (potentially oblique) projection PÛ : U → Û onto a subspace Û ⊂ U , i.e. PÛ2 = PÛ , PÛ < ∞, and ran(PÛ ) = Û . In general, this subspace need not be finite-dimensional. We apply PÛ to our GP prior over u, which, by Corollary 3, results in another GP û := PÛ [u] ∼ GP PÛ [mu ] , PÛ ku PÛ∗ , with sample paths in Û . Note that this discards prior information about ker(PÛ ). Hence, especially in case dim Û < ∞, applying the information operators Iψi from Section 3.3.1 directly to û would suffer from similar problems as choosing a parametric prior. However, Iψi ,PÛ := Iψi ◦ PÛ = B PÛ [·] , ψi − l [ψi ] is a valid information operator for u, which leads to a probabilistic generalization of the method of weighted residuals. This is why we refer to Iψi ,PÛ as an MWR information operator. The similarity to the method of weighted residuals is particularly prominent if we choose a finite-dimensional subspace Û = span (φ1 , . . . , φm ) as in Section 2.1.2. In this case, there is a bounded linear operator PRm : U → Rm such that PÛ [u] = m X ci φi =: IRÛm [c] , i=1 where the c := PRm [u] ∈ Rm are the coordinates of PÛ [u] in Û and IRÛm : Rm → Û is the canonical isomorphism between Rm and Û . Hence, we get the factorization PÛ = IRÛm PRm , (3.9) which implies that û is a parametric Gaussian process. Moreover, note that l [ψi ] = ˆli and m h i X Û B IRm [c] , ψi = ci B [φi , ψi ] = (B̂c)i i=1 for c ∈ Rm , where B̂ and ˆl are defined as in Section 2.1.2. Consequently, the MWR information operator is given by Iψi ,PÛ [u] = (IRm ◦ PÛ ) [u]i , where IRm [c] := B̂c − ˆl. This illustrates that we are dealing with the hierarchical model u ∼ GP (mu , ku ) c | u ∼ δPRm [u] with observations IRm [c] = 0, where c ∼ N (PRm [mu ] , PRm ku PR∗ m ). Inference in this model can be broken down into two steps. First, we update our belief about the solution’s coordinates in Û by computing the conditional random variable c | IRm [c] = 0, 21 Pförtner, Steinwart, Hennig and Wenger which is also Gaussian. If B̂ is invertible and c has full support on Rm , then the law of c | IRm [c] = 0 is a Dirac measure whose mean is given by the coordinates of the MWR approximation cMWR = B̂ −1 ˆl from Equation (2.8). Next, we can reuse precomputed quantities from the conditional moments of c | IRm [c] = 0 such as the representer weights w = (B̂PRm ku PR∗ m B̂ > )† (ˆl − B̂PRm [mu ]) to efficiently compute the conditional random process (u | (IRm ◦ PRm ) [c] = 0) = (u | {Iψi ,PÛ [u] = 0}ni=1 ), i.e. the main object of interest. Assuming once more that B̂ is invertible and c has full support on Rm , the remaining uncertainty of the conditional process lies in the kernel of Pû , since the law of c | IRm [c] = 0 is a Dirac measure and (PÛ [u] | {Iψi ,PÛ [u] = 0}ni=1 ) = (IRÛm [c] | IRm [c] = 0). Thus, all remaining uncertainty must be due to (id −PÛ ) [u] | {Iψi ,PÛ [u] = 0}ni=1 . Note the striking similarity of this property to the notion of Galerkin orthogonality (Logg et al., 2012, Equation 2.63). A canonical choice for the projection PÛ would arguably be orthogonal projection w.r.t. the RKHS inner product of the sample space of u (see e.g. Kanagawa et al. 2018). However, this inner product is generally difficult to compute. Fortunately, we can use the L2 inner products or Sobolev inner products on the samples to induce a (usually non-orthogonal) projection PÛ . Example 3.4. If the elements of U are square-integrable, then the linear operator m Z , φi (x)u(x) dx PRm [u]i := P −1 D where i=1 Z Pij := φi (x)φj (x) dx, D induces a projection PÛ = IRÛm PRm onto Û ⊂ U , even if h·, ·iU 6= h·, ·iL2 . At first glance, information operators restricting Û to be finite-dimensional might seem fundamentally inferior to the information operators from Section 3.3.1. However, note that the conditional mean of a Gaussian process prior conditioned on {Iψi [u] = 0}ni=1 is updated by a linear combination of n functions, while the covariance function receives an at most rank n downdate. This means that, implicitly, Gaussian process projection methods also have an implicit finite-dimensional trial function space, which is constructed from the test function basis, the bilinear form B and the prior covariance function ku . MWR information operators with finite-dimensional trial function bases can be used to realize a GP-based analogue of the finite element method. Example 3.5 (A 1D Finite Element Method). Generally speaking, finite element methods are (generalized) Galerkin methods, where the functions in the test and trial bases have compact support, i.e. they are nonzero only in a highly localized region of the domain. The archetype of a finite element method chooses linear Lagrange elements (Logg et al., 2012, 22 Physics-Informed GP Regression Generalizes Linear PDE Solvers 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 −1.0 −0.5 0.0 0.5 −1.0 1.0 −0.5 0.0 0.5 1.0 (a) Test/trial functions φi = ψi . The functions (b) The trial functions φi span the space of pieceare defined on the whole interval [-1, 1], but we wise linear functions on the given grid. only show the non-zero parts of the functions to avoid clutter in the figure above. Figure 10: Linear Lagrange elements are famous test and trial functions ψi = φi used in the finite element method. Section 3.3.1) as test and trial functions. Linear Lagrange elements are piecewise linear on a triangulation of the domain. For instance, on a one-dimensional domain D = [−1, 1], this amounts to fixing a grid −1 = x0 < · · · < xm+1 = 1 and then choosing the basis functions φi (x) = ψi (x) = x−xi−1 xi −xi−1 xi+1 −x xi+1 −xi 0 if xi−1 ≤ x ≤ xi , if xi ≤ x ≤ xi+1 , otherwise. for i = 1, . . . , m. Note that multiplying a coordinate vector c ∈ Rm with these basis functions leads to a piecewise linear interpolation between the points (x0 , 0), (x1 , c1 ), . . . , (xn , cn ), (xn+1 , 0), since, for x ∈ [xi , xi+1 ], m X i=1 x − xi xi+1 − x ci φi (x) = ci + ci+1 = xi+1 − xi xi+1 − xi x − xi x − xi 1− ci + ci+1 . xi+1 − xi xi+1 − xi The basis functions and an element in their span are visualized in Figure 10. The Lagrange elements at the boundary of the can also be easily modified such that arbitrary piecewise linear boundary conditions are fulfilled by construction. The effect of MWR information operators based on this set of test and trial functions is visualized in Figure 11(a). 3.3.3 MWR Information Operators Even though the class of information operators introduced above is constructed for weak forms of linear PDEs, it can naturally be applied to the weak form of an arbitrary operator 23 Pförtner, Steinwart, Hennig and Wenger 1.5 1.5 1.0 1.0 0.5 0.5 ? u u | BC, PDE 0.0 −1.0 −0.5 0.0 0.5 u? u | BC, PDE 0.0 −1.0 1.0 (a) Posterior process corresponding to a Matérn3/2 prior. The sample paths of the process embed continuously into the Sobolev space H 1 (D) (see Appendix B.5). −0.5 0.0 0.5 1.0 (b) Posterior process corresponding to an MWR Recovery Prior constructed from a Matérn-3/2 prior via Lemma 3.4. The posterior mean corresponds to the point estimate produced by the classical MWR. Figure 11: Conditioning a Gaussian process prior on the MWR information operators {Iψi ,PÛ }ni=1 corresponding to the weak formulation of the Poisson equation, i.e. Equation (2.3), and m = 3 linear Lagrange elements as test functions ψi and trial functions φi (see Example 3.5). The trial functions φ1 and φm were modified to fulfill the non-zero boundary conditions exactly. equation. In particular, we can use MWR information operators for the boundary conditions in a BVP. Moreover, it is straightforward to extend Iψ,PÛ to a joint GP prior over (u, f ) if the right-hand side f of the operator equation is unknown, particularly if l [v] = hf, viV as in Section 2.1. In this case, Iψ,PÛ is jointly linear in (u, f ). Summarizing Sections 3.3.1 and 3.3.2 and incorporating the extensions discussed here, we give the following general definition of an MWR information operator: Definition 3.1 (MWR Information Operator). Let B [u, v] = l [v] be an operator equation in weak formulation. An MWR information operator for said operator equation is an affine functional Iψ,PÛ := B PÛ [·] , ψ − l [ψ] parameterized by a test function ψ ∈ V and a bounded (potentially oblique) projection PÛ onto a subspace Û ⊂ U . We also write Iψ := Iψ,idU . If l [v] = hf, viV , then the input of Iψ,PÛ can be extended to the right-hand side f of the operator equation, i.e. Iψ,PÛ [(u, f )] := B PÛ [u] , ψ − hf, ψiV , which is jointly linear in (u, f ). 3.3.4 Recovery of Classical Methods In this section we will show that, under certain assumptions, the posterior mean of a GP prior conditioned on a set of MWR information operators is identical to the approximation 24 Weak & Strong Solutions Strong Solutions Physics-Informed GP Regression Generalizes Linear PDE Solvers Method Trial Functions φi Test Functions ψi Collocation arbitrary ψi = δx∗i for xi ∈ D ⇒ B [u, ψi ] = D [u] (xi ) Subdomain (Finite Volume) arbitrary ψi = 1Di for DRi ⊂ D ⇒ B [u, ψi ] = Di D [u] (xi ) dx Pseudospectral orthogonal and globally supported (e.g. Fourier basis or Chebychev polynomials) ψi = δx∗i for xi ∈ D ⇒ B [u, ψi ] = D [u] (xi ) Generalized Galerkin arbitrary arbitrary Finite Element locally supported (e.g. piecewise polynomial) same class as trial functions, but in general ψi 6= φi Spectral (Galerkin) orthogonal and globally supported (e.g. Fourier basis or Chebychev polynomials) same class as trial functions, but in general ψi 6= φi (Ritz-)Galerkin arbitrary ψi = φi Table 1: Overview of trial and test functions defining commonly used methods of weighted residuals. The table also shows whether the method is capable of approximating weak solutions. See Fletcher (1984) for more details. generated by the corresponding traditional method of weighted residuals, examples of which are given in Table 1. More precisely, we will show that there is a flexible family of GP priors u ∼ GP (mu , ku ) whose posterior means after conditioning on {Iψi ,PÛ }m i=1 are identical to the corresponding classical MWR approximation uMWR to the solution of the same weak form linear PDE, where we use the same trial functions φ1 , . . . , φm and test functions ψ1 , . . . , ψn in both cases, i.e. Û = span (φ1 , . . . , φm ). As in Section 2.1.2, we assume that the trial functions are already constructed in such a way that the boundary conditions are fulfilled. However, it is possible to extend the results below to the general case by adding MWR information operators corresponding to the boundary conditions and using −1 ˆlPDE B̂PDE MWE c = ˆlBC B̂BC as coordinates for the reference solution generated by the traditional MWR. Lemma 3.2. If B̂ ∈ Rn×m and Σc := PRm ku PR∗ m ∈ Rm×m are invertible, then c | B̂c − ˆl = 0 ∼ δcMWR and the conditional mean mu|B̂,l̂ of u B̂PRm [u] − ˆl = 0 admits a unique additive decomposition mu|B̂,l̂ = uMWR + uker(PÛ ) (3.10) 25 Pförtner, Steinwart, Hennig and Wenger with uMWR ∈ Û and uker(PÛ ) ∈ ker(PÛ ). Corollary 3.3. If, additionally, mu ∈ Û and Pker(PÛ ) kuu PR∗ m = 0, then the conditional mean mu|B̂,l̂ is equal to the MWR approximation, i.e. mu|B̂,l̂ = uMWR . It turns out that it is possible to transform any admissible GP prior over the (weak) solution of the PDE into a prior that fulfills the assumptions of Corollary 3.3. We describe this transformation in the following lemma. Lemma 3.4 (MWR Recovery Prior). Let ũ ∼ GP m̃u , k̃u with mean and sample paths in U . Then u ∼ GP (mu , ku ) with mu := PÛ [m̃u ] and ∗ kuu := PÛ k̃uu PÛ∗ + Pker(PÛ ) k̃uu Pker(P = PÛ k̃uu PÛ∗ Û ) + (idU −PÛ )k̃uu (idU −PÛ )∗ = k̃uu − PÛ k̃uu − k̃uu PÛ∗ + 2PÛ k̃uu PÛ∗ has sample paths in U , mu ∈ Û , and Pker(PÛ ) kuu PR∗ n = 0. Figure 11(b) visualizes how a prior of this form reproduces a 1D finite element method in the posterior mean and Figure 11 as a whole contrasts the difference between ũ and u. Intuitively speaking, the construction for the covariance from Lemma 3.4 enforces statistical independence between the subspaces Û and ker(PÛ ) of the GP’s path space. This way, an observation of the GP prior in the subspace Û gains no information about ker(PÛ ), which means that the posterior process will not be updated along ker(PÛ ). Since mu ∈ Û , i.e. Pker(PÛ ) [mu ], it follows that the posterior mean will also lie in Û . Even though this choice of prior is somewhat restrictive, there are good reasons to use it in practice, arguably the most important of which is that the uncertainty quantification provided by the GP can be added on top of traditional MWR solvers in existing pipelines in a plug-and-play fashion. This is due to the fact that, in this case, the mean estimate agrees with the point estimate produced by the classical solver. 3.4 Algorithm Algorithm 1 summarizes our framework from an algorithmic standpoint. It outlines how a GP prior can be conditioned on heterogeneous sources of information such as mechanistic knowledge given in the form of a linear boundary value problem, and noisy measurement data by leveraging the notion of a linear information operator. All GP posteriors in this article were computed by this algorithm with different choices of prior, PDE, boundary conditions and policy. 26 Physics-Informed GP Regression Generalizes Linear PDE Solvers Algorithm 1: Solving PDEs via Gaussian Process Inference Input: Joint GP prior (u, f, g, ) ∼ GP (m, k), linear PDE (D, f ) or (BPDE , f ), boundary conditions (B, g) or (BBC , g), (noisy) measurements (XMEAS , YMEAS ), . . . Output: GP posterior u ∼ GP (mi , ki ) 1 procedure LinPDE-GP(m, k, I PDE , I BC , XMEAS , YMEAS ) 2 i←0 3 (m0 , k0 ) ← (m, k) 4 w0 ← () 5 G0 ← () 6 while not StoppingCriterion() do 7 i←i+1 8 (ψPDE , ψBC , PÛ , vMEAS ) ← Policy(mi , ki ) B Action IψPDE [(u, f )] PDE ,PÛ BC IψBC ,P [(u, g)] Û 9 Ii ← (u, f, g, ) 7→ B Information operator .. . hvMEAS , u(XMEAS ) + i > B Observations 10 yi ← 0 0 . . . hvMEAS , YMEAS i ∗ Gi−1 I1:i−1 kIi ∗ = I1:i kI1:i B Update Gram matrix 11 Gi ← ∗ Ii kI1:i−1 Ii kIi∗ 12 13 14 15 wi ← G†i (y1:i − I1:i [m]) B Update representer weights > mi ← x 7→ m(x) + I1:i [k(x, ·)] wi B Belief Update > † ki ← (x1 , x2 ) 7→ k(x1 , x2 ) − I1:i [k(x1 , ·)] Gi I1:i [k(·, x2 )] return GP (mi , ki ) Modeling uncertainty over the right-hand side f of the PDE, the boundary function g and the measurements YMEAS is achieved by specifying a joint prior over (u, f, g, ). Therefore, Algorithm 1 also returns a multi-output Gaussian process posterior over (u, f, g, ). This means that our method can be used to solve PDE-constrained Bayesian inverse problems for the right-hand side f and the boundary function g, while computing a consistent distributional estimate for the corresponding solution u of the forward problem. This is a generalization of a linear latent force model (Alvarez et al., 2009). If f and g are not uncertain, the corresponding covariance functions in the joint prior can simply be set to 0, which (in the absence of measurements) reduces the joint prior to a simple prior over the solution u. To condition the GP on the PDE and the boundary conditions, we make use of MWR information operators (see Definition 3.1), where the test functions and projection are chosen by an arbitrary policy in each iteration of the method. An example of such a policy which reproduces Figure 1(c) chooses PÛ as the L2 projection onto the basis from ∗ , δ ∗ } and ψ Example 3.5 in every iteration, the test functions ψBC ∈ {δ−1 PDE = 0 in the first 1 two iterations, and ψPDE = φi−2 (and ψBC = 0) from iteration 3 onward. The ellipses in the information operator I and the observations yi indicate that adding additional information 27 Pförtner, Steinwart, Hennig and Wenger operators is possible in the same fashion. For instance, adding additional PDE information operators enables the solution of systems of linear PDEs. Performance Considerations Instead of naively conditioning the previous conditional process on the new observation in each iteration, Algorithm 1 always conditions the prior on the accumulated observations. This is because the naive expressions for the conditional moments become more and more complex over time. While, in principle, it is possible to use automatic differentiation (AD) to compute Ii [mi ], Ii [ki−1 (x, ·)], and Ii ki−1 Ii∗ in each iteration and then evaluate Equations (4.15) and (4.16) naively, we found that this is detrimental to the performance of the algorithm. In Algorithm 1, we only need to compute Ii [m], and Ii [k(x, ·)], and Ii kIi∗ on the prior moments, which are much less complex and cheaper to evaluate. For maximum efficiency, for many information operator / kernel combinations one can compute optimized closed-form expressions for these terms, alleviating the need for automatic differentiation or quadrature. We can avoid unnecessary recomputation of the representer weights at every iteration of the method by means of block-matrix inversion. For instance, if a Cholesky decomposition is used to invert the Gramian Gi , we can use a variant of the block Cholesky decomposition (Golub and Van Loan, 2013) to update the Cholesky factor of Gi−1 . Code A Python implementation of Algorithm 1 based on ProbNum (Wenger et al., 2021) and JAX (Bradbury et al., 2018) is available at: https://github.com/marvinpfoertner/linpde-gp 3.5 Related Work The area of physics-informed machine learning (Karniadakis et al., 2021) aims at augmenting machine learning models with mechanistic knowledge about physical phenomena, mostly in the form of ordinary and partial differential equations. Recently, there has been growing interest in deep learning–based approaches (Raissi et al., 2019; Li et al., 2020, 2021). However, this model choice makes it inherently difficult to quantify the uncertainty about the solution induced by noise-corrupted input data and inevitable approximation error. Instead, we approach the problem through the lens of probabilistic numerics (Hennig et al., 2015; Cockayne et al., 2019b; Oates and Sullivan, 2019; Owhadi et al., 2019; Hennig et al., 2022), which frames numerical problems as statistical estimation tasks. Probabilistic numerical methods for the solution of PDEs are predominantly based on Gaussian process priors. Our work builds upon and extends these works. Many existing methods aim to find a strong solution to a linear PDE using a collocation scheme (e.g. Graepel 2003; Cockayne et al. 2017; Raissi et al. 2017). Unfortunately, many practically relevant (linear) PDEs only admit weak solutions. Our framework extends existing collocation approaches to weak formulations. Probabilistic numerical methods approximating weak formulations are primarily based on discretization. For example, Cockayne et al. (2019a); Wenger and Hennig (2020) apply a probabilistic linear solver to the linear system arising from discretization. Girolami et al. (2021) propose a statistical version of the finite element method (statFEM), which uses a specific parametric GP prior. However, these approaches do not quantify the inherent discretization error – often the largest source of uncertainty about the solution. In contrast, our framework models this error and additionally admits a broader class of dis28 Physics-Informed GP Regression Generalizes Linear PDE Solvers cretizations. Wang et al. (2021); Krämer et al. (2022) propose GP-based solvers for strong formulations of time-dependent nonlinear PDEs by leveraging finite-difference approximations to the differential operator and linearization-based approximate inference. While it is possible to apply such methods to linear PDEs, the finite difference approximation of the differential operator introduces additional estimation error. By contrast, the evaluation of the differential operator in our method is exact. Cockayne et al. (2017); Raissi et al. (2017); Girolami et al. (2021) also apply their methods to solve PDE-constrained (Bayesian) inverse problems. Särkkä (2011) directly infers the right-hand side of a linear PDE in strong formulation by observing measurements of the solution through the associated Green’s function. Our approach also builds a belief over an unknown right-hand side without requiring access to a Green’s function. The aforementioned methods use the closure of Gaussian processes under conditioning on observations of the sample paths through a linear operator without proof. Owhadi and Scovel (2018) show how to condition Gaussian measures on an orthogonal direct sum of separable Hilbert spaces on observations of one of the summands. Our work extends these results to Gaussian processes with sample paths in separable reproducing kernel Hilbert spaces by leveraging the dualities between these. Recent results about the sample spaces of GPs (Steinwart, 2019; Kanagawa et al., 2018) ensure the applicability of our work to practical GP regression problems. To our knowledge this is the first complete proof of this widely used property of GPs. Thus, Theorem 1 provides the theoretical basis for physics-informed GP regression, including the aforementioned methods for the solution of PDEs. In our work, it enables conditioning on information operators constructed from e.g. PDEs, integral equations, or boundary conditions. 4. Gaussian Process Inference with Affine Observations of Sample Paths Our framework fundamentally relies on the fact that when a Gaussian process prior is conditioned on affine observations of its paths, one obtains a closed-form posterior. This section provides the theoretical foundation for this result. While this property is used widely in the literature (see e.g. Graepel (2003); Rasmussen and Williams (2006); Särkkä (2011); Särkkä et al. (2013); Cockayne et al. (2017); Raissi et al. (2017); Agrell (2019); Albert (2019); Krämer et al. (2022)), no proof of its general form where observations are made via bounded linear operators between separable Hilbert function spaces, instead of finite-dimensional linear maps on a finite number of point evaluations exists, to the best of our knowledge. Owhadi and Scovel (2018) give a proof of a related property for Gaussian measures. Here, we extend their results to the case of Gaussian processes. While these perspectives are closely related, significant technical attention needs to be paid for this result to transfer to the GP case. For our framework this is essential such that we can leverage the modelling capabilities provided by specifying a kernel as described in Section 3.1.1. To state the result, let f ∼ GP (m, k) be a (multi-output) GP prior with index set X ⊂ Rd , L : paths (f ) → Rn a linear operator acting on the paths of f , and ∼ N (µ, Σ) a Gaussian random vector in Rn with ⊥ ⊥ f . We need to compute the conditional random process f | L [f ] + = y 29 Pförtner, Steinwart, Hennig and Wenger for some y ∈ Rn . Formally, this object is defined as the family ( f | L [f ] + = y ) := {f (x, ·) | E}x∈X , of conditional random variables8 , where (Ω, B (Ω) , P) is the probability space on which both f and are defined, E is the event E := h−1 ({y}) ∈ B (Ω), and h is the random variable h : Ω → Rn , ω 7→ L [f (·, ω)] + (ω). We refer to Appendix B.1 for definitions of the objects mentioned above. For instance, in Section 3, we use L := (D [·] (xi ))ni=1 , where D is a linear differential operator, as well as L [f ] := (f (xi ))ni=1 , and, in Section 3.2, we additionally use Z L [f ] = f (x) dx. D It is well-known that h is a Gaussian random vector h ∼ GP (L [m] + µ, LkL∗ + Σ), where LkL∗ ∈ Rn×n with h i (LkL∗ )ij = L t 7→ L [k(t, ·)]j , i and that the conditional random process is a Gaussian process f | L [f ] + = y ∼ GP mf |y , kf |y with conditional moments given by mf |y (x) = m(x) + L [k(·, x)]> (LkL∗ + Σ)−1 (y − (L [m] + µ)) , and kf |y (x1 , x2 ) = k(x1 , x2 ) + L [k(·, x1 )]> (LkL∗ + Σ)−1 L [k(·, x2 )] Since the above are nontrivial claims about potentially ill-behaved infinite-dimensional objects, a proof is important, be it just to identify a precise set of assumptions about the objects at play, which are required so that the result holds. For instance, it is possible that h is not a random variable (because it might not measurable), i.e. E might not be measurable. To remedy this situation, a major contribution of this work are Theorem 1 and Corollaries 2 and 3 and their proof in Appendix B, which provide a sequence of increasingly specialized results capturing the claims above. Hence, besides being the theoretical basis for this work, Theorem 1 and Corollaries 2 and 3 also provide theoretical backing for many of the publications cited above. Our results identify a set of mild assumptions, which are easy to verify and widely-applicable in practical applications. Assumption 1 constitutes the common set of assumptions shared by Theorem 1 and Corollaries 2 and 3. See Appendix B.5 for information on how to verify Assumption 1 in a practical scenario. 8. Here, we need to work with regular conditional probability measures (Klenke, 2014), since the event E typically has probability 0. 30 Physics-Informed GP Regression Generalizes Linear PDE Solvers Assumption 1. Let f ∼ GP (mf , kf ) be a Gaussian process prior with index set X on the Borel probability space (Ω, B (Ω) , P), whose mean function and sample paths lie in a real separable RKHS H ⊂ RX with H ⊇ Hkf . Let L : H → HL be a bounded linear operator mapping the paths of f into a separable Hilbert space HL . We start our exposition here by presenting Theorem 1, our most general result. Using Theorem 1, it is possible to condition Gaussian processes on affine observations of their paths, which take values in arbitrary and potentially infinite-dimensional separable Hilbert spaces. For instance, this means that conditioning on observations of a whole function is well-defined, given that the assumptions of Theorem 1 are fulfilled. The formulation of this theorem heavily relies on the theory of Gaussian measures on separable Hilbert spaces, some of which is detailed in Appendix B.2. Theorem 1 (Affine Gaussian Process Inference). Let Assumption 1 hold. Then ω 7→ f (·, ω) is an H-valued Gaussian random variable with mean mf and covariance operator h 7→ Cf [h] (x) = hkf (x, ·), hiH . We also write f ∼ N (mf , Cf ). Let ∼ N (m , C ) be an HL -valued Gaussian random variable with ⊥ ⊥ f . Then f mf Cf Cf L∗ ∼N , , (4.1) L [mf ] + m LCf LCf L∗ + C L [f ] + with values in H × HL and hence L [f ] + ∼ N (L [mf ] + m , LCf L∗ + C ). (4.2) If ran(LCf L∗ + C ) is closed, then, for all y ∈ HL , (4.3) f | L [f ] + = y ∼ GP mf |y , kf |y , where the conditional mean and covariance function are given by D E mf |y (x) = mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )] HL , (4.4) and D E kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )] HL , (4.5) respectively. Unfortunately, especially in the context of PDEs, Theorem 1 is difficult to apply in practice, since the operator LCf L∗ + C is infinite-dimensional and its pseudoinverse (if it exists) usually has no analytic form. However, as seen in Section 3, its corollaries can, in practical scenarios, be applied to great effect. Corollary 2 enables affine observations, in which the GP sample paths enter through one or multiple continuous linear functionals. For example, we used Corollary 2 in Section 3.2 to condition on observations of a GPs. To state the result conveniently, we introduce some notation. Notation 1. Let k : X × X → R be a positive-definite kernel and let Li : Hk → Rni for i = 1, 2 be bounded linear operators. By L1 kL∗2 ∈ Rn1 ×n2 , we denote the matrix with entries h i (L1 kL∗2 )ij := L1 x 7→ L2 [k(x, ·)]j . i 31 Pförtner, Steinwart, Hennig and Wenger Table 2: Theorem 1 provides the theoretical basis to condition on (affine) observations of a Gaussian process. While results like conditioning on derivative evaluations are used ubiquitously throughout the literature (e.g. monotonic GPs, Bayesian optimization, probabilistic numerical PDE solvers, . . . ) a complete proof does not exist in the literature, to the best of our knowledge. Observation Information operator Point evaluation Affine finite-dim. operator Point evaluation of derivative Integral Derivative Integro-differential operator Affine operator f f f f f f f Proof known? 7→ f (x) 7→ Af (X) + b d 7→ Rdx f (x) x=x0 7→ D f (x) dµ (x) d 7→ dx f 7→ D [f ] 7→ L [f ] + b Reference Bishop (2006) Bishop (2006) Corollary 3 Corollary 2 Theorem 1 Theorem 1 Theorem 1 It turns out that the order in which the operators L1 , L2 are applied to the arguments of k does not matter, i.e. h i (L1 kL∗2 )ij = L1 x 7→ L2 [k(x, ·)]j = L2 [x 7→ L1 [k(·, x)]i ]j i (see Lemma B.27). This motivates the parenthesis-free notation L1 kL∗2 introduced above. Corollary 2. Let Assumption 1 hold for HL = Rn and let ∼ N (µ , Σ ) be an Rn -valued Gaussian random variable with ⊥ ⊥ f . Then L [f ] + ∼ N (L [mf ] + µ , Lkf L∗ + Σ ) (4.6) f | L [f ] + = y ∼ GP mf |y , kf |y , (4.7) and, for any y ∈ Rn , with conditional mean and covariance function given by D E mf |y (x) = mf (x) + L [kf (x, ·)] , (Lkf L∗ + Σ )† (y − (L [mf ] + µ )) and Rn D E kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (x1 , ·)] , (Lkf L∗ + Σ )† L [kf (·, x2 )] Rn , (4.8) . (4.9) Finally, we turn to Corollary 3, which is the result that is most widely-used throughout the literature (Graepel, 2003; Särkkä, 2011; Särkkä et al., 2013; Cockayne et al., 2017; Raissi et al., 2017; Agrell, 2019; Albert, 2019; Krämer et al., 2022). It shows how Gaussian processes can be conditioned on point evaluations of the image of their paths under a linear operator, provided that the linear operator is bounded and maps into a Hilbert function space, on which point evaluation is continuous. Moreover, it shows that, under these conditions, the image of the GP under the linear operator is itself a Gaussian process. Again, we introduce some notation to facilitate stating the result. 32 Physics-Informed GP Regression Generalizes Linear PDE Solvers Notation 2. Let k : X × X → R be a positive-definite kernel and let Li : Hk → Hi for i = 1, 2 be bounded linear operators mapping into real RKHSs Hi ⊂ RXi . In analogy to Notation 1, we define the bivariate functions kL∗2 : X ×X2 → R, (x, x2 ) 7→ L2 [k(x, ·)] (x2 ) , L1 k : X1 ×X → R, (x1 , x) 7→ L1 [k(·, x)] (x1 ) , L1 kL∗2 : (4.10) (4.11) and X1 ×X2 → R, (x1 , x2 ) 7→ L2 [(L1 k)(x1 , ·)] (x2 ) = L1 [(kL∗2 )(·, x2 )] (x1 ) . (4.12) 0 Corollary 3. Let Assumption 1 hold such that HL is an RKHS HL ⊂ RX . Then L [f ] ∼ GP (L [mf ] , Lkf L∗ ), (4.13) Let ∼ N (µ , Σ ) with values in Rn and ⊥ ⊥ f . Then, for X 0 = {x0i }ni=1 ⊂ X 0 and y ∈ Rn , f | L [f ] X 0 + = y ∼ GP mf |y , kf |y (4.14) with D E † mf |y (x) := mf (x) + (kf L∗ )(x, X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (y − (L [mf ] (X) + µ )) Rn (4.15) and D E † kf |y (x1 , x2 ) := kf (x1 , x2 ) − (kf L∗ )(x1 , X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (Lkf )(X 0 , x2 ) If additionally X = Rn . (4.16) X 0, then kf mf f , ∼ GP Lkf L [mf ] L [f ] kf L∗ Lkf L∗ . (4.17) This corollary is is the theoretical basis for Section 3 and most of Section 3.2. Note that, for L = idH , we recover standard GP regression as a special case in Corollary 3. Remark 4.1 (Multi-Output Gaussian Processes). Theorem 1 and Corollaries 2 and 3 also apply if the GPs involved are multi-output GPs. In this case, the sample paths are functions I × X → R with I = {1, . . . , d} by Definition B.6. In order to apply linear operators defined on functions X → Rd , we interpret a sample path f (·, ω) : I × X → R as a function f˜(·, ω) : X → Rd , x 7→ (f ((i, x), ω))di=1 ∈ Rd . (4.18) 5. Conclusion In this work, we developed a probabilistic framework for the solution of (systems of) linear partial differential equations, which can be interpreted as physics-informed Gaussian process regression. It enables the seamless fusion of (1) a-prior known, provable properties of the system of interest, (2) exact and partial mechanistic information, (3) subjective domain expertise, as well as, (4) noisy empirical measurements into a unified scientific model. 33 Pförtner, Steinwart, Hennig and Wenger This model outputs a consistent uncertainty estimate, which quantifies the inherent approximation error in addition to the uncertainty arising from partially-known physics, as well as limited-precision measurements. Our framework fundamentally relies on the closure of Gaussian processes under conditioning on observations of their sample paths through an arbitrary bounded linear operator. While this result has been used ubiquitously in the literature, a rigorous proof for linear operator observations, as needed in the PDE setting, did not exist prior to this work to the best of our knowledge. By choosing a specific prior and information operator in our framework, it recovers methods of weighted residuals, a popular family of numerical methods for the solution of (linear) PDEs, which includes generalized Galerkin methods such as finite element and spectral methods. This demonstrates that classical linear PDE solvers can be generalized in their functionality to include approximate input data and equipped with a structured uncertainty estimate. Our work outlines a general framework for the integration of mechanistic building blocks in the form of information operators derived from e.g. linear PDEs into probabilistic models. Our case study shows that the language of information operators is a powerful toolkit for aggregating heterogeneous sources of partial information in a joint probabilistic model, especially in the context of physics-informed machine learning. This opens up several interesting lines of research. For example, the choice of prior and information operator are not fixed and can be specifically chosen for the problem at hand. The design of adaptive information operators, which actively collect information based on the current belief about the solution could prove to be a promising research direction. Further, the uncertainty estimate about the solution could be used to inform experimental design choices. For example, in the case study from Section 3.2, the posterior belief can be used to optimize the locations of the digital thermal sensors in future CPU designs. Finally, it remains an open question whether this framework can be adapted to nonlinear partial differential equations in a similar manner to how many classic methods solve a sequence of linearized problems to approximate the solution of a nonlinear PDE. Acknowledgments MP, PH and JW gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence “Machine Learning - New Perspectives for Science”, EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-Württemberg. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting MP and JW. Finally, the authors are grateful to Filip Tronarp for many helpful discussions concerning the theoretical part of this work. 34 Physics-Informed GP Regression Generalizes Linear PDE Solvers Appendix A. Proofs for Section 3.3 Proof of Example 3.4 PÛ2 [u] = PÛ "m X # PRm [u]i φi (A.1) PRm [u]i PÛ [φi ] (A.2) i=1 = = = = = = m X i=1 m X i=1 m X (A.3) PRm [φi ]j PRm [u]i (A.4) j=1 φj m X j=1 i=1 m X m m X X φj j=1 i=1 m X m X j=1 m X φj φj j=1 = m X PRm [φi ]j φj PRm [u]i m X i=1 m X k=1 m X ! (P −1 )jk huk , φi iL2 PRm [u]i (A.5) ! (P −1 )jk Pki PRm [u]i (A.6) k=1 P −1 P ji PRm [u]i (A.7) i=1 (A.8) φj PRm [u]j j=1 (A.9) = PÛ [u] Proof of Lemma 3.2 By Corollary 2, we have −1 ˆl − B̂PRm [mu ] mu|B̂,l̂ (x) = mu (x) + (B̂PRm ) [ku (x, ·)]> (B̂PRm )ku (B̂PRm )∗ −1 = mu (x) + PRm [ku (x, ·)]> B̂ > B̂Σc B̂ > B̂ B̂ −1 ˆl − PRm [mu ] = mu (x) + PRm [ku (x, ·)]> Σ−1 B̂ −1 ˆl − PRm [mu ] . c Since PÛ is a bounded projection, we have U = ran(PÛ ) ⊕ ker(PÛ ) (A.10) (Rudin 1991, Section 5.16) = Û ⊕ ker(PÛ ), (A.11) where each u ∈ U decomposes uniquely into u = uÛ + ucÛ with uÛ ∈ Û and ucÛ ∈ ker(PÛ ). It is clear that uÛ = PÛ [u] , 35 Pförtner, Steinwart, Hennig and Wenger and ucÛ = id −PÛ [u] = Pker(PÛ ) [u] . This implies h i −1 ˆ m [mu ] PRm mu|B̂,l̂ = PRm [mu ] + PRm ku PR∗ m Σ−1 B̂ l − P R | {z } c =Σc = PRm [mu ] + B̂ = B̂ −1 ˆl −1 ˆ l − PRm [mu ] = cMWR . Hence, we have m m h i h i X X PRn mu|B̂,l̂ φi = cMWR φi = uMWR PÛ mu|B̂,l̂ = i i i=1 (A.12) i=1 h i and since U = Û ⊕ ker(PÛ ), the statement follows. Moreover, note that PRn mu|B̂,l̂ is the mean of c | B̂c − ˆl = 0 and its covariance matrix is given by −1 B̂Σc Σc|B̂,l̂ = Σc − Σc B̂ > B̂Σc B̂ > −1 = Σc − Σc B̂ > (B̂ > )−1 Σ−1 c B̂ B̂Σc = Σc − Σc Σ−1 c Σc = 0. Consequently, c | B̂c − ˆl = 0 ∼ δcMWR . Proof of Corollary 3.3 h i Pker(PÛ ) mu|B̂,l̂ (x) −1 ˆl − B̂PRm [mu ] = Pker(PÛ ) [mu ](x) + (δx ◦ Pker(PÛ ) )ku (B̂PRm )∗ (B̂PRm )ku (B̂PRm )∗ | {z } =0 −1 ˆl − B̂PRm [mu ] = δx (Pker(PÛ ) ku PR∗ m ) B̂ > (B̂PRm )ku (B̂PRm )∗ | {z } =0 =0 Proof of Lemma 3.4 Since PÛ is idempotent, we have Pker(PÛ ) PÛ = PÛ − PÛ2 = PÛ − PÛ = 0 36 Physics-Informed GP Regression Generalizes Linear PDE Solvers and PRn PÛ = (IRÛm )−1 PÛ2 = (IRÛm )−1 PÛ = PRn . It follows that Pker(PÛ ) ku PR∗ n = Pker(PÛ ) k̃u PR∗ n − Pker(PÛ ) PÛ k̃u | {z } =0 − Pker(PÛ ) k̃u PÛ∗ PR∗ n + 2 Pker(PÛ ) PÛ k̃u PÛ∗ PR∗ n | {z } =0 ∗ ∗ = Pker(PÛ ) k̃u PR∗ n − Pker(PÛ ) k̃u PRn PÛ | {z } =PRn = 0. 37 Pförtner, Steinwart, Hennig and Wenger Appendix B. Proofs for Section 4 This appendix constitutes a proof of Theorem 1 and Corollaries 2 and 3. More precisely, Appendices B.1, B.2 and B.2.2 introduce the objects needed to formalize these results, while Appendices B.2.1, B.2.3 and B.3 develop the machinery used to conduct their proof which is given in Appendix B.4. In the following, B (Ω) denotes the Borel σ-algebra on some topological space Ω. Let H be a Hilbert space. For a linear functional l ∈ H∗ , l∗ ∈ H denotes the unique vector for which l [h] = hl∗ , hiH for all h ∈ H. Similarly, for h ∈ H, h∗ ∈ H∗ denotes the linear functional h∗ [h0 ] = hh, h0 iH for h0 ∈ H. B.1 Gaussian Processes We start by reviewing the definition and basic properties of Gaussian processes. Definition B.1. A Gaussian process (GP) f with index set X is a family {fx }x∈X of Rvalued random variables on a common probability space (Ω, B (Ω) , P), such that, for each finite set of indices x1 , . . . , xn , the joint distribution of fx1 , . . . , fxn is Gaussian. We also write f (x) := fx and f (x, ω) := fx (ω). Definition B.2. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . The function m : X → R, x 7→ m(x) = EP [f (x)] is called the mean (function) of f and the function k : X × X → R, (x1 , x2 ) 7→ k(x1 , x2 ) = CovP [f (x1 ), f (x2 )] is called the covariance function or kernel of f . We also often write f ∼ GP (m, k) if f is a Gaussian process with mean m and kernel k. We commonly use Gaussian processes to model our belief about unknown functions, which can be motivated by interpreting their sample paths as function-valued random variables: Definition B.3. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . For each ω ∈ Ω, the function f (·, ω) : X → R, x 7→ f (x, ω) is called a (sample) path of the Gaussian process. The set paths (f ) := {f (·, ω) : ω ∈ Ω} ⊆ RX containing all sample paths of f is referred to as the path space of f . Lemma B.4. Let f be a Gaussian process on (Ω, B (Ω) , P) with index set X . Consider the function fX : Ω → paths (f ) , ω 7→ f (·, ω). If there is a σ-algebra on paths (f ) such that fX is measurable, then fX is a function-valued random variable with values in paths (f ). In the following, we will refer to function-valued random variables as random functions, in analogy to the concept of a random variable. 38 Physics-Informed GP Regression Generalizes Linear PDE Solvers Using the rules of linear-Gaussian inference (Bishop, 2006), we can easily see that f ∼ GP (m, k) Af (X) ∼ N Am(X), Ak(X, X)A> f | Af (X) + b = y ∼ GP mf |y , kf |y , where A ∈ Rm×n , X = {xi }ni=1 ⊂ X , b ∼ N (µ, Λ) with b ⊥ ⊥ f and mf |y (x) := m(x) + k(x, X)A> (Ak(X, X)A> + Λ)† (y − (Am + µ)) kf |y (x1 , x2 ) := k(x1 , x2 ) − k(x1 , X)A> (Ak(X, X)A> + Λ)† Ak(X, x2 ). It is tempting to think that the above also extends to more general linear transformations of f such as differentiation and integration. Unfortunately, this is not the case, since the result from (Bishop, 2006) heavily uses the fact that, by definition, evaluations of the Gaussian process at a finite set of points follow a joint Gaussian distribution. However, differentiation and integration are examples of linear operators, i.e. linear maps between vector spaces of functions, which operate on an (uncountably) infinite subset of the random variables. To generalize the result above to linear operators L (or more generally affine maps) acting on the paths of f , we need to analyze the objects L [f ] and f | L [f ] = h. By L [f ], we denote the function ω 7→ L [f (·, ω)] = (L ◦ fX )(ω), which is a random variable if there is a σ-algebra on paths (f ) and the image of L such that L and fX are measurable. If we understand the joint law of fX and L [f ], we can compute the conditional random variable fX | L [f ] = h and the conditional random process f | L [f ] = h. This outlines the proof strategy we will follow below. Specifically, we will 1. gain an understanding of the structure of the GP’s path space paths (f ) in order to be able to decide whether fX is a random function, i.e. measurable. We will focus on cases, in which we can continuously embed paths (f ) into a separable Hilbert space H, which is a measurable space with respect to B (H). This will be useful when applying linear operators to the GP, since it helps decide whether paths (f ) lies in the domain of the linear operator and whether the linear operator is measurable. 2. analyze the law of the random function fX in order to understand the belief about the sample paths encoded in P and fX . If paths (f ) is (a subset of) a separable Hilbert space H, then the law of fX will turn out to be a Gaussian measure on H. 3. analyze the law of the random functions L ◦ fX and (fX , L ◦ fX ). We will assume that L maps into some separable Hilbert space HL . Since Gaussian random variables on separable Hilbert spaces are closed under continuous affine transformations between such spaces, L ◦ fX and (fX , L ◦ fX ) are also Gaussian if L : H 7→ HL is bounded. 4. compute the conditional Gaussian measure fX | L [f ] = h by marginalizing over L ◦ fX in (fX , L ◦ fX ) | L ◦ fX = h. 5. show how to transform Gaussian random variables on separable Hilbert spaces into Gaussian processes. With this result we are then able to transform L ◦ fX and fX | L ◦ fX = h back into Gaussian processes. 39 Pförtner, Steinwart, Hennig and Wenger Fortunately, the first point has already been extensively addressed in the literature. See Kanagawa et al. (2018, Section 4) for an overview. Remark B.5. Let f ∼ GP (m, k) be a Gaussian process with index set X and let Hk be the reproducing kernel Hilbert space (RKHS) of the covariance function or kernel k. If dim Hk = ∞, then the sample paths of f do almost surely not lie in Hk . Fortunately, in many cases, there exists a larger related RKHS Hk0 ⊃ Hk , which contains the sample paths with probability 19 . We refer to (Kanagawa et al., 2018, Section 4) and Steinwart (2019) for more details on sample path properties. In Appendix B.5, we have already seen that Sobolev spaces can be obtained as path spaces of Gaussian processes with Matérn covariance functions. B.1.1 Multi-output Gaussian Processes The sample paths of Gaussian processes as defined in Definition B.1 are always real-valued. However, especially in the context of PDEs, vector-valued functions are ubiquitous, e.g. when dealing with vector fields such as the electric field. Fortunately, the index set of a Gaussian process can be chosen freely, which means that we can “emulate” vector-valued 0 GPs. More precisely, a function f : X → Rd can be equivalently viewed as a function f 0 : {1, . . . , d0 } × X → R, (i, x) 7→ f 0 (i, x) = fi (x). Applying this construction to a Gaussian process leads to the following definition of a multi-output Gaussian process: Definition B.6 (Multi-output Gaussian Process). A d-output Gaussian process f with index set X on (Ω, B (Ω) , P) is a Gaussian process with index set X 0 := {1, . . . , d}×X on the same probability space. With a slight abuse of notation, we write fx (ω) := (f(i,x) (ω))di=1 ∈ Rd , etc. We also write the mean and covariance functions m and k of f as m : X → Rd and k : X × X → Rd×d , where m(1, x) k((1, x1 ), (1, x2 )) . . . k((1, x1 ), (d, x2 )) .. .. .. m(x) = ... and k(x1 , x2 ) = . . . . m(d, x) k((d, x1 ), (1, x2 )) . . . k((d, x1 ), (d, x2 )) B.2 Gaussian Measures on Separable Hilbert Spaces As stated before, we need to understand the law of the random function fX . This amounts to analyzing the pushforward measure µ := P ◦ fX−1 . In many cases, µ will turn out to be a Gaussian probability measure on a (usually) infinite-dimensional separable Hilbert function space H ⊇ paths (f ) (see Proposition B.22 and Lemma B.13). Definition B.7. Let H be a real separable Hilbert space. A probability measure µ on (H, B (H)) is called Gaussian if hh, ·iH is a univariate Gaussian random variable for all h ∈ H. An H-valued random variable is called Gaussian if its law is Gaussian. Just as for probability measures on Euclidean vector space Rn , we can define a mean and covariance (operator) for this more general class of probability measures. 9. In practice, f is virtually always implicitly defined via m and k without ever constructing the function fX and the probability space. Hence, we can always choose fX and Ω such that f ∼ GP (m, k) where f ∈ Hk0 even holds pathwise, i.e. f (·, ω) ∈ Hk0 for all ω ∈ Ω, instead of just with probability 1. 40 Physics-Informed GP Regression Generalizes Linear PDE Solvers Definition B.8. Let X be a random variable on (Ω, B (Ω) , P) with values in a real separable Hilbert space H. If hh, X(·)iH ∈ L1 (Ω, P) for all h ∈ H, and there is mX ∈ H such that Z hh, mX iH = EX [hh, XiH ] = Ω (B.1) hh, X(ω)iH dP (ω) for all h ∈ H, then m is called the mean (vector) of X. Let X 0 be another random variable on (Ω, B (Ω) , P) with values in a real separable Hilbert space H0 and mean mX 0 . If hh, X(·)iH ∈ L2 (Ω, P) for all h ∈ H, hh0 , X 0 (·)iH0 ∈ L2 (Ω, P) for all h0 ∈ H0 , and there is a linear operator CX,X 0 : H → H0 such that h0 , CX,X 0 [h] H0 = CovX,X 0 hh, XiH , h0 , X 0 H0 Z hh, X(ω) − miH h0 , X 0 (ω) − m0 = Ω (B.2) H0 dP (ω) for all h ∈ H and h0 ∈ H0 , then CX,X 0 is called the cross-covariance operator of X and X 0 . If X = X 0 , then C is referred to as the covariance operator of X. Remark B.9. One can show that the existence of the mean vector and (cross-)covariance operator already follows from the given conditions. More precisely, the mean mX exists if hh, X(·)iH ∈ L1 (Ω, P) for all h ∈ H, and the (cross-)covariance operator exists if hh, X(·)iH ∈ L2 (Ω, P) for all h ∈ H and hh0 , X 0 (·)iH0 ∈ L2 (Ω, P) for all h0 ∈ H0 . Remark B.10. One can show that covariance operators are self-adjoint and positive. Moreover, covariance operators are in the trace class (Maniglia and Rhandi, 2004, Section 1.2) and hence compact and bounded. Remark B.11. The mean and the covariance operator of a Gaussian random variable with values in a separable Hilbert space always exist and they identify its law uniquely (Maniglia and Rhandi, 2004, Theorem 1.2.5). Conversely, for every self-adjoint, positive, trace-class operator C : H → H and m ∈ H, there is a Gaussian measure with mean m and covariance operator C. Hence, we also often write N (m, C) to denote Gaussian measures on separable Hilbert spaces. Using the notion of a Bochner integral (Yosida, 1995, section V.5), we can also give an equivalent definition of the mean and covariance operator, which is more similar to the finitedimensional counterpart. For our purposes, Bochner integrals have the favorable property that they commute with bounded linear operators, i.e. if f : Ω → V is a Bochner integrable function mapping a measure space (Ω, B (Ω) , µ) into a Banach space V and L : V → U is a bounded linear operator between V and another Banach space U , then ω 7→ L [f (ω)] is Bochner integrable and Z Z L [f (ω)] dµ (ω) = L B f (ω) dµ (ω) B for B ∈ B (Ω) (Yosida, 1995, Section V.5, Corollary 2). 41 (B.3) Pförtner, Steinwart, Hennig and Wenger Lemma B.12. Let X ∼ N (mX , CX ) be a Gaussian random variable on (Ω, B (Ω) , P) with values in a real separable Hilbert space H. Then X is Bochner P-integrable and the mean m of X is given by the following Bochner integral Z X(ω) dP (ω) . (B.4) mX = Ω X0 Let ∼ N (mX 0 , CX 0 ) be another Gaussian random variable on (Ω, B (Ω) , P) with values in a real separable Hilbert space H0 . Then the function ω 7→ hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) is Bochner P-integrable for any h ∈ H and the cross-covariance operator CX,X 0 of X and X 0 is given by the Bochner integral Z CX,X 0 [h] := hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) dP (ω) . (B.5) Ω Proof By Maniglia and Rhandi (2004, Theorem 1.2.5), we have that kX(·)kH ∈ L2 (Ω, P). Hence, sZ sZ Z Z kX(ω)kH dP (ω) = 1 · kX(ω)kH dP (ω) ≤ 1 dP (ω) · kX(ω)k2H dP (ω) < ∞ Ω Ω Ω Ω by the Cauchy-Schwarz inequality in L2 (Ω, P) and the fact that P is a probability measure. Moreover, X is measurable and H ⊃ ran(X) is separable, which means that X is strongly measurable (Yosida, 1995, Section V.4, Pettis’ Theorem). It follows that X is Bochner integrable (Yosida, 1995, Section V.5, Theorem 1) and that Z Z hh, mX iH = hh, X(ω)iH dP (ω) = h, X(ω) dµ (ω) Ω Ω H for h ∈ H (Yosida, 1995, Section V.5, Corollary 2), since hh, ·iH is continuous. The function ω 7→ hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) is clearly weakly measurable and, since H is separable, also strongly measurable (Yosida, 1995, Section V.4, Pettis’ Theorem). By the triangle inequalities in H and H0 and the fact that P is a probability measure, we have kX(·) − mX kH ∈ L2 (Ω, P) and kX 0 (·) − mX 0 kH0 ∈ L2 (Ω, P). Hence, for h ∈ H, Z hh, X(ω) − mX iH (X 0 (ω) − mX 0 ) H0 dP (ω) ZΩ = |hh, X(ω) − mX iH | X 0 (ω) − mX 0 H0 dP (ω) Ω Z ≤ khkH kX(ω) − mX kH X 0 (ω) − mX 0 H0 dP (ω) Ω = khkH kX(·) − mX kH , X 0 (·) − mX 0 H0 L2 (Ω,P) <∞, by the Cauchy-Schwarz inequality in H. It follows that ω 7→ hh, X(ω) − mX iH (X 0 (ω)−mX 0 ) is Bochner integrable for any h ∈ H (Yosida, 1995, Section V.5, Theorem 1) and that Z 0 h , CX,X 0 [h] H0 = hh, X(ω) − mX iH h0 , X 0 (ω) − mX 0 H0 dP (ω) Ω 42 Physics-Informed GP Regression Generalizes Linear PDE Solvers = 0 Z h, Ω 0 hh, X(ω) − mX iH (X (ω) − mX 0 ) dP (ω) H0 for any h ∈ H and h0 ∈ H0 (Yosida, 1995, Section V.5, Corollary 2), where we used the fact that hh0 , ·iH0 is continuous. B.2.1 Continuous Affine Transformations Just as their finite-dimensional counterparts, Gaussian random variables with values in separable Hilbert are closed under continuous affine transformations and the expressions for the transformed mean and covariance operator are analogous to the finite-dimensional case. In the following, we will use this result to compute the law of L ◦ fX . Lemma B.13. Let L : H1 → H2 be a bounded linear operator between real separable Hilbert spaces H1 , H2 and let b ∈ H2 . Let X ∼ N (m, C) be an H1 -valued Gaussian random variable. Then L [X(·)] + b ∼ N (L [m] + b, LCL∗ ). Proof See Lemma 1.2.7 in Maniglia and Rhandi (2004). B.2.2 Joint Gaussian Measures on Separable Hilbert Spaces In order to compute fX | L ◦ fX = h, we need access to the joint distribution of fX and L ◦ fX . Using Lemma B.13 to apply the linear operator h 7→ (h, L [h]) to the Gaussian random function fX , it becomes apparent that this joint distribution can be described by a Gaussian measure on a Cartesian product H × HL of separable Hilbert spaces, where HL is the codomain of L. Remark B.14. The Cartesian product H× := H1 × · · · × Hn of a finite family {Hi }ni=1 of real Hilbert spaces equipped with elementwise addition and scalar multiplication is a real Hilbert space with respect to the inner product h, h0 H× := n X hi , h0i Hi . i=1 Additionally, if every Hi for i = 1, . . . , n is separable, then H× is separable (Adams and Fournier, 2003, Theorem 1.23). Unless stated otherwise, we will always equip Cartesian products of Hilbert spaces with the Hilbert space structure described above. Lemma B.15. Let i ∈ {1, . . . , n}. The i-th projection map Πi : H× → Hi , h 7→ hi on H× is a bounded linear operator and Π∗i [hi ] = (0, . . . , 0 , hi , 0, . . . , 0). | {z } i−1 times P Proof Let h = (h1 , . . . , hn ) ∈ H× . Then kΠi [h]k2Hi = khi k2Hi ≤ nj=1 khj k2Hj = khk2H× and + * X 0 hi , Πi h H = hi , h0i H = hi , h0i H + 0, h0j H = (0, . . . , 0 , hi , 0, . . . , 0), h0 i i i j | {z } j6=i for all h0 ∈ H× . 43 i−1 times H× Pförtner, Steinwart, Hennig and Wenger Notation B.16. For linear operators L : H → H0 between Cartesian products H = H1 × 0 of real Hilbert spaces, we introduce the notation · · · × Hn and H0 = H10 × · · · × Hm L [(h1 , . . . , hn )] = (L11 [h1 ] + · · · + L1n [hn ] , . . . , Lm1 [h1 ] + · · · + Lmn [hn ]) L11 . . . L1n .. [(h , . . . , h )] , .. =: ... n . . 1 Lm1 . . . Lmn with Lij := Π0i LΠ∗j : Hj → Hi0 , where Πi and Π0i denote the i-th projection maps on H and H0 , respectively. Lemma B.15 implies that Lij is bounded if L is bounded. Specifically, for a covariance operator L = C (i.e. H = H0 ), we know that C is bounded and hence all blocks Cij of the covariance operator are bounded. One can show that Cij is the cross-covariance ∗. operator between entries i and j of the tuple and hence Cij = Cji We will refer to a Gaussian measure on a Cartesian product of separable Hilbert spaces as a joint Gaussian measure on separable Hilbert spaces. In the remainder of this section, we will show that joint Gaussian measures on separable Hilbert spaces share some important properties with their finite-dimensional counterparts. First of all, we can use orthogonal projections to marginalize over variables in a random vector whose law is a joint Gaussian measure on separable Hilbert spaces. Corollary B.17 (Marginalization in Joint Gaussian Measures). Let H1 , H2 be real separable Hilbert spaces and let X ∼ N (m, C) be an H1 × H2 -valued Gaussian random variable. Then Xi ∼ N (mi , Cii ) for i ∈ {1, 2}. Proof This follows from Lemma B.13, since Xi = Πi ◦ X and Πi is linear and bounded. The statistical independence properties of joint Gaussian measures on Hilbert spaces are also analogous to the finite-dimensional case. Proposition B.18 (Independence in Joint Gaussian Measures). Let H1 , H2 be real separable Hilbert spaces and let X1 and X2 be independent random variables on (Ω, B (Ω) , P) with values in H1 and H2 , respectively, where X1 ∼ N (m1 , C1 ) and X2 ∼ N (m2 , C2 ). Then X : Ω → H, ω 7→ (X1 (ω), X2 (ω)) is a Gaussian random variable on (Ω, B (Ω) , P) with mean m = (m1 , m2 ) and covariance operator C1 0 C := . 0 C2 Proof Let H := H1 × H2 and h∗ ∈ H∗ . Then h∗ = h∗1 + h∗2 , where h∗i := h∗ ◦ Π∗i ∈ Hi∗ for i ∈ {1, 2}. X1 and X2 are Gaussian, which implies that h∗1 ◦ X1 and h∗2 ◦ X2 are Gaussian. Moreover, h∗1 ◦ X1 ⊥ ⊥ h∗2 ◦ X2 , because X1 ⊥ ⊥ X2 . Since the sum of independent (univariate) Gaussian random variables is Gaussian, it follows that h∗ ◦ X is Gaussian. Hence, X is Gaussian. Πi for i ∈ {1, 2} is bounded and thus, by Lemma B.13, we have that m = (m1 , m2 ), C11 = C1 , and C22 = C2 . Let µ, µ1 and µ2 be the laws of f , X1 and X2 , respectively. Then, X1 ⊥ ⊥ X2 implies µ = µ1 ⊗ µ2 and hence, for h2 ∈ H2 , C12 [h2 ] = Π1 CΠ∗2 [h2 ] 44 Physics-Informed GP Regression Generalizes Linear PDE Solvers Z 0 0 0 = Π1 (0, h2 ), h − m H (h − m) dµ h H Z = h2 , h02 − m2 H2 (h01 − m1 ) dµ h0 H Z = ZH2 = H2 (Yosida 1995, Section V.5, Corollary 2) h2 , h02 − m2 H2 (h01 − m1 ) dµ1 h01 dµ2 h02 H1 Z 0 h2 , h2 − m2 H (h01 − m1 ) dµ1 h01 dµ2 h02 = 0 H | 1 {z } Z =0 ∗ = 0∗ = 0. and C21 = C12 Note that analogous versions of these results also hold in joint Gaussian measures with more than two components. This follows from Lemma B.13 and the fact that there are isometries between H1 × · · · × Hn and arbitrary reorderings and/or parenthesizations of the Cartesian product. B.2.3 Conditional Gaussian Measures on Separable Hilbert Spaces At the heart of Theorem 1 is the conditional random process f | L [f ] = h. We will compute this process by conditioning the joint Gaussian measure (fX , L ◦ fX ) on a given value of its second component. To do so, our main workhorse will be a result by Owhadi and Scovel (2018) who show how to condition Gaussian measures on an orthogonal direct sum of separable Hilbert spaces on observations in one of the two subspaces, i.e. they show how to compute X | X2 = t, where X = X1 + X2 is a Gaussian random variable with values in H1 ⊕ H2 . Unfortunately, Owhadi and Scovel (2018) don’t give explicit expressions for the conditional mean and covariance operator. In the following, we add to Theorem 3.3 in Owhadi and Scovel (2018) by constructing explicit expressions for the mean and covariance operator of the conditional measure, which resemble the well-known expressions for conditional Gaussian measures on finite-dimensional Euclidean vector spaces. Theorem B.19. Let H1 , H2 be real separable Hilbert spaces and let H := H1 × H2 . Let X be an H-valued Gaussian random variable with mean m = (m1 , m2 ) and covariance operator C11 C12 := C :H→H ∗ C12 C22 such that ran(C22 ) is closed. Then X | X2 = t for any t ∈ H2 is an H-valued Gaussian random variable with mean † m1 + C12 C22 [t − m2 ] := mX|X2 =t , (B.6) t and covariance operator CX|X2 =t := † ∗ C11 − C12 C22 C12 0 . 0 0 45 (B.7) Pförtner, Steinwart, Hennig and Wenger Proof H is an orthogonal direct sum of (separable) subspaces Ĥ1 := {(h1 , 0) | h1 ∈ H1 } = Π∗1 [H1 ] , Ĥ2 := {(0, h2 ) | h2 ∈ H2 } = and Π∗2 [H2 ] . Let Π̂i := Πi |Ĥi : Ĥi → Hi for i ∈ {1, 2} be the restriction of Πi to Ĥi . Note that the Π̂i are unitary. We have m = m̂1 + m̂2 , where m̂1 := Π̂∗1 [m1 ] and m̂2 := Π̂∗2 [m2 ]. Using the blockmatrix notation for operators on orthogonal direct sums of Hilbert spaces from Owhadi and Scovel (2018); Anderson and Trapp (1975), the covariance operator can be represented by ∗ Cˆ11 Cˆ12 Π̂∗1 C12 Π̂2 Π̂1 C11 Π̂1 =: ˆ∗ ˆ C= C12 C22 Ĥ ⊕Ĥ (Π̂∗1 C12 Π̂2 )∗ Π̂∗2 C22 Π̂2 Ĥ ⊕Ĥ 1 1 2 2 Let t ∈ H2 . Note that X | X2 = t = X | (0, X2 ) = (0, t) = X | X̂2 = t̂ where X̂2 := Π̂∗2 [X2 ], and t̂ := Π̂∗2 [X2 ]. By Theorem 3.3 in Owhadi and Scovel (2018), X | X̂2 = t̂ is Gaussian, its covariance operator is the short of C to Ĥ2 (Anderson and Trapp, 1975), and, if a C-symmetric oblique projection Q onto Ĥ2 (Owhadi and Scovel, 2018) of the form h i h i Q ĥ1 + ĥ2 = Q̂21 ĥ1 + ĥ2 ∈ Ĥ2 for some Q̂21 : Ĥ1 → Ĥ2 exists, then its mean is given by (m̂1 + Q̂∗21 t̂ − m̂2 ) + t̂. In the following, we will show that the expressions for mX|X2 =t and CX|X2 =t from Equations (B.6) and (B.7) are indeed equal to the mean and covariance operator of X | X̂2 = t̂, respectively. † ˆ∗ We will first show that Q̂21 := Cˆ22 C12 defines a C-symmetric oblique projection Q onto Ĥ2 . Evidently, Q is idempotent, i.e. Q2 = Q. In Notation B.16, we noted that C12 and C22 are bounded and hence Cˆ12 and Cˆ22 are bounded. Since ran(C22 ) is closed and Π̂2 is unitary, ran(Cˆ22 ) is closed. It follows from Theorem 3 in Ben-Israel and Greville (2003, † Section 8.3) that the Moore-Penrose pseudoinverse Cˆ22 exists and his bounded. Q̂21 and Q i † ˆ ˆ are bounded because C and C12 are. Moreover, ran(Q) = Q̂21 Ĥ1 + Ĥ2 = Ĥ2 , since 22 † ran(Q̂21 ) ⊂ ran(Cˆ22 ) ⊂ Ĥ2 . It remains to show that Q∗ C = CQ. Cˆ22 is bounded and thus † ∗ ∗ )† = Cˆ Cˆ† closed (Yosida, 1995, Section II.6), which means that Q̂∗21 = Cˆ12 (Cˆ22 ) = Cˆ12 (Cˆ22 12 22 ˆ22 is by Theorem 2 (g) from Ben-Israel and Greville (2003, Section 8.3) and the fact that C i h h i † self-adjoint. Consequently, the adjoint of Q is given by Q∗ ĥ1 + ĥ2 = Cˆ12 Cˆ22 ĥ2 + ĥ2 , because D h iE D h i E ĥ1 + ĥ2 , Q ĥ01 + ĥ02 = ĥ1 + ĥ2 , Q̂21 ĥ01 + ĥ02 H H D h i E 0 0 = ĥ2 , Q̂21 ĥ1 + ĥ2 H 46 Physics-Informed GP Regression Generalizes Linear PDE Solvers (Ĥ1 ⊥ Ĥ2 ) D E + h2 , h02 H = Q̂∗21 [h2 ] , h01 H D E D h i E + ĥ2 , ĥ01 = Q̂∗21 ĥ2 , ĥ01 D H E D h iH E ∗ 0 + ĥ2 , ĥ02 + Q̂21 ĥ2 , ĥ2 H H (Ĥ1 ⊥ Ĥ2 ) h i E D = Q̂∗21 ĥ2 + ĥ2 , ĥ01 + ĥ02 H D h i E † 0 ˆ ˆ = C12 C22 ĥ2 + ĥ2 , ĥ1 + ĥ02 H for all ĥ1 + ĥ2 , ĥ01 + ĥ02 ∈ H. Since Cˆ22 is bounded, self-adjoint and positive, its square 1 2 exists and is also bounded, self-adjoint and positive (Bernau, 1968, Theorem 4). root Cˆ22 Moreover, we have 1 ∗ (B.8) ran(Cˆ12 ) ⊂ ran(Cˆ2 ) = ran(Cˆ22 ), 22 where the inclusion follows from Theorem 3 in Anderson and Trapp (1975) and the equality holds due to the fact that ran(Cˆ22 ) is closed (Dixmier, 1949; Tarcsay, 2014). Let ĥ1 + ĥ2 ∈ H. If ĥ2 ∈ ran(Cˆ22 ), then we indeed find h h h i i h i h i h ii ∗ Q∗ C ĥ1 + ĥ2 = Q∗ (Cˆ11 ĥ1 + Cˆ12 ĥ2 ) + (Cˆ12 ĥ1 + Cˆ22 ĥ2 ) h h i h ii h i h i † ∗ ∗ Cˆ12 = Cˆ12 Cˆ22 ĥ1 + Cˆ22 ĥ2 + (Cˆ12 ĥ1 + Cˆ22 ĥ2 ) h h i h ii h i h i † ˆ∗ † ˆ ∗ = Cˆ12 Cˆ22 C12 ĥ1 + Cˆ22 C22 ĥ2 + (Cˆ12 ĥ1 + Cˆ22 ĥ2 ) h h i h i i h i † ˆ∗ † ˆ∗ = Cˆ12 Cˆ22 C12 ĥ1 + ĥ2 + Cˆ22 Cˆ22 C12 ĥ1 + ĥ2 † ∗ ) ⊂ ran(Cˆ ) by Equation (B.8)) (ĥ2 ∈ ran(Cˆ22 ), Cˆ22 Cˆ22 |ran(Cˆ22 ) = idran(Cˆ22 ) and ran(Cˆ21 22 h h i i † ˆ∗ = C Cˆ22 C12 ĥ1 + ĥ2 h i = CQ ĥ1 + ĥ2 . (B.9) Now consider a general ĥ2 ∈ Ĥ2 . Since ran(Cˆ22 ) is closed, we have Ĥ2 = ran(Cˆ22 ) ⊕ ran(Cˆ22 )⊥ (Yosida, 1995, Section III.1, Theorem 1), which implies that there is a unique additive k k ⊥ ˆ ˆ ⊥ ˆ decomposition ĥ2 = ĥ2 + ĥ⊥ 2 with ĥ2 ∈ ran(C22 ) and ĥ2 ∈ ran(C22 ) = ker(C22 ). Moreover, ⊥ ⊥ ∗ ⊥ ⊥ ˆ ˆ ˆ ˆ This ĥ2 ∈ ran(C22 ) ⊂h ran( and hence ĥ2 ∈ ker(C). h (B.8), i i C12 ) = ker(C12 )h by iEquation ∗ ⊥ ⊥ implies that Q∗ C ĥ⊥ 2 = Q [0] = 0 = C ĥ2 = CQ ĥ2 , and hence h i h i h i k k ∗ ∗ Q∗ C ĥ1 + (ĥ2 + ĥ⊥ ) = Q C ĥ + ĥ + Q C ĥ⊥ 1 2 2 2 h i h i k = CQ ĥ1 + ĥ2 + CQ ĥ⊥ 2 47 Pförtner, Steinwart, Hennig and Wenger h i k = CQ ĥ1 + (ĥ2 + ĥ⊥ ) , 2 k by Equation (B.9), since ĥ2 ∈ ran(C22 ). This concludes the proof that Q with this choice of Q̂21 is a C-symmetric oblique projection onto Ĥ2 . By Theorem 3.3 in Owhadi and Scovel (2018), it follows that the mean of X | X̂2 = t̂ is given by † (m̂1 + Q̂∗21 t̂ − m̂2 ) + t̂ = (m̂1 + Cˆ12 Cˆ22 t̂ − m̂2 ) + t̂ h i † t̂ − m̂2 , Π̂2 t̂ = Π̂1 m̂1 + Cˆ12 Cˆ22 † = m1 + C12 (Π̂2 Cˆ22 Π̂∗2 ) [t − m2 ] , t . † Since C22 is bounded and ran(C22 ) is assumed to be closed, C22 exists and is bounded (BenIsrael and Greville, 2003, Section 8.3, Theorem 3). Moreover, Π̂2 is unitary. This means that conditions (2), (3), and (4) from Theorem 3.1 in Bouldin (1973) hold and thus † Π̂∗2 = Π̂2 (Π̂∗2 C22 Π̂2 )† Π̂∗2 Π̂2 Cˆ22 † (Π̂∗2 )† Π̂∗2 = Π̂2 Π̂†2 C22 (Bouldin 1973, Theorem 3.1) † −1 = Π̂2 Π̂−1 2 C22 Π̂2 Π̂2 (Π̂2 is unitary) † = C22 . (B.10) This shows that mX|X2 =t is indeed the mean of X | X2 = t. By Theorem 3 in Anderson and Trapp (1975), to S = Ĥ2 is given by Cˆ11 − A∗ A S(C) = 0 1 ∗ = Cˆ2 A, then the short S(C) of C if Cˆ12 22 0 0 Ĥ . 1 ⊕Ĥ2 1 1 1 2 2 2 † is bounded and ran(Cˆ22 ) = ran(Cˆ22 ) is closed, the pseudoinverse (Cˆ22 ) exists and Since Cˆ22 1 2 † ˆ∗ is bounded (Ben-Israel and Greville, 2003, Section 8.3, Theorem 3). Let A := (Cˆ22 ) C12 . Then 1 1 1 ∗ ∗ = Cˆ12 , Cˆ2 A = Cˆ2 (Cˆ2 )† Cˆ12 22 since 1 22 22 1 2 2 † Cˆ22 (Cˆ22 ) 1/2 = id 1/2 ran(Cˆ22 ) ran(Cˆ22 ) 1 ∗ ) ⊂ ran(Cˆ2 ) by Equa(Ben-Israel and Greville, 2003, Section 8.3, Definition 1) and ran(Cˆ12 22 1 1 ∗ 2 † ∗ 2 ∗ † ˆ ˆ ˆ ˆ tion (B.8). Moreover, A = C12 ((C ) ) = C12 ((C ) ) , and 22 1 22 1 1 1 † ˆ∗ 2 ∗ † ˆ2 † ˆ∗ 2 ∗ ˆ2 † ˆ∗ A∗ A = Cˆ12 ((Cˆ22 ) ) (C22 ) C12 = Cˆ12 ((Cˆ22 ) C22 ) C12 = Cˆ12 Cˆ22 C12 48 Physics-Informed GP Regression Generalizes Linear PDE Solvers by Theorem 2 (g) and (j) in Ben-Israel and Greville (2003, Section 8.3) and the fact that 1 2 Cˆ22 is self-adjoint. Consequently, † ˆ∗ Cˆ11 − Cˆ12 Cˆ22 C12 0 S(C) = 0 0 Ĥ ⊕Ĥ 1 2 ! † ∗ Π̂1 Cˆ11 − Cˆ12 Cˆ22 Cˆ12 Π̂∗1 0 = 0 0 † ∗ ∗ C11 − C12 Π2 Cˆ22 Π2 C12 0 = 0 0 † ∗ C11 − C12 C22 C12 0 = 0 0 (by Equation (B.10)) = CX|X2 =t . Remark B.20. One can show that ran(C22 ) being closed is equivalent to C22 having finite rank. By applying Corollary B.17 to the conditional random variable from Theorem B.19 we find that X1 | X2 = t is an H1 -valued Gaussian random variable with mean and covariance operator † mX1 |X2 =t := m1 + C12 C22 [t − m2 ] (B.11) † ∗ CX1 |X2 =t := C11 − C12 C22 C12 . (B.12) B.3 Gaussian Processes as Gaussian Random Functions As mentioned before, the function ω 7→ f (·, ω) is often a Gaussian random variable with values in a separable Hilbert space of real-valued functions on X . In the following, we will make this statement precise and also give expressions for the mean and covariance operator of the Gaussian random variable, which will depend on the mean and covariance functions of the Gaussian process, respectively. Assumption B.21. Let f ∼ GP (m, k) be a Gaussian process with index set X on a Borel probability space (Ω, B (Ω) , P), whose mean and sample paths lie in a real separable RKHS10 H ⊂ RX with Hk ⊂ H, i.e. m ∈ H and paths (f ) ⊂ H. Proposition B.22. Let Assumption B.21 hold. Then ω → f (·, ω) is an H-valued Gaussian random variable whose mean is given by the mean function m of the Gaussian process f and whose covariance operator is given by Ck : H → H, h 7→ Ck [h] (x) = hk(x, ·), hiH . (B.13) 10. Any Hilbert function space H ⊂ RX with continuous point evaluation functionals δx : H → R is an RKHS with kernel kH (x1 , x2 ) = hδx∗1 , δx∗2 iH (Steinwart and Christmann, 2008). 49 Pförtner, Steinwart, Hennig and Wenger Proof By definition, f (x, ·) is a Gaussian random variable for every x ∈ X . Hence, Corollary 12 in (Berlinet and Thomas-Agnan, 2004, Chapter 4, Section 2, p.195) ensures that ω 7→ f (·, ω) is Borel measurable and thus a random variable, which is Gaussian by Theorem 91 in (Berlinet and Thomas-Agnan, 2004, Chapter 4, Section 3.1, p.196). Since ω 7→ f (·, ω) is Gaussian and H is separable, by Lemma B.12, it remains to show that m and Ck fulfill Z Z hh, f (·, ω) − miH (f (·, ω) − m) dP (ω) f (·, ω) dP (ω) and Ck [h] = m= Ω Ω for all h ∈ H, which are both well-defined Bochner integrals. Consequently, for x ∈ X , we find that Z Z f (·, ω) dP (ω) , f (x, ω) dP (ω) = δx m(x) = Ω Ω where the last equation holds by Corollary 2 from (Yosida, 1995, Section V.5), since δx is continuous. Hence, by Lemma B.12, m ∈ H is the mean of ω 7→ f (·, ω). Moreover, for x1 , x2 ∈ X , we have Z k(x1 , x2 ) = (f (x1 , ω) − m(x1 ))(f (x2 , ω) − m(x2 )) dP (ω) Ω Z ∗ = δ x2 δx1 , f (·, ω) − m H (f (·, ω) − m) dP (ω) , Ω and hence, for any h ∈ H, Ck [h] (x) = hk(x, ·), hiH Z ∗ = h, hδx , f (·, ω) − miH (f (·, ω) − m) dP (ω) Ω H Z hδx∗ , f (·, ω) − miH hh, f (·, ω) − miH dP (ω) = Ω Z = δx hh, f (·, ω) − miH (f (·, ω) − m) dP (ω) , Ω where we applied Corollary 2 from Yosida (1995, Section V.5) repeatedly. This shows that Ck is indeed the covariance operator of ω 7→ f (·, ω). The correspondence from Proposition B.22 also holds in reverse in the sense that, a Gaussian random variable h with values in a separable Hilbert space H, and a set X ∗ ⊂ H∗ of continuous linear functionals on H induce a Gaussian process on the same probability space as f , whose paths are given by x∗ 7→ x∗ [h(ω)]. Lemma B.23. Let f ∼ N (m, C) be a Gaussian random variable on (Ω, B (Ω) , P) with values in a real separable Hilbert space H. For every set X ⊂ H, the family {hx, f (·)iH }x∈X is a Gaussian process on (Ω, B (Ω) , P) with mean function x 7→ hx, miH and covariance function (x1 , x2 ) 7→ hx1 , C [x2 ]iH . 50 Physics-Informed GP Regression Generalizes Linear PDE Solvers Proof Since f is Gaussian, hx, f (·)iH is Gaussian for all x ∈ X . Let X = {xi }ni=1 ⊂ X and hX, ·iH : H → Rn , h 7→ (hX, hiH )i := hxi , hiH . Then hX, ·iH is continuous and thus Borel measurable. It follows that the function fX := hX, f (·)iH is an Rn -valued random variable. Moreover, since hv, hX, ·iH iRn ∈ H∗ for all v ∈ Rn , fX is Gaussian. All in all, it follows that {hx, f (·)iH }x∈X is a Gaussian process on (Ω, B (Ω) , P). Moreover, its mean function is given by Z hx, f (ω)iH dP (ω) = hx, miH . x 7→ Ω by Equation (B.1) and its covariance function is given by Z (x1 , x2 ) 7→ (hx1 , f (ω)iH − hx1 , miH )(hx2 , f (ω)iH − hx2 , miH ) dP (ω) = hx1 , C [x2 ]iH Ω by Equation (B.2). Note that, unlike before, H is not necessarily a space of functions and the sample paths of the resulting process are, generally speaking, not contained in H. However, if H is an RKHS of real-valued functions on some domain X , then all point evaluation functionals δx for x ∈ X are continuous and Proposition B.22 produces Gaussian processes in the spirit of Assumption B.21. Hence, the following corollary is a more accurate converse of Proposition B.22 than Lemma B.23. Corollary B.24. Let f ∼ N (m, C) be a Gaussian random variable on (Ω, B (Ω) , P) with values in a real separable RKHS H ⊂ RX . Then the family {ω 7→ f (ω)(x)}x∈X is a Gaussian process on (Ω, B (Ω) , P) with paths in H. Its mean and covariance functions are given by ∗ m and k(x1 , x2 ) := C δx2 (x1 ) , respectively. With a slight abuse of notation, we also write f ∼ GP (m, k). We can also establish a similar correspondence between joint Gaussian measures on separable Hilbert spaces and multi-output Gaussian processes. Proposition B.25. Let {Hi ⊂ RX }ni=1 be a family of real separable RKHSs and let H := H1 × · · · × Hn . Let f ∼ N (m, C) on (Ω, B (Ω) , P) with values in H. Then the family {ω 7→ f (ω)i (x)}(i,x)∈I×X with I = {1, . . . , n} is an n-output Gaussian process with index set X on (Ω, B (Ω) , P). Its and covariance functions are given by (i, x) 7→ mi (x) and ∗mean ((i1 , x1 ), (i2 , x2 )) 7→ Ci1 ,i2 δx2 (x1 ) , respectively. Proof Let H̃ := {(i, x) 7→ hi (x) : h ∈ H} ⊂ RI×X D . Then with pointwise E H̃ equipped E addiPn D 0 0 := i=1 h̃(i, ·), h̃ (i, ·) is a tion and scalar multiplication and inner product h̃, h̃ H̃ Hi Hilbert space and the linear map I : H → H̃, h 7→ I [h] (i, x) = hi (x) is the canonical isometry between H and H̃. Lemma B.13 implies that I ◦ f is a Gaussian random variable with mean (i, x) 7→ I [m] (i, x) = mi (x) and covariance operator ICI ∗ . Since the point evaluation functionals on all Hi are continuous, it follows that the point evaluation functionals on H̃ 51 Pförtner, Steinwart, Hennig and Wenger are continuous. Hence, by Corollary B.24, {ω 7→ f (ω)i (x)}(i,x)∈I×X is indeed a Gaussian process with mean function (i, x) 7→ mi (x) and covariance function h i ∗ ((i1 , x1 ), (i2 , x2 )) 7→ ICI ∗ δ(i (i1 , x1 ) 2 ,x2 ) ∗ ∗ = I CΠi2 δx2 (i1 , x1 ) = I (C1,i2 δx∗2 , . . . , Cn,i2 δx∗2 ) (i1 , x1 ) = Ci1 ,i2 δx∗2 (x1 ) . B.4 Proofs of Theorem 1 and its Corollaries Using the results from Appendices B.2 and B.3, particularly Proposition B.22, Theorem B.19, and Corollary B.24, we can now conduct the proof of Theorem 1 and Corollaries 2 and 3 as outlined in Appendix B.1. All three results share a common set of assumptions. Assumption 1. Let f ∼ GP (mf , kf ) be a Gaussian process prior with index set X on the Borel probability space (Ω, B (Ω) , P), whose mean function and sample paths lie in a real separable RKHS H ⊂ RX with H ⊇ Hkf . Let L : H → HL be a bounded linear operator mapping the paths of f into a separable Hilbert space HL . In the most general case, the linear operator L maps into a space, which is either not a function space or a function space on which point evaluation is not a continuous functional. This happens for instance when applying the differential operator of highest possible order on a Sobolev path space, since then the resulting object will be an L2 function, which is not pointwise defined. Theorem 1 (Affine Gaussian Process Inference). Let Assumption 1 hold. Then ω 7→ f (·, ω) is an H-valued Gaussian random variable with mean mf and covariance operator h 7→ Cf [h] (x) = hkf (x, ·), hiH . We also write f ∼ N (mf , Cf ). Let ∼ N (m , C ) be an HL -valued Gaussian random variable with ⊥ ⊥ f . Then mf Cf Cf L∗ f ∼N , , (4.1) L [mf ] + m LCf LCf L∗ + C L [f ] + with values in H × HL and hence L [f ] + ∼ N (L [mf ] + m , LCf L∗ + C ). (4.2) If ran(LCf L∗ + C ) is closed, then, for all y ∈ HL , (4.3) f | L [f ] + = y ∼ GP mf |y , kf |y , where the conditional mean and covariance function are given by D E mf |y (x) = mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )] HL , (4.4) and D E kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )] HL respectively. 52 , (4.5) Physics-Informed GP Regression Generalizes Linear PDE Solvers Proof f ∼ N (mf , Cf ) follows from Proposition B.22. By Proposition B.18 we know that f mf C ∼N , f m 0 0 C with values in H× = H × HL . Moreover, the map (h, h ) 7→ (h, L [f ] + h ) is realized by the bounded linear operator idH 0 L̃ := : H× → H× . L idHL Hence, Equation (4.1) follows from Lemma B.13 and Equation (4.2) follows from Corollary B.17. Under the assumption that ran(LCf L∗ + C ) is closed, Theorem B.19 and Corollary B.17 imply that f | L [f ] + = y ∼ N mf |y , Cf |y with m̃f |y = mf + Cf L∗ (LCf L∗ + C )† [y − (L [mf ] + m )] , and Cf |y = Cf − Cf L∗ (LCf L∗ + C )† LCf . Since point evaluation functionals on H are continuous, Corollary B.24 shows that {(f (x, ·) | L [f ] + = y)}x∈X is a Gaussian process with mean function h i m̃f |y (x) = mf (x) − Cf L∗ (LCf L∗ + C )† [y − (L [mf ] + m )] (x) D E = mf (x) + L [kf (·, x)] , (LCf L∗ + C )† [y − (L [mf ] + m )] HL = mf |y (x) and covariance function h i Cf |y δx∗2 (x1 ) = Cf δx∗2 (x1 ) − Cf L∗ (LCf L∗ + C )† L Cf δx∗2 (x1 ) D E = kf (x1 , x2 ) − L [kf (·, x1 )] , (LCf L∗ + C )† L [kf (·, x2 )] HL = kf |y (x1 , x2 ), since Cf δx∗2 (x1 ) = kf (x1 , ·), δx∗2 H = kf (x1 , x2 ) and Cf L∗ [h] (x) = Cf [L∗ [h]] (x) = hkf (x, ·), L∗ [h]iHL = hL [kf (x, ·)] , hiHL for h ∈ HL . This proves Equations (4.3) to (4.5). The first corollary deals with the case, where we observe the GP through a finite number of linear functionals. This happens when conditioning on integral observations or on (Galerkin) projections as in Section 3.3. 53 Pförtner, Steinwart, Hennig and Wenger Corollary 2. Let Assumption 1 hold for HL = Rn and let ∼ N (µ , Σ ) be an Rn -valued Gaussian random variable with ⊥ ⊥ f . Then L [f ] + ∼ N (L [mf ] + µ , Lkf L∗ + Σ ) (4.6) f | L [f ] + = y ∼ GP mf |y , kf |y , (4.7) and, for any y ∈ Rn , with conditional mean and covariance function given by D E mf |y (x) = mf (x) + L [kf (x, ·)] , (Lkf L∗ + Σ )† (y − (L [mf ] + µ )) and Rn D E kf |y (x1 , x2 ) = kf (x1 , x2 ) − L [kf (x1 , ·)] , (Lkf L∗ + Σ )† L [kf (·, x2 )] Rn , (4.8) . (4.9) To prove Corollary 2, we first need to show that LCf L∗ = Lkf L∗ . We will prove a slightly more general result, for which the following generalization of Notation 1 will prove useful. Notation B.26. Let H1 ⊆ RX1 and H2 ⊆ RX2 be Hilbert spaces and let k : X1 × X2 → R such that k(·, x2 ) ∈ H1 for all x2 ∈ X2 and k(x1 , ·) ∈ H2 for all x1 ∈ X1 . Let Li : Hi → Rni for i = 1, 2 be linear. By L1 k, kL∗2 and L1 kL∗2 11 , we denote the functions L1 k : X2 → Rn1 , x2 7→ L1 [k(·, x2 )] , kL∗2 : X1 → Rn2 , x1 7→ L2 [k(x1 , ·)] , and the matrix L1 kL∗2 ∈ Rn1 ×n2 with entries (L1 kL∗2 )ij := L1 [(kL∗2 )j ]i , respectively. Lemma B.27. Let H1 ⊆ RX1 and H2 ⊆ RX2 be RKHSs. Let k : X1 × X2 → R such that k(·, x2 ) ∈ H1 for all x2 ∈ X2 and k(x1 , ·) ∈ H2 for all x1 ∈ X1 and let K : H2 → H1 , K [h2 ] (x1 ) = hk(x1 , ·), h2 iH2 . Finally, let L1 : H1 → Rn1 and L2 : H2 → Rn2 be linear and bounded. Then (i) the adjoint of K is given by K∗ : H1 → H2 , K∗ [h1 ] (x2 ) = hk(·, x2 ), h1 iH1 , (B.14) (L1 K) [h2 ]i = h(L1 k)i , h2 iH (B.15) (KL∗2 ) [v] (x1 ) = h(kL∗2 )(x1 ), viRn2 (B.16) (L1 KL∗2 )ij = L2 [(L1 k)i ]j (B.17) L1 [(kL∗2 )j ]i (L1 kL∗2 )ij . (B.18) (ii) and we have for all h2 ∈ H2 , for all v ∈ Rn2 , and (iii) L1 KL∗2 ∈ Rn1 ×n2 with = = (B.19) 11. The omission of parentheses in L1 kL∗2 is motivated by Equations (B.17) and (B.18) from Lemma B.27, which shows that the order in which L1 and L2 are applied to k is irrelevant. 54 Physics-Informed GP Regression Generalizes Linear PDE Solvers Proof • (B.14): Let h1 ∈ H1 and x2 ∈ X2 . Then K∗ [h1 ] (x2 ) = δx∗2 , K∗ [h1 ] and H2 K δx∗2 (x1 ) = k(x1 , ·), δx∗2 = h1 , K δx∗2 H H1 = k(x1 , x2 ) for all x1 ∈ X1 . This means that K∗ [h1 ] (x2 ) = hh, k(·, x2 )iH . Evidently, dom (K∗ ) = H1 . • (B.15): L1 [·]i is a bounded linear functional and hence, by the Riesz representation theorem (Yosida, 1995, Section III.6), there is hL1 ,i ∈ H1 such that L1 [h1 ]i = hhL1 ,i , h1 iH1 for all h1 ∈ H1 . It follows that (L1 K) [h2 ] (x1 ) = L1 [K [h2 ]] (x1 ) = hhL1 ,i , K [h2 ]iH1 = hK∗ [hL1 ,i ] , h2 iH2 , for all h2 ∈ H2 and K [hL1 ,i ] (x2 ) = hhL1 ,i , k(·, x2 )iH1 = L1 [k(·, x2 )]i = (L1 k)i (x2 ) for all x2 ∈ X2 . Hence, (L1 K) [h]i = h(L1 k)i , hiH for all h ∈ H. • (B.16): Let v ∈ Rn2 and x1 ∈ X1 . Then (KL∗2 ) [v] (x1 ) = hk(x1 , ·), L∗2 [v]iH2 = hL2 [k(x1 , ·)] , viRn2 = h(kL∗2 )(x1 ), viRn2 . • (B.17): Let ej ∈ Rn2 such that hej , viRn2 = vj . Then we have (L1 KL∗2 )ij = L1 K [L∗2 [ej ]]i = h(L1 k)i , L∗2 [ej ]iH2 = hL2 [(L1 k)i ] , ej iRn2 = L2 [(L1 k)i ]j by Equation (B.15). • (B.18): Let ej ∈ Rn2 such that hej , viRn2 = vj . Then we have (L1 KL∗2 )ij = L1 [(KL∗2 ) [ej ]]i = L1 [(kL∗2 )j ]i by Equation (B.16). Corollary B.28. Let the assumptions of Lemma B.27 hold such that X := X1 = X2 and H := H1 = H2 and let k be symmetric. Then K is self-adjoint. If n := n1 = n2 and L := L1 = L2 , then LkL∗ ∈ Rn×n is symmetric. If K is additionally positive-(semi)definite, then LkL∗ is positive-(semi)definite. 55 Pförtner, Steinwart, Hennig and Wenger Proof By the symmetry of k, for h ∈ H and x ∈ X , we have K∗ [h] (x) = hh, k(·, x)iH = hk(x, ·), hiH = K [h] (x) , i.e. K is symmetric. Obviously, H = dom (K) = dom (K∗ ). Consequently, K is self-adjoint. This implies that (LKL∗ )> = (LKL∗ )∗ = (L∗ )∗ K∗ L∗ = LKL∗ . Finally, if K is positive-semidefinite, then hv, LKL∗ [v]iRn = hL∗ [x] , K [L∗ [x]]iH ≥ 0 for all v ∈ Rn , where the inequality is strict if K is (strictly) positive-definite. Proof of Corollary 2 By Lemma B.27 we know that LCf L∗ = Lkf L∗ and hence Equation (4.6) follows from Equation (4.2) in Theorem 1. Moreover, ran(Lkf L∗ + Σ ) is closed, since it is finite-dimensional. This means that Equations (4.7) to (4.9) also follow from Theorem 1. Finally, we address the archetypical case, in which both the prior f and the prior predictive L [f ] + are Gaussian processes. This happens if the linear operator maps into a function space, in which point evaluation is continuous. In this article, this case occurred in Sections 3.1 and 3.2, where we inferred the strong solution of a PDE from observations of the PDE residual at a finite number of domain points. 0 Corollary 3. Let Assumption 1 hold such that HL is an RKHS HL ⊂ RX . Then L [f ] ∼ GP (L [mf ] , Lkf L∗ ), (4.13) Let ∼ N (µ , Σ ) with values in Rn and ⊥ ⊥ f . Then, for X 0 = {x0i }ni=1 ⊂ X 0 and y ∈ Rn , f | L [f ] X 0 + = y ∼ GP mf |y , kf |y (4.14) with D E † mf |y (x) := mf (x) + (kf L∗ )(x, X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (y − (L [mf ] (X) + µ )) Rn (4.15) and D E † kf |y (x1 , x2 ) := kf (x1 , x2 ) − (kf L∗ )(x1 , X 0 ), (Lkf L∗ )(X 0 , X 0 ) + Σ (Lkf )(X 0 , x2 ) If additionally X = X 0 , then f mf kf ∼ GP , L [mf ] L [f ] Lkf Rn . (4.16) kf L∗ Lkf L∗ . (4.17) Proof Since point evaluation on HL is continuous, we have h i (LCf L∗ ) δx02 x01 = (Lkf L∗ )(x01 , x02 ) by Lemma B.27. Consequently, Equation (4.13) follows from Equation (4.2) in Theorem 1 and Corollary B.24. Moreover, ran((Lkf L∗ )(X 0 , X 0 ) + Σ ) is closed, since it is finitedimensional. This means that Equations (4.14) to (4.16) also follow from Theorem 1. Finally, Equation (4.17) follows from Equation (4.1) in Theorem 1 and Proposition B.25, where we used that, by Lemma B.27, h i (Cf L∗ ) δx02 (x1 ) = (kf L∗ )(x1 , x02 ), 56 Physics-Informed GP Regression Generalizes Linear PDE Solvers (LCf ) [δx2 ] x01 = (Lkf )(x01 , x2 ), and h i (LCf L∗ ) δx02 x01 = (Lkf L∗ )(x01 , x02 ). B.5 On Prior Selection A typical choice for the solution space U of a linear PDE, especially in the context of weak solutions (see Section 2.1.1), are Sobolev spaces (Adams and Fournier, 2003). Unfortunately, it is impossible to formulate a Gaussian process prior u, whose paths are elements of a Sobolev space U . This is due to the fact that Sobolev spaces are, technically speaking, not function spaces, but rather spaces of equivalence classes [f ]∼ of functions, which are equal almost everywhere (Adams and Fournier, 2003). By contrast, the path spaces of Gaussian processes are proper function spaces, which means that, in this setting, paths (u) ⊆ U is impossible. Fortunately, if the path space can be continuously embedded in U , i.e. there is a continuous and injective linear operator ι : paths (u) → U , commonly referred to as an embedding, then the inference procedure above can still be applied. If such an embedding exists, we can interpret the paths of the GP as elements of U by applying ι implicitly. For instance, D [u] is then a shorthand notation for D [ι [u]]. Fortunately, since the embedding is assumed to be continuous, the conditions for GP inference with linear operator observations are still met when applying ι implicitly. The canonical choice for the embedding in the case of Sobolev spaces is ι [u] = [u]∼U . Example B.1 (Matérn covariances and Sobolev spaces). Kanagawa et al. (2018) show that, under certain assumptions, the sample spaces of GP priors with Matérn covariance functions (Rasmussen and Williams, 2006) are continuously embedded in Sobolev spaces whose smoothness depends on the parameter ν of the Matérn covariance function. To be precise, let D ⊂ Rd be open and bounded with Lipschitz boundary such that the cone condition (Adams and Fournier, 2003, Definition 4.6) holds. Denote by kν,l the Matérn kernel with smoothness parameter ν > 0 and lengthscale l > 0. Then, with probability 1, the sample paths of a Gaussian process f with covariance function kν,l are contained in any RKHS Hkν 0 ,l0 with l0 > 0 and d 0 < ν0 + < ν (B.20) | {z 2} =:m0 (Kanagawa et al., 2018, Corollary 4.15 and Remark 4.15), i.e. paths (f ) ⊂ Hkν 0 ,l0 . More0 over, if m0 ∈ N, then the RKHS Hkν 0 ,l0 is norm-equivalent to the Sobolev space H m (D) (Kanagawa et al., 2018, Example 2.6). This implies that the canonical embedding 0 ι : Hkν 0 ,l0 → H m (D) , f (·, ω) 7→ [f (·, ω)]∼ 0 H s (D) (B.21) is continuous. 0 For U = H m (D), the example above shows that the Matérn covariance function kν,l with ν = m0 + for any > 0 leads to an admissible GP prior. The choice = 12 makes 57 Pförtner, Steinwart, Hennig and Wenger evaluating the covariance function particularly efficient (Rasmussen and Williams, 2006). However, note that the elements of the Sobolev space H m (D) are only m-times weakly differentiable, which means that H 2 (D) is not an admissible choice in Sections 3.1 and 3.2. Remark B.29 (Sobolev Spaces and Strong Derivatives). The Sobolev embedding theorem (Adams and Fournier, 2003, Theorem 4.12) gives conditions under which the elements of a Sobolev space are embedded into Banach spaces of continuously differentiable functions. Let D ⊂ Rd be open and bounded with Lipschitz boundary such that the cone condition (Adams and Fournier, 2003, Definition 4.6) holds. Let j ≥ 0, m ≥ 1 be integers. If m > d2 , then there is a continuous embedding j ι : H j+m (D) → CB (D), (B.22) j where CB (D) is the space of continuously differentiable functions with bounded derivatives, which is a Banach space under the norm (B.23) kf kC j (D) = max sup |Dα f (x)|. B 0≤|α|≤j x∈D j Moreover, point-evaluated partial derivatives on CB (D) are continuous linear functionals, since, for any multi-index |α0 | ≤ j and any x0 ∈ D, we have 0 Dα [f ] x0 0 ≤ sup Dα f (x) ≤ max sup |Dα f (x)| = kf kC j (D) . x∈D 0≤|α|≤j x∈D B (B.24) Example B.2 (Strong Derivatives in Matérn Sample Spaces). Under the assumptions of Example B.1, for a prior GP f with Matérn covariance function kν,l such that d+1 + , (B.25) ν := m + 2 where > 0, we have the following chain of continuous embeddings m paths (f ) ⊂ Hkν 0 ,l0 ,→ H m+k (D) ,→ CB (D). (B.26) As noted in Remark B.29, point-evaluated partial derivatives of order ≤ m are continuous m (D). It follows that a point-evaluated differential operator D [·] (x) linear functionals on CB of order ≤ m is a continuous linear functional on paths (f ) if the two continuous embeddings are prepended. In Section 3.2,we have d = 1 and a GP prior with Matérn covariance function, where ν = 72 = 2 + d+1 + 21 . It follows that point-evaluated differential operators of order ≤ 2 2 are continuous linear functionals. Hence, the assumptions of Corollary 3 are fulfilled, which means that the inference procedure used in these sections is supported by our theoretical results above. 58 Physics-Informed GP Regression Generalizes Linear PDE Solvers Appendix C. Linear Partial Differential Equations Definition C.1 (Multi-index). Using a d-dimensional multi-index α ∈ Nd0 , we can represent (mixed) partial derivatives of arbitrary order as ∂ |α| ∂ |α| := , (α ) (α ) ∂xα ∂x1 1 · · · ∂xd d (C.1) P where |α| := di=1 αi . If the variables w.r.t. which we differentiate are clear from the context, we also denote this (mixed) partial derivative by Dα . Definition C.2 (Linear differential operator). A linear differential operator D : U → V of 0 order k between a space U of Rd -valued functions and a space V of real-valued functions defined on some common domain Ω ⊂ Rd is a linear operator that linearly combines partial derivatives up to k-th order of its input function, i.e. 0 D [u] := d X i=1 X Ai,α Dα ui , (C.2) α∈Nd0 ,|α|≤k where Ai,α ∈ R for every i ∈ {1, . . . , d0 } and every multi-index α ∈ Nd0 with |α| ≤ k. Definition C.3 (Heat equation (Lienhard and Lienhard, 2020; Evans, 2010)). Let Ω ⊂ Rd be an open and bounded region and T > 0. The heat equation is given by ρcp ∂u − div (k∇u) = q̇V , ∂t (C.3) where k ∈ Rd×d , ρ, cp , kij ∈ L∞ (Ω × (0, T ]), and q̇V ∈ L2 (Ω × (0, T ]). Definition C.4 (Elliptic PDE in nondivergence form). Let Ω ⊂ Rd be an open and bounded region. The equation − div (A∇u) + bT ∇u + cu = f, (C.4) where Aij , bi , c ∈ L∞ (Ω) and f ∈ L2 (Ω). C.1 Weak Derivatives and Sobolev Spaces Definition C.5 (Test Function). Let D ⊂ Rd be open and let Cc∞ (D) := {φ ∈ C ∞ (D, R) | supp (φ) ⊂ U is compact} (C.5) be the space of smooth functions with compact support in D. A function φ ∈ Cc∞ (D) is dubbed test function and we refer to Cc∞ (D) as the space of test functions. Theorem C.6 (Sobolev Spaces12 ). Let D ⊂ Rd be open, m ∈ N>0 , and p ∈ [1, ∞) ∪ {∞}. The functional 1/p P α ukp kD if p < ∞, |α|≤m Lp (D) kukm,p,D := (C.6) max α uk kD if p = ∞. |α|≤m L∞ (D) 12. This theorem is a summary of (Adams and Fournier, 2003, Definitions 3.1 and 3.2 and Theorems 3.3 and 3.6) 59 Pförtner, Steinwart, Hennig and Wenger is called a Sobolev norm. A Sobolev norm kukm,p,D is a norm on subspaces of Lp (D), on which the right-hand side is well-defined and finite. A Sobolev space of order m is defined as the subspace W m,p (D) := {u ∈ Lp (D) | Dα u ∈ Lp (D) for |α| ≤ m}. (C.7) of Lp , where the Dα are weak partial derivatives. Sobolev spaces W m,p (D) are Banach spaces under the Sobolev norm k·km,p . The Sobolev space H m (D) := W 2,m (D) is a separable Hilbert space with inner product X hDα u1 , Dα u2 iL2 (D) (C.8) hu1 , u2 im,D := |α|≤m and norm k·km,D := q h·, ·im,D = k·km,2,D . 60 (C.9) Physics-Informed GP Regression Generalizes Linear PDE Solvers Bibliography Robert A. Adams and John J. F. Fournier. Sobolev Spaces, volume 140 of Pure and Applied Mathematics. Elsevier, 2nd edition, 2003. ISBN 9780080541297. Christian Agrell. Gaussian processes with linear operator inequality constraints. Journal of Machine Learning Research, 20(135):1–36, 2019. URL http://jmlr.org/papers/v20/ 19-065.html. Christopher G. Albert. Gaussian processes for data fulfilling linear differential equations. Proceedings of the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 33(1), 2019. ISSN 2504-3900. doi:10.3390/proceedings2019033005. Mauricio Alvarez, David Luengo, and Neil D. Lawrence. Latent force models. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5, pages 9–16, Clearwater Beach, Florida, USA, 2009. W. N. Anderson, Jr. and G. E. Trapp. Shorted operators. II. SIAM Journal on Applied Mathematics, 28(1):60–71, 1975. doi:10.1137/0128007. Iskander Azangulov, Andrei Smolensky, Alexander Terenin, and Viacheslav Borovitskiy. Stationary kernels and Gaussian processes on Lie groups and their homogeneous spaces i: the compact case. arXiv preprint arXiv:2208.14960, 2022. Adi Ben-Israel and Thomas N.E. Greville. Generalized Inverses: Theory and Applications. CMS Books in Mathematics. Springer, New York, 2nd edition, 2003. ISBN 978-0-38721634-8. doi:10.1007/b97366. Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, first edition, 2004. ISBN 978-1-4613-4792-7. doi:10.1007/9781-4419-9096-9. S. J. Bernau. The square root of a positive self-adjoint operator. Journal of The Australian Mathematical Society, 8(1):17–36, February 1968. doi:10.1017/S1446788700004560. Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York, first edition, 2006. ISBN 978-0387-31073-2. Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637–654, 1973. doi:10.1086/260062. David Borthwick. Introduction to Partial Differential Equations. Universitext. Springer, first edition, 2018. ISBN 978-3-319-48936-0. doi:10.1007/978-3-319-48936-0. Richard Bouldin. The pseudo-inverse of a product. SIAM Journal on Applied Mathematics, 24(4):489–495, 1973. doi:10.1137/0124051. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and 61 Pförtner, Steinwart, Hennig and Wenger Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. Jon Cockayne, Chris Oates, Tim Sullivan, and Mark Girolami. Probabilistic numerical methods for PDE-constrained Bayesian inverse problems. In Geert Verdoolaege, editor, Proceedings of the 36th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, volume 1853 of AIP Conference Proceedings, pages 060001–1 – 060001–8, 2017. doi:10.1063/1.4985359. Jon Cockayne, Chris J. Oates, Ilse C.F. Ipsen, and Mark Girolami. A Bayesian conjugate gradient method (with discussion). Bayesian Analysis, 14(3):937–1012, 2019a. doi:10.1214/19-BA1145. Jon Cockayne, Chris J. Oates, T. J. Sullivan, and Mark Girolami. Bayesian probabilistic numerical methods. SIAM Review, 61(4):756–789, 2019b. doi:10.1137/17M1139357. Jacques Dixmier. étude sur les variétés et les opérateurs de Julia, avec quelques applications. Bulletin de la Société Mathématique de France, 77:11–101, 1949. ISSN 0037-9484. doi:10.24033/bsmf.1403. Lawrence C. Evans. Partial Differential Equations: Second Edition, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, Providence, Rhode Island, 2nd edition, 2010. ISBN 978-0-82-184974-3. URL https://bookstore.ams.org/gsm-19-r. Gregory E. Fasshauer. Solving partial differential equations by collocation with radial basis functions. In Alain Le Méhauté, Christophe Rabut, and Larry L. Schumaker, editors, Surface Fitting and Multiresolution Methods, pages 131–138. Vanderbilt University Press, Nashville, TN, 1997. ISBN 9780826512949. Gregory E. Fasshauer. Solving differential equations with radial basis functions: multilevel methods and smoothing. Advances in Computational Mathematics, 11:139–159, November 1999. doi:10.1023/A:1018919824891. C. A. J. Fletcher. Computational Galerkin Methods. Scientific Computation. Springer, Berlin, Heidelberg, 1 edition, 1984. ISBN 978-3-642-85949-6. doi:10.1007/978-3-64285949-6. Jean Baptiste Joseph Fourier. Théorie analytique de la chaleur. doi:10.1017/CBO9780511693229. Firmin Didot, 1822. Mark Girolami, Eky Febrianto, Yin Ge, and Fehmi Cirak. The statistical finite element method (statFEM) for coherent synthesis of observation data and model predictions. Computer Methods in Applied Mechanics and Engineering, 275:113533, 2021. doi:10.1016/j.cma.2020.113533. Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, fourth edition, 2013. ISBN 978-1-4214-0794-4. URL https://www.press.jhu.edu/books/title/ 10678/matrix-computations. 62 Physics-Informed GP Regression Generalizes Linear PDE Solvers Thore Graepel. Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations. In Proceedings of the 20th International Conference on Machine Learning, pages 234–241. AAAI Press, 2003. Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35–61, 2007. Philipp Hennig, Michael A. Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A, 471(2179), 2015. doi:10.1098/rspa.2015.0142. Philipp Hennig, Michael A. Osborne, and Hans P. Kersting. Probabilistic Numerics: Computation as Machine Learning. Cambridge University Press, June 2022. ISBN 9781316681411. doi:10.1017/9781316681411. David S. Holder, editor. Electrical Impedance Tomography: Methods, History and Applications. Institute of Physics Medical Physics Series. Institute of Physics Publishing, Bristol, 2005. ISBN 0750309520. Peter Holderrieth, Michael J Hutchinson, and Yee Whye Teh. Equivariant learning of stochastic fields: Gaussian processes and steerable conditional neural processes. In International Conference on Machine Learning, pages 4297–4307. PMLR, 2021. Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018. George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021. doi:https://doi.org/10.1038/s42254-021-00314-5. Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the 4th Eurographics Symposium on Geometry Processing, volume 7, 2006. Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. Springer, London, second edition, 2014. doi:10.1007/978-1-4471-5361-0. Nicholas Krämer, Jonathan Schmidt, and Philipp Hennig. Probabilistic numerical method of lines for time-dependent partial differential equations. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151, pages 625–639. PMLR, 2022. URL https://proceedings.mlr.press/v151/kramer22a.html. Benny Lautrup. The PDE’s of continuum physics. In Proceedings of the Workshop on PDE methods in Computer Graphics, 2005. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020. doi:10.48550/arXiv.2003.03485. 63 Pförtner, Steinwart, Hennig and Wenger Zongyi Li, Nikola B. Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. doi:10.48550/arXiv.2010.08895. John H. Lienhard, IV and John H. Lienhard, V. A Heat Transfer Textbook. Phlogiston Press, Cambridge, MA, 5th edition, 2020. URL http://ahtt.mit.edu. Anders Logg, Kent-Andre Mardal, and Garth Wells, editors. Automated Solution of Differential Equations by the Finite Element Method, volume 84 of Lecture Notes in Computational Science and Engineering. Springer, Berlin, Heidelberg, 2012. ISBN 978-3-642-23099-8. doi:10.1007/978-3-642-23099-8. Stefania Maniglia and Abdelaziz Rhandi. Gaussian measures on separable Hilbert spaces and applications, January 2004. James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical transactions of the Royal Society of London, 155:459–512, 1865. Pierre Michaud. A simple model of processor temperature for deterministic turbo clock frequency. resreport RR-9308, Inria Rennes, 2019. URL https://hal.inria.fr/ hal-02391970. Chris J. Oates and Tim J. Sullivan. A modern retrospective on probabilistic numerics. Statistics and Computing, 29:1335–1351, 2019. doi:10.1007/s11222-019-09902-z. Houman Owhadi and Clint Scovel. Conditioning Gaussian measure on Hilbert space. Journal of Mathematical and Statistical Analysis, 1(109), 2018. Houman Owhadi, Clint Scovel, and Florian Schäfer. Statistical numerical approximation. Notices of the American Mathematical Society, 66(10):1608–1617, 2019. doi:10.1090/noti1963. Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Machine learning of linear differential equations using Gaussian processes. Journal of Computational Physics, 348: 683–693, 2017. ISSN 0021-9991. doi:10.1016/j.jcp.2017.07.050. Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378: 686–707, 2019. ISSN 0021-9991. doi:https://doi.org/10.1016/j.jcp.2018.10.045. URL https://www.sciencedirect.com/science/article/pii/S0021999118307125. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, London, England, 2006. ISBN 026218253X. Marco Reisert and Hans Burkhardt. Learning equivariant functions with matrix valued kernels. Journal of Machine Learning Research, 8(3), 2007. 64 Physics-Informed GP Regression Generalizes Linear PDE Solvers Walter Rudin. Functional Analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, New York, second edition, 1991. ISBN 978-0-07-054236-5. Simo Särkkä. Linear operators and stochastic partial differential equations in Gaussian process regression. In Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel Kaski, editors, Artificial Neural Networks and Machine Learning – ICANN 2011, pages 151–158, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. doi:10.1007/978-3-64221738-8_20. Simo Särkkä, Arno Solin, and Jouni Hartikainen. Spatiotemporal learning via infinitedimensional Bayesian filtering and smoothing: A look at Gaussian process regression through Kalman filtering. IEEE Signal Processing Magazine, 30(4):51–61, 2013. doi:10.1109/MSP.2013.2246292. Ingo Steinwart. Convergence types and rates in generic Karhunen-Loève expansions with applications to sample path properties. Potential Analysis, 51:361–395, 2019. doi:10.1007/s11118-018-9715-5. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, first edition, 2008. ISBN 978-0-387-77242-4. doi:10.1007/978-0-387-77242-4. Zsigmond Tarcsay. Closed range positive operators on Banach spaces. Acta Mathematica Hungarica, 142:494–501, 2014. doi:10.1007/s10474-013-0380-2. Bastian von Harrach. Numerik partieller differentialgleichungen. Lecture Notes, 2021. URL https://www.math.uni-frankfurt.de/~harrach/lehre/Numerik_PDGL.pdf. Junyang Wang, Jon Cockayne, Oksana Chkrebtii, Tim J. Sullivan, and Chris J. Oates. Bayesian numerical methods for nonlinear partial differential equations. Statistics and Computing, 31(55), 2021. doi:10.1007/s11222-021-10030-w. Jonathan Wenger and Philipp Hennig. Probabilistic linear solvers for machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020. Jonathan Wenger, Nicholas Krämer, Marvin Pförtner, Jonathan Schmidt, Nathanael Bosch, Nina Effenberger, Johannes Zenn, Alexandra Gessner, Toni Karvonen, François-Xavier Briol, Maren Mahsereci, and Philipp Hennig. ProbNum: Probabilistic numerics in python, 2021. URL http://arxiv.org/abs/2112.02100. Kôsaku Yosida. Functional Analysis, volume 123 of Classics in Mathematics. Springer, 6th edition, 1995. ISBN 978-3-540-58654-8. doi:10.1007/978-3-642-61859-8. 65