Gaussian Process Modeling of Large Scale Terrain Shrihari Vasudevan, Fabio Ramos, Eric Nettleton and Hugh Durrant-Whyte Australian Centre for Field Robotics The University of Sydney, NSW 2006, Australia {s.vasudevan,f.ramos,e.nettleton,hugh}@acfr.usyd.edu.au Abstract Building a model of large scale terrain that can adequately handle uncertainty and incompleteness in a statistically sound way is a challenging problem. This work proposes the use of Gaussian processes as models of large scale terrain. The proposed model naturally provides a multi-resolution representation of space, incorporates and handles uncertainties aptly and copes with incompleteness of sensory information. Gaussian process regression techniques are applied to estimate and interpolate (to fill gaps in occluded areas) elevation information across the field. The estimates obtained are the best linear unbiased estimates for the data under consideration. A single non-stationary (neural network) Gaussian process is shown to be powerful enough to model large and complex terrain, effectively handling issues relating to discontinuous data. A local approximation method based on a “moving window” methodology and implemented using KD-Trees is also proposed. This enables the approach to handle extremely large datasets, thereby completely addressing its scalability issues. Experiments are performed on large scale datasets taken from real mining applications. These datasets include sparse mine planning data, which is representative of a GPS based survey, as well as dense laser scanner data taken at different mine-sites. Further, extensive statistical performance evaluation and benchmarking of the technique has been performed through cross validation experiments. They conclude that for dense and/or flat data, the proposed approach will perform very competitively with grid based approaches using standard interpolation techniques and triangulated irregular networks using triangle based interpolation techniques; for sparse and/or complex data however, it would significantly outperform them. 1 Introduction Large scale terrain mapping is a difficult problem with wide-ranging applications in robotics, from space exploration to mining and more. For autonomous robots to function in such high-value applications, an 1 efficient, flexible and high-fidelity representation of space is critical. The key challenges posed by this problem are that of dealing with the problems of uncertainty, incompleteness and handling unstructured terrain. Uncertainty and incompleteness are virtually ubiquitous in robotics as sensor capabilities are limited. The problem is magnified in a field robotics scenario due to the significant scale of many domains. State-of-the-art surface mapping methods employ representations based on triangular tesselations. These representations, however, do not have a statistically sound way of incorporating and managing uncertainty. The assumption of spatially independent data is a further limitation of approaches based on these representations. While there are several interpolation techniques known, the independence assumption can lead to simplistic (simple averaging like) techniques resulting in inaccurate modeling of the terrain. Tesselation methods also do not handle incomplete data where areas of the terrain may not be visible from the sensor perspective. This paper proposes the use of Gaussian process models to represent large scale terrain. Gaussian process models provide a direct method of incorporating sensor uncertainty and using this in learning of the terrain model. Estimation of data at unknown locations is treated as a regression problem in 3D space and takes into account the correlated nature of the spatial data. This technique is also used to overcome sensor limitations and occlusions by filling the gaps in sensor data with a best linear unbiased estimate. The Gaussian process representation is a continuous domain, compact and non-parametric model of the terrain and thus can readily be used to create terrain maps at any required resolution. The main contribution of this paper is a novel approach for representing large scale terrain using Gaussian processes. Specifically, this paper shows that a single non-stationary kernel (neural-network) Gaussian process is successfully able to model large scale terrain data taking into account the local smoothness while preserving spatial features in the terrain. A further contribution of this paper is a local approximation method based on a “moving window” implemented using KD-Trees that incorporates the benefits of both stationary and non-stationary kernels in the terrain estimation process. The use of the local approximation technique addresses the issue of scalability of the Gaussian process model to large and complex terrain data sets. The end-result is a multi-resolution representation of space that incorporates and manages uncertainty in a statistically sound way and handles spatial correlation between data in an appropriate manner. Experiments conducted on large scale datasets obtained from real mining applications, including a sparse mine planning dataset that is representative of a GPS based survey as well as dense laser scanner based surveys, clearly demonstrate the viability and effectiveness of the Gaussian process representation of large-scale terrain. The paper also compares the performance of stationary and non-stationary kernels and details various finer points of the Gaussian process approach to terrain modeling. Extensive cross-validation experiments have been performed to statistically evaluate the method; they also compare the approach with several well known representation and interpolation methods, serving as benchmarks. These experiments further support the 2 claims made and enable a greater understanding of how different techniques compare against each other, the applicability of the different techniques and finally the Gaussian process modeling paradigm itself. Section 2 broadly reviews recent work in the areas of terrain representations, interpolation and finally Gaussian processes applied to the terrain modeling problem. The following section (3) presents the approach in detail, from problem formulation to modeling and learning and finally application. The next two sections (4 and 5) will discuss the kernels, the learning method used and the approximation methodology in more detail; perspectives and comparisons of the proposed approach in relation to specific works in the state-ofthe-art will also be drawn. These theoretical sections will be followed by a three part experimental section (6) that will provide an extensive evaluation of the proposed approach. The paper will conclude with a reiteration of the findings and contributions of this work. 2 Related Work State-of-the-art terrain representations used in applications such as mining, space exploration and other field robotics scenarios as well as in geospatial engineering are typically limited to elevation maps, triangulated irregular networks (TIN’s), contour models and their variants or combinations ((Durrant-Whyte, 2001) and (Moore et al., 1991)). Each of them have their own strengths and preferred application domains. The former two are more popular in robotics. The latter represents the terrain as a succession of “isolines” of specific elevation (from minimum to maximum). They are particularly suited for modeling hydrological phenomena but otherwise offer no particular computational advantages in the context of this paper. Each of them, in their native form, do not handle spatially correlated data effectively. Grid based methods represent space in terms of elevation data corresponding to each cell of a regularly spaced grid structure. The outcome is a 2.5D representation of space, also known as an elevation map. The main advantage of this representation is simplicity. Limitations include the inability to handle abrupt changes, the dependence on grid size and the issue of scalability in large environments. In robotics, grid based methods have been exemplified by numerous works such as (Krotkov and Hoffman, 1994), (Ye and Borenstein, 2003), (Lacroix et al., 2002) and more recently (Triebel et al., 2006). Both (Krotkov and Hoffman, 1994) and (Ye and Borenstein, 2003) use laser range finders to construct elevation maps. The latter also proposes a “certainty assisted spatial filter” that uses “certainty” information as well as spatial information (elevation map) to filter out erroneous terrain pixels. The authors of (Lacroix et al., 2002) developed stereo-vision based elevation maps and recognize that the main problem is uncertainty management. Addressing this issue, they proposed a heuristic data fusion algorithm based on the Dempster Shafer theory. The authors of (Triebel et al., 2006) propose an extension to standard elevation maps in order to handle multiple surfaces 3 and overhanging objects. The main weaknesses observed in grid based representations are the lack of a statistically direct way of incorporating and managing uncertainty and the inability to appropriately handle spatial correlation. Triangulated Irregular Networks (TIN) usually sample a set of surface specific points that capture important aspects of the terrain surface to be modeled - bumps/peaks, troughs, breaks for example. The representation typically takes the form of an irregular network of elevation points with each point linked to its immediate neighbors. This set of points is represented as a triangulated surface. TIN’s are able to more easily capture sudden elevation changes and are also more flexible and efficient than grid maps in relatively flat areas. In robotics, TIN’s have been used extensively, see (Leal et al., 2001) and (Rekleitis et al., 2008). The work (Leal et al., 2001), presents an approach to performing online 3D multi-resolution reconstruction of unknown and unstructured environments to yield a stochastic representation of space. Recent work in the space exploration domain presents a LIDAR based approach to terrain modeling using TIN’s (Rekleitis et al., 2008). This work decimates coplanar triangles into a single triangle to provide a more compact representation. TIN’s may be efficient from a survey perspective as few points are hand-picked. However, (Triebel et al., 2006) points out that for dense sensor data, while they are accurate and can easily be textured, they have a huge memory requirement which grows linearly with the number of scans. The assumption of spatial statistical independence may render the method ineffective at handling incomplete data as the elemental facet of the TIN (planar triangle) may approximate a complicated surface. This however, depends on the choice of the sensor and the data density obtained. There are several different kinds of interpolation strategies for grid data structures. The choice of interpolation method can have severe consequences on the accuracy of the model obtained. Kidner (Kidner et al., 1999) reviewed and compared numerous interpolation methods for grid based methods. He recommended the use of higher order (at least bi-cubic) polynomial interpolation methods as a requirement for grid data interpolation. Ye and Borenstein, in (Ye and Borenstein, 2003), used median filtering to fill in the missing data in an elevation map. The Locus algorithm (Kweon and Kanade, 1992) was used in (Krotkov and Hoffman, 1994). This interpolation method attempted to find an elevation estimate by computing the intersection of the terrain with the vertical line at the point in the grid - this was done in image space rather than Cartesian space. Gaussian processes (Rasmussen and Williams, 2006) (GP’s) are powerful non-parametric learning techniques that can handle many of the problems described above. They can provide a scalable multi-resolution model of large scale terrain. GP’s provide a continuous domain representation of the terrain data and hence can be easily sampled at any desired resolution. They incorporate and handle uncertainty in a statistically sound way and represent spatially correlated data in an appropriate manner. They model and use the 4 spatial correlation of the given data points to estimate the elevation values for other unknown points of interest. In an estimation sense, GP’s provide a best linear unbiased estimate (Kitanidis, 1997) based on the underlying stochastic model of spatial correlation between the data points. They basically implement an interpolation methodology called Kriging (Matheron, 1963) which is a standard interpolation technique used in geostatistics. GP’s thus handle both uncertainty and incompleteness effectively. In the recent past, Gaussian processes have been applied in the context of terrain modeling, see (Lang et al., 2007), (Plagemann et al., 2008a) and (Plagemann et al., 2008b). In all three cases, the GP’s are implemented using a non-stationary equivalent of a stationary squared exponential covariance function (Paciorek and Schervish, 2004) and incorporate kernel adaptation techniques in order to adequately handle both smooth surfaces as well as inherent (and characteristic) surface discontinuities. Whereas (Lang et al., 2007) initializes the kernel matrices evaluated at each point with parameters learnt for the corresponding stationary kernel and then iteratively adapts them to account for local structure and smoothness, (Plagemann et al., 2008a) and (Plagemann et al., 2008b) introduce the idea of a “hyper-GP” (using an independent stationary kernel GP) to predict the most probable length scale parameters to suit the local structure. These are used to make predictions on a non-stationary GP model of the terrain. (Plagemann et al., 2008b) proposes to model space as an ensemble of GP’s in order to reduce computational complexity. While the context of this work is similar, the approach differs in proposing the use of non-stationary, neural network kernels to model large scale discontinuous spatial data. The performance of GP’s using stationary (squared exponential) and non-stationary (neural network) kernels for modeling large scale terrain data are compared. Experiments to better understand various aspects of the modeling process using GP’s are also described. These demonstrate that using suitable non-stationary kernel can directly result in modeling both local structure and smoothness. A local approximation methodology that addresses scalability issues is proposed, which when used with the proposed kernel, emulates the locally adaptive effect of the techniques proposed in (Lang et al., 2007), (Plagemann et al., 2008a) and (Plagemann et al., 2008a). This approximation technique is based on a “moving window” methodology and implemented using an efficient hierarchical representation (KD-Tree) of the data. Thus, the scalability issues relating to the application of GP’s to large scale data sets are addressed. Further, extensive statistical evaluation and benchmarking experiments are performed, they compare GP’s using stationary and non-stationary kernel, grid representations using several well known interpolation methods and finally, TIN representations using triangle based interpolation techniques. These experiments further strengthen the claims on the effectiveness of the approach. The contribution of this work is thus the proposition of a novel way of representing large scale terrain using Gaussian processes. The representation naturally provides a multi-resolution model of the terrain, incorporates and handles uncertainty effectively and appropriately handles spatially correlated data. 5 3 Approach 3.1 Problem Definition The terrain modeling problem can be understood as follows - given a sensor that provides terrain data as a set of points (x, y, z) in 3D space, the objective is to: 1. Develop a multi-resolution representation that incorporates sensor uncertainty in an appropriate manner. 2. Handle in a statistically consistent manner, sensing limitations including incomplete sensor information and terrain occlusions. Terrain data can be obtained using numerous sensors including 3D laser scanners, radar or GPS. 3D laser scanners provide dense and accurate data whereas a GPS based survey typically comprise of a relatively sparse set of well chosen points of interest. The experiments reported in this work use data sets that represent both of these kinds of information. The overall process of using Gaussian processes to model large scale terrain is shown in Figure 1. The steps involved in learning the representation sought are shown in Figure 2. The process of applying the GP representation to create elevation/surface maps at any arbitrary resolution is described in Figure 3. Typically, for dense and large data sets, not all data is required for learning the GP model. Such an approach would not scale for reasons of computational complexity. Thus, a sampling step may need to be included. This sampling may be uniform or random and could also use heuristic approaches such as preferential sampling from areas of high gradient. The dataset is sampled into three subsets - training, evaluation and testing. The training data is used to learn the GP model corresponding to the data. The training and evaluation data together are stored in the KD-Tree (Preparata and Shamos, 1993) for later use in the application or inference process. The KD-Tree provides for rapid data access in the inference process. It is used in the local approximation method proposed in this work to address scalability issues related to applying the approach to large scale data sets. The testing data is the subset of data over which the GP model is evaluated. In this work, the mean squared error between the predicted and actual elevation, computed over the testing subset, is the performance metric used. 3.2 Model Description The use of Gaussian processes for modeling and representing terrain data is proposed. The steps below are directly based on (Rasmussen and Williams, 2006). Gaussian processes provide a powerful framework 6 Figure 1: The overall process of learning and using Gaussian processes to model terrain: Data from any sensor along with noise estimates are provided as the input to the process. The modeling step then learns the appropriate Gaussian process model (hyper-parameters of the kernel) and the result of the modeling step are the GP hyperparameters together with the training/evaluation data. These are then used to perform GP regression to estimate the elevation data across any grid of points, of the users interest. The application step produces a 2.5D elevation map of the terrain as well as an associated uncertainty for each point. 7 Figure 2: The training or modeling process: A sensor such as a laser scanner is used to provide 3D terrain data. A sample of the given terrain data is used to learn a GP representation of the terrain. Different sampling methods can be used. Training and evaluation data are stored in a KD-Tree for later use in the application stage. Learning the GP model of the terrain amounts to performing a maximum marginal likelihood estimation of the hyperparameters of the model. The outcomes of this process are the KD-Tree representation of the training/evaluation data and the GP model of the terrain. 8 Figure 3: The testing or application process: Given a desired region of interest where elevation needs to be estimated, applying the GP model learnt in the modeling stage amounts to sampling the continuous representation at the desired resolution, in the area of interest. A local approximation technique based on a “moving window” methodology and implemented using the KD-Tree obtained from the modeling stage addresses scalability issues related to the method being applied for large scale data. It also enables a tradeoff between smoothness and feature preservation in the terrain. The outcome of this process is an elevation map with the corresponding uncertainty for each point in it. 9 for learning models of spatially correlated and uncertain data. Gaussian process regression provides a robust means of estimation and interpolation of elevation information and can handle incomplete sensor data effectively. GP’s are non-parametric approaches in that they do not specify an explicit functional model between the input and output. They may be thought of as a Gaussian probability distribution in function space and are characterized by a mean function m(x) and the covariance function k(x, x0 ) where m(x) = E[f (x)] (1) k(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))] (2) such that the GP is written as f (x) ∼ GP(m(x), k(x, x0 )) (3) The mean and covariance functions together specify a distribution over functions. In the context of the problem at hand, each x ≡ (x, y) and f (x) ≡ z of the given data. The covariance function models the relationship between the random variables corresponding to the given data. Although not necessary, the mean function m(x) may be assumed to be zero by scaling the data appropriately such that it has an empirical mean of zero. There are numerous covariance functions (kernels) that can be used to model the spatial variation between the data points. The most popular kernel is the squared-exponential kernel given as 1 k(x, x0 ) = σf2 exp − (x − x0 )T Σ(x − x0 ) 2 lx where k is the covariance function or kernel; Σ = 0 (4) −2 0 is the length-scale matrix, a measure of ly how quickly the modeled function changes in the directions x and y; σf2 is the signal variance. The set of parameters lx , ly , σf are referred to as the kernel hyperparameters. The neural network kernel ((Neal, 1996), (Williams, 1998a) and (Williams, 1998b)) is the focus of this work and is specified as 0 k(x, x ) = σf2 arcsin β + 2xT Σx0 p (1 + β + 2xT Σx)(1 + β + 2x0T Σx0 ) ! (5) where β is a bias factor, Σ is the length scale matrix as described before and lx , ly , σf , β constitute the 10 kernel hyperparameters. The NN kernel represents the covariance function of a neural network with a single hidden layer between the input and output, infinitely many hidden nodes and using a Sigmoid as the transfer function (Williams, 1998a) for the hidden nodes. Hornik in (Hornik, 1993) showed that such neural networks are universal approximators of data and Neal (Neal, 1996) observed that the functions produced by such a network would tend to a Gaussian Process. The main difference between these two kernels is that the squared-exponential kernel, being a function of |x − x0 | is stationary (invariant to translation) whereas the neural network function is not so. In practice, the squared exponential function has a smoothing or averaging effect on the data. The neural network covariance function proves to be much more effective than the squared exponential covariance function in handling discontinuous (rapidly changing) data, as the experiments in this paper will show. This is the main reason why it proves to be effective in modeling complex terrain data. A more detailed analysis of these kernels and their properties will be given in Section 4. 3.3 Learning the hyperparameters Training the GP for a given data set is equivalent to choosing a kernel function and optimizing the hyperparameters for the chosen kernel. For the squared-exponential kernel, this amounts to determining the values θ = {lx , ly , σf , σn } and for the neural network kernel, the values θ = {lx , ly , σf , β , σn }, where σn2 is the noise variance of the data being modeled. This is performed by formulating the problem in a maximum marginal likelihood estimation framework and subsequently solving a non-convex optimization problem. Defining X = {xi }ni=1 = {(xi , yi )}ni=1 and z = {f (xi )}ni=1 = {zi }ni=1 as the sets of training inputs and outputs respectively of n instances, the log marginal likelihood of the training outputs (z) given the set of locations (X) and the set of hyperparameters θ is given by 1 1 n log p(z|X, θ) = − zT Kz−1 z − log |Kz | − log(2π) 2 2 2 (6) where Kz = Kf + σn2 I is the covariance matrix for all noisy targets z and Kf is covariance matrix for the noise-free targets obtained by applying either Equation 4 or 5 on the training data. The log marginal likelihood has three terms - the first describes the data fit, the second term penalizes model complexity and the last term is simply a normalization coefficient. Thus, training the model will involve searching for the set of hyperparameters that enables the best data fit while avoiding overly complex models. Occam’s razor principle (Mackay, 2002) is thus in-built in the optimization and prevention of over-fitting is guaranteed by the very formulation of the learning mechanism. Using this maximum marginal likelihood formulation, training 11 the GP model on a given set of data amounts to finding the optimal set of hyperparameters that maximize the log marginal likelihood. This can be performed using standard off-the-shelf optimization approaches. In this work, a combination of stochastic search (simulated annealing) and gradient based search (QuasiNewton optimization with BFGS Hessian update (Nocedal and Wright, 2006)) was found to produce the best results. Using a gradient based optimization approach leads to advantages in that convergence is achieved much faster. 3.4 Applying the GP model Applying the GP model amounts to using the learned GP model to estimate the elevation information across a region of interest, characterized by a grid of points at a desired resolution. The 2.5D elevation map can then be used directly or as a surface map for various applications. This is achieved by performing Gaussian process regression at a set of query points, given the training/evaluation data sets and the GP kernel with the learnt hyperparameters. The observed sensor noise is assumed to be Gaussian zero-mean and white, with unknown variance (it is learnt from the data) σn2 . Thus, the covariance on the observations is given as cov(zp , zq ) = k(xp , xq ) + σn2 δpq (7) where δpq is a Kroeneker Delta, δpq = 1 iff p = q and 0 otherwise. The joint distribution of any finite number of random variables of a GP is Gaussian. Thus, the joint distribution of the training outputs z and test outputs f∗ given this prior can be specified by σn2 I z K(X, X) + ∼ N 0 , f∗ K(X∗ , X) K(X, X∗ ) K(X∗ , X∗ ) (8) For n training points and n∗ test points, K(X, X∗ ) denotes the n × n∗ matrix of covariances evaluated at all pairs of training and test points. The terms K(X, X), K(X∗ , X∗ ) and K(X∗ , X) can be defined likewise. The function values (f∗ ) corresponding to the test locations (X∗ ) given the training inputs X, training outputs z and the covariance function (kernel) is given by f¯∗ = K(X∗ , X)[K(X, X) + σn2 I]−1 z. The function covariance is given by 12 (9) cov(f∗ ) = K(X∗ , X∗ ) − K(X∗ , X)[K(X, X) + σn2 I]−1 K(X, X∗ ). (10) Denoting K(X, X) by K, K(X, X∗ ) by K∗ and for a single test point x∗ , k(x∗ ) = k∗ is used to represent the vector of covariances between x∗ and the n training points, Equations 9 and 10 can be rewritten as: f¯∗ = k∗T (K + σn2 I)−1 z, (11) V [f∗ ] = k(x∗ , x∗ ) − k∗T (K + σn2 I)−1 k∗ . (12) and variance Equations 11 and 12 provide the basis for the elevation estimation process. The GP estimates obtained are a best linear unbiased estimate for the respective query points. Uncertainty is handled by incorporating the sensor noise model in the training data. The representation produced is multi-resolution in that a terrain model can be generated at any desired resolution using the GP regression equations presented above. Thus, the GP terrain modeling approach is a probabilistic, multi-resolution method that handles spatially correlated information. 3.5 Fast local approximation using KD-Trees During the inference process, the KD-Tree that initially stores the training and evaluation data is queried to provide a predefined number of spatially closest data, to the point for which the elevation must be estimated. The GP regression process then uses only these training exemplars to estimate the elevation at the point of interest. This is described in the zoomed in view of Figure 3. Note that the GP model itself is learnt from the set of all training data but is applied locally using the KD-Tree approximation. This approximation method is akin to a “moving window” methodology and has multiple benefits that are vital to the scalability of the GP method in a large scale application. The use of only the spatially closest neighbors for GP regression keeps the computational complexity low and bounded. As described earlier, the squared exponential kernel has a smoothing effect on the data whereas the neural-network kernel is much more effective in modeling discontinuous terrain data. The number of nearest neighbor exemplars used can be used to control the tradeoff between smoothness and feature preservation. Finally, the proposed local approximation method bounds (by selecting a fixed number of data points) the subset of data over which the neural network kernel is applied and thus effectively uses it. A more detailed explanation of these aspects is provided in Section 5. 13 4 On the Choice of Kernel (a) Squared Exponential (SQEXP) kernel behavior (b) Neural Network (NN) kernel behavior Figure 4: 4(a) depicts the SQEXP kernel (X = -3.0 : 0.1 : 3.0). Correlation between points is inversely proportional to the Euclidean distance between them, hence, points further away from the point under consideration are less correlated to it. 4(b) depicts the NN kernel for data (X = -3.0 : 0.1 : 3.0). The NN kernel is a function of the dot product between the points until it saturates. Spatial correlation between two points is dependent on the distance of the respective points from the coordinate origin and not their relative distance as in SQEXP. Thus, the center/query point does not bear any special significance as in SQEXP. The data will be such that the positive and negative correlation values together with the respective data will produce the correct estimate. 4.1 Squared Exponential Kernel The squared exponential (SQEXP) kernel (equation 4) is a function of | x − x0 | and is thus stationary and isotropic meaning that it is invariant to all rigid motions. The SQEXP function is also infinitely differentiable and thus tends to be smooth. The kernel is also called the Gaussian kernel. It functions similarly to a Gaussian filter being applied to an image. The point at which the elevation has to be estimated is treated as the central point/pixel. Training data at this point together with the those at points around it determine the elevation. As a consequence of being a function of the Euclidean distance between the points and the fact that the correlation is inversely proportional to the distance between them, the SQEXP kernel is prone to producing smoothened (possibly overly) models. Points that are far away from the central pixel contribute less to its final estimate whereas points nearby the central pixel have maximum leverage. The correlation measures of far off points no matter how significant they may be, are diminished. This is in accordance with the “bell-shaped” curve of the kernel. Figure 4(a) depicts the SQEXP kernel of a selection of data with itself. 14 4.2 Neural Network Kernel The NN kernel treats data as vectors and is a function of the dot product of the data points. Hence, the NN kernel is dependent on the coordinate origin and not the central point in a window of points, as explained about SQEXP above. Spatial correlation is a function of the distance of the points from the data origin until reaching a saturation point. This kernel behavior is exhibited in Figure 4(b). In the saturation region, the correlation of points are similar. It is this kernel behavior that enables the resulting GP model to adapt to rapid or large variations in data. This work uses a local approximation technique (Sections 3.5 and 5) to address scalability issues relating to large scale applications. The local approximation method also serves to bound the region over which the kernel is applied (hence, enabling effective usage of the kernel) as even two distant points lying in the saturation region of the kernel will be correlated. The effect of using the NN kernel alongside the local approximation leads to a form of locally non-stationary GP regression for large scale terrain modeling. The fact that correlation increases (until saturation) with distance in the NN kernel, although apparently counter intuitive, may be thought of as a “noise-filtering” mechanism, wherein, a greater emphasis is placed on the neighborhood data of a point under consideration. Such a behavior is not seen in the SQEXP kernel. In this work, it has been repeatedly observed from experiments in Section 6 that the NN kernel has been able to adapt to fast changing / large discontinuities in data. The NN kernel has been described in detail in (Neal, 1996), (Williams, 1998a), (Williams, 1998b) and (Rasmussen and Williams, 2006). The form used here is an exact equivalent and implementationally more convenient form to the classical form of the kernel. 4.3 Discussion Table 1: Comparison of Kernels Squared Exponential (SQEXP) Neural Network (NN) Stationary (translational invariance) Non-stationary (not translationally invariant) Isotropic (rotational invariance) Non-isotropic (not rotationally invariant) Negative values not allowed Negative values allowed Function of Euclidean distance between points Function of dot-product of the points until saturation Origin independent Origin dependent Correlation inversely proportional to Euclidean Correlation is a function of the distance of the distance between points points from the origin until it saturates A comparison between the SQEXP and NN kernels is detailed in Table 1. As noted in (Williams, 1998a), a direct analytical comparison of stationary and non-stationary kernel beyond its properties is not straight-forward. This paper compares them based on their properties and provides empirical results which 15 (a) GP modeling of data using SQEXP kernel (b) GP modeling of data using NN kernel Figure 5: Adaptive power of GP’s using stationary (SQEXP) and non-stationary (NN) kernels. For a set of chosen points (in 1 dimension) that represent an abrupt change (the plus marks), a GP is learnt using both the kernels and the GP based estimates are plotted (solid blue line) with the uncertainty being represented by the gray area around the solid line. The GP using the NN kernel is able to respond to the step-like jump in data almost spontaneously whereas the GP using the SQEXP smoothens out the curve (ie. is much more approximate) in order to best fit the data. The GP-NN estimates are also much more certain than the GP-SQEXP estimates. This adaptive power of the GP using the NN kernel is the main motivation of using it in this work in order to model large scale, terrain which may potentially be very unstructured - there may be large and rapid jumps in data and the GP/kernel will have to respond to them effectively. demonstrate that the non-stationary kernel performs much better than the stationary one for a spatial modeling task. There is no clear way of specifying which kinds of kernel behavior are “correct” or “incorrect” for a spatial modeling task. The only criteria of importance is the ability of the GP using the particular kernel to adapt to the data at hand. A simple but telling example of the adaptive power of the a GP using the non-stationary NN kernel versus that of one using a stationary SQEXP kernel is depicted in Figure 5. Another interesting aspect of a GP using the NN kernel in comparison with one using an SQEXP kernel is its superior ability to respond to sparse data availability. This is due to the fact that the correlation in the NN kernel saturates. Thus, even relatively far off points could be used to make useful predictions at the point of interest. In the case of the SQEXP kernel however, the correlation decreases with distance towards zero. Thus, far off data in sparse datasets become insignificant and are not useful for making inferences at the point of interest. Such a scenario could occur in large voids due to occlusions or due to fundamental sensor limitations. These aspects of the NN kernel taken together, make it a powerful method for use in complex terrain modeling using GP’s. While a GP based on the NN kernel is an effective model for terrain data as described and demonstrated in this paper, other kernels may work better than the NN in specific situations. Indeed, the GP with 16 NN kernel is a general purpose universal approximator. Domain specific kernels (kernels tailored to specific applications or datasets - for example a linear kernel in flat areas) could potentially perform better in specific situations. The essential issues with GP’s based on the SQEXP kernel for describing terrain data are the resulting stationarity and constant length scales of the model. These underscore the issue that the SQEXP treats different parts of the data in exactly the same way and to exactly the same extent. This makes GP’s based on the standard SQEXP kernel inappropriate for the task of modelling unstructured large-scale terrain. The works of Lang and Plagemann ((Lang et al., 2007), (Plagemann et al., 2008a) and (Plagemann et al., 2008b)) have attempted to address these issues by using the non-stationary equivalent of the SQEXP kernel ((Paciorek and Schervish, 2004)), defining Gaussian kernels for each point that account for the local smoothness or by using a hyper-GP to predict local length scales. This work attempts to use an alternative non-stationary kernel that is inherently immune to these issues. The Neural Network (NN) kernel is a function of the dot product of the points, thus, different parts of the data will be treated differently even if a single set of length scales are used. Treating the data points as vectors and using their dot product inherently captures the “local smoothness”. The interesting aspect is that a single kernel choice with a single set of hyperparameters will suffice to model the data - this is demonstrated in our experiments (Section 6). Thus, this work proposes an alternative methodology to address the core issues in GP based terrain modeling and further, it is also tailored to meet the demands of modeling large scale and unstructured terrain. 4.4 Learning methodology Once the kernel is chosen, the next important step in the GP modeling process is to estimate the hyperparameters for a given dataset. Different approaches have been attempted including a maximum marginal likelihood formulation (MMLE), generalized cross validation (GCV) and maximum a-posteriori estimation (MAP) using Markov chain Monte Carlo (MCMC) techniques (see (Williams, 1998b) and (Rasmussen and Williams, 2006)). The MMLE computes the hyperparameters that best explains the training data given the chosen kernel. This requires the solution of a non-convex optimization problem and may possibly be driven to locally optimal solutions. The MAP technique places a prior on the hyperparameters and computes the posterior distribution of the test outputs once the training data is well explained. The state-of-the-art in GP based terrain modeling typically used the latter approach. For large scale datasets, the GCV and MAP approaches are too computationally expensive. In the work described in this paper, the learning procedure is based on the MMLE method, the use of gradients and the learning of the hyperparameters that best explains 17 the training data through optimization techniques. Locally optimal solutions are avoided by combining a stochastic optimization procedure with a gradient based optimization. The former enables the optimizer to reach a reasonable solution and the latter focuses on fast convergence to the exact solution. 5 The local approximation strategy As described earlier, Equations 11 and 12 can be used to estimate the elevation and its uncertainty at any point, given a set of training data. The two equations involve the term (K + σn2 I)−1 which requires an inversion of the covariance matrix of the training data. This is an operation of cubic complexity O(n3 ) where n is the number of training data used. In this work, the training data together with the evaluation data is stored in a KD-Tree. Typically, the size of these data subsets is very large and thus using the entire set of data for elevation estimation is not practically possible. Thus, approximations to the GP regression problem are required. In the literature, several approximation techniques for Gaussian processes have been proposed to address this problem. Some well known examples include the Nystrom approximation, the subset of regressors and Bayesian committee machines. Extensive reviews of these techniques are provided in Chapter 8 of (Rasmussen and Williams, 2006) and (Candela et al., 2007). In this work, a local approximation strategy that is akin to a “moving window” approximation is proposed. The proposed approximation methodology is inspired from work in geostatistics where the “moving window” approach is well established ((Haas, 1990b), (Haas, 1990a) and (Wackernagel, 2003)). Such an approximation has also been suggested in the machine learning community by Williams in (Williams, 1998b). The work (Shen et al., 2006) proposed a different approximation method using KD-Trees. They used the idea that GP regression is a weighted summation. For monotonic isotropic kernels, they used KD-Trees to efficiently group training data points having similar weights with respect to a given query point in order to approximate and speed-up the regression process. In comparison, the method proposed in this work uses standard GP regression using any kernel, with a bounded neighborhood of points obtained by applying a moving window on data. The proposed local approximation strategy is based on the underlying idea that a point is most correlated with its immediate neighborhood. Thus, during the inference process, the KD-Tree, that initially stored the training and evaluation datasets, is queried to provide a predefined set of training data, spatially closest to the point for which the elevation must be estimated. The GP regression process then uses only these training exemplars to estimate the elevation at the point of interest. Note that the GP model itself is learnt from the set of all training data but is applied locally using the local approximation idea. In order to have an idea of the reduction in complexity due to the proposed approximation, a simple 18 example is considered. A dataset of N (for example, 1 million) points may be split into for instance ntrain (say, 3000) training points, ntest (say, 100000) test points and the remaining neval (897000) are evaluation points. The training points are those over which the GP is learnt, the test points are those over which the MSE is evaluated and the evaluation points together with the training points are used to make predictions at the test points. The KD-Tree data structure would thus comprise of a total of ntrain + neval points (a total of 900000 training and evaluation data). The complexity in the absence of the approximation strategy would be O(ntrain + neval )3 for each of the ntest points. However, with the proposed local approximation strategy, a predefined number m of spatially closest training and evaluation points is used to make a prediction at any point of interest. Typically, m << ntrain + neval . For instance, a neighborhood size of 100 points is used for experiments in this work. Thus the computational complexity would significantly decrease to O(m3 ) for each of the ntest points to be evaluated. As the experiments in this work will demonstrate, the results are still very good. Also note that the KD-Tree is just a convenient hierarchical data structure used to implement the fast “moving window” based local approximation method. In theory, any appropriate data structure could be used; the key idea is to use the local neighborhood only. The squared exponential kernel has a smoothing effect on the data whereas the neural-network kernel is much more effective in modeling discontinuous terrain data. The local approximation strategy defines the neighborhood over which the interpolation takes place and hence can serve as a control parameter to balance local smoothness with feature preservation in terrain modeling. The method of choosing the nearest neighbors can also be used to the same effect. For instance, 100 nearest neighbors mutually separated by a point will produce much smoother outputs as compared to the 100 immediate neighbors. In our experience, the immediate neighborhood produced the best results. The local approximation method can thus be used as a control or tradeoff mechanism between locally smooth outputs and feature preservation. The local approximation strategy is also useful in the application of the neural network kernel (NN). The NN kernel is capable of seamlessly adapting to rapidly and largely changing data. However, it also has the property that depending on the location of the origin and that of the data points with respect to it, the correlation increases with distance until it saturates. Particularly in the saturation region, the kernel is able to respond to discontinuous data very effectively as the correlation values become similar. However, this also implies that two distant points lying in the saturation region will be correlated. The proposed local approximation method acts as a bounding mechanism for the applicability of the kernel. This is because the NN only acts on the immediate neighborhood of the data, over which it is extremely effective in adapting. In the case of the neural network kernel, this amounts to adding the benefits of a stationary kernel to the highly adaptive power of a non-stationary kernel. Thus, the proposed local approximation methodology attempts to emulate the locally adaptive GP effect exemplified by the works (Lang et al., 2007), (Plagemann et al., 19 2008a) and (Plagemann et al., 2008b). 6 Experiments 6.1 Overview and Datasets Experiments were performed on three large scale data sets representative of different sensory modalities. The first data set comprises of a scan from a RIEGL LMSZ420 laser scanner at the Mt. Tom Price mine in Western Australia. The data set consists of over 1.8 million points (in 3D) spread over an area of 135 x 72 sq m, with a maximum elevation difference of about 13.8 m. The scan has both numerous voids of varying sizes as well as noisy data, particularly around the origin of the scan. The data set is shown in Figure 6 and is referred to as the “Tom Price” data set in the rest of this paper. The second data set comprises of a mine planning dataset, that is representative of a GPS based survey, from a large Kimberlite (diamond) mine. This data set is a very hard one in that it contains very sparse information (4612 point in total; the average spread of the data is about 33 m in x and y directions) spread over a very large geographical area (≈ 2.2 x 2.3 sq km), with a significant elevation of about 250 m. The data set is referred to as the “Kimberlite Mine” data set in the rest of this paper and is shown in Figure 7. A third data set also consists of a laser scan but over a much larger area than the first data set. This dataset is shown in Figure 8 and is referred to as the “West Angelas” dataset for the rest of this paper. It comprises of about 534,460 points spread over 1.865 x 0.511 sq km. The maximum elevation of this dataset is about 190 m. This dataset is both rich in features (step like wall formations, voids of varying sizes) as well and large in scale; a large part of the dataset is composed of only sparse data. Figure 6: Laser scanner data set taken at Mt. Tom Price mine in Western Australia using a RIEGL LMSZ420. The data set consists of over 1.8 million data points. The data set spans about 135 x 72 sq m. The focus of the experiments is three-fold. The first part will demonstrate GP terrain modeling and 20 Figure 7: Dataset representative of a GPS based survey of a large Kimberlite (diamond) mine. The data set comprised of just over 4600 points spanning about 2.2 x 2.3 sq km. The difference in elevation between the base and the top was about 250 m. The data set is very sparse with the average spread being about 33 m in x and y directions. Figure 8: Laser scanner data set taken at West Angelas mine in Western Australia using a RIEGL LMSZ420. The data set consists of about 534460 data points and spans about 1.865 x 0.511 sq km. 21 evaluate the non-stationary NN kernel against the stationary SQEXP kernel for this application. The second part will detail results from extensive cross-validation experiments done using all three datasets that aim to statistically evaluate the performance of the stationary and non-stationary GP and also compare them with several well known methods of interpolating data, even using alternative representations. This part aims at understanding how they compare against each other on the test datasets and which would be more effective in what circumstances. The last part will detail experiments that help understand finer aspects of the GP modeling process and explain some of the design decisions in this work - these include the choice of using two optimization techniques or 100 nearest neighbors for instance. In all experiments, the mean squared error (MSE) criterion was used as the performance metric (together with visual inspection) to interpret the results. Each dataset was split into three parts - training, evaluation and testing. The former-most one was used for learning the GP and the latter was used only for computing the mean squared error (MSE); the former two parts together were used to estimate the elevations for the test data. The mean squared error was computed as the mean of the squared differences between the GP elevation prediction at the test points and the ground truth specified by the test data subset. 6.2 Stationary vs Non-stationary GP terrain modeling In the first set of experiments, the approach as described in Section 3 was implemented and tested with the aforementioned datasets. The objective was to understand the if GP’s could be successfully used for a terrain modeling task, the benefits of using them and if the non-stationary (neural network) or stationary (squared exponential) kernel was more suited for terrain modeling. Table 2: Kernel performance - Tom Price dataset (1806944 data points over 135 x 72 sq m, 3000 training data, 10000 test points for MSE) Kernel Mean Squared Error (MSE) (sq m) Squared Exponential (SQEXP) 0.0136 Neural Network (NN) 0.0137 Table 3: Kernel performance - Kimberlite Mine dataset (4612 points spread over 2.17 x 2.28 sq km) Kernel Number of Mean Squared Error (MSE) training data (sq m) Squared Exponential (SQEXP) 1000 13.014 (over 3612 points) Neural Network (NN) 1000 8.870 (over 3612 points) Squared Exponential (SQEXP) 4512 4.238 (over 100 points) Neural Network (NN) 4512 3.810 (over 100 points) 22 Table 4: Kernel performance - West Angelas dataset (534460 data points over 1.865 x 0.511 sq km, 3000 training data, 106292 test points for MSE) Kernel Mean Squared Error (MSE) (sq m) Squared Exponential (SQEXP) 0.590 Neural Network (NN) 0.019 The results of GP modeling of the three data sets using are summarized in Tables 2, 3 and 4. The tests compared the stationary SQEXP kernel and the non-stationary NN kernel at Gaussian process modeling of large scale mine terrain data. Overall, the NN kernel outperformed the SQEXP and proved to be more suited to the terrain modeling task. In the case of the Tom Price dataset, as the dataset is relatively flat, both kernels performed relatively well. The Kimberlite Mine dataset being sparse, was better modeled by the NN kernel. The West Angelas dataset was more challenging in that it had large voids, large elevation and sparse data. The NN kernel clearly outperformed the SQEXP kernel. Further, not only was its MSE lower than that of the SQEXP but the MSE values were comparable to the known sensor noise of the RIEGL scanner. The final output for the Tom Price data set is depicted in Figures 9 and 10; the corresponding confidence estimate is depicted in Figure 11. The output surface and confidence estimates for the Kimberlite Mine data set are respectively depicted in Figures 12 and 13 respectively. Figures 14 and 15 show the output points and associated confidence for the West Angelas dataset. Figure 9: Outcome of applying a single neural network based GP to the Tom Price data set. The gray (darker) points constitute the given data set. The red points are the output points - the elevation data as estimated by the GP over a test grid of points, spanning the dataset. The figure demonstrates the ability of the GP to (1) reliably estimate the elevation data in known areas and (2) provide a good interpolated estimate in areas with occlusions. The experiments successfully demonstrated Gaussian process modeling of large scale terrain. Further, the SQEXP and NN kernel were compared. The NN kernel proved to be much better suited to modeling 23 Figure 10: The surface map generated using the output elevation map obtained by applying a single neural network GP to the Tom Price data set. Refer Figure 9. complex terrain as compared to the SQEXP kernel. The kernels performed very similarly on relatively flat data. Thus, the approach can 1. Model large scale sensory data acquired at different degrees of sparseness. 2. Provide an unbiased estimator for the data at any desired resolution. 3. Handle sensor incompleteness by providing an unbiased estimates of data that are unavailable (due to occlusions for instance). Note that no assumptions on the structure of the data are made, a best prediction is made using only the data at hand. 4. Handle sensor uncertainty by modeling the uncertainty in data in a statistically sound manner. 5. Handle spatial correlation between data. The result is a compact, non-parametric, multi-resolution, probabilistic representation of the large scale terrain that appropriately handles spatially correlated data as well as sensor incompleteness and uncertainty. 24 (a) Top-down view of Tom Price dataset (b) Uncertainty (standard deviation in m) estimates of output elevation map (Figure 9) Figure 11: 11(b) depicts the final uncertainty estimates of the output elevation map (see Figure 9) of the Tom Price dataset as obtained using the proposed GP approach. The uncertainty map corresponds to a top-down view of the dataset (Figure 6) as depicted in 11(a). Each point represents the uncertainty in the estimated elevation of the corresponding point in the elevation map. The uncertainty is clearly higher in those areas where there was no training data - typically in the outer fringe areas and in gaps. (min standard deviation = 0.05 m, max standard deviation = 11.98 m) 25 Figure 12: Final output surface of the Kimberlite Mine data set obtained using a single neural network kernel based GP. The surface map is obtained from the elevation map, in turn obtained by predicting the elevation for each point across a test grid of points spanning the dataset. The figure demonstrates the ability of the GP to (1) reliably estimate the elevation data in known areas (2) produce models that take into account the local structure so as to preserve the characteristics of the terrain being modeled (3) be able to model sparse and feature rich datasets. 26 (a) Top-down view of Kimberlite Mine dataset (b) Uncertainty (standard deviation in m) estimates of output elevation map (Figure 12) Figure 13: 13(b) depicts the final uncertainty estimates of the output elevation map (see Figure 12) of the Kimberlite Mine dataset as obtained using the proposed GP approach. The uncertainty map corresponds to a top-down view of the dataset (Figure 7) as depicted in 13(a). Each point represents the uncertainty in the estimated elevation of the corresponding point in the elevation map. The uncertainty is clearly higher in those areas where there was no training data - typically in the outer fringe areas and in gaps. (min standard deviation = 0.336 m, max standard deviation = 32.65 m) 27 Figure 14: Outcome of applying a single neural network based GP to the West Angelas data set. This elevation map is obtained by predicting the elevation for each point across a test grid of points spanning the dataset. The figure clearly demonstrates the ability of the GP to (1) reliably estimate the elevation data in known areas (2) produce models that take into account the local structure so as to preserve the characteristics of the terrain being modeled (3) be able to model sparse and feature rich datasets. 28 (a) Top-down view of West Angelas mine dataset (b) Uncertainty (standard deviation in m) of output elevation map (Figure 14) Figure 15: 15(b) depicts the final uncertainty estimates of the output elevation map (see Figure 14) of the West Angelas mine dataset as obtained using the proposed GP approach. The uncertainty map corresponds to a top-down view of the dataset (Figure 8) as depicted in 15(a). Each point represents the uncertainty in the estimated elevation of the corresponding point in the elevation map. The uncertainty is clearly higher in those areas where there was no training data - typically in the outer fringe areas and in gaps. (min standard deviation = 0.01 m, max standard deviation = 39.25 m) 29 6.3 Statistical performance evaluation, comparison and analysis The next set of experiments that were conducted were aimed at 1. Evaluating the GP modeling technique in a way that provides for statistically representative results. 2. Providing benchmarks to compare the proposed GP approach with grid/elevation maps using several standard interpolation techniques as well as TIN’s using triangle based interpolation techniques. 3. Understanding the strengths and applicability of different techniques. The method adopted to achieve these objectives was to perform cross validation experiments on each of the three datasets and interpret the results obtained. The GP modeling using the SQEXP and NN kernels are compared with each other as well as with parametric, non-parametric and triangle based interpolation techniques. A complete list and description of the techniques used in this experiment is included below. 1. GP modeling using SQEXP and NN kernels. Details on the approach and the kernels can be found in Sections 3 and 4. 2. Parametric interpolation methods - These methods attempt to fit the given data to a polynomial surface of the chosen degree. The polynomial coefficients, once found, can be used to interpolate or extrapolate as required. The parametric interpolation techniques used here include linear (plane fitting), quadratic and cubic. (a) Linear (plane fitting) - A polynomial of degree 1 (plane in 2D) is fit to the given data. The objective is to find the coefficients {ai }2i=0 for the equation z = a0 + a1 x + a2 y , that best describes the given data by minimizing the residual error. (b) Quadratic fitting: A polynomial of degree 2 is fit to the given data. The objective is to find the coefficients {ai }5i=0 for the equation z = a0 + a1 x + a2 y + a3 x2 + a4 xy + a5 y 2 , that best describes the given data by minimizing the residual error. (c) Cubic fitting - A polynomial of degree 3 is fit to the given data. The objective is to find the 30 coefficients {ai }9i=0 for the equation z = a0 + a1 x + a2 y + a3 x2 + a4 xy + a5 y 2 + a6 x3 + a7 x2 y + a8 xy 2 + a9 y 3 , that best describes the given data by minimizing the residual error. 3. Non-parametric interpolation methods - These techniques attempt to fit a smooth curve through the given data and do not attempt to characterize all of it using a parametric form, as was the case with the parametric methods. They include piecewise linear or cubic interpolation, biharmonic spline interpolation as well as techniques such as nearest neighbor interpolation and mean-of-neighborhood interpolation. (a) Piecewise linear interpolation - In this method, the data is gridded (associated with a suitable mesh structure) and for any point of interest, the 4 points of its corresponding grid cell are determined. A bilinear interpolation (Kidner et al., 1999) of these 4 points yields the estimate at the point of interest. This interpolation basically uses the form: z = a0 + a1 x + a2 y + a3 xy , where the coefficients are expressed in terms of the coordinates of the grid cell. It is thus truly linear only if its behavior is considered keeping one variable/dimension fixed and observing the other. It is also called linear spline interpolation wherein n + 1 points are connected by n splines such that (a) the resultant spline would constitute a polygon if the begin and end points were identical and (b) the splines would be continuous at each data point. (b) Piecewise cubic interpolation - This method extends the piecewise linear method to cubic interpolation. It is also called cubic spline interpolation wherein n + 1 points are connected by n splines such that (a) the splines are continuous at each data point and (b) the spline functions are twice continuous differentiable, meaning that both the first and second derivatives of the spline functions are continuous at the given data points. The spline function also minimizes the total curvature and thus produces the smoothest curve possible. It is akin to constraining an elastic strip to fit the given n + 1 points. (c) Biharmonic spline interpolation - The biharmonic spline interpolation is based on (Sandwell, 1987). It places Green functions at each data point and finds coefficients for a linear combination of these functions to determine the surface. The coefficients are found by solving a linear system 31 of equations. Both the slope and the values of the given data are used to determine the surface. The method is known to be relatively inefficient, possibly unstable but flexible (in the sense of the number of model parameters) in relation to the cubic spline method. In one or two dimensions, this is known to be equivalent to the cubic spline interpolation as it also minimizes the total curvature. (d) Nearest-neighbor interpolation - This method uses the elevation estimate of the nearest training/evaluation data as the estimate of the point of interest. (e) Mean-of-neighborhood interpolation - This method uses the average elevation of the set of training/evaluation data as the estimate of the point of interest. 4. Triangle based interpolation - All techniques mentioned thus far are applied on the exact same neighborhood of training and evaluation data that is used for GP regression for any desired point. In essence, thus far, the comparison can be likened to one of comparing GP’s with standard elevation maps using any of the aforementioned techniques, applied over exactly the same neighborhoods. Two other interpolation techniques are also considered for comparison - triangle based linear and cubic interpolation (Watson, 1992). These methods work on datasets that have been triangulated and involve fitting planar or polynomial (cubic) patches to each triangle. They are meant to compare the proposed GP approach with a TIN representation of the dataset, using the triangle based interpolation techniques. (a) Triangle based linear interpolation - In this method, a Delaunay triangulation is performed on the data provided and then the coordinates of the triangle corresponding to the point of interest (in Barycentric coordinates (Watson, 1992)) are subject to linear interpolation to estimate the desired elevation. (b) Triangle based cubic interpolation - In this method, a Delaunay triangulation is performed on the data provided and then the coordinates of the triangle corresponding to the point of interest (in Barycentric coordinates (Watson, 1992)) are subject to cubic interpolation to estimate the desired elevation. Taking inspiration from (Kohavi, 1995), a 10-fold cross validation was conducted using each dataset. Two kinds of cross validation techniques were considered - using uniform sampling to define the folds and using patch sampling to define the folds. In the uniform sampling method, each fold had approximately the same number of points and points were selected through a uniform sampling process - thus, the cross validation was also stratified in this sense. This was performed for all three datasets. Further to this, the patch sampling based cross validation was conducted for the two laser scanner datasets. In this case, the data 32 was gridded into 5 m (for Tom Price dataset) or 10 m (for West Angelas dataset) patches. Subsequently a 10 fold cross validation was performed by randomly selecting different sets of patches for testing and training in each cycle. The uniform sampling method was meant to evaluate the method as such. The patch sampling method was done with a view of testing the different techniques for their robustness to handling occlusions or voids in data. Figures 16 and 17 depict the two kinds of cross validation performed. Figure 16: Cross validation using uniform sampling. The figure shows a top - down / X (horizontal axis) - Y (vertical axis) view of the West Angelas dataset (Figure 8). Points are uniformly sampled from the dataset to produce 10 equally sized folds and cross validation is performed by testing each fold while learning from the remaining folds. The figure depicts the dataset (blue points) and test points from three of the folds (red, green and yellow points). The results of the cross validation with uniform sampling experiments are summarized in Tables 5, 6 and 7. The tabulated results depict the mean and standard-deviation in MSE values obtained across the 10 folds of testing performed. The results of the cross validation with patch sampling experiments performed on the Tom Price and West Angelas datasets are summarized in Tables 8 and 9 respectively. In each of these tests, the GP representation with its kriging interpolation technique, stationary or non-stationary kernel and local approximation strategy is compared with elevation maps using various parametric, non-parametric interpolation techniques as well as the exact same local approximation method as the GP and also with TIN’s using triangle based interpolation techniques. The former two approaches thus operate on exactly the same neighborhood of training/evaluation data during the estimation/interpolation process. In order to compare the proposed GP modeling approach with a TIN representation, the local approximation is not applied on the latter class of interpolation techniques. 33 (a) Patch sampling of West Angelas dataset (b) Patch sampling of Tom Price dataset Figure 17: Cross Validation using patch sampling. The figures show a top - down / X (horizontal axis) Y (vertical axis) view of the West Angelas (Figure 8) and Tom Price (Figure 6) datasets respectively. The datasets are gridded into 5m (for Tom Price) and 10m (for West Angelas) patches. 10 fold cross validation is done by randomly selecting patches for training and testing in each cycle. The figures depict the datasets (blue points) and test patches for three cycles of testing (red, green and yellow patches) in the two laser scanner datasets. 34 Table 5: Tom Price dataset 10-fold cross validation with uniform sampling. Comparison of Interpolation techniques 1000 test data per fold 10000 test data per fold Interpolation Method Mean MSE Std. Dev. MSE Mean MSE Std. Dev. MSE (sq m) (sq m) (sq m) (sq m) GP Neural Network 0.0107 0.0012 0.0114 0.0004 GP Squared Exponential 0.0107 0.0013 0.0113 0.0004 Nonparametric Linear 0.0123 0.0047 0.0107 0.0013 Nonparametric Cubic 0.0137 0.0053 0.0120 0.0017 Nonparametric Biharmonic 0.0157 0.0065 0.0143 0.0019 Nonparametric 0.0143 0.0010 0.0146 0.0007 Mean-of-neighborhood Nonparametric 0.0167 0.0066 0.0149 0.0017 Nearest-neighbor Parametric Linear 0.0107 0.0013 0.0114 0.0005 Parametric Quadratic 0.0110 0.0018 0.0104 0.0005 Parametric Cubic 0.0103 0.0018 0.0103 0.0005 Triangulation Linear 0.0123 0.0046 0.0107 0.0013 Triangulation Cubic 0.0138 0.0053 0.0120 0.0017 Table 6: Kimberlite Mine dataset 10-fold cross validation with uniform sampling. Comparison of Interpolation techniques Interpolation Method Mean MSE (sq m) Std. Dev. MSE (sq m) GP Neural Network 3.9290 0.3764 GP Squared Exponential 5.3278 0.3129 Nonparametric Linear 5.0788 0.6422 Nonparametric Cubic 5.1125 0.6464 Nonparametric Biharmonic 5.5265 0.5801 Nonparametric 132.5097 2.9112 Mean-of-neighborhood Nonparametric 20.4962 2.5858 Nearest-neighbor Parametric Linear 43.1529 2.2123 Parametric Quadratic 13.6047 0.9047 Parametric Cubic 10.2484 0.7282 Triangulation Linear 5.0540 0.6370 Triangulation Cubic 5.1091 0.6374 35 Table 7: West Angelas dataset 10-fold cross validation with uniform sampling. Comparison of Interpolation techniques 1000 test data per fold 10000 test data per fold Interpolation Method Mean MSE Std. Dev. MSE Mean MSE Std. Dev. MSE (sq m) (sq m) (sq m) (sq m) GP Neural Network 0.0166 0.0071 0.0219 0.0064 GP Squared Exponential 0.4438 1.0289 0.7485 0.7980 Nonparametric Linear 0.0159 0.0075 0.0155 0.0021 Nonparametric Cubic 0.0182 0.0079 0.0161 0.0021 Nonparametric Biharmonic 0.0584 0.0328 0.1085 0.1933 Nonparametric 0.9897 0.4411 0.9158 0.0766 Mean-of-neighborhood Nonparametric 0.1576 0.0271 0.1233 0.0048 Nearest-neighbor Parametric Linear 0.1019 0.0951 0.0927 0.0173 Parametric Quadratic 0.0458 0.0130 0.0390 0.0059 Parametric Cubic 0.0341 0.0109 0.0288 0.0038 Triangulation Linear 0.0162 0.0074 0.0157 0.0022 Triangulation Cubic 0.0185 0.0078 0.0166 0.0023 Table 8: Tom Price dataset 10-fold cross validation with patch sampling. Comparison of Interpolation techniques 1000 test data per fold 10000 test data per fold Interpolation Method Mean MSE Std. Dev. MSE Mean MSE Std. Dev. MSE (sq m) (sq m) (sq m) (sq m) GP Neural Network 0.0104 0.0029 0.0100 0.0021 GP Squared Exponential 0.0103 0.0029 0.0099 0.0021 Nonparametric Linear 0.0104 0.0047 0.0098 0.0040 Nonparametric Cubic 0.0114 0.0054 0.0108 0.0045 Nonparametric Biharmonic 0.0114 0.0046 0.0124 0.0047 Nonparametric 0.0143 0.0036 0.0142 0.0026 Mean-of-neighborhood Nonparametric 0.0139 0.0074 0.0132 0.0048 Nearest-neighbor Parametric Linear 0.0103 0.0029 0.0100 0.0021 Parametric Quadratic 0.0099 0.0029 0.0091 0.0022 Parametric Cubic 0.0097 0.0027 0.0605 0.1072 Triangulation Linear 0.0104 0.0047 0.0098 0.0039 Triangulation Cubic 0.0114 0.0054 0.0108 0.0045 36 Table 9: West Angelas dataset 10-fold cross validation with patch sampling. Comparison of Interpolation techniques 1000 test data per fold 10000 test data per fold Interpolation Method Mean MSE Std. Dev. MSE Mean MSE Std. Dev. MSE (sq m) (sq m) (sq m) (sq m) GP Neural Network 0.0286 0.0285 0.0291 0.0188 GP Squared Exponential 0.7224 1.1482 1.4258 2.0810 Nonparametric Linear 0.0210 0.0168 0.0236 0.0161 Nonparametric Cubic 0.0229 0.0175 0.0243 0.0165 Nonparametric Biharmonic 0.0555 0.0476 0.0963 0.0795 Nonparametric 1.6339 0.8907 1.7463 1.2831 Mean-of-neighborhood Nonparametric 0.1867 0.0905 0.2248 0.1557 Nearest-neighbor Parametric Linear 0.1400 0.0840 0.1553 0.1084 Parametric Quadratic 0.0612 0.0350 0.0658 0.0454 Parametric Cubic 0.0380 0.0196 0.0752 0.0880 Triangulation Linear 0.0212 0.0169 0.0235 0.0160 Triangulation Cubic 0.0232 0.0175 0.0245 0.0164 From the experimental results presented, many inferences on the applicability and usage of different techniques could be drawn. The results obtained validate the findings from the previous set of experiments. They show that the NN kernel is much more effective at modeling terrain data than the SQEXP kernel. This is particularly true in relatively sparse and/or complex datasets such as the Kimberlite Mine or West Angelas datasets as shown in Tables 6, 7 and 9. For relatively flat terrain such as the Tom Price scan data, the techniques compared performed more or less similarly. This is shown in Tables 5 and 8. From the GP modeling perspective, the choice of nonstationary or stationary kernel was less relevant in such a scenario. The high density of data in the scan was likely another contributive factor. For sparse datasets such as the Kimberlite Mine dataset, GP-NN easily outperformed all the other interpolation methods. This is shown in Table 6. This is quite simply because most other techniques imposed apriori models on the data whereas the GP-NN was actually able to model and adapt to the data at hand much more effectively. Further, GP-SQEXP is not as effective as the GP-NN in handling sparse or complex data. In the case of the West Angelas dataset, the data was complex in that it had a poorly defined structure and large areas of the data were sparsely populated. This was due to large occlusions as the scan was a bottom-up scan from one end of the mine pit. However, the dataset spanned over a large area and local neighborhoods (small sections of data) were relatively flat. Thus, while the GP-NN clearly outperformed the GP-SQEXP, it performed comparably with the piecewise linear/cubic and triangle based linear/cubic 37 methods and much better than the other techniques attempted. This is shown in the Tables 7 and 9. Thus, even in this case, the GP-NN proved to be a very competitive modeling option. GP-SQEXP was not able to handle sparse data or the complex structure (with an elevation difference of 190 m) effectively and hence performed poorly in relation to the aforementioned methods. As evidenced by Tables 6, 7 and 9, the non-parametric and triangle based linear/cubic interpolation techniques performed much better than the parametric interpolation techniques, which fared poorly overall. This is expected and intuitive as the parametric models impose apriori models over the whole neighborhood of data whereas the non-parametric and triangle based techniques basically split the data into simplexes (triangular or rectangular patches) and then operate only only one such simplex for a point of interest. Thus their focus is more towards smooth local data fitting. The parametric techniques only performed well in the Tom Price dataset tests (Tables 5 and 8) due to the flat nature of the dataset. Among the nonparametric techniques, the linear and cubic also performed better than the others. The nearest-neighbor and mean-of-neighborhood methods were too simplistic to be able to handle any complex or sparse data. The results prove that the GP-NN is a very versatile and competitive modeling option for a range of datasets, varying in complexity and sparseness. For dense datasets or relatively flat terrain (even locally flat terrain), the GP-NN will perform as well as the standard grid based methods employing any of the interpolation techniques compared or a TIN based representation employing triangle based linear or cubic interpolation. For sparse datasets or very bumpy terrain, the GP-NN will significantly outperform every other technique that has been tested in this experiment. When considering the other advantages of the GP approach such as Bayesian model fitting, ability to handle uncertainty and spatial correlation in a statistically sound manner and multi-resolution representation of data, the GP approach to terrain representation seems to have a clear advantage over standard grid based or TIN based approaches. Finally, note that in essence, these experiments compared the GP - Kriging interpolation technique with various other standard interpolation methods. The GP modeling approach can use any of the standard interpolation techniques through the use of suitably chosen/designed kernels. For instance, a GP-linear kernel performs the same interpolation as a linear interpolator, however, the GP also provides the corresponding uncertainty estimates among other things. The GP is more than just an interpolation technique. It is a Bayesian framework that brings together interpolation, linear least squares estimation, uncertainty representation, non-parametric continuous domain model fitting and the Occam’s Razor principle (thus preventing the over-fitting of data). Individually, these are well known techniques; the GP provides a flexible (in the sense of using different kernels) technique to do all of them together. Thus, this set of experiments statistically evaluated the proposed GP modeling approach, compared stationary and non-stationary GP’s for terrain modeling, compared the GP modeling / Kriging interpolation 38 approach to elevation maps using many standard interpolation techniques as well as TIN representations using triangle based interpolation techniques. 6.4 Further analysis The final set of experiments performed aimed at gaining a better understanding of some aspects of the modeling process, the factors influencing it and some of the design decisions made in this work. These experiments were mostly conducted on the Kimberlite Mine data set owing to its characteristic features (gradients, roadways, faces etc.). Only small numbers of training data have been used in these experiments in order to facilitate rapid computation. The mean squared error values will thus obviously not be near the best case scenario, however the focus of these experiments was not on improving accuracy but rather on observing trends and understanding dependencies. Among the experiments performed, three different optimization strategies were attempted - stochastic (simulated annealing), gradient based (Quasi-Newton optimization with BFGS Hessian update (Nocedal and Wright, 2006)) and the combination (first a simulated annealing, followed by the Quasi-Newton optimization with the output of the previous stage being the input to this one). The results obtained are shown in Table 10. The objective of this experiment was to start from a random set of hyper-parameters and observe which method converges to a better solution. Expectedly, it was found that the combined strategy performed best. The gradient based optimization requires a good starting point or the optimizer may get stuck in local minima. The stochastic optimization (simulated annealing) provides a better chance to recover from local minima as it allows for “jumps” in the parameter space during optimization. Thus, the combination first localizes a good parameter neighborhood and then zeros-down on the best parameters for the given scenario. Table 10: Optimization Strategy - GP-NN applied on Kimberlite Mine data set (Number of Training Data = 1000, optimization is started from a random start point) Optimization Method Mean Squared Error (sq m) Stochastic (Simulated Annealing) 94.30 Gradient based (Quasi-Newton Optimization) 150.48 Combined 4.61 Three different sampling strategies have been attempted in this work - uniform sampling, random sampling, heuristic sampling (implemented here as a preferential sampling around elevation gradients). All experiments in this paper have been performed using a uniform sampling strategy as it performed the best. The three sampling strategies were specifically tested on the Kimberlite Mine data set owing to its significant elevation gradient. The results obtained are shown in Table 11. Heuristic sampling did poorly in comparison to the other sampling methods because of the fact that the particular gradient based sampling approach 39 applied here resulted in the use of mostly gradient information and little or no face information. Increasing the number of training data increases the number of face related data points and reduces the amount of error. At the limit (number of training data = number of data), all three approaches would produce the same result. Table 11: Sampling Strategy - GP-NN applied on Kimberlite Mine data set Method Number(Train Data) = 1000 Number(Train Data) = 1500 Mean Squared Error (sq m) Mean Squared Error (sq m) Uniform 8.87 5.87 Random 10.17 7.89 Heuristic 32.55 11.48 Table 12: GP-NN applied on Kimberlite Mine data set (3000 training data, 1000 test data) Number of Neighbors MSE (sq m) 10 4.5038 50 4.2214 100 4.1959 150 4.1859 200 4.1811 250 4.1743 300 4.1705 350 4.1684 400 4.1664 450 4.1652 500 4.1642 550 4.1626 The number of nearest neighbors used for the local approximation step in the inference process is another important factor influencing the GP modeling process. Too many nearest neighbors results in excessive “smoothing” of the data whereas too few of them would result in a relatively “spiky” output. Further, the greater the number of neighbors considered, the more the time it takes to estimate/interpolate the elevation (an increase in computational complexity). The “correct” number of nearest neighbors was determined by computing the MSE for different neighborhood sizes and interpreting the results. This experiment was conducted on both the Kimberlite Mine and the West Angelas datasets. The results of this experiment are shown in Tables 12 and 13 respectively. With an aim of (a) choosing a reasonable sized neighborhood, (b) obtaining reasonably good MSE performance and time complexity and (c) if possible, using a single parameter across all experiments and datasets, a neighborhood size of 100 was used in this work. The size of the training data is obviously an important factor influencing the modeling process. However, this depends a lot on the data set under consideration. For the Tom Price laser scanner data, a mere 3000 40 Table 13: GP-NN applied on West Angelas Mine data set (3000 training data, 10000 test data) Number of Neighbors MSE (sq m) 10 0.082 50 0.017 100 0.020 150 0.023 200 0.026 250 0.027 300 0.028 of over 1.8 million data points were enough to produce centimeter level accuracy. This was due to the fact that this particular data set comprised of dense and accurate scanner data with a relatively lower change in elevation. The same number of training data also produced very satisfactory results in the case of the West Angelas dataset which is both large and has a significant elevation gradient - this is attributed to the uniform sampling of points across the whole dataset and adaptive power of the NN kernel. The Kimberlite Mine data set comprised of very sparse data spread over a very large area. This data set also had a lot of “features” such as roads, “crest-lines” and “toe-lines”. Thus, if the amount of training data is reduced (say 1000 of the 4612), while the terrain was still broadly reconstructed, the finer details - faces, lines, roads etc. were smoothened out. Using nearly the entire data set for training and then reconstructing the elevation/surface map at the desired resolution provided the best results. This section of experiments attempted to better understand the finer points and design decisions of the GP modeling approach proposed in this work. A combination of stochastic (simulated annealing) and gradient based (Quasi-Newton) optimization was found to be the best option to obtain the GP hyperparameters. For most datasets and particularly for dense and large ones, a uniform sampling procedure was a good option. For feature rich datasets, a heuristic sampling might be attempted but care must be taken to ensure that training data from the non-feature-rich areas are also obtained. The number of nearest neighbors affected both the time complexity as well as the MSE performance. A tradeoff between these was applied to empirically determine that the experiments in this work would use 100 nearest neighbors. Finally, the number of training data depends on the computational resources available and on the richness of the dataset. A dataset such as the Kimberlite Mine dataset needed more training data than the Tom Price dataset. 7 Conclusion This work proposed the use of Gaussian processes for modeling large scale and complex terrain. Specifically, it showed that a single neural network based Gaussian process (GP) was powerful enough to be able to 41 successfully model complex terrain data, taking into account the local structure and preserving much of the spatial features in the terrain. This work also proposed a fast local approximation technique, based on a “moving-window” methodology and implemented using KD-Trees, that was able to simultaneously address the issues of scalability of this approach to very large datasets, effective application of the neural network (NN) kernel and procuring a tradeoff between smoothness and feature preservation in the resulting output. This work also compared stationary and non-stationary kernel and studied various detailed aspects of the modeling process. Cross validation experiments were used to provide statistically representative performance measures, compare the GP approaches with several other interpolation methods using both grid/elevation maps as well as TIN’s and understand the strengths and applicability of different techniques. These experiments demonstrated that for dense and/or relatively flat terrain, the GP-NN would perform as well as the grid based methods using any of the standard interpolation techniques or TIN based methods using triangle based interpolation techniques. However, for sparse and/or complex terrain data, the GP-NN would significantly outperform alternative methods. Thus, the GP-NN proved to be a very versatile and effective modeling option for terrain of a range of sparseness and complexity. Experiments conducted on multiple real sensor data sets of varying sparseness, taken from a mining scenario clearly validated the claims made. The data sets included dense laser scanner data and sparse mine planning data that was representative of a GPS based survey. Not only are the kernel choice and local approximation method unique to this work on terrain modeling, the approach as a whole is novel in that it addressed the problem at a scale never attempted before with this particular technique. The model obtained naturally provided a multi-resolution representation of large scale terrain, effectively handled both uncertainty and incompleteness in a statistically sound way and provided a powerful basis to handle correlated spatial data in an appropriate manner. The Neural Network kernel was found to be much more effective than the squared exponential kernel in modeling terrain data. Further, the various properties of the neural network kernel including its adaptive power and of being a universal approximator make it an excellent choice for modeling data, particularly when no apriori information (eg. the geometry of the data) is available or when data is sparse, unstructured and complex. This has been consistently demonstrated throughout this paper. Where domain information is available (eg. a flat surface), a domain specific kernel (eg. linear kernel) may perform well but the NN kernel would still be very competitive. Thus, this work also suggests further research in the area of kernel modeling - particularly combinations of kernels and domain specific kernels. 42 Acknowledgements This work has been supported by the Rio Tinto Centre for Mine Automation and the ARC Centre of Excellence programme, funded by the Australian Research Council (ARC) and the New South Wales State Government. The authors acknowledge the support of Annette Pal, James Batchelor, Craig Denham, Joel Cockman and Paul Craine of Rio Tinto. References Candela, J. Q., Rasmussen, C. E., and Williams, C. K. I. (2007). Approximation methods for gaussian process regression. Technical Report MSR-TR-2007-124, Microsoft Research. Durrant-Whyte, H. (2001). A Critical Review of the State-of-the-Art in Autonomous Land Vehicle Systems and Technology. Technical Report SAND2001-3685, Sandia National Laboratories, USA. Haas, T. C. (1990a). Kriging and automated variogram modeling within a moving window. Atmospheric Environment, 24A:1759–1769. Haas, T. C. (1990b). Lognormal and moving window methods of estimating acid deposition. Journal of the American Statistical Association, 85:950–963. Hornik, K. (1993). Some new results on neural network approximation. Neural Networks, 6(8):1069–1072. Kidner, D., Dorey, M., and Smith, D. (1999). What’s the point? Interpolation and extrapolation with a regular grid DEM. In Fourth International Conference on GeoComputation, Fredericksburg, VA, USA. Kitanidis, P. K. (1997). Introdcution to Geostatistics: Applications in Hydrogeology. Cambridge University Press. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI), volume 2 (12), page 1137:1143. Krotkov, E. and Hoffman, R. (1994). Terrain mapping for a walking planatary rover. IEEE Transactions on Robotics and Automation, 10(6):728–739. Kweon, I. S. and Kanade, T. (1992). High-Resolution Terrain Map from Multiple Sensor Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):278–292. 43 Lacroix, S., Mallet, A., Bonnafous, D., Bauzil, G., Fleury, S., Herrb, M., and Chatila, R. (2002). Autonomous rover navigation on unknown terrains: Functions and Integration. International Journal of Robotics Research (IJRR), 21(10-11):917–942. Lang, T., Plagemann, C., and Burgard, W. (2007). Adaptive Non-Stationary Kernel Regression for Terrain Modeling. In Robotics, Science and Systems. Leal, J., Scheding, S., and Dissanayake, G. (2001). 3D Mapping: A Stochastic Approach. In Australian Conference on Robotics and Automation. Mackay, D. J. C. (2002). Information Theory, Inference & Learning Algorithms. Cambridge University Press. Matheron, G. (1963). Principles of Geostatistics. Economic Geology, 58:1246–1266. Moore, I. D., Grayson, R. B., and Ladson, A. R. (1991). Digital terrain modelling: A review of hydrological, geomorphological, and biological applications. Hydrological Processes, 5-1:3–30. Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. Springer, New York. Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. Paciorek, C. J. and Schervish, M. J. (2004). Nonstationary Covariance Functions for Gaussian Process Regression. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems (NIPS) 16. MIT Press, Cambridge, MA. Plagemann, C., Kersting, K., and Burgard, W. (2008a). Nonstationary Gaussian Process Regression using Point Estimates of Local Smoothness. In European Conference on Machine Learning (ECML). Plagemann, C., Mischke, S., Prentice, S., Kersting, K., Roy, N., and Burgard, W. (2008b). Learning Predictive Terrain Models for Legged Robot Locomotion. In International Conference on Intelligent Robots and Systems (IROS). To appear. Preparata, F. P. and Shamos, M. I. (1993). Computational Geometry: An Introduction. Springer. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. Rekleitis, I., Bedwani, J., Gingras, D., and Dupuis, E. (2008). Experimental Results for Over-the-Horizon Planetary exploration using a LIDAR sensor. In Eleventh International Symposium on Experimental Robotics. 44 Sandwell, D. T. (1987). Biharmonic spline interpolation of geos-3 and seasat altimeter data. Geophysical Research Letters, 14,2:139–142. Shen, Y., Ng, A., and Seeger, M. (2006). Fast gaussian process regression using kd-trees. In Weiss, Y., Schölkopf, B., and Platt, J., editors, Advances in Neural Information Processing Systems (NIPS) 18, pages 1225–1232. MIT Press, Cambridge, MA. Triebel, R., Pfaff, P., and Burgard, W. (2006). Multi-Level Surface Maps for Outdoor Terrain Mapping and Loop Closing. In International Conference on Intelligent Robots and Systems (IROS), Beijing, China. Wackernagel, H. (2003). Multivariate geostatistics: an introduction with applications. Springer. Watson, D. E. (1992). Contouring: A Guide to the Analysis and Display of Spatial Data. Pergamon (Elsevier Science, Inc.). Williams, C. K. I. (1998a). Computation with infinite neural networks. Neural Computation, 10(5):1203– 1216. Williams, C. K. I. (1998b). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Jordan, M. I., editor, Learning in Graphical Models, pages 599–622. Springer. Ye, C. and Borenstein, J. (2003). A new terrain mapping method for mobile robots obstacle negotiation. In UGV Technology Conference at the SPIE AeroSense Symposium, volume 4, Orlando, FL, USA. 45