Data-driven Kriging models based on FANOVA decomposition O. Roustant, Ecole des Mines de St-Etienne, www.emse.fr/~roustant joint work with T. Muehlenstädt1, L. Carraro2 and S. Kuhnt1 1 University of Dortmund - 2 Telecom St-Etienne 15th February2011 1 1 Cliques of FANOVA graph and block additive decomposition f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Z(x) = Z1,2,3(x1,x2,x3) Cliques: + Z4,5,6 (x4,x5,x{3,4} {1,2,3}, {4,5,6}, 6) + Z3,4(x3,x4) k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) 2 + k3,4(h3, h4) This talk presents – with many pictures - the main ideas of the corresponding paper: we refer to it for details. 3 Introduction Computer experiments • A keyword associated to the analysis of timeconsuming computer codes x1 x2 f xd 5 y Metamodeling and Kriging • Metamodeling: construct a cheap-to-evaluate model of the simulator (itself modeling the reality) • Kriging: basically an interpolation method based on Gaussian processes 6 Kriging model (definition) Y(x) = b0 + b1g1(x) + … + bkgk(x) + Z(x) linear trend (deterministic) + centered stationary Gaussian process (stochastic) Some conditional simulations 7 Kriging model (prediction) Conditional mean and 95% conf. int. Some conditional simulations 8 Kriging model (kernel) • Kriging model is a kernel-based method K(x,x’) = cov(Z(x), Z(x’)) FLEXIBLE (see after…) • When Z is stationary, K(x,x’) depends on h=x-x’ we denote k(h) = K(x,x’) 9 Kriging model (kernel) • “Making new kernels from old” (Rasmussen and Williams, 2006) K1 + K2 cK, with c>0 K1K2 … 10 Kriging model (common choice) • Tensor-product structure k(h) = k1(h1)k2(h2)…kd(hd) with hi=xi – xi’, and ki Gaussian, Matern 5/2… 11 The main idea on an example • Ishigami, defined on D = [-π,π]3, with A=7, B=0.1: f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1) • This is a block additive decomposition f(x) = f2(x2) + f1,3(x1,x3) Z(x) = Z2(x2) + Z1,3(x1,x3) k(h) = k2(h2) + k1,3(h1,h3) 12 The main idea on an example • Comparison of the two Kriging models – Training set: 100 points from a maximin Latin hypercube – Test set: 1000 additional points from a unif. distribution 13 The schema to be generalized f k = k2 + k1,3 14 Outline Introduction [How to choose a Kriging model for the Ishigami function] 1. From FANOVA graphs to block additive kernels [Generalizes 1.] 2. Estimation methodologies [With a new sensitivity index] 3. Applications 4. Some comments 15 From FANOVA graphs to block additive kernels FANOVA decomposition (Efron and Stein, 1981) • Assume that X1, …, Xd are independent random variables. Let f be a function defined on D1x…xDd and dν=dν1…dvd an integration measure. Then: f(X) = μ0 + Σμi(Xi) + Σμi,j(Xi,Xj) + Σμi,j,k(Xi,Xj,Xk) + … where all terms are centered and orthogonal to the others. They are given by: μ0 := E(f(X)), μi(Xi) := E(f(X)|Xi) – μ0 μi,j(Xi,Xj) := E(f(X)|Xi,Xj17 ) - μi(Xi) - μj(Xj) – μ0 and so on… FANOVA decomposition Example. Ishigami function, with uniform measure on D = [-π,π ]3. With a=π, b=2π5/5, we have: f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1) = aA + sin(x1)(1+bB) + A(sin2(x2)-a) + B[(x3)4-b]sin(x1) μ0 μ1(x1) μ2(x2) μ1,3(x1,x3) (main effect) (main effect) (2nd order interaction) 18 FANOVA decomposition Example (following) • Some terms can vanish, due to averaging, as μ3 , or μ1 if B = -1/b, but this depends on the integration measure, and only happens when there exist terms of higher order • On the other hand, and for the same reason, we always have: μ2,1 = μ2,3 = 0 and, under mild conditions: μ1,3 ≠ 0 19 FANOVA decomposition • The name “FANOVA” becomes from the relation on variances implied by orthogonality: var(f(X)) = Σvar(μi(Xi)) + Σvar(μi,j(Xi,j)) + … which measures the importance of each term. • var(μJ(XJ))/var(f(X)) is often called a Sobol indice 20 FANOVA graph Vertices: variables Edges: if there is at least one interaction (at any order) Width: prop. to variances Here (example above): μ1,2 = μ2,3 = 0 The graph does not depend on the integration measure, (under mild conditions) 21 FANOVA graph and cliques • A complete subgraph: all edges exist • A clique: maximal complete subgraph Cliques: {1,3} and {2} 22 Cliques of FANOVA graph and block additive decomposition f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Cliques: {1,2,3}, {4,5,6}, {3,4} 23 Why cliques? f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) {1,2,3,4,5,6} f(x) = f1,…,6(x1,…,x6) !!! 24 Incomplete subgraphs rough model forms Why cliques? f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4) {1,2},{2,3},{1,3},{3,4}{4, 5,6} f(x) = f1,2(x1,x2) +f2,3(x2,x3)+f1,3(x1,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) 25 Non maximality wrong model forms Cliques of FANOVA graph and Kriging models Cliques: {1,2,3}, {4,5,6}, {3,4} f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4) Z(x) = Z1,2,3(x1,x2,x3) + Z4,5,6(x4,x5,x6) + Z3,4(x3,x4) k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) 26 + k3,4(h3, h4) Estimation methodologies Graph estimation • Challenge: estimate all interactions (at any order) involving two variables • Two issues: 1. The computer code is time-consuming 2. Huge number of combinations for the usual Sobol indices 28 Graph estimation • Solutions 1. Replace the computer code by a metamodel, for instance a Kriging model with a standard kernel 2. Fix x3, …, xd, and consider the 2nd order interaction of the 2-dimensional function: (x1, x2) f(x) = f1(x-2) + f2(x-1) + f12(x1,x2; x-{1,2}) Denote D12(x3,…,xd) the unnormalized Sobol indice, and define: D12 = E(D12(x3,…,xd)) Then D12 > 0 iif (1,2) is an edge of the FANOVA graph 29 Graph estimation • Comments: – The new sensitivity index is computed by averaging 2nd order Sobol indices, and thus numerically tractable – In practice “D12 > 0” is replaced by “D12 > δ” Different thresholds give different FANOVA graphs 30 Kriging model estimation • Assume that there are L cliques of size d1,…,dL. The total number of parameters to be estimated is: ntrend + (d1+1) + … + (dL + 1) (trend, “ranges” and variance parameters) • MLE is used, 3 numerical procedures tested 31 Kriging model estimation • Isotropic kernels are useful for high dimensions • Example: suppose that C1={1,2,3}, C2={4,5,6}, C3={3,4}, and x7, …, x16 have a smaller influence 1st solution: C4={x7}, …, C13={x16} N = ntrend + 4 + 4 + 3 + 10*2 = ntrend + 31 2nd solution: C4 = {x7, …, x16}, with an isotropic kernel N = ntrend + 4 + 4 + 3 + 2 = ntrend + 13 32 Applications a 6D analytical case • f(x) = cos(- 0.8 - 1.1x1 + 1.1x2 + x3) + sin(- 0.5 + 0.9x4 + x5 – 1.1x6) + (0.5 + 0.35x3 - 0.6x4)2 • Domain: [-1,1]6 • Integration measure: uniform • Training set: 100 points from a maximin LHD • Test set: 1000 points drawn from a unif. dist. 34 a 6D analytical case Estimated graph Usual Sensitivity Analysis (from R package sensitivity) 35 a 6D analytical case 36 a 16D analytical case • Consider the same function, but assume that it is in a 16D space (with 10 more inactive variables) • Including all the inactive variables in one clique is improving the prediction 37 A 6D case study • Piston slap data set (Fang et al., 2006) – Unwanted noise of engine, simulated using a finite elements method • Training set: 100 points • Test set: 12 points • Leave-one-out is also considered 38 A 6D case study 39 A 6D case study Leave-one-out RMSE: 0.0864 (standard Kriging), 0.0371 (modified Kriging) 40 Some comments Some comments • Main strengths – Adapting the kernel to the data in a flexible manner – A substantial improvement may be expected in prediction • Depending on the function complexity • Some drawbacks – Dependence on the first initial metamodel – Sometimes a large nb of parameters to be estimated • May decrease the prediction power 42 THANKS A LOT FOR ATTENDING! 43