Model Averaging Techniques for Quantifying Conceptual Model Uncertainty by Abhishek Singh1 , Srikanta Mishra2 , and Greg Ruskauff3 Abstract In recent years a growing understanding has emerged regarding the need to expand the modeling paradigm to include conceptual model uncertainty for groundwater models. Conceptual model uncertainty is typically addressed by formulating alternative model conceptualizations and assessing their relative likelihoods using statistical model averaging approaches. Several model averaging techniques and likelihood measures have been proposed in the recent literature for this purpose with two broad categories—Monte Carlo-based techniques such as Generalized Likelihood Uncertainty Estimation or GLUE (Beven and Binley 1992) and criterion-based techniques that use metrics such as the Bayesian and Kashyap Information Criteria (e.g., the Maximum Likelihood Bayesian Model Averaging or MLBMA approach proposed by Neuman 2003) and Akaike Information Criterion-based model averaging (AICMA) (Poeter and Anderson 2005). These different techniques can often lead to significantly different relative model weights and ranks because of differences in the underlying statistical assumptions about the nature of model uncertainty. This paper provides a comparative assessment of the four model averaging techniques (GLUE, MLBMA with KIC, MLBMA with BIC, and AIC-based model averaging) mentioned above for the purpose of quantifying the impacts of model uncertainty on groundwater model predictions. Pros and cons of each model averaging technique are examined from a practitioner’s perspective using two groundwater modeling case studies. Recommendations are provided regarding the use of these techniques in groundwater modeling practice. Introduction Groundwater modeling and decision making is beset with uncertainty caused by incomplete knowledge of the underlying system and/or uncertainty due to natural variability in system processes and field conditions. The different sources of uncertainty in the modeling process can be categorized as follows: • Conceptual uncertainty: The first step in modeling is to build a conceptual model of the underlying 1 Corresponding author: INTERA Inc., Austin, TX; (512) 425-2048; fax (512) 425-2099; asingh@intera.com 2 INTERA Inc., Austin, TX; smishra@intera.com 3 INTERA Inc., Las Vegas, NV; greg.ruskauff@nv.doe.gov Received December 2008, accepted September 2009. Copyright © 2009 The Author(s) Journal compilation © 2009 National Ground Water Association. doi: 10.1111/j.1745-6584.2009.00642.x NGWA.org system. Decisions about the conceptual model are often made with imperfect or incomplete knowledge of the system, which leads to uncertainties in the conceptualization of the model itself. This is shown in Figure 1 by the multiple polynomial curves fit to a dataset (shown by black dots). Each curve represents an alternative conceptualization of the relationships between the independent and state variables. • Parametric uncertainty: A model can have numerous parameters that need to be specified, often in the absence of sufficient data, leading to parametric uncertainty. As shown in Figure 1, parametric uncertainty can be of two kinds: ◦ Unconditional uncertainty: Parameters that are directly specified (based on expert judgment or literature values) are uncertain because of the lack of knowledge or insufficient data. Such uncertainty is often referred to as unconditional uncertainty, since it is not conditioned on field values, and is typically Vol. 48, No. 5–GROUND WATER–September-October 2010 (pages 701–715) 701 ◦ characterized by a probability distribution based on subjective judgment (shown by the second part of Figure 1). Conditional uncertainty: Calibrated or conditioned parameters are those that lead to an acceptable degree of agreement between model behavior and field observations. Conditioning on past observations generally leads to improved predictive ability—unless the calibration space is substantially different from the predictive space (e.g., groundwater flow vs. reactive contaminant transport). More will be said on this issue later in the section. Uncertainty in calibrated parameters can be due to (1) errors in the field data that the parameters are being calibrated against; (2) insensitivity of the parameters to the model predictions; and (3) correlations within parameter sets with respect to model predictions. These types of uncertainties are demonstrated by the bottom three plots in Figure 1—the first showing the mismatch in the true parameter and the calibrated parameter, the second showing the lack of sensitivity of certain parameters, and the third showing the correlations in two parameters that can lead to response surface of the calibration objective to have multiple optima. Both insensitivity and correlations in parameters lead to certain parameters remaining uncalibrated. These two problems in model calibration lead to what is referred to as equifinality or nonuniqueness (Beven Figure 1. Schematic for different types of uncertainty in modeling. 702 A. Singh et al. GROUND WATER 48, no. 5: 701–715 NGWA.org and Freer 2001). Insensitive parameters cannot, in essence, be calibrated because the model behavior is not constrained by such parameters. Correlations within parameters can mean that while it may be possible to uniquely identify a group of parameters together, it would be difficult to separate each parameter and give it a unique value. Errors in field data can, of course, lead to erroneous calibration of model parameters, which in turn adds to the uncertainty in these parameter values. Such uncertainties are also likely exacerbated by error in the model itself, although a discussion on the characterization of model structural error is beyond the scope of this paper. • Stochastic uncertainty: Even with a well-conceptualized and well-calibrated model, there exists natural variability in field conditions that can lead to uncertainty in predictions. To make robust decisions, the variability needs to be incorporated in the decision-making process. This is typically done by considering stochastic realizations for the various model inputs. Of the above-mentioned sources of uncertainty, the focus of uncertainty analysis in groundwater modeling has traditionally been parametric uncertainty. This paper, however, concerns itself with the more fundamental issue of conceptual model uncertainty. Conceptual model uncertainty in groundwater models typically arise due to (1) inadequate representation of physical processes; (2) incomplete understanding of the subsurface geologic framework; and (3) inability of the model to properly explain all of the available observations of state variables. The limited literature on the assessment of alternative models suggests that it is possible to develop models consistent with geologic data that yield very different hydrologic predictions. This is particularly true for groundwater models, where the data used for calibration (typically hydraulic heads) may not be of the same scale or sensitivity as the predictions (often contaminant transport). Owing to these reasons, a growing understanding has emerged in recent years regarding the need to expand the modeling paradigm to include more than one plausible conceptual model of the system. The need to move away from one unique model to a set of multiple models for predictions was identified early on by Delhomme (1979), Neuman (1982), Hoeksema and Kitanidis (1989), Wagner and Gorelick (1989), Beven (1993), Neuman and Wierenga (2003), and Poeter and Anderson (2005) among others. Beven (1993, 2000) laid out the argument that a unique model with an “optimal” set of parameters is inherently unknowable. Instead, they argued for a set of acceptable and realistic model representations that are consistent with the data. Work such as National Research Council (2001), Neuman and Wierenga (2003), Carrera and Neuman (1986), and Samper and Neuman (1989) have also shown that considering only one conceptual model for a particular site can lead to poorly informed decisions. NGWA.org Given these multiple models, it becomes essential to assess the likelihood or probability of each model. Without such likelihood measures, models would be assumed to be equally likely and it is possible that the resulting uncertainty is much higher than reasonable. Once the likelihoods have been assessed, model predictions would have to be based on a weighted average (proportional to the model likelihoods) over the ensemble of models. The task of model averaging is thus closely linked to the task of assessing the likelihood of alternative conceptual models. To this end, several approaches have been proposed for dealing with model uncertainty (and averaging). There are two broad categories of methods—the methods that use Monte Carlo sampling across multiple models/parameter combinations to estimate the posterior probabilities and the methods that use certain metrics such as Akaike, Bayesian, or Kashyap Information Criteria to estimate the posterior probabilities (all these criterion-based approaches use a similar approach to the calculation of posterior probabilities, the only difference being the chosen criterion for this purpose). Generalized Likelihood Uncertainty Estimation or GLUE (e.g., Beven and Binley 1992) is an example of the first type of approach. Examples of criterion-based model averaging include Maximum Likelihood Bayesian Model Averaging or MLBMA (e.g., Neuman 2003) that uses the Bayesian and Kashyap Information Criteria (BIC and KIC, respectively) and Akaike Information Criterion (or AIC)-based model averaging (Poeter and Anderson 2005). While there are similarities within all these approaches, the major differences lie in the way they ascribe likelihood (or probability) to the different models being considered. Unfortunately, more often than not, different model averaging techniques lead to remarkably different model likelihoods (and hence ensemble predictions). The proponents of each technique have pointed to the theoretical and practical advantages for each approach, while contrary views have been expressed by other researchers. As such there is no consensus within the research community, and the modeler’s dilemma remains—which technique (if any) to utilize for model averaging. A note of caution is due at this stage. In many instances, the calibration and testing space is often different from the predictive space—that is, there is insufficient data or evidence to validate many of the assumptions and parameters that have been used during the modeling process, especially with respect to the predictive behavior of the model. The likelihoods mentioned earlier are, obviously, based on the same data that have been used to calibrate and test the model. Thus, these likelihoods are at best “surrogates” for the true likelihoods for a given set of models. As the calibration and predictive space become more similar, these surrogates become more consistent with the true likelihood of the models. The practitioner is thus encouraged to (1) consider different types of available data sources and formulations when assessing the likelihoods of the models and (2) approach these likelihoods with the requisite caution. A. Singh et al. GROUND WATER 48, no. 5: 701–715 703 The objective of this paper is to provide some clarity to the practitioner by providing a comparative assessment of different model averaging techniques for the purpose of quantifying the impacts of model uncertainty on groundwater model predictions. We begin with a brief description of the theoretical background for each model averaging technique. Next, we present a case study applying these techniques for estimating the impacts of uncertainty in predictions for a groundwater flow and transport model of the Nevada Test Site. (A second case study looking at the impact of uncertainty in multiple recharge models for the Death Valley regional flow model has been provided in the Supporting Information section.) Finally, some recommendations are provided regarding the use of model averaging in groundwater modeling practice. Techniques for Model Averaging Generalized Likelihood Uncertainty Estimation (GLUE) GLUE was originally proposed for dealing with model nonuniqueness in catchment modeling. It is based on the concept of “equifinality,” that is, the possibility that the same final state may be obtained from a variety of initial states (Beven and Binley 1992). In other words, a single set of observed data may be (nonuniquely) matched by multiple parameter sets that produce similar model predictions. In the GLUE framework, the feasible parameter space is first sampled to produce many equally likely parameter combinations (realizations)—each of which can be thought of as an alternative conceptual model. Discrete alternatives can also be considered in lieu of alternative parameter sets. The output corresponding to each realization (or model alternative) is compared against actual observations. Only those realizations (or models) that satisfy some acceptable level of performance (e.g., a maximum sum-of-squared weighted residuals), also known as the behavioral threshold, are retained for further analysis, and the nonbehavioral realizations (models) are rejected. A “likelihood” for each model is then computed as a function of the misfit between observations and model predictions. The weights (or probabilities) for each model are estimated by normalizing the likelihoods. One of the central features of GLUE is the flexibility with respect to the choice of the likelihood measure. As the name “generalized likelihood” implies, any reasonable likelihood measure can be used appropriately as long as it adequately represents the experts’ understanding of the relative importance of different data sources used to assess model accuracy. In the literature, many different likelihood measures based on goodness-of-fit metrics have been proposed. One likelihood measure that has seen widespread usage in the GLUE literature is given by the inverse weighted variance: N σ 2 l (1) Lj = σ 2e,j l 704 l A. Singh et al. GROUND WATER 48, no. 5: 701–715 where Lj is the likelihood for model j , l is the number of state variables (data types), σ 2e,j is the variance of the errors for model j (i.e., the error residuals for model j and data type l), σ 2l is the variance of the observations of data type l, and N is a shape factor such that values of N 1 tend to give higher weights (likelihoods) to models with better agreement with the data, and values of N 1 tend to make all models equally likely. The variance of the errors σ 2e,j for data type l is given by: SSR 2 σ e,j |l = (2) n l where SSR is the sum-of-squared residuals for the jth model predictions and observations (of data type l), while n is the number of observations (for data type l). Other forms of the likelihood functions include the Nash-Sutcliffe efficiency index (Nash and Sutcliffe 1970) given by: N σ 2e,j (3) 1− 2 Lj = σl l l and the exponential likelihood function (Beven 2000): σ 2e,j exp −N 2 Lj = (4) σl l l Normalizing the likelihoods, so that their sum is equal to one, gives the GLUE weight for model j : wj (GLU E) = P rj Lj n (5) P rj Lj j =1 where Lj is one of the likelihood functions described above, Prj is the prior weight given to each model (typically based on the modelers’ expert judgment), and n is the total number of models being considered. The GLUE approach can thus be considered as a form of conditional uncertainty analysis, where the unconditional predictions (based on equally likely parameter combinations) are conditioned by observations. The posterior probabilities for each realization can be used to weight the sampled parameter values, leading to a posterior distribution for each uncertain input that is also conditioned to observations. GLUE is a generalizable framework and is applicable to almost all types of problems. However, certain aspects of the methodology have generated controversy in recent years (e.g., Mantovan and Todini 2006; Vogel et al. 2007). These include (1) a lack of statistical basis for the likelihood and threshold measures used for model selection and weighting; (2) lack of dependence of most likelihoods on the number of data points (since all formulations from Equations 1 to 4 depend on the average residual—σ 2e,j in Equation 2, with n in the denominator—not the total residual, two models with the same average residual but different number of data points will NGWA.org be deemed equivalent by GLUE); (3) the computational burden required due to the need for extensive Monte Carlo simulations; and (4) the fact that GLUE does not require the model structure and parameters to be optimized (calibrated), which could lead to overestimation of predictive uncertainty. Moreover, there is typically no acknowledgment of differences in model complexity in the likelihood functions used. This is in contrast to methods that use criterion-based likelihoods (as discussed in later sections), where model complexity is an important component of the weight ascribed to the model. Beven (2006) has answered some of these criticisms by contending that (1) formal Bayesian model averaging (BMA) approaches are a special case of GLUE and are applicable under certain strong assumptions and (2) optimization or model selection can be used within the GLUE framework to reduce uncertainty. In recent years, the link between GLUE and optimization has become stronger with the work of Mugunthan and Shoemaker (2006), who showed that optimization can in fact be used to generate alternative models for GLUE, leading to efficiency enhancements for the GLUE framework by eliminating the need for Monte Carlo trials to generate model alternatives. Finally, with regard to the debate between the GLUE and Bayesian methods, Beven (2008) argues that “. . . the best approach to estimating model uncertainties is a Bayesian statistical approach, but that will only be the case if all the assumptions associated with the error model can be justified.”, and that “simple assumptions about the error term may be difficult to justify as more than convenient approximations to the real nature of the errors,” finally cautioning that “. . . making convenient formal Bayesian assumptions may certainly result in over estimating the real information content of the data in conditioning the model space.” Bayesian Model Averaging Techniques BMA framework was propounded by Draper (1995), Kass and Raftery (1995), and Hoeting et al. (1999) and is based on a formal Bayesian formulation for the posterior probabilities of different conceptual models. The most commonly used Bayesian modeling averaging paradigm in hydrology is MLBMA (Neuman 2003). MLBMA is a special case of the BMA approach, in that it approximates the Bayesian posterior probability by using the concept of “information criteria” to calculate the posteriori probabilities rather than computing these probabilities directly. In the Bayesian framework, the posterior weights (probabilities) for model Mj given the data (D) can be calculated using Bayes’ rule as follows: p(D|Mj )p(Mj ) p(Mj |D) = p(D|Mj )p(Mj ) (6) j where p(Mj ) is the prior probability of model Mj (similar to Prj used in Equation 5 for GLUE) and p(D|Mj ) is the model likelihood reflected by the level of agreement (or NGWA.org lack thereof) between predictions of the model Mj and the observed data, D. This model likelihood is given by: (7) p(D|Mj ) = p(D|θ j , Mj )p(θ j |Mj )d θ j Here θ j is the parameter set associated with model j , p(θ j |Mj ) is the prior probability of the parameters, and p(D|θ j , Mj ) is the joint probability of model j and is a function of the errors with respect to the field data (D). The prior probabilities for the model, p(Mj ), are typically obtained using expert elicitation (Ye et al. 2005, 2008b) or based on a noninformative prior (i.e., all models are equiprobable). The prior probabilities for the parameters, p(θ j |Mj ), can either be calculated from the data or also through an expert elicitation process (if there are not enough data to infer this distribution). The BMA calculation requires the integral in Equation 7 to be evaluated, which is typically done through exhaustive Monte Carlo simulations of the parameter space θ . This can be computationally very demanding, and thus Neuman (2003) proposed a variant of the BMA approach called MLBMA. MLBMA approximates this integral by using likelihood measures such as the Kashyap Information Criterion (KIC) (Kashyap 1982) or the Bayesian Information Criterion (BIC) (Schwarz 1978), which are evaluated for each model calibrated to the maximum likelihood estimator for the parameter set. The starting point for MLBMA is a collection of models that have been calibrated to observed data using maximum likelihood estimation. The model likelihood is then estimated using: j (8) p(D|Mj ) ∝ exp − 2 with: j = (BI Cj − BI Cmin ) (9) j = (KI Cj − KI Cmin ) (10) or where j is the difference between the BIC or KIC measure for the j th model and the minimum BIC or KIC value among all competing models (given by BICmin or KICmin in Equations 9 and 10). Assuming a multiGaussian error distribution with unknown mean and variance for the model likelihood in Equation 7, the BIC and KIC terms can be written as (Ye et al. 2008b): BI Cj = (n) ln(σ̂ 2e,j ) + kj ln(n) (11) and KI Cj = (n − kj ) ln(σ̂ 2e,j ) − 2 ln p(θ̂ j ) −kj ln(2π ) + ln |XjT ωXj | (12) where n is the number of observations, kj is the number of parameters for model j , θ̂ j is the maximum likelihood estimator for the parameters from model j , p(θ̂ j ) is A. Singh et al. GROUND WATER 48, no. 5: 701–715 705 the prior probability (either assessed from field data or through expert elicitation) for the parameter estimate, and σ̂ 2e,j is the maximum likelihood estimator for the variance of the error residuals (e) estimated from the weighted sum-of-squares residuals for model j with the maximum likelihood estimator for the parameters as: ejT ω ej 2 σ̂ e,j = (13) n θ j =θ̂ j where ej is the calibration error vector, n is the number of samples, θ̂ j is the maximum likelihood estimator for the parameters, and ω is a weight factor, which theoretically is given by the covariance between the data points. It is common to assume uncorrelated data leading to a diagonal matrix with the variance of the data points along the diagonal. In many cases, the unbiased “least-square” formulation may be used where, instead of n, (n − kj ) is used in the denominator, with kj being the number of calibrated parameters in the model j . Also note that for the sake of simplicity and without loss of generality, we have assumed only a single data type (unlike the GLUE formulation presented in Equations 1 to 5, which were for multiple data types). The last term in Equation 12—|XjT ωXj |—is the determinant of the Fisher information (FI) matrix, Xj is the Jacobian (sensitivity) matrix, XjT is its transpose, and ω is the weight matrix. The Fisher matrix requires calculation of derivatives of the calibration measures with respect to the model parameters (a nontrivial task for highly parameterized models)—and therefore represents the sensitivity of the model output to the parameters. Ye et al. (2004) have shown that using the KIC metric gives a better (more unbiased) measure of the model likelihood than BIC. The metric takes into account the information content in the data as given by the sensitivity of the model output with respect to the parameters, selecting more complex models (with a greater number of parameters) only when the data support such a choice. Ye et al. (2008b) also showed that from a theoretical standpoint, BIC asymptotically converges to KIC as the number of calibration data increases relative to the number of parameters (i.e., n k). The MLBMA model weights using either BIC or KIC can be given by: exp(−0.5j )p(Mj ) wj (MLBMA) = exp(−0.5j )p(Mk ) (14) k where j is given by Equation 9 or 10 and p(Mj ) are prior probabilities of the models (typically given by the expert, expressing his or her knowledge about the suitability of different models). There are two key aspects of the KIC and BIC based model weights: (1) the use of the j term, which can vary from 0 (for the model with the minimum KIC or BIC metric, see Equations 9 and 10) to many orders of magnitude higher (for the models with higher KIC 706 A. Singh et al. GROUND WATER 48, no. 5: 701–715 and BIC metrics) and (2) the exponential weighting in Equation 14 that tends to apportion most of the posterior weights to relatively few models exhibiting marginally better agreement with the data. The distribution of weights becomes narrower as the number of observations increase, since the value of n linearly affects the nominal values of BIC and KIC (per Equations 11 and 12). From a Bayesian standpoint this makes sense, as with more data there needs to be less uncertainty amongst competing models (Poeter and Hill [2007] emphasize this, as well). However, Beven (2008) has pointed out that this is only desirable if the error structure assumed by the averaging technique is consistent with the “real” error structure. If this is not the case, then model averaging techniques such as MLBMA may overestimate the information content of the data while conditioning the model. The additional FI term used in MLBMA (Equation 12) has been a source of much confusion and debate in the literature. Higher FIs indicate that the model outputs (calibration measures) have higher Jacobians (sensitivities) to the model parameters, which in turn indicates higher information content in the data points. From Equation 12, it is also apparent that increasing the FI term decreases the model likelihood (low KIC values correspond to higher likelihoods). This may be deemed a nonintuitive result, as for two models with the same accuracy (residuals) and complexity (number of parameters) KIC favors the model with lower sensitivities. Ye et al. (2008b) explain this by pointing out that more “information content” in the observed data (i.e., higher FI values) should lead to improved model performance—if it does not then the model has less basis to be selected (lower likelihood). In other words, the Fisher term reestablishes the performance standard for a model—the higher the information content in the data vis-à-vis the model parameters, the better the model needs to perform for it to be given a higher likelihood by MLBMA. Yet another way to look at the Fisher term is to think of it as a means of supporting complexity in the model. Ye et al. (2008b) argue that KIC balances parsimony (as expressed by the penalty term for the number of parameters) with expected information content in the data. Thus, higher FI content in the calibration data indicates that more complex models can be supported by the data (and can be selected with high likelihoods), whereas low Fisher terms mean that the data do not support model complexity and simpler less accurate models may be more appropriate. BMA has been questioned by Domingos (2000), who has argued that model combination by its very nature works by enriching the space of model hypotheses, not by approximating a Bayesian distribution function. In their study, Domingos (2000) compared BMA with other model averaging techniques and showed that BMA tends to underestimate the predictive uncertainty. However, others such as Minka (2000) have contended that these results are hardly surprising because by definition techniques like BMA, and especially MLBMA, are built on the intrinsic assumption that there is a unique model of reality NGWA.org (i.e., there is only one mode in the conditional distribution—representing the most likely model). This is borne out in the original MLBMA paper by Neuman (2003), where he lays out the fundamental assumption for this technique—“only one of the (alternative) models is correct even in the event that some yield similar predictions for a given set of data.” Thus, strictly speaking MLBMA is more a model selection technique than a model combination methodology. Note that unlike model averaging, model selection (or ranking) is simply based on the relative magnitude of the BMA criterion (either BIC or KIC), and thus is not affected by the exponential dependence on n. It is worth noting that the formulations shown earlier require the models to be well calibrated (normally distributed errors, etc.) and the residual variance (σ̂ 2e,j ) assessed using the calibrated parameters. In fact, the error distribution used is typically unimodal, with the mode approximated by the “calibrated” model. In the case of highly parameterized models, there is bound to be nonuniqueness in the parameter domain (and thus multimodality in the calibration response surface). The applicability of MLBMA and BMA in such cases is not clear. In such cases, it is advisable that the dimensionality of the model parameters be reduced (thereby introducing some level of uniqueness in the calibrated parameter set) before applying this methodology. The final point that needs to be made about the formulation shown earlier is that most applications assume uncorrelated data points leading to a diagonal weighting matrix. In reality, more often than not, the errors are correlated and there is often much redundancy in the data. In essence, this reduces the information content in the data points and needs to be reflected in the weighting scheme used for the weighted sum-of-squared residuals calculation and may spread the weights across the different models (see Hill and Tiedman [2007], for a discussion on the diagonal weight matrix assumption). Variance Window-Based MLBMA The previous section highlighted the issue with MLBMA distributing most of the posterior weights to a few models that exhibited marginally better calibration performance. Tsai and Li (2008) have proposed an approach to address this by using the concept of “variance window” to modify the MLBMA scheme. The motivation for their work was the realization that BMA tended to assign most of the weights to a few models that exhibit marginally better calibration performance (due to the exponential weighting and the j term used in Equation 14—see discussion in the preceding section). Tsai and Li (2008) contended that this stringency in the model averaging criteria is a result of the underlying assumption of “Occam’s windows” (Madigan and Raftery 1994) that only accepts models in a very narrow performance range. Occam’s window is defined by Raftery (1995) as the range within which the model performance of two competing models is statistically indistinguishable—that is, if the difference between the calibration NGWA.org metrics of two models (with the same complexity) is less than the Occam’s window then they will both be accepted. Raftery (1995) pointed out that for sample sizes between 30 and 50 data points, an Occam’s window of 6 units in the BIC metric (BIC in Equation 9) roughly corresponded to a significance level of 5% (in t statistics) in conventional hypothesis testing terms. Over the years there has been growing realization that this Occam’s window for model acceptance may be too restrictive leading to biased results (see appended comments to Hoeting et al. [1999]; Tsai and Li 2008). To reduce this overweighting and the resulting bias, Tsai and Li (2008) introduce the concept of a “variance window” as an alternative to the Occam’s window for selection with the BMA. The variance window is determined by including a scaling factor α with BIC (and KIC), where α is given by: α= s1 s2 σ D (15) where σ D is the standard deviation of the chi-square distribution for the “goodness-of-fit” criterion used in formulating KIC or BIC (see Tsai and Li [2008] for details). The variance of √the chi-square distribution is given by 2n (i.e., σ D = 2n), where n is the number of observations, s1 is the size of the Occam’s window corresponding to the given significance level, and s2 is the width of the variance window in terms of σ D . As the width of the variance window becomes larger, α becomes progressively smaller than 1. Note that since the minimum size of the variance window is the Occam’s window, the value of α is never larger than 1. When the concept of this variance window is incorporated into the model averaging process, the posterior model probabilities (also the model averaging weights) are given by: exp − 12 αj (16) wj (MLBMA) = 1 exp − αk 2 k where j is given by Equation 9 or 10. It can be seen that α is a multiplicative factor that when multiplied with BIC or KIC (as the case may be) reduces the impact the exponential term has on the weighting. For α = 1, the weighting is identical to the BIC or KIC based weights, and for α = 0 all models are equally weighted irrespective of their calibration performance. Tsai and Li (2008) also provide a table for recommended values of α corresponding to different significance levels and variance window sizes, which are shown in Table 1. The variance window concept was originally derived only for Bayesian model averaging by Tsai and Li (2008). It is not entirely clear if a similar α factor can be applied to AIC-based likelihoods (to be discussed in the next section), and if so then what significance level and variance size would such factors correspond to. Thus, for this study the variance window concept has only been used with the KIC-based cumulative distribution function (CDF) (i.e., j is given by Equation 10). A. Singh et al. GROUND WATER 48, no. 5: 701–715 707 In a manner similar to Equation 13, the AICMA model weights can be written as: Table 1 α-Values for Different Variance Window Sizes and Significance Levels (from Tsai and Li 2008) Variance Window Size σD 2σ D Significance level 5% 4.24 √ n 6.51 √ n 2.12 √ n 3.26 √ n Significance level 1% 4σ D 1.06 √ n 1.63 √ n Information Theory-Based Model Averaging Information theory provides a rich literature for the assessment of relative model performances as the likelihood of a model can be assumed to be related to the value of “information” it provides. The most popular information theory-based measure in use is the AIC. A recently developed publicly available model averaging software called multimodel analysis or MMA (Poeter and Hill 2007) provides a generalized framework that can be used to rank models and calculate posterior model probabilities. While MMA allows the user to choose or define the model criterion and the model averaging equations (including the MLBMA formulation), for this work we implement the AIC component of MMA, which we refer to as Akaike Information Criterion-based model averaging (AICMA). The AICMA framework works similar to the Bayesian framework, although there are significant philosophical differences between the two approaches. The AIC is used to approximate the Kullback-Leibler (K-L) metric, a measure of the loss of information when an imperfect model (Mj ) is used to approximate the “real” (and unknown model f ). The K-L distance (I ) between model Mj and f is defined as: f (x) dx (17) I [f, Mj ] = f (x) log p(Mj |θ j ) where f (x) is the real distribution and p(Mj |θ j ) is the distribution of model Mj given the set of calibrated parameters θ j . Obviously, since the real distribution f is not known, this term cannot be calculated. However, the relative K-L information can be approximated using the AIC (Akaike 1973) given by: AI Cj = n ln(σ̂ 2e,j ) + 2k (18) To further correct for the bias introduced from small sample sizes, a modified AIC equation (Hurvich and Tsai 1989; Poeter and Anderson 2005) has been proposed as follows: 2k(k + 1) AI Ccj = n ln(σ̂ 2e,j ) + 2k + (19) n−k−1 where the extra term in Equation 19 as compared to Equation 18 accounts for second-order bias that may result from a limited number of observations, for example, when n/k < 40. This work uses the AICc metric as defined in Equation 19 for likelihood estimation. 708 A. Singh et al. GROUND WATER 48, no. 5: 701–715 exp(−0.5AI Ccj )p(Mj ) wj (AI CMA) = exp(−0.5AI Ccj )p(Mj ) (20) j Theoretically, the fundamental difference between the AICMA and Bayesian approaches lies in their conception of a model. Since AICMA is based on an information theoretic framework, it assumes that all models are approximations and it is impossible to perfectly capture reality. While the goal for AICMA therefore is to select models with increasing complexity as the number of observations increases, the goal for MLBMA is to strive for models with consistent complexity (i.e., constant k), regardless of the number of observations (since the penalty term for model complexity is not dependent on the number of observations). Of course, use of the FI matrix in the KIC calculation leads to lower probabilities for more complex models, if such complexity is not supported by the data, thereby alleviating some of the problems with the consistent complexity assumption. Despite these differences, the AICMA approach shares some of the behavior, in terms of posterior weight distribution, of MLBMA due to the use of the term and exponential weighting in Equation 20, which results in larger weights being given to models that exhibit optimal or near optimal error residuals. The definition of AICc (like that of KIC and BIC) exhibits a linear dependence on n, which implies that the AICc weights are proportional to (1/σ̂ 2e,j )n , whereas the GLUE weights are proportional to (1/σ̂ 2e,j ). This is the primary source of difference in inferring posterior model probabilities with GLUE vs. MLBMA or AICMA. Application The various model averaging techniques were applied to two case studies: (1) a case study first presented in Ye et al. (2006), who used this case study to assess conceptual model uncertainty in the Death Valley regional flow system; and (2) a case study involving a groundwater model developed for one of the corrective action units at the Nevada Test Site. Details of the application to the Death Valley recharge model are given in Supporting Information, section A. The main conclusion that could be drawn from this case study was that there was a lack of consistency with respect to model rankings for the different model averaging schemes, with different model averaging techniques preferring different conceptual models as the top-ranked model. The model weights given by different model averaging techniques were also disparate, with GLUE giving more uniformly distributed weights and the other techniques (such as AICMA, MLBMA-KIC, and MLBMABIC) giving most of the weights to one or two models. The second case study used for testing the methodologies is based on a groundwater model developed for NGWA.org one of the corrective action units (at the Frenchman Flat, an alkaline desert depression) at the Nevada Test Site. One of the objectives for the Frenchman Flat model is to provide an estimate of the vertical and horizontal extent of radionuclide migration for use in regulatory decision making (Belcher 2004; Stoller-Navarro Joint Venture 2006). The flow of water through the cavity of an underground nuclear test is the prediction of interest for this case study. Additional details of this case study are discussed in Supporting Information, section B. For this case study, there is considerable uncertainty about the underlying geology for the Frenchman Flat. To address these uncertainties, nine alternative models of groundwater flow—reflecting a combination of uncertainties in geologic framework, parameters, and conceptualization of recharge—were developed (Bechtel Nevada 2005; Stoller-Navarro Joint Venture 2006). These are shown in Table 2. Each of the models was calibrated using a combination of head and boundary flux data. A total of 38 head and flux measurements were used (i.e., n = 38 for all calculations). The inversion code Model-Independent Parameter Estimation (PEST) (Doherty 2004) was used to calibrate each of the models. The objective was to estimate the uncertainty (over the ensemble of alternative models) in predictions of cavity flow. Note that for MLBMA and AICMA, theoretically, k (in Equations 11, 12, 18, and 19) should correspond to the number of parameters that are uniquely estimated using the calibration process (the maximum likelihood estimate corresponds only to such parameters). The Frenchman Flat models had more parameters than number of observations due to which not all could be uniquely identified. Sensitivity analysis of the parameter space was undertaken to extract the subspace of the most sensitive parameters from the calibration process. This is referred to as singular value decomposition (SVD) and consists of identifying the most dominant linear combinations (eigenvectors) of the system parameters (see Moore and Doherty [2006] on how to calculate this subspace of sensitive parameters). In effect, these are the only parameters that can be calibrated uniquely, and thus the maximum likelihood estimates essentially pertain to these parameters. Thus, for the model averaging exercise, only the subspace of sensitive parameters was considered for each model. For this case, the top 15 sensitive parameters were chosen for further analysis. All 38 data points were used for the sensitivity analysis and subsequent calculations. Table 3 shows the results from an application of GLUE (with a shape factor of N = 1), MLBMA-BIC, MLBMA-KIC, and AICMA to this test case. As can be seen from the table, GLUE weights are more uniformly distributed as compared with the other approaches, with at least four models having weights more than 10%. MLBMA-BIC and AICMA have nonnegligible weights for only two models. On the other hand, MLBMA-KIC assigns most of the weight to a single model. Unlike the previous test case (see Supporting Information, section A), the model ranks for this case study are more consistent. The model ranks for GLUE, AICMA, NGWA.org and MLBMA-BIC are identical, while MLBMA-KIC has a slight difference in relative ranks (to be discussed later). This is primarily because the Frenchman Flat models all have the same number of (effective) parameters. With different number of parameters, GLUE, AICMA, BIC, and KIC can have discrepancies in model ranks because they give different levels of importance to model complexity in the posterior weights. As the number of (sensitive) parameters for each model is set to be the same (15), both AICMA and MLBMA-BIC have identical model weights and ranks. In fact, the model weights for all these approaches are based purely on the calibration residual, since the parsimony term in Equations 10 and 19 is cancelled out when calculating BIC and AIC. The KIC column in Table 3 shows that the relative order of model weights given by KIC is different from the GLUE, MLBMA-BIC, and AICMA weights. Using the KIC metric selects model “floor_final” as the best and “aniso_final” as the second best model, compared to GLUE, MLBMA-BIC, and AICMA where this order is reversed. The additional sensitivity term tends to favor the model (floor_final) with slightly higher calibration error, while the other model averaging techniques all favor the model (aniso_final) with minimum calibration error. This is consistent with the discussion on the Fisher term (see section on Bayesian Model Averaging), as KIC requires the model with higher information content to have a correspondingly lower model error. In this case, higher information content in the data is not adequately balanced by a lower calibration error (for model aniso_final), and thus the model with slightly higher calibration error but lower information content (floor_final) is selected. The CDFs for the flow prediction are presented in Figure 2. The unconditional CDF corresponds to all the models being uniformly weighted, with the weighted CDFs corresponding to GLUE, MLBMA-BIC, and AICMA weights, respectively. As expected, the unconditional case has the largest spread. Conditioning with GLUE leads to a reduction in variance (i.e., the shape of the CDF is steeper than the uncalibrated case), and most of the models participate in the model weighting process. On the other hand, application of AICMA, MLBMA-KIC, or MLBMA-BIC leads to most of posterior weights given to a few (2–3) models with zero weight assigned to a majority of the models. As expected, the CDFs for both AICMA and MLBMA-BIC coincide. Variance Reduction with Different Averaging Techniques As shown in Figure 2, the spread in predictions from various model averaging techniques can be quite different. This is examined in detail by comparing the standard deviations of cavity flow shown in Figure 3. Not surprisingly, the highest uncertainty is associated with the uncalibrated case, with some reduction in variance for the GLUE case (because of conditioning). However, results for MLBMABIC, MLBMA-KIC, and AICMA show a significant reduction in predictive variance compared to the unconditional case, with MLBMA-BIC and AICMA leading to almost 95% reduction and MLBMA-KIC to almost 99% A. Singh et al. GROUND WATER 48, no. 5: 701–715 709 Table 2 Description of Frenchman Flat Model Cases Model Description ANISO_ FINAL Permeability depth reduction tends to impose apparent anisotropy. Such additional anisotropy may be overly constraining flow. Depth-limited anisotropy was developed to test if this was the case. FLOOR_FINAL Indefinite permeability reduction with depth can effectively remove some parts of the geology from the flow system because they become impermeable. This approach imposes a lower limit, or floor, on permeability depth decay for the base framework. BF_7_ AV Base framework model with prior data. Tests influence such data and model parameter stability. NDD2 Base framework model with limited alluvium and volcanic rock permeability depth decay. BASE_FINAL Best calibration base framework model. DISP This alternative is concerned with the locations and displacement of basin-forming faults. This alternative juxtaposes shallow aquifers against deeper aquifers, allowing a hydraulic connection between volcanic aquifers underlying the AA in Frenchman Flat to carbonate aquifers east and south from the Rock Valley fault system. Juxtaposition removes zeolitic confining units from a potential flow path. BASE_NODD Base framework model without alluvium permeability depth decay. This tests if the conceptual model of permeability reduction with depth can give feasible results. BLFA The BLFA HSU is modeled as a single continuous flow, rather than three separate zones. Located at or near the water table, which may affect flow and transport of radionuclides away from underground nuclear tests in the Northern Testing Area. Conceptually, the BLFA is a fractured rock, thus fracture/matrix processes are acting over a larger area. CPBA Some uncertainty exists in the distribution of pre-Tertiary HSUs, particularly the distribution of UCCU beneath CP basin. This alternative results in a continuous sheet of UCCU beneath CP basin. No direct transport consequences in terms of materials, but broadly impacts the flow system. reduction, respectively. This is consistent with the distribution of weights for these respective model averaging techniques, as shown earlier in Table 3. It is interesting to note that MLBMA-KIC leads to the least variance among all the model averaging methodologies. This is of particular consequence when considering model uncertainty. As shown in Table 3, it can be seen that while MLBMA-KIC has almost the same order of ranks as GLUE, MLBMABIC, and AICMA, the difference between the best and the second best models for MLBMA-KIC is much higher than the other aggregation schemes (the rank 1 model for MLBMA-KIC is two orders of magnitude more likely than the rank 2 model). MLBMA-KIC, thus, tends to further concentrate the weight for just a single model, leading to much less predictive uncertainty than the other techniques. Additional analysis was also conducted to look at the sensitivity of the GLUE CDFs to the shape factor. Details of this analysis are presented in Supporting Information, section C. It was seen that increasing the shape factor led to more nonuniform GLUE weights, with better models being given progressively higher weights. In addition, as Table 3 Model Weights and Ranks using Superparameters Models WSSR nk GLUE Wts BLFA BASE_NODD CPBA DISP FLOOR_FINAL ANISO_FINAL BF_7_CAV NDD2 BASE_FINAL 434 394 1503 298 11.44 10.87 15.15 31.71 55.85 15 15 15 15 15 15 15 15 15 0.76% 0.84% 0.22% 1.10% 28.78% 30.29% 21.73% 10.38% 5.90% MLBMA- MLBMABIC KIC AICMAWts Wts Wts 0.0000 0.0000 0.0000 0.0000 27.43% 72.44% 0.13% 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 27.43% 72.44% 00.13% 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 99.55% 0.44% 0.0000 0.0000 0.0000 GLUE Rank AICMA Rank 8 7 9 6 2 1 3 4 5 8 7 9 6 2 1 3 4 5 MLBMA- MLBMABIC KIC Rank Rank 8 7 9 6 2 1 3 4 5 6 8 9 7 1 2 3 4 5 All 38 observations were used for all models. 710 A. Singh et al. GROUND WATER 48, no. 5: 701–715 NGWA.org Figure 2. Prediction uncertainty for cavity flow for different model averaging techniques. the shape factor was increased the GLUE CDF tended to converge to the AICMA and MLBMA-BIC CDFs. Evaluation of Modified BMA This section presents the results for the modified BMA (mBMA) approach of Tsai and Li (2008) applied to the NTS case study. For this application, the modified approach was only applied with the KIC weighting scheme. The variance window factors (α-values) given in Table 1 were used with a significance level of 5%. With 38 observations (n), the α-value was determined to be 0.68 for a 1σ variance window, decreasing to 0.34 and 0.17 for 2σ and 4σ variance window sizes, respectively. Model weights as per this modified BMA technique were then calculated using Equation 17 for different values of α and compared to GLUE and MLBMABIC and MLBMA-KIC (both of the latter with the original Occam’s window-based weighting). The results for cavity flow are shown in Figure 4. As expected, Figure 4 shows that decreasing values of α result in a broadening of Occam’s window and correspondingly smoother CDFs for prediction uncertainty in cavity flow (for α = 1, mBMA is equivalent to MLBMA-KIC). Thus, the generalization of Occam’s window (to Tsai and Li’s variance window) allows additional plausible models to be effectively weighted in the model averaging process. In addition, with larger variance windows (lower αs) the predictive CDF tends to become smoother leading to higher predictive variance. Figure 3. Standard deviation for cavity flow predictions for different conceptual models. NGWA.org A. Singh et al. GROUND WATER 48, no. 5: 701–715 711 Figure 4. Sensitivity of uncertainty in model predictions for modified MLBMA to different variance windows. Recall that sensitivity analysis (section C, Supporting Information) conducted with GLUE showed that with high values of shape factor N , the GLUE CDF coinciding with AICMA and MLBMA-BIC based CDFs. The impact of different variance window sizes on the CDF is somewhat different—in this case even with very large variance window sizes, the mBMA CDFs (based on the KIC metric) never coincide with either the GLUE or the AICMA/MLBMA-BIC based CDFs. This is because of the difference in the ranks of the highly weighted models between the KIC scheme (on which mBMA is based) and the GLUE and AICMA/MLBMA-BIC schemes. If, for example, MLBMA-KIC and GLUE had the same relative ordering of models, a larger variance window size for mBMA would lead to mBMA having the same weights as GLUE for some larger variance window size. Recall that the difference in the relative order of model weights for different techniques is essentially because of the parsimony and sensitivity terms included in the AICc, BIC, and KIC criteria, respectively. The GLUE weights, on the other hand, do not depend on the number of parameters in the model or the sensitivity of the model parameters, and hence may or may not lead to the same relative order of models as AICc, BIC, or KIC based CDFs. Concluding Remarks This analysis provides a comparative assessment of different groundwater model averaging techniques for quantifying the impacts of model uncertainty on groundwater model predictions. These techniques include (1) GLUE, (2) MLBMA using both KIC and BIC, (3) AICMA (using AICc), and (4) a modified BMA using the variance window concept. Two groundwater modeling case studies are used to illustrate the performance of 712 A. Singh et al. GROUND WATER 48, no. 5: 701–715 these different techniques and provide some practical suggestions regarding their applicability. On the basis of the results presented in the previous sections, the following general conclusions are warranted regarding the various techniques. Different model averaging techniques can lead to different relative ranking and, hence, significantly different weights for models. In particular, BMA using KIC or BIC leads to a concentration of weights in the top few models and a corresponding reduction in prediction uncertainty compared to the unconditional case (where all models are considered equally likely). However, the variance window modification to BMA provides an opportunity to expand Occam’s window, thus expanding the hypothesis space by accepting multiple plausible models with a commensurate redistribution of model weights. Although AICMA is conceptually different from MLBMA in its use of the AICc as the criterion of choice, the use of an exponential weighting term leads to a similar concentration of weights in 1 or 2 models with the best agreement with the data. Such concentration of weights can also reduce the impact expert-based priors have on the final predictive uncertainty bounds. As such, the concentration of weights in a few top-performing models may be deemed acceptable by the expert, in which case BMA and/or AICMA yield the most statistically meaningful results. This, essentially, indicates that the modeler has full confidence in the data used for calibration and its use for the assigning of likelihoods for models to be used in predictive space. If, on the other hand, the calibration data are not fully reliable or represent conditions sufficiently different from the predictive modeling environment, alternatives may need to be considered to expand the “hypothesis window” to include more alternative conceptualizations. Thus, the modeler and field experts are recommended to NGWA.org exercise a certain level of judgment in the final model averaging process. GLUE produces more uniformly distributed weights for an N factor of 1.0. The degree of uniformity depends on the choice of the shape factor N . A large value of N (of the order of 20) leads to a concentration of weighting for the model(s) with the best calibration performance. If the GLUE model weights have the same rank ordering as one of the other model averaging (such as MLBMA or AICMA) weights, then changing the GLUE shape factor leads to weights that converge to the other model averaging weights. Here, the empirical GLUE shape factor could be interpreted similar to the variance window concept. It should also be noted that the criticism of certain GLUE likelihoods not depending on the number of samples is valid from a statistical perspective. While GLUE provides a flexible framework, work still needs to be done on formulating statistically meaningful likelihood functions with appropriate dependence on model error, number of data points, and model complexity. From a practical standpoint, the various model averaging techniques provide a useful framework for assigning probabilities to alternative conceptual models. As noted earlier, there are significant differences between the various approaches. Practitioners should therefore be cognizant of the fact that different model averaging techniques can lead to different predictive uncertainty bounds. They should apply and compare different model averaging techniques—a task facilitated by software tools such as MMA (Poeter and Hill 2007)—before deciding which technique is most appropriate to their problem. They should also be aware of the assumptions and limitations, as well as the advantages of different model averaging techniques. The work presented in this paper is intended to bring forth some of the underlying assumptions and nuances of one method over another and also provide some practical guidelines. To that end, a preliminary set of recommendations is provided regarding the use of these model averaging techniques for the ultimate goal of quantifying uncertainty in model predictions. • The starting point for any model averaging exercise should be an exhaustive set of alternative models that have been properly calibrated. Conclusions from the application of model averaging techniques are likely to be misleading if uncertainty in the model space has not been properly characterized. This is consistent with the observations of others (Beven 2009; Ye et al. 2008; Poeter and Hill 2007). • In the case of overparameterized models, parametric sensitivity analysis should be undertaken to identify the “sensitive” parameters that can be properly calibrated with the available data. In the case of MLBMA-KIC, FI criterion should be calculated for this set of sensitive parameters only. • As a first step, different models should be ranked using more than one information criterion (i.e., AICc, BIC, or KIC). As the rank ordering of models may differ from technique to technique depending on the balance NGWA.org of goodness-of-fit and model complexity, it may be useful to create a union of the top-ranked models across various techniques. • If there is consistency across model rankings, then predictions from KIC, BIC, or AICc based techniques will be similar and likely display a smaller variance than weighted GLUE predictions. In cases where the calibration data are not very reliable or where the calibration space is sufficiently different from the predictive space, the modeler may need to apply techniques to further expand the hypothesis window to allow for more conceptual models to be given a nonzero weight. Within the Bayesian framework, this can be accomplished through the variance window concept. In GLUE, the shape factor (N ) can be modified to distribute the weights across different models. • If model rankings are in conflict, then the onus is on the analyst to determine which of the model averaging techniques is appropriate for the problem at hand based on consistency with hydrogeologic considerations. For example, in case study 1, the analyst may use site-specific and domain-specific knowledge to decide between DPW1 and CMB2, given the lack of robustness among different techniques for assigning model weights. It is important that the modeler looks at other model performance measures (bias and correlation of errors, mass balance characteristics, matching of peak events, etc.). • It is highly recommended to develop a CDF (as shown in Figure 4) that takes into account the prediction from each model and the weight assigned to that model. Such a CDF captures the full range of outcomes and their associated likelihoods, rather than aggregating the results in terms of the mean and standard deviation of model predictions. The decision maker is likely to benefit from a complete presentation of uncertainty propagation results as compared to just the first two statistical moments. Acronyms and Symbols AIC BIC BMA BMC CDF GLUE KIC MLBMA MMA RMSE WSSR Mi Li θi θ̂ i Akaike Information Criterion Bayesian Information Criterion Bayesian Model Averaging Bayesian Monte Carlo Cumulative Distribution Function Generalized Likelihood Uncertainty Estimation Kashyap Information Criterion Maximum Likelihood Bayesian Model Averaging Multi-Model Analysis Root Mean Square Error Weighted Sum of Squared Residuals Model i Likelihood for Model i Parameter Vector for Model i Maximum Likelihood Estimate for Parameter Vector θ A. Singh et al. GROUND WATER 48, no. 5: 701–715 713 ki σ 2 e,i σ 2l N D n ω p(x) p(x|y) X |X| I [f, Mi ] s1 σD s2 α Number of Parameters for Model i Variance of the Errors (residuals) for Model i Variance of the Observations GLUE Shape Factor Observation Data Number of Observations Observation Weight Matrix Probability of x Conditional Probability of x given y Sensitivity/Jacobian Matrix of Calibration with respect to Parameters Determinant of matrix X Kullback-Leibler (K-L) Metric for Loss of Information when Using Model Mi to Represent Reality (f ) Size of Occam’s Window Standard Deviation of the Chi-Square Distribution used for “Goodness of Fit” Criterion used in KIC or BIC Width of Variance Window in Terms of σ Ratio of Occam’s Window to Variance Window Acknowledgments We wish to thank the reviewers of this paper—Mary Hill, Eileen Poeter, Ming Ye, and Harihar Rajaram—for their comments and feedback, which helped improve the quality and readability of our manuscript. Supporting Information Additional Supporting Information may be found in the online version of this article: Supporting information has been provided in three supplemental sections. Section A provides details of applying the different model averaging techniques to the Death Valley regional flow model (Ye et al. 2006). Section B provides additional detail of the Nevada Test Site case study. Section C provides information on sensitivity analysis conducted with different GLUE shape factors for the NTS case study. Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article. References Akaike, H. 1973. Information theory as an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, ed. B.N. Petrov, 267–281. Budapest, Hungary: Akademiai Kiado. Bechtel Nevada. 2005. A Hydrostratigraphic Framework Model and Alternatives for the Ground water Flow and Contaminant Transport Model of Corrective Action Unit 98: 714 A. Singh et al. GROUND WATER 48, no. 5: 701–715 Frenchman Flat, Clark, Lincoln and Nye Counties, Nevada, DOE/NV/11718–1064. Las Vegas, NV: Bechtel Nevada. Belcher, W.R., ed. 2004. Death Valley regional groundwater flow system, Nevada and California—Hydrogeologic framework and transient ground-water flow model. U.S. Geological Survey Scientific Investigations Report 2004–5205, 408 p. Reston, Virginia: USGS. Beven, K.J. 2008. Environmental Modelling: An Uncertain Future? An Introduction to Techniques for Uncertainty Estimation in Environmental Prediction. London: Routledge Publishing. Beven, K.J. 2006. On undermining the science? Hydrological Processes 20, 3141–3146. Beven, K.J. 2000. Uniqueness of place and process representations in hydrological modelling. Hydrology and Earth System Sciences 4, 203–213. Beven, K.J. 1993. Prophecy, reality and uncertainty in distributed hydrological modeling. Advances in Water Resources 16, 41–51. Beven, K., and J. Freer. 2001. Equifinality, data assimilation, and uncertainty estimation in mechanistic modelling of complex environmental systems using the GLUE methodology. Journal of Hydrology 249, 11–29. Beven, K.J., and A. Binley. 1992. The future of distributed models: Model calibration and uncertainty prediction. Hydrological Processes 6, 279–298. Carrera, J., and S.P. Neuman. 1986. Estimation of aquifer parameters under transient and steady state conditions. 1, Maximum likelihood method incorporating prior information. Water Resources Research 22, no. 2: 199–210. Delhomme, J.P. 1979. Spatial variability and uncertainty in ground water flow parameters: A geostatistical approach. Water Resources Research 15, no. 2: 269–280. Doherty, J. 2004. PEST: Model-independent parameter estimation, user manual, version 5. Brisbane, Australia: Watermark Numerical Computing. Domingos, P. 2000. Bayesian averaging of classifiers and the overfitting problem. ICML’00. http://www.cs.washington. edu/homese/pedrod/mlc00b.ps.gz. Draper, D. 1995. Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society: Series B 57, no. 1: 45–97. Hill, M.C., and C.R. Tiedeman. 2007. Effective Groundwater Model Calibration: With Analysis of Data, Sensitivities, Predictions, and Uncertainty. New York: Wiley and Sons. Hoeksema, R.J., and P.K. Kitanidis. 1989. Predictions of transmissivities, heads, and seepage velocities using mathematical models and geostatistics. Advances in Water Resources 12, no. 2: 90–102. Hoeting, J.A., D. Madigan, A.E. Raftery, and C.T. Volinsky. 1999. Bayesian model averaging: A tutorial. Statistical Science 14, no. 4: 382–417. Hurvich, C.M., and C.-L. Tsai. 1989. Regression and time series model selection in small sample. Biometrika 76, no. 2: 99–104. Kashyap, R.L. 1982. Optimal choice of AR and MA parts in autoregressive moving average models. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, no. 2: 99–104. Kass, R.E., and A.E. Raftery. 1995. Bayes factors. Journal of the American Statistical Association 90, 773–795. Madigan, D., and A.E., Raftery. 1994. Model selection and accounting for model uncertainty in graphical models NGWA.org using Occam’s window. Journal of the American Statistical Association 89, no. 428: 1535–1546. Mantovan, P., and E. Todini. 2006. Hydrological forecasting uncertainty assessment: Incoherence of the GLUE methodology. Journal of Hydrology 330, 368–381. Minka, T.P. 2000. Bayesian model averaging is not model combination, MIT Media Lab note (7/6/00). Available at http://research.microsoft.com/∼minka/papers/minka-bmaisnt-mc.pdf. Moore, C. and J. Doherty. 2006. The cost of uniqueness in groundwater model calibration. Advances in Water Resources 29, 605–623. Mugunthan, P., and C.A. Shoemaker. 2006. Assessing the impacts of parameter uncertainty for computationally expensive ground water models. Water Resources Research 42, W10428. doi:10.1029/2005WR004640. Nash, J., and J. Sutcliffe. 1970. River flow forecasting through conceptual models, 1. A 849 discussion of principles. Journal of Hydrology 10, 282–290. National Research Council. 2001. Conceptual Models of Flow and Transport in the Fractured Vadose Zone. Washington, DC: National Academic Press. Neuman, S.P. 2003. Maximum likelihood Bayesian averaging of uncertain model predictions. Stochastic Environmental Research and Risk Assessment 17, no. 5: 291–305. Neuman, S.P. 1982. Statistical characterization of aquifer heterogeneities: An overview. In Recent Trends in Hydrogeology, 81–102. Geological Society of America Special Paper 189. Boulder, Colorado: Geological Society of America. Neuman, S.P., and P.J. Wierenga. 2003. A Comprehensive Strategy of Hydrogeologic Modeling and Uncertainty Analysis for Nuclear Facilities and Sites, NUREG/CR-6805. Washington, DC: U.S. Nuclear Regulatory Commission. Poeter, E.P., and M.C. Hill 2007. MMA, A computer code for Multi-Model Analysis: U.S. Geological Survey Techniques and Methods 6–E3 Reston, Virginia: USGS. Poeter, E., and D. Anderson. 2005. Multimodel ranking and inference in ground water modeling. Ground Water 43, no. 4: 597–605. Raftery, A.E. 1995. Bayesian model selection in social research. Sociological Methodology 25, 111–163. Samper, F.J., and S.P. Neuman. 1989. Estimation of spatial covariance structures by adjoint state maximum likelihood NGWA.org cross validation: 1, Theory. Water Resources Research 25, 351–362. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6, no. 2: 461–464. Stoller-Navarro Joint Venture. 2006a. Phase II ground-water flow model of Corrective Action Unit 98-Frenchman Flat, Nevada Test Site, Nye County, Nevada. Stoller-Navarro Joint Venture Report S-N/99205-074 prepared for the U.S. Department of Energy Available at http://www.osti.gov/ bridge (accessed February 8, 2007). Tsai, F.T.-C., and X. Li. 2008. Ground water inverse modeling for hydraulic conductivity estimation using Bayesian model averaging and variance window. Water Resources Research 44, no. 9: W09434. doi:10.1029/2007WR006576. Vogel, R.M., R. Batchelder, and J.R. Stedinger. 2007. Appraisal of the Generalized Likelihood Uncertainty Estimation (GLUE) method. Water Resources Research 44, WOOB06. doi:10.1029/2008WR006822, 2008. Wagner, B.J., and S.M. Gorelick. 1989. Reliable aquifer remediation in the presence of spatially variable hydraulic conductivity; from data to design. Water Resources Research 25, no. 10: 2211–2225. Ye, M., K.F. Pohlmann, and J.B. Chapman. 2008a. Expert elicitation of recharge model probabilities for the Death Valley regional flow system. Journal of Hydrology 354, 102–115. doi:10.1016/j.jhydrol.2008.03.001. Ye, M., P.D. Meyer, and S.P. Neuman. 2008b. On model selection criteria in multimodel analysis. Water Resources Research 44, W03428. doi:10.1029/2008WR006803. Ye, M., K. Pohlmann, J. Chapman, and D. Shafer. 2005. On evaluation of conceptual models: a priori and a posteriori. International High-Level Radioactive Waste Management Conference, April 30 - May 4, Las Vegas, NV. Available at http://www.osti.gov/bridge/product.biblio.jsp?osti_id= 875590. Ye, M., S.P. Neuman, P.D. Meyer, and K.F. Pohlmann. 2005. Sensitivity analysis and assessment of prior model probabilities in MLBMA with application to unsaturated fractured tuff. Water Resources Research 41, W12429. doi:10.1029/2005WR004260. Ye, M., S.P. Neuman, and P.D. Meyer. 2004. Maximum likelihood Bayesian averaging of spatial variability models in unsaturated fractured tuff. Water Resources Research 40, W05113. A. Singh et al. GROUND WATER 48, no. 5: 701–715 715