Model Averaging Techniques for Quantifying Conceptual

advertisement
Model Averaging Techniques for Quantifying
Conceptual Model Uncertainty
by Abhishek Singh1 , Srikanta Mishra2 , and Greg Ruskauff3
Abstract
In recent years a growing understanding has emerged regarding the need to expand the modeling paradigm to
include conceptual model uncertainty for groundwater models. Conceptual model uncertainty is typically addressed
by formulating alternative model conceptualizations and assessing their relative likelihoods using statistical model
averaging approaches. Several model averaging techniques and likelihood measures have been proposed in the
recent literature for this purpose with two broad categories—Monte Carlo-based techniques such as Generalized
Likelihood Uncertainty Estimation or GLUE (Beven and Binley 1992) and criterion-based techniques that use
metrics such as the Bayesian and Kashyap Information Criteria (e.g., the Maximum Likelihood Bayesian Model
Averaging or MLBMA approach proposed by Neuman 2003) and Akaike Information Criterion-based model
averaging (AICMA) (Poeter and Anderson 2005). These different techniques can often lead to significantly
different relative model weights and ranks because of differences in the underlying statistical assumptions about
the nature of model uncertainty. This paper provides a comparative assessment of the four model averaging
techniques (GLUE, MLBMA with KIC, MLBMA with BIC, and AIC-based model averaging) mentioned above
for the purpose of quantifying the impacts of model uncertainty on groundwater model predictions. Pros and
cons of each model averaging technique are examined from a practitioner’s perspective using two groundwater
modeling case studies. Recommendations are provided regarding the use of these techniques in groundwater
modeling practice.
Introduction
Groundwater modeling and decision making is beset
with uncertainty caused by incomplete knowledge of
the underlying system and/or uncertainty due to natural
variability in system processes and field conditions. The
different sources of uncertainty in the modeling process
can be categorized as follows:
• Conceptual uncertainty: The first step in modeling
is to build a conceptual model of the underlying
1
Corresponding author: INTERA Inc., Austin, TX; (512) 425-2048;
fax (512) 425-2099; asingh@intera.com
2 INTERA Inc., Austin, TX; smishra@intera.com
3 INTERA Inc., Las Vegas, NV; greg.ruskauff@nv.doe.gov
Received December 2008, accepted September 2009.
Copyright © 2009 The Author(s)
Journal compilation © 2009 National Ground Water Association.
doi: 10.1111/j.1745-6584.2009.00642.x
NGWA.org
system. Decisions about the conceptual model are
often made with imperfect or incomplete knowledge
of the system, which leads to uncertainties in the
conceptualization of the model itself. This is shown
in Figure 1 by the multiple polynomial curves fit to a
dataset (shown by black dots). Each curve represents
an alternative conceptualization of the relationships
between the independent and state variables.
• Parametric uncertainty: A model can have numerous
parameters that need to be specified, often in the
absence of sufficient data, leading to parametric uncertainty. As shown in Figure 1, parametric uncertainty
can be of two kinds:
◦ Unconditional uncertainty: Parameters that are
directly specified (based on expert judgment or literature values) are uncertain because of the lack of
knowledge or insufficient data. Such uncertainty is
often referred to as unconditional uncertainty, since
it is not conditioned on field values, and is typically
Vol. 48, No. 5–GROUND WATER–September-October 2010 (pages 701–715)
701
◦
characterized by a probability distribution based on
subjective judgment (shown by the second part of
Figure 1).
Conditional uncertainty: Calibrated or conditioned
parameters are those that lead to an acceptable degree
of agreement between model behavior and field
observations. Conditioning on past observations generally leads to improved predictive ability—unless
the calibration space is substantially different from
the predictive space (e.g., groundwater flow vs. reactive contaminant transport). More will be said on
this issue later in the section. Uncertainty in calibrated parameters can be due to (1) errors in the field
data that the parameters are being calibrated against;
(2) insensitivity of the parameters to the model predictions; and (3) correlations within parameter sets
with respect to model predictions. These types of
uncertainties are demonstrated by the bottom three
plots in Figure 1—the first showing the mismatch
in the true parameter and the calibrated parameter,
the second showing the lack of sensitivity of certain
parameters, and the third showing the correlations
in two parameters that can lead to response surface
of the calibration objective to have multiple optima.
Both insensitivity and correlations in parameters lead
to certain parameters remaining uncalibrated. These
two problems in model calibration lead to what is
referred to as equifinality or nonuniqueness (Beven
Figure 1. Schematic for different types of uncertainty in modeling.
702
A. Singh et al. GROUND WATER 48, no. 5: 701–715
NGWA.org
and Freer 2001). Insensitive parameters cannot, in
essence, be calibrated because the model behavior
is not constrained by such parameters. Correlations
within parameters can mean that while it may be
possible to uniquely identify a group of parameters together, it would be difficult to separate each
parameter and give it a unique value. Errors in field
data can, of course, lead to erroneous calibration of
model parameters, which in turn adds to the uncertainty in these parameter values. Such uncertainties
are also likely exacerbated by error in the model
itself, although a discussion on the characterization
of model structural error is beyond the scope of this
paper.
• Stochastic uncertainty: Even with a well-conceptualized
and well-calibrated model, there exists natural variability in field conditions that can lead to uncertainty in
predictions. To make robust decisions, the variability
needs to be incorporated in the decision-making process. This is typically done by considering stochastic
realizations for the various model inputs.
Of the above-mentioned sources of uncertainty, the
focus of uncertainty analysis in groundwater modeling has traditionally been parametric uncertainty. This
paper, however, concerns itself with the more fundamental issue of conceptual model uncertainty. Conceptual
model uncertainty in groundwater models typically arise
due to (1) inadequate representation of physical processes;
(2) incomplete understanding of the subsurface geologic
framework; and (3) inability of the model to properly
explain all of the available observations of state variables.
The limited literature on the assessment of alternative
models suggests that it is possible to develop models consistent with geologic data that yield very different hydrologic predictions. This is particularly true for groundwater
models, where the data used for calibration (typically
hydraulic heads) may not be of the same scale or sensitivity as the predictions (often contaminant transport). Owing
to these reasons, a growing understanding has emerged in
recent years regarding the need to expand the modeling
paradigm to include more than one plausible conceptual
model of the system.
The need to move away from one unique model to
a set of multiple models for predictions was identified
early on by Delhomme (1979), Neuman (1982), Hoeksema and Kitanidis (1989), Wagner and Gorelick (1989),
Beven (1993), Neuman and Wierenga (2003), and Poeter
and Anderson (2005) among others. Beven (1993, 2000)
laid out the argument that a unique model with an “optimal” set of parameters is inherently unknowable. Instead,
they argued for a set of acceptable and realistic model
representations that are consistent with the data. Work
such as National Research Council (2001), Neuman and
Wierenga (2003), Carrera and Neuman (1986), and Samper and Neuman (1989) have also shown that considering
only one conceptual model for a particular site can lead
to poorly informed decisions.
NGWA.org
Given these multiple models, it becomes essential to
assess the likelihood or probability of each model. Without such likelihood measures, models would be assumed
to be equally likely and it is possible that the resulting
uncertainty is much higher than reasonable. Once the likelihoods have been assessed, model predictions would have
to be based on a weighted average (proportional to the
model likelihoods) over the ensemble of models. The task
of model averaging is thus closely linked to the task of
assessing the likelihood of alternative conceptual models.
To this end, several approaches have been proposed for dealing with model uncertainty (and averaging). There are two broad categories of methods—the
methods that use Monte Carlo sampling across multiple models/parameter combinations to estimate the posterior probabilities and the methods that use certain metrics such as Akaike, Bayesian, or Kashyap Information
Criteria to estimate the posterior probabilities (all these
criterion-based approaches use a similar approach to the
calculation of posterior probabilities, the only difference
being the chosen criterion for this purpose). Generalized Likelihood Uncertainty Estimation or GLUE (e.g.,
Beven and Binley 1992) is an example of the first type of
approach. Examples of criterion-based model averaging
include Maximum Likelihood Bayesian Model Averaging
or MLBMA (e.g., Neuman 2003) that uses the Bayesian
and Kashyap Information Criteria (BIC and KIC, respectively) and Akaike Information Criterion (or AIC)-based
model averaging (Poeter and Anderson 2005). While there
are similarities within all these approaches, the major
differences lie in the way they ascribe likelihood (or probability) to the different models being considered. Unfortunately, more often than not, different model averaging
techniques lead to remarkably different model likelihoods
(and hence ensemble predictions). The proponents of each
technique have pointed to the theoretical and practical
advantages for each approach, while contrary views have
been expressed by other researchers. As such there is
no consensus within the research community, and the
modeler’s dilemma remains—which technique (if any)
to utilize for model averaging.
A note of caution is due at this stage. In many
instances, the calibration and testing space is often different from the predictive space—that is, there is insufficient
data or evidence to validate many of the assumptions and
parameters that have been used during the modeling process, especially with respect to the predictive behavior of
the model. The likelihoods mentioned earlier are, obviously, based on the same data that have been used to
calibrate and test the model. Thus, these likelihoods are at
best “surrogates” for the true likelihoods for a given set of
models. As the calibration and predictive space become
more similar, these surrogates become more consistent
with the true likelihood of the models. The practitioner is
thus encouraged to (1) consider different types of available data sources and formulations when assessing the
likelihoods of the models and (2) approach these likelihoods with the requisite caution.
A. Singh et al. GROUND WATER 48, no. 5: 701–715
703
The objective of this paper is to provide some clarity
to the practitioner by providing a comparative assessment
of different model averaging techniques for the purpose
of quantifying the impacts of model uncertainty on
groundwater model predictions. We begin with a brief
description of the theoretical background for each model
averaging technique. Next, we present a case study
applying these techniques for estimating the impacts of
uncertainty in predictions for a groundwater flow and
transport model of the Nevada Test Site.
(A second case study looking at the impact of uncertainty in multiple recharge models for the Death Valley
regional flow model has been provided in the Supporting
Information section.) Finally, some recommendations are
provided regarding the use of model averaging in groundwater modeling practice.
Techniques for Model Averaging
Generalized Likelihood Uncertainty Estimation (GLUE)
GLUE was originally proposed for dealing with
model nonuniqueness in catchment modeling. It is based
on the concept of “equifinality,” that is, the possibility
that the same final state may be obtained from a variety
of initial states (Beven and Binley 1992). In other words,
a single set of observed data may be (nonuniquely)
matched by multiple parameter sets that produce similar
model predictions. In the GLUE framework, the feasible
parameter space is first sampled to produce many equally
likely parameter combinations (realizations)—each of
which can be thought of as an alternative conceptual
model. Discrete alternatives can also be considered in lieu
of alternative parameter sets. The output corresponding to
each realization (or model alternative) is compared against
actual observations. Only those realizations (or models)
that satisfy some acceptable level of performance (e.g.,
a maximum sum-of-squared weighted residuals), also
known as the behavioral threshold, are retained for further
analysis, and the nonbehavioral realizations (models) are
rejected. A “likelihood” for each model is then computed
as a function of the misfit between observations and model
predictions. The weights (or probabilities) for each model
are estimated by normalizing the likelihoods.
One of the central features of GLUE is the flexibility
with respect to the choice of the likelihood measure. As
the name “generalized likelihood” implies, any reasonable
likelihood measure can be used appropriately as long
as it adequately represents the experts’ understanding of
the relative importance of different data sources used to
assess model accuracy. In the literature, many different
likelihood measures based on goodness-of-fit metrics have
been proposed. One likelihood measure that has seen
widespread usage in the GLUE literature is given by the
inverse weighted variance:
N
σ 2 l
(1)
Lj =
σ 2e,j l
704
l
A. Singh et al. GROUND WATER 48, no. 5: 701–715
where Lj is the likelihood for model j , l is the number
of state variables (data types), σ 2e,j is the variance of the
errors for model j (i.e., the error residuals for model j and
data type l), σ 2l is the variance of the observations of data
type l, and N is a shape factor such that values of N 1
tend to give higher weights (likelihoods) to models with
better agreement with the data, and values of N 1 tend
to make all models equally likely. The variance of the
errors σ 2e,j for data type l is given by:
SSR 2
σ e,j |l =
(2)
n l
where SSR is the sum-of-squared residuals for the jth
model predictions and observations (of data type l), while
n is the number of observations (for data type l).
Other forms of the likelihood functions include the
Nash-Sutcliffe efficiency index (Nash and Sutcliffe 1970)
given by:
N
σ 2e,j (3)
1− 2 Lj =
σl l
l
and the exponential likelihood function (Beven 2000):
σ 2e,j exp −N 2 Lj =
(4)
σl l
l
Normalizing the likelihoods, so that their sum is equal
to one, gives the GLUE weight for model j :
wj (GLU E) =
P rj Lj
n
(5)
P rj Lj
j =1
where Lj is one of the likelihood functions described
above, Prj is the prior weight given to each model
(typically based on the modelers’ expert judgment), and
n is the total number of models being considered.
The GLUE approach can thus be considered as a
form of conditional uncertainty analysis, where the unconditional predictions (based on equally likely parameter
combinations) are conditioned by observations. The posterior probabilities for each realization can be used to weight
the sampled parameter values, leading to a posterior distribution for each uncertain input that is also conditioned
to observations.
GLUE is a generalizable framework and is applicable to almost all types of problems. However, certain
aspects of the methodology have generated controversy
in recent years (e.g., Mantovan and Todini 2006; Vogel
et al. 2007). These include (1) a lack of statistical basis
for the likelihood and threshold measures used for model
selection and weighting; (2) lack of dependence of most
likelihoods on the number of data points (since all formulations from Equations 1 to 4 depend on the average
residual—σ 2e,j in Equation 2, with n in the denominator—not the total residual, two models with the same
average residual but different number of data points will
NGWA.org
be deemed equivalent by GLUE); (3) the computational
burden required due to the need for extensive Monte Carlo
simulations; and (4) the fact that GLUE does not require
the model structure and parameters to be optimized (calibrated), which could lead to overestimation of predictive
uncertainty. Moreover, there is typically no acknowledgment of differences in model complexity in the likelihood
functions used. This is in contrast to methods that use
criterion-based likelihoods (as discussed in later sections),
where model complexity is an important component of the
weight ascribed to the model.
Beven (2006) has answered some of these criticisms
by contending that (1) formal Bayesian model averaging
(BMA) approaches are a special case of GLUE and
are applicable under certain strong assumptions and
(2) optimization or model selection can be used within the
GLUE framework to reduce uncertainty. In recent years,
the link between GLUE and optimization has become
stronger with the work of Mugunthan and Shoemaker
(2006), who showed that optimization can in fact be
used to generate alternative models for GLUE, leading
to efficiency enhancements for the GLUE framework by
eliminating the need for Monte Carlo trials to generate
model alternatives. Finally, with regard to the debate
between the GLUE and Bayesian methods, Beven (2008)
argues that “. . . the best approach to estimating model
uncertainties is a Bayesian statistical approach, but that
will only be the case if all the assumptions associated
with the error model can be justified.”, and that “simple
assumptions about the error term may be difficult to
justify as more than convenient approximations to the real
nature of the errors,” finally cautioning that “. . . making
convenient formal Bayesian assumptions may certainly
result in over estimating the real information content of
the data in conditioning the model space.”
Bayesian Model Averaging Techniques
BMA framework was propounded by Draper (1995),
Kass and Raftery (1995), and Hoeting et al. (1999)
and is based on a formal Bayesian formulation for the
posterior probabilities of different conceptual models.
The most commonly used Bayesian modeling averaging
paradigm in hydrology is MLBMA (Neuman 2003).
MLBMA is a special case of the BMA approach, in
that it approximates the Bayesian posterior probability by
using the concept of “information criteria” to calculate
the posteriori probabilities rather than computing these
probabilities directly.
In the Bayesian framework, the posterior weights
(probabilities) for model Mj given the data (D) can be
calculated using Bayes’ rule as follows:
p(D|Mj )p(Mj )
p(Mj |D) = p(D|Mj )p(Mj )
(6)
j
where p(Mj ) is the prior probability of model Mj (similar
to Prj used in Equation 5 for GLUE) and p(D|Mj ) is the
model likelihood reflected by the level of agreement (or
NGWA.org
lack thereof) between predictions of the model Mj and
the observed data, D. This model likelihood is given by:
(7)
p(D|Mj ) = p(D|θ j , Mj )p(θ j |Mj )d θ j
Here θ j is the parameter set associated with model j ,
p(θ j |Mj ) is the prior probability of the parameters, and
p(D|θ j , Mj ) is the joint probability of model j and is a
function of the errors with respect to the field data (D).
The prior probabilities for the model, p(Mj ), are typically
obtained using expert elicitation (Ye et al. 2005, 2008b)
or based on a noninformative prior (i.e., all models are
equiprobable). The prior probabilities for the parameters,
p(θ j |Mj ), can either be calculated from the data or also
through an expert elicitation process (if there are not
enough data to infer this distribution).
The BMA calculation requires the integral in Equation 7 to be evaluated, which is typically done through
exhaustive Monte Carlo simulations of the parameter
space θ . This can be computationally very demanding,
and thus Neuman (2003) proposed a variant of the
BMA approach called MLBMA. MLBMA approximates
this integral by using likelihood measures such as the
Kashyap Information Criterion (KIC) (Kashyap 1982) or
the Bayesian Information Criterion (BIC) (Schwarz 1978),
which are evaluated for each model calibrated to the
maximum likelihood estimator for the parameter set.
The starting point for MLBMA is a collection of
models that have been calibrated to observed data using
maximum likelihood estimation. The model likelihood is
then estimated using:
j
(8)
p(D|Mj ) ∝ exp −
2
with:
j = (BI Cj − BI Cmin )
(9)
j = (KI Cj − KI Cmin )
(10)
or
where j is the difference between the BIC or KIC
measure for the j th model and the minimum BIC or
KIC value among all competing models (given by BICmin
or KICmin in Equations 9 and 10). Assuming a multiGaussian error distribution with unknown mean and
variance for the model likelihood in Equation 7, the BIC
and KIC terms can be written as (Ye et al. 2008b):
BI Cj = (n) ln(σ̂ 2e,j ) + kj ln(n)
(11)
and
KI Cj
= (n − kj ) ln(σ̂ 2e,j ) − 2 ln p(θ̂ j )
−kj ln(2π ) + ln |XjT ωXj |
(12)
where n is the number of observations, kj is the number
of parameters for model j , θ̂ j is the maximum likelihood
estimator for the parameters from model j , p(θ̂ j ) is
A. Singh et al. GROUND WATER 48, no. 5: 701–715
705
the prior probability (either assessed from field data or
through expert elicitation) for the parameter estimate, and
σ̂ 2e,j is the maximum likelihood estimator for the variance
of the error residuals (e) estimated from the weighted
sum-of-squares residuals for model j with the maximum
likelihood estimator for the parameters as:
ejT ω ej 2
σ̂ e,j =
(13)
n θ j =θ̂ j
where ej is the calibration error vector, n is the number of
samples, θ̂ j is the maximum likelihood estimator for the
parameters, and ω is a weight factor, which theoretically
is given by the covariance between the data points. It
is common to assume uncorrelated data leading to a
diagonal matrix with the variance of the data points along
the diagonal. In many cases, the unbiased “least-square”
formulation may be used where, instead of n, (n − kj )
is used in the denominator, with kj being the number of
calibrated parameters in the model j . Also note that for
the sake of simplicity and without loss of generality, we
have assumed only a single data type (unlike the GLUE
formulation presented in Equations 1 to 5, which were for
multiple data types).
The last term in Equation 12—|XjT ωXj |—is the
determinant of the Fisher information (FI) matrix, Xj
is the Jacobian (sensitivity) matrix, XjT is its transpose,
and ω is the weight matrix. The Fisher matrix requires
calculation of derivatives of the calibration measures with
respect to the model parameters (a nontrivial task for
highly parameterized models)—and therefore represents
the sensitivity of the model output to the parameters. Ye
et al. (2004) have shown that using the KIC metric gives
a better (more unbiased) measure of the model likelihood
than BIC. The metric takes into account the information
content in the data as given by the sensitivity of the model
output with respect to the parameters, selecting more
complex models (with a greater number of parameters)
only when the data support such a choice. Ye et al.
(2008b) also showed that from a theoretical standpoint,
BIC asymptotically converges to KIC as the number
of calibration data increases relative to the number of
parameters (i.e., n k).
The MLBMA model weights using either BIC or KIC
can be given by:
exp(−0.5j )p(Mj )
wj (MLBMA) = exp(−0.5j )p(Mk )
(14)
k
where j is given by Equation 9 or 10 and p(Mj )
are prior probabilities of the models (typically given by
the expert, expressing his or her knowledge about the
suitability of different models).
There are two key aspects of the KIC and BIC based
model weights: (1) the use of the j term, which can
vary from 0 (for the model with the minimum KIC or
BIC metric, see Equations 9 and 10) to many orders
of magnitude higher (for the models with higher KIC
706
A. Singh et al. GROUND WATER 48, no. 5: 701–715
and BIC metrics) and (2) the exponential weighting in
Equation 14 that tends to apportion most of the posterior
weights to relatively few models exhibiting marginally
better agreement with the data. The distribution of weights
becomes narrower as the number of observations increase,
since the value of n linearly affects the nominal values of
BIC and KIC (per Equations 11 and 12). From a Bayesian
standpoint this makes sense, as with more data there
needs to be less uncertainty amongst competing models
(Poeter and Hill [2007] emphasize this, as well). However,
Beven (2008) has pointed out that this is only desirable if
the error structure assumed by the averaging technique
is consistent with the “real” error structure. If this is
not the case, then model averaging techniques such as
MLBMA may overestimate the information content of the
data while conditioning the model.
The additional FI term used in MLBMA (Equation 12) has been a source of much confusion and debate
in the literature. Higher FIs indicate that the model outputs (calibration measures) have higher Jacobians (sensitivities) to the model parameters, which in turn indicates higher information content in the data points. From
Equation 12, it is also apparent that increasing the FI
term decreases the model likelihood (low KIC values
correspond to higher likelihoods). This may be deemed a
nonintuitive result, as for two models with the same accuracy (residuals) and complexity (number of parameters)
KIC favors the model with lower sensitivities. Ye et al.
(2008b) explain this by pointing out that more “information content” in the observed data (i.e., higher FI values)
should lead to improved model performance—if it does
not then the model has less basis to be selected (lower
likelihood). In other words, the Fisher term reestablishes
the performance standard for a model—the higher the
information content in the data vis-à-vis the model parameters, the better the model needs to perform for it to be
given a higher likelihood by MLBMA. Yet another way
to look at the Fisher term is to think of it as a means
of supporting complexity in the model. Ye et al. (2008b)
argue that KIC balances parsimony (as expressed by the
penalty term for the number of parameters) with expected
information content in the data. Thus, higher FI content in
the calibration data indicates that more complex models
can be supported by the data (and can be selected with
high likelihoods), whereas low Fisher terms mean that
the data do not support model complexity and simpler
less accurate models may be more appropriate.
BMA has been questioned by Domingos (2000), who
has argued that model combination by its very nature
works by enriching the space of model hypotheses, not
by approximating a Bayesian distribution function. In
their study, Domingos (2000) compared BMA with other
model averaging techniques and showed that BMA tends
to underestimate the predictive uncertainty. However, others such as Minka (2000) have contended that these results
are hardly surprising because by definition techniques like
BMA, and especially MLBMA, are built on the intrinsic assumption that there is a unique model of reality
NGWA.org
(i.e., there is only one mode in the conditional distribution—representing the most likely model). This is borne
out in the original MLBMA paper by Neuman (2003),
where he lays out the fundamental assumption for this
technique—“only one of the (alternative) models is correct even in the event that some yield similar predictions
for a given set of data.” Thus, strictly speaking MLBMA
is more a model selection technique than a model combination methodology. Note that unlike model averaging,
model selection (or ranking) is simply based on the relative magnitude of the BMA criterion (either BIC or
KIC), and thus is not affected by the exponential dependence on n.
It is worth noting that the formulations shown earlier require the models to be well calibrated (normally
distributed errors, etc.) and the residual variance (σ̂ 2e,j )
assessed using the calibrated parameters. In fact, the
error distribution used is typically unimodal, with the
mode approximated by the “calibrated” model. In the
case of highly parameterized models, there is bound to be
nonuniqueness in the parameter domain (and thus multimodality in the calibration response surface). The applicability of MLBMA and BMA in such cases is not clear.
In such cases, it is advisable that the dimensionality of the
model parameters be reduced (thereby introducing some
level of uniqueness in the calibrated parameter set) before
applying this methodology.
The final point that needs to be made about the
formulation shown earlier is that most applications assume
uncorrelated data points leading to a diagonal weighting
matrix. In reality, more often than not, the errors are
correlated and there is often much redundancy in the
data. In essence, this reduces the information content
in the data points and needs to be reflected in the
weighting scheme used for the weighted sum-of-squared
residuals calculation and may spread the weights across
the different models (see Hill and Tiedman [2007], for a
discussion on the diagonal weight matrix assumption).
Variance Window-Based MLBMA
The previous section highlighted the issue with
MLBMA distributing most of the posterior weights to
a few models that exhibited marginally better calibration performance. Tsai and Li (2008) have proposed an
approach to address this by using the concept of “variance
window” to modify the MLBMA scheme. The motivation for their work was the realization that BMA tended
to assign most of the weights to a few models that
exhibit marginally better calibration performance (due
to the exponential weighting and the j term used in
Equation 14—see discussion in the preceding section).
Tsai and Li (2008) contended that this stringency in the
model averaging criteria is a result of the underlying
assumption of “Occam’s windows” (Madigan and Raftery
1994) that only accepts models in a very narrow performance range. Occam’s window is defined by Raftery
(1995) as the range within which the model performance
of two competing models is statistically indistinguishable—that is, if the difference between the calibration
NGWA.org
metrics of two models (with the same complexity) is less
than the Occam’s window then they will both be accepted.
Raftery (1995) pointed out that for sample sizes between
30 and 50 data points, an Occam’s window of 6 units
in the BIC metric (BIC in Equation 9) roughly corresponded to a significance level of 5% (in t statistics) in
conventional hypothesis testing terms.
Over the years there has been growing realization
that this Occam’s window for model acceptance may be
too restrictive leading to biased results (see appended
comments to Hoeting et al. [1999]; Tsai and Li 2008). To
reduce this overweighting and the resulting bias, Tsai and
Li (2008) introduce the concept of a “variance window”
as an alternative to the Occam’s window for selection
with the BMA. The variance window is determined by
including a scaling factor α with BIC (and KIC), where
α is given by:
α=
s1
s2 σ D
(15)
where σ D is the standard deviation of the chi-square
distribution for the “goodness-of-fit” criterion used in
formulating KIC or BIC (see Tsai and Li [2008] for
details). The variance of √the chi-square distribution is
given by 2n (i.e., σ D = 2n), where n is the number
of observations, s1 is the size of the Occam’s window
corresponding to the given significance level, and s2 is
the width of the variance window in terms of σ D . As the
width of the variance window becomes larger, α becomes
progressively smaller than 1. Note that since the minimum
size of the variance window is the Occam’s window, the
value of α is never larger than 1. When the concept of this
variance window is incorporated into the model averaging
process, the posterior model probabilities (also the model
averaging weights) are given by:
exp − 12 αj
(16)
wj (MLBMA) = 1
exp − αk
2
k
where j is given by Equation 9 or 10. It can be seen
that α is a multiplicative factor that when multiplied
with BIC or KIC (as the case may be) reduces the
impact the exponential term has on the weighting. For
α = 1, the weighting is identical to the BIC or KIC based
weights, and for α = 0 all models are equally weighted
irrespective of their calibration performance. Tsai and Li
(2008) also provide a table for recommended values of α
corresponding to different significance levels and variance
window sizes, which are shown in Table 1.
The variance window concept was originally derived
only for Bayesian model averaging by Tsai and Li (2008).
It is not entirely clear if a similar α factor can be
applied to AIC-based likelihoods (to be discussed in the
next section), and if so then what significance level and
variance size would such factors correspond to. Thus, for
this study the variance window concept has only been
used with the KIC-based cumulative distribution function
(CDF) (i.e., j is given by Equation 10).
A. Singh et al. GROUND WATER 48, no. 5: 701–715
707
In a manner similar to Equation 13, the AICMA
model weights can be written as:
Table 1
α-Values for Different Variance Window Sizes and
Significance Levels (from Tsai and Li 2008)
Variance Window Size
σD
2σ D
Significance level 5%
4.24
√
n
6.51
√
n
2.12
√
n
3.26
√
n
Significance level 1%
4σ D
1.06
√
n
1.63
√
n
Information Theory-Based Model Averaging
Information theory provides a rich literature for
the assessment of relative model performances as the
likelihood of a model can be assumed to be related to
the value of “information” it provides. The most popular
information theory-based measure in use is the AIC. A
recently developed publicly available model averaging
software called multimodel analysis or MMA (Poeter and
Hill 2007) provides a generalized framework that can
be used to rank models and calculate posterior model
probabilities. While MMA allows the user to choose
or define the model criterion and the model averaging
equations (including the MLBMA formulation), for this
work we implement the AIC component of MMA, which
we refer to as Akaike Information Criterion-based model
averaging (AICMA).
The AICMA framework works similar to the
Bayesian framework, although there are significant philosophical differences between the two approaches. The
AIC is used to approximate the Kullback-Leibler (K-L)
metric, a measure of the loss of information when an
imperfect model (Mj ) is used to approximate the “real”
(and unknown model f ). The K-L distance (I ) between
model Mj and f is defined as:
f (x)
dx
(17)
I [f, Mj ] = f (x) log
p(Mj |θ j )
where f (x) is the real distribution and p(Mj |θ j ) is the
distribution of model Mj given the set of calibrated
parameters θ j . Obviously, since the real distribution f
is not known, this term cannot be calculated. However,
the relative K-L information can be approximated using
the AIC (Akaike 1973) given by:
AI Cj = n ln(σ̂ 2e,j ) + 2k
(18)
To further correct for the bias introduced from small
sample sizes, a modified AIC equation (Hurvich and Tsai
1989; Poeter and Anderson 2005) has been proposed as
follows:
2k(k + 1)
AI Ccj = n ln(σ̂ 2e,j ) + 2k +
(19)
n−k−1
where the extra term in Equation 19 as compared to
Equation 18 accounts for second-order bias that may
result from a limited number of observations, for example,
when n/k < 40. This work uses the AICc metric as
defined in Equation 19 for likelihood estimation.
708
A. Singh et al. GROUND WATER 48, no. 5: 701–715
exp(−0.5AI Ccj )p(Mj )
wj (AI CMA) = exp(−0.5AI Ccj )p(Mj )
(20)
j
Theoretically, the fundamental difference between the
AICMA and Bayesian approaches lies in their conception
of a model. Since AICMA is based on an information
theoretic framework, it assumes that all models are
approximations and it is impossible to perfectly capture
reality. While the goal for AICMA therefore is to select
models with increasing complexity as the number of
observations increases, the goal for MLBMA is to strive
for models with consistent complexity (i.e., constant
k), regardless of the number of observations (since the
penalty term for model complexity is not dependent on the
number of observations). Of course, use of the FI matrix in
the KIC calculation leads to lower probabilities for more
complex models, if such complexity is not supported by
the data, thereby alleviating some of the problems with
the consistent complexity assumption.
Despite these differences, the AICMA approach
shares some of the behavior, in terms of posterior weight
distribution, of MLBMA due to the use of the term
and exponential weighting in Equation 20, which results
in larger weights being given to models that exhibit optimal or near optimal error residuals. The definition of AICc
(like that of KIC and BIC) exhibits a linear dependence on
n, which implies that the AICc weights are proportional
to (1/σ̂ 2e,j )n , whereas the GLUE weights are proportional
to (1/σ̂ 2e,j ). This is the primary source of difference in
inferring posterior model probabilities with GLUE vs.
MLBMA or AICMA.
Application
The various model averaging techniques were applied
to two case studies: (1) a case study first presented in
Ye et al. (2006), who used this case study to assess
conceptual model uncertainty in the Death Valley regional
flow system; and (2) a case study involving a groundwater
model developed for one of the corrective action units at
the Nevada Test Site.
Details of the application to the Death Valley recharge
model are given in Supporting Information, section A.
The main conclusion that could be drawn from this case
study was that there was a lack of consistency with
respect to model rankings for the different model averaging schemes, with different model averaging techniques
preferring different conceptual models as the top-ranked
model. The model weights given by different model averaging techniques were also disparate, with GLUE giving
more uniformly distributed weights and the other techniques (such as AICMA, MLBMA-KIC, and MLBMABIC) giving most of the weights to one or two models.
The second case study used for testing the methodologies is based on a groundwater model developed for
NGWA.org
one of the corrective action units (at the Frenchman Flat,
an alkaline desert depression) at the Nevada Test Site.
One of the objectives for the Frenchman Flat model is to
provide an estimate of the vertical and horizontal extent of
radionuclide migration for use in regulatory decision making (Belcher 2004; Stoller-Navarro Joint Venture 2006).
The flow of water through the cavity of an underground
nuclear test is the prediction of interest for this case study.
Additional details of this case study are discussed in Supporting Information, section B.
For this case study, there is considerable uncertainty
about the underlying geology for the Frenchman Flat.
To address these uncertainties, nine alternative models
of groundwater flow—reflecting a combination of uncertainties in geologic framework, parameters, and conceptualization of recharge—were developed (Bechtel Nevada
2005; Stoller-Navarro Joint Venture 2006). These are
shown in Table 2. Each of the models was calibrated using
a combination of head and boundary flux data. A total of
38 head and flux measurements were used (i.e., n = 38 for
all calculations). The inversion code Model-Independent
Parameter Estimation (PEST) (Doherty 2004) was used
to calibrate each of the models. The objective was to
estimate the uncertainty (over the ensemble of alternative
models) in predictions of cavity flow.
Note that for MLBMA and AICMA, theoretically, k
(in Equations 11, 12, 18, and 19) should correspond to the
number of parameters that are uniquely estimated using
the calibration process (the maximum likelihood estimate
corresponds only to such parameters). The Frenchman
Flat models had more parameters than number of observations due to which not all could be uniquely identified.
Sensitivity analysis of the parameter space was undertaken
to extract the subspace of the most sensitive parameters
from the calibration process. This is referred to as singular value decomposition (SVD) and consists of identifying
the most dominant linear combinations (eigenvectors) of
the system parameters (see Moore and Doherty [2006]
on how to calculate this subspace of sensitive parameters). In effect, these are the only parameters that can
be calibrated uniquely, and thus the maximum likelihood
estimates essentially pertain to these parameters. Thus,
for the model averaging exercise, only the subspace of
sensitive parameters was considered for each model. For
this case, the top 15 sensitive parameters were chosen
for further analysis. All 38 data points were used for the
sensitivity analysis and subsequent calculations.
Table 3 shows the results from an application of
GLUE (with a shape factor of N = 1), MLBMA-BIC,
MLBMA-KIC, and AICMA to this test case. As can be
seen from the table, GLUE weights are more uniformly
distributed as compared with the other approaches, with
at least four models having weights more than 10%.
MLBMA-BIC and AICMA have nonnegligible weights
for only two models. On the other hand, MLBMA-KIC
assigns most of the weight to a single model.
Unlike the previous test case (see Supporting Information, section A), the model ranks for this case study are
more consistent. The model ranks for GLUE, AICMA,
NGWA.org
and MLBMA-BIC are identical, while MLBMA-KIC has
a slight difference in relative ranks (to be discussed later).
This is primarily because the Frenchman Flat models all
have the same number of (effective) parameters. With different number of parameters, GLUE, AICMA, BIC, and
KIC can have discrepancies in model ranks because they
give different levels of importance to model complexity in the posterior weights. As the number of (sensitive)
parameters for each model is set to be the same (15),
both AICMA and MLBMA-BIC have identical model
weights and ranks. In fact, the model weights for all these
approaches are based purely on the calibration residual,
since the parsimony term in Equations 10 and 19 is cancelled out when calculating BIC and AIC.
The KIC column in Table 3 shows that the relative
order of model weights given by KIC is different from
the GLUE, MLBMA-BIC, and AICMA weights. Using
the KIC metric selects model “floor_final” as the best
and “aniso_final” as the second best model, compared to
GLUE, MLBMA-BIC, and AICMA where this order is
reversed. The additional sensitivity term tends to favor the
model (floor_final) with slightly higher calibration error,
while the other model averaging techniques all favor the
model (aniso_final) with minimum calibration error. This
is consistent with the discussion on the Fisher term (see
section on Bayesian Model Averaging), as KIC requires
the model with higher information content to have a correspondingly lower model error. In this case, higher information content in the data is not adequately balanced by
a lower calibration error (for model aniso_final), and thus
the model with slightly higher calibration error but lower
information content (floor_final) is selected.
The CDFs for the flow prediction are presented
in Figure 2. The unconditional CDF corresponds to
all the models being uniformly weighted, with the
weighted CDFs corresponding to GLUE, MLBMA-BIC,
and AICMA weights, respectively. As expected, the
unconditional case has the largest spread. Conditioning
with GLUE leads to a reduction in variance (i.e., the shape
of the CDF is steeper than the uncalibrated case), and most
of the models participate in the model weighting process.
On the other hand, application of AICMA, MLBMA-KIC,
or MLBMA-BIC leads to most of posterior weights given
to a few (2–3) models with zero weight assigned to a
majority of the models. As expected, the CDFs for both
AICMA and MLBMA-BIC coincide.
Variance Reduction with Different Averaging Techniques
As shown in Figure 2, the spread in predictions from
various model averaging techniques can be quite different.
This is examined in detail by comparing the standard deviations of cavity flow shown in Figure 3. Not surprisingly,
the highest uncertainty is associated with the uncalibrated
case, with some reduction in variance for the GLUE case
(because of conditioning). However, results for MLBMABIC, MLBMA-KIC, and AICMA show a significant
reduction in predictive variance compared to the unconditional case, with MLBMA-BIC and AICMA leading to
almost 95% reduction and MLBMA-KIC to almost 99%
A. Singh et al. GROUND WATER 48, no. 5: 701–715
709
Table 2
Description of Frenchman Flat Model Cases
Model
Description
ANISO_ FINAL
Permeability depth reduction tends to impose apparent anisotropy. Such additional anisotropy may be
overly constraining flow. Depth-limited anisotropy was developed to test if this was the case.
FLOOR_FINAL
Indefinite permeability reduction with depth can effectively remove some parts of the geology from
the flow system because they become impermeable. This approach imposes a lower limit, or floor, on
permeability depth decay for the base framework.
BF_7_ AV
Base framework model with prior data. Tests influence such data and model parameter stability.
NDD2
Base framework model with limited alluvium and volcanic rock permeability depth decay.
BASE_FINAL
Best calibration base framework model.
DISP
This alternative is concerned with the locations and displacement of basin-forming faults. This alternative
juxtaposes shallow aquifers against deeper aquifers, allowing a hydraulic connection between volcanic
aquifers underlying the AA in Frenchman Flat to carbonate aquifers east and south from the Rock Valley
fault system.
Juxtaposition removes zeolitic confining units from a potential flow path.
BASE_NODD
Base framework model without alluvium permeability depth decay. This tests if the conceptual model of
permeability reduction with depth can give feasible results.
BLFA
The BLFA HSU is modeled as a single continuous flow, rather than three separate zones. Located at
or near the water table, which may affect flow and transport of radionuclides away from underground
nuclear tests in the Northern Testing Area. Conceptually, the BLFA is a fractured rock, thus fracture/matrix
processes are acting over a larger area.
CPBA
Some uncertainty exists in the distribution of pre-Tertiary HSUs, particularly the distribution of UCCU
beneath CP basin. This alternative results in a continuous sheet of UCCU beneath CP basin. No direct
transport consequences in terms of materials, but broadly impacts the flow system.
reduction, respectively. This is consistent with the distribution of weights for these respective model averaging
techniques, as shown earlier in Table 3. It is interesting to
note that MLBMA-KIC leads to the least variance among
all the model averaging methodologies. This is of particular consequence when considering model uncertainty. As
shown in Table 3, it can be seen that while MLBMA-KIC
has almost the same order of ranks as GLUE, MLBMABIC, and AICMA, the difference between the best and
the second best models for MLBMA-KIC is much higher
than the other aggregation schemes (the rank 1 model for
MLBMA-KIC is two orders of magnitude more likely
than the rank 2 model). MLBMA-KIC, thus, tends to
further concentrate the weight for just a single model,
leading to much less predictive uncertainty than the other
techniques.
Additional analysis was also conducted to look at the
sensitivity of the GLUE CDFs to the shape factor. Details
of this analysis are presented in Supporting Information,
section C. It was seen that increasing the shape factor led
to more nonuniform GLUE weights, with better models
being given progressively higher weights. In addition, as
Table 3
Model Weights and Ranks using Superparameters
Models
WSSR
nk
GLUE
Wts
BLFA
BASE_NODD
CPBA
DISP
FLOOR_FINAL
ANISO_FINAL
BF_7_CAV
NDD2
BASE_FINAL
434
394
1503
298
11.44
10.87
15.15
31.71
55.85
15
15
15
15
15
15
15
15
15
0.76%
0.84%
0.22%
1.10%
28.78%
30.29%
21.73%
10.38%
5.90%
MLBMA- MLBMABIC
KIC
AICMAWts
Wts
Wts
0.0000
0.0000
0.0000
0.0000
27.43%
72.44%
0.13%
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
27.43%
72.44%
00.13%
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
99.55%
0.44%
0.0000
0.0000
0.0000
GLUE
Rank
AICMA
Rank
8
7
9
6
2
1
3
4
5
8
7
9
6
2
1
3
4
5
MLBMA- MLBMABIC
KIC
Rank
Rank
8
7
9
6
2
1
3
4
5
6
8
9
7
1
2
3
4
5
All 38 observations were used for all models.
710
A. Singh et al. GROUND WATER 48, no. 5: 701–715
NGWA.org
Figure 2. Prediction uncertainty for cavity flow for different model averaging techniques.
the shape factor was increased the GLUE CDF tended to
converge to the AICMA and MLBMA-BIC CDFs.
Evaluation of Modified BMA
This section presents the results for the modified
BMA (mBMA) approach of Tsai and Li (2008) applied
to the NTS case study. For this application, the modified
approach was only applied with the KIC weighting
scheme. The variance window factors (α-values) given
in Table 1 were used with a significance level of 5%.
With 38 observations (n), the α-value was determined to
be 0.68 for a 1σ variance window, decreasing to 0.34 and
0.17 for 2σ and 4σ variance window sizes, respectively.
Model weights as per this modified BMA technique
were then calculated using Equation 17 for different
values of α and compared to GLUE and MLBMABIC and MLBMA-KIC (both of the latter with the
original Occam’s window-based weighting). The results
for cavity flow are shown in Figure 4. As expected,
Figure 4 shows that decreasing values of α result in
a broadening of Occam’s window and correspondingly
smoother CDFs for prediction uncertainty in cavity flow
(for α = 1, mBMA is equivalent to MLBMA-KIC). Thus,
the generalization of Occam’s window (to Tsai and Li’s
variance window) allows additional plausible models to
be effectively weighted in the model averaging process.
In addition, with larger variance windows (lower αs)
the predictive CDF tends to become smoother leading to
higher predictive variance.
Figure 3. Standard deviation for cavity flow predictions for different conceptual models.
NGWA.org
A. Singh et al. GROUND WATER 48, no. 5: 701–715
711
Figure 4. Sensitivity of uncertainty in model predictions for modified MLBMA to different variance windows.
Recall that sensitivity analysis (section C, Supporting
Information) conducted with GLUE showed that with high
values of shape factor N , the GLUE CDF coinciding with
AICMA and MLBMA-BIC based CDFs. The impact of
different variance window sizes on the CDF is somewhat
different—in this case even with very large variance
window sizes, the mBMA CDFs (based on the KIC
metric) never coincide with either the GLUE or the
AICMA/MLBMA-BIC based CDFs. This is because of
the difference in the ranks of the highly weighted models
between the KIC scheme (on which mBMA is based)
and the GLUE and AICMA/MLBMA-BIC schemes. If,
for example, MLBMA-KIC and GLUE had the same
relative ordering of models, a larger variance window
size for mBMA would lead to mBMA having the same
weights as GLUE for some larger variance window size.
Recall that the difference in the relative order of model
weights for different techniques is essentially because
of the parsimony and sensitivity terms included in the
AICc, BIC, and KIC criteria, respectively. The GLUE
weights, on the other hand, do not depend on the number
of parameters in the model or the sensitivity of the model
parameters, and hence may or may not lead to the same
relative order of models as AICc, BIC, or KIC based
CDFs.
Concluding Remarks
This analysis provides a comparative assessment
of different groundwater model averaging techniques
for quantifying the impacts of model uncertainty on
groundwater model predictions. These techniques include
(1) GLUE, (2) MLBMA using both KIC and BIC,
(3) AICMA (using AICc), and (4) a modified BMA using
the variance window concept. Two groundwater modeling case studies are used to illustrate the performance of
712
A. Singh et al. GROUND WATER 48, no. 5: 701–715
these different techniques and provide some practical suggestions regarding their applicability. On the basis of the
results presented in the previous sections, the following
general conclusions are warranted regarding the various
techniques.
Different model averaging techniques can lead to different relative ranking and, hence, significantly different
weights for models. In particular, BMA using KIC or BIC
leads to a concentration of weights in the top few models
and a corresponding reduction in prediction uncertainty
compared to the unconditional case (where all models are
considered equally likely). However, the variance window
modification to BMA provides an opportunity to expand
Occam’s window, thus expanding the hypothesis space
by accepting multiple plausible models with a commensurate redistribution of model weights. Although AICMA
is conceptually different from MLBMA in its use of the
AICc as the criterion of choice, the use of an exponential weighting term leads to a similar concentration of
weights in 1 or 2 models with the best agreement with
the data. Such concentration of weights can also reduce
the impact expert-based priors have on the final predictive
uncertainty bounds. As such, the concentration of weights
in a few top-performing models may be deemed acceptable by the expert, in which case BMA and/or AICMA
yield the most statistically meaningful results. This, essentially, indicates that the modeler has full confidence in
the data used for calibration and its use for the assigning of likelihoods for models to be used in predictive
space. If, on the other hand, the calibration data are not
fully reliable or represent conditions sufficiently different
from the predictive modeling environment, alternatives
may need to be considered to expand the “hypothesis
window” to include more alternative conceptualizations.
Thus, the modeler and field experts are recommended to
NGWA.org
exercise a certain level of judgment in the final model
averaging process.
GLUE produces more uniformly distributed weights
for an N factor of 1.0. The degree of uniformity depends
on the choice of the shape factor N . A large value of N
(of the order of 20) leads to a concentration of weighting
for the model(s) with the best calibration performance. If
the GLUE model weights have the same rank ordering
as one of the other model averaging (such as MLBMA
or AICMA) weights, then changing the GLUE shape
factor leads to weights that converge to the other model
averaging weights. Here, the empirical GLUE shape factor
could be interpreted similar to the variance window
concept. It should also be noted that the criticism of
certain GLUE likelihoods not depending on the number
of samples is valid from a statistical perspective. While
GLUE provides a flexible framework, work still needs to
be done on formulating statistically meaningful likelihood
functions with appropriate dependence on model error,
number of data points, and model complexity.
From a practical standpoint, the various model averaging techniques provide a useful framework for assigning
probabilities to alternative conceptual models. As noted
earlier, there are significant differences between the various approaches. Practitioners should therefore be cognizant of the fact that different model averaging techniques can lead to different predictive uncertainty bounds.
They should apply and compare different model averaging techniques—a task facilitated by software tools such
as MMA (Poeter and Hill 2007)—before deciding which
technique is most appropriate to their problem. They
should also be aware of the assumptions and limitations,
as well as the advantages of different model averaging
techniques. The work presented in this paper is intended
to bring forth some of the underlying assumptions and
nuances of one method over another and also provide
some practical guidelines. To that end, a preliminary set
of recommendations is provided regarding the use of these
model averaging techniques for the ultimate goal of quantifying uncertainty in model predictions.
• The starting point for any model averaging exercise
should be an exhaustive set of alternative models that
have been properly calibrated. Conclusions from the
application of model averaging techniques are likely
to be misleading if uncertainty in the model space has
not been properly characterized. This is consistent with
the observations of others (Beven 2009; Ye et al. 2008;
Poeter and Hill 2007).
• In the case of overparameterized models, parametric
sensitivity analysis should be undertaken to identify the
“sensitive” parameters that can be properly calibrated
with the available data. In the case of MLBMA-KIC,
FI criterion should be calculated for this set of sensitive
parameters only.
• As a first step, different models should be ranked using
more than one information criterion (i.e., AICc, BIC,
or KIC). As the rank ordering of models may differ
from technique to technique depending on the balance
NGWA.org
of goodness-of-fit and model complexity, it may be
useful to create a union of the top-ranked models across
various techniques.
• If there is consistency across model rankings, then
predictions from KIC, BIC, or AICc based techniques
will be similar and likely display a smaller variance
than weighted GLUE predictions. In cases where the
calibration data are not very reliable or where the
calibration space is sufficiently different from the
predictive space, the modeler may need to apply
techniques to further expand the hypothesis window to
allow for more conceptual models to be given a nonzero
weight. Within the Bayesian framework, this can be
accomplished through the variance window concept.
In GLUE, the shape factor (N ) can be modified to
distribute the weights across different models.
• If model rankings are in conflict, then the onus is on
the analyst to determine which of the model averaging techniques is appropriate for the problem at hand
based on consistency with hydrogeologic considerations. For example, in case study 1, the analyst may use
site-specific and domain-specific knowledge to decide
between DPW1 and CMB2, given the lack of robustness among different techniques for assigning model
weights. It is important that the modeler looks at other
model performance measures (bias and correlation of
errors, mass balance characteristics, matching of peak
events, etc.).
• It is highly recommended to develop a CDF (as shown
in Figure 4) that takes into account the prediction from
each model and the weight assigned to that model.
Such a CDF captures the full range of outcomes and
their associated likelihoods, rather than aggregating the
results in terms of the mean and standard deviation
of model predictions. The decision maker is likely to
benefit from a complete presentation of uncertainty
propagation results as compared to just the first two
statistical moments.
Acronyms and Symbols
AIC
BIC
BMA
BMC
CDF
GLUE
KIC
MLBMA
MMA
RMSE
WSSR
Mi
Li
θi
θ̂ i
Akaike Information Criterion
Bayesian Information Criterion
Bayesian Model Averaging
Bayesian Monte Carlo
Cumulative Distribution Function
Generalized Likelihood Uncertainty
Estimation
Kashyap Information Criterion
Maximum Likelihood Bayesian Model
Averaging
Multi-Model Analysis
Root Mean Square Error
Weighted Sum of Squared Residuals
Model i
Likelihood for Model i
Parameter Vector for Model i
Maximum Likelihood Estimate for
Parameter Vector θ
A. Singh et al. GROUND WATER 48, no. 5: 701–715
713
ki
σ 2 e,i
σ 2l
N
D
n
ω
p(x)
p(x|y)
X
|X|
I [f, Mi ]
s1
σD
s2
α
Number of Parameters for Model i
Variance of the Errors (residuals) for
Model i
Variance of the Observations
GLUE Shape Factor
Observation Data
Number of Observations
Observation Weight Matrix
Probability of x
Conditional Probability of x given y
Sensitivity/Jacobian Matrix of Calibration
with respect to Parameters
Determinant of matrix X
Kullback-Leibler (K-L) Metric for Loss
of Information when Using Model Mi to
Represent Reality (f )
Size of Occam’s Window
Standard Deviation of the Chi-Square Distribution used for “Goodness of Fit” Criterion
used in KIC or BIC
Width of Variance Window in Terms of σ
Ratio of Occam’s Window to Variance
Window
Acknowledgments
We wish to thank the reviewers of this paper—Mary
Hill, Eileen Poeter, Ming Ye, and Harihar Rajaram—for
their comments and feedback, which helped improve the
quality and readability of our manuscript.
Supporting Information
Additional Supporting Information may be found in
the online version of this article:
Supporting information has been provided in three
supplemental sections. Section A provides details of
applying the different model averaging techniques to
the Death Valley regional flow model (Ye et al. 2006).
Section B provides additional detail of the Nevada Test
Site case study. Section C provides information on
sensitivity analysis conducted with different GLUE shape
factors for the NTS case study.
Please note: Wiley-Blackwell are not responsible for
the content or functionality of any supporting materials
supplied by the authors. Any queries (other than missing
material) should be directed to the corresponding author
for the article.
References
Akaike, H. 1973. Information theory as an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, ed. B.N. Petrov, 267–281.
Budapest, Hungary: Akademiai Kiado.
Bechtel Nevada. 2005. A Hydrostratigraphic Framework Model
and Alternatives for the Ground water Flow and Contaminant Transport Model of Corrective Action Unit 98:
714
A. Singh et al. GROUND WATER 48, no. 5: 701–715
Frenchman Flat, Clark, Lincoln and Nye Counties, Nevada,
DOE/NV/11718–1064. Las Vegas, NV: Bechtel Nevada.
Belcher, W.R., ed. 2004. Death Valley regional groundwater flow system, Nevada and California—Hydrogeologic
framework and transient ground-water flow model.
U.S. Geological Survey Scientific Investigations Report
2004–5205, 408 p. Reston, Virginia: USGS.
Beven, K.J. 2008. Environmental Modelling: An Uncertain
Future? An Introduction to Techniques for Uncertainty Estimation in Environmental Prediction. London: Routledge
Publishing.
Beven, K.J. 2006. On undermining the science? Hydrological
Processes 20, 3141–3146.
Beven, K.J. 2000. Uniqueness of place and process representations in hydrological modelling. Hydrology and Earth
System Sciences 4, 203–213.
Beven, K.J. 1993. Prophecy, reality and uncertainty in
distributed hydrological modeling. Advances in Water
Resources 16, 41–51.
Beven, K., and J. Freer. 2001. Equifinality, data assimilation, and
uncertainty estimation in mechanistic modelling of complex
environmental systems using the GLUE methodology.
Journal of Hydrology 249, 11–29.
Beven, K.J., and A. Binley. 1992. The future of distributed models: Model calibration and uncertainty prediction. Hydrological Processes 6, 279–298.
Carrera, J., and S.P. Neuman. 1986. Estimation of aquifer
parameters under transient and steady state conditions. 1,
Maximum likelihood method incorporating prior information. Water Resources Research 22, no. 2: 199–210.
Delhomme, J.P. 1979. Spatial variability and uncertainty in
ground water flow parameters: A geostatistical approach.
Water Resources Research 15, no. 2: 269–280.
Doherty, J. 2004. PEST: Model-independent parameter estimation, user manual, version 5. Brisbane, Australia: Watermark Numerical Computing.
Domingos, P. 2000. Bayesian averaging of classifiers and the
overfitting problem. ICML’00. http://www.cs.washington.
edu/homese/pedrod/mlc00b.ps.gz.
Draper, D. 1995. Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society: Series B 57,
no. 1: 45–97.
Hill, M.C., and C.R. Tiedeman. 2007. Effective Groundwater
Model Calibration: With Analysis of Data, Sensitivities,
Predictions, and Uncertainty. New York: Wiley and Sons.
Hoeksema, R.J., and P.K. Kitanidis. 1989. Predictions of transmissivities, heads, and seepage velocities using mathematical models and geostatistics. Advances in Water Resources
12, no. 2: 90–102.
Hoeting, J.A., D. Madigan, A.E. Raftery, and C.T. Volinsky.
1999. Bayesian model averaging: A tutorial. Statistical
Science 14, no. 4: 382–417.
Hurvich, C.M., and C.-L. Tsai. 1989. Regression and time series
model selection in small sample. Biometrika 76, no. 2:
99–104.
Kashyap, R.L. 1982. Optimal choice of AR and MA parts in
autoregressive moving average models. IEEE Transactions
on Pattern Analysis and Machine Intelligence 4, no. 2:
99–104.
Kass, R.E., and A.E. Raftery. 1995. Bayes factors. Journal of
the American Statistical Association 90, 773–795.
Madigan, D., and A.E., Raftery. 1994. Model selection and
accounting for model uncertainty in graphical models
NGWA.org
using Occam’s window. Journal of the American Statistical
Association 89, no. 428: 1535–1546.
Mantovan, P., and E. Todini. 2006. Hydrological forecasting
uncertainty assessment: Incoherence of the GLUE methodology. Journal of Hydrology 330, 368–381.
Minka, T.P. 2000. Bayesian model averaging is not model
combination, MIT Media Lab note (7/6/00). Available at
http://research.microsoft.com/∼minka/papers/minka-bmaisnt-mc.pdf.
Moore, C. and J. Doherty. 2006. The cost of uniqueness
in groundwater model calibration. Advances in Water
Resources 29, 605–623.
Mugunthan, P., and C.A. Shoemaker. 2006. Assessing the
impacts of parameter uncertainty for computationally
expensive ground water models. Water Resources Research
42, W10428. doi:10.1029/2005WR004640.
Nash, J., and J. Sutcliffe. 1970. River flow forecasting through
conceptual models, 1. A 849 discussion of principles.
Journal of Hydrology 10, 282–290.
National Research Council. 2001. Conceptual Models of Flow
and Transport in the Fractured Vadose Zone. Washington,
DC: National Academic Press.
Neuman, S.P. 2003. Maximum likelihood Bayesian averaging
of uncertain model predictions. Stochastic Environmental
Research and Risk Assessment 17, no. 5: 291–305.
Neuman, S.P. 1982. Statistical characterization of aquifer heterogeneities: An overview. In Recent Trends in Hydrogeology, 81–102. Geological Society of America Special
Paper 189. Boulder, Colorado: Geological Society of
America.
Neuman, S.P., and P.J. Wierenga. 2003. A Comprehensive Strategy of Hydrogeologic Modeling and Uncertainty Analysis
for Nuclear Facilities and Sites, NUREG/CR-6805. Washington, DC: U.S. Nuclear Regulatory Commission.
Poeter, E.P., and M.C. Hill 2007. MMA, A computer code for
Multi-Model Analysis: U.S. Geological Survey Techniques
and Methods 6–E3 Reston, Virginia: USGS.
Poeter, E., and D. Anderson. 2005. Multimodel ranking and
inference in ground water modeling. Ground Water 43,
no. 4: 597–605.
Raftery, A.E. 1995. Bayesian model selection in social research.
Sociological Methodology 25, 111–163.
Samper, F.J., and S.P. Neuman. 1989. Estimation of spatial
covariance structures by adjoint state maximum likelihood
NGWA.org
cross validation: 1, Theory. Water Resources Research 25,
351–362.
Schwarz, G. 1978. Estimating the dimension of a model. Annals
of Statistics 6, no. 2: 461–464.
Stoller-Navarro Joint Venture. 2006a. Phase II ground-water
flow model of Corrective Action Unit 98-Frenchman Flat,
Nevada Test Site, Nye County, Nevada. Stoller-Navarro
Joint Venture Report S-N/99205-074 prepared for the U.S.
Department of Energy Available at http://www.osti.gov/
bridge (accessed February 8, 2007).
Tsai, F.T.-C., and X. Li. 2008. Ground water inverse modeling
for hydraulic conductivity estimation using Bayesian model
averaging and variance window. Water Resources Research
44, no. 9: W09434. doi:10.1029/2007WR006576.
Vogel, R.M., R. Batchelder, and J.R. Stedinger. 2007. Appraisal
of the Generalized Likelihood Uncertainty Estimation
(GLUE) method. Water Resources Research 44, WOOB06.
doi:10.1029/2008WR006822, 2008.
Wagner, B.J., and S.M. Gorelick. 1989. Reliable aquifer remediation in the presence of spatially variable hydraulic conductivity; from data to design. Water Resources Research
25, no. 10: 2211–2225.
Ye, M., K.F. Pohlmann, and J.B. Chapman. 2008a. Expert
elicitation of recharge model probabilities for the Death
Valley regional flow system. Journal of Hydrology 354,
102–115. doi:10.1016/j.jhydrol.2008.03.001.
Ye, M., P.D. Meyer, and S.P. Neuman. 2008b. On model
selection criteria in multimodel analysis. Water Resources
Research 44, W03428. doi:10.1029/2008WR006803.
Ye, M., K. Pohlmann, J. Chapman, and D. Shafer. 2005. On
evaluation of conceptual models: a priori and a posteriori.
International High-Level Radioactive Waste Management
Conference, April 30 - May 4, Las Vegas, NV. Available
at http://www.osti.gov/bridge/product.biblio.jsp?osti_id=
875590.
Ye, M., S.P. Neuman, P.D. Meyer, and K.F. Pohlmann. 2005.
Sensitivity analysis and assessment of prior model probabilities in MLBMA with application to unsaturated
fractured tuff. Water Resources Research 41, W12429.
doi:10.1029/2005WR004260.
Ye, M., S.P. Neuman, and P.D. Meyer. 2004. Maximum likelihood Bayesian averaging of spatial variability models in
unsaturated fractured tuff. Water Resources Research 40,
W05113.
A. Singh et al. GROUND WATER 48, no. 5: 701–715
715
Download