Designed Simulation Experiments, Part 2: DOE for the Digital Age

AIAA 2011-6625
AIAA Modeling and Simulation Technologies Conference
08 - 11 August 2011, Portland, Oregon
Designed Simulation Experiments, Part 2:
DOE for the Digital Age
Terril N. Hurst ∗, Chatterpaul S. Joseph †, Colin F. Pouchet £, Brett D. Collins €
Raytheon Missile Systems, Tucson, AZ 85756
This paper illustrates successful ways in which conventional design of experiments
(DOE), as developed during previous DOE eras, has been modified and expanded for
successful application to the design and analysis of simulation experiments (DASE). DASE
supports development of software-centric systems, using a complex, fast-evolving simulation
for sample collection. A companion paper (Part 1) illustrates specific problems in attempting
to follow previously standardized DOE principles and methods within DASE, and this paper
suggests solutions to those problems.
I. Introduction
S
TATISTICS is a field with a long history including both spectacular successes and examples of being
1
misapplied and misunderstood—see the satirical poem written in 1959 by Sir M.G. Kendall. Part 1 of
this two-paper presentation reviews the development of statistical design of experiments (DOE),
2
characterized by Montgomery as consisting of four distinct eras, beginning with R.A. Fisher in the 1920s.
Part 1 highlights issues arising when a simulation is used to implement DOE for software-centric systems.
This is a challenge for meaningful experimental design and inference, due to the dynamic complexity of
both the system and the simulation. Applying DOE to such systems constitutes a new DOE era and the
emergence of design and analysis of simulation experiments (DASE, pronounced “days”).
Box et al characterize DOE as an organized way of framing an imperfect window through which to
3
observe what they call The True State of Nature. In DASE, a simulation serves as an imperfect proxy to
the True State of Nature. Thus, relevance of the sample depends not only on the quality of experimental
design and inference but also on the simulation’s accuracy and suitability for intended use. Since a
credible, deterministic simulation eliminates response variability that is due to setup or operator error,
Fisher’s principles of randomized run-order, replication, and blocking are irrelevant. Monte Carlo analysis
is employed to mimic process uncertainties in the system being simulated.
The paper reflects the authors’ vantage point as developers and users of deterministic simulations of
missile systems; however, the examples and lessons learned apply to many other cases where DOE and
simulation are jointly employed. Section II introduces DASE; four general categories of simulation
experiments are described, and a 7-step protocol is given for planning, executing, and outbriefing
simulation experiments. Section III describes some of the tools proven useful for DASE, including
alternative experimental designs, modeling and analysis techniques. Section IV briefly shows several
examples of using the techniques and protocol described in the previous sections. Section V concludes
the paper and briefly discusses some current challenges for advancing DASE.
II. DASE Categories and 7-Step Protocol
Although the ultimate goal of conventional DOE is usually represented as optimizing a system response
(e.g. performance or reliability) using response surface methodology (RSM), optimization is only one of
several reasons for conducting simulation experiments. This section describes four categories of
simulation objectives and how conventional DOE may or may not easily satisfy them. A 7-step protocol is
∗
Sr. Principal Systems Engineer, Modeling, Simulation & Analysis Center, Bldg 805 M/S C1.
Sr. Systems Engineer I, Modeling, Simulation & Analysis Center, Bldg 9030 M/S M30-S12.
£
Sr. Systems Engineer I, Guidance, Navigation and Control Center, Bldg M01 M/S 1.
€
Principal Systems Engineer, Modeling, Simulation & Analysis Center, Bldg M09 M/S 1.
†
1
American
Institute
of
Aeronautics
and Astronautics
Copyright © 2011 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.
presented, which was developed in intensive collaboration with Raytheon customers who are interested
in the successful application of DOE to simulation.
A. Why Simulation Experiments are Conducted. Experience has led to identifying four distinctly
different DASE categories, shown in the following list and explained below. A single keyword is
underlined to label and distinguish each category.
1. Assess system performance (or another overall system response, e.g. reliability)
a) Examine/compare total effect of random variability on system(s) as a whole
b) Isolate & troubleshoot “outliers” from response probability distributions
c) Identify the dominant factors affecting the system response
2. Evaluate specific aspects of system design
a) Conduct trade studies involving system design/control variables in the presence of noise
b) Perform sensitivity analysis for cost/performance or margin allocation decisions
c) Create trustworthy surrogate models for well-defined purposes
3. Support tests (e.g. Bench Top, Hardware-In-The-Loop, Captive Carry, Flight)
a) Assist test scenario allocation (i.e. which cases to test)
b) Support pre-test activities (e.g. Range Safety Review)
c) Conduct post-test re-construction & data analysis (e.g. Failure Review Board)
4. Verify & validate the simulation (SimV&V)
a) Check assumptions and implementation of models & simulations
b) Compare simulation results with real-world data from bench/flight tests
Due to the Department of Defense policy of Simulation Based Acquisition (SBA), the fundamental
requirement for a missile system performance simulation is to verify a system’s performance (or to
4
compare multiple systems) in lieu of running large numbers of costly flight tests. Category 1 objectives
are therefore focused on system performance across the full factor dimensionality and extent of
operational conditions. Contractually, none of the factors of this space may be screened; the full space
must be sampled in order to estimate the expected value of a system performance metric—denoted
generically in this paper as Probability of Success (Ps). Efficient sampling is desired to estimate Ps within
an acceptable confidence interval.
In contrast to performance assessment experiments, Category 2 simulation experiments involve
evaluating system design aspects. Conventional DOE and response surface methodology (RSM) are
most naturally applied here. However, due to the complex nature of system responses, alternatives to
RSM are popular when a response includes strong nonlinearities or discontinuities—a common situation
in software-centric systems such as guided missiles. In such cases, classification and regression trees,
kriging, or other types of surrogate models are often preferred to response surfaces. Whatever the form of
the surrogate model, interest is primarily in using the model to make sufficiently accurate predictions for
optimization. Therefore, as in conventional DOE, factor screening is essential for increasing the likelihood
of obtaining a useful surrogate model to achieve the experiment’s objective.
The purpose of Category 3 experiments is not system optimization; rather, simulation experiments are
conducted to support a specific test. Example objectives include establishing confidence in successful
test completion, performing test calibration, and explaining why a test failed.
Finally, Category 4 simulation experiments are conducted in order to verify or (incrementally) validate
a simulation—referred to as SimV&V. System optimization is not the goal of these experiments; the intent
is to verify that the simulation is implemented in a way that meets its requirements and to establish its
suitability for intended use. Simulation verification has a binary outcome—pass/fail—but validation
depends strongly on good judgment from subject matter experts. Although statistical analysis is
employed, the tests’ small sample sizes often limit the utility of DOE.
Experience with missile systems has shown that it is a serious mistake to design a single simulation
experiment to achieve objectives spanning these four categories. It is common for an initial Category 1
experiment to spawn one or more follow-on experiments of different categories. Section IV includes
examples of experiments for each of the four categories.
B. The 7-Step DASE Protocol. The following checklist has been proven effective in planning, executing,
and reporting results for all four categories of simulation experiments. In one sense, the checklist is quite
high-level in order to handle a diversity of cases, but in another, it constitutes a rigorous protocol to be
followed for each case. A single keyword is underlined in each step to emphasize its purpose.
2
American Institute of Aeronautics and Astronautics
Establish basis (sponsor, system requirements, subject matter experts, tools, …, time)
State this experiment’s specific objective & category (1 - 4)
Define experiment’s response(s) and practically discernible difference(s) δ
Define experiment’s factors being varied
a) Control factors’ type (numeric/categorical), units, and ranges/levels
b) Monte Carlo (MC) factors’ distributions & parameter values
5. Factor Screening (use as “dry run” for Category 1)
a) Select experimental design (sampling strategy—e.g. fractional factorial, D-optimal)
b) Choose number of MC replicates, considering δ, confidence interval, and statistical power
c) Establish simulation run sequence, and execute/analyze the Screening runs
6. Empirical Modeling
a) Select control factors to be held fixed (eliminated) for the modeling experiment
b) Select model type & form (e.g. ANOVA, response surface, regression tree)
c) Select Experimental Design (sampling strategy—e.g. Latin Hypercube sampling)
d) Choose number of MC replicates, considering δ, confidence interval, and statistical power
e) Construct and verify model(s)
7. Final Analysis and Results Presentation
a) Use data & model(s) to draw conclusions and to present results for decision-making
b) Determine whether follow-on experiments are required
1.
2.
3.
4.
The rationale for each of the seven steps is explained below.
1. Basis. Another label for Step 1 could be “context.” As mentioned previously, a simulation serves as
proxy for The True State of Nature and therefore determines what is observable under given sampling
conditions. This is why it is Step 1. The basis is established by a sponsor (e.g. customer or other
stakeholder) in accordance with one or more requirements, and involves specific subject matter experts,
tools, and a finite amount of time. The basis typically supports multiple experiments.
2. Objective. Step 2 involves stating as specifically and quantitatively as possible an objective to be
accomplished in the simulation experiment being designed. Specificity is crucial in order to set bounds on
the experiment and to know when it is completed. This step can be difficult and might be re-visited as
other details emerge regarding the experiment.
3. Response(s) and discernible difference(s). The simulation’s output model may include many
variables that could serve as a response, but typically, one or two are chosen as responses and others as
secondary indicators. It is important to state a numerical value for each response that is the discernible—
i.e. practical—difference. Doing this helps to define statistical power and prevents the expensive error of
seeking statistical confidence intervals that are smaller than what matters, based on the realities of
operational conditions and simulation accuracy.
4. Factors. The most intensive collaboration between simulation stakeholders occurs in Step 4, and
experience has shown that if consensus is not reached here, stakeholders will likely discount the
experiment’s results. Each control factor must be identified, established as being either numerical or
categorical, and given either upper and lower bounds (numerical) or categories (categorical). Until this is
done, it is not possible to select an experimental design or to identify nonsensical factor-level
combinations (treatments). During this step, it is often necessary to re-visit previous steps.
5. Screening. Except for Category 1 experiments—which require considering the system’s full factor
dimensionality—screening less dominant factors is crucial for constructing useful surrogate models.
6. Modeling. Sometimes called “the main DOE,” this step involves fixing values of factors that were
eliminated in Step 5 and sampling the remaining factors, usually at more levels. The intent is to generate
a surrogate model that can be used to complete the experiment.
7. Final Analysis and Results Presentation. Exploratory data analysis and statistical inference are
valuable tools for completing this step. Conclusions are drawn for subsequent action, possibly including
additional experiments.
Several tools have been developed to accomplish each of the seven steps. Only a few are described
within the next section; Section IV illustrates how these tools are used to conduct specific experiments.
3
American Institute of Aeronautics and Astronautics
III. Experimental Design and Analysis Techniques and Tools for DASE
Figure 1: 4-factor coverage example, Central Composite Design vs. Latin Hypercube Sampling.
A. Alternative Experimental Designs. Fisher popularized the use of factorial experiments, which are
2
still in wide use—especially 2- and 3-level, full- or fractional-factorial designs. Other popular design types
include the central composite design (CCD) and the Box-Behnken design. Although all of these designs
have several advantages, they are not space-filling but concentrate sample points in regular patterns.
Figure 1 compares a 4-factor CCD example (left), which requires 25 points, with a 4-factor Latin
6
hypercube sampling (LHS) design (right)—for fair comparison also having 25 points. Each example’s
diagonal subplots show the individual factor levels, and each off-diagonal subplot is a two-dimensional
subspace projection of the same 25 sample points. Even though each sample size is 25, the CCD
appears sparser because of the lattice-stacking within each projection. Due to complex responses that
are common in DASE, the space-filling nature of LHS frequently yields surprising discoveries.
Figure 2: Example Nested Latin Hypercube design, 3 layers for 3 factors.
7
A more recent sampling innovation that has proven useful for DASE is nested LHS. The idea is to
create space-filling designs that can be executed sequentially when further detail is deemed necessary,
rather than starting over each time. For example, Figure 2 shows a 3-factor, 3-level example. Layer 1 has
4
American Institute of Aeronautics and Astronautics
only 4 sample points (red circles). Layer 2 adds 8 more points (green squares), interleaved within the
original 4 points, giving a total of 12 (4 x 3) points. Finally, Layer 3 adds 12 more points (blue triangles),
interleaved with the prior 12 points, giving a sample size of 24 (4 x 3 x 2) points. Nested LHS has proven
useful for high-fidelity, scene-based missile simulations, which can require hours for a single run.
Another alternative sampling strategy is called PEM, for Point Estimation Method. PEM is a Raytheon8
developed tool that samples much more sparingly than Monte Carlo sampling. PEM is based on two
2
basic ideas. First, the response’s expected value (mean µY) and variance σY are each modeled using the
9
expected value operator on a Taylor series approximation of the response function Y(x). The models
2
depend on the k factors’ statistical moments (variance σi , skewness Wi , and kurtosis Ki ) and the
response Y’s first and second partial derivatives (Si , Ti , i = 1 to k factors):
µ=
YX * +
Y
σ Y2 =
1 k
Tiσ i 2
∑
2 i =1
(1)

 Tiσ i2 
2
 2 2
3
+
+
S
σ
S
T
σ
W

∑

 ( K i − 1) 
i
i
i i i
i
i =1 

 2 

k
(2)
YX* is the predicted response value when all k factors are at their nominal values, and the sum in Eqn. 1
predicts a mean shift which occurs in the presence of any nonlinear response terms (in which case, at
least one second partial derivative Ti is non-zero). These equations are also the basis of a tolerance
allocation tool developed at Raytheon, called RAVE (Raytheon Analysis of Variability Engine), which was
used in the Category 2 example described in Section IV.
The second idea behind the PEM tool involves determining the locations xk and weightings pk of
Gaussian quadrature points to approximate each factor’s first n moments (usually four):
∞
∫
−∞
j
x=
f ( x)dx
n
∑p x
k =1
k
j
k
=
for j 0,1,..., 2n − 1
(3)
The underlying assertion is that, given representation of a random variable X by a sample of size n from
its domain definition, the method of Gaussian quadratures can be used to estimate which n values of x to
choose, and how much weight (the p values) to associate with each x value in computing the response
probability distribution. This set of paired p and x values enables reproducing the first n moments of X.
Critical values {pk , xk} are selected for each factor by solving the set of 2n Eqns. 3, where n is the number
of factor levels that will be used to represent the factor in the designed experiment.
Using the above equations, the PEM algorithm designs the experiment by allocating more levels to
factors with higher sensitivities than those with lower sensitivities. Accurate estimates of the response
distribution have been shown possible with orders of magnitude fewer samples than Monte Carlo
10
sampling.
B. Alternative Modeling Strategies. As mentioned previously, basic response surface methodology
(RSM) involves building empirical models by using ordinary least-squares fitting to analytic forms—e.g.
low-order polynomials or transcendental functions. Within DASE, popular alternatives include logistic
regression and classification and regression trees.
Figure 3 illustrates a two-factor logistic regression surface, for a response that has a significant
interaction between the two factors. This surface has the following equation:
1
1 + exp  − ( 2 x1 + 3 x2 − 8 x1 x2 ) 
1) =
π (x1 , x 2 ) ≡ Prob(Y =
(4)
Eqn. 4 states that the response random variable Y’s binomial parameter (denoted by the symbol π) varies
as a function of factors x1, x2, and their interaction. Logistic regression is appropriate when the response
can be transformed from a continuous response (e.g. miss distance) to a binomial response (e.g.
Probability of Success). Experience with DASE has shown that converting a response to a binomial
random variable often reduces the factor dimensionality required to construct an accurate, empirical
predictive model.
5
American Institute of Aeronautics and Astronautics
Figure 3: Example 2-factor logistic regression surface with strong factor interaction.
In DASE, a popular alternative to a smooth, response surface model is a regression tree. Rather than
representing a response as an analytic function, a regression tree is a binary decision tree, with each
decision point being a threshold value of one of the factors. Leaf nodes indicate the predicted response
11
value (see Figure 4).
X5 < -0.5
Y
N = 144
Mean = 14.65
StdDev = 4.66
X1 < 0
N = 48
Mean = 8.69
StdDev = 1.31
Y
X3 < -0.67
N = 24
Mean = 7.79
Y StdDev = 1.22
6.17
8.33
X5 < 0.5
Y
N
9.58
N
N
X3 < -0.67
Y
N = 48
Mean = 16.06
StdDev = 1.86
N = 96
Mean = 17.64
StdDev = 2.21
N
X3 < -0.67
N = 48
Mean = 19.21
Y StdDev = 1.45 N
X1 < 0
X1 < 0
X1 < 0
N = 12
N = 36
N = 12
Mean = 13.75
Mean = 16.83
Mean = 17.83
StdDev = 2.05 N
Y StdDev = 0.94 N Y StdDev = 1.27 N
Y
15.67 16.33
19.00
11.83
17.33 16.67
19.67
Figure 4: Example regression tree for response prediction.
IV. Example Applications of DASE
This section includes four examples, one per DASE category. The intent is not to provide in-depth
information on each example (see references for more details), but to illustrate usage of the tools and
methods described in previous sections.
A. Category 1 Example (Performance): Increased Coverage with Fewer Runs. As mentioned
previously, Category 1 DASE is more challenging for applying conventional DOE. Typical missile
performance assessments involve 12-18 control factors (which define each engagement scenario) and
hundreds of Monte Carlo factors (which represent uncontrolled “noise” in the system and environment).
Performance should be assessed across the full space in order to estimate Probability of Success for
variously defined performance bins (e.g. by target type).
Prior to applying DOE, specific scenarios were hand-picked by subject matter experts for representing
performance. As time passed, this set of hand-picked scenarios grew unacceptably large while not
covering the entire operational space. Therefore, the decision was made to replace the legacy
performance scenario set by one based upon DOE principles, as now described.
6
American Institute of Aeronautics and Astronautics
Candidate Points
Feasible-Point
Population
Evaluation
Sample
Figure 5: Sets of Missile Engagement Scenarios for Performance Assessment.
Figure 5 illustrates a novel approach for defining a Category 1 sample space. First, LHS is used to
generate a large number of candidate points. Filters are established to exclude candidates that may be
either operationally unlikely (given engagement tactics) or kinematically infeasible (given missile
hardware limitations, e.g. rocket motor size, control authority, etc.). An example filter criterion is
Probability of Guide, i.e. the likelihood of guiding within a feasible terminal configuration for success. This
filtered set constitutes a finite population of feasible points (middle, green set) for performance
assessment. Once the population is established, an evaluation sample (inner, yellow set) can be drawn
from it. Sample size depends upon available computing power and time allotted, and on the experiment’s
specific purpose—e.g. a final or intermediate performance assessment, or quick regression-testing
following a software change (which should involve as small a sample set as possible). Random selection
has been used, but the authors’ recent experience indicates that a better selection method is D-optimal
2
sampling, to ensure that small sample sizes do not result in clumps or voids of scenarios within the
sampled operational space.
B. Category 2 (Design): System Tolerance Allocation for Tracker Performance.
10
Guidance
Detection
Scene
X3: bsAln
X4: frDOR
Image X : mpOff
11
X12: mpSlp
Tracking
Detections
X5: frLat
X9: hoPsn
X10: hoVel
Track
files
Gimbal
Control
Waveform
Control
Seeker
Manager
Target
State
Estimate
X1: CRP
X2: blRg
X14: trnRg
Commands
X6: frR
X7: gbGRN
Radar
X12: gbM
Waveform
X13: rgRz
Figure 6: Functional relationship between the tracker algorithm and other system elements.
A systems engineer requested that system tolerances be established to maximize success of the
missile’s tracker algorithm. The tracker algorithm is crucial to missile performance: If a high probability of
target tracking is achieved, the guidance and control systems are likely able to manage the missile’s
energy well enough to assure successful target intercept. Two simulations were used. The first utilized a
statistical seeker model and required about 5 minutes to simulate a single fly-out, and the second
simulation used a high-fidelity, scene-based seeker model and required about 2.5 hours per fly-out. One
challenge, therefore, was successfully blending results from the two simulations.
Figure 6 is a high-level diagram of the control loop used in this experiment; the loop includes several
functional blocks and fourteen factors that were allocated tolerances based on results of several designed
experiments. Note the absence of guidance and control factors—i.e. their parameters were assumed
fixed for this study. The response for this study was median miss distance.
7
American Institute of Aeronautics and Astronautics
Table 1: Factor symbols, descriptions, and final values assigned during each phase of the study.
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
Mnemonic
Symbol
CRP
blRg
bsAln
fDOR
frLat
frR
gbGRN
gbM
hoP
hoV
mpOff
mpSlp
rgRz
trRg
Factor Name and
Brief Description
Coherent Reference Point Error1: a ranging error
resulting in image formation errors.
RF Blind Range2: range at which radar stops
dwelling.
Boresight Alignment Error1: misalignment between
antenna & IR camera boresights.
Frame Dropout Rate3: fraction of dropped frames.
Frame Latency3: elapsed time between taking and
using a measurement for control.
Frame Rate3: tracker data sample rate.
Gimbal to GPS-Rcvr-NAV Alignment Error1
Gimbal Angle Measurement Error1
Target Hand Over Error, Position2
Target Hand Over Error, Velocity2
Monopulse Offset Error1: angle calc. term.
Monopulse Slope Error1: angle calc. term.
Range Resolution3: minimum separation between
distinguishable objects.
Transition Range2: tracker start-up range.
Radar HW factor/error
2
Flight-related factor/error
Allocated Value
(-1 to +1 scaling) by Phase
Screening Modeling
RAVE
-
-
0.213
-1
-1
-1
-
0.610
0.610
+1
+1
+1
+1
+1
+1
0
+1
0
0
+1
0
0.674
0
0
+1
0
0.423
0.452
0.674
0
-1
-1
-1
-
0.413
0.413
3
Signal processing factor
Table 1 gives a one-line description of each factor, along with scaled tolerance values that were
established for each factor during DASE Steps 5 and 6. Using the faster, statistical seeker simulation, 7 of
the 14 factors were eliminated during screening—i.e. set to values shown in the Screening column. The
modeling step also used the statistical seeker simulation, and a 7-factor regression model was created for
use in the RAVE tool to allocate tolerances for 4 of the 7 factors. This model was also used in the PEM
tool to allocate tolerances to the remaining 3 factors (#1, #9, and #10), and finally, the high-fidelity
10
simulation was used to verify the entire tolerancing solution.
C. Category 3 Example (Test Support): RF seeker monopulse angle calibration. This Category 3
experiment involves monopulse angle calibration for a missile’s RF seeker. Due to imperfections in
manufacturing, the RF antenna is typically created with perturbed sum and delta channel antenna
patterns. The perturbation is usually a shift from zero azimuth and elevation in the peak of the main lobe
sum channel, and/or in the null of the delta channels patterns—see Figure 7.
In bench top testing, an RF seeker is placed at one end of an anechoic chamber, and a grid scan is
made across an RF repeater, varying the azimuth and elevation angles. These coarse samples and true
pointing angles are used to compute a least squares linear fit to the antenna patterns, and the leastsquares coefficients are used to estimate the RF monopulse angle.
To calibrate the high-fidelity seeker model, fine-grid sampling is conducted with arbitrary antenna
perturbation patterns against a point target. One advantage of using the simulation is that the antenna
pattern is available, which can be used to determine more precisely the monopulse angle coefficients.
The goal of monopulse angle calibration is to determine coefficients so that the angle error is within
specification. However, designed experiments are needed to cover the sample space (i.e. perturbation of
the antenna patterns) and to determine if the RF monopulse calibration process can achieve the specified
angle error. If the calibration process cannot achieve the specified angle error, adjustments are made to
fit the antenna patterns.
The simulation experiment's response is monopulse angle error, which must be within specifications
assumed by the algorithms. Factors being varied are perturbation shifts to each channel's antenna
pattern, which are interdependent. For each of the four quadrants Q1 - Q4, a phase and a gain
8
American Institute of Aeronautics and Astronautics
perturbation are applied independently, with upper limits for each of these factors based on maximum
expected manufacturing variations.
Figure 7: Sum and delta channel perturbations due to antenna manufacturing imperfections.
A Resolution V fractional factorial simulation experiment is conducted to identify significant two-way
2
interactions. No factors can be screened, and the data are used to quantify all factors’ sensitivity.
Next, a Latin Hypercube Sampling experiment is performed to create a surrogate model using ordinary
least-squares fitting. The surrogate model is verified to have desired properties (unbiased, Gaussian
residuals). The data is also examined for any outlier found to be outside of specification.
Finally, the surrogate model is compared against bench-top test data for the seeker. Strong agreement
between simulation and test results indicates that the monopulse calibration process is valid.
Furthermore, the surrogate model is used within the statistical seeker simulation to model expected factor
calibration error.
12
D. Category 4: Incremental SimV&V. As mentioned previously, simulation validation must be done
incrementally, due to the small number of actual flight tests conducted. Figure 8 is an example plot of two
flight test matching parameters (FMPs) for using a single flight test to incrementally validate a missile
performance simulation.
Figure 8: Example Percent-in-Envelope Analysis for two flight test matching parameters.
In this case, there is only one scenario being simulated—the test being flown—and therefore, all
variations are due to Monte Carlo variables. By tracking the percent-in-envelope for FMPs, and by using
12
flight test data as it becomes available, accuracy of the simulation can be assessed and improved.
9
American Institute of Aeronautics and Astronautics
V. Conclusions
This paper and its companion (Part 1) paper describe current issues and successes in applying DOE
to simulation. Most engineers have had limited exposure to DOE and DASE, so one challenge is to
educate them. A common misconception is that DASE is only applicable to specific types of simulations;
however, DASE is a tool to assist in the most fundamental of engineering work: high-dimensional,
multidisciplinary trade studies, where the solution is not obvious and where multiple subject matter
experts must collaborate to integrate their work. One challenge in securing support for DASE is the ability
to measure its business impact. Those who have embraced DASE usually express confidence that
engineering designs are more robust due to the wider coverage of the design space. During a recent
program milestone review, the customer’s chief engineer made the comment that “by using DOE, this
program has achieved a level of thoroughness unmatched by any previous program with which I’ve been
associated.” This comment was reassuring, but until clearer benefits are established quantitatively,
consistently getting management support for DOE is impeded.
Another challenge is how to facilitate commonality, reuse, and sharing of tools and analysis techniques
within an expanding community of DASE practitioners. Hopefully, these papers will stimulate discussions
that lead to improvements in the discipline and infrastructure of DASE.
References
1
Kendall, M.G., “Hiawatha Designs an Experiment,” The American Statistician (1959), Vol. 13, No. 5. Also
available at http://www.the-aps.org/publications/tphys/legacy/1984/issue5/363.pdf.
2
Montgomery, D.C., Design and Analysis of Simulation Experiments (Wiley), 2009.
3
Box, G.E.P, J.S. Hunter, and W.G. Hunter, Statistics for Experimenters (Wiley), 2005.
4
Zittel, R.C., “The Reality of Simulation-Based Acquisition—and an Example of U.S. Military Implementation,”
Acquisition Quarterly Review (2001 summer edition). Also available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.6349&rep=rep1&type=pdf .
5
Gilmore, J.M., “Guidance on the use of Design of Experiments (DOE) in Operational Test and Evaluation,”
memorandum from the Office of the Secretary of Defense, Washington, D.C., Oct. 19, 2010.
6
McKay, M.D., Beckman, R.J., and Conover, W.J. (May 1979). "A Comparison of Three Methods for Selecting Values of Input
Variables in the Analysis of Output from a Computer Code," Technometrics (American Statistical Association) 21 (2): 239–245
7
Qian, P.Z.G., M. Ai, and C.F.J. Wu, “Construction of Nested Space-Filling Designs,” The Annals of Statistics
(2009), Vol. 37, No. 6A, pp. 3616-3643.
8
Alderman, J. and Mense, A., “Second Generation Point Estimation Methods,” Proceedings, 2006 U.S. Army
Conference on Applied Statistics (Durham, NC).
9
Hahn, G.H. and S.S. Shapiro, Statistical Models in Engineering (Wiley), 1994.
10
Hurst, T.N., C. Joseph, J.S. Rhodes, and K. Vander Putten, “Novel Experimental Design & Analysis Methods for
Simulation Experiments Involving Algorithms”, Proceedings, 2009 U.S. Army Conference on Applied Statistics (Cary, NC).
11
Breiman, L. et al, Classification and Regression Trees (Chapman & Hall/CRC), 1998.
12
Hurst, T.N. and B.D. Collins, “Simulation Validation Alternatives When Flight Test Data Are Severely Limited,”
Proceedings, 2008 AIAA Conference on Modeling & Simulation Technologies (AIAA-6357).
10
American Institute of Aeronautics and Astronautics