AIAA 2011-6625 AIAA Modeling and Simulation Technologies Conference 08 - 11 August 2011, Portland, Oregon Designed Simulation Experiments, Part 2: DOE for the Digital Age Terril N. Hurst ∗, Chatterpaul S. Joseph †, Colin F. Pouchet £, Brett D. Collins € Raytheon Missile Systems, Tucson, AZ 85756 This paper illustrates successful ways in which conventional design of experiments (DOE), as developed during previous DOE eras, has been modified and expanded for successful application to the design and analysis of simulation experiments (DASE). DASE supports development of software-centric systems, using a complex, fast-evolving simulation for sample collection. A companion paper (Part 1) illustrates specific problems in attempting to follow previously standardized DOE principles and methods within DASE, and this paper suggests solutions to those problems. I. Introduction S TATISTICS is a field with a long history including both spectacular successes and examples of being 1 misapplied and misunderstood—see the satirical poem written in 1959 by Sir M.G. Kendall. Part 1 of this two-paper presentation reviews the development of statistical design of experiments (DOE), 2 characterized by Montgomery as consisting of four distinct eras, beginning with R.A. Fisher in the 1920s. Part 1 highlights issues arising when a simulation is used to implement DOE for software-centric systems. This is a challenge for meaningful experimental design and inference, due to the dynamic complexity of both the system and the simulation. Applying DOE to such systems constitutes a new DOE era and the emergence of design and analysis of simulation experiments (DASE, pronounced “days”). Box et al characterize DOE as an organized way of framing an imperfect window through which to 3 observe what they call The True State of Nature. In DASE, a simulation serves as an imperfect proxy to the True State of Nature. Thus, relevance of the sample depends not only on the quality of experimental design and inference but also on the simulation’s accuracy and suitability for intended use. Since a credible, deterministic simulation eliminates response variability that is due to setup or operator error, Fisher’s principles of randomized run-order, replication, and blocking are irrelevant. Monte Carlo analysis is employed to mimic process uncertainties in the system being simulated. The paper reflects the authors’ vantage point as developers and users of deterministic simulations of missile systems; however, the examples and lessons learned apply to many other cases where DOE and simulation are jointly employed. Section II introduces DASE; four general categories of simulation experiments are described, and a 7-step protocol is given for planning, executing, and outbriefing simulation experiments. Section III describes some of the tools proven useful for DASE, including alternative experimental designs, modeling and analysis techniques. Section IV briefly shows several examples of using the techniques and protocol described in the previous sections. Section V concludes the paper and briefly discusses some current challenges for advancing DASE. II. DASE Categories and 7-Step Protocol Although the ultimate goal of conventional DOE is usually represented as optimizing a system response (e.g. performance or reliability) using response surface methodology (RSM), optimization is only one of several reasons for conducting simulation experiments. This section describes four categories of simulation objectives and how conventional DOE may or may not easily satisfy them. A 7-step protocol is ∗ Sr. Principal Systems Engineer, Modeling, Simulation & Analysis Center, Bldg 805 M/S C1. Sr. Systems Engineer I, Modeling, Simulation & Analysis Center, Bldg 9030 M/S M30-S12. £ Sr. Systems Engineer I, Guidance, Navigation and Control Center, Bldg M01 M/S 1. € Principal Systems Engineer, Modeling, Simulation & Analysis Center, Bldg M09 M/S 1. † 1 American Institute of Aeronautics and Astronautics Copyright © 2011 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved. presented, which was developed in intensive collaboration with Raytheon customers who are interested in the successful application of DOE to simulation. A. Why Simulation Experiments are Conducted. Experience has led to identifying four distinctly different DASE categories, shown in the following list and explained below. A single keyword is underlined to label and distinguish each category. 1. Assess system performance (or another overall system response, e.g. reliability) a) Examine/compare total effect of random variability on system(s) as a whole b) Isolate & troubleshoot “outliers” from response probability distributions c) Identify the dominant factors affecting the system response 2. Evaluate specific aspects of system design a) Conduct trade studies involving system design/control variables in the presence of noise b) Perform sensitivity analysis for cost/performance or margin allocation decisions c) Create trustworthy surrogate models for well-defined purposes 3. Support tests (e.g. Bench Top, Hardware-In-The-Loop, Captive Carry, Flight) a) Assist test scenario allocation (i.e. which cases to test) b) Support pre-test activities (e.g. Range Safety Review) c) Conduct post-test re-construction & data analysis (e.g. Failure Review Board) 4. Verify & validate the simulation (SimV&V) a) Check assumptions and implementation of models & simulations b) Compare simulation results with real-world data from bench/flight tests Due to the Department of Defense policy of Simulation Based Acquisition (SBA), the fundamental requirement for a missile system performance simulation is to verify a system’s performance (or to 4 compare multiple systems) in lieu of running large numbers of costly flight tests. Category 1 objectives are therefore focused on system performance across the full factor dimensionality and extent of operational conditions. Contractually, none of the factors of this space may be screened; the full space must be sampled in order to estimate the expected value of a system performance metric—denoted generically in this paper as Probability of Success (Ps). Efficient sampling is desired to estimate Ps within an acceptable confidence interval. In contrast to performance assessment experiments, Category 2 simulation experiments involve evaluating system design aspects. Conventional DOE and response surface methodology (RSM) are most naturally applied here. However, due to the complex nature of system responses, alternatives to RSM are popular when a response includes strong nonlinearities or discontinuities—a common situation in software-centric systems such as guided missiles. In such cases, classification and regression trees, kriging, or other types of surrogate models are often preferred to response surfaces. Whatever the form of the surrogate model, interest is primarily in using the model to make sufficiently accurate predictions for optimization. Therefore, as in conventional DOE, factor screening is essential for increasing the likelihood of obtaining a useful surrogate model to achieve the experiment’s objective. The purpose of Category 3 experiments is not system optimization; rather, simulation experiments are conducted to support a specific test. Example objectives include establishing confidence in successful test completion, performing test calibration, and explaining why a test failed. Finally, Category 4 simulation experiments are conducted in order to verify or (incrementally) validate a simulation—referred to as SimV&V. System optimization is not the goal of these experiments; the intent is to verify that the simulation is implemented in a way that meets its requirements and to establish its suitability for intended use. Simulation verification has a binary outcome—pass/fail—but validation depends strongly on good judgment from subject matter experts. Although statistical analysis is employed, the tests’ small sample sizes often limit the utility of DOE. Experience with missile systems has shown that it is a serious mistake to design a single simulation experiment to achieve objectives spanning these four categories. It is common for an initial Category 1 experiment to spawn one or more follow-on experiments of different categories. Section IV includes examples of experiments for each of the four categories. B. The 7-Step DASE Protocol. The following checklist has been proven effective in planning, executing, and reporting results for all four categories of simulation experiments. In one sense, the checklist is quite high-level in order to handle a diversity of cases, but in another, it constitutes a rigorous protocol to be followed for each case. A single keyword is underlined in each step to emphasize its purpose. 2 American Institute of Aeronautics and Astronautics Establish basis (sponsor, system requirements, subject matter experts, tools, …, time) State this experiment’s specific objective & category (1 - 4) Define experiment’s response(s) and practically discernible difference(s) δ Define experiment’s factors being varied a) Control factors’ type (numeric/categorical), units, and ranges/levels b) Monte Carlo (MC) factors’ distributions & parameter values 5. Factor Screening (use as “dry run” for Category 1) a) Select experimental design (sampling strategy—e.g. fractional factorial, D-optimal) b) Choose number of MC replicates, considering δ, confidence interval, and statistical power c) Establish simulation run sequence, and execute/analyze the Screening runs 6. Empirical Modeling a) Select control factors to be held fixed (eliminated) for the modeling experiment b) Select model type & form (e.g. ANOVA, response surface, regression tree) c) Select Experimental Design (sampling strategy—e.g. Latin Hypercube sampling) d) Choose number of MC replicates, considering δ, confidence interval, and statistical power e) Construct and verify model(s) 7. Final Analysis and Results Presentation a) Use data & model(s) to draw conclusions and to present results for decision-making b) Determine whether follow-on experiments are required 1. 2. 3. 4. The rationale for each of the seven steps is explained below. 1. Basis. Another label for Step 1 could be “context.” As mentioned previously, a simulation serves as proxy for The True State of Nature and therefore determines what is observable under given sampling conditions. This is why it is Step 1. The basis is established by a sponsor (e.g. customer or other stakeholder) in accordance with one or more requirements, and involves specific subject matter experts, tools, and a finite amount of time. The basis typically supports multiple experiments. 2. Objective. Step 2 involves stating as specifically and quantitatively as possible an objective to be accomplished in the simulation experiment being designed. Specificity is crucial in order to set bounds on the experiment and to know when it is completed. This step can be difficult and might be re-visited as other details emerge regarding the experiment. 3. Response(s) and discernible difference(s). The simulation’s output model may include many variables that could serve as a response, but typically, one or two are chosen as responses and others as secondary indicators. It is important to state a numerical value for each response that is the discernible— i.e. practical—difference. Doing this helps to define statistical power and prevents the expensive error of seeking statistical confidence intervals that are smaller than what matters, based on the realities of operational conditions and simulation accuracy. 4. Factors. The most intensive collaboration between simulation stakeholders occurs in Step 4, and experience has shown that if consensus is not reached here, stakeholders will likely discount the experiment’s results. Each control factor must be identified, established as being either numerical or categorical, and given either upper and lower bounds (numerical) or categories (categorical). Until this is done, it is not possible to select an experimental design or to identify nonsensical factor-level combinations (treatments). During this step, it is often necessary to re-visit previous steps. 5. Screening. Except for Category 1 experiments—which require considering the system’s full factor dimensionality—screening less dominant factors is crucial for constructing useful surrogate models. 6. Modeling. Sometimes called “the main DOE,” this step involves fixing values of factors that were eliminated in Step 5 and sampling the remaining factors, usually at more levels. The intent is to generate a surrogate model that can be used to complete the experiment. 7. Final Analysis and Results Presentation. Exploratory data analysis and statistical inference are valuable tools for completing this step. Conclusions are drawn for subsequent action, possibly including additional experiments. Several tools have been developed to accomplish each of the seven steps. Only a few are described within the next section; Section IV illustrates how these tools are used to conduct specific experiments. 3 American Institute of Aeronautics and Astronautics III. Experimental Design and Analysis Techniques and Tools for DASE Figure 1: 4-factor coverage example, Central Composite Design vs. Latin Hypercube Sampling. A. Alternative Experimental Designs. Fisher popularized the use of factorial experiments, which are 2 still in wide use—especially 2- and 3-level, full- or fractional-factorial designs. Other popular design types include the central composite design (CCD) and the Box-Behnken design. Although all of these designs have several advantages, they are not space-filling but concentrate sample points in regular patterns. Figure 1 compares a 4-factor CCD example (left), which requires 25 points, with a 4-factor Latin 6 hypercube sampling (LHS) design (right)—for fair comparison also having 25 points. Each example’s diagonal subplots show the individual factor levels, and each off-diagonal subplot is a two-dimensional subspace projection of the same 25 sample points. Even though each sample size is 25, the CCD appears sparser because of the lattice-stacking within each projection. Due to complex responses that are common in DASE, the space-filling nature of LHS frequently yields surprising discoveries. Figure 2: Example Nested Latin Hypercube design, 3 layers for 3 factors. 7 A more recent sampling innovation that has proven useful for DASE is nested LHS. The idea is to create space-filling designs that can be executed sequentially when further detail is deemed necessary, rather than starting over each time. For example, Figure 2 shows a 3-factor, 3-level example. Layer 1 has 4 American Institute of Aeronautics and Astronautics only 4 sample points (red circles). Layer 2 adds 8 more points (green squares), interleaved within the original 4 points, giving a total of 12 (4 x 3) points. Finally, Layer 3 adds 12 more points (blue triangles), interleaved with the prior 12 points, giving a sample size of 24 (4 x 3 x 2) points. Nested LHS has proven useful for high-fidelity, scene-based missile simulations, which can require hours for a single run. Another alternative sampling strategy is called PEM, for Point Estimation Method. PEM is a Raytheon8 developed tool that samples much more sparingly than Monte Carlo sampling. PEM is based on two 2 basic ideas. First, the response’s expected value (mean µY) and variance σY are each modeled using the 9 expected value operator on a Taylor series approximation of the response function Y(x). The models 2 depend on the k factors’ statistical moments (variance σi , skewness Wi , and kurtosis Ki ) and the response Y’s first and second partial derivatives (Si , Ti , i = 1 to k factors): µ= YX * + Y σ Y2 = 1 k Tiσ i 2 ∑ 2 i =1 (1) Tiσ i2 2 2 2 3 + + S σ S T σ W ∑ ( K i − 1) i i i i i i i =1 2 k (2) YX* is the predicted response value when all k factors are at their nominal values, and the sum in Eqn. 1 predicts a mean shift which occurs in the presence of any nonlinear response terms (in which case, at least one second partial derivative Ti is non-zero). These equations are also the basis of a tolerance allocation tool developed at Raytheon, called RAVE (Raytheon Analysis of Variability Engine), which was used in the Category 2 example described in Section IV. The second idea behind the PEM tool involves determining the locations xk and weightings pk of Gaussian quadrature points to approximate each factor’s first n moments (usually four): ∞ ∫ −∞ j x= f ( x)dx n ∑p x k =1 k j k = for j 0,1,..., 2n − 1 (3) The underlying assertion is that, given representation of a random variable X by a sample of size n from its domain definition, the method of Gaussian quadratures can be used to estimate which n values of x to choose, and how much weight (the p values) to associate with each x value in computing the response probability distribution. This set of paired p and x values enables reproducing the first n moments of X. Critical values {pk , xk} are selected for each factor by solving the set of 2n Eqns. 3, where n is the number of factor levels that will be used to represent the factor in the designed experiment. Using the above equations, the PEM algorithm designs the experiment by allocating more levels to factors with higher sensitivities than those with lower sensitivities. Accurate estimates of the response distribution have been shown possible with orders of magnitude fewer samples than Monte Carlo 10 sampling. B. Alternative Modeling Strategies. As mentioned previously, basic response surface methodology (RSM) involves building empirical models by using ordinary least-squares fitting to analytic forms—e.g. low-order polynomials or transcendental functions. Within DASE, popular alternatives include logistic regression and classification and regression trees. Figure 3 illustrates a two-factor logistic regression surface, for a response that has a significant interaction between the two factors. This surface has the following equation: 1 1 + exp − ( 2 x1 + 3 x2 − 8 x1 x2 ) 1) = π (x1 , x 2 ) ≡ Prob(Y = (4) Eqn. 4 states that the response random variable Y’s binomial parameter (denoted by the symbol π) varies as a function of factors x1, x2, and their interaction. Logistic regression is appropriate when the response can be transformed from a continuous response (e.g. miss distance) to a binomial response (e.g. Probability of Success). Experience with DASE has shown that converting a response to a binomial random variable often reduces the factor dimensionality required to construct an accurate, empirical predictive model. 5 American Institute of Aeronautics and Astronautics Figure 3: Example 2-factor logistic regression surface with strong factor interaction. In DASE, a popular alternative to a smooth, response surface model is a regression tree. Rather than representing a response as an analytic function, a regression tree is a binary decision tree, with each decision point being a threshold value of one of the factors. Leaf nodes indicate the predicted response 11 value (see Figure 4). X5 < -0.5 Y N = 144 Mean = 14.65 StdDev = 4.66 X1 < 0 N = 48 Mean = 8.69 StdDev = 1.31 Y X3 < -0.67 N = 24 Mean = 7.79 Y StdDev = 1.22 6.17 8.33 X5 < 0.5 Y N 9.58 N N X3 < -0.67 Y N = 48 Mean = 16.06 StdDev = 1.86 N = 96 Mean = 17.64 StdDev = 2.21 N X3 < -0.67 N = 48 Mean = 19.21 Y StdDev = 1.45 N X1 < 0 X1 < 0 X1 < 0 N = 12 N = 36 N = 12 Mean = 13.75 Mean = 16.83 Mean = 17.83 StdDev = 2.05 N Y StdDev = 0.94 N Y StdDev = 1.27 N Y 15.67 16.33 19.00 11.83 17.33 16.67 19.67 Figure 4: Example regression tree for response prediction. IV. Example Applications of DASE This section includes four examples, one per DASE category. The intent is not to provide in-depth information on each example (see references for more details), but to illustrate usage of the tools and methods described in previous sections. A. Category 1 Example (Performance): Increased Coverage with Fewer Runs. As mentioned previously, Category 1 DASE is more challenging for applying conventional DOE. Typical missile performance assessments involve 12-18 control factors (which define each engagement scenario) and hundreds of Monte Carlo factors (which represent uncontrolled “noise” in the system and environment). Performance should be assessed across the full space in order to estimate Probability of Success for variously defined performance bins (e.g. by target type). Prior to applying DOE, specific scenarios were hand-picked by subject matter experts for representing performance. As time passed, this set of hand-picked scenarios grew unacceptably large while not covering the entire operational space. Therefore, the decision was made to replace the legacy performance scenario set by one based upon DOE principles, as now described. 6 American Institute of Aeronautics and Astronautics Candidate Points Feasible-Point Population Evaluation Sample Figure 5: Sets of Missile Engagement Scenarios for Performance Assessment. Figure 5 illustrates a novel approach for defining a Category 1 sample space. First, LHS is used to generate a large number of candidate points. Filters are established to exclude candidates that may be either operationally unlikely (given engagement tactics) or kinematically infeasible (given missile hardware limitations, e.g. rocket motor size, control authority, etc.). An example filter criterion is Probability of Guide, i.e. the likelihood of guiding within a feasible terminal configuration for success. This filtered set constitutes a finite population of feasible points (middle, green set) for performance assessment. Once the population is established, an evaluation sample (inner, yellow set) can be drawn from it. Sample size depends upon available computing power and time allotted, and on the experiment’s specific purpose—e.g. a final or intermediate performance assessment, or quick regression-testing following a software change (which should involve as small a sample set as possible). Random selection has been used, but the authors’ recent experience indicates that a better selection method is D-optimal 2 sampling, to ensure that small sample sizes do not result in clumps or voids of scenarios within the sampled operational space. B. Category 2 (Design): System Tolerance Allocation for Tracker Performance. 10 Guidance Detection Scene X3: bsAln X4: frDOR Image X : mpOff 11 X12: mpSlp Tracking Detections X5: frLat X9: hoPsn X10: hoVel Track files Gimbal Control Waveform Control Seeker Manager Target State Estimate X1: CRP X2: blRg X14: trnRg Commands X6: frR X7: gbGRN Radar X12: gbM Waveform X13: rgRz Figure 6: Functional relationship between the tracker algorithm and other system elements. A systems engineer requested that system tolerances be established to maximize success of the missile’s tracker algorithm. The tracker algorithm is crucial to missile performance: If a high probability of target tracking is achieved, the guidance and control systems are likely able to manage the missile’s energy well enough to assure successful target intercept. Two simulations were used. The first utilized a statistical seeker model and required about 5 minutes to simulate a single fly-out, and the second simulation used a high-fidelity, scene-based seeker model and required about 2.5 hours per fly-out. One challenge, therefore, was successfully blending results from the two simulations. Figure 6 is a high-level diagram of the control loop used in this experiment; the loop includes several functional blocks and fourteen factors that were allocated tolerances based on results of several designed experiments. Note the absence of guidance and control factors—i.e. their parameters were assumed fixed for this study. The response for this study was median miss distance. 7 American Institute of Aeronautics and Astronautics Table 1: Factor symbols, descriptions, and final values assigned during each phase of the study. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 Mnemonic Symbol CRP blRg bsAln fDOR frLat frR gbGRN gbM hoP hoV mpOff mpSlp rgRz trRg Factor Name and Brief Description Coherent Reference Point Error1: a ranging error resulting in image formation errors. RF Blind Range2: range at which radar stops dwelling. Boresight Alignment Error1: misalignment between antenna & IR camera boresights. Frame Dropout Rate3: fraction of dropped frames. Frame Latency3: elapsed time between taking and using a measurement for control. Frame Rate3: tracker data sample rate. Gimbal to GPS-Rcvr-NAV Alignment Error1 Gimbal Angle Measurement Error1 Target Hand Over Error, Position2 Target Hand Over Error, Velocity2 Monopulse Offset Error1: angle calc. term. Monopulse Slope Error1: angle calc. term. Range Resolution3: minimum separation between distinguishable objects. Transition Range2: tracker start-up range. Radar HW factor/error 2 Flight-related factor/error Allocated Value (-1 to +1 scaling) by Phase Screening Modeling RAVE - - 0.213 -1 -1 -1 - 0.610 0.610 +1 +1 +1 +1 +1 +1 0 +1 0 0 +1 0 0.674 0 0 +1 0 0.423 0.452 0.674 0 -1 -1 -1 - 0.413 0.413 3 Signal processing factor Table 1 gives a one-line description of each factor, along with scaled tolerance values that were established for each factor during DASE Steps 5 and 6. Using the faster, statistical seeker simulation, 7 of the 14 factors were eliminated during screening—i.e. set to values shown in the Screening column. The modeling step also used the statistical seeker simulation, and a 7-factor regression model was created for use in the RAVE tool to allocate tolerances for 4 of the 7 factors. This model was also used in the PEM tool to allocate tolerances to the remaining 3 factors (#1, #9, and #10), and finally, the high-fidelity 10 simulation was used to verify the entire tolerancing solution. C. Category 3 Example (Test Support): RF seeker monopulse angle calibration. This Category 3 experiment involves monopulse angle calibration for a missile’s RF seeker. Due to imperfections in manufacturing, the RF antenna is typically created with perturbed sum and delta channel antenna patterns. The perturbation is usually a shift from zero azimuth and elevation in the peak of the main lobe sum channel, and/or in the null of the delta channels patterns—see Figure 7. In bench top testing, an RF seeker is placed at one end of an anechoic chamber, and a grid scan is made across an RF repeater, varying the azimuth and elevation angles. These coarse samples and true pointing angles are used to compute a least squares linear fit to the antenna patterns, and the leastsquares coefficients are used to estimate the RF monopulse angle. To calibrate the high-fidelity seeker model, fine-grid sampling is conducted with arbitrary antenna perturbation patterns against a point target. One advantage of using the simulation is that the antenna pattern is available, which can be used to determine more precisely the monopulse angle coefficients. The goal of monopulse angle calibration is to determine coefficients so that the angle error is within specification. However, designed experiments are needed to cover the sample space (i.e. perturbation of the antenna patterns) and to determine if the RF monopulse calibration process can achieve the specified angle error. If the calibration process cannot achieve the specified angle error, adjustments are made to fit the antenna patterns. The simulation experiment's response is monopulse angle error, which must be within specifications assumed by the algorithms. Factors being varied are perturbation shifts to each channel's antenna pattern, which are interdependent. For each of the four quadrants Q1 - Q4, a phase and a gain 8 American Institute of Aeronautics and Astronautics perturbation are applied independently, with upper limits for each of these factors based on maximum expected manufacturing variations. Figure 7: Sum and delta channel perturbations due to antenna manufacturing imperfections. A Resolution V fractional factorial simulation experiment is conducted to identify significant two-way 2 interactions. No factors can be screened, and the data are used to quantify all factors’ sensitivity. Next, a Latin Hypercube Sampling experiment is performed to create a surrogate model using ordinary least-squares fitting. The surrogate model is verified to have desired properties (unbiased, Gaussian residuals). The data is also examined for any outlier found to be outside of specification. Finally, the surrogate model is compared against bench-top test data for the seeker. Strong agreement between simulation and test results indicates that the monopulse calibration process is valid. Furthermore, the surrogate model is used within the statistical seeker simulation to model expected factor calibration error. 12 D. Category 4: Incremental SimV&V. As mentioned previously, simulation validation must be done incrementally, due to the small number of actual flight tests conducted. Figure 8 is an example plot of two flight test matching parameters (FMPs) for using a single flight test to incrementally validate a missile performance simulation. Figure 8: Example Percent-in-Envelope Analysis for two flight test matching parameters. In this case, there is only one scenario being simulated—the test being flown—and therefore, all variations are due to Monte Carlo variables. By tracking the percent-in-envelope for FMPs, and by using 12 flight test data as it becomes available, accuracy of the simulation can be assessed and improved. 9 American Institute of Aeronautics and Astronautics V. Conclusions This paper and its companion (Part 1) paper describe current issues and successes in applying DOE to simulation. Most engineers have had limited exposure to DOE and DASE, so one challenge is to educate them. A common misconception is that DASE is only applicable to specific types of simulations; however, DASE is a tool to assist in the most fundamental of engineering work: high-dimensional, multidisciplinary trade studies, where the solution is not obvious and where multiple subject matter experts must collaborate to integrate their work. One challenge in securing support for DASE is the ability to measure its business impact. Those who have embraced DASE usually express confidence that engineering designs are more robust due to the wider coverage of the design space. During a recent program milestone review, the customer’s chief engineer made the comment that “by using DOE, this program has achieved a level of thoroughness unmatched by any previous program with which I’ve been associated.” This comment was reassuring, but until clearer benefits are established quantitatively, consistently getting management support for DOE is impeded. Another challenge is how to facilitate commonality, reuse, and sharing of tools and analysis techniques within an expanding community of DASE practitioners. Hopefully, these papers will stimulate discussions that lead to improvements in the discipline and infrastructure of DASE. References 1 Kendall, M.G., “Hiawatha Designs an Experiment,” The American Statistician (1959), Vol. 13, No. 5. Also available at http://www.the-aps.org/publications/tphys/legacy/1984/issue5/363.pdf. 2 Montgomery, D.C., Design and Analysis of Simulation Experiments (Wiley), 2009. 3 Box, G.E.P, J.S. Hunter, and W.G. Hunter, Statistics for Experimenters (Wiley), 2005. 4 Zittel, R.C., “The Reality of Simulation-Based Acquisition—and an Example of U.S. Military Implementation,” Acquisition Quarterly Review (2001 summer edition). Also available at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.6349&rep=rep1&type=pdf . 5 Gilmore, J.M., “Guidance on the use of Design of Experiments (DOE) in Operational Test and Evaluation,” memorandum from the Office of the Secretary of Defense, Washington, D.C., Oct. 19, 2010. 6 McKay, M.D., Beckman, R.J., and Conover, W.J. (May 1979). "A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code," Technometrics (American Statistical Association) 21 (2): 239–245 7 Qian, P.Z.G., M. Ai, and C.F.J. Wu, “Construction of Nested Space-Filling Designs,” The Annals of Statistics (2009), Vol. 37, No. 6A, pp. 3616-3643. 8 Alderman, J. and Mense, A., “Second Generation Point Estimation Methods,” Proceedings, 2006 U.S. Army Conference on Applied Statistics (Durham, NC). 9 Hahn, G.H. and S.S. Shapiro, Statistical Models in Engineering (Wiley), 1994. 10 Hurst, T.N., C. Joseph, J.S. Rhodes, and K. Vander Putten, “Novel Experimental Design & Analysis Methods for Simulation Experiments Involving Algorithms”, Proceedings, 2009 U.S. Army Conference on Applied Statistics (Cary, NC). 11 Breiman, L. et al, Classification and Regression Trees (Chapman & Hall/CRC), 1998. 12 Hurst, T.N. and B.D. Collins, “Simulation Validation Alternatives When Flight Test Data Are Severely Limited,” Proceedings, 2008 AIAA Conference on Modeling & Simulation Technologies (AIAA-6357). 10 American Institute of Aeronautics and Astronautics