CS&E and Statistics James Berger Duke University and Statistical and Applied Mathematical Sciences Institute (SAMSI) Outline • A Glimpse of the World of Statistical Modeling in Science, Engineering and Society from the Viewpoint of a Statistician • Bringing the CS&E and Statistics Communities Together • Research Themes I. An Idiosyncratic Glimpse of the World of Statistical Modeling in Science, Engineering, and Society • Example 1: Predicting Fuel Economy Improvements • Example 2: Understanding the Orbital Composition of Galaxies • Example 3: Protecting Confidentiality in Government Databases, while Allowing for their Use in Research Example 1: An early 90’s study of the potential available gain in fuel economy, to gauge the possibility of changing CAFE • Statistical modeling of EPA data involved – physics/engineering-based data transformations – ‘multilevel random effects’ models, accounting for vehicle model effects, manufacturer effects, technology type, … (about 3000 parameters) – physics/engineering knowledge of effect on vehicle performance of technology changes, necessary to implement a ‘constant performance’ condition, some from simulation. • Prediction of the effect of technology change (highly non-linear) – was done in a Bayesian fashion; – involved thousands of 3000-dimensional integrals; – utilized Markov Chain Monte Carlo methods. • The total estimated fuel economy gains available by 1995 and 2001 were (within 2%) – 11% and 20% (Automobile) – 8% and 16% (Truck) (Note that legislation had proposed CAFE increases of 20% by 1995 and 40% by 2001.) See http://www.stat.duke.edu/~berger/papers/fuel.html Example 2: Understanding the orbital composition of galaxies • Consider a galaxy as made of a collection of ‘rings’ of orbiting stars; each ring specified by – its location – a given velocity for the stars in the ring. • Available data is the luminosity in each (location,velocity) slit of the galaxy; – it is measured with noise. • Goal: find the luminosity ‘weight’ of each ‘ring’. • Finding the weights appears to be a linearly constrained quadratic minimization problem, but – there are many local minima, with nearly the same minimum value, so the actual minimum is unimportant – characterization of the uncertainty in the weights is crucial, leading to identification of the computationally ‘stable’ and ‘transient’ orbits. • A solution is to employ Bayesian analysis, leading to the posterior distribution of weights: – here, dimensions of integration are roughly equal to the number of orbits considered; – new Markov Chain Monte Carlo methods for highly constrained spaces are required . Example 3: Protecting Privacy in an Electronic, Post-9/11/01 World • Underlying Tension: Federal statistical agencies must – protect confidentiality of data (and privacy of individuals and organizations), – disclose information to the public, researchers, … • Current Milieu: Sophisticated ways to break confidentiality. – Example: Linkage to external databases (many) using powerful software tools. • The need: equally powerful models and tools to protect confidentiality . • Full Data: Large (e.g., 40 dimensions x 10 categories) contingency table corresponding to a categorical database; note that there are 1040 cells in the full table (but most are 0). • To Disseminate to Researchers: Set of marginal sub-tables that maximize utility of released information subject to a risk disclosure constraint • The difficult computational challenges include – computation via MCMC or integer programming or ?? with huge contingency tables; – optimization of the utility, subject to the constraint; – determination of the statistical utility of sub-tables. See NISS Digital Government Project: http://www.niss.org/dg II. Bringing the CS&E and Statistics Communities Together • Example : Inverse problems and validation for complex computer models • Barriers to closer association • Mechanisms for closer association Example: Development, Analysis and Validation of Computer Models – Consider computer models of processes, created via applied mathematical modeling, statistical modeling, microsimulation, or other strategy. – Collect data from the real process, to • Find unknown parameters of the computer model (the inverse problem), and characterize uncertainty • Find inadequacies of the computer model and suggest improvements • Predict accuracy of the computer model b(x) x Illustration: Math modeling of vehicle crashes • A finite element applied math model – 100,000 elements – developed using LS-DYNA – 12+ hours to run • Accelerometer data is available at differing vehicle velocities – 36 computer runs – 36 field tests Statistical modeling of velocities as a function of time: vfield(t) = vtrue(t) + e(t), vmodel(t) = vtrue(t) + b(t), where e(t) is noise and b(t) is computer model bias. Analysis: Use Bayesian analysis and Markov Chain Monte Carlo implementation to – provide estimates (with uncertainties) of unknown coefficients in the math model, e.g., damping; – assess accuracy of predictions of the computer model (e.g., at initial velocity v=30 mph, there is a 90% chance that the computer model prediction is within 1.5 of the true process value) – allows prediction of key engineering quantities, such as CRITV, the airbag deployment time. See: http://www.niss.org/technicalreports/tr128.pdf Barriers to Bringing the CS&E and Statistics Communities Together • To many disciplinary scientists – we are each ‘providers of tools they can use’ – we are indistinguishable quantitative experts • Program and project funding rarely encourage inclusion of both CS&E and statistical scientists. • Our traditional application areas generally differ – CS&E tradition: physical sciences and engineering – Statistics tradition: strongest – as the statistics discipline – in social sciences, medical sciences,… (This could be an organizational strength for the CS&E initiative, but is a barrier at the personal level.) Mechanisms for Bringing the CS&E and Statistics Communities Together • Most important is simply to bring them together on interdisciplinary teams. • Institute programs (e.g., at SAMSI), for extended cooperation – joint workshops – joint working groups • Emphasize need for joint funding on interdisciplinary projects. • At Universities? Organizing and Delivering Joint CS&E and Statistics Educational Programs At SAMSI, we – provide integrated courses, jointly taught; – provide graduate students and postdocs with year-long exposure to joint programs; – provide 1 week outreach programs to undergraduates and high-school teachers, and 2 week outreach programs to beginning graduate students, to introduce them to the CS&E and Statistics worlds; – begin opening program workshops with extensive tutorials. Research Challenges • Statistical computational research challenges: – MCMC development and implementation – data confidentiality and large contingency tables – dealing with large data sets • in real time • off-line – – – – – – – – bioinformatics, gene regulation, protein folding, … data mining utilizing multiscale data data fusion, data assimilation graphical models/causal networks open source software environments visualization many many more. • Challenges in the synthesis of statistics and development of computer modeling: – Statistical analysis in non-linear situations can require thousands of model evaluations (e.g., using MCMC), so the ‘real’ computational problem is the product of two very intensive computational problems; this is needed for • designing effective evaluation experiments; • estimating unknown model parameters (inverse problem), with uncertainty evaluation; • assessing model bias and predictive capability of the model; • detecting inadequate model components. – Simultaneous use of statistical and applied mathematical modeling is needed for • effective utilization of many types of data, such as – data that occurs at multiple scales; – data/models that are individual-specific. • replacing unresolvable determinism by stochastic or statistically modeled components (parameterization) This general area of validation of computer models should be a Grand Challenge.