Statistics in CSME

advertisement
CS&E and Statistics
James Berger
Duke University and
Statistical and Applied Mathematical
Sciences Institute (SAMSI)
Outline
• A Glimpse of the World of Statistical
Modeling in Science, Engineering and
Society from the Viewpoint of a Statistician
• Bringing the CS&E and Statistics
Communities Together
• Research Themes
I. An Idiosyncratic Glimpse of the
World of Statistical Modeling in
Science, Engineering, and Society
• Example 1: Predicting Fuel Economy
Improvements
• Example 2: Understanding the Orbital
Composition of Galaxies
• Example 3: Protecting Confidentiality in
Government Databases, while Allowing for
their Use in Research
Example 1: An early 90’s study of the
potential available gain in fuel economy, to
gauge the possibility of changing CAFE
• Statistical modeling of EPA data involved
– physics/engineering-based data transformations
– ‘multilevel random effects’ models, accounting
for vehicle model effects, manufacturer effects,
technology type, … (about 3000 parameters)
– physics/engineering knowledge of effect on
vehicle performance of technology changes,
necessary to implement a ‘constant
performance’ condition, some from simulation.
• Prediction of the effect of technology change
(highly non-linear)
– was done in a Bayesian fashion;
– involved thousands of 3000-dimensional integrals;
– utilized Markov Chain Monte Carlo methods.
• The total estimated fuel economy gains
available by 1995 and 2001 were (within 2%)
– 11% and 20% (Automobile)
– 8% and 16% (Truck)
(Note that legislation had proposed CAFE increases of
20% by 1995 and 40% by 2001.)
See http://www.stat.duke.edu/~berger/papers/fuel.html
Example 2: Understanding the orbital
composition of galaxies
• Consider a galaxy as made of a collection of
‘rings’ of orbiting stars; each ring specified by
– its location
– a given velocity for the stars in the ring.
• Available data is the luminosity in each
(location,velocity) slit of the galaxy;
– it is measured with noise.
• Goal: find the luminosity ‘weight’ of each ‘ring’.
• Finding the weights appears to be a linearly
constrained quadratic minimization problem, but
– there are many local minima, with nearly the same
minimum value, so the actual minimum is unimportant
– characterization of the uncertainty in the weights is
crucial, leading to identification of the computationally
‘stable’ and ‘transient’ orbits.
• A solution is to employ Bayesian analysis,
leading to the posterior distribution of weights:
– here, dimensions of integration are roughly equal to
the number of orbits considered;
– new Markov Chain Monte Carlo methods for highly
constrained spaces are required .
Example 3: Protecting Privacy in an
Electronic, Post-9/11/01 World
• Underlying Tension: Federal statistical agencies
must
– protect confidentiality of data (and privacy of
individuals and organizations),
– disclose information to the public, researchers, …
• Current Milieu: Sophisticated ways to break
confidentiality.
– Example: Linkage to external databases (many) using
powerful software tools.
• The need: equally powerful models and tools to
protect confidentiality .
• Full Data: Large (e.g., 40 dimensions x 10
categories) contingency table corresponding to a
categorical database; note that there are 1040
cells in the full table (but most are 0).
• To Disseminate to Researchers: Set of marginal
sub-tables that maximize utility of released
information subject to a risk disclosure constraint
• The difficult computational challenges include
– computation via MCMC or integer programming or ??
with huge contingency tables;
– optimization of the utility, subject to the constraint;
– determination of the statistical utility of sub-tables.
See NISS Digital Government Project: http://www.niss.org/dg
II. Bringing the CS&E and Statistics
Communities Together
• Example : Inverse problems and validation
for complex computer models
• Barriers to closer association
• Mechanisms for closer association
Example: Development, Analysis and
Validation of Computer Models
– Consider computer models of processes,
created via applied mathematical modeling,
statistical modeling, microsimulation, or other
strategy.
– Collect data from the real process, to
• Find unknown parameters of the computer model
(the inverse problem), and characterize uncertainty
• Find inadequacies of the computer model and
suggest improvements
• Predict accuracy of the computer model
b(x)
x
Illustration: Math modeling of vehicle
crashes
• A finite element
applied math model
– 100,000 elements
– developed using
LS-DYNA
– 12+ hours to run
• Accelerometer data is
available at differing
vehicle velocities
– 36 computer runs
– 36 field tests
Statistical modeling of velocities as a function of time:
vfield(t) = vtrue(t) + e(t), vmodel(t) = vtrue(t) + b(t),
where e(t) is noise and b(t) is computer model bias.
Analysis: Use Bayesian analysis and Markov Chain
Monte Carlo implementation to
– provide estimates (with uncertainties) of unknown
coefficients in the math model, e.g., damping;
– assess accuracy of predictions of the computer model
(e.g., at initial velocity v=30 mph, there is a 90% chance that the
computer model prediction is within 1.5 of the true process value)
– allows prediction of key engineering quantities, such as
CRITV, the airbag deployment time.
See: http://www.niss.org/technicalreports/tr128.pdf
Barriers to Bringing the CS&E and Statistics
Communities Together
• To many disciplinary scientists
– we are each ‘providers of tools they can use’
– we are indistinguishable quantitative experts
• Program and project funding rarely encourage
inclusion of both CS&E and statistical scientists.
• Our traditional application areas generally differ
– CS&E tradition: physical sciences and engineering
– Statistics tradition: strongest – as the statistics
discipline – in social sciences, medical sciences,…
(This could be an organizational strength for the CS&E
initiative, but is a barrier at the personal level.)
Mechanisms for Bringing the CS&E and
Statistics Communities Together
• Most important is simply to bring them together
on interdisciplinary teams.
• Institute programs (e.g., at SAMSI), for extended
cooperation
– joint workshops
– joint working groups
• Emphasize need for joint funding on
interdisciplinary projects.
• At Universities?
Organizing and Delivering Joint CS&E and
Statistics Educational Programs
At SAMSI, we
– provide integrated courses, jointly taught;
– provide graduate students and postdocs with
year-long exposure to joint programs;
– provide 1 week outreach programs to
undergraduates and high-school teachers,
and 2 week outreach programs to beginning
graduate students, to introduce them to the
CS&E and Statistics worlds;
– begin opening program workshops with
extensive tutorials.
Research Challenges
• Statistical computational research challenges:
– MCMC development and implementation
– data confidentiality and large contingency tables
– dealing with large data sets
• in real time
• off-line
–
–
–
–
–
–
–
–
bioinformatics, gene regulation, protein folding, …
data mining
utilizing multiscale data
data fusion, data assimilation
graphical models/causal networks
open source software environments
visualization
many many more.
• Challenges in the synthesis of statistics
and development of computer modeling:
– Statistical analysis in non-linear situations can
require thousands of model evaluations (e.g.,
using MCMC), so the ‘real’ computational
problem is the product of two very intensive
computational problems; this is needed for
• designing effective evaluation experiments;
• estimating unknown model parameters (inverse
problem), with uncertainty evaluation;
• assessing model bias and predictive capability of
the model;
• detecting inadequate model components.
– Simultaneous use of statistical and applied
mathematical modeling is needed for
• effective utilization of many types of data, such as
– data that occurs at multiple scales;
– data/models that are individual-specific.
• replacing unresolvable determinism by stochastic
or statistically modeled components
(parameterization)
This general area of validation of computer
models should be a Grand Challenge.
Download