Computational Model Discovery

advertisement
The end of geographic theory?
Prospects for model discovery in the geographic
domain
Mark Gahegan
Centre for eResearch & Dept. Computer Science
University of Auckland, New Zealand
The holy grail of analytics
Analytical models that can explain their own
reasoning
– David Harvey
– Peter Gould
– Stan Openshaw
Computational Model Discovery (or Discovery
Informatics)
Recap: there are two kinds of
analytical models…
- Predictive models
- Descriptive models
In what way is this new?
Data mining & knowledge discovery
– Does not emphasize model comprehensibility
– Does not take advantage of prior knowledge
– Produces predictive models that do not connect
to existing knowledge
Computational Model Discovery
– Focus on interpretability of models by humans
– Interested in explanations by connecting
observations to theory
Explanation in Geography (Harvey 1969)
Examines the stages of geographic investigation and
how together they support explanation, via:
- methodological frameworks: the nature of
investigation and
- philosophy: the nature of the science process and its
various conceptual artifacts (includes ontology),
- which determine representation: how we abstract
and represent the world and
- analysis: how we model and analyze the world
- through to explanation: which uses theory to
describe what our analysis reveals.
Inductive learning of models
based on processes
• A process is a collection of related functions
– Differential or algebraic form
– Can be a single equation
• Can have unobserved variables
• Specifies a causal relationship between one or
more input and output variables
Computational Model Discovery
Predat or prey ecosyst em
350
Aur - Obs
Nas-OBS
300
250
Prey
200
150
Conce nt ra t ion
100
50
Predator
0
10
12
14
16
18
Tim e (da y s)
20
22
24
26
Example Process Model (from SC-IPM, Bridewell, 2008)
Prey growth
Predation
Predator loss
Algebraic
process to
calculate
grazing rate
Bridewell et al, 2008
Inducing Process Models Summary
• Input
– Time-series data
– Domain knowledge
– Processes and constraints
• Structure search
– Combine processes together using constraints and an
evaluation strategy to limit the search
• Parameter search
– For a given structure, fit parameters and evaluate
• Output
– List of models ranked by score
Computational Model Discovery
Given:
– a methodology for the research and
– a meta-model for the process of the research and
– a set of representational forms for the observations (data)
– observations for a set of variables;
– a set of categories (entities) that the model may include;
– a set of generic processes that specify relations among
entities;
– a set of constraints that indicate plausible relations among
processes and entities;
Find:
– a specific process model and associated parameterization that
not only predicts the observed values but also explains them
EVE, a bench robot
for drug discovery?
Qi et al, 2010, Journal of Integrative Bioinformatics, 7(3):126, 2010 http://journal.imbio.de
GOES early fire detection system
Koltunov et al, 2012
So, how close are we, in GIScience, to
discovering process models?
Example domain model: OneGeology
Example library of analytical functions (PySAL)
One possible process for scientific
investigation
Data
Exploration:
EXPLORING,
DISCOVERING
Map
Presentation:
Concept
Synthesis:
LEARNING,
CATEGORIZING
COMMUNICATING,
CONSENSUSBUILDING
Explanation
confidence
Hypothesis
Evaluation:
Analysis:
GENERALIZING
, MODELING
EXPLAINING,
TESTING,
GENERALIZING
Results
Category
, relation
Theory
Model
Gahegan, 2005
CyberGIS Grand Challenge
Create a ‘Geographical Process Model Discovery
System’ that integrates:
– a science model
– a domain (data) model
– analysis software
– data
– (constraints)
Are there limits to what we can learn
from data?
• Yes, but our learned models may still be useful
• Yes, the model is—at best—as good as the
data
– But this still might be better than current theory
• Yes, but as data becomes ubiquitous, then
these limits will retreat
End
CyberGIS Workflow:
5 simple (and also very complicated) steps
1. Discover and gain access to, and – to some extent –
understand (e.g. the semantics, the provenance, the
limitations of) each dataset we intend to use.
2. Harmonize these datasets into a consistent form (data
model), for example by re-projecting, converting from raster
to vector and harmonizing the semantics. (Data Model
Integration)
3. Analyze the datasets via an analytical workflow of some kind.
(Software Integration)
4. Validate the accuracy and suitability of the results and
5. Publish the results back into the Infrastructure. The results
are of little value unless they maintain connections to the
above steps.
Learn a predictive model, even when
entire steps/states are missing?
Bayesian belief network learning
An example inferred model from
GIScience
The consumer wants fit-for-purpose data, but the task and
domain semantics are not given (latent variables).
Gahegan & Adams, 2014
The education of the GIScientist?
• Better data custodian skills
• Better scientific computing skills—but you
have to bring the geographic understanding
too
• Deeper awareness of the processes
/philosophy of our science
• A greater respect for data…
• An outward gaze…
Scatterplot, grand tour,
projection pursuit,
parallel coordinate plot,
iconographic displays
…with types of inference and
examples of visual and
computational methods
Data
Exploration:
EXPLORING,
DISCOVERING
Databases,
Digital libraries,
clearinghouses
Hypothesis
Concept
Map
Maps, navigable
worlds, charts,
immersive
visualizations
Self organizing map,
k-means, clustering,
geographical analysis
machine, data mining,
concept learning.
Presentation:
COMMUNICATING
, CONSENSUSBUILDING
Synthesis:
LEARNING,
CATEGORIZING
Category,
relation
Explanation
confidence
Evaluation:
EXPLAINING,
TESTING,
GENERALIZING
Analysis:
GENERALIZING,
MODELING
Theory
Interactive visual
classification,
parallel
coordinate plot,
separability plots,
graphs of
relationships
Model
Results
Statistical
testing, M-C
simulation
Uncertainty
visualization
Statistical
modeling,
Scene
composition,
information fusion,
visual overlay
machine learning,
maximum.
likelihood, decision
trees, regression &
correlation analysis
The Evolving Paths to Knowledge
The First Paradigm:
Experiment/Measurement
The Second Paradigm:
Analytical Theory
The Third Paradigm:
Numerical Simulations
The Fourth Paradigm:
Data-Driven Science?
Data fusion + data mining
+ synthesis/learning +
explanation
George Djorgovski, Caltech)
Building Explanatory Models from
Time-Series Data
• Process models are a natural choice
• Many ways to define process
• Processes are casual relations between one or
more input and output variables
• Processes represent knowledge in notation
familiar to scientists
– Helpful for explanation
Download