The end of geographic theory? Prospects for model discovery in the geographic domain Mark Gahegan Centre for eResearch & Dept. Computer Science University of Auckland, New Zealand The holy grail of analytics Analytical models that can explain their own reasoning – David Harvey – Peter Gould – Stan Openshaw Computational Model Discovery (or Discovery Informatics) Recap: there are two kinds of analytical models… - Predictive models - Descriptive models In what way is this new? Data mining & knowledge discovery – Does not emphasize model comprehensibility – Does not take advantage of prior knowledge – Produces predictive models that do not connect to existing knowledge Computational Model Discovery – Focus on interpretability of models by humans – Interested in explanations by connecting observations to theory Explanation in Geography (Harvey 1969) Examines the stages of geographic investigation and how together they support explanation, via: - methodological frameworks: the nature of investigation and - philosophy: the nature of the science process and its various conceptual artifacts (includes ontology), - which determine representation: how we abstract and represent the world and - analysis: how we model and analyze the world - through to explanation: which uses theory to describe what our analysis reveals. Inductive learning of models based on processes • A process is a collection of related functions – Differential or algebraic form – Can be a single equation • Can have unobserved variables • Specifies a causal relationship between one or more input and output variables Computational Model Discovery Predat or prey ecosyst em 350 Aur - Obs Nas-OBS 300 250 Prey 200 150 Conce nt ra t ion 100 50 Predator 0 10 12 14 16 18 Tim e (da y s) 20 22 24 26 Example Process Model (from SC-IPM, Bridewell, 2008) Prey growth Predation Predator loss Algebraic process to calculate grazing rate Bridewell et al, 2008 Inducing Process Models Summary • Input – Time-series data – Domain knowledge – Processes and constraints • Structure search – Combine processes together using constraints and an evaluation strategy to limit the search • Parameter search – For a given structure, fit parameters and evaluate • Output – List of models ranked by score Computational Model Discovery Given: – a methodology for the research and – a meta-model for the process of the research and – a set of representational forms for the observations (data) – observations for a set of variables; – a set of categories (entities) that the model may include; – a set of generic processes that specify relations among entities; – a set of constraints that indicate plausible relations among processes and entities; Find: – a specific process model and associated parameterization that not only predicts the observed values but also explains them EVE, a bench robot for drug discovery? Qi et al, 2010, Journal of Integrative Bioinformatics, 7(3):126, 2010 http://journal.imbio.de GOES early fire detection system Koltunov et al, 2012 So, how close are we, in GIScience, to discovering process models? Example domain model: OneGeology Example library of analytical functions (PySAL) One possible process for scientific investigation Data Exploration: EXPLORING, DISCOVERING Map Presentation: Concept Synthesis: LEARNING, CATEGORIZING COMMUNICATING, CONSENSUSBUILDING Explanation confidence Hypothesis Evaluation: Analysis: GENERALIZING , MODELING EXPLAINING, TESTING, GENERALIZING Results Category , relation Theory Model Gahegan, 2005 CyberGIS Grand Challenge Create a ‘Geographical Process Model Discovery System’ that integrates: – a science model – a domain (data) model – analysis software – data – (constraints) Are there limits to what we can learn from data? • Yes, but our learned models may still be useful • Yes, the model is—at best—as good as the data – But this still might be better than current theory • Yes, but as data becomes ubiquitous, then these limits will retreat End CyberGIS Workflow: 5 simple (and also very complicated) steps 1. Discover and gain access to, and – to some extent – understand (e.g. the semantics, the provenance, the limitations of) each dataset we intend to use. 2. Harmonize these datasets into a consistent form (data model), for example by re-projecting, converting from raster to vector and harmonizing the semantics. (Data Model Integration) 3. Analyze the datasets via an analytical workflow of some kind. (Software Integration) 4. Validate the accuracy and suitability of the results and 5. Publish the results back into the Infrastructure. The results are of little value unless they maintain connections to the above steps. Learn a predictive model, even when entire steps/states are missing? Bayesian belief network learning An example inferred model from GIScience The consumer wants fit-for-purpose data, but the task and domain semantics are not given (latent variables). Gahegan & Adams, 2014 The education of the GIScientist? • Better data custodian skills • Better scientific computing skills—but you have to bring the geographic understanding too • Deeper awareness of the processes /philosophy of our science • A greater respect for data… • An outward gaze… Scatterplot, grand tour, projection pursuit, parallel coordinate plot, iconographic displays …with types of inference and examples of visual and computational methods Data Exploration: EXPLORING, DISCOVERING Databases, Digital libraries, clearinghouses Hypothesis Concept Map Maps, navigable worlds, charts, immersive visualizations Self organizing map, k-means, clustering, geographical analysis machine, data mining, concept learning. Presentation: COMMUNICATING , CONSENSUSBUILDING Synthesis: LEARNING, CATEGORIZING Category, relation Explanation confidence Evaluation: EXPLAINING, TESTING, GENERALIZING Analysis: GENERALIZING, MODELING Theory Interactive visual classification, parallel coordinate plot, separability plots, graphs of relationships Model Results Statistical testing, M-C simulation Uncertainty visualization Statistical modeling, Scene composition, information fusion, visual overlay machine learning, maximum. likelihood, decision trees, regression & correlation analysis The Evolving Paths to Knowledge The First Paradigm: Experiment/Measurement The Second Paradigm: Analytical Theory The Third Paradigm: Numerical Simulations The Fourth Paradigm: Data-Driven Science? Data fusion + data mining + synthesis/learning + explanation George Djorgovski, Caltech) Building Explanatory Models from Time-Series Data • Process models are a natural choice • Many ways to define process • Processes are casual relations between one or more input and output variables • Processes represent knowledge in notation familiar to scientists – Helpful for explanation