Invited Talks A Data Lifecycle Approach to Discovery Informatics

advertisement
Discovery Informatics: AI Takes a Science-Centered View on Big Data
AAAI Technical Report FS-13-01
Invited Talks
A Data Lifecycle Approach to Discovery Informatics
Richard J. Doyle, NASA Jet Propulsion Laboratory
Although discovery is ostensibly a process that operates on data in hand, in the context of
space exploration it is natural to take a full lifecycle perspective that begins at the data collection point of a sensor or instrument. At each phase of the data lifecycle, important steps
can be taken to both enable and assist the objective of scientific discovery. For example, data
triage is concerned with efficient assessment of data while it is buffered at the collection
point, to address the harsh reality that for many emerging high-capacity sensor and instruments, not all data can be captured. Data visualization provides an array of tools for
abstracting large volumes of data, to gain insight and shape the next query. Model-to-data
reconciliation is at the heart of hypothesis formation and refinement, and relates to intelligent sampling strategies for maximizing information gain at multiple levels: (1) extracting
the next manageable slice from massive, distributed data sets, (2) updating models to generate improved predictions (synthetic data), and ultimately, (3) designing the next-generation
sensor, instrument and/or space mission, which brings the data lifecycle full circle. In this
talk, I will utilize the data lifecycle as a framework for exploring a range of techniques to
assist in scientific discovery, and will argue that the integrated lifecycle perspective leads to
increases in efficiency and effectiveness.
Generating Biomedical Hypotheses Using Semantic Web Technologies
Michel Dumontier, Stanford University
With its focus on investigating the nature and basis for the sustained existence of living systems, modern biology has always been a fertile, if not challenging, domain for formal
knowledge representation and automated reasoning. Over the past 15 years, hundreds of
projects have developed or leveraged ontologies for entity recognition and relation extraction, semantic annotation, data integration, query answering, consistency checking, association mining and other forms of knowledge discovery. In this talk, I will discuss our efforts
to build a rich foundational network of ontology-annotated linked data, discover significant
biological associations across these data using a set of partially overlapping ontologies, and
identify new avenues for drug discovery by applying measures of semantic similarity over
phenotypic descriptions. As the portfolio of semantic web technologies continues to mature
in terms of functionality, scalability and an understanding of how to maximize their value,
increasing numbers of biomedical researchers will be strategically poised to pursue increasingly sophisticated KR projects aimed at improving our overall understanding of the capability and behavior of biological systems.
ix
Socially Intelligent Science
Haym Hirsh, Cornell University
“Standing on the shoulders of giants” is a metaphor for how science progresses: our knowledge grows by expanding and building off what others have learned and taught us in the
past. Implicit in this metaphor is that science is a social enterprise — we learn from others
and we relate what we do to what others have done. However, in recent decades scientists
have invented new ways to bring people together at unprecedented scale in the pursuit of
advancing science and pushing the thresholds of what we know. ese new forms of social
enterprise — made possible by innovations in computing and the widespread reach of the
Internet — are facilitating discovery and innovation in a range of areas of science and technology. In this talk I will survey examples of these new forms of socially intelligent science,
while also providing a historical context that shows that elements of many of these ideas
predate the Internet era.
Representing and Reasoning with
Experimental and Quasi-Experimental Designs
David Jensen, University of Massachusetts at Amherst
e formulation and widespread adoption of the randomized controlled trial is one of the
most important intellectual achievements of the twentieth century. However, the precise
logic of RCTs, and the extent to which similar logic can be extended to analysis of data collected under alternative conditions, is not widely known or easily formalized. e language
of causal graphical models — a well-developed formalism from computer science — can
describe much of the logic behind experimental and quasi-experimental designs, and recent
extensions to that language can express an even wider array of designs. In addition, this formalization has revealed new types of designs and new opportunities for computational
assistance in the analysis of experimental and observational data.
Bioinformatics Computation of
Metabolic Models from Sequenced Genomes
Peter Karp, SRI International
e bioinformatics field has developed the ability to extract far more information from
sequenced genomes than was envisioned in the early days of the Human Genome Project.
By connecting a set of analytical programs into a computational pipeline, we can recognize
genes within a sequenced genome, assign functions to those genes, infer reactions catalyzed
by the gene products, arrange those reactions into metabolic pathways, and create a computational metabolic model of the organism. e computational methods used by pipeline
components include machine learning, pattern matching, inexact sequence matching, and
optimization. is success story can provide lessons to other areas of computational science,
and raises interesting questions about what it means for machines to make scientific discoveries.
x
Climate Informatics: Recent Advances and Challenge
Problems for Machine Learning in Climate Science
Claire Monteleoni, George Washington University
e threat of climate change is one of the greatest challenges currently facing society. Given
the profound impact machine learning has made on the natural sciences to which it has
been applied, such as the field of bioinformatics, machine learning is poised to accelerate
discovery in climate science. Our recent progress on climate informatics reveals that collaborations with climate scientists also open interesting new problems for machine learning. I
will give an overview of challenge problems in climate informatics, and present recent work
from my research group in this nascent field. A key problem in climate science is how to
combine the predictions of the multimodel ensemble of global climate models that inform
the Intergovernmental Panel on Climate Change (IPCC). I will present three approaches to
this problem. Our Tracking Climate Models (TCM) work demonstrated the promise of an
algorithm for online learning with expert advice, for this task. Given temperature predictions from 20 IPCC global climate models, and over 100 years of historical temperature
data, TCM generated predictions that tracked the changing sequence of which model currently predicts best. On historical data, at both annual and monthly time-scales, and in
future simulations, TCM consistently outperformed the average over climate models, the
existing benchmark in climate science, at both global and continental scales. We then
extended TCM to take into account climate model predictions at higher spatial resolutions,
and to model geospatial neighborhood influence between regions. Our second algorithm
enables neighborhood influence by modifying the transition dynamics of the hidden
Markov model from which TCM is derived, allowing the performance of spatial neighbors
to influence the temporal switching probabilities for the best climate model at a given location. We recently applied a third technique, sparse matrix completion, in which we create a
sparse (incomplete) matrix from climate model predictions and observed temperature data,
and apply a matrix completion algorithm to recover it, yielding predictions of the unobserved temperatures.
Predictive Modeling of Patient State and erapy Optimization
Zoran Obradovic, Temple University
Uncontrolled inflammation accompanied by an infection that results in septic shock is the
most common cause of death in intensive care units and the tenth leading cause of death
overall. In principle, spectacular mortality rate reduction can be achieved by early diagnosis
and accurate prediction of response to therapy. is is a very difficult objective due to the
fast progression and complex multistage nature of acute inflammation. Our ongoing
DARPA DLT project is addressing this challenge by development and validation of effective
predictive modeling technology for analysis of temporal dependencies in high dimensional
multisource sepsis related data. is lecture will provide an overview of the results of our
project, which show potentials for significant mortality reduction in severe sepsis patients.
xi
Case Studies in Data-Driven Systems:
Building Carbon Maps to Finding Neutrinos
Christopher Re, Stanford University
e question driving my work is, how should one deploy statistical data-analysis tools to
enhance data-driven systems? Even partial answers to this question may have a large impact
on science, government, and industry---each of whom are increasingly turning to statistical
techniques to get value from their data. To understand this question, my group has built or
contributed to a diverse set of data-processing systems for scientific applications: a system,
called GeoDeepDive, that reads and helps answer questions about the geology literature and
a muon filter that is used in the IceCube neutrino telescope to process over 250 million
events each day in the hunt for the origins of the universe. is talk will give an overview of
the lessons that we learned in these systems, will argue that data systems research may play
a larger role in the next generation of these systems, and will speculate on the future challenges that such systems may face.
Computational Analysis of Complex Human Disorders
Andrey Rzhetsky, University of Chicago
Focusing on autism, bipolar disorder and schizophrenia, my talk will touch the following
questions. How understanding of genetics and epidemiology of disease can be advanced
through modeling and computational analysis of very large and heterogeneous datasets?
What are the bottlenecks in analysis of complex human maladies? How can we model and
compute over multiple data types to narrow hypotheses about genetic causes of disease?
How collaborations across multiple fields of science can bring translational results to initially purely academic studies?
Look at is Gem: Automated Data Prioritization for Scientific
Discovery of Exoplanets, Mineral Deposits, and More
Kiri L. Wagstaff, NASA Jet Propulsion Laboratory
Inundated by terabytes of data flowing from telescopes, microscopes, DNA sequencers, etc.,
scientists in various disciplines have a need for automated methods for prioritizing data for
review. Which observations are most interesting or unusual, and why? I will describe
DEMUD (Discovery by Eigenbasis Modeling of Uninteresting Data), which iteratively prioritizes items from large data sets to provide a diverse traversal of interesting items. By
modeling what the user already knows and/or has already seen, DEMUD can focus attention on the unexpected, facilitating new discoveries. Uniquely, DEMUD also provides a
domain-relevant explanation for each selected item that indicates why it stands out.
DEMUD's explanations offer a first step towards automated interpretation of scientific data
discoveries. We are using DEMUD in collaboration with scientists from the Mars Science
Laboratory, the Mars Reconnaissance Orbiter, the Kepler exoplanet telescope, Earth
orbiters, and more. It provides scalable performance, interpretable output, and new insights
into very large data sets from diverse disciplines. is is joint work with James Bedell, Nina
L. Lanza, Tom G. Dietterich, Martha S. Gilmore, and David R. ompson.
xii
Download