Discovery Informatics: AI Takes a Science-Centered View on Big Data AAAI Technical Report FS-13-01 Invited Talks A Data Lifecycle Approach to Discovery Informatics Richard J. Doyle, NASA Jet Propulsion Laboratory Although discovery is ostensibly a process that operates on data in hand, in the context of space exploration it is natural to take a full lifecycle perspective that begins at the data collection point of a sensor or instrument. At each phase of the data lifecycle, important steps can be taken to both enable and assist the objective of scientific discovery. For example, data triage is concerned with efficient assessment of data while it is buffered at the collection point, to address the harsh reality that for many emerging high-capacity sensor and instruments, not all data can be captured. Data visualization provides an array of tools for abstracting large volumes of data, to gain insight and shape the next query. Model-to-data reconciliation is at the heart of hypothesis formation and refinement, and relates to intelligent sampling strategies for maximizing information gain at multiple levels: (1) extracting the next manageable slice from massive, distributed data sets, (2) updating models to generate improved predictions (synthetic data), and ultimately, (3) designing the next-generation sensor, instrument and/or space mission, which brings the data lifecycle full circle. In this talk, I will utilize the data lifecycle as a framework for exploring a range of techniques to assist in scientific discovery, and will argue that the integrated lifecycle perspective leads to increases in efficiency and effectiveness. Generating Biomedical Hypotheses Using Semantic Web Technologies Michel Dumontier, Stanford University With its focus on investigating the nature and basis for the sustained existence of living systems, modern biology has always been a fertile, if not challenging, domain for formal knowledge representation and automated reasoning. Over the past 15 years, hundreds of projects have developed or leveraged ontologies for entity recognition and relation extraction, semantic annotation, data integration, query answering, consistency checking, association mining and other forms of knowledge discovery. In this talk, I will discuss our efforts to build a rich foundational network of ontology-annotated linked data, discover significant biological associations across these data using a set of partially overlapping ontologies, and identify new avenues for drug discovery by applying measures of semantic similarity over phenotypic descriptions. As the portfolio of semantic web technologies continues to mature in terms of functionality, scalability and an understanding of how to maximize their value, increasing numbers of biomedical researchers will be strategically poised to pursue increasingly sophisticated KR projects aimed at improving our overall understanding of the capability and behavior of biological systems. ix Socially Intelligent Science Haym Hirsh, Cornell University “Standing on the shoulders of giants” is a metaphor for how science progresses: our knowledge grows by expanding and building off what others have learned and taught us in the past. Implicit in this metaphor is that science is a social enterprise — we learn from others and we relate what we do to what others have done. However, in recent decades scientists have invented new ways to bring people together at unprecedented scale in the pursuit of advancing science and pushing the thresholds of what we know. ese new forms of social enterprise — made possible by innovations in computing and the widespread reach of the Internet — are facilitating discovery and innovation in a range of areas of science and technology. In this talk I will survey examples of these new forms of socially intelligent science, while also providing a historical context that shows that elements of many of these ideas predate the Internet era. Representing and Reasoning with Experimental and Quasi-Experimental Designs David Jensen, University of Massachusetts at Amherst e formulation and widespread adoption of the randomized controlled trial is one of the most important intellectual achievements of the twentieth century. However, the precise logic of RCTs, and the extent to which similar logic can be extended to analysis of data collected under alternative conditions, is not widely known or easily formalized. e language of causal graphical models — a well-developed formalism from computer science — can describe much of the logic behind experimental and quasi-experimental designs, and recent extensions to that language can express an even wider array of designs. In addition, this formalization has revealed new types of designs and new opportunities for computational assistance in the analysis of experimental and observational data. Bioinformatics Computation of Metabolic Models from Sequenced Genomes Peter Karp, SRI International e bioinformatics field has developed the ability to extract far more information from sequenced genomes than was envisioned in the early days of the Human Genome Project. By connecting a set of analytical programs into a computational pipeline, we can recognize genes within a sequenced genome, assign functions to those genes, infer reactions catalyzed by the gene products, arrange those reactions into metabolic pathways, and create a computational metabolic model of the organism. e computational methods used by pipeline components include machine learning, pattern matching, inexact sequence matching, and optimization. is success story can provide lessons to other areas of computational science, and raises interesting questions about what it means for machines to make scientific discoveries. x Climate Informatics: Recent Advances and Challenge Problems for Machine Learning in Climate Science Claire Monteleoni, George Washington University e threat of climate change is one of the greatest challenges currently facing society. Given the profound impact machine learning has made on the natural sciences to which it has been applied, such as the field of bioinformatics, machine learning is poised to accelerate discovery in climate science. Our recent progress on climate informatics reveals that collaborations with climate scientists also open interesting new problems for machine learning. I will give an overview of challenge problems in climate informatics, and present recent work from my research group in this nascent field. A key problem in climate science is how to combine the predictions of the multimodel ensemble of global climate models that inform the Intergovernmental Panel on Climate Change (IPCC). I will present three approaches to this problem. Our Tracking Climate Models (TCM) work demonstrated the promise of an algorithm for online learning with expert advice, for this task. Given temperature predictions from 20 IPCC global climate models, and over 100 years of historical temperature data, TCM generated predictions that tracked the changing sequence of which model currently predicts best. On historical data, at both annual and monthly time-scales, and in future simulations, TCM consistently outperformed the average over climate models, the existing benchmark in climate science, at both global and continental scales. We then extended TCM to take into account climate model predictions at higher spatial resolutions, and to model geospatial neighborhood influence between regions. Our second algorithm enables neighborhood influence by modifying the transition dynamics of the hidden Markov model from which TCM is derived, allowing the performance of spatial neighbors to influence the temporal switching probabilities for the best climate model at a given location. We recently applied a third technique, sparse matrix completion, in which we create a sparse (incomplete) matrix from climate model predictions and observed temperature data, and apply a matrix completion algorithm to recover it, yielding predictions of the unobserved temperatures. Predictive Modeling of Patient State and erapy Optimization Zoran Obradovic, Temple University Uncontrolled inflammation accompanied by an infection that results in septic shock is the most common cause of death in intensive care units and the tenth leading cause of death overall. In principle, spectacular mortality rate reduction can be achieved by early diagnosis and accurate prediction of response to therapy. is is a very difficult objective due to the fast progression and complex multistage nature of acute inflammation. Our ongoing DARPA DLT project is addressing this challenge by development and validation of effective predictive modeling technology for analysis of temporal dependencies in high dimensional multisource sepsis related data. is lecture will provide an overview of the results of our project, which show potentials for significant mortality reduction in severe sepsis patients. xi Case Studies in Data-Driven Systems: Building Carbon Maps to Finding Neutrinos Christopher Re, Stanford University e question driving my work is, how should one deploy statistical data-analysis tools to enhance data-driven systems? Even partial answers to this question may have a large impact on science, government, and industry---each of whom are increasingly turning to statistical techniques to get value from their data. To understand this question, my group has built or contributed to a diverse set of data-processing systems for scientific applications: a system, called GeoDeepDive, that reads and helps answer questions about the geology literature and a muon filter that is used in the IceCube neutrino telescope to process over 250 million events each day in the hunt for the origins of the universe. is talk will give an overview of the lessons that we learned in these systems, will argue that data systems research may play a larger role in the next generation of these systems, and will speculate on the future challenges that such systems may face. Computational Analysis of Complex Human Disorders Andrey Rzhetsky, University of Chicago Focusing on autism, bipolar disorder and schizophrenia, my talk will touch the following questions. How understanding of genetics and epidemiology of disease can be advanced through modeling and computational analysis of very large and heterogeneous datasets? What are the bottlenecks in analysis of complex human maladies? How can we model and compute over multiple data types to narrow hypotheses about genetic causes of disease? How collaborations across multiple fields of science can bring translational results to initially purely academic studies? Look at is Gem: Automated Data Prioritization for Scientific Discovery of Exoplanets, Mineral Deposits, and More Kiri L. Wagstaff, NASA Jet Propulsion Laboratory Inundated by terabytes of data flowing from telescopes, microscopes, DNA sequencers, etc., scientists in various disciplines have a need for automated methods for prioritizing data for review. Which observations are most interesting or unusual, and why? I will describe DEMUD (Discovery by Eigenbasis Modeling of Uninteresting Data), which iteratively prioritizes items from large data sets to provide a diverse traversal of interesting items. By modeling what the user already knows and/or has already seen, DEMUD can focus attention on the unexpected, facilitating new discoveries. Uniquely, DEMUD also provides a domain-relevant explanation for each selected item that indicates why it stands out. DEMUD's explanations offer a first step towards automated interpretation of scientific data discoveries. We are using DEMUD in collaboration with scientists from the Mars Science Laboratory, the Mars Reconnaissance Orbiter, the Kepler exoplanet telescope, Earth orbiters, and more. It provides scalable performance, interpretable output, and new insights into very large data sets from diverse disciplines. is is joint work with James Bedell, Nina L. Lanza, Tom G. Dietterich, Martha S. Gilmore, and David R. ompson. xii