Summary Report On Development of a Singular Presence in Data Analytics for The Ohio State University To Executive Team by Janet Box-Steffensmeir, Vernal Riffe Professor of Political Science Casey Hoy, Kellogg Endowed Chair In Agricultural Ecosystems Management William Martin, Dean of Public Health Ellen Peters, Professor of Psychology Gary Hattery, Project Manager Support February 2014 Data Analytics Collaborative Framework Acknowledgements The authors would like to acknowledge and thank the faculty and staff members who provided guidance, support, suggestions, and recommendations and helped provide the critical thinking necessary in the creation of this Data Analytics Collaborative document. In particular, we would like to thank the following participants in the process for their engagement, discussions and willingness and dedication to helping craft this document: Climate and Environment Team • Dorota Grejner-Brzezinska, Civil, Environmental, & Geodetic Engineering • Erich Grotewold, Plant Cell & Molecular Biology • Joel Johnson, Electrical and Computer Engineering • Jay Martin, Food, Agricultural & Biological Engineering Complex Systems and Network Science Team • John Casterline, Sociology • Elena Irwin, Agricultural, Environmental, and Development Economics • Srinivasan Parthasarathy, Computer Science & Engineering Foundational Core Team • Mark Berliner, Statistics • Randy Olsen, Economics • Peter Shane, Law • David Tomasko, Chemical & Biomolecular Engineering Health and Well Being Team • Zhong-Lin Lu, Psychology • Cynthia Carnes, Pharmacy • Larry Schlesinger, Microbial Infection & Immunity • Steven Schwartz, Food Science and Technology • Peter Shields, Internal Medicine Office of Academic Affairs Discovery Themes Support Team Stephen Myers, Associate Provost Rebecca Momany, Program Coordinator Mary White, Executive Assistant Marty Kress, Assistant VP for Research Business Development Varun Garg, Graduate Assistant, Fisher College of Business In addition, support from the Ohio Supercomputer Center (Pankaj Shah and associated faculty/staff) in providing recognition of the available data handling infrastructure around these efforts are gratefully acknowledged. Hundreds of faculty members, students, staff and others outside the University offered their time and talent to generating the ideas and insights that drove this process and we wish to extend our appreciation to them as well for their contributions to this collaborative effort. We hope that each will benefit from new opportunities in data analytics. i Data Analytics Collaborative Framework EXECUTIVE SUMMARY A new era exists in research and application known as data science, data analytics, or Big Data. Data streams at petabyte (1015 bytes) and exabyte (1018 bytes) scales are becoming common and widely available via warehouse-scale computers and the internet. Due to cloud computing, very large computers are also widely available. Big Data has joined theory and experiment to form a triumvirate basis for the practice of science in areas ranging from cosmology and climate system modeling to genomics and systems biology. Massive datasets on human behavior contain information relevant to understanding the big problems facing society and individuals today. Investment in data analytics will create a critical core capability at Ohio State as it develops initiatives in the three Discovery Areas and beyond. In early December 2013, after review of a total of 49 Statements of Intent in response to the Discovery Themes inaugural RFP in data analytics, the Discovery Themes Executive Team identified four specific areas of strength, or clusters, for leveraging The Ohio State University’s (OSU) presence in data analytics. These clusters/cores are (1) Climate and Environment; (2) Foundations; (3) Health and Well-Being; and (4) Complex Systems and Network Science. At this December 5, 2013 meeting, the Provost charged these teams with developing a Data Analytics Framework document that would serve as a basis for leading to OSU creating “a singular presence in data analytics leveraged by areas of strength across the campus.” Leadership across these groups was provided by four (4) Conveners, a Project Manager supporting the organizational efforts and support from the Office of Academic Affairs Discovery Themes Team office. Experience and resources were leveraged from some twenty teams that emerged from the Discovery Themes RFP. Brief descriptions of the four core areas follow below The main objectives of this document are to: 1) Identify key disciplinary gaps that, when filled, will establish OSU’s eminence in Data Analytics across the core areas 2) Recommend a set of Guiding Principles of Implementation to be used to launch the Data Analytics Collaborative (hereafter referred to as the Collaborative) across the University 3) Identify Evaluation Metrics that can be used to assess the progress that the Collaborative makes in its first 5 years in achieving a singular presence at OSU, advancing solutions to the Discovery Theme Challenges, and establishing OSU among the top Universities in Data Analytics. Over the course of two months including holidays and inclement weather, frequent meetings of the conveners, meetings within teams and meetings across teams this Data Analytics ii Data Analytics Collaborative Framework Collaborative Framework document was developed. The DA Collaborative document consists of two phases. The first phase establishes a vision for key programmatic areas associated with developing the Collaborative, “Big Ideas” of institutional strength that can leverage the Collaborative, and disciplinary gaps to be addressed in order to achieve the full potential of the Collaborative. Results of these efforts (see figure at right) showed how intra-cluster and inter-cluster identification of conceptual, discipline and resource overlaps and unique ideas, resulted in the recommendation that around 100 positions with strengths in seven (7) methodological areas be considered for recruitment. The second phase, which this effort has addressed to provide some structure and suggestions for implementation of the Collaborative, recommends the immediate need for an internal leadership team serving in an advisory capacity, to develop an implementation plan using the suggestions and recommendations provided in this Framework document as an initial guideline in creating a strategic vision, mission and operating plan for the Collaborative. This implementation team will ensure that momentum in developing the various aspects of the Collaborative, including the new undergraduate major, are maintained and enhanced as appropriate. Guiding Principles of Implementation The following principles provide guidance on implementing the Collaborative. They are not to be construed as policy. The principles were proposed and agreed upon by the faculty who developed the framework and are recommended to the ongoing faculty leadership and the Discovery Themes Executive Team as being key to implementing the Collaborative as a singular presence and achieving success as measured by the proposed metrics. 1. Faculty hires should each fit a personality and leadership profile that is consistent with an interdisciplinary team player, someone who actively seeks collaboration, someone with a history of promoting success in the teams and people with whom they collaborate, someone with a history of putting success of their collaborative groups above their own self interests. A means of assessment for these qualities should be developed. 2. Recruiting rising stars from other institutions should be a priority (e.g., Associate Professors who are poised to quickly reach the top of their fields). 3. A process for continued faculty leadership of the collaborative should be put in place immediately. An Interim internal director with a Data Analytics Faculty Advisory Board should be formed to lead implementation. They should identify and consult with data analytics thought leaders at OSU. 4. Prioritization of faculty hires will be dependent on resolving the interests of the faculty leadership of the Collaborative, College and Department leaders, and the Discovery Themes Executive Team. These groups will need to negotiate priorities for hires and their potential iii Data Analytics Collaborative Framework TIU’s, and resolve issues of unequal ability among TIU’s to match DT support at present; identify members for and participate on cross-disciplinary, search and recruiting committees that represent the TIU or TIU’s, review and approve detailed job descriptions, and provide consistency in information provided to candidates regarding the collaborative, expectations of new hires, and advantages of working at OSU. 5. Fairness to small and large colleges and TIU’s should be an important principle. 6. Consistent with the Framework, the hires made in data analytics should span the entire continuum from foundational to bridging to domain-focused scholars from the outset. 7. Leveraging of support from external (i.e., businesses, institutes, foundations) as well as internal (i.e., colleges) partners should be explored as an immediate next step, based on phase I position descriptions. 8. Support for the existing faculty who currently provide the strengths upon which the collaborative is built, and support for integrating new faculty into existing research and scholarly activity, should be addressed within the first year. Forums, internal sabbaticals, and bridging postdoctoral fellows are examples of tested mechanisms that can ensure rapid integration of new faculty and launch of collaborations. 9. A University-wide effort should be launched immediately to convene faculty across the various areas discussed in the Framework, providing new opportunities to discuss the framework, plan participation in the collaborative, and begin planning for coursework, grant proposals, new collaborations, etc. 10. Because data analytics is expected to be shared by and integrated within disciplines throughout the University, all TIU’s, Schools and Colleges should share in data analytics teaching, research and outreach and no TIU, School or College should be recognized as having any special claim on data analytics as their disciplinary area. Indicators of success for the Data Analytics Collaborative as a Singular Presence Finally, suggestions for evaluating the progress in creation of this “singular presence in data analytics leveraged by areas of strength across campus” were generated as an aid and a prompt for the implementation team. The following indicators of success are recommended as a set of measurable 5 year outcomes expected from implementing the Collaborative. Additional measures may be proposed and associated data gathered as the implementation proceeds and the Faculty Advisory Board and Executive Team find new specific opportunities for the Collaborative to excel. Success in faculty, staff and student engagement in data analytics: Current faculty engagement with data analytics – measured with the number of applications for sabbaticals in data analytics and number of grants proposed Number of co-authorships between new hires and existing faculty on peer-reviewed publications and grant proposals Number of collaborations with industry and other external partners with data analytics as a key feature of the collaboration. iv Data Analytics Collaborative Framework Success in executing the framework and plan: First hires completed in 2014 An interim Director is hired immediately Success in teaching data analytics: Number of courses incorporating data analytics at the undergrad and graduate levels Number of students enrolled in the undergraduate major Number of MS Thesis and PhD Dissertations with data analytics in the title or key words. Success in national and international recognition: Grants and contracts awarded with data analytics in the title and key words Patent and license agreements related to data analytics and software or hardware developments Number of peer reviewed publications with data analytics in the title and key words Metrics used in the Battelle report, repeated in 5 years. The following four sections provide brief summaries of each of the cores: (1) Foundations; (2) Health and Well-Being; (3) Climate and Environment; and (4) Complex Systems and Network Science The Foundations of Data Analytics: Fundamental Needs for Advancing the Discovery Themes The Foundations group provides the domain-independent core critical to current and future research in virtually all areas of data-science investigation and applications of that research to policy and decision making of societal importance. It consists of: (i) the theory and practice of data science that is the basis for cyber-enabled discovery; and (ii) the transformational and synergistic approaches to legal and regulatory issues, decision science, and social- and health-science Big Data resources that will guide, support and enhance data-analytic applications for humans and society. In the Foundations core, education of the current and future workforce and partnerships with a variety of institutions and businesses will be paramount. We will revolutionize data analytics as the emerging basis for science, policy, and decision making across the OSU campus and beyond. The challenge for OSU is to lay a lasting foundation for collaborative data-analytic contributions to the big problems facing society and individuals today and tomorrow. Data-analytic approaches can catalyze solutions across levels of human activity from the individual decision maker, to microsystems (such as engineers and health care providers) to macrosystems (such as businesses and governments) that exist within an even larger geo-political context. The Foundations core will provide leadership and synergy in Data Analytics and its applications to real world problems. The two primary foundational area notions are set in a landscape of collaborations in the areas identified in this document and their critical roles in making OSU the leader in research on the v Data Analytics Collaborative Framework Discovery Themes (see figure on the right). First, advances in data science will produce the demanding computing technologies, algorithms and software, and methodological techniques that will enable compilation and combination of data needed for application-specific research breakthroughs and to enable evidence-driven decision making. The second notion concerns humans and societies in the 21st Century and how they shape and can be affected by Data Analytics. Data are being collected, managed, and analyzed in ways never before imagined. These uses require improved tools and resources to influence decision making and policy across diverse societal issues from energy and the environment to health and well-being. At the same time, these breakthroughs question our notions of privacy, personal and national security, ethics, and social equity and require new regulatory and legal frameworks that recognize the realities of our new society. Hence, though they appear quite diverse, the two primary components of Foundations interact as technological and data science advances are combined with recognition of emergent opportunities and challenges from the human side of the equation in ways that will ultimately allow us to transform society for the better. In addition, through training of students at all levels and retraining current workers and scholars, the Foundations team will support the development of a 21st century workforce. Data Analytics of Health and Well-Being: From the Molecule to the Community Our reality is that expanding human, animal and plant populations will continue to share a complex ecosystem on this small planet and increasingly stress human health and well-being from early child development to the end of life, made worse with limiting resources, unhealthy behaviors and poverty. Health and Well-Being result from a healthy environment and lifestyle, access to health care, and opportunities for individuals and communities to be productive and happy. This is achieved by who we are, how we behave in our culture and environment, choices we make (e.g., food we eat, use of tobacco, etc.), with whom we coexist (e.g., from microbes to vertebrates), how we can improve health-care and how we care for ourselves as individuals and in our community. The Health and Well-Being core, which builds on existing strengths at OSU, can counteract our future challenges and stressors by advances in science and technology through the use of big data characterizing multiple dimensions from the molecule to the community. Big data and systems analysis will allow us to understand how seemingly disparate factors move together to affect our health and well-being, and the interventions we can do to improve this. For example, changes in a person’s environmental exposure and diet (individual and population scale) alter the probability for health and disease. These changes are mediated and can be predicted by an individual’s genetic make-up and gene expression, the metabolome for a given biochemical pathway and or microbiome, psychological state and social context, which in turn can adversely affect health and well-being. For all of these pathways from the individual to the community, understanding and predicting the complexities of the overall ecosystem affecting health and wellbeing is essential to improving disease prevention, early treatment paradigms, and promotion of life-long quality of life for individuals and for the communities in which they live. Overarching Conceptual Overlap: Control Theory for Health & Well-Being The Health and Well Being cluster has a common purpose to use a systems approach to develop new knowledge and fundamentally new strategies to improve health and well-being. We have a common intent to intervene in the systems equilibrium from the molecule to the community vi Data Analytics Collaborative Framework to improve health outcomes. The interventions in effect attempt to control or modify the existing systems equilibrium, and this represents a major conceptual overlap across the Health and WellBeing cluster. The fundamental premise of the theory on systems control which is particularly appropriate to health and well-being problems is that even mild inputs or stimuli, when properly administered, can be effective in changing the behavior of a system of components. In many applications a mild control is specifically required in order to tune the system without destroying its fundamental nature. Mild controls also have typically better potential to reduce interventions’ undesirable side effects and to improve system stability. For instance, administrating an antimicrobial drug to a microbiome or a chemotherapy to a cancer patient, or enforcing the antismoking ban at OSU campus are all examples of mild interventions in the specific biosystems with the goal of altering their parameters just enough for achieving the desired change of equilibrium. Due to various technological and scientific limitations as well as diversity and uniqueness of specific scientific problems, no unified methodological approach to such problems in the context of health and well-being currently exists. However, the advancements in analytical as well computing methods within the last decade as well as the increased data collection capabilities in the life sciences have made it possible to start asking general questions on the problem of controlling the biological health systems. The success of any effective approach to improving health and well-being rests, ultimately, on how well the available information about the health systems is collected, processed and utilized in order to make informed choices that will achieve an optimal homeostatic (healthy) state. As the data will differ substantially across various areas of health and well-being related activities, in order to develop proper strategies for complicated biological multicomponent systems, OSU needs to first develop a comprehensive and multifaceted approach to mining, analyzing and modeling the increasingly vast amount of data (local, national and global). The challenge is formidable, since the sheer amount and complexity of the information collected with various modern research tools, ranging from the DNA sequencers to population surveys and satellite imaging and spanning multiple physical scales from the molecular to individuals’ levels, defies any currently available standard approaches. To give a simple example, consider the area of health promotion, where in order to promote the healthy behavior of a community one needs to analyze how personal level decisions contribute to the global changes in behavior. The mathematical theory required to answer such questions is known as the control of interacting particles systems. In this case the mathematical results describing the system behavior, when coupled with the appropriate data analytics methods, are likely to hold the key on how to inform the policies encouraging the consumers to increase healthy behaviors desired for pushing the overall community towards the more healthy state. The use of such approaches requires us to bridge the gap between the theory and practice of data science and their capacities to identify and understand options to make informed decisions. Data Analytics in Climate and the Environment to meet Discovery Theme Challenges The profusion of data that is being generated to measure trends in climate and environmental change spans from global to ecosystem to molecular scales, including weather and climate sensing and modeling, water and land use/land cover satellite data, aerial and ground based sensing systems, and multilevel –omics data. The big ideas proposed for a data analytics Climate and Environment core recognize and build upon these current and anticipated data vii Data Analytics Collaborative Framework streams by an existing network of over 200 faculty who contributed to big ideas and whose collaboration, as part of both intramural and extramural partnerships, will be enhanced to meet the Discovery Theme challenges. Major earth systems proposed for focus and integration through data analytics include fundamental climate and weather processes; watersheds and the land-water interface; foodsheds and agriculture, particularly plant systems that are the foundation for all food production and greenhouse gas sequestration; cities and the built environment; and global to regional scale integration across these major systems. Specific research activities in data analytics include data generation, processing and manipulation, integration, analysis, modeling, decision support, and policy. The major systems and levels of data analytics, along with specific technologies and research themes, create a matrix framework for the research areas that will build data analytics capacity for climate and the environment. Proposed research themes will build and synergize current strengths and lead to global solutions to challenges in climate and the environment. Climate science has been about big data and models for decades. While its predictions are gaining credence and sophistication, research is needed to better understand the complex atmosphere to geosphere relationships and translate predictions to adaptation and mitigation strategies. Technological aspects of sensing and monitoring systems as well as data integration, analysis and modeling at global to local scales are key to the science envisioned for this cluster and provide strong linkage to the Complex Systems and Network Science core. Social science, including demographic and behavioral changes related to climate, ecosystem services and ethics of data ownership and integration among individuals and corporations, in precision agriculture for example, also form a strong link to the Foundations core. Health is impacted by diet and food security, environmental impacts on plants, pathogens, water quality and human demographics, creating strong linkage to the Health and Well-being core and related Discovery Theme. Areas of research more unique to climate and environment issues, and addressing the Energy and Environment and Food Production and Security Discovery Themes, include the interfaces between the atmosphere, oceans, and polar regions with terrestrial dynamics in watersheds, foodsheds and agriculture, built urban, and associated landscapes and land uses. Genomics and environmental data on the critical plant systems upon which food, materials and renewable energy rely, provides opportunities for adapting plant systems to a changing climate and mitigating climate change through their role in C and N cycles. Cities and the built environment present their own set of data and challenges including optimizing energy and food distribution across foodsheds, while managing watersheds to provide downstream ecosystem services. Proposed faculty hires will be expected to fill gaps and create synergy with the large existing faculty and partner network in the Climate and Environment core at The Ohio State University. Integration and interdisciplinary networking within and among the levels of analysis, major earth systems, technologies, and topical research areas described above will provide an exciting, compelling and productive environment for research, teaching and outreach, and position The Ohio State University for meeting the major challenges inherent in our Discovery Themes. viii Data Analytics Collaborative Framework Complex Systems and Network Science: Moving Beyond Disciplinary Boundaries to Unlock Transformational Solutions Complex Systems and Network Science constitute an Network Science & interdisciplinary area of inquiry Behavioral Sciences that transcends traditional knowledge domains by focusing on the fundamental interdependencies Policy Social and Ethics Behavioral of components within systems. Governance Systems & Networks Examples abound from social networks to coupled human and Biological and Network Health Systems Visualization COMPLEX natural systems, from financial & Networks SYSTEMS AND networks to disease systems, and NETWORK Data Mining Sustainability from telecommunication networks SCIENCE & Statistical Science Learning on to energy and power systems. It is Networks the interconnection among these Network Data Systems Network Analytics: Management Modeling & Coupled Human and components that often sit at the Algorithms, Models & Dynamics Natural Systems Systems heart of our most vexing global grand challenge problems, including climate change, energy demands, food insecurity, health and wellness, and livelihoods and poverty. Accordingly, the study of complex systems and networks – understanding their intrinsic properties, changes to their structure over time or due to external factors, multi-scale behavior of individuals to coarser grained modular communities – can afford important insights about optimal strategies for tackling such grand challenges. Complex systems are, by their very nature, dauntingly difficult to describe and even harder to explain. While scientific advances have improved our understanding of various component parts, our pursuit of deeper knowledge has been stymied by a failure to understand the connections across these component parts. Currently our ability to produce and store data that relate to such complex and networked systems has far outstripped our ability to analyze and utilize this data to derive actionable insight. The crucial insights in such data often reside in the implicit interconnections or explicit relationships among individual entities. Further advances in the field of Complex Systems and Network Sciences depend on our ability to harness the increasing availability of vast amounts of data, along with more sophisticated computing power and modeling techniques, to develop and test new theories of interconnectedness in complex systems. The challenges are manifold and include the ability to manage, model, visualize, and analyze large scale complex systems and networks with rigor while facilitating effective decision making, the ability to abstract common concepts across fields to realize new ways of thinking about problems, and the ability to analyze such data in the presence of noise, variations in model assumptions, dynamics and uncertainty. Bridging the complexity chasm in Systems and Network Sciences requires not only the ability to handle the data deluge (storage and representation), although this is an essential prerequisite, but also the ability to integrate information in the presence of uncertainty and complex dynamics while making effective use of insights specific to relevant scientific fields of inquiry. Complex Systems and Network Sciences represent one of the most promising scientific approaches for deriving transformational analytic solutions to 21st century problems. Confirming ix Bridging Faculty Data Analytics Collaborative Framework this judgment, Network Science is recognized as one of four research frontiers in the U.S. National Science Foundation’s agenda-setting statement Rebuilding the Mosaic (2011). The Executive Summary of this report additionally stresses that: “Future research will be interdisciplinary, data-intensive, and collaborative”, which is what this project sets into motion for Ohio State University. The establishment of a Data Analytics Collaborative with Complex Systems and Network Sciences at its core will provide a mechanism for scholars to interact beyond disciplinary boundaries to identify, learn and develop new methods for data extraction and analysis that is guided by and informs Complex Systems and Network Sciences theories and methods. The overarching goal is that Ohio State becomes a leader in an area that is certain to be central to all domains of science during the next few decades. Doing so will be critical for transformational solutions to 21st century grand challenges in areas such as Climate Change, Energy and Sustainability, Food Security, and Health and Wellness. The figure represents an integration of ideas in Complex Systems and Network Science and lays out suggested hires that will come from a wide variety of disciplines. Central to accomplishing the goals and big ideas will be hires who can integrate both methods and substance in the quest for transformative solutions. x