Headers will be added later The rise of informatics as a research domain Peter Fox1 1Tetherless World Constellation, Rensselaer Polytechnic Institute, 110 8th St., Troy, NY 12180 USA, pfox@cs.rpi.edu Abstract: Over the past five years, data science has emerged as a means of conducting science over many disciplines and domains. Accompanying this emergence is the realization that different informatics approaches, i.e. the science of data and information underlying the developments, have also emerged independently and empirically in many areas, e.g. astro, bio, geo, hydro, ocean and over different timescales, funding models and corresponding appreciation by their communities. To fully enable both interdisciplinary research and cope with increasingly complex data within domains, a move to a more repeatable and interworkable mode is required, i.e. adding a research component to the application component of informatics that enables data science as a means to address integrative science grand challenges areas such as water, environment and climate, ultimately resulting in the discovery of new knowledge. This contribution details some key elements of research informatics, the class of people who appear in this discipline, and the state of some current research challenges. Keywords: Informatics; Data Science; Provenance; Semantics. 1 EVOLVING CONDUCT OF SCIENCE Recent advances in data acquisition techniques quickly provide massive amounts of complex data characterized by source heterogeneity, multiple modalities, high volume, high dimensionality, and multiple scales (temporal, spatial, and function). In turn, science and engineering disciplines are rapidly becoming more and more data driven with goals of higher sample throughput, better understanding/modeling of complex systems and their dynamics, and ultimately engineering products for practical applications. However, analyzing libraries of complex data requires managing its complexity and integrating the information and knowledge across multiple scales over different disciplines. For example, the reductionist approach to biological research has provided an understanding of how the linear arrangement of nucleotides encode the linear arrangement of amino acids and how proteins interact to form functional groups governing signal transduction and metabolic pathways, etc. But at each level of biological organization, new forms of complexity are encountered such that we’ve taken our biological machines apart but can’t put them back together again (B. Yener, personal communication) - our ability to accumulate reductionist data has outstripped our ability to understand it. Thus, we encounter a gap in the structure/function relationship; having accumulated an extraordinary amount of detailed information about biological, material, and environmental structures, we cannot assemble it in a way that explains the correspondingly complex functions these structures perform. Educating and training data scientists with the necessary knowledge and skills to close the data - knowledge gap becomes a crucial requirement for successfully competing in science and engineering. Attention to data science has spread from being a discussion among researchers (Baker et al., 2008, Nativi and Fox, 2010), e.g. the Fourth Paradigm publication (Hey et al., 2009), into more general audiences, for example the recent Nature and Science special issues on “Data”. Not surprisingly there is explicit emphasis on data and information in professional societies and international scientific unions (SCCID, 2011), in national and international agency programs, foundations (for example the Keck Foundation and the Gordon and Betty Moore Foundation) and corporations (IBM, Slumberger, GE, Microsoft, etc.). Surrounding this attention is a Headers will be added later proliferation of studies, reports, conferences and workshops on Data, Data Science and workforce. Examples include: “Train a new generation of data scientists, and broaden public understanding” from an EU Expert Group (Riding the Wave, 2010), “…the nation faces a critical need for a competent and creative workforce in science, technology, engineering and mathematics (STEM)...” (NSF 2011a), "We note two possible approaches to addressing the challenge of this transformation: revolutionary (paradigmatic shifts and systemic structural reform) and evolutionary (such as adding data mining courses to computational science education or simply transferring textbook organized content into digital textbooks).” (NSF2011a), and “The training programs that NSF establishes around such a data infrastructure initiative will create a new generation of data scientists, data curators, and data archivists that is equipped to meet the challenges and jobs of the future." (NSF 2011b). Currently, there are very few graduate education and training programs in the world that can prepare students in a cohesive and interdisciplinary way to be productive data scientists. The emphasis on advanced degree programs thus required a rich research agenda, one that draws on many disciplines. 2 2.1 INFORMATICS – BALANCING RESEARCH AND APPLICATION Definition One of the more useful definitions is Informatics: “Information science includes the science of (data and) information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, behavior, and interactions of natural and artificial systems that store, process and communicate (data and) information. It also develops its own conceptual and theoretical foundations. Since computers, individuals and organizations all process information, informatics has computational, cognitive and social aspects, including study of the social impact of information technologies.” (Wikipedia, italics added by author to note added content). Many disciplines, science, engineering, theory, all together. 2.2 Key elements of informatics Since science informatics efforts have emerged largely in isolation across a number of disciplines, it is only recently that certain core elements have been recognized as increasingly common. Since informatics bridges to end-use, the ‘use case’ (or user scenario) is one of the most important. Introduced by Jacobson (1987) and popularized by Cockburn (2000) for the purposes of software development, a use case ‘… is a prose description of a system’s behavior when interacting with the outside world.’ An excellent coverage of the topic is presented by Bittner and Spence (2002) who outline key factors in the development of use cases, especially the ‘why’ in addition to the ‘what’ and ‘how’. The second core element is the use of conceptual information models, especially working with domain science, application and data experts and matching and leveraging those models with others, especially ones that underlying key standards (see Research Challenges). There are other elements but these two are critical. Thus an overall skillset for an informaticist features two main elements: for information architecture - to sustainably generate information models, designs and architectures – and for technology development - to understand and support essential data and information needs of a wide variety of producers and consumers. In order to continue to meet evolving end-use needs at the current rate technical and technology change, informaticists must be grounded in the underpinnings of informatics, including information systems theoretical methods and best practices. This combination of knowledge and skills currently resides in ~ 1 in 100 information practitioners in the sciences (purely empirically and ethnographically determined by the author). The balance of research, application and a strong interdisciplinary educational curriculum is needed for informatics to address integrative science grand challenge problems that are the centre-piece of many international research organization missions, especially in Earth system sciences. Headers will be added later 3 EXAMPLAR APPLICATION IN A WIRADA CONTEXT Sensor Agency Activity Sensing/ Collecting Observations Deploy Sensor <<information>> Raw Observations from Agency Transfer Data <<information>> Sensor Information Data Transformation & Publishing <<information>> Published Observations in O&M Kepler – Generate Gridded Rainfall Data Figure 1: Map of Tasmania with place marks indicating stream gauge sensor placements. Different colours represent different operating agencies. The South Esk area is in the ENE. <<information>> Flow Forecast Output (in TSF) <<information>> Specified Rainfall Range <<information>> Gridded Rainfall Data Stream Forecast Model During 2010-2011 the CSIRO Tasmanian Information and Communication Technologies Centre (ICTC) undertook an excellent <<information>> Data application of many of the aforementioned Published Flow Forecast Transformation & Result (in O&M) Publishing informatics capabilities and in doing so encountered many of the research challenges discussed below. They developed many use cases, information models, evaluated language Figure 2: Information model for SEFF encodings for sensor networks, water domain vocabularies and provenance, and deployed the beginning with the sensor and ending with a stream flow forecast. Rounded rectangles software implementation in support of two represent processes and square rectangles hydrological sensor webs for the South Esk represent data and information products. (north-east Tasmania, see Fig. 1) Flow Forecast (SEFF; see Fig. 2) and the Water Information Research and Development Flood Early Warning System (WIRADA-FEWS). Detailed descriptions of these systems are available in other papers both at the conference and in the proceedings. 4 RESEARCH CHALLENGES AND PATHS FORWARD Seen collectively, prominent domains of informatics (DeRoure and Goble, 2006) have two key things in common: i) a distinct shift towards systematic methodologies and corresponding shift away from the tight dependence on technologies and ii) the importance of a multi-disciplinary and collaborative approaches. Together, these changes point to a maturing of the discipline itself, with the aim of providing reproducible results, i.e. ones based on a solid underlying set of theoretical concepts rather than trial and error with various technologies. The current evolution in what capabilities are developed today is that many of the information systems theories (Shannon, 1948), semiotics (Peirce 1898), and cognitive and architectural design principles (Dowell and Long, 1998) are all beginning to coalesce in the modern practice of informatics. The key new element is the fourth paradigm of science noted earlier. The problems to be solved are more about integration, crossing disciplines and responding to a much wider variety of Headers will be added later stakeholders (SCCID 2011); those that really need to use many forms of data and information. In practice this means that at the intersection of prior theoretical concepts (almost all of them developed in the analog world of data and information) and real applications is a research environment that is providing fertile for the new breed of informaticists and in turn, data scientists (Hey et al., 2009). Of the many current research challenges, several examples, which are applicable to water/hydro informatics, are discussed below. Where’s the data and what happened to it? One of the most important and increasingly evident challenges arises from the shift of where and how the data are being made available. More and more ‘distance’ in the form of data and information from different computers, undergoing network transmission and format translation means that without very careful curation of these processes, obfuscation results and the probability of information loss increases (Weaver and Shannon, 1963). The present worst-case scenario is that data may be used for a purpose that it is not fit for. While it is highly likely that this situation occurs frequently today, the consequences of incorrect conclusions or erroneous decisions are far reaching, especially when avoidable with a combination of methodological and technical means. The challenge then is to propagate relevant knowledge, information and/or metadata (or pointers to them) along with the data as it a) proceeds through a workflow, b) through an analysis pipeline, c) is extracted, transformed, and loaded into applications or transmitted over the Internet. The ultimate realization then is that an informatics task involves constructing when needed, a sense of ‘state’ from a highly distributed and non-uniform quality, knowledge base. As a whole, this important topic falls under the heading of provenance research, and has grown in significance and understanding over the last ~ 10 years (Buneman et al., 2001; Simmhan et al. 2005). Machine processable provenance also provides and opportunity to instrument the scientific data ‘enterprise’ enabling explanation and verification use cases. Allowing for unintended use? A closely related challenge to the previous one, also focused on legitimate end use, is how to effectively increase the return on scientific investment made in generating and processing data by enabling use in a discipline or application area different from the original purpose? This challenge adds a dimension beyond the view of provenance noted above to include additional information and documentation along the lines of what is intended for digital data preservation. One of the biggest examples has occurred with the proliferation of ‘geobrowser’ applications (e.g. Microsoft Visual Earth tm or Google Earthtm) making immense amounts of previous inaccessible geospatial data (often on hard copy maps or in proprietary Geographic Information Systems (GIS) available and useable for very different purposes (Hayes et al., 2005), e.g. navigation and feature identification or recreation in the case of the general public, all the way to post-natural disaster damage evaluation. Past approaches, which fall into the category of ‘save everything known about the data’ are increasingly untenable and are unlikely to scale at the present rate of increase of data volumes and diversity. While planning for the unknown is difficult, the small number of current strategies used (e.g. using common/ standard data formats and metadata representations) must be augmented. New tools or new again? As informatics approaches to data science start to make their way into the modern design and implementation of information systems, some new opportunities for further enhancing the conduct of science emerge. The most obvious target for change is the user’s (presumably) electronic device that they use to perform a task. Progress with relatively light-weight clients (hand held devices, Web browsers) has meant that people can change the tools they are using for certain tasks relatively frequently, e.g. communicating, seeking information, reviewing it, and perhaps retrieving a sample. This situation is desirable, as the rate of technical change means this market sees the greatest innovation and capability enhancement. However, as science or application research proceeds, different tools are used. Analysis tools such as electronic spread sheets (e.g. Microsoft Exceltm) or fourth generation language tools (e.g. Matlabtm) and some of the most common tools, are the least flexible and least integrated into newer information systems, and most importantly, able (or not) to handle increasing data volumes, complexity and heterogeneity (let alone semantics; see below). Continuing research is needed to understand how common modes of scientific investigation: Headers will be added later analysis, modelling, visualization can be brought closer to the end user and sooner in the process, especially for visualization (Fox and Hendler, 2011). A role for standards? Standards, both community and those certified by national and international bodies (e.g. the International Standards Organization; ISO), are one means for disparate information systems to reliably exchange data and information, over the Internet (Percival, 2010; Woolf, 2010). However, while the development of standards is largely a methodological and technical exercise (organizational and political factors aside), the adoption and/ or implementation of standards is very much a social (and organizational/ political) one. Further, the need to be able to consistently but rapidly evolve standard means of data and information exchange is often hard to achieve without paying attention to the socio-technical aspects of the systems, or systems-of-systems. Again, informatics approaches are a valuable path forward as the pragmatics of end-use and real implementations tends to drive different choices in deciding what needs to be agreed on and what does not (E. Christian, personal communication). Encoding meaning? As noted by Baker et al. (2008), the gap that informatics bridges is that between science and application areas and underlying, discipline-agnostic ICT infrastructure (such as databases, web servers, wikis, etc.). The informatics task then is to model and represent the semantics expressed in the use case, or at the very least in the questions that need answers (Fagin et al., 2005). Traditionally, this knowledge has been embedded in software implementations that operate on data and information, often with convoluted or complex logic and human reasoning. There are many consequences but two prominent ones for this discussion are the inability to present sufficient provenance and explanation as to what happened and why (see earlier discussion; Buneman et al. 2001) and the rigid nature of the encodings which make them harder to maintain, especially as technology changes and different people, with different understandings of the semantics become involved. The latter circumstance highlights the important distinction between closed world and open world semantics, especially for data exchange (Libkin and Sirangelo, 2011). In short, the closed world assumption (CWA) means all knowledge is known and encoded and that which is not encoded is assumed false. A primary example is a relational database schema. In contrast, the open world assumption (OWA) fosters incomplete and evolving knowledge encodings and what is not encoded is simply not known. As a result another shift is occurring toward more explicit (declarative) semantics in the form of ontologies (Gruber, 1993) and consequent use of semantic web technologies (Berners-Lee et al. 2001) with applications in diverse science areas (Fox et al. 2009). The challenges here are numerous (Fox and Hendler, 2009) and include forming and sustaining suitable collaborative teams that include domain and data experts, information and knowledge modellers, and software engineers, balancing what semantic expressivity to encode, with how they will be implemented and used, as well as how they are maintained and/or evolve (see standards discussion). ACKNOWLEDGMENTS The author acknowledges the invitation and support from the conference organizers (CSIRO) and a Research Collaboration agreement between CSIRO/TasICTC and RPI and stimulating research discussions with Stephen Giugni, Kerry Taylor and Andrew Terhorst. The hydroinformatics application contributions to the sensor webs mentioned in this paper are due to Quan Bai, Corne Kloppers, Brad Lee, Qing Liu, Chris Peters and Peter Taylor (and others); they are the informaticists of today. REFERENCES Baker D, Barton C, Peterson W and Fox P (2008) Informatics and the 2007–2008 Electronic Geophysical Year. Eos Trans. AGU 89(48), 485-486. Bittner K and Spence I (2002) Use Case Modeling, Addison-Wesley Professional. Headers will be added later Buneman P, Khanna S and Wang-Chiew T (2001) Why and Where: A Characterization of Data Provenance, Lecture Notes in Computer Science, 1973, 316-330, DOI: 10.1007/3-54044503-X_20 Cockburn A (2000) Writing Effective Use Cases, Addison-Wesley, Boston, MA. De Roure D and Goble C (2009) Software Design for Empowering Scientists, IEEE Software, 26, (1), pp. 88-95. Dowell J and Long J (1998) A conception of the cognitive engineering design problem, Ergonomics , 41 (2) 126 - 139. Fagin R, Kolaitis Ph, Miller R, Popa L (2005) Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1): 89–124. Fox P, McGuinness DL, Cinquini L, West P, Garcia J, Benedict JL and Middleton D (2009) Ontology-supported scientific data frameworks: The Virtual Solar-Terrestrial Observatory experience, Computers & Geosciences, 35 (4), 724-73. Fox P and Hendler J (2009) Semantic eScience: Encoding Meaning in Next-Generation Digitally Enhanced Science, in The Fourth Paradigm: Data Intensive Scientific Discovery, Eds. T. Hey, S. Tansley and K. Tolle, Microsoft External Research, pp. 145-150. Fox P and Hendler (2011) Changing the Equation on Scientific Data Visualization, Science, 331 (6018) pp. 705-708, DOI: 10.1126/science.1197654 online at http://www.sciencemag.org/content/331/6018/705.full Gruber TR (1993) Toward Principles for the Design of Ontologies Used for Knowledge Sharing, Formal Ontology in Conceptual Analysis and Knowledge Representation, eds. N. Guarino and R. Poli, Kluwer Academic Publishers, pp. 1-23. Hayes, GS, Mayers SA and Pensack CE (2005) Using GIS to Find New Uses for Old Data, Government engineering, Sept.-Oct. 2005, pp. 16-19. Hey T, Tansley S and Tolle K, Eds. (2009) The Fourth Paradigm: Data Intensive Scientific Discovery, Microsoft External Research. Jacobson I (1987) Object oriented development in an industrial environment, OOPSLA ‘87: Object-Oriented Programming Systems, Languages and Applications. ACM SIGPLAN, pp. 183-191. Libkin L and Sirangelo C (2011) Data Exchange and Schema Mappings in Open and Closed Worlds, Journal of Computer and System Sciences, 77 (3), 542-571. Nativi S and Fox P (2010) Advocating for the Use of Informatics in the Earth and Space Sciences, Eos Trans. AGU 91(8), 75-76, doi: 10.1029/2010EO080004. Peirce CS (1898) Reasoning and the Logic of Things: The Cambridge Conference Series of 1898, Ed. K Ketner, Harvard University Press (January 1, 1992). Percival G (2010) The Application of Open Standards to Enhance the Interoperability of Geoscience Information, International Journal of Digital Earth, 3, Suppl. 1, 14-30, doi: 10.1080/17538941003792751 Report Of The National Science Foundation Advisory Committee For Cyberinfrastructure Task Force On Cyberlearning And Workforce Development (2011a) http://www.nsf.gov/od/oci/taskforces/ Report Of The National Science Foundation Advisory Committee For Cyberinfrastructure Task Force On Data and Visualization (2011b). Riding the wave (2010) How Europe can gain from the rising tide of scientific data http://www.grdi2020.eu/Pages/Unlock.aspx, Final report of the High Level Expert Group on Scientific Data A submission to the European Commission October 2010. SCCID Interim Report (2011) The International Council for Science’s Strategic Coordinating Committee on Information and Data Interim Report, ICSU, April 2011. On the web at http://www.icsu.org/what-we-do/committees/information-data-sccid/?icsudocid=interim-report Shannon, CE (1948) A Mathematical Theory of Communication, Bell System Technical Journal, 27, pp. 379–423 & 623–656, July & October, 1948. Simmhan YL, Plale B and Gannon D (2005) A survey of data provenance in e-science, ACM SIGMOD, 34 (3), 31-36, doi: 10.1145/1084805.1084812 Weaver W and Shannon CE (1963) The Mathematical Theory of Communication. Univ. of Illinois Press. ISBN 0252725484. Woolf A (2010) Powered by Standards – New Data Tools for the Climate Sciences, International Journal of Digital Earth, 3, Suppl. 1, 85-102, doi: 10.1080/17538941003672268.