Converging Conversations and Repositories: Bringing Archival Data to Life Susan Elliott Sim Dept. of Informatics University of California, Irvine ses@uci.edu Ann Zimmerman School of Information University of Michigan asz@umich.edu ABSTRACT Bonnie Nardi Dept. of Informatics University of California, Irvine nardi@uci.edu metadata standards, to make data easier to find, retrieve, and combine. These two approaches illustrate the tension between the rigor that comes with standardization and the richness of context that occurs through conversation. In this paper we describe a Conversation Environment for Ecological Data (CEED). Within this environment, we seek to: 1) incorporate conversation to support data sharing early in the data collection and analysis sequence and ensure that this information is added to the archive; and 2) make the creation of archival information part of the research process. Social Exchange In comparison to some fields, ecologists have little experience with data sharing and that which does occur is usually between close associates [3]. There are many reasons for this, including a lack of standard methods for data collection and storage, limited funding for data management, and a culture that does not reward sharing. It is also the case that ecological data are unusually complex and present problems of interpretation and analysis because of the very nature of the material under study [5]. Author Keywords ecoinformatics, data sharing, scientific collaboration ACM Classification Keywords H. Information Systems, H.4 INFORMATION SYSTEMS APPLICATIONS, H.4.3 Communications Applications PROBLEM: SECONDARY USES OF DATA IN ECOLOGY Reusing ecological data is hard and often requires either the informal knowledge that comes from having collected similar types of data or from interaction with the original data collector; sometimes both are necessary [12]. While the transfer of knowledge through informal exchange is precise and effective, it is inefficient at a large scale. Ecologists are seeking to expand the spatial and temporal scales of their science in order to understand today's complex and global environmental problems[9]. A key part of this effort is the development of sophisticated means to share, manage, and integrate data. The goal of the emergent field of ecoinformatics is to design tools to support these activities and to make it possible for scientists to use data collected for one purpose to study a new problem. Technological Solutions To help address these inefficiencies, software and technologies are being designed for use by ecologists [1, 4]. The primary goal of these tools is to introduce standards that will make it easier for ecologists to share their data and to find and reuse data collected by others. Although they attempt to accommodate ecologists' work practices, most of the existing tools are not widely used [1]. However, most ecological data are not collected with the expectation that they will be shared and reused. In addition, many contextual and informal details about the data are lost on the way to creating publishable results, as is often the case with science [2, 6, 11]. Currently, there are two approaches to making ecological data more amenable to secondary use. The first approach involves social exchange. Personal relationships and knowledge transferred in informal settings are used to help interpret data and assess their quality. The second method relies on technical infrastructure, such as repositories and Sharing data requires extra overhead. Adding metadata takes time and can distract researchers whose efforts are focused on the work at hand. In addition, it is difficult to document data for unknown users and uses. As ecologists have noted, "There is no unique minimal and sufficient set of metadata for any given data set, since sufficiency depends on the use(s) to which the data are put" [8]. OUR SOLUTION: CONVERSATION ENVIRONMENT FOR ECOLOGICAL DATA (CEED) In our view, conversation is critical to supporting secondary uses of data in ecology. It provides the contextual information necessary to perform subsequent analyses by restoring information that is lost as specimens move from field sites to laboratories to publications. 1 We propose to build a Conversation Environment for Ecological Data (CEED). This tool will serve as a bridge between current work practices and existing leading edge technology for ecological data. CEED will be part field notebook, part data analysis sandbox, and part repository. metadata standard. These data sets provide suggestions on which variables and measures to use. Other responses come from the conversations that have been archived with the data. These provide information on the procedures used to collect the data. In both cases, Juan has access to information that goes beyond what is normally found in a journal or conference publication. Finally, he can use CEED to explore metadata standards to develop a data collection sheet. Once Juan has started this thread of conversation, he can sustain it by taking CEED into the field with him, as in Usage Scenario 1. Incorporating conversations into data is not as large a departure from current work practices as one might expect. Roth and Bowen conducted ethnographic studies of ecologists performing data collection in field sites [10]. They found that throughout a study, ecologists conducted monologues with themselves on the objectivity and acceptability of measurements to the wider research community. We want to capture some of these monologues as they occur and, in turn, change them into conversations that can then be archived along with the data. Data repositories that incorporate informal knowledge make secondary use more tractable and can also serve as important resources during study design and data collection. In the following subsections, we present three usage scenarios to illustrate interactions with CEED. Usage Scenario 3: Secondary Analysis of Field Data Jane Globe is an earth sciences researcher performing a geographically dispersed study of air quality. Localized dissipation patterns are well-known, but global patterns are less clear. In order to investigate her questions, Jane needs access to far more data than she could collect on her own. Data can be downloaded directly from the repository or through CEED, which provides additional support for interpreting the data and combining data sets. Not only are there conversation threads from the individuals who collected the data, there are contributions from other people who reused the data previously that provide insights into how difficult the data are to work with and what kind of adjustments are needed to reconcile multiple data sets. As Jane works with the data, she in turn contributes to the domain knowledge in the conversations. These communitywide, ongoing threads of conversation that are archived with the data enable more sensitive and reliable analyses. Usage Scenario 1: Data Collection Roth and Bowen [10] reported on Sam, an ecologist studying a sub-species of lizard at the edge of its geographical distribution. During the field season, Sam made many annotations to data collection sheets, trained students to assist her, and refined her measurement procedures. Assigning a color to a lizard proved to be highly problematic, due to mottling and shade variations over its body. Sam devised an elaborate system of identifying a lizard’s color using a special holding box and Munsell charts. Even with this method, there was low inter-rater reliability among her students. Sam decided to do all color measurements herself in order to improve her precision, reduce variability, and account for her work to others. OPEN PROBLEMS IN THE IMPLEMENTATION OF CEED CEED is an ambitious project. However, the challenges of data sharing and reuse have both technical and social components, so appropriate solutions must address these dual concerns. Fortunately, ecologists are working to standardize devices, measurements, and procedures [9]. An electronic tool such as CEED will be used to collect and store data (e.g. GPS coordinates for location, RGB values or photographs for colors), but more importantly it can help to capture the private annotations, notes, and instructions that are left out of public communications. For others who are studying lizards or using a similar procedure to assign color to specimens, it will be extremely useful to know how Sam made the decision to change the procedure and how difficult it was to use. Through CEED, this rich contextual information will be archived along with Sam's data. The primary question this work addresses is: How to achieve standardization without loss of contextual richness? Our initial answer is to use CEED to support and sustain conversations through the various stages of research design, data collection, archiving, and secondary use. We are not just designing a tool, we are creating a mode of interaction. A second practical question is how to encourage the use of CEED while the repository is sparse [7]. The preferred method is to have ecologists populate it with their conversations. Data from existing sources can serve as starting points, but the main contribution of CEED is that it preserves informal knowledge that hitherto has been lost. Usage Scenario 2: Study Design Juan Frost wants to study the effect of reduced snowfall and shorter winters in an alpine region on plant productivity and diversity. Although he’s read several papers, Juan is uncertain which variables are most important. Through CEED, Juan can access information to make decisions about his study design. He starts the conversation by “posting” a question. The replies to his query come from archived data that have been annotated using an ecological Will this approach work? Where should the dividing line between informal knowledge and archived data fall? What are the costs and benefits of the different points of separation? These are substantive questions that need to be answered as we move beyond threaded conversation to archived knowledge and back again. 2 ACKNOWLEDGMENTS [6] This research has been funded by NSF Grant IIS-0438848 for the Collaboration in Ecology Workshop, and the Newkirk Center for Science and Society at the University of California, Irvine. [7] REFERENCES [1] [2] [3] [4] [5] Andelman, S. J., Bowles, C. M., Willig, M. R., and Waide, R. B., “Understanding Environmental Complexity through a Distributed Knowledge Network,” BioScience, 54, 3, March, (2004), pp. 240-246. Berg, M., “Problems and Promises of the Protocol,” Social Science & Medicine, 44, 8, (1997), pp. 1081-1088. Committee on the Future of Long-Term Ecological Data, “Final Report of the Ecological Society of America Committee on the Future of Long-Term Ecological Data, Vols. 1-2.,” Ecological Society of America, Washington, D.C., USA 1995. “Knowledge Network for Biocomplexity,” http://knb.ecoinformatics.org, last accessed 24 March 2004. Lane, M. A., Edwards, J. L., and Nielsen, E., “Biodiversity Informatics,” Proc. 26th International Conference on Very Large Data Bases (VLDB 2000), Cairo, Egypt, pp. 729-732, 10-14 September 2000. [8] [9] [10] [11] [12] 3 Latour, B. and Woolgar, S., Laboratory Life. Beverly Hills, CA: Sage Publications, 1979. Ludford, P. J., Cosley, D., Frankowski, D., and Terveen, L., “Think Different: Increasing Online Community Participation Using Uniqueness and Group Dissimilarity,” Proc. CHI 2004, Vienna, Austria, pp. 631-638, 24-29 April 2004. Michener, W. K., Brunt, J. W., Helly, J. J., Kirchner, T. B., and Stafford, S. G., “Nongeospatial Metadata for Ecological Sciences,” Ecological Applications, 7, 1, (1997), pp. 330-342. National Research Council, NEON: Addressing the Nation's Environmental Challenges. Washington, DC: National Academy Press, 2003. Roth, W.-M. and Bowen, G. M., “Digitizing Lizards: The Topology of 'Vision' in Ecological Fieldwork,” Social Studies of Science, 29, 5, October, (1999), pp. 719-764. Star, S. L., “Scientific Work and Uncertainty,” Social Studies of Science, 15, (1985), pp. 391-427. Zimmerman, A., Data Sharing and Secondary Use of Scientific Data. Doctoral, Information and Library Studies, University of Michigan, 2003.