Bringing Archival Data to Life - Center for Organizational Research

advertisement
Converging Conversations and Repositories:
Bringing Archival Data to Life
Susan Elliott Sim
Dept. of Informatics
University of California, Irvine
ses@uci.edu
Ann Zimmerman
School of Information
University of Michigan
asz@umich.edu
ABSTRACT
Bonnie Nardi
Dept. of Informatics
University of California, Irvine
nardi@uci.edu
metadata standards, to make data easier to find, retrieve,
and combine. These two approaches illustrate the tension
between the rigor that comes with standardization and the
richness of context that occurs through conversation.
In this paper we describe a Conversation Environment for
Ecological Data (CEED). Within this environment, we seek
to: 1) incorporate conversation to support data sharing early
in the data collection and analysis sequence and ensure that
this information is added to the archive; and 2) make the
creation of archival information part of the research
process.
Social Exchange
In comparison to some fields, ecologists have little
experience with data sharing and that which does occur is
usually between close associates [3]. There are many
reasons for this, including a lack of standard methods for
data collection and storage, limited funding for data
management, and a culture that does not reward sharing. It
is also the case that ecological data are unusually complex
and present problems of interpretation and analysis because
of the very nature of the material under study [5].
Author Keywords
ecoinformatics, data sharing, scientific collaboration
ACM Classification Keywords
H. Information Systems, H.4 INFORMATION SYSTEMS
APPLICATIONS, H.4.3 Communications Applications
PROBLEM: SECONDARY USES OF DATA IN ECOLOGY
Reusing ecological data is hard and often requires either the
informal knowledge that comes from having collected
similar types of data or from interaction with the original
data collector; sometimes both are necessary [12]. While
the transfer of knowledge through informal exchange is
precise and effective, it is inefficient at a large scale.
Ecologists are seeking to expand the spatial and temporal
scales of their science in order to understand today's
complex and global environmental problems[9]. A key part
of this effort is the development of sophisticated means to
share, manage, and integrate data. The goal of the emergent
field of ecoinformatics is to design tools to support these
activities and to make it possible for scientists to use data
collected for one purpose to study a new problem.
Technological Solutions
To help address these inefficiencies, software and
technologies are being designed for use by ecologists [1, 4].
The primary goal of these tools is to introduce standards
that will make it easier for ecologists to share their data and
to find and reuse data collected by others. Although they
attempt to accommodate ecologists' work practices, most of
the existing tools are not widely used [1].
However, most ecological data are not collected with the
expectation that they will be shared and reused. In addition,
many contextual and informal details about the data are lost
on the way to creating publishable results, as is often the
case with science [2, 6, 11].
Currently, there are two approaches to making ecological
data more amenable to secondary use. The first approach
involves social exchange. Personal relationships and
knowledge transferred in informal settings are used to help
interpret data and assess their quality. The second method
relies on technical infrastructure, such as repositories and
Sharing data requires extra overhead. Adding metadata
takes time and can distract researchers whose efforts are
focused on the work at hand. In addition, it is difficult to
document data for unknown users and uses. As ecologists
have noted, "There is no unique minimal and sufficient set
of metadata for any given data set, since sufficiency
depends on the use(s) to which the data are put" [8].
OUR SOLUTION: CONVERSATION ENVIRONMENT FOR
ECOLOGICAL DATA (CEED)
In our view, conversation is critical to supporting secondary
uses of data in ecology. It provides the contextual
information necessary to perform subsequent analyses by
restoring information that is lost as specimens move from
field sites to laboratories to publications.
1
We propose to build a Conversation Environment for
Ecological Data (CEED). This tool will serve as a bridge
between current work practices and existing leading edge
technology for ecological data. CEED will be part field
notebook, part data analysis sandbox, and part repository.
metadata standard. These data sets provide suggestions on
which variables and measures to use. Other responses come
from the conversations that have been archived with the
data. These provide information on the procedures used to
collect the data. In both cases, Juan has access to
information that goes beyond what is normally found in a
journal or conference publication. Finally, he can use
CEED to explore metadata standards to develop a data
collection sheet. Once Juan has started this thread of
conversation, he can sustain it by taking CEED into the
field with him, as in Usage Scenario 1.
Incorporating conversations into data is not as large a
departure from current work practices as one might expect.
Roth and Bowen conducted ethnographic studies of
ecologists performing data collection in field sites [10].
They found that throughout a study, ecologists conducted
monologues with themselves on the objectivity and
acceptability of measurements to the wider research
community. We want to capture some of these monologues
as they occur and, in turn, change them into conversations
that can then be archived along with the data. Data
repositories that incorporate informal knowledge make
secondary use more tractable and can also serve as
important resources during study design and data collection.
In the following subsections, we present three usage
scenarios to illustrate interactions with CEED.
Usage Scenario 3: Secondary Analysis of Field Data
Jane Globe is an earth sciences researcher performing a
geographically dispersed study of air quality. Localized
dissipation patterns are well-known, but global patterns are
less clear. In order to investigate her questions, Jane needs
access to far more data than she could collect on her own.
Data can be downloaded directly from the repository or
through CEED, which provides additional support for
interpreting the data and combining data sets. Not only are
there conversation threads from the individuals who
collected the data, there are contributions from other people
who reused the data previously that provide insights into
how difficult the data are to work with and what kind of
adjustments are needed to reconcile multiple data sets. As
Jane works with the data, she in turn contributes to the
domain knowledge in the conversations. These communitywide, ongoing threads of conversation that are archived
with the data enable more sensitive and reliable analyses.
Usage Scenario 1: Data Collection
Roth and Bowen [10] reported on Sam, an ecologist
studying a sub-species of lizard at the edge of its
geographical distribution. During the field season, Sam
made many annotations to data collection sheets, trained
students to assist her, and refined her measurement
procedures.
Assigning a color to a lizard proved to be highly
problematic, due to mottling and shade variations over its
body. Sam devised an elaborate system of identifying a
lizard’s color using a special holding box and Munsell
charts. Even with this method, there was low inter-rater
reliability among her students. Sam decided to do all color
measurements herself in order to improve her precision,
reduce variability, and account for her work to others.
OPEN PROBLEMS IN THE IMPLEMENTATION OF CEED
CEED is an ambitious project. However, the challenges of
data sharing and reuse have both technical and social
components, so appropriate solutions must address these
dual concerns. Fortunately, ecologists are working to
standardize devices, measurements, and procedures [9].
An electronic tool such as CEED will be used to collect and
store data (e.g. GPS coordinates for location, RGB values
or photographs for colors), but more importantly it can help
to capture the private annotations, notes, and instructions
that are left out of public communications. For others who
are studying lizards or using a similar procedure to assign
color to specimens, it will be extremely useful to know how
Sam made the decision to change the procedure and how
difficult it was to use. Through CEED, this rich contextual
information will be archived along with Sam's data.
The primary question this work addresses is: How to
achieve standardization without loss of contextual richness?
Our initial answer is to use CEED to support and sustain
conversations through the various stages of research design,
data collection, archiving, and secondary use. We are not
just designing a tool, we are creating a mode of interaction.
A second practical question is how to encourage the use of
CEED while the repository is sparse [7]. The preferred
method is to have ecologists populate it with their
conversations. Data from existing sources can serve as
starting points, but the main contribution of CEED is that it
preserves informal knowledge that hitherto has been lost.
Usage Scenario 2: Study Design
Juan Frost wants to study the effect of reduced snowfall and
shorter winters in an alpine region on plant productivity and
diversity. Although he’s read several papers, Juan is
uncertain which variables are most important. Through
CEED, Juan can access information to make decisions
about his study design. He starts the conversation by
“posting” a question. The replies to his query come from
archived data that have been annotated using an ecological
Will this approach work? Where should the dividing line
between informal knowledge and archived data fall? What
are the costs and benefits of the different points of
separation? These are substantive questions that need to be
answered as we move beyond threaded conversation to
archived knowledge and back again.
2
ACKNOWLEDGMENTS
[6]
This research has been funded by NSF Grant IIS-0438848
for the Collaboration in Ecology Workshop, and the
Newkirk Center for Science and Society at the University
of California, Irvine.
[7]
REFERENCES
[1]
[2]
[3]
[4]
[5]
Andelman, S. J., Bowles, C. M., Willig, M. R., and
Waide, R. B., “Understanding Environmental
Complexity through a Distributed Knowledge
Network,” BioScience, 54, 3, March, (2004), pp.
240-246.
Berg, M., “Problems and Promises of the
Protocol,” Social Science & Medicine, 44, 8,
(1997), pp. 1081-1088.
Committee on the Future of Long-Term Ecological
Data, “Final Report of the Ecological Society of
America Committee on the Future of Long-Term
Ecological Data, Vols. 1-2.,” Ecological Society of
America, Washington, D.C., USA 1995.
“Knowledge Network for Biocomplexity,”
http://knb.ecoinformatics.org, last accessed 24
March 2004.
Lane, M. A., Edwards, J. L., and Nielsen, E.,
“Biodiversity Informatics,” Proc. 26th
International Conference on Very Large Data
Bases (VLDB 2000), Cairo, Egypt, pp. 729-732,
10-14 September 2000.
[8]
[9]
[10]
[11]
[12]
3
Latour, B. and Woolgar, S., Laboratory Life.
Beverly Hills, CA: Sage Publications, 1979.
Ludford, P. J., Cosley, D., Frankowski, D., and
Terveen, L., “Think Different: Increasing Online
Community Participation Using Uniqueness and
Group Dissimilarity,” Proc. CHI 2004, Vienna,
Austria, pp. 631-638, 24-29 April 2004.
Michener, W. K., Brunt, J. W., Helly, J. J.,
Kirchner, T. B., and Stafford, S. G.,
“Nongeospatial Metadata for Ecological
Sciences,” Ecological Applications, 7, 1, (1997),
pp. 330-342.
National Research Council, NEON: Addressing the
Nation's Environmental Challenges. Washington,
DC: National Academy Press, 2003.
Roth, W.-M. and Bowen, G. M., “Digitizing
Lizards: The Topology of 'Vision' in Ecological
Fieldwork,” Social Studies of Science, 29, 5,
October, (1999), pp. 719-764.
Star, S. L., “Scientific Work and Uncertainty,”
Social Studies of Science, 15, (1985), pp. 391-427.
Zimmerman, A., Data Sharing and Secondary Use
of Scientific Data. Doctoral, Information and
Library Studies, University of Michigan, 2003.
Download