Building FLOW: Federating Libraries on the Web Anna Keller Gold UCSD Libraries University of California, San Diego San Diego, CA 92093 858-534-1214 agold@ucsd.edu Karen Baker Scripps Institution of Oceanography University of California, San Diego San Diego, CA 92093-0218 858-534-2350 kbaker@ucsd.edu Kim Baldridge San Diego Supercomputer Center Integrative BioSciences University of California, San Diego San Diego, CA 92093 kimb@sdsc.edu Jean-Yves Le Meur European Center for Nuclear Research (CERN) CH-1211 Geneva 23 41-22-76-74745 Jean-Yves.Le.Meur@cern.ch ABSTRACT An individual scientist, a collaborative team and a research network have a variety of document management needs in common. The levels of research organization, when viewed as nested tiers, represent boundaries across which information can flow openly if technology and metadata standards are partnered to provide an accessible, interoperable digital framework. The CERN Document System (CDS), implemented by a research partnership at the San Diego Supercomputer Center (SDSC), establishes a prototype tiered repository system. An ongoing exploration of existing scientific research information infrastructure suggests modifications to enable cross-tier and cross-domain information flow across what could be represented as a metadata grid. Keywords document management system, bibliographic citations, eprints open archives, INTRODUCTION Solutions for managing documentation generally address the needs of a single tier in the research process (individual, group, or network). Desktop bibliographic citation software packages such as EndNote or Procite provide tools for individuals to manage personal citation libraries. Manually maintained web pages allow individuals or small groups to post selected documents and gather links to other documents. In-house bibliographies, whether as ongoing or ad hoc efforts, arise in response to immediate administrative or public reporting requirements and may take the form of centrally maintained spreadsheet or database tables. Far-flung disciplinary communities may have electronic repositories such as eprint libraries supporting networked exchange and management of a community’s research documentation [1]. Although open archive software for individuals is under development [2] and rapid progress in Open Archives and Open Eprint repositories is underway, to date there has been little focus on the integration among these types of tools. APPROACH A multi-tier view of information flow in research makes it possible to enhance the participation of scientists at different stages of the research process. This in turn will make it possible to capture a wider range of research products, resulting in further discovery. Such an approach requires protocols that authorize federation and interconnectivity, while preserving personal or institutional differentiation in the form of “streams” both into and out of the federated processes. The World Wide Web can be expected to serve as a common interface to each layer, enabling participation and management at all levels of organization, regardless of local technical infrastructure. A metadata architecture supporting multiple layers could be thought of as a metadata grid, similar to a computational grid architecture, with “a series of layers of different widths” at the center of which are resource and connectivity layers and protocols [3]. Research groups typically articulate the requirement to collect citation data from group members using web forms that would allow management of information about research publications, grants, and associated researchers. Requirements include the necessity to retrieve this data in multiple ways from a data management system that can be queried; and to create public exposure for data gathered in order to enhance the impact of the research group. The needs of research groups have much in common with the requirements of large-scale research networks, such as the national Long Term Ecological Research (LTER) Network, the National Partnership for Advanced Computational Infrastructure (NPACI) and the Integrative Biosciences group at the San Diego Supercomputer Center (SDSC) and the European Organization for Nuclear Research (CERN). A partnership representing researchers and programmers at these research organizations, in addition to the research libraries at the University of California, San Diego, identified the following common needs: (1) to gather (G) structured resource information, both in batch and individual submission modes; (2) to share (S) collections of information with research networks; and (3) to discover (D) relevant research, partners, updated results, and linkages. The targets of these activities (GSD) are the files and metadata representing preprints and technical reports, journal publications and book chapters, presentations, proposals, conference papers, courses taught, and eventually multimedia. This is a much wider range of ‘grey literature’ than has typically been tracked in bibliographic repositories. In addition, we consider it important to be able to gather, share, and discover information representing a researcher or research group. For example, a common request is to be able to identify all the documents associated with a particular research effort or supported by a particular research instrument. Figure 1: FLOW supporting grid metadata infrastructure. PROTOTYPE SYSTEM The immediate goal of our partnership is to develop a prototype capable of handling GSD functions for this wide range of metadata types. Requirements for the prototype are that it be standards-based, open, flexible, quick to implement, and that it allows search within and across collections. Functional desiderata were the ability to modify and update submissions; provide full text via local or remote files; search by fields or full text; extract and count citations within the system; handle various media and types of data; provide for administrative and peer review processes; permit customization and personalization; and support for alerting services. In short, our requirements described the need for a generalized tool easily adapted to the needs of diverse science researchers, organizations, and networks. In selecting a system for the prototype we reviewed a variety of community based bibliographic citation approaches and compared them systematically, identifying two systems designed to meet cross - community document repository requirements: OpenEprints software (http://www.eprints.org/) and the CERN Document System (CDS, http://cds.cern.ch). The CDS, under active development at CERN by a dedicated staff, serves an international research community and a large research center. While compliant with OAI protocols, it has capabilities that go beyond those of OpenEprints that are adaptable to the FLOW concept. In addition to basic submit, modify, and search modules, CDS includes a batch upload module [4], customization features; workflows supporting review processes; alert services, and diverse output and export options. It also has an automatic citation extraction and tracking capability. [5] CDS is distributed under the GNU license, but technical support is available under a Technology Transfer agreement. This means that our project programming staff can work in collaboration with CERN programmers to transfer, install, configure, and customize the modules of the CDS system on the local server. CONCLUSION The CDS system is implemented for local use at the San Diego Supercomputer Center, where SDSC and CERN project staff are working in partnership with researchers to explore workflows, develop protocols, and create new modules designed to support the concept of FLOW (integration of different layers in the research information process). Through collective practice and iterative design we are developing experience and protocols to support easy uploading of bibliographies created and/or maintained in EndNote, as well as the reverse (extraction from the repository to EndNote-compatible file formats). Further, the need has emerged to add a “person” module to the suite of “document-like” objects that can be managed by the system. While designing for local needs we plan to prototype a layered set of protocols that are general purpose and use established communication practices in research communities, and yet bridge individual, local project, and national needs. This partnership provides an arena for discovering and exploring the challenges of both cross-tier and crossdomain document work practices. REFERENCES 1.Suleman, H., and Fox, E. A. A Framework for Building Open Digital Libraries. D-Lib Magazine 7, 12 (December 2001). 2.Maly, K., et al. Kepler – An OAI Data/Service Provider for the Individual. D-Lib Magazine 7, 4 (April 2001). 3. Foster, I. The GRID: A New Infrastructure for 21st Century Science. Physics Today 55, 2 (February 2002), 42-47). 4. Pignard, N., et al. Automated treatment of electronic resources in the Scientific Information Service at CERN. HEP Libraries Webzine 2001, 3 ( March 2001). 5.Claivaz, J., et al. From Fulltext Documents to Structured Citations: CERN’s Automated Solution. HEP Libraries Webzine 2001, 5, (November 2001).