Collection Repository: Personal to Organizational Introduction to the CERN Document Server at the San Diego Supercomputer Center (CDS @ SDSC) Karen S. Baker, Anna Gold, Frank Sudholt Abstract An individual, a project, and an organization, whether an institution, a network or a discipline, have collection management needs in common. In an academic arena, a collection is often of bibliographic citations but may focus also on the documents themselves as well as on photographs, videos, course materials and/or artifacts. The multiple levels from personal to organizational (project, department, campus or network of campuses) represent nested tiers across which information can flow when technology and metadata standards are partnered to provide an accessible, interoperable digital framework for istributed collection management. We seek a generalized tool that can be easily adapted to the multi-level needs for a full range of repository activities: gathering, sharing, and discovering materials. The gathering together of objects into different collections may involve unique or re-used objects. An object is repurposed through use in multiple collections where a collection may be viewed as a type of intellectual capital (Greenstein, 200x). A collection conveys information by the selections and omissions; it shares a view or piece of knowledge about a subject or an organization. This work is driven by the recognition of the power of a collection to present information about individual entries and to convey insights resulting from an assembled whole. Introduction This report provides an overview in theory and in practice of a project seeking to define a document repository useful at multiple levels. The initial focus is on bibliographic materials and system requirements to support coordinated repository efforts. Beyond theory, a field report and local observations are presented in addition to a consideration of next steps. Progress in creating a prototype repository at the San Diego Supercomputer Center using CERN CDSware is detailed including the reconciliation of divergent practices and motivations of target repository participants. We build from the assumptions that information flow is enhanced through use of the web as a common interface, cooperation is facilitated through a central repository, and the user is to be respected through coordination with attention to local practice. The project goals that focus project activity include: 1) to create a repository supporting two-way information flow (ingest and export) involving exchange between an individual’s collection system and a centralized collection system through partnering of existing technology and metadata; 2) to develop a specific working Page 1 environment while staying informed about compatibility with alternative developments in the metadata and archives arenas; 3) to consider information about collections in a broader context with reference to collectors and their organizational ties; 4) to employ an iterative prototyping process as a development environment in order to optimize product usefulness given existing practice. With so many metadata and digital library efforts, one is prompted to ask “why another repository?”. Response arises from recognition of the value in considering a diversity of approaches given the complexity involved in first establishing and then maintaining a repository. A range of important differences occur when considering repository design issues such as How things get in How things get out Who can put things in (and take out) What things can be put in What linkages they have to other systems What protocols/standards they follow There are multiple, differing roles and motivations for participation by individuals, groups, networks, institutions, disciplines (Table 1, Appendix). The process observes protocols that enable federation and interconnectivity. The process also allows personal or institutional differentiation or “streams” (both into and out of the federated processes). Repository success depends on a good match between technical and social design. Both technical and social issues are complex. Good social design remains an unsolved research challenge as the cultural and management aspects of repository building are emerging as areas of investigation (ARL, 2002). In addition to the multiple levels of organization, it is important to take into consideration the divergence ingrained in more-or-less well-functioning workflows and practice. We recognize and are committed to research and discovery as a process, not a product (Floyd, 1987) and to the participation of the multiple voices from scientist to information manager to system designer. Background Discussions on digital repositories are ongoing (Peters, 2002). Visions often focus on the particular, i.e. individual, project, organizational, national or international levels. Repository models are often driven by issues of publishing and/or identity. Our own work is stimulated by the notion of a single collection as a representation of an ongoing instance of learning and of a diversity of collections as a presentation of multiple understandings, all important to preserve and to connect without limitations. Further, we focus on the flow between these levels recognizing the importance of identity at all levels along with the critical need to engage the individual participant in the process of information gathering. The growing work in cross-domain research prompts a refocus to the empirical understanding of the interdisciplinary research environment itself (Spanner, 2001) where the communications and infrastructures differ even between established and fringe interdisciplinary studies. In both, however, high priority is given to informal communications. The personal repository may be Page 2 unique in providing a mechanism to extend informal communications if collection-making tools are available. Collections may provide a diversity of individual views into a discipline that when shared serve as informal guides and communications for associates. They also serve as an important outlet similar to that provided by contribution to volunteer open-source efforts where an individual is motivated to contribute by the ability to create according to a personal vision of a shared product. Knowledge management and information flow become critical issues the moment a collection of objects is assembled. An individual insight is instantiated through selection and classification. Tools that facilitate aggregation create a knowledge heterogeneity that informs at multiple levels and from multiple views. Individual collections are a first step in a learning process (research cycle) involving document diversity, information flow, and knowledge heterogeneity. Our view emphasizes the 3R’s (research, relationships and reflexivity) with a focus on both infrastructure (documents, content, work practices) and cyber infrastructure (tools, methods, best practices). As we situate our position within the world of repositories, archives and libraries, we start with a document oriented view and extend it through integration of related information such as administrative tables, alternative media such as photographs, and associated services such as organizational metrics and report displays. We seek to create a process of learning and informing for participants by providing mechanisms that enable an individual’s work. The project name FLOW is a purposeful metaphor calling to mind the uncounted rivulets shaped by the local landscape that join to a river of information contributing to heterogeneous pools of knowledge. Linking individuals and organizations to documents and collections contributes to identity but technical, conceptual and social problems arise. Technical hurdles include resource support, open design, and implementation strategies; social barriers include the need to have a critical mass of participation when acceptance for the system is dependant on activation energy since people must be motivated to take time to make time; conceptual difficulties involve articulating associations between individuals, materials and organizations in order to capture the relationships. When considering the federation of collections, complexity is introduced also by the multiple levels of relationships and by reclassification requirement changes as learning happens. Benefits of a multiple level, multi-participant approach include new tools to facilitate information gathering and reporting at multiple levels of organization knowledge generation enabling participation from multiple individuals and groups data/information reuse providing accessibility for multiple use project definition through identity enhancement and multiple reflections The project demonstrates that short-term/local approaches are not only compatible with long-term federation strategies but also critical to initiating information flow, contributing to knowledge diversity, and ensuring a continuing feed-back process. Our design approach is to start small, design grounded in the local particular with an eye to federation, and implement at the organizational level. Critical system design elements include centralization by organization of individual, project, institution or partnership; federation through a common theme and across multiple locations; openness with interoperability through protocols such as OAI. The approach is a reflexive process where we act locally, think globally, and then reflect and react. Reacting reinitiates the process of customized local actions, of experiments and experiences, thereby enabling learning and change. Page 3 In Partnership This project, supporting partnering of the Long-Term Ecological Research (LTER) Palmer site information manager with participants from the San Diego Supercomputer Center (SDSC) and the University of California San Diego (UCSD) Libraries, creates a larger view of organizational infrastructure. The partnership role is to Empower partners with broader vision Bridge individual, organizational, & national needs Define and meet current needs Provide arena for cross-domain work Bridge from present to future needs Anticipate change, optimizing for sustainability, and Work from bottom up toward repository system. This team identified the European Organization of Nuclear Research (CERN) document system (CDS) as a powerful prototyping software package and installed the software locally at SDSC. Communications with the CERN software developers were initiated and a technology transfer agreement established between SDSC and CERN. This collaboration (Figure 1) brings the active involvement of the CDS development team with the UCSD CDS @ SDSC prototype team. In order to present a larger context, figure 1 presents four tiers highlighting: the SDSC computational environment, the Semantics Group project, the broader community of digital repositories and services which may be divided into two categories, those compliant with the Open Archives Initiative (OAI) and those not compliant with OAI. Embedded within the SDSC computational environment is the CDSware software where potential storage may be interfaced with the facilities Storage Resource Broker which does not address ingestion or work flow but is a logical name space (rather than a database) for managing the storage of the data rather than the data itself. During these times of rapid technological change and social transition, there is good purpose to multiple approaches both as complementary, synergistic investigations and as training grounds for all participants. With the ability to preserve more information in the form of collections, will arise new needs requiring new more new services and tools than any one group can provide. Our semantics group partnership goals may be summarized as follows: Page 4 Create personal citation libraries Create input & retrieval tools for users and administrators Assure discovery and output capabilities Design for interoperability through identification of national categories and use of national standards Design for local needs (define local categories) o Term lists (controlled vocabulary) o Keyword fields, ie by discipline, theme, grant support, bibliography Design for scalability through mapping (cross-walking) Consider sustainability of process Existing participant approaches to bibliographic materials include: LTER: Using EndNote PC platform software to gather structured bibliographic data from disparate sites. Using proprietary software as means to share data for discovery and networking. SDSC: Compiling bibliographic records and citation counts, in order to testify to research impact of funding for projects such as the National Partnership for Advanced Computational Infrastructure (NPACI). Gathering is manual, with no means to share or enable discovery. UCSD Departments: Maintaining large EndNote files of references on research on a dedicated workstation. Lacks capability to share, interactively submit, or discover by partners overseas. . Page 5 nonOAI Community Reference Web Poster EndNote OAI nonOAI Library Catelogues ARL/Sparc OAI Repositories & Services Community CDL/Escholarship Eprints OAI MIT/Dspace OpenEprints ARC bepress Semantics Project UCSD Library CDS@SDSC dialogue CERN LTER UCSD SDSC SDSC Computational Environment Unix (Koshka) CDSware db &OAIweb server development HPSS SAN High Performance Storage System Temporary/ Permanent Storage System UNIX(IBS) CDSware production OAI ASC: Application Service Computing SRB Storage Resource Broker Network Environment 10Mb-100Mb(Ethernet)-10Gb HPC: High Performance Computing Figure 1. CDS @ SDSC Computational and Partnership Environments Page 6 Unix Solaris ~60 hosts Federated Library on the Web (FLOW) A stream metaphor (Figure 2a) is invoked in the conceptual schema in contrast to the original notion for bibliographic citation entries (Figure 2b). An ease of input and output is critical to both models. The concept of a Federated Library on the Web, FLOW, develops from understanding the distinctive document management tools and practices used within each layer (individuals, group, center, network, discipline) and that these layers represent boundaries across which information could flow openly if technology and metadata could provide an enabling digital framework (“metadata grid”). Figure 2a: FLOW: Federating Libraries on the Web. The immediate needs for such a model were to gather data from research users using web forms for input of information about publications, research grants, individuals and organizational context. A dual-mode function is desired for gathering entries in batches (EndNote, Web of Science searches) and one-by-one (individual submissions). This data could then be shared as collections of information and could then be discovered and retrieved in multiple ways from a relational data base. This approach with a central repository creates public exposure for data to enhance its impact. Page 7 INGEST Gather WEB DESKTOP Controlled Vocabulary Filter *.enf Standardize (Keeper-of-the Keys) QA/QC STORAGE EndNote *.enl TermLists *.enf Biblio File *.txt Style *.ens %0 Report %A Baker, K S %T Partnership Science %D 2001 %@ UCSD-SIO 01-03 %2 M1037 %4 SDSC DISPLAY …. Public Viewer As formatted output Individual User As *.txt or *.enl file XML SDSC Infrastructure Interface As Oracle import Application Packages DTD TRANSFER Figure 2b. Creating an organizational bibliography: Structured, Parsable, Loosely Federated, MultiLevel Approach Page 8 Functional criteria call for a system supporting the ability to: Modify and update submissions Provide full text via local or remote files Search by fields or full text Support citation counting Handle various media and data types Provide for a review process, and Offer customization (personalization) and alert services. and implementation criteria for the system include: Standards-based Open source Flexible Fast Support search within or across collections. Technical Issues Design Decisions The following are protocols / standards were considered: OAI-Protocol for Metadata Harvesting MARC 21 Z39.80 (article databases, bibliographic software) Dublin Core Additional design considerations included: Representing both people and digital objects in the system: o “creators” are considered both authors and people o integration with the personnel database was needed in order to enable organization views such as “all the people associated with XYZ research group” Incorporating records for non-document objects (events as well as groups, people, grants) Allowing a hybrid system of metadata with or without associated digital objects Planning for end-user upload from EndNote or similar commercial citation management software, and Creating genre-based views for public, and organization views for internal institutional purposes. CERN Document System Background Page 9 Some existing software options existed, such as OpenEprints software and the CERN Document System (CDS, later CDSware). A comparison table of active potential repository systems is given in Table 2 (see appendix). The California Digital Library’s bepress software is still under development. The CERN Document System was found compatible with open eprints initiatives in research communities (OAI and OAI-PMH) but independent of OpenEprints / OAI priorities. Running at CERN, the CERN Document System (CDS: http://www.cern.ch), revised and released in July 2002 as CDSware, is a program that allows the user to: Search a scientific publication database Submit objects into the database (metadata and document files) Personalize the user account (predefined searches, publication baskets etc.) The public interface is the World Wide Web. The current CERN implementation of CDSware (http://cdsware.cern.ch) manages over 350 collections of data, consisting of over 550,000 bibliographic records, including 220,000 full-text documents: preprints, articles, books, journals, photographs. The capabilities of the CDSware system include Batch uploading of bibliographic citations Submitting and modifying individual submissions Support for differentiated collections or “catalogues” that can be searched separately or together, and Support for implementing other modules, including: personalization, alert services, output and file format options. The CERN Document System (CDS) was identified as available for rapid deployment and interactive feedback under the project name CDS @ SDSC. CDSware strengths were an active, ongoing development staff; compliance and evolution with existing international standards; and responsiveness to a user community during the development stage. Support has been available from the CERN technical staff for installation, configuration and modification for the CDS system with a CERN developed front-end and an off-the-shelf open source back-end software. System experts helped to configure CDSware, working with our local development team leader who set up the server, updated supporting software (WML, C-compiler, make, Perl and zlib), and ran the basic installations (MySQL, Php, Apache and Python). The enhancement of import filters, development of export filters, and population of the CDS @ SDSC system will continue through the second year of this grant. The CDSware software has the advantages of Proven institutional implementation at CERN Full implementation of extended features (personalization, review) OAI compliance Support for hybrid repository / bibliography Technical support and active development, and Open source distribution under GNU license. Page 10 CDSware presents a configurable portal-like interface for hosting various kinds of collections, and features: A powerful search engine with Google-like syntax; User personalization, including document baskets and email notification alerts, Electronic submission and upload of various types of documents, Compliance with OAI data and service provider protocols, enabling the metadata exchange between heterogeneous repositories, and Automated citation recognition and linking There are two basic input/output forms: batch and individual; and separately, there are configurable modes for submission, either direct (no curation or intervention), reviewed (submission goes to staging area for approval before posting); and peer reviewed (more complicated routing and approval). Following are the details for the CDSware for the design issues listed earlier: What things can be put in: o Digital objects plus metadata o Metadata only o Document-like objects o Event records o People records (and associations with organizations and research groups) How things get in: o One-by-one item deposits o Batch uploading from local collections o Goal: to also populate the collection via intelligent spidering of designated open collections/documents (ResearchIndex does this now) How things get out: o Extract metadata to bibliographic software o Extract metadata as XML, MARC 21, or DC o Single items or groups of records can be extracted o Personal baskets can be established and shared Who can put things in (or take them out) o Organization affiliates (tracked by personnel database) o Registered affiliates (voluntary deposits), associated by research collaboration, or just research interest o Any interested parties (extract only) o Review options are available to manage deposits Data linkages with other systems: o Now: personnel database at SDSC o For consideration in future: NSF grants database Open URL Page 11 Storage Resource Broker (SRB) Local Integration (People Database) To be successfully federated and sustained, an individual bibliographic collection must be understood within its broader context including affiliations with grants, projects and/or organizations. Collections work is a part of a site information system as reported in the SCI2002 proceedings (Melendez and Baker, 2002) with discussion of a common information management framework (CIMF) concept. Identity is defined by important relationships between collections, individuals, and projects. This means that within an organizational infrastructure, administrative definitions of staff (or people) in general relate to bibliographic record collections in particular. At SDSC online staff accounting developed over the last decade into an account information system (AIS). The schema, originally developed to permit tracking of computer accounts, consists of a PC client application interfaced to Oracle forms on a Solaris system. In collaboration with efforts such as the CDS@SDSC project, the original schema has been altered to increase flexibility and to address both scaling and interoperability issues including: Update to include the latest software (from Oracle forms 4.5 to forms 6.0) so forms can be modified in contemporary environment; Re-architecture to update sequencing and thus permit multiple users to add people simultaneously thus eliminating a recognized single user input bottleneck; Generalization to permit adding of any people rather than only people connected with Unix computer accounts; Development to include call procedures, thus providing a mechanism such that new applications can access the people table; and Extension with tables for institution id (populated with NSF organization identifications) and a grouping (currently expressing program/department and sub_department levels but extensible to include alternate groupings such as projects). Although there are continuing issues such as non-unique entries requiring manual intervention, this year's modifications open the door to interface with not only our documents collections efforts but with future SDSC projects such as the distributed teraflop facility (DTF) and its follow-on extended teraflop facility (ETF). Design Development Process: Iterative Feed-back Change is a part of the development phase so modes for handling changes have been considered. Two communication modes have been identified: 1) feedback: suggestions by CERN partners become part of CDSware source code, and 2) user-designed functions are developed as part of a local library. For instance, changes were initiated by the partners in the CDSware code in the submit process module in order to permit the CDS software to interface with locally developed Page 12 people, grant and organization tables. On the other hand, to populate these tables, new userdesigned functions to extend organizational functionality are being developed to work with the non-CDS databases. Project Development CDS @ SDSC Begins (January 2002) The first steps carried out toward implementation included signing a collaboration agreement/contract with CERN technology transfer department, followed by: Transfer, installation, configuration, and customization of the full software package to a local server Loading test data at the local site, and Testing the software with the real data. CDS was originally composed of a group of individual modules which could be installed on any server. The initial module delivered and installed at SDSC was the CDS Submit module. This is the interface for the user to input individual publications to the server. It also provided the functionality to manage itself through an administrator tool. The second module, the CDS search engine, was not installed initially because CERN re-designed the CDS schema and produced an improved version. The CERN software was installed on a Unix Solaris platform. The software has been upgraded from its original multi-module (search and submit) format through several iterations to its current integrated contemporary version known as CDSware with search, submit and administer capabilities. The initial strategy was to control the CDS application with a web page driver but design evolved throughout this year resulting in an updated CDSware software package (v0.01pre6) installation. Project Current Status (July 2002) The improved CDS software is distributed as a single install package called CDSware. The CDSware development has proceeded as follows: v0.01-pre6 released 6/27/2002 v0.01-pre4 released 5/31/2002 v0.01-pre3 released 4/29/2002 v0.01-pre2 released 4/11/2002 In the current v0.01-pre6 software, both search and submit modules are well integrated and packaged together. In addition to this architectural change, the CDSware release in summer 2002 signals a change in CERN’s strategy for development and support of the code, by establishing an open implementers (users) mailing list, and a separate news mailing list for those interested only Page 13 in tracking CDSware development (see: http://cdsware.cern.ch/news) and information on CDSware status. There is compile time configuration via GNU Autoconf and WML and runtime configuration via MySQL configuration tables. The package integrates with other platform independent services (e.g. the CDS Conversion server for the file format conversions) and enables the integration of other installation specific applications (extensiblity). Note, the MySQL database is adaptable to Oracle. Local implementation details The project tasks performed at SDSC can be divided into four types: installation, analysis, implementation and customization. Installation refers to copying and compiling the CDSware source code as well as creating installing and upgrading the software environment required by CDSware (see above). Analysis accounts for the majority of the first year software effort. The goals were twofold: 1) understanding how CDSware works and 2) providing user feedback through identification of features that would better meet the needs of SDSC and perhaps other organizations as well. Having chosen the CDS package because of its match with our outlined SDSC conceptual model, the expectation was that modification suggestions would be small so changes could fit smoothly into the distributed code of CDSware. Implementation permits local testing of code changes prior to submitting a request to CERN either to integrate changes into the next release or to create user defined functions in CDSware. Customization represents a major share of the remaining effort required. The addition of items such as publication types and fields is foreseen. As a result, the CDSware administration tools which have been used somewhat in the past will be utilized more extensively. This type of development does not include source coding for the most part. Installation Software used by CDSware during runtime includes: Apache web server (1.3.26) MySQL database (4.0.1-alpha) PHP apache module (4.2.2) PHP command line (4.0) Python (2.1.1) MySQL-python (0.9.1) Software used by CDSware during installation includes: Common Unix installation tools - C/C++ compiler - Make Page 14 - Perl - Various c- libraries like zlib WML (2.0.8) Analysis Identification of SDSC design requirements: The SDSC conceptual model calls for a database with information about publications and people as well as affiliated organizations and grants. The database is envisioned supporting search by author, year, group, program and funding source. Physical resources: Hardware includes a networked UNIX Solaris database and web server (Figure 2); Software includes CDSware, and the SDSC administrative PEOPLE table and GROUP tables; digital data include LTER and SDSC publication collections Issues: Develop a mechanism within CDSware that it can link to and query outside-the-package such as a locally defined table. Our focus is on an external PEOPLE table which would identify an author uniquely within an organizational context. Initial approach: As shown in Figure 3, the initial plan was to control the whole application with a driver web page. The user would have made queries against the people, grant, or group table; information returned would be used in a subsequent search of the publication database. Driver web page Trigger Cern Document System CDS repository metadata about: registered users html/form fields report numbers user functions administration admin submit query / update search search Publication database People table Group table Grant table collector context Figure 3. Initial collector context interface with CDS System. Page 15 Implementation version: With the new CDSware available in a more flexible version, our plan changed since it is more efficient to use CDSware directly and drop the driver web page (see Figure 4). So now CDSware coordinates and controls except for a detailed view of the grant data. The grant table is unique in that it is actually a series of related tables which have more information stored in them than is required for storage in the publication database. This detailed view has the advantage of providing protection through separation to special kinds of data (Anna Gold, personal communication). CDS repository user functi ons CDSware admin submit search Publication database bibwords query / update link People table Group table Grant table query Details (protected) ) Figure 4. Existing CDSware context interface with CDS system Implementation To implement the dual functionalities of data input and data display, it is necessary to ensure that each document record inserted into CDSware contains information about the unique authors, groups (organization, program, research program), and grants. To accomplish this, the user form can structure input of data into publication records. This is in the best interest of the user, who wants to have publications searchable and visible. In addition, it has to be insured that this kind of data can be used to select special publications and display data. Through communication with Tibor Simko from CERN, it was learned that CDSware already provides the second functionality of data display by virtual collections, bibwords, and bibformat. Parts of these tools are now included in CDSware v0.01-pre6. Page 16 CDSware did not provide the first functionality. A current difficulty is that one can store a variety of fields into the CDS database by simply typing them in, but it is unacceptable to let the user type in data that are encoded and therefore not immediately meaningful (through use of administrative tools for arbitrary numbers as people-ids). On the other hand, it is noted that these data are necessary for the full working of the system. The solution is to convert user meaningful, like names data like into system data, like people-ids. To do this we have to query on a non-CDS database during the submit process, which depends on data that were entered at an earlier time in the same process. To do this a change in the CDSware source code was requested. The submit process worked before as shown in a simplified schema in Figure 5. CDS Repository (HTML code) CDS Publication database user functions PHP/SHTML source code Web Form Figure 5. CDSware Submit Process For the conversion, a change request was submitted to CERN as shown in 6. Fortunately, this change could be implemented by changing just three lines of the distributed code. Other requirements were addresses via user-defined functions (that are called by CDSware). Page 17 external database(s) (ie PEOPLE) CDS Repository HTML user libraries CDS Publication database PHP-source PHP/SHTML source code Web Form Figure 6. CDSware Submit Process Modified for Database Access In Table 3 the implemented functions are listed along with a short description. In addition to this list several function delivered with CDSware were slightly modified in order to better fit SDSC needs (e.g.: Create_Modify_Interface). Of course, every programmed function was tested in module and/or integration test. Customization Using the functions above and the CDSware administration tools the following functionality was created in CDSware and tested (Integration test 2); some are not fully completed: Batch upload of bibliographic information (nearly complete) Submission grants (complete) Submission people (complete) Definition of collections (complete, but more collections expected ) Submission of published article (metadata) Modification of published article (metadata) (complete) Submission of published article file (needs debugging) Page 18 Definition of bibformat (CDSware functionality tested, but not completely defined) definition of bibconvert (complete for published articles) Conclusions The initiation of a two-way process for individual citation collection coordinated with a central repository system is a complex task requiring attention to both international standards and local practices. Work with the UCSD team (CDS @ SDSC) using CDSware in collaboration with CERN partners is building a valuable experience base with focus on local use, developing standards and iterative design . As a result, local project understanding of the concept of organizational informatics is deeper and broader yet grounded by site-based information management. Accomplishments to date include having formed an interdisciplinary team which assessed available repository software choices, implemented software locally and maintained concen for grounding in local practices while balancing management demands. Upload from test citation management files has been demonstrated while work continues on integrating the repository database with a local personnel database in order to link people with organizational units. The importance of staying current with developments across the field (Open Archives Initiative, Open ePrints, the California Digital Library’s eScholarship, MIT’s D-Space) is recognized along with the need to acquire specific hands-on practical experience. Specific activities to enhance communications have included development of a working website for the San Diego project group as well as attention to related communities of practice such as an SDSC semantics interest group and the digital library (Gold et al, 2002). Next Steps: Conceptual Further work is needed to address integration of repository building with researcher workflow. Further assessment is needed regarding the centrality of people and organizations in digital libraries / repositories. Further work is needed to elaborate the challenges and prospects of creating a metadata grid in which participation and flow is multilateral and multidirectional. Next Steps: Technical Complete demonstration of submit and upload functions from citation management software and grants database Populate database using both individual and batch submissions Demonstrate internal views of data for program administrators Installation of CDSware v0.0.9 released 08/01/2002 Migration to CDSware v0.0.9, due to new MARC XML schema Page 19 connection to Oracle Migration to new “group” table in Oracle Definition of remaining document types ; create online document submission for all document types Change input of function deliver_aid to be more user friendly Move to production environment Continued work is needed toward understanding the requirements of digital repositories, with continued attention to accommodating current practices at all levels and enhancing participation at all stages of research / learning process. Acknowledgements The conceptual and programming support of Joshua Polterock (SDSC) as well as the contributions of the CERN CDS Team (Jean-Yves Le Meur, Tibor Simko, and Thomas Baron) is acknowledged. We benefited from UCSD institutional support of SDSC (Kim Baldridge, Integrative Computational Sciences; Phil Bourne, Integrative Biosciences), UCSD/Scripps Institution of Oceanography, and the UCSD Libraries. The work was carried out with support from NSF Grants DBI-01-11544 and OPP-96-32763. References Semantics Interest Group: http://pal.lternet.edu/dm/projects/semantics Bainbridge, paynter and boddie, 2002 (http://link.springer.de/link/service/series/0558/bibs/2458/24580390.htm) CERN CDS: http://cds.cern.ch CDSware: http://cdsware.cern.ch CDS @ SDSC: http://koshka.sdsc.edu Open ePrints: http://www.eprints.org/ ARL white paper on institutional repositories: http://www.arl.org/ir2002.html eScholarship: http://escholarship.cdlib.org/ A.Gold, K.S.Baker, K.Baldridge, J.Y. Le Meur, 2002, Building FLOW: Federating Libraries On the Web. Proceedings of the 2nd Association for Computing Machinery (ACM)/IEEE-CS Joint Conference on Digital Libraries. Floyd, C., 1987. Outline of a paradigm change in software engineering in Computers and Democracy, G.Bjerknes, P.Ehn, and M.Kyng eds. Adershot, 191-210p. Page 20 Gold, Anna Keller. The Role of Documents in Knowledge Management Melendez-Colom, E. and K.S.Baker, 2002.. Common information management framework: in practice in Proceedings of the 6th World Multi-Conference on Systematics, Cybernetics and Informatics, 14-18 July 2002, Orlando, FL. N.Callaos, J.Porter, N.Rishe eds. IIIS 7, 385-389p. Peters, T.A. Digital Repositories: Individual, Discipline-based, Institutional, Consortial, or National? Spanner, D. 2001. Border Crossings: Understanding the Cultural and Informational Dilemmas of Interdisciplinary Scholars. The Journal of Academic Librarianship 27(5):352-360. Wilde, Eric, 2003. Towards Federated Referatories. ECDL 2003. ECDL 2003 - 7th European Conference on Research and Advanced Technology for Digital Libraries (http://wildesweb.com/glossary) Wolff and Cremers, 1999 (http://citeseer.nj.nec.com/wolff99myview.html) Page 21 Table 1. Multiple Levels of Practices and Motivations Category Practices Motivations Individuals Notebooks, articles, office files Mail, email, in-person: circulate preprints by mail, email Personal web pages (multi-format links) Personal databases (e.g. flat files, citation managers: can extract from, download and import to) Deposit to/extract from disciplinary repositories (e.g. arXiv) Internal databases (shared) Web sites with lists Research Groups Institutions Disciplines Page 22 Track citation counts for rporting (maintain lists of peerreviewed publications) Manage knowledge for easy retrieval and discovery Exchange with key colleagues Participate in building shared knowledge Manage knowledge Track output (for funding agencies) Track impact (greater exposure leads to greater impact) Discovery Publish (e.g. tech reports, conf. proceedings, journals) Create internal databases Establish repositories Establish libraries Hybrid library/repositories Sharing, discovery, and reputation Management and reporting (including accountability to funding agencies) Archiving Professional society databases, portals Establish disciplinary repositories (may be distributed & federated or centralized, e.g. NCSTRL, arXiv) Sharing Discovery Table 2. Document System Comparisons Parameter: openEprints CDSware Reference Web Poster Library catalogs 1. how things get in Deposit by registered / authorized people *Deposit by registered / authorized people * Upload from structured file Upload by administrator from one or more private citation libraries *additions to citation library can be batch-extracted from commercial sources *additions may also be individual entries by private library manager *FTP of batch files consisting of individual entries or single record copies from bibliographic utilities 2. how things get out *OAI metadata harvesting protocol *Marked records may be downloaded to citation management software (Z39.80) *Marked records may be extracted in printable or downloadable formats, e.g. to citation management software Parameter: openEprints *OAI metadata harvesting protocol *marked records can be downloaded singly or collectively *personal “baskets” can be made, shared *record output in XML, HTML, MARC, DC record formats *CERN applications support file format conversions CDSware Reference Web Poster Library catalogs 3. who can put things in Configurable: registered or authorized people; may include researcher direct deposits, or be configured to “flow” deposits through administrators Configurable: may be registered or authorized people: researchers in or outside the institution; may be linked to institutional ID Administrator with access to Specially trained and server and commercial / or authorized staff, software usually in libraries, using locally configured catalog software 4. what things can be put in * Focus on preprints, working papers (full text) * Other uses: conference proceedings (CalTech); other monographs *Configurable; current support for documents with metadata or metadata alone. *Articles and conference proceedings are focus. *Monographic works and entire journals are primary focus *CERN configured for preprints, commercial articles, books, photos, presentations, etc. *Developing “people” records Parameter: openEprints CDSware Reference Web Poster Library catalogs 5. what linkages to other systems OAI supports crossrepository searching *OAI supports crossrepository searching *Linkages created to local applications and databases, e.g. personnel database *Upload from citation management software OK *primarily to commercial article databases for which citation management download filters have been written 6. what protocols, standards followed *DC *OAI-PMH *crosswalks from other metadata formats *OAI-PM *DC *MARC21 *Z39.80 (in dev.) *Z39.80 *MARC *via Z39.50, federated search of other library catalogs; extraction and deposit to parallel collective catalogs (OCLC, union catalogs) *MARC *Z39.50 *Z39.80 Page 23 Table 3. SDSC User Defined Functions Function InsPid / InsPid2 look_for_author deliver_aid deliver_pid separate_names check_first_name Add_person double_check_person NewAuthor1Check DatCheckUS get_next_id change_to_mysql_fmt SelectGrant SelectResearchPrg add_grant Page 24 Description Program written in perl to insert people ids and email into an EndNote tagged file Migration from InsPid from perl to PHP Reads name from input (web-form) and generates select list on web page with people-id and email address Variation of deliver_aid (for PIs) Gets name string and separates last name and initials, called by deliver_aid Compares initials mit first names found in table called by deliver_aid Function reads description from given database tables, reads the input fields from $STORAGE and and upload the data into the table, used to upload people data into people table Function that reads author names from $STORAGE and looks for the matches in the author table JS function checks if user wants to add new author JS function checks US date format (mm/dd/yyyy) Function that reads description from given database tables, determinates the keyfields and wrote the next available key into a file in the $STORAGE dir with the same name as the table column reads dates from file translate from mm/dd/yyyy into yyyy-mm-dd or vice versa and write back to file Select grant data from tables where authors PI or CoPI and displays selection list to user Select research program from group table and display selection list --- needs to be modified due to table change. Function that reads the input fields from $STORAGE and and upload the GRANT data into the GRANTS and GRANT_XREF table. splitGrantPI JoinBibFiles CallBibConvert CallBibFormat CallBibUpload Testfunction XML2EndNote Page 25 function that reads the input file from $STORAGE and splits it into two files. These will be uploaded the into the GRANTS table by an other function Function was written to run CDS submit 3.0 with CDSware v0.01-pre4 deprecated since CDSware v0.01-pre6 Function was written to run CDS submit 3.0 with CDSware v0.01-pre4 deprecated since CDSware v0.01-pre6. Function was written to run CDS submit 3.0 with CDSware v0.01-pre4 deprecated since CDSware v0.01-pre6. Function was written to run CDS submit 3.0 with CDSware v0.01-pre4 deprecated since CDSware v0.01-pre6. Driver function, was written to test functions in command line mode without invoking apache or CDSware. Test program to convert CDSware search results into EndNote import files. (For later use)