The CERN Document Server at the San Diego Supercomputer Center

advertisement
Collection Repository: Personal to Organizational
Introduction to the CERN Document Server
at the San Diego Supercomputer Center (CDS @ SDSC)
Karen S. Baker, Anna Gold, Frank Sudholt
Abstract
An individual, a project, and an organization, whether an institution, a network or a discipline,
have collection management needs in common. In an academic arena, a collection is often of
bibliographic citations but may focus also on the documents themselves as well as on photographs,
videos, course materials and/or artifacts. The multiple levels from personal to organizational
(project, department, campus or network of campuses) represent nested tiers across which
information can flow when technology and metadata standards are partnered to provide an
accessible, interoperable digital framework for istributed collection management.
We seek a generalized tool that can be easily adapted to the multi-level needs for a full range of
repository activities: gathering, sharing, and discovering materials. The gathering together of
objects into different collections may involve unique or re-used objects. An object is repurposed
through use in multiple collections where a collection may be viewed as a type of intellectual
capital (Greenstein, 200x). A collection conveys information by the selections and omissions; it
shares a view or piece of knowledge about a subject or an organization. This work is driven by
the recognition of the power of a collection to present information about individual entries and to
convey insights resulting from an assembled whole.
Introduction
This report provides an overview in theory and in practice of a project seeking to define a
document repository useful at multiple levels. The initial focus is on bibliographic materials and
system requirements to support coordinated repository efforts. Beyond theory, a field report and
local observations are presented in addition to a consideration of next steps. Progress in creating a
prototype repository at the San Diego Supercomputer Center using CERN CDSware is detailed
including the reconciliation of divergent practices and motivations of target repository
participants.
We build from the assumptions that information flow is enhanced through use of the web as a
common interface, cooperation is facilitated through a central repository, and the user is to be
respected through coordination with attention to local practice. The project goals that focus
project activity include: 1) to create a repository supporting two-way information flow (ingest and
export) involving exchange between an individual’s collection system and a centralized collection
system through partnering of existing technology and metadata; 2) to develop a specific working
Page 1
environment while staying informed about compatibility with alternative developments in the
metadata and archives arenas; 3) to consider information about collections in a broader context
with reference to collectors and their organizational ties; 4) to employ an iterative prototyping
process as a development environment in order to optimize product usefulness given existing
practice.
With so many metadata and digital library efforts, one is prompted to ask “why another
repository?”. Response arises from recognition of the value in considering a diversity of
approaches given the complexity involved in first establishing and then maintaining a repository.
A range of important differences occur when considering repository design issues such as






How things get in
How things get out
Who can put things in (and take out)
What things can be put in
What linkages they have to other systems
What protocols/standards they follow
There are multiple, differing roles and motivations for participation by individuals, groups,
networks, institutions, disciplines (Table 1, Appendix). The process observes protocols that enable
federation and interconnectivity. The process also allows personal or institutional differentiation
or “streams” (both into and out of the federated processes).
Repository success depends on a good match between technical and social design. Both technical
and social issues are complex. Good social design remains an unsolved research challenge as the
cultural and management aspects of repository building are emerging as areas of investigation
(ARL, 2002). In addition to the multiple levels of organization, it is important to take into
consideration the divergence ingrained in more-or-less well-functioning workflows and practice.
We recognize and are committed to research and discovery as a process, not a product (Floyd,
1987) and to the participation of the multiple voices from scientist to information manager to
system designer.
Background
Discussions on digital repositories are ongoing (Peters, 2002). Visions often focus on the
particular, i.e. individual, project, organizational, national or international levels. Repository
models are often driven by issues of publishing and/or identity. Our own work is stimulated by
the notion of a single collection as a representation of an ongoing instance of learning and of a
diversity of collections as a presentation of multiple understandings, all important to preserve and
to connect without limitations. Further, we focus on the flow between these levels recognizing the
importance of identity at all levels along with the critical need to engage the individual participant
in the process of information gathering.
The growing work in cross-domain research prompts a refocus to the empirical understanding of
the interdisciplinary research environment itself (Spanner, 2001) where the communications and
infrastructures differ even between established and fringe interdisciplinary studies. In both,
however, high priority is given to informal communications. The personal repository may be
Page 2
unique in providing a mechanism to extend informal communications if collection-making tools
are available. Collections may provide a diversity of individual views into a discipline that when
shared serve as informal guides and communications for associates. They also serve as an
important outlet similar to that provided by contribution to volunteer open-source efforts where an
individual is motivated to contribute by the ability to create according to a personal vision of a
shared product.
Knowledge management and information flow become critical issues the moment a collection of
objects is assembled. An individual insight is instantiated through selection and classification.
Tools that facilitate aggregation create a knowledge heterogeneity that informs at multiple levels
and from multiple views. Individual collections are a first step in a learning process (research
cycle) involving document diversity, information flow, and knowledge heterogeneity. Our view
emphasizes the 3R’s (research, relationships and reflexivity) with a focus on both infrastructure
(documents, content, work practices) and cyber infrastructure (tools, methods, best practices). As
we situate our position within the world of repositories, archives and libraries, we start with a
document oriented view and extend it through integration of related information such as
administrative tables, alternative media such as photographs, and associated services such as
organizational metrics and report displays. We seek to create a process of learning and informing
for participants by providing mechanisms that enable an individual’s work. The project name
FLOW is a purposeful metaphor calling to mind the uncounted rivulets shaped by the local
landscape that join to a river of information contributing to heterogeneous pools of knowledge.
Linking individuals and organizations to documents and collections contributes to identity but
technical, conceptual and social problems arise. Technical hurdles include resource support, open
design, and implementation strategies; social barriers include the need to have a critical mass of
participation when acceptance for the system is dependant on activation energy since people must
be motivated to take time to make time; conceptual difficulties involve articulating associations
between individuals, materials and organizations in order to capture the relationships. When
considering the federation of collections, complexity is introduced also by the multiple levels of
relationships and by reclassification requirement changes as learning happens.
Benefits of a multiple level, multi-participant approach include
new tools to facilitate information gathering and reporting at multiple levels of organization
knowledge generation enabling participation from multiple individuals and groups
data/information reuse providing accessibility for multiple use
project definition through identity enhancement and multiple reflections
The project demonstrates that short-term/local approaches are not only compatible with long-term
federation strategies but also critical to initiating information flow, contributing to knowledge
diversity, and ensuring a continuing feed-back process.
Our design approach is to start small, design grounded in the local particular with an eye to
federation, and implement at the organizational level. Critical system design elements include
centralization by organization of individual, project, institution or partnership;
federation through a common theme and across multiple locations;
openness with interoperability through protocols such as OAI.
The approach is a reflexive process where we act locally, think globally, and then reflect and react.
Reacting reinitiates the process of customized local actions, of experiments and experiences,
thereby enabling learning and change.
Page 3
In Partnership
This project, supporting partnering of the Long-Term Ecological Research (LTER) Palmer site
information manager with participants from the San Diego Supercomputer Center (SDSC) and the
University of California San Diego (UCSD) Libraries, creates a larger view of organizational
infrastructure. The partnership role is to







Empower partners with broader vision
Bridge individual, organizational, & national needs
Define and meet current needs
Provide arena for cross-domain work
Bridge from present to future needs
Anticipate change, optimizing for sustainability, and
Work from bottom up toward repository system.
This team identified the European Organization of Nuclear Research (CERN) document system
(CDS) as a powerful prototyping software package and installed the software locally at SDSC.
Communications with the CERN software developers were initiated and a technology transfer
agreement established between SDSC and CERN. This collaboration (Figure 1) brings the active
involvement of the CDS development team with the UCSD CDS @ SDSC prototype team.
In order to present a larger context, figure 1 presents four tiers highlighting: the SDSC
computational environment, the Semantics Group project, the broader community of digital
repositories and services which may be divided into two categories, those compliant with the
Open Archives Initiative (OAI) and those not compliant with OAI. Embedded within the SDSC
computational environment is the CDSware software where potential storage may be interfaced
with the facilities Storage Resource Broker which does not address ingestion or work flow but is a
logical name space (rather than a database) for managing the storage of the data rather than the
data itself.
During these times of rapid technological change and social transition, there is good purpose to
multiple approaches both as complementary, synergistic investigations and as training grounds for
all participants. With the ability to preserve more information in the form of collections, will arise
new needs requiring new more new services and tools than any one group can provide.
Our semantics group partnership goals may be summarized as follows:






Page 4
Create personal citation libraries
Create input & retrieval tools for users and administrators
Assure discovery and output capabilities
Design for interoperability through identification of national categories and use of national
standards
Design for local needs (define local categories)
o Term lists (controlled vocabulary)
o Keyword fields, ie by discipline, theme, grant support, bibliography
Design for scalability through mapping (cross-walking)

Consider sustainability of process
Existing participant approaches to bibliographic materials include:
LTER: Using EndNote PC platform software to gather structured bibliographic data from
disparate sites. Using proprietary software as means to share data for discovery and
networking.
SDSC: Compiling bibliographic records and citation counts, in order to testify to research
impact of funding for projects such as the National Partnership for Advanced
Computational Infrastructure (NPACI). Gathering is manual, with no means to share or
enable discovery.
UCSD Departments: Maintaining large EndNote files of references on research on a
dedicated workstation. Lacks capability to share, interactively submit, or discover by
partners overseas.
.
Page 5
nonOAI Community
Reference Web Poster
EndNote
OAI
nonOAI
Library Catelogues
ARL/Sparc
OAI Repositories
& Services
Community
CDL/Escholarship
Eprints
OAI
MIT/Dspace
OpenEprints
ARC
bepress
Semantics Project
UCSD
Library
CDS@SDSC
dialogue
CERN
LTER
UCSD
SDSC
SDSC
Computational
Environment
Unix (Koshka)
CDSware
db &OAIweb server
development
HPSS
SAN
High
Performance
Storage System
Temporary/
Permanent
Storage
System
UNIX(IBS)
CDSware
production
OAI
ASC: Application Service Computing
SRB
Storage
Resource
Broker
Network Environment
10Mb-100Mb(Ethernet)-10Gb
HPC: High Performance Computing
Figure 1. CDS @ SDSC Computational and Partnership Environments
Page 6
Unix Solaris
~60 hosts
Federated Library on the Web (FLOW)
A stream metaphor (Figure 2a) is invoked in the conceptual schema in contrast to the original
notion for bibliographic citation entries (Figure 2b). An ease of input and output is critical to both
models. The concept of a Federated Library on the Web, FLOW, develops from understanding the
distinctive document management tools and practices used within each layer (individuals, group,
center, network, discipline) and that these layers represent boundaries across which information
could flow openly if technology and metadata could provide an enabling digital framework
(“metadata grid”).
Figure 2a: FLOW: Federating Libraries on the Web.
The immediate needs for such a model were to gather data from research users using web forms
for input of information about publications, research grants, individuals and organizational
context. A dual-mode function is desired for gathering entries in batches (EndNote, Web of
Science searches) and one-by-one (individual submissions). This data could then be shared as
collections of information and could then be discovered and retrieved in multiple ways from a
relational data base. This approach with a central repository creates public exposure for data to
enhance its impact.
Page 7
INGEST
Gather
WEB
DESKTOP
Controlled Vocabulary
Filter
*.enf
Standardize
(Keeper-of-the Keys)
QA/QC
STORAGE
EndNote
*.enl
TermLists
*.enf
Biblio File
*.txt
Style
*.ens
%0 Report
%A Baker, K S
%T Partnership Science
%D 2001
%@ UCSD-SIO 01-03
%2 M1037
%4 SDSC
DISPLAY
….
Public Viewer
As formatted output
Individual User
As *.txt or *.enl file
XML
SDSC Infrastructure Interface
As Oracle import
Application Packages
DTD
TRANSFER
Figure 2b. Creating an organizational bibliography: Structured, Parsable, Loosely
Federated, MultiLevel Approach
Page 8
Functional criteria call for a system supporting the ability to:
 Modify and update submissions
 Provide full text via local or remote files
 Search by fields or full text
 Support citation counting
 Handle various media and data types
 Provide for a review process, and
 Offer customization (personalization) and alert services.

and implementation criteria for the system include:





Standards-based
Open source
Flexible
Fast
Support search within or across collections.
Technical Issues
Design Decisions
The following are protocols / standards were considered:




OAI-Protocol for Metadata Harvesting
MARC 21
Z39.80 (article databases, bibliographic software)
Dublin Core
Additional design considerations included:





Representing both people and digital objects in the system:
o “creators” are considered both authors and people
o integration with the personnel database was needed in order to enable organization
views such as “all the people associated with XYZ research group”
Incorporating records for non-document objects (events as well as groups, people, grants)
Allowing a hybrid system of metadata with or without associated digital objects
Planning for end-user upload from EndNote or similar commercial citation management
software, and
Creating genre-based views for public, and organization views for internal institutional
purposes.
CERN Document System Background
Page 9
Some existing software options existed, such as OpenEprints software and the CERN Document
System (CDS, later CDSware). A comparison table of active potential repository systems is given
in Table 2 (see appendix). The California Digital Library’s bepress software is still under
development. The CERN Document System was found compatible with open eprints initiatives in
research communities (OAI and OAI-PMH) but independent of OpenEprints / OAI priorities.
Running at CERN, the CERN Document System (CDS: http://www.cern.ch), revised and released
in July 2002 as CDSware, is a program that allows the user to:



Search a scientific publication database
Submit objects into the database (metadata and document files)
Personalize the user account (predefined searches, publication baskets etc.)
The public interface is the World Wide Web. The current CERN implementation of CDSware
(http://cdsware.cern.ch) manages over 350 collections of data, consisting of over 550,000
bibliographic records, including 220,000 full-text documents: preprints, articles, books, journals,
photographs. The capabilities of the CDSware system include




Batch uploading of bibliographic citations
Submitting and modifying individual submissions
Support for differentiated collections or “catalogues” that can be searched separately or
together, and
Support for implementing other modules, including: personalization, alert services,
output and file format options.
The CERN Document System (CDS) was identified as available for rapid deployment and
interactive feedback under the project name CDS @ SDSC. CDSware strengths were an active,
ongoing development staff; compliance and evolution with existing international standards; and
responsiveness to a user community during the development stage. Support has been available
from the CERN technical staff for installation, configuration and modification for the CDS system
with a CERN developed front-end and an off-the-shelf open source back-end software. System
experts helped to configure CDSware, working with our local development team leader who set up
the server, updated supporting software (WML, C-compiler, make, Perl and zlib), and ran the
basic installations (MySQL, Php, Apache and Python). The enhancement of import filters,
development of export filters, and population of the CDS @ SDSC system will continue through
the second year of this grant.
The CDSware software has the advantages of






Proven institutional implementation at CERN
Full implementation of extended features (personalization, review)
OAI compliance
Support for hybrid repository / bibliography
Technical support and active development, and
Open source distribution under GNU license.
Page 10
CDSware presents a configurable portal-like interface for hosting various kinds of collections, and
features:





A powerful search engine with Google-like syntax;
User personalization, including document baskets and email notification alerts,
Electronic submission and upload of various types of documents,
Compliance with OAI data and service provider protocols, enabling the metadata exchange
between heterogeneous repositories, and
Automated citation recognition and linking
There are two basic input/output forms: batch and individual; and separately, there are
configurable modes for submission, either direct (no curation or intervention), reviewed
(submission goes to staging area for approval before posting); and peer reviewed (more
complicated routing and approval).
Following are the details for the CDSware for the design issues listed earlier:
What things can be put in:
o Digital objects plus metadata
o Metadata only
o Document-like objects
o Event records
o People records (and associations with organizations and research groups)
How things get in:
o One-by-one item deposits
o Batch uploading from local collections
o Goal: to also populate the collection via intelligent spidering of designated open
collections/documents (ResearchIndex does this now)
How things get out:
o Extract metadata to bibliographic software
o Extract metadata as XML, MARC 21, or DC
o Single items or groups of records can be extracted
o Personal baskets can be established and shared
Who can put things in (or take them out)
o Organization affiliates (tracked by personnel database)
o Registered affiliates (voluntary deposits), associated by research collaboration, or just
research interest
o Any interested parties (extract only)
o Review options are available to manage deposits
Data linkages with other systems:
o Now: personnel database at SDSC
o For consideration in future:
 NSF grants database
 Open URL
Page 11
 Storage Resource Broker (SRB)
Local Integration (People Database)
To be successfully federated and sustained, an individual bibliographic collection must be
understood within its broader context including affiliations with grants, projects and/or
organizations. Collections work is a part of a site information system as reported in the SCI2002
proceedings (Melendez and Baker, 2002) with discussion of a common information management
framework (CIMF) concept. Identity is defined by important relationships between collections,
individuals, and projects. This means that within an organizational infrastructure, administrative
definitions of staff (or people) in general relate to bibliographic record collections in particular.
At SDSC online staff accounting developed over the last decade into an account information
system (AIS). The schema, originally developed to permit tracking of computer accounts, consists
of a PC client application interfaced to Oracle forms on a Solaris system. In collaboration with
efforts such as the CDS@SDSC project, the original schema has been altered to increase
flexibility and to address both scaling and interoperability issues including:





Update to include the latest software (from Oracle forms 4.5 to forms 6.0) so forms can be
modified in contemporary environment;
Re-architecture to update sequencing and thus permit multiple users to add people
simultaneously thus eliminating a recognized single user input bottleneck;
Generalization to permit adding of any people rather than only people connected with Unix
computer accounts;
Development to include call procedures, thus providing a mechanism such that new
applications can access the people table; and
Extension with tables for institution id (populated with NSF organization identifications)
and a grouping (currently expressing program/department and sub_department levels but
extensible to include alternate groupings such as projects).
Although there are continuing issues such as non-unique entries requiring manual intervention,
this year's modifications open the door to interface with not only our documents collections efforts
but with future SDSC projects such as the distributed teraflop facility (DTF) and its follow-on
extended teraflop facility (ETF).
Design Development Process: Iterative Feed-back
Change is a part of the development phase so modes for handling changes have been considered.
Two communication modes have been identified: 1) feedback: suggestions by CERN partners
become part of CDSware source code, and 2) user-designed functions are developed as part of a
local library. For instance, changes were initiated by the partners in the CDSware code in the
submit process module in order to permit the CDS software to interface with locally developed
Page 12
people, grant and organization tables. On the other hand, to populate these tables, new userdesigned functions to extend organizational functionality are being developed to work with the
non-CDS databases.
Project Development
CDS @ SDSC Begins (January 2002)
The first steps carried out toward implementation included signing a collaboration
agreement/contract with CERN technology transfer department, followed by:



Transfer, installation, configuration, and customization of the full software package to a
local server
Loading test data at the local site, and
Testing the software with the real data.
CDS was originally composed of a group of individual modules which could be installed on any
server. The initial module delivered and installed at SDSC was the CDS Submit module. This is
the interface for the user to input individual publications to the server. It also provided the
functionality to manage itself through an administrator tool. The second module, the CDS search
engine, was not installed initially because CERN re-designed the CDS schema and produced an
improved version.
The CERN software was installed on a Unix Solaris platform. The software has been upgraded
from its original multi-module (search and submit) format through several iterations to its current
integrated contemporary version known as CDSware with search, submit and administer
capabilities. The initial strategy was to control the CDS application with a web page driver but
design evolved throughout this year resulting in an updated CDSware software package (v0.01pre6) installation.
Project Current Status (July 2002)
The improved CDS software is distributed as a single install package called CDSware. The
CDSware development has proceeded as follows:
v0.01-pre6 released 6/27/2002
v0.01-pre4 released 5/31/2002
v0.01-pre3 released 4/29/2002
v0.01-pre2 released 4/11/2002
In the current v0.01-pre6 software, both search and submit modules are well integrated and
packaged together. In addition to this architectural change, the CDSware release in summer 2002
signals a change in CERN’s strategy for development and support of the code, by establishing an
open implementers (users) mailing list, and a separate news mailing list for those interested only
Page 13
in tracking CDSware development (see: http://cdsware.cern.ch/news) and information on
CDSware status. There is compile time configuration via GNU Autoconf and WML and runtime
configuration via MySQL configuration tables. The package integrates with other platform
independent services (e.g. the CDS Conversion server for the file format conversions) and enables
the integration of other installation specific applications (extensiblity). Note, the MySQL database
is adaptable to Oracle.
Local implementation details
The project tasks performed at SDSC can be divided into four types: installation, analysis,
implementation and customization.

Installation refers to copying and compiling the CDSware source code as well as creating
installing and upgrading the software environment required by CDSware (see above).

Analysis accounts for the majority of the first year software effort. The goals were
twofold: 1) understanding how CDSware works and 2) providing user feedback through
identification of features that would better meet the needs of SDSC and perhaps other
organizations as well. Having chosen the CDS package because of its match with our
outlined SDSC conceptual model, the expectation was that modification suggestions would
be small so changes could fit smoothly into the distributed code of CDSware.

Implementation permits local testing of code changes prior to submitting a request to
CERN either to integrate changes into the next release or to create user defined functions
in CDSware.
 Customization represents a major share of the remaining effort required. The addition of
items such as publication types and fields is foreseen. As a result, the CDSware
administration tools which have been used somewhat in the past will be utilized more
extensively. This type of development does not include source coding for the most part.
Installation
Software used by CDSware during runtime includes:






Apache web server (1.3.26)
MySQL database (4.0.1-alpha)
PHP apache module (4.2.2)
PHP command line (4.0)
Python (2.1.1)
MySQL-python (0.9.1)
Software used by CDSware during installation includes:

Common Unix installation tools
- C/C++ compiler
- Make
Page 14

- Perl
- Various c- libraries like zlib
WML (2.0.8)
Analysis
Identification of SDSC design requirements: The SDSC conceptual model calls for a database
with information about publications and people as well as affiliated organizations and grants. The
database is envisioned supporting search by author, year, group, program and funding source.
Physical resources: Hardware includes a networked UNIX Solaris database and web server
(Figure 2); Software includes CDSware, and the SDSC administrative PEOPLE table and GROUP
tables; digital data include LTER and SDSC publication collections
Issues: Develop a mechanism within CDSware that it can link to and query outside-the-package
such as a locally defined table. Our focus is on an external PEOPLE table which would identify
an author uniquely within an organizational context.
Initial approach: As shown in Figure 3, the initial plan was to control the whole application with
a driver web page. The user would have made queries against the people, grant, or group table;
information returned would be used in a subsequent search of the publication database.
Driver
web
page
Trigger
Cern
Document
System
CDS repository
metadata
about:
registered users
html/form fields
report numbers
user functions
administration
admin
submit
query / update
search
search
Publication
database
People
table
Group
table
Grant
table
collector context
Figure 3. Initial collector context interface with CDS System.
Page 15
Implementation version: With the new CDSware available in a more flexible version, our plan
changed since it is more efficient to use CDSware directly and drop the driver web page (see
Figure 4). So now CDSware coordinates and controls except for a detailed view of the grant data.
The grant table is unique in that it is actually a series of related tables which have more
information stored in them than is required for storage in the publication database. This detailed
view has the advantage of providing protection through separation to special kinds of data (Anna
Gold, personal communication).
CDS
repository
user
functi
ons
CDSware
admin
submit
search
Publication
database
bibwords
query / update
link
People
table
Group table
Grant
table
query
Details
(protected)
)
Figure 4. Existing CDSware context interface with CDS system
Implementation
To implement the dual functionalities of data input and data display, it is necessary to ensure that
each document record inserted into CDSware contains information about the unique authors,
groups (organization, program, research program), and grants. To accomplish this, the user form
can structure input of data into publication records. This is in the best interest of the user, who
wants to have publications searchable and visible. In addition, it has to be insured that this kind of
data can be used to select special publications and display data. Through communication with
Tibor Simko from CERN, it was learned that CDSware already provides the second functionality
of data display by virtual collections, bibwords, and bibformat. Parts of these tools are now
included in CDSware v0.01-pre6.
Page 16
CDSware did not provide the first functionality. A current difficulty is that one can store a variety
of fields into the CDS database by simply typing them in, but it is unacceptable to let the user type
in data that are encoded and therefore not immediately meaningful (through use of administrative
tools for arbitrary numbers as people-ids). On the other hand, it is noted that these data are
necessary for the full working of the system. The solution is to convert user meaningful, like
names data like into system data, like people-ids. To do this we have to query on a non-CDS
database during the submit process, which depends on data that were entered at an earlier time in
the same process. To do this a change in the CDSware source code was requested. The submit
process worked before as shown in a simplified schema in Figure 5.
CDS
Repository
(HTML code)
CDS
Publication
database
user functions
PHP/SHTML
source code
Web Form
Figure 5. CDSware Submit Process
For the conversion, a change request was submitted to CERN as shown in 6. Fortunately, this
change could be implemented by changing just three lines of the distributed code. Other
requirements were addresses via user-defined functions (that are called by CDSware).
Page 17
external
database(s)
(ie PEOPLE)
CDS
Repository
HTML
user
libraries
CDS
Publication
database
PHP-source
PHP/SHTML
source code
Web Form
Figure 6. CDSware Submit Process Modified for Database Access
In Table 3 the implemented functions are listed along with a short description. In addition to this
list several function delivered with CDSware were slightly modified in order to better fit SDSC
needs (e.g.: Create_Modify_Interface). Of course, every programmed function was tested in
module and/or integration test.
Customization
Using the functions above and the CDSware administration tools the following functionality was
created in CDSware and tested (Integration test 2); some are not fully completed:







Batch upload of bibliographic information (nearly complete)
Submission grants (complete)
Submission people (complete)
Definition of collections (complete, but more collections expected )
Submission of published article (metadata)
Modification of published article (metadata) (complete)
Submission of published article file (needs debugging)
Page 18


Definition of bibformat (CDSware functionality tested, but not completely defined)
definition of bibconvert (complete for published articles)
Conclusions
The initiation of a two-way process for individual citation collection coordinated with a central
repository system is a complex task requiring attention to both international standards and local
practices. Work with the UCSD team (CDS @ SDSC) using CDSware in collaboration with
CERN partners is building a valuable experience base with focus on local use, developing
standards and iterative design . As a result, local project understanding of the concept of
organizational informatics is deeper and broader yet grounded by site-based information
management.
Accomplishments to date include having formed an interdisciplinary team which assessed
available repository software choices, implemented software locally and maintained concen for
grounding in local practices while balancing management demands. Upload from test citation
management files has been demonstrated while work continues on integrating the repository
database with a local personnel database in order to link people with organizational units.
The importance of staying current with developments across the field (Open Archives Initiative,
Open ePrints, the California Digital Library’s eScholarship, MIT’s D-Space) is recognized along
with the need to acquire specific hands-on practical experience. Specific activities to enhance
communications have included development of a working website for the San Diego project group
as well as attention to related communities of practice such as an SDSC semantics interest group
and the digital library (Gold et al, 2002).
Next Steps: Conceptual



Further work is needed to address integration of repository building with researcher
workflow.
Further assessment is needed regarding the centrality of people and organizations in digital
libraries / repositories.
Further work is needed to elaborate the challenges and prospects of creating a metadata
grid in which participation and flow is multilateral and multidirectional.
Next Steps: Technical





Complete demonstration of submit and upload functions from citation management
software and grants database
Populate database using both individual and batch submissions
Demonstrate internal views of data for program administrators
Installation of CDSware v0.0.9 released 08/01/2002
Migration to CDSware v0.0.9, due to new MARC XML schema
Page 19





connection to Oracle
Migration to new “group” table in Oracle
Definition of remaining document types ; create online document submission for all
document types
Change input of function deliver_aid to be more user friendly
Move to production environment
Continued work is needed toward understanding the requirements of digital repositories, with
continued attention to accommodating current practices at all levels and enhancing participation at
all stages of research / learning process.
Acknowledgements
The conceptual and programming support of Joshua Polterock (SDSC) as well as the
contributions of the CERN CDS Team (Jean-Yves Le Meur, Tibor Simko, and Thomas Baron) is
acknowledged. We benefited from UCSD institutional support of SDSC (Kim Baldridge,
Integrative Computational Sciences; Phil Bourne, Integrative Biosciences), UCSD/Scripps
Institution of Oceanography, and the UCSD Libraries. The work was carried out with support
from NSF Grants DBI-01-11544 and OPP-96-32763.
References
Semantics Interest Group: http://pal.lternet.edu/dm/projects/semantics
Bainbridge, paynter and boddie, 2002
(http://link.springer.de/link/service/series/0558/bibs/2458/24580390.htm)
CERN CDS: http://cds.cern.ch
CDSware: http://cdsware.cern.ch
CDS @ SDSC: http://koshka.sdsc.edu
Open ePrints: http://www.eprints.org/
ARL white paper on institutional repositories: http://www.arl.org/ir2002.html
eScholarship: http://escholarship.cdlib.org/
A.Gold, K.S.Baker, K.Baldridge, J.Y. Le Meur, 2002, Building FLOW: Federating Libraries On
the Web. Proceedings of the 2nd Association for Computing Machinery (ACM)/IEEE-CS Joint
Conference on Digital Libraries.
Floyd, C., 1987. Outline of a paradigm change in software engineering in Computers and
Democracy, G.Bjerknes, P.Ehn, and M.Kyng eds. Adershot, 191-210p.
Page 20
Gold, Anna Keller. The Role of Documents in Knowledge Management
Melendez-Colom, E. and K.S.Baker, 2002.. Common information management framework: in
practice in Proceedings of the 6th World Multi-Conference on Systematics, Cybernetics and
Informatics, 14-18 July 2002, Orlando, FL. N.Callaos, J.Porter, N.Rishe eds. IIIS 7, 385-389p.
Peters, T.A. Digital Repositories: Individual, Discipline-based, Institutional, Consortial, or
National?
Spanner, D. 2001. Border Crossings: Understanding the Cultural and Informational Dilemmas of
Interdisciplinary Scholars. The Journal of Academic Librarianship 27(5):352-360.
Wilde, Eric, 2003. Towards Federated Referatories. ECDL 2003. ECDL 2003 - 7th European
Conference on Research and Advanced Technology for Digital Libraries
(http://wildesweb.com/glossary)
Wolff and Cremers, 1999 (http://citeseer.nj.nec.com/wolff99myview.html)
Page 21
Table 1. Multiple Levels of Practices and Motivations
Category
Practices
Motivations
Individuals


Notebooks, articles, office files
Mail, email, in-person: circulate
preprints by mail, email
Personal web pages (multi-format links)
Personal databases (e.g. flat files, citation
managers: can extract from, download
and import to)
Deposit to/extract from disciplinary
repositories (e.g. arXiv)

Internal databases (shared)
Web sites with lists





Research
Groups







Institutions





Disciplines
Page 22


Track citation counts for
rporting (maintain lists of peerreviewed publications)
Manage knowledge for easy
retrieval and discovery
Exchange with key colleagues
Participate in building shared
knowledge
Manage knowledge
Track output (for funding
agencies)
Track impact (greater exposure
leads to greater impact)
Discovery
Publish (e.g. tech reports, conf.
proceedings, journals)
Create internal databases
Establish repositories
Establish libraries
Hybrid library/repositories


Sharing, discovery, and
reputation
Management and reporting
(including accountability to
funding agencies)
Archiving
Professional society databases, portals
Establish disciplinary repositories (may
be distributed & federated or
centralized, e.g. NCSTRL, arXiv)


Sharing
Discovery

Table 2. Document System Comparisons
Parameter:
openEprints
CDSware
Reference Web
Poster
Library
catalogs
1. how things get
in
Deposit by registered /
authorized people
*Deposit by registered /
authorized people
* Upload from structured file
Upload by administrator
from one or more private
citation libraries
*additions to citation library
can be batch-extracted from
commercial sources
*additions may also be
individual entries by private
library manager
*FTP of batch files
consisting of
individual entries or
single record copies
from bibliographic
utilities
2. how things get
out
*OAI metadata
harvesting protocol
*Marked records may be
downloaded to citation
management software
(Z39.80)
*Marked records
may be extracted in
printable or
downloadable
formats, e.g. to
citation management
software
Parameter:
openEprints
*OAI metadata harvesting
protocol
*marked records can be
downloaded singly or
collectively
*personal “baskets” can be
made, shared
*record output in XML,
HTML, MARC, DC record
formats
*CERN applications support
file format conversions
CDSware
Reference Web Poster
Library catalogs
3. who can put
things in
Configurable:
registered or authorized
people; may include
researcher direct
deposits, or be
configured to “flow”
deposits through
administrators
Configurable: may be
registered or authorized
people: researchers in or
outside the institution; may
be linked to institutional ID
Administrator with access to Specially trained and
server and commercial
/ or authorized staff,
software
usually in libraries,
using locally
configured catalog
software
4. what things can
be put in
* Focus on preprints,
working papers (full
text)
* Other uses:
conference proceedings
(CalTech); other
monographs
*Configurable; current
support for documents with
metadata or metadata alone.
*Articles and conference
proceedings are focus.
*Monographic works
and entire journals
are primary focus
*CERN configured for
preprints, commercial
articles, books, photos,
presentations, etc.
*Developing “people”
records
Parameter:
openEprints
CDSware
Reference Web Poster
Library catalogs
5. what linkages
to other systems
OAI supports crossrepository searching
*OAI supports crossrepository searching
*Linkages created to local
applications and databases,
e.g. personnel database
*Upload from citation
management software OK
*primarily to commercial
article databases for which
citation management
download filters have been
written
6. what protocols,
standards
followed
*DC
*OAI-PMH
*crosswalks from other
metadata formats
*OAI-PM
*DC
*MARC21
*Z39.80 (in dev.)
*Z39.80
*MARC
*via Z39.50,
federated search of
other library
catalogs; extraction
and deposit to
parallel collective
catalogs (OCLC,
union catalogs)
*MARC
*Z39.50
*Z39.80
Page 23
Table 3. SDSC User Defined Functions
Function
InsPid / InsPid2
look_for_author
deliver_aid
deliver_pid
separate_names
check_first_name
Add_person
double_check_person
NewAuthor1Check
DatCheckUS
get_next_id
change_to_mysql_fmt
SelectGrant
SelectResearchPrg
add_grant
Page 24
Description
Program written in perl to insert people ids
and email into an EndNote tagged file
Migration from InsPid from perl to PHP
Reads name from input (web-form) and
generates select list on web page with
people-id and email address
Variation of deliver_aid (for PIs)
Gets name string and separates last name
and initials, called by deliver_aid
Compares initials mit first names found in
table called by deliver_aid
Function reads description from given
database tables, reads the input fields from
$STORAGE
and and upload the data into the table, used
to upload people data into people table
Function that reads author names from
$STORAGE
and looks for the matches in the author
table
JS function checks if user wants to add new
author
JS function checks US date format
(mm/dd/yyyy)
Function that reads description from given
database tables,
determinates the keyfields and wrote the
next available key into a file in the
$STORAGE dir with the same name as
the table column
reads dates from file translate from
mm/dd/yyyy into yyyy-mm-dd or vice
versa and write back to file
Select grant data from tables where authors
PI or CoPI and displays selection list to
user
Select research program from group table
and display selection list --- needs to be
modified due to table change.
Function that reads the input fields from
$STORAGE
and and upload the GRANT data into the
GRANTS and GRANT_XREF table.
splitGrantPI
JoinBibFiles
CallBibConvert
CallBibFormat
CallBibUpload
Testfunction
XML2EndNote
Page 25
function that reads the input file from
$STORAGE
and splits it into two files. These will be
uploaded the into the GRANTS table by an
other function
Function was written to run CDS submit
3.0 with CDSware v0.01-pre4 deprecated
since CDSware v0.01-pre6
Function was written to run CDS submit
3.0 with CDSware v0.01-pre4 deprecated
since CDSware v0.01-pre6.
Function was written to run CDS submit
3.0 with CDSware v0.01-pre4 deprecated
since CDSware v0.01-pre6.
Function was written to run CDS submit
3.0 with CDSware v0.01-pre4 deprecated
since CDSware v0.01-pre6.
Driver function, was written to test
functions in command line mode without
invoking apache or CDSware.
Test program to convert CDSware search
results into EndNote import files. (For later
use)
Download