Digital Archiving of Astronomical Data to Support Publication and

advertisement
Digital Archiving of Astronomical Data to Support Publication and
Long-term Preservation
Assessment of Need
One of the most fundamental aspects of scientific scholarly communication is the ability to cite and examine
data in a persistent manner. Without this ability, the very essence of the scientific method, with its requirement
of validating results, becomes compromised. Large-scale astronomy projects such as the Sloan Digital Sky
Survey (http://www.sdss.org) have gathered data at unprecedented rates, raising new challenges and
opportunities. This explosion in data-driven science has led to fundamental changes in practice and modes of
inquiry, prompting the National Science Foundation (NSF) to advance the evaluation and development of
Cyberinfrastructure to support large-scale, digital science projects. Both the Library of Congress' National
Digital Information Infrastructure and Preservation Program (NDIIPP at http://www.digitalpreservation.gov)
and the NSF Blue-Ribbon Panel on Cyberinfrastructure report (National Science Foundation 2003) stress the
essential aspect of digital archiving of datasets to ensure long-term access. Most importantly, this year’s
Institute of Museum and Library Services’ (IMLS) National Leadership Grant guidelines for demonstration
projects invite efforts to “develop pilot projects or programs in data curation.” This proposal directly addresses
this important and urgent priority. Without immediate action, we may find ourselves in a “digital dark age”
losing important, scholarly resources from the scientific domain.
The National Virtual Observatory (NVO) project is playing a leadership role in building services for the
astronomy community to access and analyze astronomical data (http://us-vo.org). For good reason, the NVO is
often cited as one of the quintessential cyberinfrastructure projects. With projects such as NVO, the astronomy
community has moved into the forefront of data-intensive digital science, providing a path for other disciplines
to consider. However, thus far the scope of the NVO has deliberately not included long-term data curation,
focusing instead on data location and data access standards and protocols. Based on extensive, ongoing
dialogue and communication, the NVO project team, led by researchers at Johns Hopkins University (JHU), has
concluded that academic research libraries represent the ideal home for long-term preservation and curation of
large-scale datasets to support persistent access and scholarly communication, given their expertise and longterm, sustainable support from universities.
NVO researchers are not only involved in this proposal, but they are driving the effort. The proposed work
does not rest upon assumptions or inferences related to digital science, but rather upon firsthand feedback from
and interaction with NVO. The work proposed herein reflects a pressing, clearly identified need with serious
implications for scientific research and scholarly communication. The Library's involvement in this effort does
not arise from an abstract or theoretical argument. The NVO research team has concluded that the Library
should move to the center of data curation for various reasons, including their confidence in the Digital
Knowledge Center (DKC) of the Sheridan Libraries, which combines the rich, historical principles of library
science with a leadership role for digital library research and development.
National Impact and Intended Results
In the astronomy community there is a long-established partnership between the dominant, non-profit publishers
such as the American Astronomical Society and its production partner the University of Chicago Press
(Astrophysical Journal, Astronomical Journal) and astronomy data centers and bibliographic services
(Astrophysics Data System, ADS, in the US; Centre de Données astronomique de Strasbourg, CDS, in Europe).
This proposal offers an opportunity to move libraries from the periphery of projects such as NVO to the center
1
of digital archiving and data curation efforts, and to establish a three-fold collaboration—publishers, an
association of libraries, and NVO—that assures universal and long-term access and preservation. By
incorporating NVO web services into a Fedora digital library framework (http://www.fedora.info/), we will
provide mechanisms for long-term digital archiving of astronomical tables, catalogs, spectra, images and
documents that facilitate data publishing for astronomers and scholarly journals. The proposed work will
achieve the following goals:








Recognize university libraries' key role in digital archiving and data curation, and move libraries into the
center of digital archiving and data curation efforts
Deliver functional system of an appliance to be delivered and installed at partner libraries
Demonstrate the viability of a Fedora-based repository as the foundation for a data and digital content
curation infrastructure
Provide long-term, persistent storage and access for cited datasets
Develop services to place processed data online
Supply a catalyst/template for other disciplines and organizations
Increase integrity of scientific publication
Propose a new model for data publishing (with libraries as digital annexes for journals)
With specific and substantive collaboration with the NVO and publishers, we will produce a human and
technology infrastructure that will result in data curation of processed, digital science datasets to support
publication and long-term preservation. While the proposed work focuses on astronomy, a discipline that is at
the forefront of data-intensive scholarship, the results of this effort will provide a blueprint for other disciplines
and a model for libraries to lead the efforts to curate data from large-scale, data-driven projects.
Project Design and Evaluation Plan
What does data curation mean for astronomers? Astronomers' data includes images, spectra, catalogs/tables,
and documents or free form data. Individual or teams of astronomers using ground-based and space-based
instruments capture or create a portion of these data that provide the foundation for research and publication.
Most of these data reside in systems optimized for the storage and retrieval of data from particular telescopes or
facilities, but lacking generic data access or query mechanisms (a situation the NVO is beginning to rectify). In
any case, major astronomy archives focus on standard data products, and less so on highly processed images or
spectra that are associated with peer-reviewed publications. Without an integrated system for individual
scientists to deposit these data into a persistent library-based archive or NVO-standard interfaces for access, it is
impossible to query these data, identify gaps in knowledge, cite the data within publications, or preserve them
for long-term access. The barrier for participation in such a system should be low. Ideally, individual scientists
should be able to check in their processed, high-level data into library archives as easily as they can create web
pages. This proposal describes several contributions that will result in such a capability.
We will develop a set of web services that link literature and reference materials to astronomical datasets.
These services will reflect actual use cases as defined by the NVO. For example, "identify all literature and
images within this portion of the sky." We will map these web services into a Fedora-based digital library
framework. Deposited data will ultimately be archived within the Library, which will serve as a digital annex
for publications. Fedora's ability to integrate web services and object discovery facilities are particularly
relevant in this regard. Through this effort, we will develop an appliance that will comprise both the hardware
and software to manage this service. This appliance will be installed at our partner libraries at the University of
Edinburgh and the University of Washington.
2
Fedora (Payette et al. 2003) is an open source repository system being actively developed by the University of
Virginia and Cornell University. Unlike some other repository applications, Fedora was designed to support the
association of behaviors with the digital objects it contains. These associations are called "disseminations" in
Fedora. So, instead of simply returning the content, as it was stored into the repository, Fedora can render the
content or act on it in different ways. This coupling of digital objects with behaviors will allow richer
interaction with the deposited content. The fact that Fedora is open source will give us the ability to modify the
system, as necessary, and to distribute the appliance without concerns about software copyright.
With the cooperation of the American Astronomical Society (AAS), the editorial staff of the Astrophysical
Journal and Astronomical Journal, and the University of Chicago Press (UCP), we will develop an
understanding with publishers to accept these data submission formats. Since the association of libraries will
manage and preserve the deposited datasets, we will reduce the participation and entry barrier for publishers.
Through these combined efforts, we will create a fully integrated network of processes, tools and systems to
support ingestion and preservation of processed datasets to support publication.
Integration with the Virtual Observatory
A primary goal of the Virtual Observatory is to provide integrated access to archival data and derived data
products: catalogs, tables, and highly processed images, spectra, and time series. The initial focus of NVO
development has been on providing access to the former—the archival data sets that are already available via
public interfaces on the web, but often with unique and incompatible interfaces. Derived data products are the
purview of either dedicated large projects (such as the 2MASS or SDSS sky surveys) or of individual
researchers. Large projects have thus far worked to provide access to these high level products. The valuable
data from individuals or small collaborations sometimes appear in the electronic journals and sometimes on
personal web sites, but most often these data are not available at all in any standard form or via any standard
interface.
We envision closing this gap in data access—to those data products that are most valuable for comparative,
multi-wavelength studies—through a technological partnership among the research astronomers, the peerreviewed journals, the university libraries, and the Virtual Observatory (VO). As a result of providing a simple
mechanism for researchers to upload, register, and annotate their processed data in a permanent digital archive,
we enable the VO to integrate the collection of processed data products into the general framework for data
discovery and access. The standard VO methods for data access (catalogs, spectra, images) can be implemented
as front-end services to the Fedora-based, distributed collection of high-level data products. A somewhat larger
challenge comes in the area of data discovery, which depends on the availability of reliable metadata describing
the datasets in a collection.
The other key to data discovery is the extraction of coordinate system metadata. Astronomers throughout the
world use the FITS data format standard (Hanisch et al. 2001) and the associated conventions for celestial
coordinate systems (Greisen and Calabretta 2002, A&A 395, 1061; Calabretta and Greisen 2002, A&A 395,
1077). FITS images and spectra that are uploaded to the repository can easily have the coordinate-related
metadata extracted into an associated metadata database. It is then a straightforward matter to implement the
coordinate-based Simple Image Access Protocol and Simple Spectrum Access Protocol developed by the NVO
so that all images and spectra in the collection can be located and accessed transparently by the research
community.
Catalogs and tables constitute the other major type of derived data, and again the repository must provide a
mechanism for gathering and associating the relevant metadata. The Virtual Observatory has developed both a
standard data dictionary for tabular data, Uniform Content Descriptors (UCDs), and standard access methods
(OpenSkyQuery and the Virtual Observatory Query Language). Catalog/table upload will include a process for
associating UCDs with table columns, and the NVO can provide an OpenSkyQuery portal to the collection.
3
The DKC of the Sheridan Libraries, working with its library partners, will map the NVO tools, services, and
metadata into a Fedora-based framework. The NVO and DKC will conduct this work as part of its ongoing and
growing examination of data curation issues.
Fedora-based Mapping
The DKC represents a unique organization focused on digital library research and development. While the
DKC is housed physically and administratively within the Libraries, its staff includes individuals with
backgrounds in computer science, engineering, mathematics and cognitive science. This combination of
perspectives has resulted in a comprehensive, diverse approach to digital library development. Program officers
from NSF and IMLS have mentioned that the DKC is the only organization to receive grants from the NSF
Digital Libraries Initiative, Phase 2 (DLI-2), Information Technology Research (ITR), and National Science
Digital Library (NSDL) tracks, and IMLS' National Leadership Grant Program. DKC projects have focused on
digital workflow management, especially the ingestion of and access to large digital collections. Through
existing grants, the DKC has built both hardware (Suthakorn et al. 2003) and software (Droettboom et al. 2002),
including tools for automated metadata generation (DiLauro et al. 2001).
More recently, the DKC has focused on repository research, especially as it relates to repositories’ ability to
support a range of services, especially digital preservation. Led by the DKC, JHU participated in the Archive
Ingest Handling Test (AIHT), a program within the Library of Congress' NDIIP framework (DiLauro et al.
2006). The AIHT provided practical experience with ingestion of an archive that features multiple file formats,
and varying levels of metadata. The AIHT test provided an excellent opportunity to test the capabilities of
existing repository systems such as DSpace (Smith et al. 2003) and Fedora. JHU was the only institution that
evaluated multiple systems. Through a grant from the Mellon Foundation, JHU has conducted a comprehensive
and diverse technology analysis of repositories and services
(https://wiki.library.jhu.edu/display/RepoAnalysis/ProjectRepository).
In addition to these detailed explorations of repositories, the DKC is also involved in two UK-based repository
projects funded by the Joint Information Systems Committee (JISC). Through one of these JISC projects
(http://jiscstore.jot.com/WikiHome), the project team will further define and articulate the specific needs of
astronomers for data curation, especially as it relates to electronic publications. JISC has coordinated its Digital
Repository Programme training and development with JHU’s Mellon-funded repository analysis, and invited PI
Choudhury to two recent meetings in the UK and the Netherlands. Finally, the DKC evaluated LOCKSS
(Reich et al. 2001), and continues to manage a LOCKSS appliance. Through our extensive, rigorous and
objective evaluation, we have concluded that Fedora represents the best system for this particular effort.
As mentioned previously, Fedora disseminators enable the association of digital objects and behaviors or
renderings. A Fedora dissemination is created by associating a method definition (behavior definition or BDef)
and a corresponding method implementation (behavior mechanism or BMech) with a data object.
In addition to providing a default dissemination based on content type (e.g., image, spectra, catalog), we will
use these facilities to link deposited content with appropriate NVO services. These interfaces will allow users
of a digital object to be drawn into a rich interactive experience that uses data from the selected object as the
starting point for further discovery.
For example, the image content (TIFF, JPG, etc.) of an image digital object might be displayed to the end user
with links to a web service that allows interaction with a larger portion of the sky. Data from the object would
be passed as parameters to the service, allowing the service to focus on the correct portion of the sky and,
perhaps, to highlight appropriate objects.
4
To support search and discovery, technical and descriptive metadata will be extracted from deposited objects
and mapped to one or more common metadata formats. While the details will be developed over the course of
the project, it is anticipated that these formats will include simple Dublin Core (http://dublincore.org/) and a
format designed to support information specific to the astronomy domain. These will be made available using
the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH at
http://www.openarchives.org/OAI/openarchivesprotocol.html). Content will be harvested frequently so that
new data quickly become available to the astronomy community.
An Association of Libraries
In the longstanding library tradition of collaboration with and distribution of preservation responsibilities, we
will work with two library partners from the University of Edinburgh and the University of Washington. The
Fedora-based appliance that we will develop offers a low-maintenance, low-resource method for participation.
At the most basic level, we could develop this appliance and deliver it to our partners, who can manage it with
minimal effort. In fact, it will not even be necessary to have local Fedora expertise. Such low barriers to entry
and maintenance will bolster the prospects for other institutions to join our association.
However, in addition to an ideal geographical distribution, our initial library partners each bring invaluable,
unique, and worthwhile perspectives that lead to a more substantive association. JHU and the University of
Edinburgh will build upon existing partnerships between both astronomy researchers and libraries. The JHU
astronomy team has worked closely with astronomers at Edinburgh. Andrew Lawrence, the Head of the School
of Physics at Edinburgh and Project Leader of the AstroGrid Project, has provided a letter of support that
demonstrates the enthusiasm and interest from Edinburgh (see supplementary documentation). Lawrence's
letter outlines the different expertise available at Edinburgh through the UK e-Science Center and the newly
formed and JISC-funded Digital Curation Centre (DCC). From the DCC's website (http://www.dcc.ac.uk):
The DCC's emphasis complements very well the Sheridan Libraries' emphasis on evaluation of digital
repository systems. Additionally, the Sheridan Libraries and Edinburgh University Library have a formal
partnership that has resulted in several, collaborative efforts including an existing NSF Information Technology
Research (ITR) grant. Sheila Cannell, the Director of the Edinburgh University Library has provided a letter of
support in this regard (see supplementary documentation). The University of Edinburgh enriches the
international perspective to this effort, which is critical to consider digital archiving in its fullest manner.
The University of Washington Libraries, part of the DSpace Federation, has a production system in place
(https://digital.lib.washington.edu/dspace/index.jsp). Washington chose to participate as early adopters of
DSpace with the goals of digital preservation, influencing scholarly communication and considering possible
integration with the Digital Well, "a collaborative effort between ResearchChannel, UWTV, KEXP and UW
Computing & Communications Advanced Systems Technologies Group to explore discovery, distribution and
use technologies surrounding digital media collections on IP based networks" (http://digitalwell.org). With
funding from the Mellon Foundation, the University of Washington has examined digital scholarship. This
examination considered "creation of digital technology, tools and services to solve problems in scholarship"
(http://www.lib.washington.edu/digitalscholar/index.html). By running a Fedora appliance at one of the
DSpace federation libraries, we hope to persuade other libraries that running both DSpace and Fedora based
services is possible and appropriate. This association of libraries with the NVO represents an excellent
collaborative team; the connection to publishers provides the final component of our triad of partners.
Collaboration with Publishers
Cooperation from the key publishers in the astronomical community is essential for successful implementation
of this proposal. To address the goal of transforming scientific scholarly communication using archived
datasets, it is essential that the standards and protocols specifically developed in this proposal merge smoothly
5
with current usage at the major journals. Fortunately, this adoption is considerably simplified by the central role
played by a handful of publishers in the professional community. In North America, the publications sponsored
by the American Astronomical Society account for almost every significant professional venue. These journals
have been in the forefront of the switch to electronic publishing. For example, the Astrophysical Journal (ApJ),
published by the University of Chicago Press (UCP) for the AAS, now maintains an IT expert as a full time
staff member in the editorial office, and the archival version of the journal is the on-line version rather than the
paper version. This shift has opened the possibility of publishing extensive machine-readable data tables in the
on-line version. It has also raised worries about developing and maintaining standards for data publication, as
well as the connection between the articles, and the data on which they are based.
There are three points that require close cooperation with the AAS and UCP from the very beginning: the
development of keywords linking datasets with information about the particular instruments used to acquire the
data, finding acceptable standards for linking scientific articles with particular datasets, and the labeling and
storage of the datasets so as to make them a useful adjunct to the scientific literature.
Our team includes individuals with deep ties to the publishing work of the AAS. Robert Hanisch who, in
addition to acting as Project Manager for the highly distributed NVO, has served as chair of the Publication
Board of the AAS, and Ethan Vishniac, who serves as Associate Editor-in-Chief for the Astrophysical Journal
(and previously as Scientific Editor for seven years). The most relevant AAS employee is Greg Schwarz,
working at the Tucson office of the Astrophysical Journal, who has been responsible for developing machinereadable formats for ApJ papers and is currently developing keywords for astronomical facilities. We will be in
regular contact with him, and any other relevant AAS employees, to make sure that the work done here is
consistent with the current usage, and future needs, of the AAS. Robert Milkey, Executive Director of the
AAS, and Julie Steffen, Associate Journals Manager and Director, Astronomy Journals at UCP have provided
letters of support that outline their respective organization’s endorsement of and commitment to work on this
project (see supplementary documentation). Steffen will also play a key role in the business model
development of this effort, as described in the Sustainability section.
The work outlined in this proposal represents unique and groundbreaking effort. Currently, there is no data
curation infrastructure within a library that directly supports scientific researchers. The project team members
have an extensive network of contacts and professional commitments that provide ample evidence that we are
not replicating existing work. One project that emulates our organizational partnership is CLOCKSS
(http://www.lockss.org/clockss/), which “is a collaborative initiative by a group of organizations drawn from
publishers, libraries and learned societies.” However, there are noteworthy differences with our proposed
effort.
From a technological perspective, this proposal outlines a repository-based content storage layer, which will
support rich interaction with the content (beyond viewing of content only), and a data model and metadata. We
believe these components are necessary for full-fledged digital preservation, especially as it supports citation
within publications. CLOCKSS, as the name implies, is based on the LOCKSS technology
(http://www.lockss.org). As mentioned previously, the DKC is familiar with LOCKSS. While it represents a
valuable mechanism for creating distributed bit replication of content, it requires that content to be already
stored somewhere. This assumption may be reasonable for electronic articles, but it is not true for the
accompanying datasets that represent the focus of our effort. Additionally, LOCKSS does not include a
metadata component, unlike our proposed effort. The emphasis on datasets also differentiates our proposed
work from Portico (http://www.portico.org). PI Choudhury and Co-PI DiLauro have met with Portico officials,
who have confirmed that they are not currently focusing on data curation.
Even with the similar composition of organizational partners with CLOCKSS, this proposal differs in one major
manner: the direct and substantive involvement from the researchers who are creating the large, digital
scientific datasets (in addition to the relevant professional societies). These researchers, who are motivated by
6
an urgent and real need, will provide the expertise to ensure appropriate development of the Fedora-based
appliance, and the feedback necessary to evaluate the outcomes of this proposed effort.
Project Evaluation
IMLS’ Outcome-Based Evaluation (OBE) stresses the question: “What changed as a result of our work?” This
proposal addresses an urgent gap or need identified by the individuals (the astronomers) most affected by this
gap. Not surprisingly, they are in the best position to evaluate the results of this proposed work. Their direct
involvement in this proposal bolsters this prospect.
We can measure the specific outputs of this proposal fairly easily. For example, the DKC learned a great deal
about metrics or measurements related to large-scale, bulk ingestion of content as a result of AIHT. We will
use these lessons to evaluate the effectiveness of the ingestion of astronomy data into the Fedora-based
repository. The NVO team will assess whether the Fedora-based web services work properly to support the
NVO web-services framework. Our partner libraries at Edinburgh and Washington can provide feedback and
evaluation of the installation process and effort, and the relative ease of ongoing management of the Fedorabased appliance.
The system that is developed must be cost-efficient, extensible, and financially sustainable. We will be
analyzing the business model for digital data and content preservation, primarily with support from other
organizations and in-kind contributions from collaborators.
Having said this, this proposal offers the potential of far more significant outcomes. If successful, the proposed
work could change the nature of scholarly communication for astronomers by providing persistent access to
cited datasets. This work will move libraries to the center of data curation efforts, and establish a critical set of
partnerships between these libraries, publishers, professional societies and the researchers themselves. These
more significant outcomes are not as easily measured. Nonetheless, we will track the rate of data deposit into
this system by astronomers who are publishing papers. We will also examine whether other cyberinfrastructure
projects engage their institutional libraries in a similar manner to the NVO and JHU Libraries.
Data collection and analysis and integrity in data presentation are the central nervous system of modern
research. This project will help to indemnify the huge investments of public and private funds in scientific
research by establishing a means to preserve and protect digital content and underlying digital data. The
traditional unit of output is the journal article. The preservation issues surrounding this public record of results
and findings are just now being addressed. How we effectively manage the vital data upon which journal
articles depend has not yet been discussed.
Project Resources: Budget, Personnel, and Management Plan
The personnel for this one-year demonstration project comprise digital librarians, a metadata librarian,
programmers, astronomers and publishers from JHU, NVO, AAS and UCP. JHU represents the lead
organization given the central role of the library in building and maintaining this data curation infrastructure.
Personnel
Sayeed Choudhury, Associate Director for Library Digital Programs and Hodson Director of the Digital
Knowledge Center at JHU, will act as Administrative Head. Choudhury has been the Principal Investigator for
ten digital library projects. Most recently, he has been chosen as one of the technical auditors for the Center for
Research Libraries/Research Libraries Group (CRL/RLG) repository certification and audit exercise. The
7
budget request includes cost-sharing of 10% FTE salary, fringe benefits, and indirect costs per year for
Choudhury, an amount based on experience from his other grant-funded projects.
Tim DiLauro, Digital Library Architect, Library Digital Programs at JHU, will act as technical lead, a role he
has played for the digital library projects at JHU. Along with Choudhury, he has been chosen as the other
technical auditor for the CRL/RLG repository certification and audit exercise, and he acts as JHU’s
representative to the Library of Congress’ NDIIPP Preservation Partners planning meetings. The budget request
includes cost-sharing of 10% FTE salary, fringe benefits, and indirect costs per year for DiLauro, an amount
based on experience from his other grant-funded projects.
David Reynolds, Metadata Librarian at JHU, will lead the metadata development effort. Reynolds has provided
the metadata expertise for the digital library projects at JHU, including the AIHT project. The budget request
includes cost-sharing of 10% FTE salary, fringe benefits, and indirect costs per year for Reynolds, an amount
based on experience from his other grant-funded projects.
Alex Szalay, Alumni Centennial Professor of Physics and Astronomer at JHU, is the Principal Investigator for
the NVO. Szalay has extensive support from NSF for this work with both the Sloan Digital Sky Survey and the
NVO, encompassing both research and educational outreach. His leadership and eagerness to approach the
Library for data curation provide the inspiration for this proposal. The budget request does not include support
for Szalay because his contributions are consistent with his NSF-supported activity.
Ethan Vishniac, Professor of Physics and Astronomy and Director of the Center for Astrophysical Sciences at
JHU, is the Associate Editor-in-Chief for the Astrophysical Journal. Vishniac has received NSF funding for
work related to this proposal. He will act as the liaison with the publishers. The budget request does not
request support for Vishniac because his contributions are consistent with editorial role with the Astrophysical
Journal.
Ann Lally, Head of Digital Library Initiatives, and Jennifer Ward, Head of Web Services, at the University of
Washington, will work with DiLauro toward the installation of the Fedora-based appliance at Washington.
Lally, who previously worked with well-known digital library researcher, Hsinchun Chen, at the University of
Arizona, played a key role in the examination of digital scholarship. The budget request includes IMLS
funding for 2% FTE salary, fringe benefits and associated indirect costs for both Lally and Ward.
John MacColl, Sub-Librarian, Digital Library Division, at the University of Edinburgh, will work with DiLauro
toward the installation of the Fedora-based appliance at Edinburgh. MacColl leads the JISC-funded STORE
project that will identify scholar needs for data repositories. Additionally, MacColl has recently co-authored a
book on the institutional repository. MacColl will participate in this project without funding from IMLS,
relying upon JISC support instead.
The budget request also includes IMLS funding for portions of three programmers, one from the Libraries at
JHU, and two from Physics and Astronomy at JHU. These programmers, with specific, relevant experience and
expertise from with AIHT and NVO will focus primarily on the programming for the Fedora-based appliance
and for the NVO data ingestion. As mentioned previously, Robert Milkey’s (from AAS) and Julie Steffen’s
(from UCP) letters of support confirm their commitment to this project (without funding from this proposal).
In addition to the personnel outlined in this proposal, there are a few individuals who will work on related
activity with funding from existing sources or newly identified funding. Robert Hanisch, Space Telescope
Science Institute and Project Manager of NVO, Michael Kurtz from the Harvard-Smithsonian Astrophysical
Observatory, and Ray Plante, National Center for Supercomputing Applications (NCSA) will work on
astronomy-specific technical matters. Terry Ehling, Director of Innovative Publishing at Cornell University
Library, oversees the development of DPubS, an open-source, electronic publishing system (http://dpubs.org/),
8
will consider the connections to electronic publishing systems that will support data deposit procedures at the
point of article creation.
Budget
The total budget request for this proposal is $278,601 with $201,471 requested from IMLS and $77,130 offered
as (28%) cost sharing in the form of salary, fringe benefits and associated indirect costs for the JHU Library
staff and half of the equipment costs, which comprise three servers and associated hard disks for the astronomy
datasets to be installed at JHU, Edinburgh and Washington. All senior personnel from JHU Library offer their
contributions as cost sharing, a decision that reflects the commitment by JHU Library to embrace data curation
as a core activity.
It should also be noted that this project team has approached both the Scholarly Publishing and Academic
Resources Coalition (SPARC at http://www.arl.org/sparc) and Microsoft for complementary funding.
Specifically, potential SPARC funding would support business model and sustainability efforts and Microsoft
funding would support astronomy-specific technical work. We are cautiously optimistic about both sources of
funding (which would support complementary activities, not the ones outlined in this proposal), especially
given feedback from both organizations. Rather than assume that IMLS should fund their entire effort, this
proposal focuses on the library-specific, core aspects of the data curation efforts, which represent the most
appropriate aspects for an NLG proposal.
The travel request of $15,000 may be higher than other NLG proposals. This amount reflects the involvement
of an international partner (Edinburgh), and the plans for dissemination at several conferences, including a
relevant, international conference. The budget descriptions include a spreadsheet outlining potential travel
costs.
Management Plan
The project team already holds regular teleconference calls, communicates via email, and convenes in-person
meetings (both at JHU and UCP). This proposal represents a longstanding dialogue and process, and a shared
understanding of needs, goals, objectives and delineation of responsibilities. If funded, this effort will benefit
from this prior collaboration and communication.
Given its existing collaborative grant-funded projects, the Library Digital Programs at JHU has developed an
extensive web-based project management system that includes a Confluence-based wiki for collaboration
(http://wiki.library.jhu.edu), and dotproject (http://www.dotproject.net), an open-source project management
tool. Both of these tools support document sharing, collaborative development of tasks, email notification and
reminders, event and time tracking, automatic Gantt chart generation, and to-do lists. Combined with JHU’s
continuing development of web-based access to administrative information, this project will possess a rich
technology infrastructure to augment the already established human connections. These tools are intended for
internal project management and communication, but they are complemented by a set of external web-based
resources described in the dissemination section.
Dissemination
The Library Digital Programs (LDP) at JHU has recently developed a web portal available at
http://ldp.library.jhu.edu, built using open-source technology. For existing JHU projects and items of interest,
we have developed RSS feeds that can be automatically read through RSS readers or the Firefox browser.
Additionally, LDP staff has used blogs to disseminate information during conferences, provide project updates,
share ideas, and web-based forums to solicit, document, and respond to feedback from the broader community.
9
This DSS project will have an individual presence within the overall LDP web portal. The associated forum
will complement the wiki and dotproject communication and collaboration tools described earlier.
The members of the project team have an extensive track record of presentations, publications and involvement
in panels or forums related to digital libraries. It is worth mentioning that Alex Szalay has already presented at
the 2003 Web-Wise Conference regarding the importance of data curation and the possible role for libraries in
this regard. The results from this proposed effort represent an excellent follow-up to Szalay’s presentation.
JHU and the University of Washington are members of both the Coalition for Networked Information and the
Digital Library Federation. Choudhury, DiLauro and Lally have presented at both meetings, and will provide
presentations related to the results of this work. Choudhury attended the Ensuring Long-term Preservation and
Adding Value to Scientific and Technical data (PV 2005 at http://www.ukoln.ac.uk/events/pv-2005/), a
conference focused on international e-science issues and projects. One of the organizers of the conference
asked Choudhury to consider a presentation proposal for a future PV Conference related to the JHU Library’s
work with NVO data curation. Choudhury, DiLauro, and Reynolds have published in several forums including
D-Lib Magazine.
In addition to the digital library related venues and opportunities, we can rely upon the astronomy-related
prospects as well. As an example, the NVO team has published extensively regarding all aspects of their
project, including educational and outreach efforts. A list of these publications is available at http://www.usvo.org/pubs/index.cfm.
Sustainability
There are multiple aspects of sustainability associated with this proposal. First, one of the main reasons that the
NVO wishes to work with the Library is that libraries represent a sustainable organizational entity, as compared
to a project-based entity such as NVO. It should be noted that the fiscal year 2007 budget request from JHU
Library includes a request for a digital preservation specialist who would focus on data curation activities across
multiple disciplines. This proposed work would provide the foundation from which to build the expertise,
knowledge and infrastructure for this individual to continue and sustain. Second, the core technology for the
repository-based appliance is Fedora, which has built a diverse user and developer community. By building
upon a technology platform that has widespread adoption and interest, we bolster our possibilities for ongoing
support and development.
Perhaps most importantly, the development of this prototype appliance will provide insights into the costs for
development, installation, and ongoing maintenance of such (both human and technology infrastructure). UCP
provides appropriate expertise and experience in the development of business and financial models to analyze
and build upon these cost-related findings. UCP already provides the financial and business home for
astronomical journals, so it understands the domain well. This familiarity and understanding provides the final
piece to ensure that the important results and findings from this proposal will continue into the future—ensuring
that scientists can look to libraries for leadership in data curation.
10
Download