Final Report: Project to Implement a Scholarly Information

Implementation of a Scholarly Information Portal Using the Open Archives Initiative Protocol for Metadata Harvesting University of Illinois at Urbana-Champaign Final Report to the Andrew W. Mellon Foundation 25 July 2003 Timothy W. Cole, Principal Investigator Thomas G. Habing, Co-Principal Investigator William H. Mischo, Co-Principal Investigator Christopher Prom, Co-Principal Investigator Beth Sandore, Co-Principal Investigator Joanne Kaczmarek, Visiting Project Coordinator, September 2001-August 2002 Sarah Shreeves, Visiting Project Coordinator, August 2002-October 2002 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Executive Summary In June 2001, the University of Illinois at Urbana-Champaign was awarded funding from the Andrew W. Mellon Foundation to create a portal to facilitate access to scholarly cultural heritage information. The project was designed to investigate the efficacy of harvesting and aggregating metadata using the Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH). Between its start date and its conclusion in May 2003, the project accomplished the following:        Developed robust, scalable tools supporting the harvest and aggregation of metadata. Investigated issues relating to value-added metadata normalization, portal search interface design, presentation of search results, and integration of Encoded Archival Description (EAD) metadata with Dublin Core (DC) metadata Sampled range of metadata authoring practices in domain of cultural heritage. Demonstrated technical viability of search and retrieval across an aggregation of descriptive metadata harvested using the OAI-PMH. Identified critical issues relating to use of OAI-PMH in domain of cultural heritage. Tested usefulness and usability of search portal with one target user population. Implemented infrastructure to maintain metadata search portal after end of project. This final report provides a summary of activities and results over the entirety of the project, focusing especially on: technical implementation of metadata harvesting; efforts to integrate EAD and DC metadata; creation, development, and testing of the prototype search portal; and sustainability activities. While this report stands on its own as an overview of the project, emphasis is on activities since publication of interim project report in August 2002. The reader is referred to that report and to papers and presentations published over the entire course of the project for further details of accomplishments. Project-related papers and presentations through 31 August 2002 were included in the interim report supplement volume. Papers and presentations from 1 September 2002 are included in the supplement volume of this report. Technical Results: In total, we collected metadata from thirty-nine repositories describing more than 2.5 million discrete items held by approximately 500 institutions worldwide. These institutions included museums, archives, historical societies, and academic and public libraries. Consistent with other OAI-PMH projects, we found the protocol robust and scalable for use with metadata collections of cultural heritage institutions. We demonstrated a capability to harvest hundreds of thousands of metadata records per hour. Harvest rates were found to be determined by capacity of metadata provider software. Optimally a single harvesting workstation runs multiple harvests concurrently. Range of technical capability of those participating in OAI-PMH supports claim that OAI-PMH is a low-barrier way to implement interoperability; however, we did note a segment of potential content providers not yet technically able to implement OAIPMH. The soon-to-be released OAI Static Repository Protocol for Metadata Harvesting promises to lower technical barrier even further. Research Investigations: As described in interim project report, scalability and robustness of the University of Michigan's DLXS / XPat search engine proved sufficient to search metadata aggregated for this project; however, the heterogeneity of metadata records harvested (in spite of the fact they all described cultural heritage information resources) impacted adversely on utility of search portal. Quality and consistency of harvested metadata varied widely and in surprising 1 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 ways. Efforts to mitigate adverse impact of this heterogeneity on utility of search and discovery services achieved only mixed results. We had success normalizing descriptive attributes such as date and resource content type, but were generally unable to normalize or collate described resources by subject or topic. Attempts to use automated subject analysis tools developed for full-text records (e.g., NCSA's Themeweaver tool) were made difficult by the sparseness of metadata records. Only 15% of the metadata records provided by academic libraries included one or more instances of DC subject or description fields. Even when information was included in these fields, it was generally brief (as would be expected for metadata records). Because the NCSA tool was optimized for longer texts, effective clustering could not be achieved. Additional customization of this tool might yield better outcomes, but generally our results support hypothesis that human-generated descriptive metadata alone may not be enough for optimum search and discovery. Clearly also, while some enhancements can be made post-harvest, better metadata to start, created with interoperability in mind, would reduce heterogeneity of metadata and facilitate aggregation. We also had mixed results in trying to aggregate metadata derived from EAD finding aids with DC metadata. While we demonstrated the viability of deriving multiple item-level metadata descriptive records from individual EAD finding aid files (while still retaining ability to show hits retrieved in context of full EAD when desired), EAD-derived records were generally sparser and different in character from natively harvested DC records, or DC records derived from MARC catalog records. Repetition of information within EAD description of subordinate component nodes tended to clutter results returned for naïve searches. EAD authoring practices and traditions are unique. EAD and OAI-PMH are by no means antithetical, but further work is needed to more effectively integrate EAD and DC metadata. Utility, Usability, & Sustainability: OAI-PMH has lived up to billing as a low-barrier, low-cost way to harvest and aggregate descriptive metadata. While sufficient for the job, the protocol is focused in scope and rigorous and unambiguous in requirements. This makes it easy and inexpensive to implement, and cheap to maintain. (The harvesting service implementation developed for this project will require .25 FTE recurring staff and amortization of IT hardware to maintain. We have undertaken to maintain our implementation indefinitely following end of this project.) However, OAI-PMH by itself is not a magic bullet. OAI-PMH cannot compensate for poor quality metadata. Limited testing with upperclass undergraduates who will soon be teaching high school social studies showed that many search interface and aggregation issues remain. The tendency to associate OAI-PMH almost exclusively with lowest common denominator simple DC makes it difficult to implement more advanced search interface features. Content providers should prefer more expressive metadata schema like MARC or qualified DC. Our testing also suggests that mixing metadata describing analog resources with metadata describing online resources is often undesirable. Additional effort is still needed to improve metadata quality and consistency and to find ways to augment human-generated descriptive metadata. In sum, this project has contributed to our collective understanding of what is required to implement and sustain metadata harvesting services based on the OAI-PMH and has been instrumental in shaping and directing the evolution of the protocol and its community of use. OAI-PMH has proven to be technically viable and of good utility. But OAI-PMH is ultimately only as useful as the metadata it transports. The harvesting service and portal created for this project will be continued to provide a testbed for further research. The University of Illinois also has undertaken a similar project in domain of engineering and science. New, related work has been undertaken at Illinois under the sponsorship of IMLS, NSF / NSDL, and the CIC. There is good opportunity during the next few years for follow-on research that will build on the results of this project. 2 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Table of Contents Executive Summary ........................................................................................................................ 1 Table of Contents ............................................................................................................................ 3 Summary of Project Activities ........................................................................................................ 4 A. Implementation of metadata harvesting service..................................................................... 4 B. Efforts to integrate EAD metadata with DC metadata ........................................................... 8 C. Creation, development, and testing of search portal ............................................................ 11 D. Sustainability........................................................................................................................ 18 Accomplishments and Activities Listed by Month ....................................................................... 21 Proposal Objectives Accomplished .............................................................................................. 25 Metadata Providers Organized by Type of Institution (through March 2003) ............................. 27 Metadata Providers Organized by Type of Institution (April 2003- ) .......................................... 33 Bibliography of Publications & Selected Presentations ............................................................... 36 A. Publications .......................................................................................................................... 36 B. Presentations......................................................................................................................... 36 Supplemental Materials (Included in separate volume) Appendix A: Project proposal (as submitted April 2001) Appendix B: Papers published since release of interim report Appendix C: Presentations given since release of interim report Appendix D: White paper on the usefulness of OAI-PMH based search portal for K-12 teachers Appendix E: OAI Metadata Harvesting Service Workshop for CIC member libraries Appendix F: Introduction to OAI-PMH, half-day tutorial given at JCDL 2003 Appendix G: Project software tools available on SourceForge.Net OAI-PMH harvesting & metadata provider Tools XSLT stylesheets for EAD to DC transformations ZMARCO Tool Appendix H: Screenshots of Project Website Appendix I: Illustrative Metadata Search 3 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Summary of Project Activities A. Implementation of metadata harvesting service OAI-PMH was designed to facilitate the sharing and discovery of scholarly information resources. Descriptive information (metadata) about many of these resources is contained in databases and XML documents not readily available to or easily indexed by current Web search engines. OAI-PMH can be used to convey metadata describing resources that are analog as well as digital, though the initial impetus for development of the protocol was to convey metadata describing born-digital primary sources (e.g., e-prints). As a starting point for our project we sought metadata describing materials that are culturally significant, such as rare books, manuscripts, and personal papers held by library archives and special collections or in museums and historical societies. We included metadata describing born-digital resources, resources having digital representations (e.g., scanned images and pages of printed or handwritten texts), and resources only available in analog format (e.g., only available as hardcopy). We sought metadata in simple DC, DC variants, MARC, and EAD formats. To assess efficacy of OAI-PMH in domain of cultural heritage we focused also on the development of a Web portal through which end-users could search aggregated metadata to discover resources of interest. Characteristics of metadata aggregation: The metadata aggregated for our project came from 39 metadata providers. However, the number of providers does not offer a full picture of the heterogeneous nature of the aggregation. Figures 1 and 2 show the percentage breakdown of metadata harvested by metadata provider institution type and by type of resource described. Several of the data providers we harvested are aggregators themselves, that is, they collect the metadata they provide from multiple institutions. Three of the repositories harvested (CIMI, the Online Archives of California, and the Colorado Digitization Project) are large-scale aggregators of metadata. Including them in our aggregation, meant that our aggregation contained metadata describing content held in approximately 580 institutions worldwide. In addition to these aggregators, several metadata providers have made available several distinct and separately maintained collections of metadata using the "sets" concept inherent in the OAI-PMH. For example the University of Tennessee Libraries makes available eleven distinct collections — from an Appalachian photograph collection to scanned images of an emancipation newspaper to electronic theses. A harvesting service may choose to harvest all or just some of these collections. Where appropriate, we harvested from a given provider only those sets of metadata describing cultural heritage resources (broadly defined). The issue remains that each of these collections may use metadata differently than another collection. Variations are most notable institution to institution, but are sometimes present even within a single institution's metadata. Not all the metadata harvested was harvested directly from OAI-PMH compliant sites. EAD files (approximately 8,700), for instance, were obtained by FTP or were captured directly (with owner permission) from archive Web servers. Before indexing such 'captured' metadata records, we transformed them to simple DC as appropriate for the native schema in which we obtained the metadata. Transformed metadata records were then made available from surrogate OAI-PMH metadata provider sites running on University of Illinois servers. Generally metadata obtained in this way was not updated over the course of the project; however, the use of these surrogate providers allowed us to better text robustness and scalability of the protocol. 4 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 5% 18% 5% 41% 36% 20% 25% 50% Academic Libraries Text & Sheet Music Digital Libraries Images Museums/Cultural/Historical Orgs Public Libraries Figure 1 – Breakdown of metadata providers by type of institution Artifacts Other Figure 2 – Breakdown of resource types described The full aggregation for this project contained 1,101,523 original item-level metadata records. Of these records, 339,331, or approximately 30%, provided a direct link to an online resource, e.g., digitized image, scanned page of text, etc., via a hyperlink. We also obtained 8,730 EAD finding aid files. Each of these EAD files described a collection of items (e.g., a personal manuscript archive) rather than an individual item. Because metadata provided in Dublin Core format describes individual items, it was necessary to develop automated algorithms to tease out from collection-level EAD descriptions item-level metadata records describing individual collection components (see further discussion below). This process added another 1,524,325 item-level records to our aggregation. Almost none of the primary resources described by these EADderived item-level records had digital representations available online. Metadata harvesting and provider software (see also Appendix G): To acquire metadata records, we developed software tools for harvesting metadata using OAI-PMH. As a way to facilitate participation in our project, we also enhanced tools previously created to help providers make metadata available via OAI-PMH. All software and middleware developed during this project has been released under an approved OpenSource software license and is available for download from the SourceForge.Net OpenSource software repository <http://sourceforge.net/projects/uilib-oai/> (see also Appendix G of this report). Our work spanned two versions of the OAI protocol: version 1.1. (released in July 2001) and version 2.0 (released in June 2002). Harvester and provider tools were built for and tested on both the Microsoft Windows and RedHat Linux operating system platforms, though the greatest range of development was done for the Microsoft Windows platform. The Microsoft Windows implementations rely on standard proprietary components from Microsoft (e.g., Microsoft Internet Information Server, Active Server Pages, and VBScript) while the Linux 5 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 implementations rely entirely on freely available components (e.g., Apache Webserver, Tomcat Java serverlet host, and Java). Provider tools were developed and modularized to support various metadata storage architectures. Our Linux metadata provider assumes metadata items are stored in a JDBCaccessible relational database (e.g., MySQL). Our Microsoft Windows provider tools support three storage architectures (e.g., one where metadata resides in an ODBC-accessible relational database like Microsoft Access or SQL Server, one where metadata resides in XML files on the server's file system, and one using a hybrid database-file system approach). For Microsoft Windows we also developed tools that extract metadata from <meta> elements embedded in HTML files and from Z39.50 servers compliant with certain Z39.50 application profiles. This latter tool, called ZMARCO, is a distinct project on SourceForge.net. While ZMARCO demonstrates that an OAI-PMH front-end can be implemented for some Z39.50 applications, the potential of this approach was found to be limited. Only Z39.50 implementations fully compliant with the Bath or similar Z39.50 application protocol expose enough information and functionality to allow the add-on of an OAI-PMH gateway. Specifically the Z39.50 implementation must index and make searchable through Z39.50 a unique, persistent record identifier (a surprisingly large number of Z39.50 implementations do not), must return MARC format (or simple DC), must include in record return a date field indicating when catalog record was last touched (i.e., created or modified), and must allow searches that can be used to systematically extract all records in the catalog (e.g., allow search by publication year, assign a publication year to all records, and not have a governor on maximum number of records that will be returned per query that is smaller than largest return for largest single publication year search). The baseline harvesters developed (one for each operating system platform) include extensive feature sets that supports (a) full and incremental harvests of complete and set-specific metadata; (b) selective filtering of harvested metadata records (i.e., before saving for purposes of indexing) based on field-specific regular expression pattern matches; and (c) harvests of records in specific metadata schemas. Harvesting schedules and most configuration parameters are controlled through a Web-based interface written in Java. Information about the harvesting activity and records harvested is maintained in a relational database. The harvesters tolerate individual XML metadata records that fail to validate during harvesting, and both can recover from shortduration, non-repeating network and provider service failures. We also have made available XSLT stylesheets for transforming Encoded Archival Description (EAD) metadata into OAIcompatible, simple DC metadata. Our Linux harvesting tool was adopted for a related Mellonfunded project at the University of Michigan (a number of optimization and reliability improvements have been made to the harvesters in response to initial experiences and feedback from staff involved in the University of Michigan Mellon-funded OAI Harvesting project), and served also as a starting point for similar tools that have been developed elsewhere (e.g., the UCLA–Indiana University–Johns Hopkins OAI sheet music project). Additional harvesting tool components were created and made available for the Microsoft Windows platform. A general purpose OAI-PMH harvesting DLL (dynamic link library) with API (application programming interface) was made available to facilitate creation of customized harvesting implementations by third parties. In addition to the more complex, baseline Microsoft Windows harvester described above, a simple command-line harvesting tool (relying on the same DLL) was created to facilitate quick testing and harvesting of OAI-PMH provider services. 6 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 A companion command-line utility for indexing harvested metadata in Microsoft SQL server was also created. Harvesting performance: Testing has shown that harvest times vary according to a few specific parameters. Harvesting time was consistently provider- or network-limited rather than harvesterlimited, even when relatively modest harvesting hardware was used (e.g., a Pentium IV Windows 2000 workstation). Assuming no filtering at time of harvesting, up to 10 simultaneous harvests can be conducted from a single workstation without significant impact on performance of any individual harvesting thread. Moderate to large blocks of records (e.g., 1,000 to 10,000 records transmitted at once) tend to reduce the time needed to harvest a collection. Thus, harvests performed using the OAI ListRecords command (i.e., multiple records are delivered by metadata provider in response to each harvester request) are typically an order of magnitude or more faster than those using the OAI GetRecord command (i.e., each record being harvested is requested and delivered individually). Our harvester can be configured to use either method. Filtering of records at time of harvesting (i.e., selectively saving certain records returned by the provider) tends to slow harvest times, sometimes by as much as an order of magnitude. Though harvest times vary due to variations in provider-side performance, typically more than 100,000 records can be harvested in an hour. Assuming five simultaneous harvests threads, this suggests that one workstation could easily harvest 20 million new records daily. Since few available providers have this many records available it was not possible to verify this number directly; however, a single-threaded harvest of the OCLC theses provider site (more than 4 million records) was accomplished in less than 24 hours. This capacity is encouraging and implies fairly aggressive harvesting schedules, even from multiple repositories. It also suggests considerable excess capacity for harvesting the metadata currently available in the cultural heritage domain (at most a few million records distributed across less than 100 repositories). Managing OAI metadata harvesting does require ongoing attention and human intervention, although the amount of time required is decreasing as more experience with the protocol is gained and providers become more robust and reliable. Once initial test harvest of a site has been performed and the site has been incorporated established harvesting schedule, we've found that less than one day of staff time per week is needed to deal with anomalies and other problems that arise even for reasonably ambitious operations harvesting 30 to 50 sites. Some ongoing effort will always be required to deal with failed harvest jobs, update harvesting schedules and parameters over time, and identify new sites that should be harvested. The last task is currently the most open-ended since the OAI-PMH has still not yet identified a reliable and systematic way to locate new, potentially relevant provider services. Indexing and search system technologies: While preliminary estimates of harvesting scalability are encouraging, there remain scalability issues associated with trying to index and effectively search very large collections of aggregation metadata. We are using DLXS/XPAT software (version 10) created and maintained by the University of Michigan, currently installed on our dual Pentium IV Linux server, as the indexing and search engine for our aggregated metadata. Though some compromises have been required, this indexing and search application has proven generally very robust and capable. For our application, the only serious limitation discovered to date has been the limitation on result set size that the application will sort (currently 2,000 records). Improvement in this functionality over time is anticipated. For comparison and in order to explore other indexing and search features, we have also indexed harvested metadata in Microsoft SQL server. This application has certain advantages, especially in ranking and certain 7 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 full-text truncation and adjacency search features, but single-server implementations can't search large sets (e.g., over a million metadata items) as fast. Our judgment is that while very largescale OAI-PMH metadata harvesting services today are straining the limits of current state-ofthe-art indexing and search tools, continued progress and improvement in index capacity seem likely. For some OAI-PMH implementations search-engine limits will be a consideration, but over time this should be less of an issue, and is already not an issue for more modest OAI-PMH projects. B. Efforts to integrate EAD metadata with DC metadata As noted above, we obtained 8,730 EAD finding aid files describing cultural heritage collections held by 57 different institutions. Though the EAD schema supports linkages to digitized content, the vast majority of the resources described by finding aids available today are accessible in analog format only – i.e., are available only in hardcopy formats at the holding institutions. In order to index EAD metadata alongside metadata harvested in simple DC and other item-level metadata schemas like MARC, we developed algorithms to tease out from collection-level EAD finding aids item-level metadata records describing attributes of individual items in the collections defined. These item-level metadata records were then made available through surrogate OAI-PMH provider services on University of Illinois servers, harvested, and incorporated in our baseline metadata aggregation and search service. Analysis of EAD schema and implications for use with OAI-PMH: The Encoded Archival Description (EAD) schema is one of the most widely used metadata formats employed by cultural heritage and archives projects. It is used to encode finding aid level information about a wide range of collective resources (e.g., archival manuscript holdings). EAD acts as a wrapper for collective archival description, but it does so with differing levels of specificity depending on the nature of the collection and the finding aid. Archival practice in constructing finding aids varies widely from institution to institution, and EAD was designed to accommodate differences while encouraging as much uniformity as possible by standardizing commonly used data elements. There may be hundreds or even thousands of individual item-level information resources described in a single EAD finding aid. These aspects of the EAD metadata schema present challenges when aggregating EAD metadata with DC and MARC metadata as in an OAIPMH context. We explored several options for transforming EAD records for use with OAIPMH and developed and performed proof-of-concept testing of a procedure for creating useful, item-level simple DC metadata records from native EAD records. An EAD record has two main components: (1) metadata about the finding aid (i.e. the electronic document describing the collection), and (2) metadata about the collection described in the finding aid. Metadata about the finding aid is contained within required elements nested within the high-level element <eadheader>. The information encoded in <eadheader> provides summary information about the finding aid including title, author, and date of creation. Metadata about the collection described in the finding aid is encoded in the high-level element <archdesc>. This element usually comprises the bulk of the finding aid and may include numerous subelements, most of which are repeatable. In general, however, <archdesc> uses two types of elements—those that describe the collection as a whole (generally immediate children of <archdesc>), and those that describe subordinate components of the collection, such as a series of files, an individual file, or an item (generally children, often multiple times removed, of the <dsc> element, which is itself an immediate child of <archdesc>). 8 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 The EAD Application Guidelines include two recommended mappings to Dublin Core, one for the finding aid and another for the resources described in the finding aid. In EAD, metadata about the finding aid is encoded in the <eadheader>, but while mapping <eadheader> attributes to DC can be useful, attributes of the individual resources contained in the collection described by the finding aid generally are not included in this mapping. Researchers searching a broad item-level metadata aggregation resource are likely to be more interested in the individual information resources described in the finding aid than in the finding aid itself. Mapping attributes from the <archdesc> is necessary to capture this information, but the EAD Application Guidelines recommended mapping, which is a one-to-one mapping only, focuses entirely on the top-level elements of the <dsc> node. As a result, most pertinent information about the individual items in the collection described is not included in the simple DC record created by this mapping. Neither mapping given by the EAD Application Guidelines will generate records adequate enough for use in a metadata aggregation containing a large number of DC and MARC item-level metadata records. We therefore chose to create an alternative, one-to-many mapping, that mapped both top-level and <dsc> child elements of <archdesc> to DC. The issue of where (provider or harvester side) to transform XML metadata in EAD into Dublin Core remains unresolved. An advantage of mapping between XML metadata schemas on the harvesting side is that the harvester would have the option to display metadata records in native context. Thus, while item-level records generated from an EAD finding aid might be the most useful for indexing, search, and discovery, it can be desirable for the end user to be able to view (on the harvesting service site) item-level records found in the context of the original EAD finding aid from which they were derived. This is the model we investigated. Implementation Details: In essence, we decided that in order to manipulate and search the components of an EAD file, it would be necessary to produce many simple DC item-level records from each EAD source. We did these transformations using XSLT (available from http://uilib-ead.sourceforge.net/). For each EAD file, one XSLT stylesheet produced a "top level" DC record containing the collection level description, and a second XSLT generates records for each child node of the EAD description of subordinate components (<dsc>) element. Such an approach runs the risk of losing context for the information drawn from the source EAD file. To mitigate this potential problem, we provided enough information to allow our search and discovery service to reconstruct and display hits in the context of the source EAD files. Each item-level DC file also includes a relation element providing a link to parent node of the derived record in the original EAD file. Location of the hit in the original EAD file is retained implicitly, rather than explicitly, using XPointer syntax. XPointer is current recommendation of the World Wide Web Consortium, and provides syntax for the identifying XML fragments using a superset of the XPATH syntax. The identifier of each Dublin Core encoded metadata record produced from a subordinate component node within an EAD file points to the exact node of the EAD file which is being described by the DC record. For example, an identifier that reads: <dc:identifier>http://xxx/1205.xml#xpointer(//dsc[1]/c01[7])</dc:identifier> points to the seventh <c01> element within the first <dsc> element in the file located at http://xxx/1205.xml. The file mapping that we developed, including the calculation of XPointers, was applied to 8,730 EAD finding aids contributed by 57 institutions. To allow for testing, these finding aids were aggregated into multiple surrogate OAI repositories (by institution) which were then harvested by our OAI harvester. OAI records for both the top level and the component levels of 9 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 the finding aid were harvested by our OAI harvester. The procedure produced 8,730 top level records and 1,515, 595 component level records. The XPointer created is only useful assuming the original EAD file is an accessible XML resource with a persistent identifier, and assuming scripts are in place that can use the XPointer provided to locate the correct spot in the EAD finding aid file. Though XPointer is now a recommendation of the W3C, there is little in the way of off-the-shelf software that understands XPointer syntax. We created our own server-side scripts (see Figure 3) to utilize XPointer strings generated in deriving item-level metadata records from EAD. Figure 3, showing in-context result of search for "gun control" (For more details on how this process is currently implemented for use in our search portal, see the JCDL 2002 paper, available http://dli.grainger.uiuc.edu/publications/jcdl2002/p14-prom.pdf and was included in the supplemental materials to the interim report.) In general, the simple DC metadata records we derived from the child nodes of EAD finding aid <dsc> elements are very brief, typically containing only short, descriptive title and/or name information and little if anything else additional. DC type information may or may not be present (either explicitly or implicitly), depending (apparently) on whether the EAD author thought it 10 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 necessary to label the content type of subordinate components. In the context of an overall finding aid, brief descriptive nodes within the <dsc> element make sense, but when indexed and searched alongside richer records, such brief records may be under-represented in search results. Conversely, each EAD finding aid, on average, generated large numbers of simple DC metadata records. Depending on how the EAD finding aid was created, many of the simple DC records derived from a given EAD finding aid may be redundant to one degree or another (e.g., many derived DC records from a given finding may repeat the same name). Since the nodes from which they are derived were intended to be interpreted in context, the descriptive titles assigned to subordinate component nodes may contain mostly common words. For these reasons, some searches will retrieve a disproportionate number of metadata records derived from a single EAD finding aid. Potentially this difficulty could be obviated by enhancements to the search and retrieval system that merge for presentation purposes those retrieved metadata records coming from a single EAD finding aid file. We did not have sufficient resources to pursue this approach. While further refinements to the transformation stylesheets developed for this project remain to be made, work done to date demonstrates an ability to transform between XML metadata formats, such as EAD and DC, and the potential to utilize a variety of metadata formats with OAI, at least from a technical perspective. However, for more effective interoperability across fundamentally different metadata schemas such as EAD and DC, communities of practice need to become more cognizant of the differences between them, and adopt best practices that better facilitate interoperability where possible to do so. C. Creation, development, and testing of search portal The metadata records harvested for this project were extensive in breadth, featuring highly heterogeneous content that originated in different communities. These traits allowed us to examine selected questions about how a harvesting service can best process metadata and present it to the end user. Generally this work suggests that metadata authors need to be more cognizant of best practices for interoperability when creating metadata. While a few DC elements (e.g., date and type) lend themselves to normalization that can improve discoverability for end users, normalization of more complex DC elements (e.g., subject and description) is not practical. Better consensus in the community on how to create metadata is needed. Design of a portal for effectively searching harvested metadata in this heterogeneous domain is a challenge. We approached this as an iterative process. Initial search portal design was repeatedly refined based on feedback from librarians and end users. Optimal indexing and search portal design features vary to a significant extent with type of end-user. We undertook limited in-depth testing of our aggregated metadata search service with one user population (middle and high school student teachers) and identified several desirable interface features in that context. Metadata Authoring: Metadata authoring practices, including decisions on how to map into DC, which DC elements to use and how, which controlled or local vocabularies to implement, and how deeply to describe resources, have an impact on both the discoverability and usefulness of the metadata within an aggregated resource like ours. In a paper presented at the 2002 Museums and the Web Conference ("Now That We’ve Found the ‘Hidden Web,’ What Can We Do With It? The Illinois Open Archives Initiative Metadata Harvesting Experience"), we documented the range of variations in the metadata we harvested. A more complete tabulation of the frequency with which individual Dublin Core fields are used in each participating repository’s metadata, and the frequency with which fields are repeated within records (on average), is discussed in a project white paper entitled “Analysis of Dublin Core Field Use by Repository Harvested.” This 11 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 report was included as Appendix H of this project's interim report and is available on the project Website <http://oai.grainger.uiuc.edu/projectinfo.htm>. DC element usage varied greatly, even within communities that had similar cataloging traditions (e.g., from library to library). Many libraries in particular provided metadata records that were much less detailed and informative than the records these same institutions create when cataloging books or other traditional analog information resources. This variability in depth of description and in use of DC elements can create problems when searching and when determining how best to display metadata records. Similar variations in controlled vocabularies used to describe resources makes consistent and thorough collocation of like items difficult. We experimented with normalization techniques for specific fields to minimize the impact of these variations on searching and presentation (see following section) which partially (but only partially) offset some of these limitations. In addition to issues that arose because differences in the ways communities of practices made use of / failed to make use of certain DC elements, differences in exactly what was being described by metadata records harvested also impacted on searchability of metadata aggregation. As noted above, we harvested metadata describing both analog and digital resources. What to describe when creating metadata records for resources that only exist in analog format or only exist in digital format is relatively straightforward. Less clear cut is what to describe when creating metadata for an analog resource that has also a digital representation – e.g., an artifact for which a digital image exists. Practice is inconsistent in such cases. Some institutions provide metadata records that describe the analog object and make only passing reference to the digital surrogate (i.e., by providing a URL to the image of the artifact). Other institutions provide metadata records that describe primarily the digital representations of the artifact and only briefly mention the artifact (e.g., in the source field of the records). Figures 4 and 5 show excerpts from metadata records provided by two different museums. Description: Digital image of a single-sized cotton coverlet for a bed with embroidered butterfly design. Handmade by Anna F. Ginsberg Hayutin. Source: Materials: cotton and embroidery floss. Dimensions: 71 in. x 86 in. Markings: top right hand corner has 1 ½ in. x ½ in. label cut outs at upper left and right hand side for head board; fabric is woven in a variation of a rib weave; color each of yellow and gray; hand-embroidered cotton butterflies and flowers from two shades of each color of embroidery floss – blue, pink, green and purple and single top 20 in. bordered with blue and black cotton embroidery thread; stitches used for embroidery: running stitch, chain stitch, French knot and back stitches; selvage edges left unfinished; lower edges turned under and finished with large gray running stitches made with embroidery floss. Format: Epson Expression 836 XL Scanner with Adobe Photoshop version 5.5; 300 dpi; 21-53K bytes. Available via the World Wide Web. Coverage: — Date Created: 2001-09-19 09:45:18; Updated: 20011107162451; Created: 200104-05; Created: 1912-1920? Type: Image Figure 4: Excerpt of record describing image of a quilt 12 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Description: Materials: Textile--Multi, Pigment—Dye; Manufacturing Process: Weaving--Hand, Spinning, Dyeing, Hand-loomed blue wool and white linen coverlet, worked in overshot weave in plain geometric variant of a checkerboard pattern. Coverlet is constructed from finely spun, indigo-dyed wool and undyed linen, woven with considerable skill. Although the pattern is simpler, the overall craftsmanship is higher than 1934.01.0094A. - D. Schrishuhn, 11/19/99. This coverlet is an example of early "overshot" weaving construction, probably dating to the 1820's and is not attributable to any particular weaver. -- Georgette Meredith, 10/9/1973 Source: — Format: 228 x 169 x 1.2 cm (1,629 g) Coverage: Euro-American; America, North; United States; Indiana? Illinois? Date: Early 19th c. CE Type: cultural; physical object; original Figure 5: Excerpt of record describing a quilt for which digital image exists Metadata Normalization: Normalization can prove an effective means to provide context internally and to enhance discoverability of metadata records in a cross-collection repository; however, it is hard to do in automated fashion for some metadata fields. After considering all the DC elements, we investigated the potential to normalize the content of the Type, Date, Coverage, and Format DC elements as most amenable to normalization. For our metadata aggregation we found that normalization of Type and the temporal aspects of Date and Coverage was beneficial. Format was not normalized, for reasons discussed in the white papers on normalization included in Appendix D of this project's interim report. To effectively normalize metadata, it was necessary to:       Understand how the element was interpreted by metadata providers and which elements in other metadata formats were mapped to the Dublin Core element. Identify which, if any, vocabularies were used by data providers for these fields. Determine whether there was an existing controlled vocabulary that could be successfully applied to all metadata providers, or, if not, create a vocabulary specific to our repository. Apply normalized vocabulary to the metadata to augment the ‘native’ vocabulary. Build mechanisms into the search interface that would take advantage of the normalization. Measure how well normalization improved the end user’s ability to discover resources. These goals translate into a five-step process: 1. Extract and analyze the element values (content). 2. Determine how each element is interpreted and what controlled vocabulary, if any, is used. 3. Determine focus and vocabulary for normalization. 4. Normalize the data. 5. Provide services in the portal based on the normalization process. 13 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Normalization white papers describing this process for Type element normalization and Date/Coverage element normalization were included in Appendix D of the interim report supplemental materials volume and are available on the project Website <http://oai.grainger.uiuc.edu/projectinfo.htm>. The paper presented at the 2002 Museums and the Web Conference ("Now That We’ve Found the ‘Hidden Web,’ What Can We Do With It?: The Illinois Open Archives Initiative Metadata Harvesting Experience") also provides more information about the normalization procedures we applied. Currently, the repository contains metadata for which the Date and Coverage elements have been normalized. The Type normalization process was determined to be redundant with the way we now arrange our collection of aggregated metadata for searching and browsing. (After last portal restructuring, indexes are based on type of material as analyzed top down by collection.) This approach to grouping like materials together for browse and search (rather than requiring explicit entry of a type search string) is both more time efficient and more global, as some data providers do not use the Type element. The degree to which end users will find this a more useful way to present our metadata aggregation needs further measurement. Initial Search Portal Interface Design: During the creation and development of our portal (http://nergal.grainger.uiuc.edu/search/), we focused on how to provide searching capability and present aggregated and heterogeneous metadata in a useful way. We examined basic interface features and usability issues that arise when constructing such a portal. Scenario-based design techniques (from the work of J.M. Carroll and others) were used during the initial planning of the portal’s interface. Scenario-based design is an iterative process that takes as its starting place an analysis of the work that targeted end users will conduct using the system. Several possible scenarios were constructed which focused on a variety of end users. These decisions impacted both the preliminary display of the aggregated repository, as well as the search results display options. Basic decisions included the renaming of Dublin Core elements to more commonly used terms ("Creator" to "Author/Artist") and providing a Google-like simple search entry option. Recognizing that the portal provides a gateway to selected resources owned by a large variety of institutions, we were concerned with how to best present the metadata within the context of its owning institution. We have attempted to allow end users to easily move from the metadata to the owning institution or to the digital object itself (if available) by providing hyperlinks to collections as well as an online access links in the metadata. In addition we have added an About Collections page that describes each collection and links to contributing institution's Website. We initially provided a list of data providers as part of the initial results list (on the left-hand side of the screen), but found that this had two problems. The first was that the screen became cluttered and confusing to the end user. We conducted a series of preliminary usability tests with librarians at the University of Illinois and with interface design students in the Graduate School of Information and Library Science (GSLIS). The second group emphasized in particular their preference for a clean and uncluttered display. The second interface problem was that users had difficulty interpreting what the list of institutions meant. Users did not intuit that the repository aggregated resources from many different institutions. As a result of these preliminary usability tests, we removed the list of contributing data providers from the left-hand side and moved instead to an optional grouping (pull down menu) of search results by types of resource described (e.g., text, image, audio, archival, physical objects). 14 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Targeted Usability Testing: We followed up initial usability testing with librarians, library school faculty, and library science students with more in-depth usability testing with a group of 23 college students training in an honors-level curriculum and instruction course to become middle school and high school social studies teachers. We chose this group because (1) we did not have the resources to identify and study working educators; (2) the professor and students of this class were willing and in fact eager to participate; and (3) these users were already comfortable using the Internet. During the fall of 2002, students were assigned to use the UIUC Digital Gateway to Cultural Heritage Materials to find primary sources for a lesson plan on a specific social sciences topic, and then to submit short papers about their experience. For purposes of this test, we created a duplicate portal for use by these students and we provided them with a unique URL. This enabled us to conduct a transaction log analysis after the test. Before beginning the assignment, users were introduced to the concept of metadata aggregation and were informed that the search portal would provide pointers to digital content held elsewhere. They were also told that some records referred to analog resources. After the students completed the assignment, we conducted focus group interviews. These interviews were taped, transcribed, and coded. We also received copies of the students’ papers (with names removed); however, because the papers reiterated comments made during focus groups, we did not code them. We found that, despite their prior introduction to the nature of the portal, in practice the test group expected all records to point directly to corresponding digital objects. They reported feelings of frustration in finding analog resources when they expected digital resources. This was exacerbated by the large number of item-level records derived from EAD files that described analog resources. Thus, a user who selected a result for “letters from a WWI soldier” might find that the record referred to the holding institution’s finding aid instead of to the letters themselves. Likewise, they reported a significant slowing of their efforts when the pointers (the URLs within the record) went to a top-level or intermediate page, where they might have to resubmit their request using the institution’s own search engine. The lack of a ranking facility in our portal resulted in the test group feeling overwhelmed by the quantity of unsorted results. Because of the lack of consistent metadata caused by variations in controlled vocabularies and disparities in the use of DC, we had enabled greater recall by designing the default search screen as a keyword search on all elements. This exacerbated the lack of a ranking facility. In an attempt to address these known limitations we provided an advanced search screen, which included standard methods for refining a search, such as restricting searches to specific groups of fields and setting limits. However, the test group seldom used the advanced search tools, and the few users who did attempt to refine their searches were unfamiliar with the types of entries required by metadata fields like “Format.” This suggests that a robust ranking facility is of great importance. We also found that the test group accorded equal credibility to all contributing collections. They reported that they made no decisions about which items to examine based on the name of the holding institutions. Feelings of frustration around failed searches were directed at the search portal rather than at individual institutions. Thus, users held the portal responsible for the usability of its aggregated metadata, even when that metadata originated elsewhere and remained outside the control of the Illinois project. As a result of these results, we eliminated EAD records from the UIUC Gateway to Cultural Heritage Resources. We are currently investigating creating a portal that is specific to EAD 15 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 finding aids that will enable further research into the use of EAD with OAI-PMH. In addition, the tests led to several changes in the interface. We combined the simple and advanced search screens, improved labeling, and combined several resource-type categories into a simpler set of options (see Figure 6). Figure 6 – Revised search screen The single Online Access Available link in search results was replaced by two, more specificallyworded links. (1) View Item was applied to resources that are directly viewable online from the search result. (2) Learn more about this item was applied to results that would lead the user to a collection’s web site or to descriptive information about the resource. We also attempted to clarify for users which resources offered direct online access and which did not (see Figure 7). Unfortunately, time did not allow us to conduct a second assessment of the portal’s usefulness once these changes were made. However, we continue to work with OAI-enabled aggregations of metadata, and we expect that future research will build on the baseline work done here. 16 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Figure 7 - Revised wording in search results Implications for Best Practices: Based on our experience, we feel able to make some observations about what would constitute best practices for both data providers and harvester services. Quality and consistency of metadata provided by metadata providers is key. We suggest that providers tend carefully to the task of assigning metadata to resources. Community metadata standards and controlled vocabularies should be adhered to. In addition, metadata providers need to have a clear understanding of the purpose of the metadata so that it can be valuable to both the local user community and to a wider audience. We also suggest that metadata providers utilize the option to divide their metadata into sets, as provided by the OAI-PMH. While there are any number of logical sets, we have found the most useful to be by subject area, sub-collection, or type of material. Metadata aggregators may also want to use sets to indicate the institutions included within their collection. However, this division of the collection may not be as useful for the end user. OAI harvesting services can be used for resources intended to be shared among narrowly defined community groups. They can also enable a more broad-reaching portal that serves a variety of users. In either case, strategies for scheduling regular harvests and indexing of metadata need to be established. These strategies should take into consideration the frequency of updates made to the metadata provider sites as well as the amount of post-harvesting processing the data will 17 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 require. While OAI minimizes the need for communications between metadata provider and harvesting service, discussions of scheduling issues with providers is desirable. A clear and obvious finding of our work is that, while the OAI-PMH itself is readily implemented, the challenges posed by large amounts of heterogeneous metadata are significant. Certainly the application of more sophisticated pre-processing tools as well as robust, scalable search tools and ranking of results would make the portal a more effective tool for users. Other options include the development of thematic exhibits (based on human and/or machine analysis of metadata) that would offer glimpses into the range and type of materials available, and offering users the ability to annotate individual records to highlight particularly useful resources. Providing a quick-browse feature to give users a preview of what is — and is not — available in the portal would make it more useful to educators. In general, the interface challenge extends to any tools that help adjust user expectations. The inclusion of EAD finding aids and their decomposed item-level records was an obstacle for these users. They did not understand why the records were included and were confused by opaque labels, such as “Box 23.” Several members of the test group commented that finding aids may be useful for researchers or scholars but not for educators. As a result of the test, we eliminated EAD records from the UIUC Gateway to Cultural Heritage Resources. We are currently investigating creating a portal that is specific to EAD finding aids that will enable further research into the use of EAD with OAI-PMH. D. Sustainability Research conducted during the course of this project has informed multiple follow-on activities at Illinois. Two aspects of sustainability are considered here: i) maintenance and continued development of the prototype portal for searching aggregated metadata describing cultural heritage resources; and ii) identification and exploration of additional, more in-depth research relating to aspects of OAI-PMH. Continuation of prototype search portal: Consistent with results from several other OAI-PMH projects, our research confirmed that an OAI-PMH based harvesting service is relatively easy and inexpensive to implement. Metadata provider services are within the means of many if not most libraries and museums to implement as an adjunct to the Web services they already offer. As more and more OAI-PMH freeware becomes available and as the protocol is integrated into commercial products, basic metadata harvesting services also will be increasingly within the means of many academic library to implement and maintain. As this project concluded we made changes in our prototype portal interface design, scope, coverage, and contents to facilitate sustainability. Most notably, we decided to focus attention for the version of the portal that will be continued on resources and a search interface design appropriate for supporting curricular needs (middle school through undergraduate). Also, the prototype portal developed as part of this project now focuses exclusively on metadata describing readily available digital cultural heritage resources and aggregates only metadata harvestable via OAI-PMH. This reduced total number of metadata providers to 25 and total number of metadata records indexed to approximately 412,000 as of June 2003. Incremental harvesting of provider services is done every three weeks, and full harvesting of each provider services is done quarterly. Scripts have been written to make routine the steps of harvesting, limited refinement of harvested metadata (e.g., date normalization), and metadata indexing. The Grainger Engineering Library Information Center has dedicated server space and 18 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 software and ongoing resources equivalent to approximately 0.25 FTE for staff who will monitor harvesting processes (resolving issues as they arise and adding additional provider services that come on line as able). This maintained, live testbed of aggregated metadata, regularly refreshed, will be available to support ongoing research into metadata and metadata harvesting services. For instance, metadata from this harvesting service will be among the resources used by faculty at the Graduate School of Library and Information Science at Illinois to study to what extent and in what ways do metadata providers modify their metadata records over time. Though not yet containing a sufficient critical mass of metadata content to qualify as a major search utility for the casual searcher, our portal will be a resource for those interested in more comprehensively identifying online cultural heritage resources. As additional content repositories come online as OAI-PMH metadata providers, the portals importance in this regard will increase. Related new research: Several internally and externally funded follow-on research activities have been undertaken at Illinois or are planned for the near future. We continue to study the integration of EAD and OAI. We are maintaining and doing further research to upgrade and extend the stylesheets developed for this project to transform EAD finding aids into multiple item-level records more suitable for use with OAI-PMH. We are also looking at how proposed changes in EAD authoring practice and in the EAD standard itself might affect ability to integrate EAD metadata with item-level metadata records such as those harvested using OAI-PMH. We will continue to provide feedback to the archival community regarding these issues. This work is centered within our University Archives unit, which also has ongoing research efforts looking at, among other things, the potential uses of OAI-PMH in the context of statewide and national information resource archiving and aggregating initiatives. We have migrated technologies and operational lessons from our Mellon-funded research into other subject domains of interest. Using the harvesting tools we developed to harvest metadata describing cultural heritage resources, the Grainger Engineering Library Information Center has created and is maintaining an OAI-PMH based portal for searching aggregated metadata describing physical science and engineering academic and research information resources. This project has provided an opportunity to experiment with Microsoft SQL server as an alternative tool for searching metadata harvested using OAI-PMH. (See http://g118.grainger.uiuc.edu/engroai/) Later this summer we will begin a three-year collaboration with the research libraries of 9 other Committee on Institutional Cooperation (CIC) institutions to study the potential of OAI-PMH to facilitate resource sharing within an academic consortium like CIC. The University of Illinois Library at Urbana-Champaign will host the harvest and search and discovery services for this project. This work will allow us to explore aspects of using OAI-PMH in a consortial context and will offer a way to explore reinventing the CIC's Virtual Electronic Library in order to better unlock the hidden Web of resources that are available at CIC institutions. 19 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Among other things, this collaboration will allow the consortium to:  explore ways to improve access to selected resources at CIC member libraries;  develop methods to better advertise these resources to end-users both within and external to the consortium;  test approaches for using OAI-PMH with licensed and restricted access content and metadata;  prepare member institutions for future grant-mandated OAI-based resource sharing; and,  create a unique metadata testbed and aggregation, useful for a range of more in-depth, fundamental metadata and OAI-PMH research, funded both internally and externally. From October 2002 the University of Illinois Library at Urbana-Champaign has also undertaken to build a collection registry and OAI-PMH-based item-level metadata repository describing digital collections created under the auspices of the Institute of Museum and Library Services National Leadership Grant (NLG) program since its inception in 1998. Given the range of content created under the NLG program (though still heavily biased towards the domain of cultural heritage), this project will allow us to greatly expand the scope of our OAI-PMH research and test preliminary models developed in the course of our Mellon-funded project. (See http://imlsdcc.grainger.uiuc.edu/) These and other projects (e.g., our 2nd Generation Digital Mathematics Resources project being conducted under the auspices of the National Science Digital Library program, see http://nsdl.grainger.uiuc.edu/) suggest that OAI-PMH has reached a maturity such that it is taking its place as one among several essential tools with the potential to enhance discoverability of digital content. The challenge has moved from focused proof-of-concept to one of learning how best to integrate and use OAI-PMH in conjunction with other tools and protocols – e.g., advanced search architectures, data warehousing models, data mining tools. Considerable work remains to be done in this regard, but a foundation of practice and success has now been laid and the indications are that OAI-PMH will in fact play an important role in future digital library interoperability implementations. 20 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Accomplishments and Activities Listed by Month July 2001  Project Start  Project Research Programmer hired (YuPing Tseng)  Work begins on OAI Harvesting tools  Work begins on updating of OAI Provider tools developed during Alpha Test August 2001  Project Website established (http://oai.grainger.uiuc.edu/)  Meeting in Ann Arbor with Michigan Project Team  Test harvesting of Illinois & selected OAI-Registered Provider sites  Project Research Assistant hired (Sarah Shreeves 50%) September 2001  Project Coordinator hired (Joanne Kaczmarek)  Michigan’s DLXS /XPAT indexing software installed  Initial interface for searching harvested metadata (http://oai.grainger.uiuc.edu/cgi/b/bib/bib-idx) available to project team.  Updated OAI Provider Tools made publicly available  OAI Workshop at ACM SIGIR (http://bolder.grainger.uiuc.edu/AMSIGIR2001/UIUC_OAIExperiences_files/frame.htm)  UIUC and Michigan Host OAI Provider workshop  Meeting of Steering Committee for Illinois & Michigan projects October 2001  Letter to CIC Library Directors from Paula Kaufman & Bill Gosling  Began setting up surrogate OAI Provider Sites (see narrative)  Began acquiring representative EAD finding aids from sites nationwide  Created XSLT stylesheet (simplistic version) to transform EAD to DC  Began production harvesting of OAI-registered & surrogate Providers  Test harvest of relevant sites registered with http://www.openarchives.org. November 2001  Alpha version of Harvester tools made available to Michigan  Preliminary mock-ups of public search interface screens developed  Preliminary Harvester Tools made available to Michigan  Project update & demonstration at DLF Forum (http://oai.grainger.uiuc.edu/Events/IllinoisOAIMetadataHarvestingService.ppt)  Began analysis of advanced EAD to Dublin Core transformations December 2001  Enhancements to Harvester tools for history and job tracking  Updated releases of Data Provider tools (http://oai.grainger.uiuc.edu/ProviderTools/)  Paper accepted for Museums and the Web 2002 conference 21 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 January 2002  Released white paper describing harvester architecture: http://oai.grainger.uiuc.edu/Papers/Harvester_Architecture/Harvester_Architecture.htm.  Released simple search interface for the UIUC Cultural Heritage Repository: http://oai.grainger.uiuc.edu/search.  Developed UIUC normalization vocabulary for Dublin Core (DC) Type element http://oai.grainger.uiuc.edu/projectinfo.htm.  Developed usability test for repository interface.  Converted 900,000 Illinois State Library MARC records to DC. February 2002  Provided feedback on design of Michigan online survey directed at end users: http://oaister.umdl.umich.edu/surveyreport.html.  Created stylesheet to convert Harvard VIA record format to DC.  Began usability testing for the repository interface.  Released advanced search interface for the repository.  Ported VisualBasic OAI harvester to Java and delivered to Michigan. March 2002  Added normalized DC Type element content to harvested records.  Applied changes to end user interface as suggested by the usability test results.  Updated Java harvester tools and delivered to Michigan.  Presented as expert practitioner at the OCLC "Steering by Standards" videoconference on OAI: http://oai.grainger.uiuc.edu/Events/Kaczmarek_OAI_OCLC.ppt  Presented to Information Systems Research Lab of the Graduate School of Library and Information Science (GSLIS) and the OAI Sheet Music Planning meeting: http://dli.grainger.uiuc.edu/publications/twcole/OAISheetMusic/Cole_OAITools.ppt. April 2002  Project highlighted in the In Brief Column of D-Lib Magazine: http://www.dlib.org/dlib/april02/04inbrief.html#KACZMAREK.  Presented to the CNI Spring Task Force Meeting and to the CIC Library Tech Directors: http://oai.grainger.uiuc.edu/oaicnibeth.ppt.  Presented paper to the Museums and the Web Conference 2002. Paper is published in the print proceedings. Presentation: http://oai.grainger.uiuc.edu/MWOAIblue.ppt. Paper: http://www.archimuse.com/mw2002/papers/cole/cole.html.  Repository interface design reviewed by a MLS class in Interfaces to Information Systems at GSLIS. Interface updated based on suggestions.  Finalized Open Source license for UIUC.  Released OAI 1.0 Harvester on SourceForge: http://www.sourceforge.net/projects/uiliboai/.  Developed alpha version of OAI 2.0 Metadata Provider tools for both VisualBasic and Java.  Implemented OAI 2.0 alpha Metadata Provider service for Illinois (for alpha testing by harvesting services). 22 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003   Developed EAD stylesheet for viewing top level and item level records in context of the full finding aid. Modified Java harvester tools to handle redirect responses. May 2002  Presented project overview at DLF Spring Forum.  Developed alpha version 2.0 of OAI Harvester tools for both VisualBasic and Java.  Provided initial data dump for data mining research to NCSA.  Added annotation box to end-user interface. June 2002  Presented at ALA Etext Discussion Group in Atlanta, GA: http://oai.grainger.uiuc.edu/ALA.ppt.  Released OAI 2.0 of Metadata Provider and Harvester tools on SourceForge.  Released Z39.50 OAI Metadata Provider tool.  Developed and applied Date and Coverage normalization scripts as a post-harvesting process.  Developed "exhibits" search interface model.  Begin scalability testing.  Add item level records from the EAD files to the repository. As a result the number of records in the repository is over three times what it previously was. July 2002  Presented at JCDL in Portland, Oregon: http://dli.grainger.uiuc.edu/publications/jcdl2002/p14-prom_files/frame.htm  Presented at two Illinois Digitization Institute Workshop at Grainger Library at UIUC.  Delivered the one-year report to the Mellon Foundation. August 2002  Met with staff at NCSA to discuss preliminary results of data mining and to gain better understanding of tools.  NCSA data mining tools delivered to the Illinois project.  Presented at the Society for American Archivists in Birmingham, AL:  Met with Professor Brenda Trofanenko of the Education Department at UIUC to plan work with her Social Studies Education class.  Met with Tom Peters of the CIC to plan a half-day conference on the Illinois and Michigan OAI projects.  Sarah Shreeves, former graduate research assistant for project, joined as visiting project coordinator until October 2002.  Christine Kirkham joins project as graduate research assistant September 2002  Presented an overview of search portal to students in honors-level Curriculum & Instruction course.  Suspended ongoing harvesting in order to assure static testbed during testing. 23 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003  Curriculum and Instruction students began testing usability of portal and utility of aggregated metadata. October 2002  With researchers from the University of Michigan’s OAIster project, presented preliminary findings to members of the Committee on Institutional Cooperation (CIC) Digital Library Initiatives Overview Committee. (See Appendix E)  Presented XML tutorial to the Colorado Digitization Program and Colorado Alliance of Research Libraries at University of Denver: http://oai.grainger.uiuc.edu/Western_Trails.ppt.  Conducted focus groups with Curriculum and Instruction students.  Project Coordinator Sarah Shreeves left project. November 2002  As part of project sustainability effort, submitted proposal for OAI-based CIC metadata harvesting, aggregation, and search service to CIC Library Directors  Per usability tests, fixed bugs and identified interface changes for search portal. December 2002  Transcribed tapes of usability focus groups and analyzed transaction logs. Implemented new OAI search portal: http://nergal.grainger.uiuc.edu/search/  Submitted two articles on OAI PMH to Library Hi Tech.  Demonstrated redesigned search portal to Curriculum and Instruction students January 2003  Attended the Open Forum on Metadata Registries, Santa Fe, NM.  Coded transcriptions of focus groups and read student papers for usability testing. February 2003  Reinstated harvesting and identified collections for continued harvesting after May 31.  Submitted short paper, “Utility of an OAI Service Provider Search Portal,” to 2003 Joint Conference on Digital Libraries Conference (JCDL). (See Appendix B for published paper.) March 2003  Continued to make improvements to OAI search portal. April 2003  Attended the 5th Annual Illinois State GILS Conference, Lisle, IL. May 2003  Presented tutorial, “Introduction to the Open Archives Initiative Protocol for Metadata Harvesting” at the 2003 JCDL in Houston, TX. (See Appendix F.)  Presented paper, “Utility of an OAI Service Provider Search Portal,” at 2003 JCDL: http://dli.grainger.uiuc.edu/Publications/JCDL2003/ShreevesJCDL.ppt. 24 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Proposal Objectives Accomplished Our proposal narrative as submitted the Andrew W. Mellon foundation in May 2001 (see Appendix A in the supplemental materials volume) laid out several project objectives. Below is a brief summary highlighting the work done to complete each objective. Objective: Construct and implement a standalone metadata middleware application ("spider") for harvesting metadata using the OAI protocols.  Harvesters have been implemented at University of Illinois and University of Michigan. Objective: Demonstrate viability of search and retrieval of metadata harvested using OAI.  We’ve successfully demonstrated the technical ability to search and retrieve metadata harvested using OAI. The heterogeneity of metadata records harvested impacted adversely on the utility of search portal. Efforts to mitigate adverse impact of this heterogeneity on utility of search and discovery services achieved only mixed results. We had success normalizing descriptive attributes such as date and resource content type, but were generally unable to normalize or collate described resources by subject or topic. Objective: Investigate feasibility of implementing a variety of basic and advanced search interface & value-added indexing features in the context of OAI metadata harvesting.  Our repository presents both basic and advanced search functions. XPat, the index and search engine system for the repository, allows truncation, Boolean, and exact-phrase searching. In addition users can limit searches by specific date ranges or by type of material. Attempts to use automated subject analysis tools developed for full-text records (e.g., NCSA's Themeweaver tool) were made difficult by the sparseness of metadata records. Objective: Investigate effective methods of presenting harvested metadata records and the linkages from those records to full-content and related external information resources.  Clear linkages are provided from metadata records to both collection-level and, when available, item-level information online. Transformations developed for this project allow end users to effectively view metadata describing components contained in EAD finding aids in the context of parent finding aids. Objective: Identify critical concerns and issues that arise when using OAI to reveal items that are part of scholarly manuscript archives and digitized collections of cultural heritage information.  Through our work with a number of different cultural heritage communities we have been able to identify some key issues. The quality and consistency of metadata and the adherence to community metadata and vocabulary standards (whether museum, archive, library, etc.) is key to facilitating the creation of a useful aggregated database. Strategies to map other metadata schemas to the requisite Dublin Core are essential to cleanly and accurately present 25 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 metadata from a variety of communities. Issues remain regarding how best to insure presentation of metadata records in accord with providers’ wishes and how best to deal with providers’ rights and permissions. Objective: Define areas where "best practice" covenants and conventions could beneficially supplement OAI protocols.  As outlined above the use of standards in metadata authoring can only benefit interoperability efforts. In addition we have found that the use of sets – particularly subject sets - within the OAI protocol aided in our indexing efforts. Objective: Document usage patterns and benefits of OAI metadata harvesting approach through evaluation, interactions with end-users, and analysis of detailed transaction logs.  Conducted usability and utility study with targeted group of end users through focus groups and transaction log analysis. See Appendix D for full discussion. 26 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Metadata Providers Organized by Type of Institution through March 2003 1,126,789 records (excluding DC records derived from EADs) 39 metadata providers (both OAIcompliant and not) Museums American Museum of Natural History  2004 records  Photographs Consortium of Museum Intelligence (CIMI) Demonstration Repository  197,233 records from 479 institutions (museums and historical societies)  Artifacts, paintings  OAI compliant data provider Spurlock Museum (University of Illinois at Urbana-Champaign)  46,612 records  artifacts, images Academic Libraries Bentley Historical Library (University of Michigan)  412 EAD collection level records  187,977 EAD collection and item level records  EAD finding aids Cornell University Library Rare & Manuscript Collections  88 EAD collection level records  34,341 EAD collection and item level records  EAD finding aids Harry Ransom Humanities Center (University of Texas)  91 EAD collection level records  13,574 EAD collection and item level records  EAD finding aids Harvard University Libraries  150,421 records  418 EAD collection level records o 190,540 EAD collection and item level records  150,003 Visual Information Access records  EAD finding aids, paintings, photographs, slides, and other images Indiana University Digital Library Project 27 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003   938 records Photographs Iowa Women's Archives (University of Iowa)  118 EAD collection level records  3,695 EAD collection and item level records  EAD finding aids Michigan State University Libraries  934 EAD collection level records  8,075 EAD collection and item level records  EAD finding aids Northwestern University Library  3,458 records  Posters from various collections Pennsylvania State University Libraries  40 records  Top level information from finding aids University of Chicago Library  100 EAD collection level records  27,047 EAD collection and item level records  EAD finding aids University of Illinois Library  97,927 records  95,712 records - sheet music  1,141 records - Teaching with Digital Content  1,025 records - aerial photographs  49 EAD collection level records o 27,564 EAD collection and item level records  Sheet music collection, photographs, artifacts, EAD finding aids  OAI compliant data provider University of Michigan Digital Library Text Collections  114,212 records  Text  OAI compliant data provider University of Minnesota Libraries  14,409 records  14,307 records - IMAGES database  102 EAD collection level records o 10,437 EAD collection and item level records 28 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003   Paintings, photographs, and other images, EAD finding aids OAI compliant data provider - IMAGES collection only University of Tennessee Special Collections  379 records  Photographs, text, finding aids (not in EAD)  OAI compliant data provider University of Wisconsin-Madison Library  4,244 records  Photographs, text, EAD finding aids  OAI compliant data provider Washington State University  1,600 records  Photographs Cultural and Historical Societies American Numismatic Society  2,799 records  Artifacts - coins  OAI compliant data provider American Philosophical Society  6576 records  Text (letters, monographs, etc.)  OAI compliant data provider Minnesota Historical Society  487 EAD collection level records  100,079 EAD collection and item level records  EAD finding aids Ohio Historical Society  768 records  Photographs from the OhioPix collection Public Libraries Illinois Library System Records provided by the Illinois State Library  321,028 records (filtered from 1.1 million metadata records)  252,271 records from the Alliance Library System  68,757 records from the Lincoln Trails Library System  Monographs 29 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Tacoma Public Library Photograph Collection  24,200 records  Images (photograph collection of Pacific Northwest history) Digital Collections Ackerman Archives  151 records  Letters, photographs from a personal archive  OAI compliant data provider AIM25 - Archives in London and the M25 Area  4474 records  Collection level descriptions of archival collections  OAI compliant data provider Celebration of Women Writers  189 records  Texts  OAI compliant data provider Colorado Digitization Project  12,631 records from 17 different institutions  Auraria Library  Canon City Public Library  Colorado College, Tutt Library  Colorado State Archives  Colorado Historical Society  Colorado Springs Pioneers Museum  Crow Canyon Archaeological Center  Colorado State University Libraries  Denver Museum of Nature and Science          Lafayette Public Library and Lafayette Miners Museum Larimer County Digitization Project Pikes Peak Library District Pueblo City-County Library District Pueblo Weisbrod Aircraft Museum University of Colorado, Boulder University of Denver University of Northern Colorado Photographs David Rumsey Map Collection 30 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003    6375 records Maps OAI compliant data provider Formations  23 records  Articles  OAI compliant data provider ibiblio   262 records Descriptions of web resources Illinois Alive!  186 records  Descriptions of web resources Library of Congress American Memory Project  103,077 records  Photographs, images, text  OAI compliant data provider National Library of Australia  235 records  Descriptions of online collections Online Archive of California  5,931 EAD collection level records from 47 different institutions  2,265,692 EAD collection and item level records         Berkeley Art Museum/Pacific Film Archive California Historical Society California Institute of Technology California State Archives California State Library California State Railroad Museum Library Cal State Chico Cal State Dominguez Hills              Humboldt State University Huntington Library Japanese American National Museum Labor Archives and Research Center Mills College NASA Ames History Oakland Museum of California Pasadena Historical Museum Sacramento Archives and Museum Collection Center San Diego State Univ. San Francisco Maritime National Historical Park San Francisco Public Library San Joaquin County Historical Society 31 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003             Cal State Fresno Center for Mennonite Brethren Studies Fowler Museum of Cultural History Fresno City and County Historical Society Gay, Lesbian, Bisexual, Transgender Historical Society Getty Research Institute Graduate Theological Union Grunwald Center for Graphic Arts Historical Sites Society of Arcata Holocaust Center of No. California Hoover Institution                 and Museum Santa Clara Univ. Sonoma State Univ. Southern California Library for Social Studies Unemployment Insurance Division Library UC Berkeley UC Davis UC Irvine UCLA UC Riverside UC Riverside Museum of Photography UC San Diego UCSF UC Santa Barbara UC Santa Cruz USC Western Jewish History Center EAD finding aids Open Video Project  1,654 records  Moving images from several special collections  OAI compliant data provider Perseus Digital Library  1,407 records  Text  OAI compliant data provider Schoenberg Center for Electronic Text/Image (University of Pennsylvania)  54 records  Text  OAI compliant data provider 32 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Metadata Providers Organized by Type of Institution April 2003 and forward 413,563 records from 25 OAI-compliant metadata providers Academic Libraries Auburn University Digital Library  14 records  Archival finding aids (collection level) Indiana University Digital Library Project  21,271 records  Images, sheet music Michigan State University Libraries  1,409 records  Text University of Illinois Library  99,130 records  Sheet music collection, images, artifacts, archival finding aids (collection level) University of Michigan Digital Library Text Collections  112,665 records  Text, images University of Minnesota Libraries  1786 records  Paintings, photographs, and other images University of North Carolina at Chapel Hill Manuscripts  4791 records  Archival finding aids (collection level) University of Tennessee Special Collections  2148 records  Images, text, archival finding aids (collection level) University of Wisconsin at Madison Library  4151 records  Images Cultural and Historical Societies American Numismatic Society 33 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003   3127 records Artifacts – coins, text Indiana Historical Society  961 records  Images and artifacts (ephemera) Digital Collections Ackerman Archives  151 records  Text, images from a personal archive AIM25 - Archives in London and the M25 Area  5701 records  Collection level descriptions of archival collections Alex Catalogue of Electronic Texts  677 records  Text Celebration of Women Writers  416 records  Text Heritage Colorado  18,813 records from 17 institutions (academic and public libraries, historical societies)  Images Documenting the American South  3161 records  Images, text David Rumsey Map Collection  2820 records  Maps Formations  23 records  Text ibiblio  418 records 34 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003  Descriptions of web resources Library of Congress American Memory Project  126,045 records  Images, text Mundus: Gateway to Missionary Collections in the UK  447 records  Archival finding aids (collection level) Open Video Project  1938 records  Moving images Perseus Digital Library  1,446 records  Text Schoenberg Center for Electronic Text/Image (University of Pennsylvania)  54 records  Text 35 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Bibliography of Publications & Selected Presentations A. Publications Cole, T.W., Kaczmarek, J., Marty, P.F., Prom, C.J., Sandore, B. and Shreeves, S.L. 2002. Now that we've found the ‘hidden web’ what can we do with it? The Illinois Open Archives Initiative Metadata Harvesting experience. In D. Bearman and J. Trant (eds.), Museums and the Web 2002: selected papers from an international conference. Pittsburgh, PA: Archives & Museum Informatics, 63-72. Available: http://www.archimuse.com/mw2002/papers/cole/cole.html (accessed, 20 July 2003). Prom, C. J. and Habing, T. G. 2002. Using the Open Archives Initiative Protocols with EAD. In G. Marchionini & W. Hersch (eds.), JCDL 2002: Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries, July 14-18, 2002. New York: Association for Computing Machinery, 171-180. Available: http://dli.grainger.uiuc.edu/publications/jcdl2002/p14-prom.pdf (accessed, 20 July 2003). Shreeves, Sarah L., Kirkham, Christine, Kaczmarek, Joanne, and Cole, Timothy W. 2003. Utility of an OAI service provider search portal. In Catherine C. Marshall, Geneva Henry, and Lois Delcambre (eds.), Proceedings of the 2003 Joint Conference on Digital Libraries May 27-31, 2003. Los Alamitos, CA: Institute of Electrical and Electronics Engineers, Inc., 306-308. Cole, Timothy W. 2003. OAI: Innovations in the Sharing of Scholarly Information. Library Hi Tech 21, no. 2: 115-117. Prom, Christopher J. 2003. Reengineering archival access through the OAI protocols. Library Hi Tech 21, no. 2: 199-209. Shreeves, Sarah L., Kaczmarek, Joanne S., and Cole, Timothy W. 2003. Harvesting cultural heritage metadata using the OAI protocol. Library Hi Tech 21, no. 2: 159-169. Prom, Christopher J. Forthcoming. Does EAD play well with other metadata standards? Searching and retrieving EAD using the OAI protocols. Journal of Archival Organization 1, no. 3: 51-72. Cole, Timothy W. and Shreeves, Sarah L. Forthcoming. Lessons learned from the Illinois OAI Metadata Harvesting Project. In Diane Hillman and Elaine Westbrooks (eds.), Metadata in Practice. Chicago, IL: ALA Editions. B. Presentations Cole, T.W. and Habing, T.G. September 13, 2001. Experiences Implementing OAI Provider Services. ACM SIGIR: Open Archives: Community, Interoperability, and Services, New Orleans, LA. Cole, T.W. November 18, 2001. University of Illinois OAI Metadata Harvesting Service. Digital Library Federation Fall Forum, Pittsburgh, PA. 36 A Scholarly Information Portal using OAI, University of Illinois - Final Report, July 2003 Kaczmarek, J. March 26, 2002. Can OAI Enhance the Quality of Everyday Life? Steering by Standards: OCLC Videoconference: "A New Harvest: Revealing Hidden Resources with the Open Archives Metadata Harvesting Protocol," Dublin, OH. Cole, T.W. March 28, 2002. OAI Tools and OAI Protocol Version 2. OAI Standards for Sheet Music - Planning Meeting, Bloomington, IN. Sandore, B., Cole, T.W., Kaczmarek, J., Mischo, B., Prom, C.J., and Habing, T.G. April 16, 2002. Developing a Domain Specific Metadata Search & Retrieval System Using OAI-PMH. CNI Spring 2002 Task Force Meeting, Washington, D.C. Cole, T.W., Kaczmarek, J., Sandore, B., Marty, P., Prom, C.J., and Shreeves, S.L. April 18, 2002. Now That We've Found the 'Hidden Web' What Can We Do With It?: The Illinois Open Archives Initiative Metadata Harvesting Experience. Museums and the Web 2002. Boston, MA. Kaczmarek, J. June 15, 2002. University of Illinois Experiences Using OAI Protocol for Metadata Harvesting. E-Text Discussion Group, American Library Association Annual Meeting. Atlanta, GA. Prom, C.J. and Habing, T.G. July 16, 2002. Using the Open Archives Initiatives Protocols with EAD. 2nd Annual Joint Conference on Digital Libraries, Portland, OR. Prom, C.J. August 22, 2002. Does EAD Play Well with Other Metadata Standards? Searching and Retrieving EAD Using the OAI Protocols. Society of American Archivists Annual Conference. Birmingham, AL. Cole, T.W. April 9, 2003. OAI: What it is & what it could mean for GILS projects. 5th Annual States Government Information Locator Service Conference. Lisle, IL. Shreeves, S.L. May 15, 2003. Green flags and yellow flags: opportunities and challenges of implementing OAI services. Ohio Valley Group of Technical Services Librarians Conference, Terre Haute, IN. (Invited Presentation) Kirkham, T. and S.L. Shreeves. May 29, 2003. The utility of an OAI service provider search portal. 3rd Joint Conference on Digital Libraries, Houston, TX. Cole, T.W. June 22, 2003. Using OAI-PMH to aggregate metadata describing cultural heritage resources. Combined American Library Association / Canadian Library Association Annual Conference, Toronto, Canada. 37

Final Report: Project to Implement a Scholarly Information

Related documents

Products

Support

Final Report: Project to Implement a Scholarly Information

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib