IU Digital Library Brown Bag October 19, 2011

advertisement

Overview of

IU Digital Collections Search

Hui Zhang

Jon Dunn

Indiana University Digital Library Program

IU Digital Library Brown Bag

October 19, 2011

Outline

Introduction and motivation – Jon

Demo – Jon

Technical implementation – Hui

Next steps and future work – Jon

Why cross-collection search?

Support discovery across multiple content formats, collections, and repositories at IU

Use cases:

◦ Multiple formats/collections within a single thematic grouping (e.g. Hoagy Carmichael )

◦ Show off the richness and diversity of IU’s digital collections (PR – see open.iu.edu

)

◦ Find digital content at IU for teaching or research use

Why cross-collection search?

Support discovery across multiple content formats, collections, and repositories at IU

Use cases:

◦ Multiple formats/collections within a single thematic grouping (e.g. Hoagy Carmichael )

◦ Show off the richness and diversity of IU’s digital collections (PR – see open.iu.edu

)

◦ Find digital content at IU for teaching or research use

Digital collections evolution:

Discrete collection web sites

Digital collections evolution:

Services

METS Navigator Archives Online PhotoCat

Video Streaming Service Variations

Digital collections evolution:

Services

Advantages

◦ Can develop workflows for content ingestion and description that are both optimized and scalable

◦ Content stored in a common repository (Fedora)

◦ Can develop discovery interfaces optimized for particular content (e.g. images vs. music)

◦ Common services to expose content into other platforms (e.g. Google)

Disadvantages

◦ “Siloing” discovery by content type can be an issue

Cross-collection search:

First iteration

Only selected collections with metadata in

Fedora

◦ Includes Archives Online and most image collections

◦ Not video streaming, Variations, encoded text,

IUScholarWorks, various “legacy” collections

Metadata only (MODS)

◦ Stored natively as MODS in Fedora

◦ Disseminated on the fly from other formats

(PhotoCat2)

◦ Transformed via XSLT from EAD

(Archives Online)

Cross-collection search:

First iteration

Demonstration

Challenge:

Item-level records from EAD

Apache Solr Overview

A Java-based web application, open source search server, Apache Lucene at its core

Demonstration

Solr vs. relational database

• Pros: full-text search, text analysis, flexible fields

• Cons: no relational operation on fields

Solr vs. Lucene

• Pros: web application, centralized configuration, facet

• Cons: security, slower

Solr Schema and Configuration

Schema: specify how the index is built

◦ field, field type

◦ dynamicField, copyField, uniqueKey

◦ Text analysis: stop, stem, synonym, tokenization

Configuration: specify Solr itself, query, data import

Converting MODS to Solr XML

Solr XML

◦ <add><doc><field>…</filed>…</doc></add>

◦ Can simply be “POST” into the Solr index

Translation of MODS to Solr XML

◦ Use XSLT

◦ Called by the indexing program

Extract facet values

◦ Format: MODS:typeofResource

◦ Collection: customized based on item’s Fedora

PID

<add>

<doc>

<field name="id">iudl:10000</field>

<field name="title_t">Women Medical Students</field>

<field name="name_t">Photographic Services, Photographer</field>

<field name="name_facet">Photographic Services</field>

<field name="subject_topic_t">Medical students</field>

<field name="subject_topic_facet">Medical students</field>

<field name="subject_city_t">Bloomington</field>

<field name="subject_city_facet">Bloomington</field>

<field name="subject_state_t">Indiana</field>

<field name="subject_state_facet">Indiana</field>

<field name="type_of_resource_t">still image</field>

<field name="type_of_resource_facet">still image</field>

<field name="genre_t">Photographs</field>

<field name="genre_facet">Photographs</field>

<field name="w3c_taken_date">04-13-1956</field>

<field name="year">1956</field>

<field name="item_id">P0028020</field>

<field name="coll_id_mods">/archives/photos/</field>

</doc>

</add>

Solr Indexing

Carried by two Java programs running under

DLP’s Fedora Index Service framework

The service can be invoked by a RESTful

HTTP request, the Solr indexing is triggered based on conditions specified in the properties file

The MODS records are extracted from the

Fedora repository (natively stored) or generated by the getMODS disseminator

(Photocat2 collections)

Overview of Blacklight

An open source project developed for libraries with many potentials:

◦ As a library catalog

◦ As the discovery interface to a digital repository

Optimized to handle diversified content

(facet browsing)

Originally developed by University of

Virginia, has a growing community of active contributors and users

Now part of Hydra Project

Written in Ruby, runs on Rails, requires Solr

Customize Blacklight for DLP

Collections

Integrate blacklight with MODS-based index

◦ Blacklight by default expects MARC fields

New functions and features

◦ Render thumbnail in result view

◦ Use collection website as the landing page

Style and layout

◦ Standard IU banner and footer

◦ Color, font, and window size

Future Improvements

Automatic update of Solr index

◦ Fedora repository communicates with the

Solr indexing program via JMS about item update

Include full-text content

◦ It is challenging to have full-text content and metadata in one index

◦ Optimize the indexing and search algorithms

◦ Search against full-text and use metadata as facets

Future Improvement (cont’d)

Add more collections

◦ Other collections from Fedora

◦ Non-Fedora DLP collections

◦ Archives of Institutional Memory

◦ IUScholarWorks Repository?

◦ IUPUI Digital Collections (ContentDM)?

Conduct usability evaluation

Explore integration w/ new Blacklight-based discovery layer for IUCAT

Variations on Video IMLS grant

◦ Hydra/Blacklight-based discovery on PBcore

Questions?

Beta: http://webapp1.dlib.indiana.edu/dcs/

Send comments to: diglib@indiana.edu

Download