Institutional Repository Open Source Software Packages: A comparative study Shazia Khan⃰ Junior Research Fellow Department of Library & Information Science, AMU, Aligarh Institutional repositories are becoming powerful tools for the free availability of institution’s intellectual output on the web. Nearly all the leading research organizations, academic institutions, universities etc all over the globe are trying to make their scholarly output openly accessible. Many of them have their Institutional repositories functional on the web. Besides the funding and other issues, while setting up the institutional repositories, there are number of technological aspects which are need to be considered including, hardware, software etc. Software is the essential part for establishing institutional repositories and to choose suitable software for developing IRs is a painstaking task. There are a large number of institutional repository (open source) software packages available in the market. The repository manager should be aware of the technological aspects before selecting appropriate software that fits to the needs and requirements of their IRs. Open Source Software Open source software is free to download and its source code is made open to public, for modifications, improvements, and redistribution for non-commercial purposes from the development community. Open source software is free of charge and can be easily downloaded from the web but it need some level of expertise to handle. Everyone is now seeking open source solutions, because of their wider benefits, as compared to proprietary software packages. There are number of open source software packages available in every field and to develop institutional repository as well e.g. Dspace, Eprints, CDS Ware, i-Tor, Greenstone etc. It is difficult to choose appropriate software package for the Institutional repository development of any institution/organization. The present study gives the comparative analysis of four major institutional repository software packages. These are Fedora, Dspace, Eprints and Greenstone. Comparative Analysis 1 Comparison1 has made under the following headings: General features, Content Management, Content acquisition, Classification, Search and Retrieval, Access Control, User Authentication and Authorization, Metadata, Interoperability, Environment and Infrastructure Compatibility, User Interface, Digital Preservation, Import from/Export To, and Other features. 1. General Features Features Host Dspace MIT & HP labs Eprints University of Southampton Product Type License Open source software BSD license Latest version2 3.1 Open source software GNU Public License 3.3.10 Fedora University of Virginia and Cornell University Open source software Apache License 3.6.2 Greenstone University of Waikato, New Zealand Open software GNU License 3 source Public Table1: General Features of Software Packages 2. Content Acquisition Dspace: The basic entity in DSpace is item, which contains both metadata and digital content. DSpace allows adding, all types of digital document ranging from books, reports, journal articles, lecture notes, technical reports, thesis, images, audio/video files to data sets. Dspace by default supports to upload all types of formats such as PDF, Microsoft word, JPEG, TIF, HTML etc. It has its own accession number and it is called as an internal ID. 1 :Comparison criteria has taken from a. Repositories Software Survey( 2010 November).Retrieved from http://www.rsp.ac.uk/start/software-survey/results-2010/ b. Masrek, M.N. & Hakimjavadi, H. (2012).Evaluation of three open source software in terms of managing repositories of electronic theses and dissertations: A comparative study. j. Basic Appl. Sci. Res., 2(11), 10843-10852. 2 . Data has been taken on 09/03/2013 2 Eprints: The basic entity in Eprints is the data object, which is a record containing metadata. Eprints supports to add all types of digital documents such as articles, book sections, monographs, conference or workshop items, patents, theses, image, video etc. Eprints by default supports all types of formats including PDF, JPEG, TIF, HTML, MPEG, Microsoft word etc. Eprints create a unique numeric ID for each document that gets added into the repository. Fedora: The basic entity in Fedora repository system is digital object. The internal structure of digital objects is determined from the fedora object XML (FOXML), which is based on Metadata Encoding and Transmission (METS). Fedora supports to upload conventional digital objects such as books, other text documents, learning objects, geospatial data, images, maps, videos, numeric data sets, etc. Fedora allows uploading the mime types of file formats including text/xml, text/plain, text/html, etc. For multimedia format image/jpeg, image/jg2, image/tiff, audio/mpeg etc. It supports to create either custom accession number or default accession number and each digital object is identified with Persistent Identifier (PID). Greenstone: The basic entity in Greenstone is document, which is expressed in XML format. Greenstone Digital Library Software supports to add all types of documents such as books, reports, journal/newspaper articles, notes, learning objects, theses, images, audio/video, visual art files etc. Greenstone supports to upload several types of digital formats and supported plug-ins are available in Greenstone such as zip, gap, text, html, pdf, rff, image, mp3 open document, lom, bibtext, etc. Greenstone assigns OID for every digital document that is added into the repository. 3. Content Management Dspace: Dspace provide a good Workflow management. It generates authority files and show strength of each collection on website. Eprints: Eprints also provide work flow management to some extent. It does not generate authority files. Fedora: Fedora does not provide any workflow management. It does not generate authority files. 3 Greenstone: Greenstone Digital Library Software does not provide any workflow management. It generates authority files. 4. Classification Dspace: Dspace supports any administrator defined controlled vocabulary but it does not support adding any class number of digital objects. Eprints: EPrints software supports to group digital objects as per the Library of Congress subject heading lists. Fedora: Fedora does not support any classification system but it is fully extensible for providing any user defined classification systems. Greenstone: It supports to enter classification number. 5. Search and Retrieval Dspace: Dspace supports Full Text searching. All types of searches are allowed by Dspace such as Boolean search, proximity search, advanced search, wild card search, Fuzzy search etc. Dspace supports browse by Title, author, community & collection, year (extensible) etc. Eprints: Eprints contains Full Text searching facility. It also supports all types of searches except proximity search, wild card search, and fuzzy search. Eprints allows browsing by Title, Author, collection, subject, year, Academic unit (fixed). Fedora: Fedora has a generic search service, which is a part of the fedora search framework that supports full text searching. It supports all kind of searches. Fedora allows browsing by Title, Author, collection, subject, year, Academic unit (extensible). Greenstone: Greenstone also supports Full Text searching. It also supports all kind of searches. It allows browsing by Title, Author, collection, subject, and year. Searching capabilities are also provided for defined sections in a document (Title, chapter, paragraph). 6. Access control 4 Dspace: DSpace creates e-persons for all the members who register themselves through the web browser and it is called as My DSpace. It supports to add/edit/delete user profiles. DSpace does not keep detailed information of every user who is registered into the repository. Eprints: EPrints have limited description of defining roles. It allows creating user, editor and repository administrator roles. It also supports to add/edit/delete user profiles. Eprints keep detailed information of every user who is registered into the repository. Fedora: It supports to create only one user account that is Fedora-Admin and only FedoraAdmin user is allowed to carry out different transactions in Fedora. It does not keep detailed information of every user. Greenstone: Greenstone supports adding different users through its web interface called as ‘collector’. It does not keep detailed information of every user. 7. User Authentication and Authorization Dspace: Dspace has well designed Authentication and Authorization. Dspace has Built-in LDAP3 & shibboleth4 Authentication mechanism. Eprints: EPrints software supports setting authorization policies with limited support. Eprints has Built-in LDAP & Add-in shibboleth Authentication mechanism. Fedora: Fedora does not support any authentication and authorization. In Fedora only Fedora-Admin can submit documents into the repository. Fedora has Built-in LDAP and Addin shibboleth authentication mechanism. Greenstone: Authentication can be done in Collection level as well as Individual Document level. But the feature does not successfully works. It does not have built-in LDAP and Shibboleth authentication mechanism. 8. Metadata Formats 3 . LDAP- Lightweight Directory Access Protocol is a protocol that enables organizations to arrange and access directory information in a hierarchy. 4 . Shibboleth- Shibboleth System is standards based, open source software package for web single signon across or within organizational boundaries. 5 Dspace: Dspace by default has qualified Dublin core metadata. It also supports the Dublin core and METS metadata. It can import/Export content from other metadata formats including MODS, PREMIS etc. (thorough fully customizable XML). Prints: Eprints supports the Dublin core, METS and MPEG21 metadata. It also has thorough fully customizable XML to import/export content from other metadata formats. Fedora: Fedora supports the Dublin core, METS, MARC21, MARCXML, MODS, EAD, ONIX, and TEI metadata. It can import/Export content from any XML format. Greenstone: Greenstone supports Dublin core metadata. It also has thorough fully customizable XML. 9. Interoperability Dspace: It supports Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH), OAI-ORE. It also supports SWORD protocol Unicode facility, SRU/SRW and Open URL search. Eprints: It supports Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH), OAI-ORE. It also supports SWORD protocol Unicode facility, and only Open URL search. It also supports PKP harvesting. Fedora: It supports Open Access Initiative Protocol for Metadata Harvesting (OAI-PMH), SWORD protocol. OAI-ORE support is optional but it supports SRU/SRW and Open URL search. Unicode facility is supported as content characters but not as file name. Greenstone: It supports Open Access Initiative Protocol for Metadata Harvesting (OAIPMH), OAI-ORE and also support Z39.50 protocol for harvesting of metadata. It also supports SWORD protocol, SRU/SRW search. It does not support open URL search. It is good in Unicode facility, provides ready to use multilingual interfaces that are already translated in many languages. 10. User Interface Dspace: Dspace has a good concept of a user interface e.g. CSS and Manakin templates.CSS or Cascading Style Sheets is a style sheet language used for describing the look and formatting 6 of a document written in a mark-up language. It’s most common application is to style web pages written in HTML and XHTML language. It can also be applied to any kind of XML document. Manakin (XMLUI) is a web-based user interface to DSpace that introduces a modular interface layer, enabling an institution to easily customize the interface according to the specific needs of the particular repository, community or collection. Dspace also supports to localize user Interface in any language. Eprints: The user interface of Eprints is easily and completely customizable, if the end user has knowledge of PERL. Eprints also supports to localize user Interface in any language. Fedora: It is the weakest in this category. It is not very easy to use Fedora Interface for Librarian. Software does not provide any help for end user. It does not provide multilingual support for user interface. Greenstone: Greenstone also provides easily and completely customizable user interface, if the end user has knowledge of PERL. It provides multilingual support. 11. Digital Preservation Dspace: Dspace is meant for long term physical storage & management of digital data in a secure, professionally managed repository including standard operating procedures such as backup, refreshing media & disaster recovery. Eprints: E-prints is not meant for long term preservation but for providing web access to materials. Tools and services are being developed to support digital preservation for EPrints repositories. Fedora: It enables long term preservation of digital objects. It has rebuilder Utility (for disaster recovery and data migration). Greenstone: GSDL does not support digital preservation. 12. Import from/Export to All the four softwares Dspace, Eprints, Fedora and Greenstone supports batch importing/exporting, bulk importing/exporting of documents as well as metadata. It also supports uploading/downloading of compressed files. 7 13. Environment and Infrastructure compatibility Features Dspace Fedora Greenstone Not specified GNU Eprints Not specified Minimum hardware requirements Automatic Not specified Not specified No No No Yes. Single installation file is there script and it installs all related Environments JAVA, Tomcat, No environments JAVA, Tomcat, Ant environments are Fedora home needed to set environments for installation are needed to Components. JRE environments needed to set for installation of each software Operating Systems on which software can be installed are needed to set for installation Linux, Sun Solaris, IBM, Axis, BSD, HP/UX, MS windows(with limited support),Mac OS are required to be set. for Installation. Linux, Solaris, BSD, as OSX well as Windows set Linux, Sun Solaris, IBM, Axis, BSD, HP/UX, MS windows(with limited support),Mac OS It can installed be on all 32-bit windows (95/ 98/ 2000/ XP/), all POSIX(Linux/ BSD/UNIX-like Programming languages used JAVA Perl, Java Script and AJAX as scripting language 8 Java, Java Script and AJAX as scripting language OSES), OS X C++ , JAVA and Perl Web server Jakarta Tomcat Apache Web server Jakarta Tomcat Apache/IISWeb server Used EPrints can Fedora is not Used Though be configured meant different easily for users. different users. Only fedoraAdmin Used Ease of system System administrator Administration can for Greenstone has ability easily configure software to configure for different for different can submit users. documents to users but the drawback is that the the repository feature does not properly work through ’collector’ Table2: Environment and Infrastructure compatibility 14. Some other features Features Dspace RSS Data migration Upgrading software Help features Fedora Greenstone Yes Yes GNU Eprints Yes Yes No Yes No Yes Not simple Slight Slight Slight EPrints provides general help features for the end users but does not give any technical answers. No help features are provided with the user interface in Fedora. Greenstone does not provide extensive help feature. (requires more knowledge of backend technology) DSpace help features provide general help but software does not support any technical help feature. 9 Table 3: Other Features The comparative analysis shows that there is not any great distinction among the four institutional repository software packages; all are good and suitable for building digital libraries/institutional repositories in their own way. The results revealed certain strengths and weaknesses of the selected software packages that are as follows: Greenstone: The installation of Greenstone is very easy and it works easily on any version of Windows, UNIX and Mac OS-X. Greenstone digital library software can be a very good solution for small libraries where staff strength and budget of library is less. One of its unique characteristic is that it has a search feature of browsing by table of contents and sections of books; hence, it can handle building collection of digital books. One of the weaknesses of Greenstone software is that it does not support any workflow and neither has it allowed to set different authorization policies. It has the competence to handle various file Formats but at the same time it does not have any ability to handle digital preservation part. It does not support self-archiving. GNU Eprints: It is chiefly designed for authors to self-archive their pre-prints or post-prints to gain more access and visibility to their work. Its unique characteristic that differ it from other softwares is it tracks all changes and actions to all documents that are deposited into the repository. One of its drawbacks is that it has limited work flow management. Eprints software does not support digital preservation. Fedora commons: Fedora is essentially designed for managing huge number of documents with long term preservation of documents added into the repository. One of the main features of Fedora is that it allows end users to build a collection of digital objects through local servers or through http servers or it also supports to redirect the Fedora to look for a particular object from another place which is available online and give link to that object through Fedora repository. It is not necessary to have that object on Fedora’s site. The major drawback of fedora repository system is that it depends upon the third party system which can be integrated with Fedora such as Fez5, Valet, Murador, Elated etc. to have more added features. For advance search or other search features it requires configuring other search tool, called Fedora-Gesearch. 5 . Latest version of fedora does not support fez. 10 Dspace: It is revealed from the number of Install bases that DSpace is the widely used digital library software among all the OSS-DL available today. Dspace is considered to be the best suited and trusted solution for the long term preservation of repositories. DSpace facilitates institutions to grant different services including long term access to the scholarly output of faculty members, increased visibility to faculty members by providing good facility of self archiving thus making available an equivalent publishing channel, providing different policies of authorization and it sustains persistent identifier i.e. Unique accession number to each document that acquired by the repository. One of the major drawbacks of Dspace is that its installation is complicated. Pirounakis6 gave some guidelines regarding the selection of suitable package for different organizations are as follows: 1. Consider a case where an institution or university needs a digital repository for research papers and dissertations produced by students and stuff. In that case, the most appropriate DL system is DSpace, since it by default represents communities (e.g. university departments) and collections (e.g. papers and dissertations), while workflow management supported is important for item submission by individuals. 2. Consider a case where an organization needs one digital collection to publish its digital content in a simple form, in strict time limits. In addition, the organization prefers to integrate the web interfaces of the DL with a portal like website. In that case the most appropriate DL systems are EPrints, since it separate the concerns of presentation and storage, is not bind to specific metadata standards and provide simple web interfaces for the submission and presentation of documents and metadata. 3. Consider a case where an organization is responsible to digitize collections from libraries, archives and museums and host them in a single DL system. The organization has human resources and the amount of time in order to customize the DL system and develop extra modules. The highest priority needs are the support of preservation issues, the use of multiple metadata standards and the different formats of digital content. In that case the most suitable DL system is Fedora, since it provides a very customizable modular architecture. Although it does not provide easy to use web interfaces or built-in functionality, it is the best choice for the case where many collections and different material must be hosted. 6 . Pirounakis, G & Nikolaidou, M. (nd). Retrieved from www.dit.hua.gr/~mara/publications/ideaDL09a.pdf 11 4. Consider a case where an organization wants to electronically publish books in an easy to use customizable DL system. In that case the most appropriate DL system is Greenstone, since it is easy to represent books in a hierarchical manner, using table of contents, while the full text of chapters can be searchable. Conclusion The comparative analysis of four open source digital library software products reveals that software packages have their own strengths and weaknesses and no software can be said good or bad. These software packages were developed to fulfill the requirements of that particular parent institution, and make available their source code free and open to public so that others can modify them and make better use of these software Packages. Hence, it is the responsibility of a repository manager to choose the best suited software to establish institutional repository and modify it to meet the requirements of their institutional repositories. References Pirounakis, G & Nikolaidou, M. (nd). Retrieved from www.dit.hua.gr/~mara/publications/ideaDL09a.pdf Masrek, M.N. & Hakimjavadi, H. (2012).Evaluation of three open source software in terms of managing repositories of electronic theses and dissertations: A comparative study. j. Basic Appl. Sci. Res., 2(11), 10843-10852. Repositories Software Survey.( 2010 November).Retrieved from http://www.rsp.ac.uk/start/software-survey/results-2010/ Websites consulted www.dspace.org/ www.eprints.org/ www.greenstone.org/ www.duraspace.org/ www.fedora-commons.org/ 12 ⃰ Shazia Khan: Junior Research Fellow, Department of Library & Information Science, AMU, Aligarh 13