A brief history of the Greenstone Digital Library

advertisement
A brief history of the
Greenstone Digital Library Software
Ian H. Witten and David Bainbridge
University of Waikato, Hamilton, New Zealand
At the time of writing (January 2007) Greenstone—a versatile open source multilingual digital library
environment with over a decade of pedigree—has a user base hailing from over 70 countries, is downloaded 4,500
times a month, runs on all popular operating systems (even the iPod!), and has a readerճ interface in over 40
languages. How did this software project and the research team behind it reach this point? Team members often
give anecdotal stories about life behind the scenes at conferences and workshops; this article gives a more
definitive and coherent account of the project.
The New Zealand Digital Library project grew out of research on text compression (Bell et al., 1990) and,
later, index compression (Witten et al., 1994). Around this time we heard of digital libraries, and pointed out the
potential advantages of compression at the first-ever digital library conference (Bell et al., 1994). The New
Zealand Digital Library Project was established in 1995, beginning with a collection of 50,000 computer science
technical reports downloaded from the Internet (Witten et al., 1995). At the time several research groups in
computer science departments collecting technical reports and making them available on the web: our main
contribution was the use of full-text indexing for effective search. We were assisted by equipment funding from
the New Zealand Lotteries Board and operating funding from the New Zealand Foundation for Research, Science
and Technology (1996–1998 and 2002–2007).
In 1997 we began to work with Human Info NGO to help them produce fully-searchable CD-ROM
collections of humanitarian information. This necessitated making our server (and in particular the full-text search
engine it used), which had been developed under Linux, run on Windows machines—including the early
Windows 3.1 and 3.11 because, although by then obsolete, they were prevalent in developing countries. This was
demanding but largely uninteresting technically: we had to develop expertise in long-forgotten software systems,
and it was hard to find suitable compilers (eventually we obtained a ҳecond-handӠone from a software auction).
The first publicly available CD-ROM, the Humanity Development Library 1.3, was issued in April 1998. A
French collection, UNESCOճ Sahel point Doc, appeared a year later; all the documents, along with the entire
interface, help text, and full-text search mechanism, were in French. The first multilingual collection came six
months later: a Spanish/English Biblioteca Virtual de Desastres/Virtual Disaster Collection. Since then about 40
CD-ROM collections have been published. They are produced by Human Info in Romania: we wrote the software
and were heavily involved in preparing the first few CD-ROMs, and then transferred the technology to them so
that they could proceed independently. At this point we realized that we did not aspire to be a digital library site
ourselves, but rather to develop software that others could use for their own digital libraries.
Towards the end of 1997 we adopted the term Greenstone: we decided that Ҏew Zealand Digital Library
SoftwareӠwas not only clumsy but could impede international acceptance and therefore sought a new
name. ҇reenstoneӠturned out to be an inspired choice: snappy, memorable, and un-nationalistic but with strong
national connotations within New Zealand—a form of nephrite jade, greenstone is a hallowed substance for
Māori, valued more highly than gold. Moreover, it is easy to spell and pronounce. Our earlier Weka (think mecca)
machine learning workbench, an acronym that in Māori spells the name of a flightless native bird, suffers from
being mispronounced weaka by some. And the term Greenstone is not overly common—today we are the number
one Google hit for it. The decision to issue the software as open source, and to use the GNU General Public
Licence, was made around the same time. We did not discuss this with University of Waikato authorities—New
Zealand universities are obsessed with commercialization and we would have been forced into an endless round
of deliberations on commercial licensing—but simply began to release under GPL. Early releases were posted on
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
our website greenstone.org (which was registered on 13 August 1998), but in November 2000 we moved to the
SourceForge site for distribution (partly due to the per-megabyte charging scheme that our university levied for
both outgoing and incoming web traffic). Our employers were not particularly happy when our licensing fait
accompli became apparent years later, but have grown to accept (and perhaps even appreciate) the status quo
because of our evident international success.
An early in-house project utilizing Greenstone was the Niupepa collection of Māori-language newspapers.
We began the work of OCRing 20,000 page images in 1998, and made an initial demonstration collection. In
2000–2001 we received (retrospective!) funding from the Ministry of Education to continue the work. Virtually
the entire Niupepa was available online early in 2001, but the collection was not officially launched until March
2002 at the Annual General meeting of Te Rūnanga o Ngā Kura Kaupapa Māori (the controlling body of Māori
medium/theology schools). Niupepa is still the largest collection of on-line Māori-language documents, and is
extensively used; Apperley et al. (2002) gives a comprehensive description of how it was developed. On 13
November 2000, in a moving ceremony, the Māori people presented our project with a ceremonial toki (adze) as a
gift in recognition of our contributions to indigenous language preservation (see Figure 1).
In 1999 the BBC in London were concerned about the threat of Y2K bugs on their database of one million
lengthy metadata records for radio and television programmes. They decided to augment their heavy-duty
mainframe database with a fully-searchable Greenstone system that could run on ordinary desktop machines. A
Greenstone collection was duly built and delivered (within two days of receiving the full dataset). We tried to get
them to the point where they could maintain it themselves, but they were not interested: instead we updated it for
them regularly (incidentally providing us with a useful small source of revenue). They eventually moved to
different technology in early 2006, with the aim of making the metadata (and ultimately the programme content)
publicly available online in a way that resembles what Amazon does for books—something that we think requires
a tailor-made portal rather than a general-purpose digital library system.
We became acquainted with UNESCO through Human Infoճ long-term relationship with them. Although
they supported Human Infoճ goal of producing humanitarian CD-ROMs and distributing them in developing
countries, UNESCO were really interested in sustainable development, which requires empowering people in
those countries to produce and distribute their own digital library collections—following that old Chinese proverb
about giving a man fish versus teaching him to fish.1[1] We had by then transferred our collection-building
technology to Human Info, and tried (though without success) to transfer it to the BBC, but this was a completely
different proposition: to put the power to build collections into the hands of those other than IT specialists,
typically librarians. We began by packaging up our PERL scripts and documenting them so that others could use
them, and slowly, painfully, came to terms with the fact that operating at this level is anathema for librarians. In
2001 we produced a web-based system called the ҃ollectorӠthat was announced in a paper whose title proudly
proclaimed Ґower to the people: end-user building of digital library collectionsӠ(Witten et al., 2001). However,
this was never a great success: web-based submission to repository systems (including Greenstone collections) is
commonplace today, but we were trying to allow users to design and configure digital library collections over the
web as well as populate them. The next year we began a Java development that became known as the Greenstone
Librarian Interface (Bainbridge et al., 2003), which grew over the years into a comprehensive system for
designing and building collections and includes its own metadata editor.
From the outset, UNESCOճ goal was to produce CD-ROMs containing the entire Greenstone software (not
just individual collections plus the run-time system, as in Human Infoճ products), so that it could be used by
people in developing countries who did not have ready access to the Internet.2[2] These were the tangible outcomes
of a series of small contracts with UNESCO: we felt that the CD-ROMs were more of symbolic than actual
significance because in practice they rapidly became outdated by frequent new releases of the software appearing
on the Internet. They were produced every year from 2002 to 2006. The CD-ROMs contained all the auxiliary
software needed to run Greenstone as well, which are not included in the Internet distributions because they can
In New Zealand, by the way, they say ҧive a man a fish and heլl eat for a day; teach a man to fish and heլl sit in a boat and
drink beer for the rest of his life.Ӽo:p>
2[2]
Incidentally, UNESCO refused to use our toki logo on the CD-ROMs because they feel that in some developing countries
axes are irrevocably linked to genocide. Our protests that this object is clearly ceremonial fell on deaf ears. Dealing with
international agencies is sometimes very frustrating.
1[1]
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
be obtained from other sources (links are provided). When we and others started to give workshops, tutorials, and
courses on Greenstone we adopted a policy of putting all instructional material—PowerPoint slides, exercises,
sample files for projects—on a workshop CD-ROM, and began to include this auxiliary material on the UNESCO
distributions. This ultimately led to their downfall, for the company producing the CD-ROMs began to question
the provenance of some of the sample files they contained, and ultimately demanded explicit proof of permission
to reproduce all the information and software. Although everything was, in principle, open source, so much had to
be stripped out that the 2006 CD-ROM distribution was seriously emasculated. CD-ROM distributions for
workshops, however, continue because they are produced on a far more limited scale.
Good documentation was (rightly!) seen by UNESCO as crucial. They were keen to make the Greenstone
technology available in Spanish, French, and Russian (Arabic and Chinese are also official UNESCO languages,
but for some reason never figured in our discussions). We already had versions of the interface in these (and many
other) languages, but UNESCO wanted everything to be translated—not just the documentation, which was
extensive (four substantial manuals) but all the installation instructions, README files, example collections, etc.
We might have demurred had we realized the extent to which such a massive translation effort would threaten to
hobble the potential for future development, and have since suffered mightily in getting everything—including
last-minute interface tweaks—translated for each upcoming UNESCO CD-ROM release. The cumbersome
process of maintaining up-to-date translations in the face of continual evolution of the software—which is, of
course, to be expected in open source systems—led us to devise a scheme for maintaining all language fragments
in a version control system so that the system could tell what needed updating. This resulted in the Greenstone
Translatorճ Interface, a web portal where officially registered translators can examine the status of the language
interface for which they are responsible, and update it (Bainbridge et al., 2003). Today the interface has been
translated into 43 languages (with a further 8 in progress), 28 of which have a designated volunteer maintainer.
Most people are surprised by the small size of the Greenstone team. Historically, for most of the duration of
the project we have employed 1–2 programmers, although recently the number has crept up to 3–4. Several
faculty involved in aspects of digital library research are associated with the project, but only two have viewed the
Greenstone software as their main interest—partly because although the work is ground-breaking the research
outputs are of questionable value in the university evaluation and promotion process. Graduate students rarely
contribute to the code base directly because of concerns about retaining the production-level code quality and
programming conventions painstakingly acquired over many years, although several students work in areas
cognate to digital libraries. Our external users tend to be librarians rather than software specialists and we have
received few major contributions or bug fixes from them. To summarize, the Greenstone digital library software
has been created by a couple of skilled people working over a 10-year period—and along the way there have been
several changes of personnel. Itճ amazing what excellent programmers can do.
With UNESCOճ encouragement (and occasional sponsorship), we have worked to enable developing
countries to take advantage of digital library technology by running hands-on workshops. This has enabled team
members to travel to many interesting places. In what other area, for example, might a computer science professor
get the opportunity to spend a week giving a course at the UN International Criminal Tribunal for Rwanda in
Arusha, Tanzania, at the foot of Mount Kilimanjaro—or in Havana, Cuba? Recognizing that devolution is
essential for sustainability, we are now attempting to distribute this effort by establishing regional Greenstone
Support Groups: the first, for South Asia, was launched in April 2006.
Greenstone won the 2004 IFIP Namur award, which recognizes recipients for raising awareness
internationally of the social implications of information and communication technologies; and was a finalist for
the 2006 Stockholm Challenge, the worldճ leading ICT Prize for entrepreneurs who use ICT to improve living
conditions and increase economic growth. Our project received the Vannevar Bush award for the best paper at the
ACM Digital Libraries Conference in 1999, the Literati Club Highly Commended Award in 2003, and the best
international paper award at the Joint Conference on Digital Libraries in 2004.
Greenstone is promoted by UNESCO (Paris) under its Information for All programme. It is distributed with
the FAOճ (Rome) Information Management Resource Kit (2005), along with tutorial information on its use. It
forms the basis of the Institute for Information Technology in Educationճ course on Digital Libraries in
Education (2006). An extensive early description appears in Witten and Bainbridgeճ book How to build a digital
library (Witten and Bainbridge, 2003). In 2002–2003 our principal developer at that time left the project to form
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
DL Consulting, an enterprise that specializes in building and customizing Greenstone collections and has won
several awards as the regionճ fastest-growing exporter and ICT company.
Many early digital library projects focused on interoperability. Although this is clearly a very important
issue, we felt that this attention was premature—we well remember a digital library conference where interest was
so strong that there were two panel discussions on interoperability, the only catch being that they were parallel
sessions, which permitted no ɠer ɠinteroperability. We adopted the informal motto Ҧirst operability, then
interoperabilityӻ and focused on other issues such as ingesting documents and metadata in a very wide variety of
formats. More recently we have added many interoperability features, which, as we had expected, were not hard
to retrofit: communication with Z39.50, SRW, OAI-PMH, DSpace, and METS are just a few examples
(Bainbridge et al., 2006).
We continually struggle with the fundamental conflict between stability and evolution. We place a strong
emphasis on backwards compatibility: it is rare for new software releases to have any effect at all on existing
collections, and then only in minor respects. Only recently we have made a concession to hardware obsolescence
by making alterations that no longer allow standard Greenstone collections to be served on Windows 3.1/3.11.
In order to take advantage of new developments in software technology we began a new project, Greenstone
3, which is a complete redesign and reimplementation of the original digital library software (Greenstone 2). It
incorporates all the features of the existing system, and is backwards compatible: that is, it can build and run
existing collections without modification. It is structured as a network of independent modules that communicate
using XML: thus it runs in a distributed fashion and can be spread across different servers as necessary. This
modular design increases the flexibility and extensibility of Greenstone. However, although initial versions of
Greenstone 3 have been released, continual demands from users for further development of Greenstone 2 have
delayed progress on the new version.
Greenstone 3 was originally envisaged purely as a research framework: backwards compatibility would be
possible but required IT skills. We have achieved this aim: it is now much easier for graduate and undergraduate
project students to build upon the digital library core (e.g. the Language Learning Digital Library, Wu and Witten
2006). However, we have found that maintaining two independent versions of Greenstone—in particular,
ensuring backwards compatibility when new and enhanced features are added to Greenstone 2—is beyond our
resources. Consequently we have committed to a new vision: to develop Greenstone 3 to the point that, by default,
its installation and operation is, to the user, indistinguishable from Greenstone 2. This work will be included in the
next release of Greenstone 3, slated for release in March 2007.
REFERENCES
Apperley, M., Keegan, T.T., Cunningham, S.J. and Witten, I.H. (2002) ҄elivering the Maori-language
newspapers on the Internet.ӠRere atu, taku manu! Discovering history, language and politics in the Maorilanguage newspapers, edited by J. Curnow, N. Hopa and J. McRae. Auckland University Press: 211-232.
Bainbridge, D., Thompson, J. and Witten, I.H. (2003) ҁssembling and enriching digital library
collections.ӠProc Joint Conference on Digital Libraries, Houston, Texas.
Bainbridge, D., Edgar, K.D., McPherson, J.R. and Witten, I.H. (2003) ҍanaging change in a digital library
system with many interface languages.ӠProc European Conference on Digital Libraries ECDL2003, Trondheim,
Norway.
Bainbridge, D., Ke, K.-Y.J. and Witten, I.H. (2006) ҄ocument level interoperability for collection
creators.ӠProc Joint Conference on Digital Libraries, pp. 105-106, Chapel Hill, NC.
Bell, T.C., Moffat, A. and Witten, I.H. (1994) ҃ompressing the digital library.ӠProc Digital Libraries '94, pp.
41-46, College Station, Texas, June.
Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, NJ.
Witten, I.H., Moffat, A. and Bell, T.C. (1994) Managing gigabytes: compressing and indexing documents
and images. Van Nostrand Reinhold, New York.
Witten, I.H., Cunningham, S.J., Vallabh, M. and Bell, T.C. (1995) ҁ New Zealand digital library for
computer science research.ӠProc Digital Libraries '95, pp. 25-30, Austin, Texas, June.
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
Witten, I. H., Bainbridge, D. and Boddie, S.J. (2001) Ґower to the people: end-user building of digital library
collections.ӠProc Joint Conference on Digital Libraries, Roanoke, VA.
Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco,
CA.
Wu, S. and Witten, I.H. (2006.ӠTowards a digital library for language learning.ӠProc European Conference
on Digital Libraries, Alicante, Spain.
Timeline of significant events
Greenstone distributed with IITEճ course Digital Libraries in Education
2007
2006
May
Apr
Finalist for the Stockholm Challenge
Greenstone Support Group for South Asia launched
2005
Nov
Feb
Initial release of Greenstone3
Greenstone distributed with FAOճ Information Management Resource Kit
2004
2002
Jan
Jun
IFIP Namur award
DL Consulting incorporated
Begin development of the Greenstone Translatorճ Interface
2002
Apr
Mar
Began development of Greenstone3
Official opening of the Niupepa collection
Begin development of the Greenstone Librarian Interface
Jun
2001
2000
1999
1998
First UNESCO Greenstone CD-ROM
Development of the Collector
Nov
Nov
Begin to distribute software on SourceForge
Toki presented to the NZ Digital Library project on behalf of the entire Māori people
Aug
Formally established cooperative effort with UNESCO and Human Info NGO
Apr
Greenstone mailing list started
Dec
Aug
Apr
BBC collection established
Greenstone.org website established
First CD-ROM collection released: Humanity Development Library
Decision to use the GPL; name ҇reenstoneӠadopted
1997
Began work with Human Info NGO to produce humanitarian CD-ROMs
1995
May
Digital library of Computer Science Technical Reports
Greenstone releases
2006
2005
2004
2003
2002
2001
Dec
Oct
2.72
2.71
Mar
2.70
Jan
2.63
Jun
Apr
2.62
2.60
Mar
2.53
Oct
Jun
2.52
2.51
Feb
2.50
Dec
Jun
2.41
2.40
Mar
2.39
Jan
Oct
Jun
2.38
2.37
2.36
May
2.35
Apr
2.33
Feb
2.31
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
2000
Feb
2.30
Dec
Sep
2.30
2.27
Jul
2.25
Jun
2.23
Jun
2.22
Apr
2.21
Feb
2.12
UNESCO Greenstone CD-ROMs
These contain the entire Greenstone software, and are intended for use in developing countries with limited access to the Internet.
2006
May
UNESCO CD-ROM v2.7 (Greenstone v2.70)
English/French/Spanish/Russian
2005
May
UNESCO CD-ROM v2.6 (Greenstone v2.60)
English/French/Spanish/Russian
2004
Mar
UNESCO CD-ROM v2.0 (Greenstone v2.50)
English/French/Spanish/Russian
2003
Mar
UNESCO CD-ROM v1.1 (Greenstone v2.39)
English/French/Spanish
2002
Jun
UNESCO CD-ROM v1.0 (Greenstone v2.38)
English
Human Info NGO CD-ROMs
Prior to the year 2000 we worked with Human Info NGO to help them produce humanitarian CD-ROMs using Greenstone. (Many more
have been produced since; a total of about 40 to date)
2006
2005
2004
Apr
May
???
Appropriate Technology Knowledge Collection
Gender and HIV/AIDS Electronic Library
Textes de Base sur LՅnvironment au Senegal (French)
Jan
Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material did?ctico v3.0
(English/German/French/Spanish)
Africa Collection for Transition: From Relief to Development v1.01
UNECE Committee for Trade, Industry and Enterprise Development (English/French
/Russian)
INEE Technical Kit on Education in Emergencies and Early Recovery
Nov
Sep
???
Jan
2003
???
Oct
Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material did?ctico
(English/German/French/Spanish)
Education, Work and the Future/Education Travail et Avenir (English/French) v2.0
Revised Curricula for Technical Colleges and Polytechnics
Jul
UNAIDS Library v2.0 (English/French/Spanish/Russian)
May
Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters v2.0
(Spanish/English)
Food and Nutrition Library v2.2
Mar
???
2002
2001
2000
1999
did?ctico
v2.0
Jan
Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material
(English/German/French/Spanish)
ICT Training Kit and Digital Library for African Educators
v1.0
Aug
Jul
Community Development Library for Sustainable Development and Basic Human Needs v2.1
Food and Nutrition Library v2.0
Mar
UNDP Energy for Sustainable Development Library
Dec
Oct
UNAIDS Library of Current Documents v1.1 (English/French/Spanish/Russian)
East African Development Library
???
Safe Motherhood Strategies (English/French/Spanish)
Jul
Researching Education Development
Jun
Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters (Spanish/English)
Jun
WHO Medicines Bookshelf
Jan
Africa Collection for Transition
Dec
???
World Environmental Library v1.1
Sahel point Doc v2.0 (French)
Jan
Food and Nutrition Library v1.0
Dec
Dec
Medical and Health Library v1.0
Biblioth?que pour le D?veloppement Durable et des Besoins Essentials v1.0 (French)
Nov
Biblioteca Virtual de Desastres/Virtual Disaster Library (Spanish, some English)
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
1998
???
UNU Collection on Critical Global Issues v2.0
Mar
Sahel point Doc (French)
Feb
Humanity Development Library v2.0
???
Apr
UNU Collection on Critical Global Issues v1.0
Humanity Development Library v1.3
Greenstone workshops
As well as tutorials at conferences in the US and Europe, many workshops have been given on Greenstone in developing countries. Here
are some that have been given by people closely associated with the project; there have been many others. They range from half a day to 6
days; most are 1–3 days. Many have been sponsored by UNESCO.
2007
May
Feb
Trinidad and Tobago National Library
Vellore, India
2006
Dec
Dec
Calcutta, India
New Delhi, India
Nov–Dec Kozhikode, India
2005
2004
2003
Oct
Vladimir, Russia
Aug
Tirunelvelli, India
Jun
Hawaii, US
Mar–Apr
Madras, India
Mar
Durban, South Africa
Feb
Bangkok, Thailand
Nov
Cape Town, South Africa
Nov–Dec Arusha, Tanzania
Sep
Suva, Fiji
Aug
Bangalore, India
July
Siena, Italy
May
Ho Chi Minh City, Vietnam
May
Kozhikode, India
???
Bombay, India
Havana, Cuba
???
Trirandom, Kerala
Aug–Sep
Windhoek, Namibia
Jul
Suva, Fiji
Jun
Cape Town, South Africa
Mar
Dakar, Senegal
Mar
Cape Town, South Africa
Feb
Gaborone, Botswana
Feb
Almaty, Kazakhstan
Nov
Nov
Dakar, Senegal
Suva, Fiji
May
Bangalore, India (IISC)
http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm
This toki (adze) was a gift from the Māori people in
recognition of our projectճ contributions to indigenous
language preservation, and resides in the project
laboratory at the University of Waikato. In Māori
culture there are several kinds of toki, with different
purposes. This one is a ceremonial adze, toki pou
tangata, a symbol of chieftainship. The rau (blade) is
sharp, hard, and made of pounamu or greenstone—
hence the Greenstone software, at the cutting edge of
digital library technology. There are three figures
carved into the toki. The forward-looking one looks
out to where the rau is pointing to ensure that the toki
is appropriately targeted. The backward-looking one at
the top is a sentinel that guards where the rau canմ
see. There is a third head at the bottom of the handle
which makes sure that the chiefճ decisions—to which
the toki lends authority—are properly grounded in
reality. The name of this taonga, or art-treasure, is
Toki Pou Hinengaro, which translates roughly as Ҵhe
adze that shapes the excellence of thought.Ӡ
Figure 1. The Greenstone toki
Download