A brief history of the Greenstone Digital Library Software Ian H. Witten and David Bainbridge University of Waikato, Hamilton, New Zealand At the time of writing (January 2007) Greenstone—a versatile open source multilingual digital library environment with over a decade of pedigree—has a user base hailing from over 70 countries, is downloaded 4,500 times a month, runs on all popular operating systems (even the iPod!), and has a readerճ interface in over 40 languages. How did this software project and the research team behind it reach this point? Team members often give anecdotal stories about life behind the scenes at conferences and workshops; this article gives a more definitive and coherent account of the project. The New Zealand Digital Library project grew out of research on text compression (Bell et al., 1990) and, later, index compression (Witten et al., 1994). Around this time we heard of digital libraries, and pointed out the potential advantages of compression at the first-ever digital library conference (Bell et al., 1994). The New Zealand Digital Library Project was established in 1995, beginning with a collection of 50,000 computer science technical reports downloaded from the Internet (Witten et al., 1995). At the time several research groups in computer science departments collecting technical reports and making them available on the web: our main contribution was the use of full-text indexing for effective search. We were assisted by equipment funding from the New Zealand Lotteries Board and operating funding from the New Zealand Foundation for Research, Science and Technology (1996–1998 and 2002–2007). In 1997 we began to work with Human Info NGO to help them produce fully-searchable CD-ROM collections of humanitarian information. This necessitated making our server (and in particular the full-text search engine it used), which had been developed under Linux, run on Windows machines—including the early Windows 3.1 and 3.11 because, although by then obsolete, they were prevalent in developing countries. This was demanding but largely uninteresting technically: we had to develop expertise in long-forgotten software systems, and it was hard to find suitable compilers (eventually we obtained a ҳecond-handӠone from a software auction). The first publicly available CD-ROM, the Humanity Development Library 1.3, was issued in April 1998. A French collection, UNESCOճ Sahel point Doc, appeared a year later; all the documents, along with the entire interface, help text, and full-text search mechanism, were in French. The first multilingual collection came six months later: a Spanish/English Biblioteca Virtual de Desastres/Virtual Disaster Collection. Since then about 40 CD-ROM collections have been published. They are produced by Human Info in Romania: we wrote the software and were heavily involved in preparing the first few CD-ROMs, and then transferred the technology to them so that they could proceed independently. At this point we realized that we did not aspire to be a digital library site ourselves, but rather to develop software that others could use for their own digital libraries. Towards the end of 1997 we adopted the term Greenstone: we decided that Ҏew Zealand Digital Library SoftwareӠwas not only clumsy but could impede international acceptance and therefore sought a new name. ҇reenstoneӠturned out to be an inspired choice: snappy, memorable, and un-nationalistic but with strong national connotations within New Zealand—a form of nephrite jade, greenstone is a hallowed substance for Māori, valued more highly than gold. Moreover, it is easy to spell and pronounce. Our earlier Weka (think mecca) machine learning workbench, an acronym that in Māori spells the name of a flightless native bird, suffers from being mispronounced weaka by some. And the term Greenstone is not overly common—today we are the number one Google hit for it. The decision to issue the software as open source, and to use the GNU General Public Licence, was made around the same time. We did not discuss this with University of Waikato authorities—New Zealand universities are obsessed with commercialization and we would have been forced into an endless round of deliberations on commercial licensing—but simply began to release under GPL. Early releases were posted on http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm our website greenstone.org (which was registered on 13 August 1998), but in November 2000 we moved to the SourceForge site for distribution (partly due to the per-megabyte charging scheme that our university levied for both outgoing and incoming web traffic). Our employers were not particularly happy when our licensing fait accompli became apparent years later, but have grown to accept (and perhaps even appreciate) the status quo because of our evident international success. An early in-house project utilizing Greenstone was the Niupepa collection of Māori-language newspapers. We began the work of OCRing 20,000 page images in 1998, and made an initial demonstration collection. In 2000–2001 we received (retrospective!) funding from the Ministry of Education to continue the work. Virtually the entire Niupepa was available online early in 2001, but the collection was not officially launched until March 2002 at the Annual General meeting of Te Rūnanga o Ngā Kura Kaupapa Māori (the controlling body of Māori medium/theology schools). Niupepa is still the largest collection of on-line Māori-language documents, and is extensively used; Apperley et al. (2002) gives a comprehensive description of how it was developed. On 13 November 2000, in a moving ceremony, the Māori people presented our project with a ceremonial toki (adze) as a gift in recognition of our contributions to indigenous language preservation (see Figure 1). In 1999 the BBC in London were concerned about the threat of Y2K bugs on their database of one million lengthy metadata records for radio and television programmes. They decided to augment their heavy-duty mainframe database with a fully-searchable Greenstone system that could run on ordinary desktop machines. A Greenstone collection was duly built and delivered (within two days of receiving the full dataset). We tried to get them to the point where they could maintain it themselves, but they were not interested: instead we updated it for them regularly (incidentally providing us with a useful small source of revenue). They eventually moved to different technology in early 2006, with the aim of making the metadata (and ultimately the programme content) publicly available online in a way that resembles what Amazon does for books—something that we think requires a tailor-made portal rather than a general-purpose digital library system. We became acquainted with UNESCO through Human Infoճ long-term relationship with them. Although they supported Human Infoճ goal of producing humanitarian CD-ROMs and distributing them in developing countries, UNESCO were really interested in sustainable development, which requires empowering people in those countries to produce and distribute their own digital library collections—following that old Chinese proverb about giving a man fish versus teaching him to fish.1[1] We had by then transferred our collection-building technology to Human Info, and tried (though without success) to transfer it to the BBC, but this was a completely different proposition: to put the power to build collections into the hands of those other than IT specialists, typically librarians. We began by packaging up our PERL scripts and documenting them so that others could use them, and slowly, painfully, came to terms with the fact that operating at this level is anathema for librarians. In 2001 we produced a web-based system called the ҃ollectorӠthat was announced in a paper whose title proudly proclaimed Ґower to the people: end-user building of digital library collectionsӠ(Witten et al., 2001). However, this was never a great success: web-based submission to repository systems (including Greenstone collections) is commonplace today, but we were trying to allow users to design and configure digital library collections over the web as well as populate them. The next year we began a Java development that became known as the Greenstone Librarian Interface (Bainbridge et al., 2003), which grew over the years into a comprehensive system for designing and building collections and includes its own metadata editor. From the outset, UNESCOճ goal was to produce CD-ROMs containing the entire Greenstone software (not just individual collections plus the run-time system, as in Human Infoճ products), so that it could be used by people in developing countries who did not have ready access to the Internet.2[2] These were the tangible outcomes of a series of small contracts with UNESCO: we felt that the CD-ROMs were more of symbolic than actual significance because in practice they rapidly became outdated by frequent new releases of the software appearing on the Internet. They were produced every year from 2002 to 2006. The CD-ROMs contained all the auxiliary software needed to run Greenstone as well, which are not included in the Internet distributions because they can In New Zealand, by the way, they say ҧive a man a fish and heլl eat for a day; teach a man to fish and heլl sit in a boat and drink beer for the rest of his life.Ӽo:p> 2[2] Incidentally, UNESCO refused to use our toki logo on the CD-ROMs because they feel that in some developing countries axes are irrevocably linked to genocide. Our protests that this object is clearly ceremonial fell on deaf ears. Dealing with international agencies is sometimes very frustrating. 1[1] http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm be obtained from other sources (links are provided). When we and others started to give workshops, tutorials, and courses on Greenstone we adopted a policy of putting all instructional material—PowerPoint slides, exercises, sample files for projects—on a workshop CD-ROM, and began to include this auxiliary material on the UNESCO distributions. This ultimately led to their downfall, for the company producing the CD-ROMs began to question the provenance of some of the sample files they contained, and ultimately demanded explicit proof of permission to reproduce all the information and software. Although everything was, in principle, open source, so much had to be stripped out that the 2006 CD-ROM distribution was seriously emasculated. CD-ROM distributions for workshops, however, continue because they are produced on a far more limited scale. Good documentation was (rightly!) seen by UNESCO as crucial. They were keen to make the Greenstone technology available in Spanish, French, and Russian (Arabic and Chinese are also official UNESCO languages, but for some reason never figured in our discussions). We already had versions of the interface in these (and many other) languages, but UNESCO wanted everything to be translated—not just the documentation, which was extensive (four substantial manuals) but all the installation instructions, README files, example collections, etc. We might have demurred had we realized the extent to which such a massive translation effort would threaten to hobble the potential for future development, and have since suffered mightily in getting everything—including last-minute interface tweaks—translated for each upcoming UNESCO CD-ROM release. The cumbersome process of maintaining up-to-date translations in the face of continual evolution of the software—which is, of course, to be expected in open source systems—led us to devise a scheme for maintaining all language fragments in a version control system so that the system could tell what needed updating. This resulted in the Greenstone Translatorճ Interface, a web portal where officially registered translators can examine the status of the language interface for which they are responsible, and update it (Bainbridge et al., 2003). Today the interface has been translated into 43 languages (with a further 8 in progress), 28 of which have a designated volunteer maintainer. Most people are surprised by the small size of the Greenstone team. Historically, for most of the duration of the project we have employed 1–2 programmers, although recently the number has crept up to 3–4. Several faculty involved in aspects of digital library research are associated with the project, but only two have viewed the Greenstone software as their main interest—partly because although the work is ground-breaking the research outputs are of questionable value in the university evaluation and promotion process. Graduate students rarely contribute to the code base directly because of concerns about retaining the production-level code quality and programming conventions painstakingly acquired over many years, although several students work in areas cognate to digital libraries. Our external users tend to be librarians rather than software specialists and we have received few major contributions or bug fixes from them. To summarize, the Greenstone digital library software has been created by a couple of skilled people working over a 10-year period—and along the way there have been several changes of personnel. Itճ amazing what excellent programmers can do. With UNESCOճ encouragement (and occasional sponsorship), we have worked to enable developing countries to take advantage of digital library technology by running hands-on workshops. This has enabled team members to travel to many interesting places. In what other area, for example, might a computer science professor get the opportunity to spend a week giving a course at the UN International Criminal Tribunal for Rwanda in Arusha, Tanzania, at the foot of Mount Kilimanjaro—or in Havana, Cuba? Recognizing that devolution is essential for sustainability, we are now attempting to distribute this effort by establishing regional Greenstone Support Groups: the first, for South Asia, was launched in April 2006. Greenstone won the 2004 IFIP Namur award, which recognizes recipients for raising awareness internationally of the social implications of information and communication technologies; and was a finalist for the 2006 Stockholm Challenge, the worldճ leading ICT Prize for entrepreneurs who use ICT to improve living conditions and increase economic growth. Our project received the Vannevar Bush award for the best paper at the ACM Digital Libraries Conference in 1999, the Literati Club Highly Commended Award in 2003, and the best international paper award at the Joint Conference on Digital Libraries in 2004. Greenstone is promoted by UNESCO (Paris) under its Information for All programme. It is distributed with the FAOճ (Rome) Information Management Resource Kit (2005), along with tutorial information on its use. It forms the basis of the Institute for Information Technology in Educationճ course on Digital Libraries in Education (2006). An extensive early description appears in Witten and Bainbridgeճ book How to build a digital library (Witten and Bainbridge, 2003). In 2002–2003 our principal developer at that time left the project to form http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm DL Consulting, an enterprise that specializes in building and customizing Greenstone collections and has won several awards as the regionճ fastest-growing exporter and ICT company. Many early digital library projects focused on interoperability. Although this is clearly a very important issue, we felt that this attention was premature—we well remember a digital library conference where interest was so strong that there were two panel discussions on interoperability, the only catch being that they were parallel sessions, which permitted no ɠer ɠinteroperability. We adopted the informal motto Ҧirst operability, then interoperabilityӻ and focused on other issues such as ingesting documents and metadata in a very wide variety of formats. More recently we have added many interoperability features, which, as we had expected, were not hard to retrofit: communication with Z39.50, SRW, OAI-PMH, DSpace, and METS are just a few examples (Bainbridge et al., 2006). We continually struggle with the fundamental conflict between stability and evolution. We place a strong emphasis on backwards compatibility: it is rare for new software releases to have any effect at all on existing collections, and then only in minor respects. Only recently we have made a concession to hardware obsolescence by making alterations that no longer allow standard Greenstone collections to be served on Windows 3.1/3.11. In order to take advantage of new developments in software technology we began a new project, Greenstone 3, which is a complete redesign and reimplementation of the original digital library software (Greenstone 2). It incorporates all the features of the existing system, and is backwards compatible: that is, it can build and run existing collections without modification. It is structured as a network of independent modules that communicate using XML: thus it runs in a distributed fashion and can be spread across different servers as necessary. This modular design increases the flexibility and extensibility of Greenstone. However, although initial versions of Greenstone 3 have been released, continual demands from users for further development of Greenstone 2 have delayed progress on the new version. Greenstone 3 was originally envisaged purely as a research framework: backwards compatibility would be possible but required IT skills. We have achieved this aim: it is now much easier for graduate and undergraduate project students to build upon the digital library core (e.g. the Language Learning Digital Library, Wu and Witten 2006). However, we have found that maintaining two independent versions of Greenstone—in particular, ensuring backwards compatibility when new and enhanced features are added to Greenstone 2—is beyond our resources. Consequently we have committed to a new vision: to develop Greenstone 3 to the point that, by default, its installation and operation is, to the user, indistinguishable from Greenstone 2. This work will be included in the next release of Greenstone 3, slated for release in March 2007. REFERENCES Apperley, M., Keegan, T.T., Cunningham, S.J. and Witten, I.H. (2002) ҄elivering the Maori-language newspapers on the Internet.ӠRere atu, taku manu! Discovering history, language and politics in the Maorilanguage newspapers, edited by J. Curnow, N. Hopa and J. McRae. Auckland University Press: 211-232. Bainbridge, D., Thompson, J. and Witten, I.H. (2003) ҁssembling and enriching digital library collections.ӠProc Joint Conference on Digital Libraries, Houston, Texas. Bainbridge, D., Edgar, K.D., McPherson, J.R. and Witten, I.H. (2003) ҍanaging change in a digital library system with many interface languages.ӠProc European Conference on Digital Libraries ECDL2003, Trondheim, Norway. Bainbridge, D., Ke, K.-Y.J. and Witten, I.H. (2006) ҄ocument level interoperability for collection creators.ӠProc Joint Conference on Digital Libraries, pp. 105-106, Chapel Hill, NC. Bell, T.C., Moffat, A. and Witten, I.H. (1994) ҃ompressing the digital library.ӠProc Digital Libraries '94, pp. 41-46, College Station, Texas, June. Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, NJ. Witten, I.H., Moffat, A. and Bell, T.C. (1994) Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York. Witten, I.H., Cunningham, S.J., Vallabh, M. and Bell, T.C. (1995) ҁ New Zealand digital library for computer science research.ӠProc Digital Libraries '95, pp. 25-30, Austin, Texas, June. http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm Witten, I. H., Bainbridge, D. and Boddie, S.J. (2001) Ґower to the people: end-user building of digital library collections.ӠProc Joint Conference on Digital Libraries, Roanoke, VA. Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, CA. Wu, S. and Witten, I.H. (2006.ӠTowards a digital library for language learning.ӠProc European Conference on Digital Libraries, Alicante, Spain. Timeline of significant events Greenstone distributed with IITEճ course Digital Libraries in Education 2007 2006 May Apr Finalist for the Stockholm Challenge Greenstone Support Group for South Asia launched 2005 Nov Feb Initial release of Greenstone3 Greenstone distributed with FAOճ Information Management Resource Kit 2004 2002 Jan Jun IFIP Namur award DL Consulting incorporated Begin development of the Greenstone Translatorճ Interface 2002 Apr Mar Began development of Greenstone3 Official opening of the Niupepa collection Begin development of the Greenstone Librarian Interface Jun 2001 2000 1999 1998 First UNESCO Greenstone CD-ROM Development of the Collector Nov Nov Begin to distribute software on SourceForge Toki presented to the NZ Digital Library project on behalf of the entire Māori people Aug Formally established cooperative effort with UNESCO and Human Info NGO Apr Greenstone mailing list started Dec Aug Apr BBC collection established Greenstone.org website established First CD-ROM collection released: Humanity Development Library Decision to use the GPL; name ҇reenstoneӠadopted 1997 Began work with Human Info NGO to produce humanitarian CD-ROMs 1995 May Digital library of Computer Science Technical Reports Greenstone releases 2006 2005 2004 2003 2002 2001 Dec Oct 2.72 2.71 Mar 2.70 Jan 2.63 Jun Apr 2.62 2.60 Mar 2.53 Oct Jun 2.52 2.51 Feb 2.50 Dec Jun 2.41 2.40 Mar 2.39 Jan Oct Jun 2.38 2.37 2.36 May 2.35 Apr 2.33 Feb 2.31 http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm 2000 Feb 2.30 Dec Sep 2.30 2.27 Jul 2.25 Jun 2.23 Jun 2.22 Apr 2.21 Feb 2.12 UNESCO Greenstone CD-ROMs These contain the entire Greenstone software, and are intended for use in developing countries with limited access to the Internet. 2006 May UNESCO CD-ROM v2.7 (Greenstone v2.70) English/French/Spanish/Russian 2005 May UNESCO CD-ROM v2.6 (Greenstone v2.60) English/French/Spanish/Russian 2004 Mar UNESCO CD-ROM v2.0 (Greenstone v2.50) English/French/Spanish/Russian 2003 Mar UNESCO CD-ROM v1.1 (Greenstone v2.39) English/French/Spanish 2002 Jun UNESCO CD-ROM v1.0 (Greenstone v2.38) English Human Info NGO CD-ROMs Prior to the year 2000 we worked with Human Info NGO to help them produce humanitarian CD-ROMs using Greenstone. (Many more have been produced since; a total of about 40 to date) 2006 2005 2004 Apr May ??? Appropriate Technology Knowledge Collection Gender and HIV/AIDS Electronic Library Textes de Base sur LՅnvironment au Senegal (French) Jan Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material did?ctico v3.0 (English/German/French/Spanish) Africa Collection for Transition: From Relief to Development v1.01 UNECE Committee for Trade, Industry and Enterprise Development (English/French /Russian) INEE Technical Kit on Education in Emergencies and Early Recovery Nov Sep ??? Jan 2003 ??? Oct Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material did?ctico (English/German/French/Spanish) Education, Work and the Future/Education Travail et Avenir (English/French) v2.0 Revised Curricula for Technical Colleges and Polytechnics Jul UNAIDS Library v2.0 (English/French/Spanish/Russian) May Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters v2.0 (Spanish/English) Food and Nutrition Library v2.2 Mar ??? 2002 2001 2000 1999 did?ctico v2.0 Jan Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material (English/German/French/Spanish) ICT Training Kit and Digital Library for African Educators v1.0 Aug Jul Community Development Library for Sustainable Development and Basic Human Needs v2.1 Food and Nutrition Library v2.0 Mar UNDP Energy for Sustainable Development Library Dec Oct UNAIDS Library of Current Documents v1.1 (English/French/Spanish/Russian) East African Development Library ??? Safe Motherhood Strategies (English/French/Spanish) Jul Researching Education Development Jun Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters (Spanish/English) Jun WHO Medicines Bookshelf Jan Africa Collection for Transition Dec ??? World Environmental Library v1.1 Sahel point Doc v2.0 (French) Jan Food and Nutrition Library v1.0 Dec Dec Medical and Health Library v1.0 Biblioth?que pour le D?veloppement Durable et des Besoins Essentials v1.0 (French) Nov Biblioteca Virtual de Desastres/Virtual Disaster Library (Spanish, some English) http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm 1998 ??? UNU Collection on Critical Global Issues v2.0 Mar Sahel point Doc (French) Feb Humanity Development Library v2.0 ??? Apr UNU Collection on Critical Global Issues v1.0 Humanity Development Library v1.3 Greenstone workshops As well as tutorials at conferences in the US and Europe, many workshops have been given on Greenstone in developing countries. Here are some that have been given by people closely associated with the project; there have been many others. They range from half a day to 6 days; most are 1–3 days. Many have been sponsored by UNESCO. 2007 May Feb Trinidad and Tobago National Library Vellore, India 2006 Dec Dec Calcutta, India New Delhi, India Nov–Dec Kozhikode, India 2005 2004 2003 Oct Vladimir, Russia Aug Tirunelvelli, India Jun Hawaii, US Mar–Apr Madras, India Mar Durban, South Africa Feb Bangkok, Thailand Nov Cape Town, South Africa Nov–Dec Arusha, Tanzania Sep Suva, Fiji Aug Bangalore, India July Siena, Italy May Ho Chi Minh City, Vietnam May Kozhikode, India ??? Bombay, India Havana, Cuba ??? Trirandom, Kerala Aug–Sep Windhoek, Namibia Jul Suva, Fiji Jun Cape Town, South Africa Mar Dakar, Senegal Mar Cape Town, South Africa Feb Gaborone, Botswana Feb Almaty, Kazakhstan Nov Nov Dakar, Senegal Suva, Fiji May Bangalore, India (IISC) http://wiki.greenstone.org/wiki/gsdoc/others/Greenstone_history.htm This toki (adze) was a gift from the Māori people in recognition of our projectճ contributions to indigenous language preservation, and resides in the project laboratory at the University of Waikato. In Māori culture there are several kinds of toki, with different purposes. This one is a ceremonial adze, toki pou tangata, a symbol of chieftainship. The rau (blade) is sharp, hard, and made of pounamu or greenstone— hence the Greenstone software, at the cutting edge of digital library technology. There are three figures carved into the toki. The forward-looking one looks out to where the rau is pointing to ensure that the toki is appropriately targeted. The backward-looking one at the top is a sentinel that guards where the rau canմ see. There is a third head at the bottom of the handle which makes sure that the chiefճ decisions—to which the toki lends authority—are properly grounded in reality. The name of this taonga, or art-treasure, is Toki Pou Hinengaro, which translates roughly as Ҵhe adze that shapes the excellence of thought.Ӡ Figure 1. The Greenstone toki