The Citation Database - Daedalus: Projects in Digital Humanities

advertisement
I.
Introduction
In the history of English Lexicography, one visual image dominates all others; James
Murray – the editor of the Oxford English Dictionary – standing in his Oxford ‘Scriptorium’
wearing his academic robe holding a book in one
hand as he peers carefully at a small piece of paper.
Presumably, this piece of paper is one of the
lexicographic slips – or note cards that contain
quotations of a particular word – and Murray is
engaged in the act of pondering the precise shade
of meaning of a specific word. The remarkable
aspect of this picture is not, however, or glimpse
into the idealized view of a lexicographer
‘defining’ a word, but rather the background to this
image. Murray is standing in a room surrounded
Figure 1: James Murray in His
Scriptorium
by pigeonholes overflowing with pieces of paper;
these papers known as slips formed the core source
material of the Oxford English Dictionary and were the result of a massive distributed reading
effort where volunteer readers were given lists of books to peruse and asked to note interesting
and unusual usages of words. (Cite actual letter here). These volunteer readers then returned
their papers to Murray who – along with a small group of subeditors – would sit perched at the
high table in the middle of this image and compose a definition from the raw materials before
him. In this single image, we see the formation of three essential components of an idealized
1
vision of the lexicographer’s task; that lexicographers should begin by consulting words in
context rather than other dictionaries, a visual representation of the massive task of actually
gathering, organizing and accessing the raw material for a dictionary, and the implicit notion that
an intrepid reader or scholar could return to these source materials later to revise or refine
definitions in the dictionary.
Of course, the actual practice of lexicography rarely – if ever - conforms to this idealized
vision. There are at least three practical obstacles to this approach associated with managing
and assimilating data. The image of Murray in his study examining a single specific paper
drawn from the vast nest of pigeonholes behind him must give at least some pause to academics
who can’t find an article offprint or student paper from previous week in the various stacks in
their office. Indeed, when Murray took on the editorship of the Oxford English Dictionary, the
slips were in such disarray that the decision was made to abandon previous efforts rather than
trying to re-sort and re-catalog the slips already in their possession. Further, all of the labor
involved in creating the slips and a corresponding system for storage and access is only the first
step in the lexicographer’s task – order must be brought to the chaos; the slips must be organized
to correspond to word senses, the lexicographer must decide whether to order the information
chronologically, semantically, etc. and the actual writing remains to be done. Finally, the slip
archive itself is more of an ideal than a practical reality; Murray himself consulted other
dictionaries for the OED and practical constraints of time, cost and labor compel most
lexicographers to work with existing dictionaries as they try to revise, refine, or build new
dictionaries.
The difference between this idealized vision and the actual practice of lexicography is
amply illustrated by the fact that there is no equivalent image in the history of the lexicography
2
of Ancient Greek to match the famous image of Murary. Indeed, an image on Henry Liddell
from the 1875 issue of the British Social magazine Vanity Fair offers only a caricature of an
Oxford don in his robes with no books in sight
and a caption that mentions only his college,
Christchurch while an image from a later
memoir shows Henry Liddell simply reading an
unidentified book at his desk. In fact, the best
equivalent image to reflect the intellectual shifts
of Greek lexicography are not that of a single
man in a room, but rather of a stack of
dictionaries, each pointing to the one that
follows it. (BRUCE EXTRACT) This line of
dictionaries begins with Thesaurus Graecae
Linguae of Stephanus in 1572 and runs in unbroken succession to the dictionaries most
commonly in use today, the Liddell-Scott-Jones Greek lexicon originally published in 1843,
revised some nine times until 1940 and further augmented with three supplemental volumes, the
most recent in 1994 and its Supplements, the Italian Vocabolario della lingua greca (GI), and the
Spanish Diccionario griego-español (DGE) and the Greman Lexikon des frühgriechischen Epos
(LFgrE) now in progress.
In this progression, citations of actual passages where words appear have become an
increasingly important and distinctive component of lexicon entries. In the work of Stephanus,
where words were grouped by ‘family resemblance’, brief phrases were given as examples,
without line or chapter references (although authors, and sometimes works, were cited). The
3
later, alphabetic, editions of the Thesaurus (Valpy and Barker 1816-28, Hase 1831-65)
introduced referenced citations, but these were very brief: often just one-word quotations from
the early grammarians and lexicographers, rather than illustrations of usage.
The first (modern) alphabetic Greek dictionary, and the first dictionary from Greek to a
modern language, Schneider (1797-8), used more extensive citations, mostly from early epic, as
examples. These provided the core source-material for subsequent Greek lexica: Passow (1831)
drew on them for his citations, and Liddell and Scott (1843) in turn used his material as the basis
for their own.
In their seven subsequent editions, Liddell and Scott steadily increased the number and
range of quotations, drawing on the alphabetic Thesaurus of Valpy and Barker, and then on a
variety of later sources, as the discoveries and textual editions of the nineteenth century
unearthed new attestations, until the accretion of new material made a complete reworking
necessary.1
In 1904, a proposal was made to the British Academy for the creation of a new
Thesaurus, in order to organise the newly-discovered material.2 However, in a memorable
phrase, which has frequently been cited, Hermann Diels (1905: 693) compared the task of
collating the citations from the full corpus of ancient Greek literature as equivalent to ‘in dieses
Chaos den Nus hineinzubringen’,3 and the task was eventually abandoned as unfeasible, in
1
Zgusta (1987: 264-72) and Glare (1987) give contrasting accounts of the changes in
Liddell and Scott’s approach. Their last (eighth) edition was published in 1897, the year of
Scott’s death and a year before Liddell’s.
2
For a brief account of the discussions, see LSJ (1925: iv-vii).
3
‘Bringing Νοῦς into this Chaos.’ The expression is cited in LSJ (1925: v), Berkowitz
4
favour of a further revision of Liddell and Scott’s lexicon, which was published in ten parts from
1925 to 1940 as its ninth edition, LSJ.
This great work has proved to be the foundation of subsequent Greek lexicography, but it
may perhaps be described as a magnificent failure, because so much new material has been
incorporated into the structure of the eighth edition that the clarity of the semantic descriptions is
often overwhelmed: see Zgusta (1987, 271-2), Glare (1987) and Chadwick (1994). Since then,
the ever-increasing volume of new material has been collected in independent volumes: new
citations were published in Supplements to LSJ (1968, 1996), and the historical range was
extended by Lampe (1961-8) and Trapp (1994-9).
(END BRUCE EXTRACT)
The road that connects the practical accretive nature of Greek lexicography with the more
idealized vision of the lexicographer’s task encapsulated in the picture of Murray in his study
converges at with the emergence of large corpora of digitized literature. These corpora allow
lexicographers to find and analyze lexicographic source material in ways that simply were not
possible for Murray, Liddell or any other lexicographer of the pre-digital era. At its simplest
level, the computer can automate the basic tasks of identifying the words in a corpus,
constructing an index, and presenting passages where the words appear so that lexicographers
can write the definitions. Electronic text corpora such as the one contained in the Thesaurus
Linguae Gracae digital collection founded in 1972 and the Perseus digital library can
substantially ease the task of locating the passages where words are used and can transform the
questions that a lexicographer can ask.4 If we can automate the task of executing searches in an
and Squitier (1990: vii), Pantelia (2000: title).
4
Crane 1999.
5
electronic corpus and compiling the results in a useful fashion, lexicographers can spend more of
their time doing the intellectual work necessary to thoroughly consider words and their meanings
for a new lexicon. The creation of a citation file only begins to exploit the possibilities that
electronic text corpora can contribute to the practice of lexicography and philology. We can also
help lexicographers begin to provide answers to questions that would be difficult or impossible
to obtain without computational techniques. For example, How common or rare is a particular
word? Is a word associated with a specific work, author, or genre? What grammatical or
morphological features are commonly associated with different verbs?, etc.
Beginning as a NEH funded post-doctoral researcher at the Perseus Project in 2001 and
continuing as a professor at the University of Missouri-Kansas City, I have been engaged in the
practical task of creating just such a database in partnership with a team at Cambridge University
who are writing a new intermediate level Greek - English Lexicon. In our work, we created a
database that allows the lexicographic team to complete a project that couldn’t have been done
by a staff of the same size if at all in a pre-digital era by providing us with a method for
examining all attestations of a word in our corpus along with an overview of the authors, timeperiods, and genres where those words appear while also providing tools that allow us to manage
the chaos of information overload that comes from a massive unsorted list of citations. In this
paper, I will describe the slips themselves, the two elements of general system architecture that
have been essential for the creation of this system, and the merging of old and new resources that
we have used to manage the potentially overwhelming abundance of information contained in
the new lexicographic database.
6
The Citation Database
As noted above, any lexicon is based on the collections of word attestations that serve as
the raw materials for the lexicographer, whether gathered from scratch as Murray’s team did or
built from previous lexica as is the case in the history of Ancient Greek-English lexicography. To
create our citation file, we extracted a key-word-in-context listing for every occurrence of every
word in the corpus and matched the Ancient Greek passage with a parallel English translation
from the Perseus digital library wherever such a translation was available. The key-word-incontext index is built using the Greek morphological analysis engine known as Morpheus.
Greek is a highly inflected language and many inflected forms share few if any surface features
with their dictionary form. In order to identify the words contained in a corpus, we must take
advantage of the Perseus morphological analysis system that allows us to determine, for
example, that moloumetha is a future form of the Greek verb blôskô (to come or go) or even that
mêtri is a form of the noun mêtêr (mother). These determinations are made using the Greek
morphological analysis engine that is integrated with the Perseus digital library. This
morphological analysis engine was developed for Greek texts by Greg Crane beginning in 1985
and it has been refined and extended over the years for Latin and other languages. Morpheus
works by breaking down words into component parts and comparing these parts to
morphological databases of stems and endings. Anne Mahoney describes the morphological
analysis system as follows:
The original implementation, Greek Morpheus, can handle regular verbs and
nouns, irregular verbs (in Greek, mostly suppletive) and nouns, verb prefixes (a very
common kind of derivation), and the various dialects of Greek in common use in the
archaic and classical periods. Virtually all inflections in Greek are endings, though many
7
past-tense verb forms take a prefix (the "temporal augment") and some stems are formed
by reduplication of the first consonant. Morpheus therefore assumes that inflected words
can be divided into stems and endings. The stems are related to lexical headwords (e.g.
the stems pemp- and pepomph- belong to the verb pempô, "send") so that tools using
Morpheus can offer definitions as well as morphological analyses. For each stem,
moreover, Morpheus knows the relevant grammatical category (the "conjugation" or
"declension"), which determines the possible endings. It can then recognize that
pempoimi is a valid form, but pempeiên is not: both use endings for the first person
singular, present optative active, but only the first of these endings is appropriate for the
verb pempô.5
Once these determinations have been made, it is then possible for us to create an index
with each dictionary form and the passages where a word such as pempo might appear.
Following the general model of the Liddell and Scott dictionaries, these citations are
sorted in chronological order (with one notable exception described below) and accompanied by
frequency charts that show how often these words appear in different authors and genres and
links to an on-line edition of the Liddell, Scott, Jones Greek English Lexicon. In the example
below, we find the top of the slip for the Greek word κλέπτω that illustrates these features.
5
http://www.ldc.upenn.edu/exploration/expl2000/papers/mahoney/mahoney.htm See
also Generating and Parsing Classical Greek, Helma’s article at
cybergreek.uchicago.edu/Bootstrapping.pdf and http://portal.acm.org/citation.cfm?id=1596347
8
9
In this citation file, the lexicographic team can clearly see that this word can be used in
poetic registers, appearing frequently in works by authors such as Homer, Aristophanes,
Euripides, Aeschylus, etc. but they can also see potentially interesting clusters in prose authors
such as Aristotle and Xenophon.
This example also provides a concrete illustration of the problem of scale that
accompanies the ability to computationally generate a lexicographic slip. While it is a great
luxury for lexicographers to have all of this source material at their disposal, it also introduces a
secondary problem of potential information overload. In order to have any hope of completing a
dictionary in a timely manner, lexicographers must also be able to move through their source
material quickly and they may not have the time to analyze and categorize all 295 passages
10
where this word appears. Consider the example of the existing intermediate Liddel-Scott lexicon
with approximately 32,000 entries. If the lexicographer is able to read all of the source material,
analyze it, and write a definition in thirty minutes, a first draft of the complete dictionary will
take almost eight person years to complete while an hour on each definition stretches the time
required to create a first draft to sixteen person years. Even with a more generous allocation of
sixty minutes per entry, the lexicographer is granted something on the order of 10 seconds to
read each citation and then some ten minutes to write the citation itself. If one assumes that each
citation will take a minute to read and contextualize, almost five hours would be devoted to
simply reading the citations before beginning to write.
Consider further the example of the Greek letter Pi. This letter is the most common
initial letter in Classical Greek; the entries for it take up some 131 pages in the current
intermediate Greek Lexicon, some ____ pages in the large Liddell, Scott, and Jones lexicon and
there are some 15,964 distinct
lemmas in our lexicographic
database for words that begin
with Pi. The scope of Pi is
such that according a memoir
TITLE, Liddel wrote to his
collaborator Scott in 1842 about the impending completion of the letter Pi.
“You will be glad to hear that I have all but finished Π, that two legged monster, who
must in ancient times have worn his legs astraddle
else he never could have strode
over so enormous a space as he has occupied and will occupy in Lexicons.” He then
inserted a drawing of the creature in human shape, adding, “Behold the monster, as he has
11
been mocking my waking and sleeping visions for the last many months.”6
Without some method to manage the volume of citations that we can generate, the Pi monster
threatens to break the bounds of the its initial letter and consume the entire lexicographic project.
Clearly a citation file of this sort requires tools that allow lexicographers to optimize the
amount of time they spend analyzing citations and to help them identify more interesting and
useful citations where they can devote their time. Indeed, the key-word-in-context index brings
us back to the point of aporia where Diels found himself in 1905, wondering how to bring order
into the chaos. Indeed, while large computational corpora create for us the chance to revise and
start from scratch, this approach digitizes chaos rather than simplifying matters. Our approach to
bringing initial order to the chaos has focused on integrating the long tradition of lexicographic
research into our citation file. We have done this in three ways, 1) by separating out passages
that were cited in the large Liddell-Scott-Jones dictionary in both the slips and in the frequency
counts 2) by integrating the cited passages from our citation file into the on-line edition of the
LSJ dictionary, and 3) by integrating the citation file into the larger architecture of the Perseus
Digital Library.
<BRUCE EXCERPT> These three approaches combine to create six essential features
that make it make it a highly-effective lexicographic tool: (1) the separate collection of citations
from LSJ in the slips and frequency counts; (2) an new digital edition of the Liddell-Scott-Jones
lexicon that integrates cited passages, (3) integration with the Perseus Digital Library
architecture in a way that allows lexicographers to quickly check ambiguous lemma forms,
missed LSJ citations, and citations from multiple editions and collections of texts.
6.1. The LSJ collection: the ‘weave’
6
Page 19 from Bruce’s scan
12
In order to make maximum use of the semantic sorting which has already been performed
on the LSJ citations, we also display them in what we call a ‘weave’: that is, interwoven with the
text of the LSJ entry itself. The start of the weave display for the same word as shown in Figure
1, θέ ατρον, is shown in Figure 2:
This display is more informative than the ‘list’ format illustrated in Figure 1, in two
ways. Firstly, it gives us a check on accuracy: we can easily see whether any citations are
missing. Secondly, it gives us semantic information: we can see the LSJ definitions next to each
13
passage, and so compare their interpretations with ours. Because the citations are given in the
order of the semantic groups of LSJ, we can benefit from the semantic sorting which has already
been done, and make it the reference-point for our own revision. Three senses are visible here:
the basic meaning of theatre as a place for dramatic performances (Herodotus), its use for
political meetings (Thucydides), and a more abstract sense, the stage, the theatre, referring to the
representations (Isocrates). The illustration does not show the full HTML page, which includes a
fourth, collective, sense, spectators, audience.7
<JEFF WRITING>This process is facilitated by a core architectural element of the
Perseus Digital Library that we have termed an Abstract Bibliographic Object (ABO). An ABO
is simply an abstract identifier for a work that is not connected to any particular instantiation of
that work. In terms of modern library cataloguing, our ABO corresponds to the concept from
FRBR cataloguing known as the ‘work’.8 Since long before the digital era, Classicists have
referred to ancient texts by abstract citation schemes rather than referencing specific printed
editions (although some of these schemes – such as the ones for Plato and Aristotle – have their
origins in early printed editions). For example, in the sample slip for θέ ατρον above, the first
cited passage is listed as Hdt. 6.67. The reader knows that Hdt. refers to the Greek historian
7
The non-LSJ slips show that the two concrete senses appear throughout Greek, while
the abstract sense is much less common. The development of the collective sense is especially
interesting, being the usual sense in Aristophanes and in Plato, who gives it a much more general
application, to any kind of audience or group of spectators. A fifth sense, what is seen, spectacle,
is not identified in LSJ, but appears in the New Testament.
8
Cite Allison Babeau’s Perseus paper on FRBR – David Mimno should have something
here too.
14
Herodotus while 6.67 directs him or her to section sixty-seven in book six of this work. The
reader who wants to read the passage in context, can go to any edition of Herodotus either in the
original or a modern translation and locate this passage. We implemented this abstract
referencing system in the underlying architecture of the Perseus Digital Library so that users
could move between different translations or Greek and English versions of texts within the
library interface. This, in turn, allowed for the interconnection of texts and lexica within the
digital library environment so that citations in reference works and grammars such as Liddell,
Scott and Jones were encountered, they could be turned into bi-directional hyperlinks between
the two sources, allowing readers to jump from lexicon to source text or source text to lexicon.9
The architecture that allowed for this bi-drectional linking made it possible for us to take every
citation in the Liddell-Scott lexicon and integrate the appropriate text fragment with the digital
edition of the lexicon.
This same architecture also allows us to integrate multiple editions of different texts and
texts from different collections. Because each edition of a text is tied together by the abstract
bibliographic object and the stable citation system used in Classical texts, we are able to display
– for example – an English translation of book six, line 140 of Homer’s Iliad with any other
edition that uses this same citation scheme. When each word-form is identified and the chunk of
surrounding text selected, that specific sentence-file in Perseus is matched to the corresponding
sentence-file in the English text. This enables a matching passage of English text to be displayed
below each Greek one, helping the lexicographers to scan quickly through the texts.
This architecture also allows us to work with documents encoded against various DTDs.
Anne Mahoney describes the core Perseus architecture as follows:
9
Cite Lexicon to Commentary and New Technologies for Reading
15
The Perseus text processing system manages XML and SGML texts encoded
according to various different DTDs. The key to the system is the mapping of specific
SGML elements to abstract structural elements. If a user wishes to read Our Mutual
Friend, book 3, chapter 6, or if a commentary refers to Iliad, book 22, line 361, the
document management system can identify this section of the text by its citation scheme
(by book and chapter, or book and line), no matter what DTD was used for Dickens or for
Homer.
…
Using our system, digital librarians create partial mappings between elements in a
DTD (e.g., div1, div2, and lb) and abstract structural elements (act, scene, and line) from
which the text processing system generates lookup tables (indices) of the elements so
mapped. Thus what is encoded as <div2 type="scene"> in one document and as <scene>
in another are both indexed as an abstract, structural "scene." This mapping hides the use
of different DTDs from the higher-level processing routines.10
In 1999 and 2000, when this system was initially being implemented with NEH support, we
found that we were able to use this system to quickly integrate texts from other collections such
as the Library of Congress American Memory Collection and the Greek texts found in the
Thesaurus Linguae Gracae collection. The TLG group gave permission for their collections to
be used in the NEH project that initially funded this work, thereby allowing those authors and
works which are not stored in Perseus to be included in their correct positions in the display.
This seamless transition between the Greek texts from multiple sources along with English
translations ensures that we have a complete coverage of our corpus texts.
6.2. Checking ambiguous lemma-forms and missed citations
10
Cite Anne’s ‘Generalizing the Perseus Document Manager’
16
The Perseus architecture also allows us to provide a mechanism to check for situations
where the automatic integration of the citations has failed, either because the slip-generation
routine fails to recognise the passage corresponding to a LSJ citation, or the morphological
analysis engine fails to parse the correct lemma from which an inflectional form is derived. If
such failure leads to serious loss of time, then the archive will be, in practical terms, of limited
value. In order for it to be a usable research tool, we need to have facilities to cope immediately
with the failures.
The most common problem is failure of lemma-identification. This has two possible
causes. Firstly, the morphological analyser cannot identify every word-form. It is limited by the
size of its index, which includes about 97,000 Greek stems and 14,000 inflections. This enables
it to recognise 69% of the word-forms in the Perseus texts that constituting about 99% of the
attestations. That gives a level of accuracy of about 85%: a good percentage, but still resulting in
a substantial number of unresolved forms and missed citations. The second possible cause of
failure is that the process of lemmatisation is itself fundamentally limited by the presence of
ambiguous forms: ἄνα, for example, could be the vocative of ἄναξ, the Aeolic feminine of ἄνη,
or the anastrophic form of ἀνά (or perhaps even a neuter plural of ἄνοος). However, we find
that, in practice, homonyms like ἄνα or λῆξις cause least difficulty, and complexities of verb
inflection cause most.
To meet these eventualities, the program is therefore designed to give us automatic
feedback, by identifying the level of certainty in lemma-identification, and assigning a ‘weight’,
or probability-number, to each attestation, which is based on the number of possible lemmas
from which the form could be derived (as far as the program recognises). This is the basis for the
totals of ‘unambiguous’ and ‘ambiguous’ citations shown in Figure 1.
17
The ambiguous forms must then be lemmatised manually. In practice, this does not take
long: the eye can very quickly scan down a page of chronologically-arranged citations. In the
initial conception of the slips, we had intended to disambiguate these words and enter corrections
in to Morpheus, however the time involved drew too much labour away from the core
lexicographic task. Indeed, the Perseus group has taken this up as a new separate project to
manually create parse-trees for a vast corpus of ancient Greek texts that can then be used to train
a probabilistic parser.
However, we needed to use the archive immediately, and so we required a strategy to
cope with identification failures. Our solution was to combine the feedback with text-links.
Every failure-report is accompanied by a hyperlink to the passage which was searched, so that
we can check the text, by clicking on the link. The small horizontal lines preceding all the text
passages shown in Figures 1 and 2 are the hyperlinks. We have, as it were, embedded the slips
archive within the digital library of texts. This allows us to check problems immediately,
reducing the times when we have to leave our work-stations and consult the print editions.
A similar procedure is used for failed identification of LSJ citations. The program
indicates to us where it has failed to find the word-form in the cited passage, and we can then
immediately check the text. This feature can be illustrated for the word ἀβᾰκής, speechless,
calm, whose LSJ entry is shown in Figure 3:
18
We can see from the absence of an inserted passage that Morpheus has missed Sappho
fragment 72, and the feedback at the bottom of the page confirms this. By clicking on the
hyperlink, the underlined ‘Sappho 72’, we move directly to the fragment, which is shown in
Figure 4:
In this fragment, the words which the analyser has identified are all underlined as parsed,
and we can see that ἀβάκην is in fact there, but unrecognised (because it is a paroxytone
accusative form not listed in the Morpheus index). So we still have fast access to the correct
citation, even when the program has failed to identify the form. The consequent saving in time is
substantial: this feature transforms the slips database from an ancillary tool with excellent but
19
limited coverage, into a dependable, ‘all-weather’ reference system.
6.4. Citation matching
In order to identify all the LSJ citations, we also need to match any variations in
numbering. In general, the citation systems for Greek texts are remarkably stable: the LSJ line
numbers for Homer and the tragedians, and the section numbers for the prose texts, are much the
same in modern editions. However, the texts of many early poets, especially the lyricists, have
been republished in new editions which give different fragment numbers. We have therefore
compiled a concordance from LSJ to the modern editions of the lyric and iambic poets, and also
to epic, comic, and tragic fragments, where modern editions differ from LSJ.
This ‘poetry map’ is integrated in the electronic database. Its use can be demonstrated
from the citation from Sappho shown in Figure 4 above. LSJ cites this as fragment 72 in Bergk’s
Poetae Lyrici Graeci, while TLG uses Lobel-Page’s Poetarum Lesbiorum Fragmenta, where it is
fragment 120. By tagging it with the LSJ number, and also mapping that to the modern edition
number, we can ensure that the LSJ citation is always recognised, even in cases like this where
Morpheus fails to find the target-word.11
6.6. The slips: summary of lexicographic functions
At this point, this database is in active use as the lexicon team continues its writing work.
As the database nears completion, however, we need to begin work on the next two phases –
publication and a long-term archival strategy. The lexicon team has a contract with Cambridge
11
Cite Poetry Map
20
University Press that allows for simultaneous publication of the lexicon in print form and as part
of the freely available collections in the Perseus Digital Library. We also need to begin to think
about the long term preservation of the slip archive itself. The bulk of the slip archive was
designed as static HTML with Unicode text with an eye to long-term accessibility on different
platforms and operating systems. Only the facilities for checking the missing citations requires a
functional digital library system behind it. We must, therefore, now consider whether we should
create static versions of these dynamically generated pages while also beginning to work with
digital repositories in libraries to find a suitable long term home for this data. The lexicon itself
is being authored in XML according to a DTD that lends itself to long term preservation and that
will also facilitate the process of digital preservation. At the same time, however, the XML also
contains additional working notes within a <NOTE> tag that are not intended for publication but
rather serve as an archive of the working papers of the lexicon team over more than a decade of
work on the dictionary. These also should find a long term stable repository.
The archive gives us a digital library tailored to our needs, with exceptionally fast access,
because it displays the results of millions of searches, with the words collated with their contexts
and indexed for reference. A lexicographically-useful size of passage is selected, set at three
sentences, which gives us enough context to evaluate the word meanings. The database is
proving indispensable in the writing of our lexicon articles, and has transformed the nature of the
project, by allowing us to examine the texts as we write, and to compare the LSJ citations with
the others. Pre-searching has proved to be a highly-effective way of utilising the limited time
available for writing the dictionary. The HTML format is also very user-friendly: we can
navigate very quickly between the two components of the double archive (the LSJ citations and
the others). The failures of identification cause minimal problems, because every page of the
21
archive is linked to the full texts. In sum, without this resource, it would have been impossible to
write fresh definitions, unless we had a much larger team of writers and much more time.
IMAGE CREDIT
http://commons.wikimedia.org/wiki/File:James-Murray.jpg – Accessed January 10, 2010
22
Download