How to document the creation of digital language resources

advertisement
How to document the creation of digital language resources
- a case study
Jan Engh
Oslo University Library, Norway
jan.engh@ub.uio.no
47 22 85 77 17
47 22 84 45 81 (fax)
Abstract Natural language processing is not only a question of new algorithms and new
technologies for representing linguistic data. It is also about how to enter and cater for
language data on a broad scale. This will still, to a great extent, imply conversion of
analogical sources and, in the case of many languages, completion and simply
registration of language information. Although these activities in general are a question
of development rather than research, such a seemingly trivial task may have its
theoretical challenges relating to the language involved. And, above all, no full-scale
natural language processing will be possible without it. However, while there is a
flourishing literature on the more formal aspects and the technical innovation part of
natural language processing, documentation on how the basic language resources were
and partly still are established is scarce, and existing documentation may be ignored. The
present article is a case study of the first full-scale encoding of the intricacies of the
Norwegian lexicon and morphology in a digital format. At the same time, it contains a
discussion as to how such projects could and ought to be documented, especially with a
view to prevent later conjectures and allegations about the origin of the resulting
resources. In doing so, it will provide a special insight into the often uneasy relationship
between linguistic research and development in private and public sector.
Keywords Natural language processing ∙ Language resource creation ∙ Lexicon ∙
Morphology ∙ Norwegian
1 Introduction
After decades of work, researchers and developers may now find more and more basic
linguistic resources available to them in what used to be called machine-readable format even for less important languages such as Norwegian and its two written standards. In
spite of the expanding activity of harvesting linguistic data from the internet and a
somewhat more positive attitude to put digital texts at the disposal for researchers on the
part of publishers,1 this still means that the major part of the language information
accessible has been converted from printed sources - or simply created by individual
1
See for instance The English-Norwegian Parallel Corpus at
http://www.hf.uio.no/ilos/english/services/omc/enpc/index.html [Accessed 27 March 2013].
1
linguists, drawing on their competences as native language users. In the latter case, one
may talk about completion or even registration of language data from scratch. In fact,
existing digital language resources are the result of a process where all the approaches are
involved, although to a varying degree.
Generally, such processes have been poorly documented. While there are conferences
and extensive publishing about how to build linguistic resources from data on the web,
hardly anything is said or written about how one actually went about creating digital
resources from printed or “innate” sources.2 This reflects the fact that this process suffers
from a certain lack of status in the research community for being a straightforward,
technical matter, and because “everybody” knows the language. However, the reason
may even be a sentiment that conversion and the less conspicuous completion and
registration are phenomena of the past, a task that has been accomplished once and for
all.3 In fact, it is a transitory phase of linguistics, since new resources are created in a
digital medium already from the start. So, in a way, this entire process belongs to the
Bronze Age of natural language processing. But that is exactly one of the reasons why the
creation of digital language resources needs to be documented. Not only as an important,
although transitory phase of linguistics as a discipline, but also for quality and legal
reasons: Documentation as to how and when the creation took place and by whom
provides important information about the quality of the product and what you can
reasonably expect from it. It is even important information from an intellectual property
point of view. Whose material is it anyway? Although nobody has the copyright of a
language, the value-adding processing of language information gives the producer certain
rights. Proper documentation may provide a lead as to what one can do with the material
as far as business is concerned, and what you can take academic credit for.
In addition to the technological and legal aspects of digitalisation, documentation of
the creation of digitalised language data may shed light on various aspects of a linguistic
nature: Even what appear to be the simplest conversions in the first place tend to reveal
vagueness and cases of doubt. In fact, few conversions are really that simple, especially
when a complex written material is concerned, implying the compilation of information
from sources as disparate as taped lists in various formats, printed dictionaries, printed
articles or monographs, typed or handwritten filing cards, as well as footnotes in the local
academy’s annual report and oral sources. In fact, conversion of linguistic resources
cannot be carried out in an entirely automatic way, in the sense of just converting what is
written or printed to a digital form without human intervention. In most cases,
conversions imply a completion process as well.
1.1 Resource creation – research and/or development?
During the process of conversion of non-digital language information, it is not
uncommon that consideable information is added too. This happens basically because the
computer is a pedant. It demands unambiguous, explicit data. Not only are errors
identified, implementation of morphological rules makes inadequacies, inconsequences,
and contradictions visible as well. As a consequence, the linguists and native language
users involved in the process have to carry out corrections, interpret the norm and make
2
3
One exception is Santos 1996. Interestingly, it is written by an engineer.
Although there are still quite a few printed dictionaries left to be converted, for instance.
2
extrapolations to cases the authors of the orthography never had in mind, only to detect
contradictions that must be resolved etc. Depending on the language structure and the
breadth and thoroughness of the lexicographical tradition, conversion entails completion.4
Also, one simply has to enter new words (lemmas), and the importance of completion in
connection with such registration is equally evident.
To be more explicit, the completion of information about a particular written
language may consist in various sub-operations: First of all, controlling whether the
traditional paradigms fit the inflected forms of the class of lexemes they are supposed to:
Does every lemma belonging to the same paradigm of traditional, usually history-based
morphology actually inflect according to the rule given? To the extent that such
information is not systematically provided in the morphology that is converted, it has to
be added. Furthermore, traditional morphology usually takes a lot for granted as far as
linguistic competence is concerned. Such knowledge has to be identified and stated
explicitly when morphological data are digitalised. How, is a matter of discretion.
During this process, one inevitably has to take a stand as far as the more peripheral
parts of paradigms and defective paradigms are concerned. This can be referred to as
raising morphological awareness: Do the words, or rather lexemes, have a complete
paradigm? Are all imaginable forms actually used, and, in the negative case, what
consequences ought to be drawn for the morphology? For instance, do all adjectives
compare? What adjectives require a periphrastic comparison, and when is it optional. Do
all past participles and adjectives derived from past participles have a complete paradigm
as far as attributive forms are concerned? What about regular past participles and
attributive forms? Not only is such information broadly ignored by standardisation
authorities, it even belongs to the very periphery of native language users’ knowledge of
their own language. The vagueness as to the “existence” of these forms matches their
infrequency. In this respect, creating digital language resources may represent an
investigation into the border areas of the language in question.
One practical consequence of completion will even be detecting obvious non-standard
lexemes or word forms that are frequently used. So, a by-product of the development of a
“descriptive” morphology is necessarily the exploration of the actual differences between
descriptive and normative linguistics in general for the language in question. Thus,
completion even means that normative linguistics become an object of investigation.
These aspects represent only a limited selection of possible completion tasks. In fact,
the ones mentioned above are some of the ways of completing the lexicon and the
morphology that are relevant for Norwegian.
After completion, enrichment of the linguistic information represents a natural next
step. By enrichment is meant adding all sorts of semantic and syntactic information to all
the lemmas of the lexicon and to particular inflected forms as well, in cases of divergence
between the lemma form and its word forms. Such information is usually not
systematically represented in printed dictionaries, if mentioned at all. And when it is, the
it tends to be inaccurate. For instance
4
Not to mention the feedback it may give to the relevant normalisation authority.
3
Valency of verbs and adjectives
Types of verb complements: Infinitives? With or without the infinitival marker?
Past participles that may occur in an attribute position5
The possibility of adjectives to be used in an attribute position and/or as a predicative
Adverbs that cannot form an adverb phrase alone, adverbs limited to certain positions
Nouns with, typically, an animate reference only etc.
Although “everybody” knows the language, even the native language linguist doesn’t
know everything about it, and, necessarily, has to rely on other sources for verification, as
already alluded to: Conferring with others – even with the codified experience of others
in terms of analogical media. The native language competence of the linguist is
complemented by written sources.
Even though completion is important in connection with the digitalisation of the
lexicon and the morphology of many languages, it is generally disparaged if not ignored
as a process by linguists without any experience from the field. One plausible explanation
is the relationship to normative linguistics. Although without directly performing
normative activities,6 it is necessary to study and to implement the standardisation of the
language in question. Now, normative linguistics is generally not well seen by theoretical
(descriptive) linguists. Another explanation is that descriptive morphology itself and,
especially lexicography, which represents the context of the morphology development,
are not particularly trendy parts of linguistics. Enrichment, on the other hand, seems to be
slightly more appealing, perhaps because of its closer relationship to semantics and
syntax, which have been the more fashionable parts of linguistics since the 1950s.
It goes without saying that although the bulk of the work behind the creation of
digital language resources can be characterised as development, the research component
can be strong,7 depending on the language and the corresponding linguistic culture. The
object of this research will not only be the standardisation of the language in question, but
even the language structure itself, especially to the extent that there are no adequate
description in advance e.g. of morphological sub-systems. Whether this research is
carried out as a part of industrial research and development or at a public sector research
institution is immaterial, as long as fundamental linguistic tenets are observed.
1.2 How to document natural language processing
No project of language resource creatin can document itself in the way that is customary
in linguistics as well as in computational linguistics literature. Here, the research work as
such and its documentation are woven together as the research is carried out in some sort
of spiral: One presents good arguments in favour of an algorithm or a rule, tests it against
new material, e.g. example sentences, and rejects or performs modifications whenever
necessary. Then, still new material is taken into consideration etc. This is a process that,
at least to some extent, is carried out during the writing process itself and the progression
of the text often reflects how the conception of the problem under scrutiny actually
developed. Or, at least, that is often the intention.
5
Contrary to common belief, this property is not automatically tied to valency, cf. Akø 1992.
The result of the inquiry may be used as input for standardisation, but that is irrelevant in this connection.
7
As even suggested by the title of Johannessen and Fjeld 2008.
6
4
Seen from a different angle, a linguistics monograph or article represents, ideally
speaking, a piece of reasoning - with documentation at the linguist’s discretion,
depending of what approach the author/researcher has chosen, what school (s)he
professes, and always with available publishing space as an upper limit. The
representation of lexicographical material is generally quite different: Neither an
extensive lexicon nor a morphology is something you just publish in the shape of one line
of reasoning. It is a result. How it was obtained is simply a different matter and can be
accounted for elsewhere. To the extent that lexica or morphologies can be formulated in
terms of lists and tables, it will always be possible to represent them as such - either in
print or on the screen - even if the sheer quantity may make it an impossibility in practice.
(Or at least it will make the result unpractical.) This holds for “normal” morphological
overviews and dictionaries intended for common use by ordinary people, i.e. texts and
graphs that can be printed or otherwise published. However, similar publishing is not
possible in the case of virtual morphology and dictionaries, entities that are designed to
be implemented as a part of a program – not just published as is by means of a program.
Or to state it differently: Many digital language resources, including those of a
lexicographical nature, simply cannot be graphically represented in practice, a property
they share with hypertext or any computer program. These resources work – with an
associated program/interpreter. What you actually can publish, are, as already mentioned,
rules and tables etc.
Since the existence of lexicographical material, i.e. all language resources at word
level and “below”, is principally external to any documentation in the way outlined
above, it seems natural to document the production of this type of digital language
resources in more or less the same manner as any other engineering product, a bridge, for
instance: First of all, describing the external circumstances of the production (the
“engineering” part of it) as well as its result: quantities, qualities and models in various
formal formats. Secondly, by discussing relevant problems encountered and solved
during the creation process.
Personally,8 I have tried to document the lexicographical projects I was involved in at
IBM Norway in the 1980s the engineering way, so to say: Engh 1991a, Engh 1992a,
Engh 1992b, but especially Engh 1991b and Engh 1994, Engh 2009 and Engh 2011.
These can be summarised as external descriptions of the respective objectives, of how
and when the project work was carried out, with information about the raw materials,
about the point of departure or base and, finally, they contain an account of the resulting
quantities. Internal aspects, such as linguistic analyses and assessments are only
mentioned superficially and as specimens in order to give a flavour of the work involved
- always within the limits of what could be made known without disclosing corporate
secrets.
The trustworthiness of this type of documentation can be assessed by the exactness of
the description of the process, by the lack of descriptive incoherence, by the truth of
every controllable fact mentioned, including the chronology, and by the reasonable
relationship between the effort invested and the actual result. In addition to references to
8
One may ask whether one can be justified to write one’s own history. Again, this is the engineering way:
Engineers mostly document their own activities and the products that result from it. Anyway, for a linguist
it should not be that shocking after all, since both linguists and computational linguists always report their
own findings and thoughts themselves.
5
the linguists who were involved. It certainly is documentation, but does it actually work
as documentation?
Unfortunately, there are signs today, more than twenty years after the discontinuation
of the Norwegian IBM project, that such an approach is not sufficient. The project is
hardly ever mentioned in the relevant literature9 and even allegations of fraud have been
levelled against it. Now, you can always point at a bridge. It is there, or at least you can
see a picture of it. Digital language resources are less tangible, and may even have
disappeared from the public eye in their original form,10 which is more or less the case of
the Norwegian digital language resources created by IBM in the 1980s. But it is still
there, although integrated in a different context.
In what follows, I am going to give an abstract of how the lexicographical language
resources were effectively created at IBM Norway in the 1980s. In order to refute
allegations about their origin and their very nature based on what is – to adopt a
benevolent interpretation – a selective misinterpretation of earlier documentation
attempts.
2 The project
The low academic status of lexical and morphological resources creation is in strong
opposition to its importance and to the quality required. In fact, high quality is a
prerequisite for anything else than computational linguistic toy systems. Now, toy
systems usually have no practical value – and are of no interest in a business context. So,
at a time when most computational linguistic research was concerned with small systems
working for a small subset of words only,11 it was only natural for IBM to advance on a
broad front when the company decided to create language sensitive software: In a first
phase, a list of all possible word forms of the written language was to be compiled, then
an adequate, extensive, and correct12 lexicon and a corresponding morphology, further a
synonyms dictionary as well as additionally “enriched” dictionaries. Later, these
resources constituted the basis for grammar development (intended for writing support
and style critiquing systems to start with) and, in the end, a system for machine
translation. (See Engh 1992c ) Many other types of applications were planned, but never
realised. The intention was to create linguistic “engines” for many kinds of user programs
that, eventually, would be put to the market for practical use. Thus, the question of
intellectual property was important already at the outset – with all the significance that
American companies attach to it. And it was all shrouded in corporate secrecy.13
9
IBM linguistic projects and their results have been largely ignored in Academia. For instance, neither
IBM’s Norwegian project nor the corresponding projects for the other Scandinavian languages were even
mentioned in the survey section about digital processing of the Scandinavian languages in the relevant
volume of the reference series Handbücher der Sprache und Kommunikationswissenschaft (Allén 2002).
One reason may be, though, that the Norwegian project was the only one to have lasting effects for the
local linguistics community.
10
Although copies of the orginal files may be kept on tapes, of course – as long as they last.
11
With poor prospects of working in a real-life, scaled-up version.
12
From a normative point of view, in this case.
13
Only the project leader and his management at IBM had a general view of all the details and knew what
was the objective of the development work at each stage. The linguists participating in the sub-projects
only knew what they needed to know in order to perform their respective tasks. The very existence of the
projects was only made public after the announcement of the software products of which they were a part.
6
2.1 Initial stage – the base dictionary
The project was initiated in 1984, when IBM launched a corporate offensive to create
language sensitive software in parallel for all the major written languages of Europe.
Since IBM Norway had no staff with the necessary linguistic competence at the time, it
turned to the University of Oslo for assistance. A joint project was organised with the
specific aim of developing necessary Norwegian language resources. The author of the
present article was contracted as leader of the project.14
The official start of the project was 1 July 1984. After necessary education and a
workshop session at IBM facilities in Gaithersburg (Maryland), the practical work started
in late August 1984. To supervise the project, a joint steering committee was appointed,
consisting of Even Hovdhaugen (professor of general linguistics, dean of the faculty of
arts and humanities) Jo Terje Ydstie (lecturer, applied linguistics), and Jan Hølen
(advisory systems engineer, IBM Norway) in addition to the project leader. The
committee had one meeting, where the project leader’s implementation plans and
linguistic guidelines were approved. After 15 February 1985, no formal ties existed
between IBM Norway and the University of Oslo. However, Jan Engh continued as the
project leader, now as a regular IBM employee, keeping informal contact with the
university and in particular with Dag Gundersen (professor of Norwegian lexicography).
The project took place on IBM premises and it was 100% funded by IBM, which also had
the exclusive right to the research and development results. The entire project was carried
out on IBM mainframes.15
A brief sketch of the technology involved and of the technicalities relating to the
progress of the project has been given most recently in Engh 2009. Here, I shall give an
account of the practical aspects of the linguistic development. The linguists taking part in
IBM Norway’s lexicographical activities in addition to the group leader are all mentioned
in the appendix, with indications of their particular role in the respective sub-projects.
The objective of the initial phase of the project was to cater for the Bokmål variety of
Norwegian, according to the official standard laid down by Norsk språkråd.16 From the
beginning, all possible lexical variants were included, even “permitted non-standard”
word forms (“valgfrie former”). In the last version, however, only the regular, optional
forms of the official “textbook standard” (“læreboknormalen”) were included.17 The
development of Norwegian Nynorsk resources did not start until 1988.18 Thus, it is
natural to focus on the Bokmål development, and to mention just a few critical facts
about the Nynorsk part of the project, since, in principle, the same kind of problems were
encountered during the development of resources for both standards of Norwegian.
Furthermore, the existence of the Bokmål project is the one that has been most seriously
contested afterwards.
14
As senior scientific officer, “førsteamanuensis”, and university employee, “oppdragsforsker”.
Under VM on System/370.
16
The ‘Norwegian language council’, the standardisation authority for the Norwegian language. Since 2005
known as Språkrådet.
17
This reduced the size of the implemented lexicographic products (see below), but had an insignificant
effect on their coverage in actual texts.
18
Vikør 2001 describes the relationship between Bokmål and Nynorsk.
15
7
The practical goal of the very first part of the project was, basically, to create one
extensive list of correct word forms. In a next step, the word forms documented were to
be sorted and classified according to a set of technical specifications. Several sub-lists
based on frequencies and numerous smaller word lists were to be developed in order to
satisfy specific system needs.
By far, most resources were spent creating the main list of word forms, the “base
dictionary”. There were two main requirements: Firstly, the coverage was to be
extensive, catering for the Norwegian vocabulary in general. Secondly, it should contain
unique forms only. The former requirement implied that the kernel vocabulary of the
language was to be covered, adding as many other words as possible – without an eye to
technical terms (computer terms, business terms etc.). Furthermore, all correct forms of
all lemmas should be entered. The latter requirement meant that the list had to contain no
duplicates/homographs and that this had to be prevented already in the input files, cf. p.
11. As a consequence, the result was not lemmatised, and did not satisfy the criteria of a
morphology, linguistically speaking. This made it practically useless for any other
purpose than the one it had been designed for. Thus, there would have been no reason to
attach importance to this part of the project from a purely linguistics point of view, if it
had not been for the question of lemma selection. This is an important aspect, though,
since the Bokmål base dictionary later served as the foundation for the development of a
lexicon and a morphology for Bokmål. So, the challenges of lemma selection will be
discussed in connection with the base dictionary for practical reasons, leaving the
question of how the word forms of each lexeme were processed to the subsequent lexicon
and morphology chapter.
2.1.1 A pioneering task …
The linguistic compilation of the base dictionary turned out to be a far greater task than
the computationally oriented American management had thought in advance. This was
mainly due to two sets of factors. First of all, one extra-linguistic factor: The general lack
of available Norwegian on-line dictionary data. IBM could not rely on electronic
language resources created by other private companies or by public research institutions.
In the second place, three linguistic factors: Firstly, the current state of the
standardisation of Norwegian. Characterised by a high and complicated degree of
variability, it was, and in fact still is, principally different from the standardisation of
other European languages. Unfortunately, its standardisation was also found to be much
more incoherent than expected. Secondly, the number of homographs of the Norwegian
base vocabulary. Finally, as a third complicating factor: The problems encountered while
trying to apply the rather simplistic IBM design of word compounding tagging to
Norwegian word forms. In fact, these linguistic factors alone were sufficient to explain
why the project took nearly one year to accomplish. In the present context, however, the
lack of digital language resources is what merits a more thorough account, given its
consequences for the entire project.
In 1984, there was no language resource centre providing lexicon, morphologies etc.
in addition to dictionaries and texts for Norwegian.19 Furthermore, of the scarce and
sparse digital resources that existed at that time, hardly anything was available for
19
Excepting the embryonic Norsk tekstarkiv, with a comparatively small number of texts. See below.
8
commercial use. So, a natural point to start would have been the processing of running
texts. There were many rocks in the sea, however. In the mid-1980s, it was still
unrealistic on technical grounds to establish extensive electronic text archives or corpora
like those we take for granted today. Creation of corpora was difficult, even when
computer storage was available, since computer composition was not common. Instead,
printed texts had to be converted, a rather expensive operation. Either the texts had to be
entered manually or by means of primitive optical character recognition, which required
extensive proofreading. Conversion would also have been generally illegal: All the major
Norwegian publishing houses at that time were strictly opposed to any sale of the right to
their texts – even for non-commercial, internal use only.
For IBM Norway, the solution was to use internal texts of any kind as the point of
departure in addition to the corporation’s own wordlist, NORFRQ LIST. Native language
user linguists supplemented this material to the best of their linguistic competence,
adding new words while consulting printed dictionaries when necessary - by looking up
single words, one by one.
2.1.2 Selection and entering of words
At the outset, the selection of entry words was based on NORFRQ LIST. This was a
frequency list of obscure corporate origin, probably compiled at IBM’s Austin (Texas)
laboratory in the beginning of the 1980s as input for the spelling checker of IBM’s
dedicated text processor, marketed in Norway as Serie 80. The list contained 50.000
unique word forms and was apparently based on a relatively big sample of technical texts
and business correspondence to and from IBM Norway. Its content, however, was of a
rather heterogeneous nature and it had obviously not been subjected to any screening by a
linguist: Quite a number of the words were misspelled, there was a great number of nonstandard (obsolete) and utterly infrequent word forms. E.g. “syv”, “tyve”, “hverken” 20
and “asbesthanskeprodusentenes”,21 and, surprisingly, quite a few words pertaining to
Norwegian Nynorsk. Despite a certain imbalance as far as technical terms and business
terminology were concerned, NORFRQ LIST did contain a high number of word forms
belonging to what one would usually count as the kernel vocabulary. However, quite a
few important words were missing, the most prominent being “rød”, ‘red’…
So, it was clear already from the beginning of the project that NORFRQ LIST was
insufficient as a base. Consequently, the selection of words for the base dictionary was
extended to the few electronic texts available to the project – mainly of corporate internal
origin - in addition to extensive excerption/registration. All electronic internal
communication, e.g. business announcements, instructions and directives, social
information etc. was processed in the quest for additional words, and the project members
regularly scanned their own correspondence. As soon as practically possible, the staff
made diligent use of IBM’s spelling checker for the 370 mainframe environment,
PROOF, in order to detect unrecognised words.
Meanwhile, a systematic registration of lexical fields was carried out: E.g. parts of the
body, colours, names of most chemical elements, car parts, furniture, measures,
construction terms, fishing equipment, maritime terminology, sports, sports gear and
20
21
Instead of “sju” ‘seven’, “tjue” ‘twenty’, and “verken” ‘neither’ of the official Bokmål standard.
‘of the producers of asbestos gloves’.
9
other fields of a concrete nature in addition to more elusive lexical fields such as
thought/meaning/opinion, truth and falsity, love/likes and hate/dislikes etc.
Once the kernel vocabulary was considered to be properly covered, the project staff
simply endeavoured to include as many other correct word forms as possible, by casual
excerption after the lexicographer’s discretion.
As the base dictionary was supposed to contain frequent names, the more common
given names and surnames were entered, in addition to the names of the 100 biggest
private companies in Norway, a selection of geographical names in Norway and abroad, a
list of post offices (including neighbourhood names), and the complete list of Norwegian
townships. The sources were:
Stemshaug, Ola et al.: Norsk personnamnleksikon. Samlaget, Oslo 1982
Utvalg av slektsnavn som hører til de mer vanlige. Justis- og politidepartementet, Oslo 1983
Norges største bedrifter. Oslo 1984
Cappelens skoleatlas. Verdensatlas for grunnskolen. Oslo 1984
Postadressebok. Postdirektoratet, Oslo 1984
Standard for kommuneklassifisering. Statistisk sentralbyrå, Oslo 1985
The entire word list was finally supplemented with available frequency data: Remaining
words were entered, and the frequency information of NORFRQ LIST served as a point
of departure for the compilation of various frequency components. During this process,
Heggstad, Kolbjørn: Norsk frekvensordbok. De 10000 vanligste ord fra norske aviser.
Universitetsforlaget: Bergen 1982
was consulted as a supplement.22
For a later version of the base dictionary, IBM Norway was able to purchase a
handful of texts in machine-readable format: A small collection of texts from Norsk
tekstarkiv and a collection of laws and law related texts from Lovdata. These texts were
simply converted to word lists, duplicates were eliminated, and the result compared with
the working copy of the base dictionary at its current phase. Similarly the base dictionary
was checked against
Hanssen, Eskil A.: “Ordforrådet i et talespråksmateriale”. In Hanssen, Eskil A., Ernst Håkon Jahr,
Olaug Rekdal, and Geirr Wiggen (eds.): Artikler 1-4. Talemålsundersøkelsen i Oslo (TAUS),
Oslo 1986
Word forms were identified with respect to their lexeme; the lexeme was expanded to
all conceivable forms,23 which were finally entered manually: Every entry word and its
inflected forms were typed in one by one by the lexicographers along with the required
information about part of speech and combinability.24 Also, the constituents of all
compound words found were properly identified and entered with all possible inflected
22
The usefulness of Heggstad 1982 was clearly limited. The data were evidently based on a rather small
corpus, and a biased one as well: The prominence of such words as “Saigon” confirmed its newspaper
origin from an intense phase of the war in Viet Nam.
23
Including genitiv (with the -S suffix) of every forms of nouns, adjectives and past participles of transitive
verbs.
24
I.e. their ability to occur as constituent in a compound word – in front, in the middle, at the end, or
anywhere etc. See Engh 2009, p. 263.
10
forms in the base dictionary as separate entries in addition to all imaginable combinations
in the form of (other) compounds. This even included all variants of the words involved.
Thus, in addition to VANN ‘water’, LØSELIG ‘soluble’, and VANNLØSELIG ‘soluble
in water’, VATN ‘water’ and LØYSELIG ‘soluble’ were entered, and so were
VANNLØYSELIG, VASSLØSELIG, and VASSLØYSELIG25 all meaning ‘soluble in
water’.
Finally, every word form of the base dictionary was provided with marks indicating
all correct hyphenation breakpoints.
2.1.3 First version ready
The first version of the base dictionary was completed 21 June 1985. 26 Technically
speaking, it consisted of two main files, a stems file and an endings file, in addition to a
number of minor files with a more technical content. Together, they provided input for
the “building” of a component that in its compacted form would drive a spelling checker
and a function for automatic hyphenation.
The base dictionary contained 292,190 unique word forms, which secured a fairly
high coverage of word forms in running Bokmål texts.27 Moreover, the number of
lemmas covered was comparatively higher than one would expect from this figure,
although the precise coverage was unknown, due to the way the requirement of unique
word forms were to be implemented. For instance “for” as forms of the noun FOR n
(‘lining’; ‘fodder’), the verbs FORE (‘line’; ‘feed’) and FARE (‘go, travel’), and the
adverb, conjunction, and preposition FOR (‘for’; ‘because’; ‘for’) were all counted as one
entry. No matter the different meanings and the different parts of speech.28 Consequently,
it is impossible to know how many lexemes that were actually covered. I.e. how many
lexemes that were fully covered intentionally – and how many lexemes that were fully or
partially covered unintentionally.
In this connection, it is also important to stress that the stems and endings of the input
files were “technical” stems and endings, not linguistic ones. The endings could be
regular inflectional or derivational suffixes or clusters of both. One example relating to
“for” mentioned above: On the basis of “for” as a stem, one could generate not only
“fore” (infinitive of FORE), but also “foret”, “forene” (definite form singular and definite
form plural of FOR) etc., but even “forets” (definite form singular plus genitive), “forer”
and “foreren” (indefinite form singular and definite form singular of FORER ‘feeder,
supplier’), “foring” and “foringas” (indefinite form singular and definite form singular
plus genitive of FORING f ‘ling’; ‘feeding’).
The motivation for the non-lemmatised base dictionary was computational efficiency:
Compaction and minimisation of the number of operations – and the time needed - to
verify a given character sequence as a valid Norwegian word according to the current
25
“vass-” is the proper form of VATN as an initial constituent of a binary compound.
A second, extended version was ready for implementation 14. December 1985.
27
In fact, much higher than what was usual for spelling checkers at that time. According to Fjeldvig and
Golden 1989:126, for instance, the spelling checker of WordPerfect was based on a list of 60 000 unique
word forms only.
28
In this particular case, the part of speech was indicated as the set “ANRV”. A ‘adverb’, N ‘noun’, R
‘preposition’, and V ‘verb’.
26
11
standard. The whole point of the technology adapted was fast pattern matching with
documented valid words.
The base dictionary was certainly not a morphology. However, it turned out to be
very important later on as the foundation of IBM’s morphology for Norwegian Bokmål.
But at the time of the base dictionary development, neither IBM Norway’s natural
language development group nor local management had any knowledge of future
corporate plans for a morphology.29
2.2 Morphology
Lemmatisation became a requirement only in the third update of the base dictionary,
when a synonyms dictionary was implemented. A morphology was necessary as a
“bridge” between the word forms identified by the base dictionary and the lemma forms
of the synonyms dictionary entries. (The synonyms dictionary contained no inflected
forms.) So, the third version of IBM’s lexicographical component for Norwegian Bokmål
was to contain a morphology.
In 1986, there was no digital morphology for any variety of Norwegian that could be
purchased off the shelf. In fact, there was none at all that would satisfy IBM’s
requirement for broad coverage. There was even little other material available in general
to build a digital morphology on. During the last 50 years, morphology had been a
neglected part of Norwegian linguistics, in the sense that focus had been on limited
subsystems only. No adequate account had been given of the totality of Norwegian
morphology. Also, morphological information of printed dictionaries was inadequate and
not extensive. So, an entirely new lexicon and morphology had to be developed by IBM
Norway. (See Engh 1991b, 57)
The basic linguistic part of the morphology development was carried out by the
project leader, who also took part in the development and adaptation of the morphology
software in cooperation with Beverly Knystautas (of IBM’s Gaithersburg laboratory).30
The work started in 1986.
When delivered in the winter of 1986-87, the first version of IBM’s morphology
consisted of two files, one inflections input file and one lexicon input file. They
comprised 550 different paradigms, mapping to 46.220 lexemes.31
29
This was the reason why no statistics about lexeme coverage were gathered during the compilation
process.
30
Assisted by Carmen Valladares of Centro Científico de IBM (Madrid), among others.
31
Additionally, the total quantity of words had increased compared to the earlier versions of the IBM’s
Norwegian Bokmål lexicographic project: On the basis of the expanded lexicon and morphology, 418,922
unique word forms could be generated, which were all covered by the updated base dictionary.
12
2.2.1 The format
Was IBM’s morphology a morphology in the habitual meaning of the word? Yes and no.
It was a morphology in the sense that it gave an explicit account of every word form of
every lemma. It differed from a customary linguistic morphology in two respects:
Complex endings and one special requirement as far as the lemma was concerned. The
endings could be more than suffixes of the normal linguistic kind, as they included both
suffixes and sequences of suffixes plus yet another suffix. For instance an inflectional
suffix, e, en or ede, plus a medio-passive S-suffix or an S genitive suffix, es (as in
“kastes” ‘is thrown’), ens (as in “mannens” the man’s’) edes (as in “forkastedes” ‘of [that
which is] thrown’) were all counted as one ending respectively. As for the lemma, it
could only belong to one paradigm: One lemma could not be inflected according to two
or more paradigms and there were no “exceptions”: no secondary or tertiary rules. In a
way, the morphology format could be thought of as “flat” or “two-dimensional”.
On the one hand, the verb SPREDE ‘spread’, for instance, would be conjugated
according to a paradigm whose past slot had two options, -de and –te, producing
“spredde” and “spredte”, not as belonging to the same paradigm as either KREVE
“krevde” or NEVNE “nevnte”, with just one optional past form. In traditional
morphology, this would have been an option. On the other hand, a verb like KVINE
‘shriek’ belonged to a paradigm with the following past options: {apophony I>EI + 0
ending} and -te, producing “kvein” and “kvinte”. According to traditional morphology,
KVINE would be conjugated according to one of two separate paradigms, either as
KLIPE ‘pinch’ (apophony I>EI) or as FORELESE (past form “foreleste”). Of course, this
linkage between the lemmas and their respective unique paradigms was not only confined
to verbs. For instance, nouns that can be inflected as masculine, feminine and/or neuter
were, in principle, handled in the same manner. To mention a few examples: SKJELL
‘shell’ fn32, SNØRR ‘snot’ fn, FLO ‘flow tide’ fn; NYRE ‘kidney’ fmn, HENGSEL
‘hinge’ fmn, GARDIN ‘curtain’ fmn; KRANGEL ‘quarrel’ mn, FJØS ‘cowshed’ mn,
HENSEENDE ‘respect’ mn. All were attached to one, individual paradigm.
There was one more important consequence of the IBM morphology format’s lack of
secondary rules etc.: “Local” adaptive rules that would adjust the stem to a typical ending
when needed were equally out of question. For instance, STOL ‘chair’ and
KONFERANSE ‘conference’ are said to belong to the very same paradigm by traditional
morphology, cf. “m1” of Bokmålsordboka (Landrø, Wangensteen et al. 1986). However,
the final E of KONFERANSE has to be reduced before attaching the –EN definite article,
in contrast to nouns such as KNE ‘knee’, “kneet”, and HÅNDKLE ‘towel’, “håndkleet”,
where the definite singular neuter -ET suffix is added to the stem with an E final. Other
types of stem adaptations in the case of nouns are doubling of final single consonant,
KAM ‘comb’ plural indefinite “kammer”, elision of unaccented vowel in the last syllable,
REGEL ‘rule’ plural indefinite “regler”, and elision of unaccented vowel in the last
syllable and reduction of stem consonant, TITTEL ‘title’ plural indefinite “titler”.
In all these cases, a special paradigm was defined to cater for these and other lemmas
sharing the same characteristics. This was of course also the case with lemmas displaying
32
“f” for ‘feminine’ because the only definite article beside a possible neuter one is the suffix –a. While
most nouns of Bokmål usually considered to be feminine may equally have the –en suffix, which is
typically masculine, -a is unambiguously feminine.
13
stem alterations such as STRYKE ‘stroke, pat etc.’ and SKYVE ‘push’, which are
traditionally considered to belong to the same conjugation. Yet, they differ with respect
to the J that has to be inserted before an Ø after a SK-cluster (pronounced ʃ): STRYKE “strauk”/“strøk”, SKYVE - “skauv”/“skjøv”. In the IBM morphology, STRYKE and
SKYVE were attributed to two different paradigms.
Not only did one lemma belong to one paradigm only; complementarily, what one
would customarily need one single paradigm to describe in a mainstream linguistic
morphology, would be accounted for by means of two or more in the IBM format. Why?
Although IBM’s morphology was a morphology in the true, linguistic sense of the word,
it was organised according to special computational requirements. This was a technical
necessity because of the systems architecture for which it was intended. It would have
been perfectly possible from an isolated computational point of view to implement a
morphology in a more traditional linguistic format, with rules applying to rules etc.33
However, the very basic technical requirements were not a subject of discussion, and the
Norwegian natural language processing group simply had to satisfy them. Yet, the IBM
morphology was an adequate, extensive, and correct morphology for Norwegian Bokmål.
And it was the first of its kind – not only technically speaking, but also in so far as it gave
a complete description of the linguistic facts on a strictly synchronic basis. The result
could easily be transformed into a linguistic morphology of a conventional format.
Moreover, the creation of the complete morphology on the basis of discontinuous and
often inconsistent morphological information from printed sources, among other things,
was far from trivial.
2.2.2 The linguistic development process
As already mentioned, the linguistic part of the development was carried out by the
project leader alone, mainly on the basis of his language competence as a linguist and his
experience from the base dictionary. First, a draft version was developed that catered for
as many different paradigms as possible. Later, the morphology was constantly revised in
confrontation with the words of the base dictionary.
One by one, the words of the base dictionary were categorised and the morphology
adjusted as required and more paradigms and even categories were added until the entire
base dictionary had been processed - and the morphology had reached a stable state. Then
the lemmas of the lexicon were expanded by means of the morphology in order to
produce all word forms. The result was printed out, and the printouts were proofread.
Whenever an irregularity was detected, the lexicon and/or the morphology were corrected
accordingly, and, eventually, the proof reading process repeated.34
Basically, the morphology development consisted in registration, and in certain
respects, the entire process represented a reverse of the original base dictionary creation
process: Every word form generated by the base dictionary was attributed to its lemma or
33
It would not have been possible to assign such rules automatically, though. Cf. the case of last syllable
vowel elision, where any clear principle seems to be missing, at least from a synchronic perspective. E.g.
“ankeret” and “begeret”, definite singular forms of ANKER ‘anchor’ and BEGER ‘cup’, respectively – in
contrast to “alteret” or “altret” and “mønsteret” or “mønstret” of ALTER ‘altar’ and MØNSTER ‘pattern’.
34
After the morphology had reached a stable state, its maintenance and expansion were gradually taken
over by Jørn-Otto Akø. As for the other linguists involved, see the appendix.
14
to all possible lemmas (like in the case of FOR35), and all lemmas were subsequently
attributed to their respective correct paradigms. This implied that the last operations
performed during the creation of the base dictionary, singling out the correct word forms
etc., had to be repeated in principle, since no records had been kept. Cf. p. 12
The following written sources were consulted:
Sverdrup, Jakob, Marius Sandvei and Bernt Fossestøl: 1983, Tanums store rettskrivningsordbok.
Bokmål. 6 edition. Oslo: Tanum-Norli
Landrø, Marit Ingebjørg, Boye Wangensteen et al.: 1986, Bokmålsordboka. Bergen:
Universitetsforlaget
Norsk språkråd: 1972-, Årsmelding. [Annual reports from The Norwegian Language Council] Oslo:
Norsk språkråd
Here “consulted” means that the printed dictionaries were used the way their authors and
publishers had intended: Words were looked up in order to verify the linguist’s
competence when necessary.36 In this connection, it is important to know that Tanums
store rettskrivningsordbok and Bokmålsordboka had - and in fact still have - a semiofficial status as the documentation of the official norm, as defined by Norsk språkråd. In
fact, there was hardly any other place where the norm was systematically documented,
however, not without numerous defects.
The linguistic challenge of this phase of the process was related to completion. The
special technical format forced the linguists to control whether the traditional paradigms
fitted the inflected forms of the class of lexemes they were supposed to before they were
implemented. It also imposed a raised awareness of peripheral forms of the respective
paradigms. As a result, the linguists had to draw the consequences of the current official
standardisation of Norwegian Bokmål to an extent not represented in any printed source.
By doing so, every inconsistency was cleared up and every error encountered corrected,
in addition to the detection and chartering of non-standardised parts of the lexicon.
In practice, this task implied that the linguists had to infer how each word ought to be
inflected, one word after the other, and to systematise the result in relationship to
traditional paradigms – as well as to adapt the result to all possible words requiring not
yet recognised exceptions and additional rules. This was far from a trivial task, given the
often confusing information of the relevant dictionaries.
A few typical cases of such vagueness: What was the correct form of the verbal noun
corresponding to verbs with a “floating” J: SVELGJE ‘swallow’ and SVELGING f
‘swallowing’ with J elision in contrast to BØLGJE, ‘wave, roll’ and BØLGJING f
‘waving, rolling’, i.e. without J elision. Should the present participles of the short and
long versions of what is historically and functionally the same verb, e.g. BE/BEDE ‘ask;
beg’, SKA/SKADE ‘damage; hurt’, KLE/KLEDE ‘dress’, AVKLE (but not
*AVKLEDE) ‘strip; lay bare’ etc. contain a stem final consonant or not: “beende” and/or
“bedende”, “skaende” and/or “skadende”, “kleende” and/or “kledende”, “avkleende”
and/or “avkledende” respectively. These and numerous other such cases simply had to be
handled one by one.
Moreover, awareness of peripheral forms even has another aspect, not necessarily
linked to normative linguistics. Two prominent instances are plural of nouns and
35
36
See p. 12.
Cf. p. 3.
15
comparison of adjectives. Separate paradigms were created for nouns without a plural,
e.g. FEBER ‘fever’, MAT ‘food’, LYKKE ‘happiness’, and GODFOT ‘[idiomatically]
healthy foot’, for nouns with no singular, e.g. INNVOLL ‘intestines’, BOMPENGER
‘toll’, HVETEBRØDSDAGER ‘honeymoon’, and for adjectives without a comparison,
e.g. ABSOLUTT ‘absolute’, KJEMPEFLOTT ‘excellent’, GLITRENDE ‘brilliant’,
GRØNNFARGET ‘(dyed) green’.37
The completion process was carried out in close cooperation with Norsk språkråd. In
quite a number of cases, its Bokmål section manager, Arnljot Thoresen and its
lexicographers were consulted directly.38 Norsk språkråd later received a list of errors
and unclear information in Tanums rettskrivningsordbok detected during this process,
which at least potentially represented regular morphological research.
The distance between traditional morphology and the result of the standardisation and
the completion carried out to its ultimate consequences under the IBM morphology
project can be measured by the number of conjugations. Norwegian Bokmål is usually
said to comprise 10 conjugations39 (cf. for instance Berulfsen 1967, 141ff.) while, in
practice, the verbs could be categorised in no less than 243 distinct classes according to
the IBM’s conception of digital morphology. (See Engh 1996.) In addition to the partly
defective standardisation and the inaccuracies of traditional morphologies and to the fact
that the morphology was conceived without a view to diachrony, this disproportionate
relationship between the numbers of conjugations was to a great extent due to the internal
variation characterising Norwegian. In the case of languages with a regular spelling and
regular inflections, and, above all, no internal variation, conversion and registration may
be a trivial task as far as linguistics is concerned. At least in theory, neither corrections
nor completion etc. will be needed.
Symptomatically, a similar effort was apparently not needed in order to implement a
morphology for any other language that was part of IBM’s project for the creation of
lexicographical language resources.
2.2.3 The last version of the Bokmål morphology
When the last official version of the lexicon and corresponding morphology was finished,
it comprised 705 paradigms mapping to 65,128 lexemes. In a pre-release to a next
version, which was never implemented in any software product, the number of lexemes
had risen to 121,577. This sharp increase was due to the fact that “new” words from one
year of the major newspaper Aftenposten were added and classified.40 On the basis of this
material, it was possible to generate more than 1.1 million unique word forms.
37
The soundness of such a classification and its relativity in relationship to application software where it
may be implemented etc. are discussed in Engh 1993.
38
Especially by telephone in what probably can be described as an unprecedented series of intense
consultations for the institution.
39
4 regular and 6 irregular conjugations. Of course, exclusive of their exceptions, of optionality etc.
40
Including an extensive number of proper names.
16
2.3 Other lexicographic products
As mentioned earlier, the lexicographical activities of IBM Norway’s natural language
processing group were more comprehensive than just the lexicon and the morphology for
Bokmål.
The corresponding Nynorsk project has only been mentioned in passing. In principle,
however, it was developed along the same lines as its Bokmål analogue - with the notable
exception that the linguists were able to develop it in a “natural” sequence, first lexicon
and morphology and then the “base dictionary”. This brought about a considerable
improvement in production efficiency. The linguistic problems encountered during this
process were similar to those of the Bokmål process, though.
As a point of departure, the existing Bokmål files were simply “translated” into
Nynorsk41 Then the result was completed, following the same routines as in the case of
Bokmål. However, only the regular, optional forms of the official “textbook standard”
were included and no genitives with the -S suffix, except for proper names.42 This
implied that the Nynorsk paradigms were constituted by fewer categories, and that the
total number of word forms generated was relatively lower compared to its Bokmål
counterpart, without imperilling the coverage.
The development started 4 January 1988, and the first version was shipped to the
IBM laboratory at the end of January 1989. In the second and last delivery of 12 June
1990, however, the lexicon had been expanded to 110,412 entries while the number of
paradigms was 576.
Further, synonyms dictionaries were compiled for both Bokmål and for Nynorsk.43
The former consisted of 17,337 separate entries when its second and last version was
included in relevant application software. The latter contained more than 25,000 entries
in its only version. As a by-product of the compilation process, “new" words were added
to the corresponding lexicon and morphology.
A number of other lexicographical products were prepared, as well. Of special
interest in the present context is the enrichment of the Bokmål vocabulary in terms of
information about semantic and syntactic properties (cf. p. 3). All lemmas were classified
with respect to 180 characteristics (Engh 1994, 111ff.). One sample of syntactic
information44 added to adverbs: systematic information about positional properties.
41
I.e. corresponding stems were listed together with different stems with the same meaning as stems in the
Bokmål list, and every lemma form was attributed to a particular paradigm etc.
42
In accordance with the mainstream interpretation of the Nynorsk norm.
43
The existing printed synonyms dictionaries were not ideal for IBM’s primary purpose: to assist users
improving their Norwegian, probably not for implementation in any search program either. They were
more or less designed for solving crossword puzzles (Gundersen 1984) or for helping Bokmål users to
write Nynorsk (Rommetveit 1986).
44
Based on characteristics attributed according to a Poul Diderichsen style sentence model or position
grammar (Diderichsen 1946), which turned out to be very useful when the Norwegian PLNLP grammar –
the first broad coverage analytic grammar for Norwegian Bokmål - was developed (Engh 1994).
17
BAREPART - can only appear as an ADVP when functioning as a verbal particle
FRIADV - light adverb that may appear in the beginning of an NP or an ADVP
IKKEADVPS - cannot form an ADVP alone
FMADV - may only appear as the nucleus of the grounding field or the nexus field
FSADV - may only appear as the nucleus of the grounding field or the content field
MADV - may only appear as the nucleus of the nexus field
FADV - may only appear as the nucleus of the grounding field
SMADV - may only appear as the nucleus of the content field
MSADV - may only appear as the nucleus of the the nexus or the content field
IKKEPRMJ - cannot modify an ADJP as a premodifier
IKKEPRMA - cannot modify an ADVP as a premodifier
Examples of other types of syntactic properties, general semantic properties and
information about style level and special lexical status:
V0 - zero valency
V2E - bivalent verb with two obligatory arguments
OBJINF - verb with possible object infinitive
A0 - may occur as finite verb of a complex passive phrase
KOLLEK - collective term
MASSE - mass term
TELLELIG countable term
FRANSK - French loanword with authentic orthography
FYSJOM – “indecent” word
Still, the Bokmål lexicon and morphology constituted the most important part of the
IBM Norway lexicography project, since it was a first, since it was the most
comprehensive part of the project - but also because it became disputed.
2.4 Bokmålsordboka - a digression
By the end of 1989, more than five years after the start of IBM Norway’s natural
languages processing project, its lexicographical resources for Norwegian Bokmål had
reached a stable phase. The corporate attention was now focused on the next linguistic
modules that were to be developed. The first priority as far as Bokmål was concerned was
to develop a grammar that would be implemented in a system for writing support and
style critique. More applications were planned for, and IBM Norway was asked to
procure digital material that could be relevant one way or the other in this connection.
The idea was to arrange for internal use only rights in the first place, in order to take a
closer look at the format and, if needed, test it in the original or a modified format in an
IBM development environment. Once its appropriateness had been established and, most
importantly, once the concrete need for its implementation arose, IBM Norway would
engage in negotiations for a licence for commercial use. Unfortunately, IBM Norway
abandoned natural language processing development and the development group was
dissolved before such a stage of development was reached.
Despite overt misgivings from major publishers, the project succeeded in acquiring
the current digital versions of the following titles:
18
Bjarne Berulfsen and Torkjell K. Berulfsen: 1989, Engelsk-norsk blå ordbok. (Kunnskapsforlagets blå
ordbøker) 5. edition. Oslo: Kunnskapsforlaget
Kirkeby, Willy A. and H. Scavenius: 1989, Norsk-engelsk ordbok. (Kunnskapsforlagets blå ordbøker)
5. edition. Oslo: Kunnskapsforlaget
Berulfsen, Bjarne and Dag Gundersen: 1986, Fremmedordbok. (Kunnskapsforlagets blå ordbøker) 15.
edition. Oslo: Kunnskapsforlaget
Rommetveit, Magne: 1986, På godt norsk. Synonymordbok med omsetjingar frå bokmål til nynorsk. 2
edition. Oslo: NKS-forlaget
Landrø, Marit Ingebjørg, Boye Wangensteen et al.: 1986, Bokmålsordboka. Bergen:
Universitetsforlaget
These titles were later made available for IBM internal use only as electronic dictionaries.
This meant that IBM employees45 at the Norwegian headquarter could look up entry
words on their terminals and PCs.
The case of Bokmålsordboka deserves a closer comment. IBM Norway acquired the
right to internal use of Bokmålsordboka for a five years period for NOK 100,000 plus a
yearly fee of NOK 10,000.46 In addition to the appropriateness tests, there were even
plans to analyse the entry words and the words in the definitions, including the examples,
in order to find words not yet covered by the base dictionary or the morphology. None of
these plans were ever implemented. Instead, the digital copy of Bokmålsordboka was put
at the disposal for an engineering student of the then Oslo ingeniørhøgskole,47 Jostein
Baustad. He worked as an unpaid trainee at IBM Norway in order to perform the practical
work for his Bsc thesis. Although probably unheard of in Norway at that time, this type
of student internship was common practice in IBM internationally as a contribution to
university education in natural language processing.
By means of IBM software, Baustad converted the content of Bokmålsordboka in
structured text format to two subsequent database formats: DAM (Dictionary Access
Method) and LDB (Lexical DataBase format). The former was a simple format enabling
look-ups in the electronic dictionary, just like in any printed one – one entry at the time –
in WordSmith.48 The latter facilitated every type of (combined) searches in the electronic
dictionary as a true database by means of LQL (Lexical Query Language). 49
So, it is important to stress that, apart from being used as an ordinary electronic
dictionary by IBM Norway employees, the digital copy of Bokmålsordboka was never
made direct use of – neither as a database nor in any other format.
45
Translators, salesmen, engineers etc.
Cf. contract of 22. November 1989. There were also negotiations with Samlaget about a similar contract
concerning Bokmålsordbokas parallel Nynorskordboka. However, the publisher demanded the staggering
amount of NOK 1,000,000 and negotiations were abandoned.
47
‘Oslo college of engineering’.
48
Cf Neff and Byrd 1988.
49
Baustad also converted the four smaller dictionaries mentioned above to DAM format for WordSmith
look-ups by IBM employees.
46
19
2.5 Discontinuation of the project, results, and documentation
Due to the international financial crisis of the late 1980s, IBM’s development division
discontinued all natural language processing projects. As a consequence, no corporate
funding for dictionary work was provided for 199150. On the basis of local funding, the
activities continued at a somewhat reduced pace till the end of 1991. The natural
language processing group was dissolved, and its leader left IBM.
At the time of its termination, IBM Norway’s lexicographical activities had been
documented in several ways. Preliminary results had been presented on one occasion to
linguists of The National Research council’s computing centre (Bergen)51 in the first
move to make the group’s activities known to the public. Later, linguists from the
University of Oslo were invited on several occasions, individually and in groups, to see
the products at IBM premises (Kolbotn). At that time, these products had already been
implemented in various types of IBM software products. (They were never marketed
separately.) The most important event of this kind, though, was the Nordic conference of
lexicography in May 1991. One conference session was located at the IBM Norway
headquarters at Kolbotn, where a broad presentation of IBM Norway’s lexicographical
activities and results was given: The project leader gave a general overview, paying
special attention to problems identified by the natural language processing group (Engh
1992a), Jostein Baustad presented his engineering Bsc thesis (Baustad 1992), and at a
session on university premises, Jørn-Otto Akø spoke about one particular problem of
lexicographical interest related to IBM Norway’s linguistic research and development:
the vagueness of the current Bokmål standard, particularly as codified in Tanums
rettskrivningsordbok and Bokmålsordboka (Akø 1992). (All three papers appear in the
conference proceedings, edited by Ruth Vatvedt Fjeld et al.) Additionally, the project
leader wrote a comprehensive documentation of the entire lexicographical project during
the last months of 1991 (Engh 1991b),52 and one month before the end of the project, he
gave a presentation at the annual national conference for Norwegian linguistics, MONS
(printed in Engh 1992b).
After the conclusion of IBM’s natural language activities for Norwegian, the
lexicography project was mentioned in Engh 1993, and served as the base for Engh
1996.53 Furthermore, it is briefly described at http://folk.uio.no/janengh/IBMnorsk.html.54
Later, IBM’s research division sold the penultimate version of the lexica and the
morphologies to software companies,55 while IBM Norway transferred the most recent
files to Dokumentasjonsprosjektet56 (University of Oslo) for a symbolic sum.57 For
50
The funding of the parallel grammar project had already been halted as off 3. August 1990.
NAVFs EDB-senter i Bergen.
52
This report and one on the parallel grammar development project of IBM Norway (Engh 1994) were later
deposited at the Norwegian National Library.
53
See p. 17.
54
Accessed 27 March 2013, but available since the late 1990s.
55
Inso used the Norwegian lexica for both spell-checking as well as for enhancing search applications.
From Inso, the lexica and morphologies even reached Microsoft, and through Inxight they found their way
to functions in software from Oracle, Microsoft, Yahoo and many others. Finally, Xerox Research Centre
Europe in Grenoble had a research license to this material. (Ian Hersey – originally IBM, later Inso, Inxight
etc. - personal communication.)
56
Literally ‘The documentation project’, a central public institution for converting non-digital cultural
resources to electronic form through the years 1992-1997.
51
20
unknown reasons, Academia never showed any interest in the “enrichment” part of IBM
Norway’s lexicographical products: information about a variety of semantic and syntactic
properties of words, to which one could also add information about word compounding
and hyphenation.58 Of IBM’s two synonyms dictionaries, the one for Nynorsk was never
used by the corporation. Neither the Bokmål nor the Nynorsk synonyms dictionaries were
published in any other way.
In 1996, Kristin Hagen59 of the University of Oslo revised the IBM lexicon and
morphology for both Bokmål and Nynorsk with three objectives: Firstly, to implement
the latest changes in the orthography of Norwegian, correct a very small number of
errors, and to make minor technical alterations. Secondly, to adjust the parts of speech
following the norm of the recently published Norsk referansegrammatikk (Faarlund et al.
1997). Thirdly, to make certain adjustments as far as the extension was concerned: To
remove all genitive forms and to expand both morphologies with all imaginable forms of
each lemma, both morphological variants and orthographical variants in addition to nonstandard forms.60 The result was linked to Bokmålsordboka and Nynorskordboka and
complemented with lemmas from these two dictionaries. This became later on the basis
of Norsk ordbank,61 a service from the University of Oslo, incorporated in the newly
established Norsk språkbank62 under the auspices of Språkrådet.63 Thus IBM Norway’s
lexica and morphologies constitute an important part of the base of today’s electronic
infrastructure for the Norwegian language,64 unfortunately not generally acknowledged as
such.65
57
Relevant information about the linguistic aspects of IBM’s lexica and morphologies, both Bokmål and
Nynorsk, can be found on http://folk.uio.no/janengh/IBMmorf.html. [Accessed 27 March 2013]
58
A curiosity: After the presentation at the national conference for Norwegian linguistics, MONS, in 1991
(cf. Engh 1992b), the next speaker presented a project that was to start at the Technical University of
Trondheim, NorKompLeks. (It was realised later on as a three years project 1996-1998.) Apart from
creating a machine-readable lexicon and a morphology, its objective was to classify Norwegian words with
respect to valency etc. In other words exactly what had already been carried out by IBM’s natural language
processing group (with the exception of information about pronunciation). Cf.
http://www.forskningsradet.no/servlet/Satellite?c=Prosjekt&cid=1193731511032&pagename=Forsknings
radetNorsk/Hovedsidemal&p=1181730334233 [Accessed 27 March 2013]
59
Former temporary IBM employee. See Appendix.
60
For instance, “mjølkene” ‘the milks’ of MJØLK ‘milk’ and “fantastiskere” ‘more fantastic/amazing’ of
FANTASTISK ‘fantastic; amazing’, “permitted non-standard” word forms, and word forms such as “efter”
‘after’ and “gutta” ‘the lads, words that were catered for in a different way, outside the lexicon and the
morphology in the IBM modal of lexicography. Furthermore, common misspellings and abbreviations were
added. [Kristin Hagen, email dated 24 June 2013.]
61
Literally ‘the Norwegian bank of words’. Cf.
http://www.hf.uio.no/iln/om/organisasjon/edd/forsking/norsk-ordbank/ [Accessed 27 March 2013].
62
Literally ‘The Norwegian language bank’, cf. “Språkbanken - a language technology resource collection
for Norwegian”, available at http://www.nb.no/English/Collection-and-Services/Spraakbanken [Accessed
27 March 2013].
63
Formerly called Norsk språkråd.
64
Apart from its use in (computational) linguistics research, both directly and indirectly, for instance via
the Oslo-Bergen-tagger, it has been acquired for commercial use by Abilia, iFinger Ltd., Innovit AS, Lingit
AS, Microværkstedet/Mikroverkstedet, Oribi AB, Ovitas AS, Nynodata AS, Rescudo, and Sticos AS.,
according to Norsk språkbank [Accessed 5. April 2013]. Additionally, it has been used by Wordfeud.
65
With the notable exception of Dokumentasjonsprosjektet and its successor Eining for digital
dokumentasjon, ‘Unit for Digital Documentation’ (University of Oslo), the technical host of Ordbanken.
21
3 Allegations
On the face of it, documentation of digital language resource projects like the one dealt
with above from the beginning of the 1990s would be adequate. In 2002, however, this
author became aware of the following paper, which had been read at the 2000
EURALEX conference and later reproduced in its proceedings:
“On the basis of the lemmas in Bokmålsordboka the IBM-company [sic!] has made
lexical full form lists, which have been further developed through the so called
Documentation project as full fledged morphological bases of the standard forms in
Norwegian bokmål.” (Fjeld 2000, 672)
The author of the paper was informed that IBM did not make lexical full forms lists on
the basis of the lemmas of Bokmålsordboka. Also, that the files received by “the so called
Documentation project” (University of Oslo) already represented a full-fledged
morphological base of Norwegian Bokmål (and Nynorsk as well). She maintained that
there was a lack of documentation as far as the IBM project was concerned, and that there
was nobody there to ask.
Later the same year, a new paper of the same author appeared, where she claimed
“I 1989 sammenliknet man ved IBMs språkavdeling det ordforrådet som lå inne i en
elektronisk versjon av Bokmålsordboka med de formene som faktisk var i bruk i
tekster. (---) IBM arbeidet imidlertid med et språkteknologisk mål for øye, (---) og
utarbeidet deskriptive lister av ordforrådet i løpende tekster. IBM-listene ble seinere
utgangspunktet for de morfologiske basene som Tekstlaboratoriet og
Dokumentasjonsprosjektet ved Universitetet i Oslo har utviklet (---).”(Fjeld 2002,
139)
‘In 1989, IBM’s section for natural languages compared the vocabulary of an
electronic version of Bokmålsordboka with the word forms actually used in texts.
(---) However, IBM had a language technology perspective, (---) and compiled
descriptive lists of the vocabulary in running texts. Later, the IBM lists served as the
point of departure for the morphological bases that were developed by
Tekstlaboratoriet and Dokumentasjonsprosjektet (University of Oslo) (---)’
For anyone without specific knowledge of the reality, the only natural interpretation of
this paragraph will be that IBM compiled lists on the basis of the vocabulary of an
electronic version of Bokmålsordboka, and that this list served as the point of departure
for the morphology developed by the University of Oslo. Again, this is contrary to the
merits of the case.
This incident provoked two conference papers (later published as Engh 2009 and
2011) where things were put straight.
Then, in 2013, this author came across material from the CLARIN meeting of 2008,
where the following contention appears in a prominent presentation:
22
Success:
research org. as developers,
commercial org. as users
•
UiO: (Bokmålsordboka + Nynorskordboka
Norw.BM + NN dictionary). Used by:
–
–
–
–
–
–
–
IBM to develop their own lexicon
NorKompLeks (NTNU+Telenor)
Nyno translation system
Several bilingual dictionaries: (No-Bulgarian, NoLithuanian)
Connexor (Finnish language technology)
Lingsoft (Finnish language technology)
Gule sider (Yellow pages)
Mikroværkstedet (Danish language technology)
(Johannessen and Fjeld 2008)
In the light of the heading, emphasising the research organisations’ role as developers and
the commercial organisations’ role as users, “Used by: IBM to develop their own
lexicon” can only be construed to mean that ‘IBM made their own lexicon on the basis of
Bokmålsordboka and Nynorskordboka’ - the old groundless assertion in new disguise.
Together, these quotations contain one contention and one insinuation: On the one
hand, IBM did not develop its own lexicon and morphology from scratch, but contented
itself with reformatting a digital version of a published dictionary instead. On the other,
IBM made illicit use of somebody else’s intellectual property, since IBM did not have the
right to commercial use of the published dictionary. Both allegations are false.
Not even the slightest proofs of copying or illicit use are presented, no qualitative
comparison of the information of the IBM lexicon and morphology and the one of
Bokmålsordboka, no simple comparison of the number of lexical items of IBM lexicon
and morphology and Bokmålsordboka.
3.1 Discussion
What can be the origin of these allegations?66 The only clue may be the fact that IBM
Norway, in a contract of fixed duration, bought the right to internal use only of a digital
copy of Bokmålsordboka in 1989.67 However, having a digital copy of a dictionary at
one’s disposal does not imply that it is used illicitly. In the case of IBM and
Bokmålsordboka, this is sustained by both external and internal, linguistic circumstances.
66
And, in fact others, e.g. Fjeld and Henriksen 2012, where the importance of IBM’s contribution to Norsk
ordbank is depreciated in absurdum.
67
See p. 21.
23
First: For IBM, it was utterly important not to infringe the intellectual property rights
of others. The corporation was extremely wary of its image in society - in general, as a
major player of the industry, but also because of the anti-trust case instituted against it a
few years earlier. So, the corporation had, above all, very strict, almost “religious” rules
of conduct, also as far as copyrights were concerned. Using third party’s software without
licence etc. was a reason for immediate dismissal. What is more important: This did not
amount to pure formalities, it was enforced in practice. In fact, the project leader had to
put in a great deal of work on documenting to IBM lawyers that the products were really
made by IBM Norway’s own natural language processing group and not copied from
anybody else’s original works. Ethics apart, as a corporation holding tens of thousands of
patents, and even being engaged in a legal dispute with Fujitsu over patents at that time,
IBM would never have seen any advantage in violating the copyright of anyone else.
Secondly, when IBM Norway’s natural language processing group finally got access
to an electronic copy of Bokmålsordboka, the group had been active for more than five
years and, most importantly: IBM’s lexicon and morphology for Bokmål had been
developed already – as described and documented. Even when the first printed versions
of Bokmålsordboka (and Nynorskordboka) appeared in 1986, IBM’s morphology was
well under way.
Third, any copying based on printed dictionaries is out of the question for practical
grounds. Typing in the entire dictionary or copying it by means of optical character
recognition - which inevitably would have implied man-years of proofreading and
correction - would have taken quite a time to complete.68 It would also have cost a
fortune, and there is no sign of such an extravaganza in IBM’s records.
In the fourth place, there exists written documentation about how the work was
carried out, and none of those who took part in the lexicography project has ever raised
doubt about it. No copying is ever mentioned.
Even for internal, linguistic reasons, copying was out of the question. First of all,
because Bokmålsordboka did not conform to the quality standards required by IBM. As
pointed out above, it was partially imprecise, its morphology markings, “m1” etc. were in
part inaccurate and, to a considerable extent, defective. Also, it represented, at its best,
the incoherent standardisation of Norwegian Bokmål. An extensive completion was
necessary in order to obtain the result that eventually was made available to the
University of Oslo. Under all circumstances, the IBM Bokmål lexicon and morphology
represent a significant research-based added value compared to the relevant information
of Bokmålsordboka.
Secondly: Although IBM’s morphology, in its peculiar format, could be rather easily
transformed to a linguistic morphology of a conventional format, the converse was not
the case. As described above, the creation of the complete morphology on the basis of
discontinuous and often inconsistent morphological information from printed sources etc.
was far from trivial. The same goes for the accurate representation of “normal” usage.69 It
would have been equally less trivial based on a digital version, and – above all – it could
not have been carried out automatically.70
68
Cf. the experience of Dokumentasjonsprosjektet five years later - when optical character recognition
techniques were supposedly five years better. See Ore and Kristiansen 1998, p. 40 et passim.
69
No plural of MJØLK ‘milk’ etc. CF. p. 21n.
70
Cf. p. 14.
24
Those who contest the origin of IBM’s lexicon and morphology for Bokmål should
also consider why it took so many years to create it if it was just a reformatted version of
Bokmålsordboka. Further, they should have asked those who worked at Norsk språkråd
at that time about the nature of our discussions. There were extensive consultations with
the linguists of Norsk språkråd. Why discuss details of standardisation if the material
involved was just a matter of reformatting?
Finally, it remains an open question why the University of Oslo group that eventually
received a copy of IBM’s lexicon and morphology after the discontinuation of IBM
Norway’s natural language processing group invested considerable resources in
reviewing this material, comparing the lexicon with the one of Bokmålsordboka, and
creating an augmented electronic dictionary on the basis of it and the definitions of
Bokmålsordboka, adding what IBM had considered defective forms, among other things.
This was rather an absurd activity if the IBM’s material only represented a formatting of
the relevant data of Bokmålsordboka in the first place.
And, above all, the detractors should read the documentation and compare the
description of the work, the information about who carried out what task and when, the
quantities and the dates etc. in order to see that it represents a coherent description of how
the Bokmål lexicon and morphology were developed.
Interestingly enough, Nynorskordboka was mentioned in the last quotation above as
well (Johannessen and Fjeld 2008). As far as this dictionary is concerned, the case is
simpler still: IBM never had access to any digital copy of this dictionary.
When all is said and done, the allegations testify to an astonishing lack of knowledge
about practical lexicographical development as well as to lacking ethical scientific
conduct.
3.2 Documentation
Was the documentation insufficient? Could the allegations mentioned above have been
avoided with more documentation or with a different kind of documentation? Probably
not.
First of all, there was accessible documentation. The information had been published.
At first as printed information, later even digital information became available on the
internet. Still more information was available in open, however formally unpublished
sources. The presentation above of how the lexicon and the morphology were made can
be regarded as some sort of summary of the relevant parts of the collected documentation
of the project from the early 1990s till 2009.
Secondly, there are witnesses. People were invited to demonstrations and
presentations at the time of the project, linguists at Norsk språkråd were continually
consulted. And, above all, those who carried out the lexicon and morphology projects are,
with one exception, still alive and professionally active – with no ties to IBM whatsoever.
They are not far away. They may be consulted in person. Equally important, they have
been able to control the veracity of the documentation produced by the project leader and
others. In fact, some of them have even read the present article.71
71
Which should effectively eliminate the extreme possibility that the entire documentation is a fraud, and
that neither the final products nor the activities leading up to it ever existed ... (After all, we know from
science that such things may happen. From the Korean cow to our own Norwegian scientist and his
25
In sum, the documentation exists and is easy to find.
What about the quality of the documentation? Is the existing documentation
adequate? According to common usage the engineering way, it is. On the one hand, all
external data of importance are listed. On the other, all phases of the development, all
operations involved in order to obtain the result are described, and what kind of data that
served as input etc. is specified.
What more can one reasonably expect from documentation - necessarily different
from what it is supposed to document? In addition, by presenting problems and their
solutions as far as completion is concerned, the documentation has been brought as close
to standard linguistic presentations as possible. What more would be necessary in order to
document the creation of one particular set of digital language resources?
Finally, one is entitled to question whether the documentation is trustworthy. Any
competent linguist with a practical experience from creation of digital language resources
and the slightest insight in contemporary Norwegian standardisation will see that the
documentation is rational and conform to reality. A thorough description of what was
actually accomplished and how during each phase of the project and how one phase of
the project led to the next can hardly be the invention of somebody who did not have
exactly this experience: The base dictionary from start to ready product. From base
dictionary to lexicon and morphology. Morphology in its initial phase, through draft of a
tentative morphology to processing of all possible words that were contained in the base
dictionary etc. Detailed samples of inventories of all problems encountered and solved –
as well as accounts of how they were solved. In fact, there should be no need to even talk
about presumptive evidence. The description is internally coherent down to the smallest
detail, e.g. practicalities such as proof reading: Which again brings us back to those who
actually carried out the project: Ask the proof readers!
There is even one more obvious external argument in support of the authenticity of
the project, since it was carried out in the private sector of society: Despite the theoretical
possibility of reengineering, no business organisation will ever spend a small fortune and
contract quite a number of qualified persons over a number of years to carry out an
unnecessary task. This is the private sector research and development version of Occam’s
razor.
4 At the end of the day ...
Since one cannot just point to a morphology like one points to another engineering
product, such as a bridge, there will always be plenty of room for allegations, no matter
the realities. Even (false) allegations of fraud may occur, despite the documentation at
hand. So, another type of documentation is not likely to make any difference at all.
The reasons for the allegations mentioned above must be found somewhere else.
Even if one accepts that they were not made with the intention to compromise the
intellectual property of others, they are certainly based on bad memory and/or lack of
competence. Perhaps even a desire for academic credit is part of the underlying cause
besides a corresponding need to undermine the academic credibility of others? By
cheating with cancer data in The Lancet. Cf. http://news.bbc.co.uk/2/hi/asia-pacific/4554422.stm and
http://www.uniforum.uio.no/nyheter/2006/01/angrende-forsker-innroemmer-mer-fusk.html [both accessed
27 March 2013].)
26
dismissing resource creation as mere formatting, its merits as linguistic research would be
effectively reduced, if not destroyed.
At this point, it is not out of place to redirect the attention to scientific documentation
itself: It is striking that academic referees have accepted papers to be published in
academic venues without proper checking of the facts. In fact, this casts a new light on
the limitations of peer-review screening.
Thus, instead of looking for new ways of documenting the creation of digital
language resources, what really needs to change are the attitudes of the relevant circles of
Academia, individual dishonesty aside: It has to be generally acknowledged that creating
digital language resources can be as much research as it is development, and that it, as
such, deserves the same attention and exact references as any other research activity. It
also has to be acknowledged that even if digital language resources need to be constantly
maintained in a way fundamentally different from for instance a bridge - due to the
dynamic nature of natural languages - their basic structure and importance do not
disappear with time. Ironically, the whole problem of allegations concerning IBM’s
lexica and morphologies would not have been brought forth if it were not for the fact that
they had certain qualities and are still in use. Furthermore, it has to be acknowledged that
high quality research is a possibility in private sector, even in the field of language
processing and linguistics. Not just in disciplines such as medicine and in science.
Academic credit is a question of relevance and quality, not about from which sector of
the society it originated.
Acknowledgements
Thanks to Jørn-Otto Akø, Marie Sundve Sannan, and Tor Ulset for taking part in the
project described below and for reading this paper, to Eskil Hanssen for reading it as
well, to Kristin Hagen and Håvard Hjulstad for providing usefull information about a few
details, and to Diana Santos, who watched it all from the sideline.
References
Akø, Jørn-Otto: 1992, Gråsoner i norske ordbøker. [‘Grey areas/vagueness in Norwegian dictionaries’] In
Fjeld, Ruth Vatvedt (Ed.), Nordiske studier i leksikografi. Rapport fra Nordisk konferanse i
leksikografi i Oslo, mai 1991. Oslo, 65-75.
Allén, Sture: 2002, Nordic language history and computer-aided lexical research. In Bandle, Oskar (Ed.):
The Nordic languages. An international handbook of the history of the North Germanic languages I.
(Handbücher zur Sprach- und Kommunikationswissenschaft 22) Berlin: de Gruyter, 268-271.
Baustad, Jostein: 1992, Automatisk analyse av maskinleselige ordbøker til bruk i en orddatabase.
[‘Automatic analysis of machine readable dictionaries for the creation of a word database’] In Fjeld,
Ruth Vatvedt (Ed.), Nordiske studier i leksikografi. Rapport fra Nordisk konferanse i leksikografi i
Oslo, mai 1991. Oslo, 423-431.
Berulfsen, Bjarne: 1967, Norsk grammatikk. Ordklassene. [‘Norwegian grammar. The parts of speech’]
Oslo: Aschehoug.
Diderichsen, Paul: 1946, Elementær dansk Grammatik. [‘Elementary Danish grammar’] København:
Gyldendal.
Engh, Jan: 1991a, IBM Norway’s Database for Present-day Norwegian. IBM Norway, Kolbotn.
Engh, Jan: 1991b, IBM’s Norwegian Lexicon Projects 1984-91. Unpublished report, IBM Norway,
Kolbotn.
27
Engh, Jan: 1992a, Leksikografi i IBM Norge. [‘Lexicography at IBM Norway’] In Fjeld, Ruth Vatvedt
(Ed.), Nordiske studier i leksikografi. Rapport fra Nordisk konferanse i leksikografi i Oslo, mai 1991.
Oslo, 409-22.
Engh, Jan: 1992b, Språkforskning i IBM Norge. [‘Linguistic research at IBM Norway’] Paper read at Møte
om norsk språk [The national conference for Norwegian linguistics] (MONS) IV, Oslo 15.-17.11.1991.
Published in NORSKRIFT 72, 16-36.
Engh, Jan: 1992c, Use of PORTUGA for the two Norwegian written standards. (In collaboration with
Diana Santos). INESC Journal of Research and Development 1, 1992, 54-59. Reprinted in Jensen,
Karen, George E. Heidorn, and Stephen D. Richardson (Eds.): Natural Language Processing: The
PLNLP Approach. Hingham (Mass.)/Dordrecht: Kluwer 1993, 115-118.
Engh, Jan: 1993, Linguistic normalisation in language industry: Some normative and descriptive aspects of
dictionary development. Hermes 1, 53-64.
Engh, Jan: 1994, Developing Grammar at IBM Norway 1988-91. Unpublished report. Oslo.
Engh, Jan: 1996, Bokmålsverb: En oversikt over hvordan verb bøyes i bokmål. [‘Verbs of Bokmål: A
survey of how verbs of Bokmål are conjugated’] (Universitetsbiblioteket i Oslo, Skrifter 28) Oslo.
Engh, Jan: 2009, Lexicography for IBM. Developing Norwegian linguistic resources in the 1980s. In
Impagliazzo, John, Timo Järvi and Petri Paju (Eds.): History of Nordic computing 2. Second IFIP WG
9.7 conference, HINC2, Turku, Finland, August 21-23, 2007, revised selected papers. (IFIP advances
in information and communication technology 303) Boston: Springer, 258-270.
Engh, Jan: 2011, IBM’s Norwegian grammar project 1988-91. In Impagliazzo, John, Per Lundin and Benkt
Wangler (Eds.): History of Nordic computing 3.Third IFIP WG 9.7 Conference, HiNC 3, Stockholm,
Sweden, October 18-20, 2010. Revised Selected Papers. (IFIP Advances in Information and
Communication Technology 350) Heidelberg: Springer, 137-149.
Faarlund, Jan Terje, Svein Lie, and Kjell Ivar Vannebo: 1997, Norsk referansegrammatikk. Oslo:
Universitetsforlaget.
Fjeld, Ruth Vatvedt: 2000, An outline of Norwegian Lexical Database (LDB) and its classification of
adjectives. In Ulrich Heid et al. (Eds.). Proceedings of the 9th Euralex International Congress,
EURALEX 2000, Stuttgart, Germany, August 8th - 12th, 2000, 671-677.
Fjeld, Ruth Vatvedt: 2002, Normering i klemme mellom språkteknologiske og pedagogiske ordbøker.
[‘Standardisation in tight squeeze between language technological and pedagogical dictionaries’]
LexicoNordica 9, 131-148.
Fjeld, Ruth Vatvedt and Petter Henriksen: 2012, The BRO-project, a bridge in the wild, Norwegian
linguistic landscape. In Euralex 2012 Proceedings, Proceedings from the 15th EURALEX International
Congress, University of Oslo 7-11 August, available at
http://www.euralex.org/elx_proceedings/Euralex2012/pp936-946%20Fjeld%20and%20Henriksen.pdf
[Accessed 27 March 2013].
Fjeldvig, Tove and Anne Golden: 1989, Utvikling av språkbaserte metoder for behandling av tekst
[‘Development of language based methods for text processing’] Humanistiske data 1/2, 122-130.
Gundersen, Dag: 1984. Norsk synonymordbok (Kunnskapsforlagets blå ordbøker) 2. edition. Oslo:
Kunnskapsforlaget.
Johannessen, Janne Bondi and Ruth Vatvedt Fjeld: 2008, Infrastructure building as a research task and a
necessity for language and speech R&D. Presentation delivered at the Clarin meeting, Bergen 15.-16.
December 2008. Available at http://clarin.b.uib.no/files/2010/02/janne.pdf [Accessed 27 March 2013].
Landrø, Marit Ingebjørg and Boye Wangensteen et al.: 1986, Bokmålsordboka. [‘The dictionary of Bokmål’]
Oslo: Universitetsforlaget.
Neff, Mary S. and Roy J. Byrd: 1988, Wordsmith User's Guide. [Version 2.0 IBM Research report]
Yorktown Hights (NY): T.J. Watson Research Center.
Ore, Christian Emil and Nina Kristiansen: 1998, Sluttrapport 1992-1997. Dokumentasjonsprosjektet.
[University of Oslo]. Oslo: Universitetets reprosentral. Also available at
http://www.dokpro.uio.no/sluttrapp.pdf [27 March 2013].
Santos, Diana: 1996, Português Computacional. In Duarte, Inês and Isabel Leiria (Eds.): Actas do Congresso
Internacional sobre o português, Universidade de Lisboa, 11 a 15 de Abril de 1994. Lisboa: Edições
Colibri/Associação Portuguesa de Linguística, III,167-184.
Sverdrup, Jakob, Marius Sandvei, and Bernt Fossestøl: 1983, Tanums store rettskrivningsordbok. Bokmål. 6
edition. Oslo: Tanum-Norli.
28
Vikør, Lars: 2001, The Nordic languages. Their status and interrelations. 3rd edition. (Nordic language
Secretariat. Publication 14). Oslo: Novus.
Appendix
Staff
Apart from the project leader, 22 linguists participated in phases of IBM Norway’s lexicography project for
shorter or longer periods as part-time supplementaries, temporaries, vendors etc.:
Jørn-Otto Akø, Anneke Askeland, Heidi A.C. Christophersen, Hans-Olav Enger, Ingrid Bjorvand,
Hildegunn Kolle Flom, Dag Gundersen, Kristin Hagen, Eva Halvorsen, †Trond Kirkeby-Garstad,
Kristian Emil Kristoffersen, †Marit Ingebjørg Landrø, †Ola Lykkjen, Åsta Norheim, Linda
Salomonsen, Marie Sundve Sannan, Mildrid Solli, Andreas Sveen, Tor Ulset, Ivar Utne, Torbjørn
Vike, Dagfinn Worren
Of those, the following 8 worked for a longer time and even had direct development responsibilities for
certain sub-projects:
Jørn-Otto Akø, Marie Sundve Sannan, Marit Ingebjørg Landrø, Kristin Hagen, Eva Halvorsen, Trond
Kirkeby-Garstad, Åsta Norheim, Tor Ulset
The entire staff distributed by sub-projects and tasks:
Bokmål base dictionary
Jørn-Otto Akø (lexicographer; proofreading, control etc.)
Marie Sundve Sannan (proofreading, control etc.)
Bokmål morphology
Jørn-Otto Akø (lexicographer)
Ingrid Bjorvand (proofreading, control etc.)
Kristin Hagen (proofreading, control etc.)
Linda Salomonsen (proofreading, control etc.)
Andreas Sveen (proofreading, control etc.)
Bokmål synonyms dictionary
Jørn-Otto Akø (lexicographer)
Marit Ingebjørg Landrø (lexicographer)
Marie Sundve Sannan (lexicographer)
Bokmål enrichment
Anneke Askeland (lexicographer)
Hans-Olav Enger (lexicographer)
Hildegunn Kolle Flom (lexicographer)
Kristian Emil Kristoffersen (lexicographer)
Marie Sundve Sannan (lexicographer)
Nynorsk base dictionary
Jørn-Otto Akø (proofreading, control etc.)
Marie Sundve Sannan (proofreading, control etc.)
Tor Ulset (proofreading, control etc.)
Nynorsk morphology
Heidi A.C. Christophersen (proofreading, control etc.)
Kristin Hagen (proofreading, control etc.)
29
Eva Halvorsen (lexicographer)
Trond Kirkeby-Garstad (lexicographer)
Åsta Norheim (lexicographer)
Tor Ulset (proofreading, control etc.)
Dagfinn Worren (proofreading, control etc.)
Nynorsk synonyms dictionary
Eva Halvorsen (lexicographer)
Ola Lykkjen (lexicographer)
Mildrid Solli (lexicographer)
Tor Ulset (lexicographer)
Torbjørn Vike (lexicographer)
Dagfinn Worren (proofreading, control etc.)
Additional consultant work and minor projects
Jørn-Otto Akø
Dag Gundersen
Ivar Utne
30
Download