VirtualCorpuswithPaul - School of Computer Science and

advertisement
‘Virtual’ Text Corpora and their Management
Khurshid Ahmad and Paul Holmes-Higgin
University of Surrey
October 1988
Contents
Abstract .............................................................................................................................................. 2
Introduction........................................................................................................................................ 2
Text Typologies, Language Varieties and Corpus Building .............................................................. 3
Lancaster-Oslo/Bergen Corpus: The pioneer corpus randomly populated ........................... 3
Birmingham collection of English Text: Specially selected text corpora for lexicography 6
Longman/Lancaster English Language Corpus: Deliberately and randomly selected
corpus .................................................................................................................................... 8
British National Corpus Initiative ......................................................................................... 10
Linguistically grounded text typology .................................................................................. 12
Diachronic Corpora and their Structure ................................................................................ 16
PISA’s Bilingual Reference Corpora .................................................................................... 18
Special Language Corpora and ‘whole texts’ .................................................................................... 20
What is a special language .................................................................................................... 20
Specialist texts, cohesion and coherence ................................................................. 21
Specialist texts: A speech act analysis ..................................................................... 22
Language of Science ................................................................................................ 23
Specialist texts or terminological texts .................................................................... 24
British National Corpus and Terminology ............................................................................ 25
Organisation of Text Corpora and text typology ............................................................................... 27
Well grounded text typologies and text representation ......................................................... 28
Coding, Description and Representation of Texts ................................................................ 28
Coding of texts ......................................................................................................... 29
Description of Texts ................................................................................................. 30
Representation of Texts ........................................................................................... 31
A virtual corpus management system ................................................................................................ 34
Conclusion ......................................................................................................................................... 41
Bibliography ...................................................................................................................................... 42
1
‘Virtual’ Text Corpora and their Management
Khurshid Ahmad and Paul Holmes-Higgin
University of Surrey
October 1988
Abstract
Text corpora come in different shapes and forms: the design is typically motivated by a task in
hand – reference base for a lexicography project, teaching and learning of second languages, and
corpora of historical and anthropological interest. Texts in a given corpus have many different
attributes and one or many of the attributes can be chosen to virtually create a (sub-)corpus. This
virtual corpus is inspired by the notion of the so-called vritual machines in computing science.
Introduction
The use of text corpora, particularly the use of computerised text corpora, has had a particularly
beneficial use in the study of languages and, perhaps to a lesser extent, on the teaching and
learning of languages.
Following on from Quirk, his colleagues Greenbaum and Leech, and his student Svartvik, and
from Francis and Kucera, the use of text corpora has given a substantial boost to the ‘descriptive
studies’ of lexis, syntax, discourse and prosody in English. These studies reflect the state of the
language, that of its users, and the uses to which the language was put. Furthermore, the use of
text corpora has given a quantitative focus to the study of the ‘international’ varieties of English
and a similar focus to the historical/diachronic analysis of language change.
The acquisition, organisation and subsequent dissemination of computerised British English text
corpora was, in part, the work of dictionary publishers Longman and Collins and in part the work
of International Computer archive of Modern English (ICAME). The motivation of the dictionary
publishers was to produce dictionaries of English. There is also a range of dictionaries clustered
2
around the Longman/Lancaster 30 million word corpus of contemporary English - including
Longman’s Contemporary Dictionary of English (Summers 1991) and the more recent ‘conceptbased’ Longman’s Language Activator (Summers 1993) - and Collins Birmingham Collection of
English Text comprising 20 million words and has been used to produce the Collins COBUILD
Dictionary and Collins COBUILD Phrasal Verbs.
Text Typologies, Language Varieties and Corpus Building
In this section we present a brief description of how texts have been categorised and these
categories used to develop text corpora. There are open questions in corpus linguistics related to
how texts should be selected and, perhaps, more importantly, for what purpose. Some argue that
lexicographers and linguists should choose the texts themselves with some advice from teachers
of English (Sinclair and colleagues in Sinclair 1987), whilst the corpus linguistics pioneers used a
random-selection approach (cf. Lancaster Oslo Bergen Corpus and the Brown Corpus). Still
others have argued that there should be an equal mixture of deliberately selected text and
randomly selected text (see, for instance, Summers 1991).
We hope that the discussion of how text is organised and, indeed, how representative text is
chosen, will motivate the reader to consider various parameters that can label a text. These
parameters may include the medium in which the text is delivered - books, magazines, journals,
leaflets, letters; the genre of the text, fiction or non-fiction, whether it is imaginative or
informative, persuasive or instructional. The register and the domain of the text are equally
important parameters. Furthermore, there are some atomic features of a text including author’s
age and sex, publication period, language variety and so on. (One might consider the use of
‘contextual correlates’ described by Halliday to categorise texts in terms of their tenor and field,
given that the mode of the language in the text corpora is textual).
Lancaster-Oslo/Bergen Corpus: The pioneer corpus randomly populated
The Lancaster-Oslo/Bergen (LOB) Corpus1 was aimed at a general representation of text types for
research on a broad range of text types and were selected from four ‘media’: books, newspapers
and periodicals, and government documents. Titles for the LOB corpus were randomly selected
1
The LOB Corpus originated in 1970 at the University of Lancaster (UK) under the direction of Geoffrey N Leech.
In 1977 the project was transferred to Oslo and Bergen in Norway under the direction of Knut Hoffland and S
Johansson.
3
from The British National Bibliography Cumulated Subject Index (1960-64) for books; from the
Willing’s Press Guide (1961) for newspapers and periodicals; and from the Catalogue of
Government Publications (1961: London, HMSO) for government documents. The ‘media’
typology and the random selection methodology was pioneered by Francis and Kucera for the
Brown Corpus of American English.
The LOB corpus was categorised into informative and imaginative texts. The latter category
contains mainly works of fiction, ranging from detective fiction to science fiction, from adventure
and ‘Western fiction’ to general fiction, and romantic texts and humour. Figure 1 shows the
structure of the LOB corpus.
Reportage
Informative
Press
Editorial
Religion
Review
Skills Trades, Hobbies
Popular Lore
Belles Le Hoes, Biography, Essays
Government documents
Learned and Scientific Writing
Brown/LOB
Corpus
General
Fictio
n
Mystery, Detective
Science
Adventure, Western
Imaginative
Romance, Love Story
Humour
Figure 1: LOB Corpus Structure
Biber (1988) has added two more categories to those used in the LOB corpus whilst discussing
variation across speech and writing samples of English. First of Biber’s additions is professional
4
letters written in an academic context comprising only administrative matters, the second addition
is personal letters written to friends or relatives. The first category is defined as ‘informational
and interactional’ and the second as from ‘intimate to friendly’ (Biber 1988: 67). Presumably
both can be added to the informative category introduced by the designers of LOB corpora.
5
Birmingham collection of English Text: Specially selected text corpora for lexicography
The Birmingham collection of English Text was compiled under the guidance of John Sinclair, in
close collaboration with Collins Publishers, and served as a source of ‘sufficient and relevant
textual evidence’ (Renouf, 1987: 1) for the production of ‘the first wholly new dictionary for
many years’ (Sinclair, 1987: vii): a dictionary not based solely in the introspection of
lexicographers and their advisers but based rather on how authors of a wide variety of texts (and
speakers partaking in conversation and delivering speeches with and to others) use words and
phrases. The COBUILD corpus contains 20 million words of current English in its computer
store.
The COBUILD corpus excluded certain categories of text included in LOB, such as poetry, and
excluded drama as not an example of ‘naturally occurring texts’. The focus of the COBUILD
team was on texts published between 1960 and 1985, the team preferred general language text
rather than ‘technical language’. The COBUILD corpus designers, with advice from teachers of
English in the UK and abroad, and from British Council Libraries abroad, selected texts
themselves. This method of text selection was different to the random-selection approach used by
the LOB corpus - that is, randomly selected titles from bibliographies, and so on. However, in
order to check the ‘relevance’ and ‘influence’ of a given text, the COBUILD team regularly
checked bestseller lists in newspapers as well as catalogues from leading publishers.
The text in the COBUILD corpus is not split along LOB’s informative/imaginative axis, rather,
the textual ‘medium’ is taken as a base classifier: books, newspapers, magazines, brochures and
leaflets, and personal correspondence are used to define the text typology. The book variety
dominates the corpus contributing around 214 texts with the rest contributing 70 texts. The book
variety is subdivided into fiction and non-fiction (‘imaginative’ and ‘informative’, respectively)
and the former dominates contributing 177 texts out of 214 in the book category. The authorship
of books is 75% male and 25% female; British English text accounts for 70% of texts, whilst 20%
is American English and the rest other varieties. The text corpora comprises a range of topics
from American Indians to Vietnam, from childcare to sex through to Third World, and myths and
cults to natural history. The structure of the Birmingham Collection (Renouf, 1987:23-32) is
shown in Figure 2.
6
Survey
Newspaper
Procedure
Daily
Government
Argument
extended texts
topics
articles
topics
handbooks
subject area
guidebooks
subject area
positional/
horatory
balanced
Banks
Brochures,
Leaflets
Narrative
Holidays
biography
travelogue
Academic
Humour
autobiography
extended texts
topics
articles, collection
topics
topics
extended texts
topics
articles, collection
topics
extended texts
topics
articles, collection
topics
others
CollinsBirmingham
Non-fiction
Mixed
Fiction
Historical (novels)
other extended texts
Book
Weekly
Topic
Monthly
Topic
General
Magazine
Correspondence
novels
short stories
Thriller (novels)
Academic (novels)
Personal letter
Future worlds/Fantasy
(novels)
Figure 2: The Structure of the Collins-Birmingham Collection of English Texts
Note the fine-grained organisation of books: positional and horatory texts, when the ‘positional’
author puts forward his or her case in relation to a particular topic and the ‘horatory’ exhorts the
reader to do or become something; the Birmingham collection distinguishes between handbooks
and guidebooks and the distinction between a variety of narrative texts, travelogue, biography and
autobiography is maintained.
7
Longman/Lancaster English Language Corpus: Deliberately and randomly selected corpus
Summers (1991) has argued that the motivation for creating the Longman/Lancaster Corpus was
to provide lexicographers and linguists with ‘an entirely new, conceived from scratch, corpus of
English that could serve a number of purposes and be organised according to objective criteria’
(ibid: 1). The Longman/Lancaster team acknowledges the influence of Geoffrey Leech and
Douglas Biber. The primary purpose of this 30 million word corpus was ‘to provide an objective
source of language data from which reliable linguistic judgments about the meaning and typical
behaviour of words and phrases can be made as a basis for dictionaries, grammars and language
books of all kinds’ (ibid: 3).
What distinguishes the Longman/Lancaster Corpus from the LOB or the Brown Corpus is that the
former is ‘topic driven’ whilst the latter are ‘genre driven’. Recall from Figure 1 that LOB
distinguishes between ‘academic discourse’ from ‘press reportage’, ‘press editorial’ from ‘arts’
and so on. The ‘topic driven’ texts in the Longman/Lancaster Corpus are categorised in 10
superfields, as shown in Figure 3
Science
Natural, Pure (6%)
Applied (4.3%)
Social (14.1%)
World Affairs (10.9%)
Commerce, finance (4.4%)
Arts (7.9%)
Beliefs, thoughts (4.7%)
Leisure (5.7%)
Fiction (40%)
Poetry, drama and humour (2.3%)
Figure 3: ‘Superfields’ in Longman Lancaster Corpus
The lexicographic argument for choosing the topic-based approach, pioneered by Michael
Rundell of Longman Dictionaries, was that ‘it was more likely to produce text categories that
were lexically homogenous’ (Summers 1991: 7).
8
In contrast to the Birmingham Collection of English text comprising mainly 1970s texts, not only
does the Longman/Lancaster Corpus contain ‘drama’ as a subject (or text type), but it contains
roughly equal samples of English dating back to 1900-49 and 1950-69 (30% each) and 40% of the
Longman/Lancaster content is post-1970s. As such, the diachronic breadth of the corpus is
greater than that of Birmingham’s.
Texts in Longman/Lancaster are divided into metacategories, informative and imaginative,
subdivided into superfields and, like the LOB corpus, informative texts comprise books,
newspapers and journals, unpublished and ephemera, and imaginative texts, and are mainly works
of fiction in book form.
There are four ‘external factors’ that form the basis of text categorisation in Longman/Lancaster:
‘region’, including language varieties; ‘time’, a diachronic corpus containing text published
between 1900-1980s; ‘medium’, including the ‘sources’ of texts books (80%), periodicals
(13.3%) and ephemera (6.7%); and finally, the ‘level’ of text. For informative texts there are
three levels: ‘technical’, ‘lay’ and ‘popular’. Similarly, the imaginative texts are divided into
‘literary’, ‘middle’ and ‘popular’.
The ‘axiomatic’ features of texts in Longman/Lancaster include authors’ gender and country of
origin, target age and gender, number of words in total, title, and so on. Most text types in
Longman/Lancaster are about 40,000 words long. No whole texts were included because the
‘emphasis was on many sources rather than the completeness of texts’ (Summers, 1991), which
appears smaller than that of Birmingham’s (c. 70,000, where possible, including some whole
texts).
Longman/Lancaster corpus design is such that half of the 20 million words are derived from
carefully selected texts (c. 15 million) - the ‘selective texts’ - and the other half is the randomlyselected, individual titles collectively known as the ‘microcosmic texts’. Like the LOB corpus,
Longman/Lancaster have used a book catalogue - Whittaker’s Book in Print - and selected texts
originally published in English (in English-speaking countries) before 1900, excluding
dictionaries and reference works and works for children. The selective texts are generally wellknown texts, whereas the microcosmic texts include technical texts that are not very well-known.
The structure of Longman/Lancaster is shown in Figure 4.
9
General
Historical
Science
Romantic
Mystery
Adventure
Fiction
Imaginative
Selective
‘Not Fiction’
S1
S8
S1
S8
S1
S8
Books
Informative
Books
Drama
Humour
Newspapers/
Journals
Unpublished
& ephemera
Medium
Influence
Region
Popularity
Time
Educational
Level
Recommended
Random
Selection
Microcosmic
Figure 4: Longman/Lancaster Corpus Structure
British National Corpus Initiative
Atkins, Clear and Ostler, key players in the British National Corpus Initiative, have identified
‘principal features in corpus design’ (1992). Their paper contains the definition and elaboration
of a number of terms used in corpus linguistics, particularly corpus-based lexicography. Atkins et
al distinguish between text-archive, electronic text library (ELT), and corpus. The first is
essentially a collection of readable electronic texts by others including the authors of texts. The
second, ELT, is a collection of electronic texts in a standardised format organised loosely on the
text content but without any selectional constraints. A corpus is a collection that has been put
together for a specific purpose, either for lexicography, for teaching and learning, or for studying
syntax, lexis or morphology.
10
Texts organised in a corpus can still be differentiated on a number of counts - the so-called
‘contrastive parameters’ of corpus typology. Recall that LOB comprised texts that were only
5,000 words long, Longman/Lancaster texts could be up to 40,000 but there was no insistence on
the texts being complete, similarly Birmingham Collection comprised texts that were about
70,000 words long. Such corpora are called ‘sample corpora’ as opposed to the full text corpora.
Greenbaum’s International Corpus of English - a corpus of English from East Africa, Hong Kong,
India, Jamaica, New Zealand, the Philippines, Singapore and Wales, is partly based on the LOB
model, with 2000 words per text, and partly on the notion of full texts.
Historical linguists and etymologists keen to study language change and eager to investigate the
life cycle of a ‘loan word’ can benefit from a text corpus containing texts that were
written/published at different times, covering decades and, in some cases, 2000 years. The
Longman/Lancaster developers regard the corpus as more diachronic, with roughly the same
distribution of texts from 1900 - 1985, than Birmingham, whose collection comprises a majority
(about 77%) of its articles published between 1970 and 1981, with 1980 - 81 articles making up
37% of the text collection. Indeed, the diachronic and dialectal Helsinki Corpus of English Texts
comprises 2100 text samples, ranging from 2500 to 10,000 words, claimed to be written between
850 - 1720. The dialects studied include south west Midlands, Mercian, Cornish, Devon,
Somerset, Wiltshire and South Avon (see, for instance, Rissanen (1991) for the diachronic
organisation of Helsinki Corpus and Ihalainen (1991) for dialectal variation within the corpus).
In the description of Longman/Lancaster corpus it was noted in section 1.3 that there were 10
topics in the corpus, with fiction contributing 40% of the text and science-based text covered
about 25%, followed closely by world affairs (10%). The focus of this corpus was on general
language, rather than subject specific languages. Table 1 introduces a corpus typology developed
by Atkins, Clear and Ostler (1992: 5-9):
11
Variant
‘Real World like’
Fittered Texts
Length
Full text archived
Sample: only certain samples archived permanently
permanently
Monitor: Full text analysed but not stored - only
statistics stored.
Time
Diachronic
Synchronic
Topics/Subject fields
All subjects/topics
General language: Non-technical, broad-readership
Language Composition
Multilingual
Multilingual: All texts ‘shadowed’ with equivalents
Standalone
in other languages
All varieties: national
‘Core’ Texts: Text types common to all varieties
varieties, dialectal, etc.
‘Periphery’: Text types specific to some varieties
All texts
‘Central’: Selected manageable - in terms of on-line
Language Variety
Hardware/Storage
storage - body of texts
‘Shell’: All other texts, accessible but not stored
on-line
Table 1: Contrastive Parameters of corpus typology (after Atkins, Clear and
Ostler, 1992): Status of individual texts in the Corpus.
Linguistically grounded text typology
Biber has discussed the typology of English texts at length (Biber, 1988, 1989) and has attempted
to shift the basis of text typologies from practical to linguistic grounds. Functional criteria of text
types is based on one or two particular functional dichotomies, such as formal versus informal,
involvement versus detachment, integration versus detachment, and the use of these points of
distinction to describe text ‘types’. Whilst Biber does not detract from the utility of functional
distinctions, he argues that these typologies leave much to be desired ‘in identifying the salient
LINGUISTIC differences among texts in English’. (1989: 5).
Biber has pointed to the considerable linguistic variation in a given functional type of text whilst
there is a systematic co-occurrence of linguistic features across the functional types. The
linguistically grounded typology is based on ‘sets of syntactic and lexical features that co-occur
frequently in texts’, the so-called ‘dimensions’ of variation is identified empirically by
multivariable practicable methods (factor analysis to be precise): similar texts in each type are
maximally similar in their linguistic characteristic, while the different types are maximally
12
distinct from one another’ (Biber, 1989: 5). Such a typology implies important functional
differences as lexical and syntactic features are used to indicate common functions.
Biber has analysed the co-occurrence distributions of 67 linguistic features, divided into 162
major grammatical categories used in tagging the LOB corpus and London-Lund Speech corpus,
across 481 text samples (written and spoken texts of contemporary British English) spread over
23 genres, including academic prose, forms of press reportage, telephone conversations, letters,
and so on. These 67 features make varying degrees of contribution to each of the dimensions, that
is a number of features co-occur and the this can be detected empirically such that the greater
cluster is labelled as a dimension. Biber then goes on to interpret the functional import of that
cluster. For example, Biber distinguishes narrative text from a non-narrative text (expository,
descriptive, and so on) by arguing that narrative concerns are marked by ‘considerable reference
to past time [past tense verbs], third person innate referents [3rd person pronoun] and reported
speech (public verbs): non-narrative concerns [...] are marked by immediate time [present tense
verbs] and attributive nominal elaboration [attributive adjectives]’ (Biber, 1989: 9-10).
Biber identifies five3 dimensions that underlie each co-occurrence pattern:
1. Involved versus informational production;
2. Narrative versus non-narrative concerns;
2
These are: (i) tenses and aspect members; (ii) place and time adverbials; (iii) pronouns and proverbs; (iv) questions;
(v) nominal forms; (vi) passives; (vii) statives; (viii) subordination features; (ix) prepositional phrases, adjectives and
adverbs; (x) lexical specifity; (xi) lexical classes; (xii) modals; (xiii) specialised verb classes; (xiv) reduced forms and
dispreferred structures; (xv) co-ordination; and (xvi) negation.
3
In his 1989 paper, Biber notes that ‘[T]o date, five major dimensions of variation have been identified in English.
Biber (1988) [‘Variations across speech and writing’] presents a unified description of genre variation in English in
terms of his five dimensional model’ (1989: 7). The only problem is that Biber opens the discussion with a sevenfactor - later christened ‘dimension’ - model, argues that the seventh factor does not seem to account for much
variance in data and will be ignored: ‘Factor 7 is not strong enough for firm interpretation, and this factor will
therefore not be considerd further in the present study’ (1989: 114). From then on, Biber concentrates on the first six
factors, of the seven identified, and suggests interpretative labels for each factor, to describe the underlying functional
dimension’ (1989: 115). The sixth dimension ‘On-line Informational Elaboration’, that distinguises between
fragmented informational discourse with integrated or no informational discourse content of the text, was not
discussed in the 1989 paper and no reasonwas given for it either.
13
3. Elaborated versus situation-dependent reference;
4. Overt expression of persuasion;
5. Abstract versus non-abstract style.
Each dimension has a different number of lexical and syntactic features (Table 2), that are usually
unique to the dimension (having some shared features across dimensions). Some dimensions
have complementary features, as is the case with the narrative versus non-narrative features, the
former distinguished by the co-occurrence of third person pronouns, past tense verbs and so on,
from the latter that comprise co-occurrence amongst present tense verbs and attributive adjectives.
However, the complementary groups in the other factors reflect similar functional relations.
Others, such as ‘overt expression of persuasion’ and ‘abstract versus non-abstract style’ do not
have complementary features; for instance, abstract text type will show a marked co-occurrence
of conjuncts, agentless passives, past-participle clauses’, whereas non-abstract, that is nontechnical and informal, text type will be marked by an absence of such co-occurrence.
Dimension
Features
Involved -v- informational
30 (6 complementary)
Narrative -v- non-narrative
8 (2 complementary)
Explicit -v- situation dependent
8 (3 complementary)
Overt expression of persuasion
7
Abstract -v- non-abstract style
6
Table 2: Dimensions and number of lexical and syntactic features
Biber uses the dimensional statistics to group LOB and London-Lund texts into ‘clusters’: ‘Texts
that are similar with respect to the dimension but very different with respect to other dimensions
are likely to be grouped into different clusters’ (1989: 17). The texts in LOB and London-Lund
corpora appear to form eight ‘distinct’ clusters and their dimensional scores on each of the five
dimensions indicate an interacting functional nature of each cluster. Each cluster is made up of
either dominant dimension or the absence of such characteristic. Table 3 shows the composition
of the clusters according to dimensionality:
14
CLUSTERS
Involved
Narrative
Elaborate
Overt
Abstract
-v-
-v-
-v-
Persuasion
-v-
Interaction
Informational
Non-narrative
Situated ref.
1. Intimate
Extremely
Not marked
Situated
Not marked
Non-abstract
Interpersonal
involved
2. Informational
Less involved
Not marked
Situated
Not marked
Non-abstract
3. Exposition/
Extremely
Non-narrative
Highly
Non-persuasive
Extremely
Scientific
informational
4. Learned
Extremely
Non-abstract
elaborated
Non-narrative
informational
Highly
abstract
Non-persuasive
elaborated
5. Narrative/
Moderately
Extremely
Imaginative
involved
narrative
6. General
Informational
Moderately
abstract
Situated
Not marked
Non-abstract
Narrative
Not marked
Not marked
Not marked
Not marked
Non-narrative
Situated
Non-persuasive
Non-abstract
8. Involved
Moderately
Non-narrative
Moderately
Extremely
Moderately non-
Persuasion
involved
elaborated
persuasive
abstract
expository
7. Situated
Reportage
Table 3: Clusters expressed in terms of their dimensions
These clusters are organised along functional categories, but the functions are interpreted in terms
of dimension and are no longer ad-hoc labels. The dimensions, in turn, are ‘computed’ by
calculating the co-occurrence of lexical and syntactic features of texts organised in genre or
subgenera. For instance, texts in academic prose subgenera are distributed across tree clusters:
two ‘obvious ones’ categorised under exposition - scientific and learned - which account for 75%
of the texts, and two less ‘obvious’ ones, ‘general narrative exposition’ accounting for 17% of
academic texts, and ‘involved persuasion’ for 9% of the texts. In contrast, 92% of the ‘romantic
fiction genre’ in LOB is in the ‘imaginative narrative’ cluster and the rest, 8%, in ‘general
narrative exposition’. (Biber 1989: 21-22).
Biber treats the ‘distance’ between clusters as an indicator of ‘similarity’: function, lexical and
syntactic similarity. There are no ‘surprises’ here: clusters 1 and 2, comprising intimate
interpersonal and informational interaction are proximate, as are the ‘expository’, scientific and
learned clusters (3, 4). The situated reportage and involved persuasion cluster (7 and 8) are
15
proximate to the imaginative narrative cluster (5). Cluster 6, general expository narrative, is
proximate to cluster 4.
On the whole, the typology presented by Biber (Figure 5) is a synthesis of the typology used by
the LOB corpus researchers, that is, the broad classification into information and imaginative text,
and the typology exposed by the ‘functionalists’, that insisted on classifying text according to the
communicative function - narrative, expository, horatory and so forth (cf. Birmingham Collection
of English Text). The difference here is that Biber has used statistics related to lexical and
syntactic feature distribution across texts, thereby well-grounding this typology.
Text
Lexical
Syntactic
Features
Compute variance
Involved/
Informational
Narrative/
Non-narrative
Elaborate/
Situated
Overt
persuasion
Abstract/
Non-abstract
Reportage
Persuasion
Map Similarities
Interactional
Expository
Informational
Expository
Learned
Scientific
Narrative
General
Imaginative
Classify Text
Figure 5: Biber’s text typology based on statistical distribution.
Diachronic Corpora and their Structure
The Helsinki Corpus of English Text comprises a diachronic part and a dialectal part. The
diachronic part is divided into Old English, Middle English and Early Modern English, spanning
a period between about 750 AD to the beginning of the eighteenth century (Kytö and Rissanen
16
1988; Rissanen 1991). The compilers of the corpus attempted to make the three periods
compatible in terms of the amount of texts, text types, registers and genre. The corpus contains
about 1.5 million words and texts vary between 2,500 words to 10,000 words. Kytö and Rissanen
have defined nine major text prototypes: correspondence, drama, instruction (secular and
religious), narration (fiction and non-fiction) proceedings, science and stipulation.
The Helsinki Corpus, especially along its diachronic variation, shows the growth and influence of
science, and the development of drama in the Middle to Modern Age. Figure 6: Helsinki corpus
structureshows the structure of the Helsinki Corpus.
Old English
Middle English
Law
Law
Document
Document
Astronomy
Astronomy
Medicene
Other sciences
Secular
Recipe
Prognostication
Handbook
Religious
Homily
Rules
Religious
Stipulation
Science
Instruction
Non-fiction
treatise
Homily
Rules
Religious
treatise
History
History
Early Modern
English
National
Local
Document
Astronomy
Medicene
Other
sciences
Handbook
Education
Sermon
History
Travelogue
Diary
Biography
Narrative
Romance
Biography
Travelogue
Fiction
Travelogue
Correspondence
Preface
Letter
Private
Non-private
Letter
Private
Non-private
Drama
-
Play
Miracle
Morality
Play
Comedy
Proceeding
Deposition
Deposition
17
Trial Meeting
Deposition
Figure 6: Helsinki corpus structure.
PISA’s Bilingual Reference Corpora
The use of a bilingual corpora in corpus-based lexicography and in collecting ‘large quantities of
documented evidence on the possible realisation of a concept in two languages’ has been reported
by Marinai, Peters and Picchi (1991: 64) in the context of Italian and English.
These authors have described a ‘parallel text retrieval’ system that can be used to query both the
source and target language text corpora for the purposes of creating a bilingual dictionary.
Marinai, Peters and Picchi have created an interesting Italian-English Bilingual Reference corpus
that comprises ‘literal translations’, where the translator keeps close to the source text to ‘feel
translation’ where translators feel ‘free to render the concepts in what he feels is the most
appropriate manner’. The composition of the corpus is given in the following table (Table 4).
Text Type
Topic
Source
Target
Language
Language
Genre
Sub Genre
Informative
Learned Papers4
Computer Science
Italian
English
Popular (Excerpts)
‘Inflight Magazine’
English5
Italian
Text Book
‘Scientific’
American English
Italian
-
Italian
English
-
English
Italian
Instructive
(Brief Extracts)
Imaginative
Fiction - short story
extracts
Poetry
Table 4: The composition of texts in the Pisa Bilingual Reference Corpus
(derived from Marinai, Peters and Picchi 1991: 64)
Unfortunately, details of this ‘reference’ corpus are sketchy, we are not given any dates, but since
the authors talk about ‘well-known American [sic] scientific text books’, used by Italian
University students and mention texts ‘donated [...] by colleagues’, we have surmised that the
4
The authors obtained these texts from colleagues, so we have surmised that these must be research papers.
5
Inflight Magazines are usually published by a select few organisations based in the English speaking world - we
have extrapolated that the source is English.
18
texts in their corpora are of a recent vintage. There is no description of language variety American English or British English - and so on, nor do we have any data about the length of
texts, except that the texts, except possibly for the learned paper sub-genre and poetry sub-genre
are ‘excerpts’ or brief extracts. Nevertheless, the authors do convey the essence of problems
faced by lexicographers in compiling bilingual dictionaries; the availability of a corpus is at once
a boon, if the texts are roughly equivalent, or could lead to considerable vigilance when faced
with creative translations.
19
Special Language Corpora and ‘whole texts’
What is a special language
Human enterprises are characterised usually by a group of people having a specific, carefully
delineated interest. The ‘enterprise’ of philosophy, for instance, involves a community of
philosophers having their separate university departments, learned societies, journals, newsletters
and so on. The ‘enterprise’ of cricket involves a tightly-knit group of players, managers, support
staff, entrepreneurs promoting matches; this group has its clubs, galas, magazines, rule-books,
biographies and travelogues. Cultural anthropologists would have us believe that shamanistic
magical practices are undertaken by a group of shamans who communicate with each other
through icons, that there are novice ‘shamans’ taught by the expert in a ‘strange language’. The
twentieth century has seen the development of supra-national business and finance communities
which undertake not only business but also the training of young people through university
courses, seminars, conferences, books, magazines and so forth. The distinguishing feature of
each of these enterprises is the way they use their first or second language to communicate with
their fellow community members to transfer facts, heuristics, concepts and ideas.
One might argue that as geographically determined communities of people communicate through
a dialect of a ‘national’ language, or as socially stratified groups use sociolects of their
dialects/national language, the ‘epistemologically determined’ specialist human enterprises
motivate at times, force at others, the involved community to use an enterprise-specific special
language. We have, rather reluctantly, used the term ‘epistemology’, however, our use is based
on the definition of epistemology in that it deals with ‘enquiry into the nature and ground of
experience, belief and knowledge’ (Lacey, 1986:63). It is our view that the human enterprises,
and the involved communities, do share beliefs and knowledge and have similar experiences, or
aspire to do so. Epistemology addresses questions about the related topics of the nature of
knowledge, its different types, what is known and the origin of knowledge. Therefore, we believe
that special languages, as they are used to express experience, belief and knowledge, can be
regarded as ‘epistemologically determined’.
This ‘epistemologically determined’ community is characterised by its attempt to use many
lexical items, comprising a highly nominalized vocabulary than, say, is the case with the
community’s everyday, spontaneous natural language and, indeed, is the case with more
imaginative forms of writing, such as ‘newspaper-speak’, fiction, and personal letters.
Furthermore, on the whole a special language appears to comprise fewer elements of metaphor
20
and simile than, say, dialects and sociolects may have: the endeavour of the specialist human
enterprise is to avoid polysemy, to avoid, or to eliminate, culture-bound references and so on.
There are some syntactic features that distinguish special language from ‘general’ language: for
instance, special languages, particularly the language used by scientists and engineers, contains
many more instances of passives than one would find in a quality newspaper (cf. Svartvik 1966)
and there would be a surfeit of (complex) nominals that may constitute as much as 60% of a given
clause or sentence (cf. Halliday 1993). Table 5 compares the three varieties of ‘national’
languages - sublanguages of the general languages.
Language
Demarcation
variant
Linguistic
Source
Characteristics
Dialects
Sociolect
Geographically
Phonological features;
Regional history,
Metaphor
determined
few lexical items
Geomorphology
Similes
Flora & Fauna
Polysemy
Syntagmatic features;
Anthropology
Metaphor
some lexical items
National history
Similes
socio-economics
Polysemy
‘Socially’ determined
Special Language Epistemologically
determined
Many lexical items
Derivation
Semantics
Foreign languages
Some syntactic features
Artificial symbol sets
Table 5: A tentative classification of sublanguages based on their ‘demarcation’,
linguistic features and some of the principle sources. Based on a discussion in
Sager, Dungworth and McDonald (1980:63-69)
The differences in the various sub-languages, or special languages, and the language used in
literature and poetry has been noted at length by a number of language researchers, including
Halliday (1993), de Beaugrande and Dressler (1981), and Sager, Dungworth and McDonald
(1980). These differences are particularly accentuated for the ‘language of science’. Indeed, a
number of well-known scientists have commented on the language of science, prominent among
them is one of the founders of quantum theory and modern physics, Werner Heisenberg.
Specialist texts, cohesion and coherence
21
According to de Beaugrande and Dressler, ‘literary and poetic texts could be seen in opposition to
text types intended to increase and distribute knowledge about the currently acceptable ‘realworld’ (1981: 186). Particularly, scientific texts that disseminate the results of explorations to
extend and clarify existing knowledge through the presentation and examination of evidence
inferred from observing natural phenomena or evidence gathered from documented knowledge.
Text Types
Principle Entities
Frequent Conceptual Surface text
relations
Descriptive
Pattern
characteristics
Objects and Situations Attributes, States,
Density of modifiers
Frame
Subordinates
Schema
Plan
instances,
specifications
Narrative
Actions and events
Cause, reason,
purpose, enablement
and time proximity
Argumentative
Evaluation of believe
Reason, significance,
Emphasis and ???
and ideas (±) T(F)
volition, value and
Recurrence,
opposition
parallelism, paraphrase
Table 6: A text typology along functional lines, i.e., according to the contributions
of texts to human interaction
We might succeed in distinguishing texts purely on their statistical characteristics. However, as
de Beaugrande and Dressler have argued, statistical linguistic should be augmented by the
considerations of the functions of text in communication and the pursuit of human goals (1981:
182-185).
Specialist texts: A speech act analysis
Special language, according to Sager, Dungworth and McDonald uses ‘special text forms, a
restricted syntax and arguably a limited morphology’ (1980: 40). Except for patents and legal
texts, special languages do not in any significant fashion ‘redesignate syntactic forms’. It is on
the semantic level that special language can be differentiated from other languages, as each
specialism requires ‘a system of references which organises and structures’ its object of study and
the processes involved. The object may comprise materials, instruments, mechanisms and
artefacts such as machines, products and parts, and installations, together with their properties,
22
qualities and states. The specialist language attempts to organise and structure processes and also
the methods of these processes.
Sager, Dungworth and McDonald (1980) were inspired by, though do not wholly subscribe to, the
‘speech act’ approach to language pioneered by Austin (1962) and Searle (1969). A speech act is
the result of ‘the convergence of a speaker (or writer), a listener (or reader) and a topic (an area of
reference at a particular time and place in a specific situation)’ (1980: 22). A philosophy
undergraduate reading commentaries on Aristotle written by a medical scholar in a present setting
is one speech act, whilst the same undergraduate reading a newspaper feature article on the ethical
issues faced by, say, a government committee on abortion is another speech act. The four
simultaneous aspects of a speech act, ‘utterance’, ‘propositional’, ‘illocutionary’ and
‘perlocutionary’ acts are ‘fundamental to the understanding of communication’ (1980: 23). The
‘intentions’ underlying the difference between the ‘illocutionary act’, indicated by punctuation,
mood of verb, word order, and ‘perlocutionary act’ indicated by the effect the utterance has on the
reader, are classified by Sager et al into four types: informative, evaluative, directive and phatic
intensions. These intentions can be differentiated and defined in relation to the four principle
features of the speech act for written texts: writer, reader, topic, and time and place. Sager et al’s
approach to special language texts appears to incorporate the linguistic features with
communicative intent and the action the communication precipitates.
Language of Science
Savory (1967) has discussed ‘the language of science’ with particular emphasis on the etymology
of scientific words, including the ‘processes’ by which scientists build up the vocabulary of their
sciences and how the language of science grows, what is its character, and how, Savory
speculates, the literature of science comes into being. According to Savory, words of science fall
into three major categories - depending on how they were ‘coined’: borrowed words, imported
words and invented words (1967: 32-34). Scientists borrow words from their own language,
sometimes they give a precise meaning to the word, such as salt and atom; at other times the
meaning of these words is just as vague and uncertain as in the ordinary language - words like
‘life’ and ‘bug’ fall into this category; and yet there are some words that keep a vague shadow of
their original (general language) meaning, cf. work, force and power. Importation from other
languages is quite common amongst the scientists; here the meaning of some of these words is
preserved in translation, like radius and femur, and for some even mythological references are
preserved: Atlas, the name of the vertebra on which the skull rests, reflects the Greek
mythological story of Atlas carrying the world on his shoulders.
23
Scientists add to the overall stock of their languages by ‘inventing’ a new word. There is a group
of invented words, words that were carefully defined and care was exercised in preserving their
original meaning: hormone is a good example. Yet there are invented words that started their life
very much like hormone, but they caught the public imagination and, according to Savory, their
meaning has deteriorated: allergy is quoted by him as an example. Table 7 summarises the above
discussion.
Type
Assimilation/Invention
Meaning of the word
Exemplar
Borrowed Words
taken by scientists from
Uncertain and obscure both in LGP
‘life’
ordinary language
and LSP
‘bug’
More precise in LSP than in LGP
‘salt’
‘atom’
Precise in LSP but still a vague
‘work’
shadow of its LGP meaning
‘force’
‘power’
Imported Words
taken by scientists from
LSP meaning the same as the source
‘femur’
languages other than
LGP meaning
‘fulcrum’
‘radius’
their first language
Invented Words
LSP word that has a metaphysical
‘Atlas’ (name of the vertebra
connection with its source LGP
corresponds to Atlas carrying
meaning
the world on his shoulders)
ad-hoc creation by
LSP words that were carefully
‘Hormone’
scientists
defined and care taken to preserve
original meaning
LSP words carefully devised but
caught popular imagination and are
exposed to abuse
Table 7: Etymology of words.
Specialist texts or terminological texts
24
‘Allergy’
The word ‘terminology’ and its derivative - terminological - have been used much in the same
way as many terms and words that were not precisely defined at first, and have since deviated in
meaning.
In Artificial Intelligence (AI) literature, one finds extensive discussions about terminological
knowledge representation formalisms. Formalisms that are the basis used to develop programs
for representing not only the definition of a term on a computer system, but also various
assertions about the term. In a limited way such formalisms support the intensional and
extensional attributes of an object process, action or event. The question here is how one
distinguishes between a term, knowledge of a domain and formalism primitives: these are
overlapping categories and perhaps terminological knowledge representation does convey another
intuition by the AI communicty, the so-called ‘concept logic’ or ‘term logic’, but cannot be held
as an example of terminological exactitude.
The term ‘terminological corpora’ has been used by corpus linguists (cf. Atkins, Clear, Ostler,
1991) to express the intuition that, somehow, texts that may be regarded as an ‘evidence’ and
‘reference’ source for terms can be isolated from general language texts and used for acquiring
terminological data.
British National Corpus and Terminology
Recently, Atkins, Clear and Ostler (1992, 1:16) have discussed a number of design criteria which
they think may be relevant to the construction of text corpora used for lexicography,
computational linguistics, machine translation and knowledge engineering.
Atkins et al have outlined ‘certain contrastive parameters of corpus typology per se’ (1992: 5).
These parameters include length of the constituent texts, that is from unabridged to a small
excerpt, onto ‘monitor’ text corpora where the text is stored only temporarily, the time-dimension
of the texts, whether the texts were selected for a fixed date - synchronic - or the selection was
diachronic, the languages that comprise texts in the corpora, for example is the text corpus
monolingual or bilingual, and if bilingual, whether there are parallel texts or not. Amongst the
nine contrastive parameters, Atkins et al introduce the contrast between ‘general’ and
‘terminological’ types of texts, and note that ‘terminologists must define conditions which must
obtain if a corpus is to be used for terminological use’. The notion of the ‘terminological corpora’
is not elaborated by Atkins et al except when they discuss what kinds of attributes a text may
have, a total of 29 as it turns out, and ‘terminology is referred to in two attributes: ‘technicality’
25
and ‘native language of authors’, of a text. The ‘technicality’ attribute, a 3-valued attribute, that
depends on ‘degree of specialist/technical knowledge of the author and target/readership
audience’ (1992: 8). For instance, a ‘technical’ text is written for and by specialist, a ‘semitechnical’ for non-specialist by specialists, and ‘general’ text is written usually by non-specialists
for non-specialists. The authors note that ‘these [general, semi-technical, technical] must be
external variables, not linguistic style variables. This particular attribute is highly important for
terminological corpora’ (ibid.). This classification has been used by Yang whilst analysing a
corpus of scientific English for identifying terms (1988).
The description of the native language of the author attribute is rather incomplete in Atkins, Clear
and Ostler, nevertheless, the authors note that if there is no information forthcoming about this
attribute, we should assume that the native language is the language of the text. Furthermore,
‘this attribute [is] of particular importance to terminological corpora. Precise information about
first language is in many cases unavailable or irrelevant’ (1992: 8).
26
Organisation of Text Corpora and text typology
Any collection of texts reflects, howsoever infinitesimally, the state of a language (or languages)
used by the authors of the texts. A text corpus is an image of language. But part of the image is
shaped by how the corpus was organised, the texts chosen either in part or whole, the texts
selected from one period of time or over a continuum, the topics discussed in the texts all
belonging to one subject or many and so on. Lexicographers may be happy with text samples of
everyday language over a period of time whereas a terminologist may require full texts in a
specific subject area after a particular date only.
One can infer from discussion in sections 1 and 2 that that whilst there may be broad agreement in
corpus-based lexicography as to what constitutes a text type, there remain open questions and
fundamental differences in how a text is viewed. The typologies discussed above do work in that
they have been used in producing lexica, and in helping a number of researchers gaining insight
into (English) language, in a relatively introspection-free manner. The basis of such a typology
really then is heuristic - rule of thumb devised by the expert for solving a relatively open-ended
problem.
The discussion in the previous section showed that the texts comprising a corpus may be
classified and organised in many different ways depending on the use to which the corpus is to be
put, for instance, lexicographers may be concerned about the variety of national language they
may wish to specify a period of time which the lexicographers regard contemporaneous with the
life-time of the lexica they wish to produce. Then again, a gender-conscious lexicographer/socialhistorian may regard the gender of the author(s) of the text more important than the 29 or so textattributes specified by Atkins, Clear and Ostler (1992) or the attributes specified by Summers
(1991) that emphasis diachronic variation and so on.
27
Well grounded text typologies and text representation
The concerns of a text linguist, for example de Beaugrande and Dressler (1981), are related to
‘the contributions of texts to human interaction’ (ibid.: 184), in other words, along functional
lines. Therefore, here, the primitives of the text type typology, description, argumentation and
narration comprise, to varying degrees, literary, scientific and didactic texts. Text linguists and
systemic function-oriented linguists would prefer a different typology of text than, say a
lexicographer or social historian.
Technical writing and ‘special English’ experts do sometimes take their cue from research in
speech acts. Searle (1969) and from the work of Halliday (1973, 1975) and Hasan (1976), have
developed a very broad based typology that embraces a much broader category of texts as
compared to the typologies used by lexicographers and theoretical text linguists: Sager,
Dungworth and McDonald’s ‘typology of special forms - the traditional forms which occur in
special communication’ embraces the informative form, including announcement and news-flash;
the evaluative form - commentary, test report, testimonial and so on; the directive - including
instructions, directives, regulations; and phatic, including presentation (1980: 124-181).
Given that texts may have a whole range of attributes that are given, like age of the author, his or
her first language, length of the text and so on, that each text in a corpus may (or may not) have
been written by its author with an exclusive function in mind, and given that each may be
classified according to its classifier’s intellectual orientation and his or her immediate need, the
question here is this: how are the texts to be organised in computerised text corpora such that the
corpus users can access the texts according to their own intellectual orientation and immediate
needs.
Coding, Description and Representation of Texts
Computer programs are essential for the creation of a text corpus, and are equally important for
retrieving one or more texts, or one or more text fragments. Generally, corpus creation programs
are based on their programmers’ understanding of the salient properties of the texts that are to be
stored in a corpus, and on their understanding of how a corpus is to be structured. The
programmers, in turn, learn about the properties and the structure from the organisers of the
corpus: these organisers can be literary experts, linguists, lexicographers, information scientists
and so on.
28
Once a corpus is created then a potentially well-defined group of users access the stored texts
through another set of computer programs. These retrieval programs are based on a model of the
user’s requirements as understood by programmers working under the direction of the corpus
organisers.
The creation and the subsequent usage of a text corpus appears, then, to depend on the shared
knowledge of the corpus organisers, the programmers and the end-users. This shared knowledge
has to be simple and formal enough such that it can be used in the corpus creation and retrieval
programs. This shared knowledge has to be articulated with a zero-intelligence device, a
computer, in mind. Furthermore, the texts are to be stored in an electronic medium as distinct
from the usual ‘graphic substance’ used in the print medium that is in the form of marks on a
surface. The graphetics-physical properties of the symbols that constitute writing systems have to
be transformed from print to the electronic medium for the purposes of storage, whilst the reverse
process has to e performed for retrieval. The efficient storage and retrieval of texts depends on
how texts and text corpora are dealt with at three overlapping, yet clearly identifiable, tasks: the
coding of the texts; the description of the texts; and the representation of texts. In other words,
how texts are to be coded or, more precisely, encrypted for use in an electronic medium, how a
particular class of texts and text corpora are to be described, and how a class of texts and text
corpora is to be represented on a computer system. The coding, description and representation
comprise some of the shared knowledge mentioned above and it is the articulation of this
knowledge that will concern in the following subsections.
Coding of texts
Coding involves the encryption of words, sentences and paragraphs of a text. For encrypting
words, binary equivalents of individual letters and numbers comprising the words are used and
the text possibly ‘compressed’ by allocating shorter binary equivalents to the more frequent
tokens and longer to the less frequent ones.
A given text may comprise all, some or at least one of the ‘modes of graphic expression’: ‘verbal’
(written), ‘pictorial’ (drawings, photographs) and ‘schematas’ (charts, diagrams, scientific and
technical notation, formulae and equations) (cf. Crystal, (1987). In the print medium, the verbal
mode relies heavily on understanding of the physical properties of the symbols, including font
details, shapes, sizes and so on. Furthermore, in the print medium units of discourse, such as
sentences and paragraphs, are demarcated by layout configuration and punctuation marks.
29
Graphic mark-up languages are used to encode aspects of the verbal, pictorial and schematic
modes of expression: an example of such a language is the Standard Generalised Mark-up
Language (SGML). The language provides a notation convention and a grammar that is used to
‘mark up’ text fragments comprising a text. This marked up text then can be processed by a
computer program that understands the notation and the grammar of SGML: the program can
distinguish text headers from the body of the text, sentences from paragraphs and so on.
Furthermore, the attributes of a text can also be ‘annotated’ in the text through mark up languages
by the use of mnemonics to indicate, say, author name, publication date, text type and so on.
Description of Texts
The ‘description’ of individual texts or a group of texts is essentially a set of conventions used to
describe some particular texts. Corpus organisers attach a variety of descriptive labels to
individual texts. These labels may be used to express, say, functional typology: Longman’s
typology is at some variance with the Birmingham Corpus of English, and the two are at variance
with de Beaugrande and Dressler (1981) and with Sager, Dungworth and McDonald’s typology.
The description of a text in a corpus is usually at a meta-level as compared to the coding of the
text. This description can be the description of the text’s functional type, for example, whether
the text is imaginative, informative, evaluative or phatic: this description may be what Sager,
Dungworth and McDonald call the ‘description [...] of traditional forms’ (1980: 148-181); this
description may involve the subject domain members responsible for producing the text. The
advent of tagged corpora allows for a lexico-grammatical description of the text in that all known
and current tokens in the text are tagged according to the categories and features by given
grammar.
Therefore, in a modern computerised corpus of texts one may find meta-textual data, that is,
additional data that is used to code and to describe each individual text in a corpus. The question
here is this: can this meta-textual data be used to build a taxonomy that contains the functional
typology at its apex or functional typology as the superordinate level, text-forms at a subordinate
level, domain-specific data subordinated to the text form, and the rest of the attributes at the
instance level. Then again, one may regard the domain or subject specificity as being the
superordinate level rather than the functional typology, or indeed any other attribute.
30
We believe that almost all of the corpus management systems are used in large measure by an
enthusiastic, well-motivated community that shares the knowledge of the meta-textual data with
the developers of corpora and software engineers that build the corpus management systems. The
corpus and software designers, and the users, have a tacit understanding of how the meta-textual
level data has been used in specifying a taxonomy. The shared knowledge mentioned above is the
shared knowledge of the taxonomy.
However, the establishment and the consolidation of corpus linguistics would mean that methods
may be developed by a team that has little or no contact with those identifying new techniques or
building tools for analysing corpora. The growth of corpus linguistics means that the first victim
of the success would be the shared knowledge of text taxonomies. The loss of the shared
knowledge also may give rise to the opinion that a corpus management system has some
privileged data about a given taxonomy and yet may not have any data about any other taxonomy.
We believe that a careful and systematic approach for collecting the meta-textual data about
individual texts in a corpus, together with the development of programs that can manipulate this
meta-textual data to create a number of different taxonomies and are capable of providing access
for a user to all the texts for a given taxonomy, is not only necessary but holds the key to a
successful user-driven exploitation of text corpora.
Representation of Texts
The coding of the texts, and the description of the texts, implicitly assumes ‘shared knowledge’,
about the code and the descriptors, between the people that build the corpora, people that program
corpus management software, and the end-users of the corpora. This shared knowledge helps
these three groups to resolve ambiguities in the code and the descriptors, and, indeed, to improve
the code and the descriptors so as to make these easy to use. As luck would have it, in most of the
major corpus initiatives (cf. Longman’s, Birmingham) these three groups were well co-ordinated
and were extremely well motivated. In ‘Looking Up’ - an account of the COBUILD project corpus buildings, systems analysts and programmers and lexicographers give an enthusiastic and
well-dovetailed account of the project. But assuming that the use of corpora, just in lexicography
alone, takes off and researchers and lexicographers can access data more cheaply, freely and
easily, as compared to now, then how will this implicit ‘shared’ knowledge be articulated? There
is considerable evidence that the take-up of corpora in lexicography, language research and
language teaching, is increasing rapidly. And, with this rapid increase, one must define a
computer-based set of conventions for describing not just some particular text but a class of texts,
31
not only just an encryption algorithm based on implicit knowledge but a simple reasoning strategy
that is capable of inferencing over the attributes of texts organised into classes of texts explicated knowledge. The conventions for describing a class of texts, or more generally, a class
of things, devised with the sole purpose of making the ‘shared knowledge’ explicit is fashionably
terms as a problem of representation.
The ‘representation of texts’ or the ‘representation of a text’ has been discussed by prominent
linguists, literateurs, literary critics and by philosophers, psychologists and sociologists in
erstwhile ideological frameworks or as frameworks that are gaining currency.
The question here is this: given that texts may have a whole range of attributes, may have been
written with an exclusive function by their authors, and, indeed, may be classified according to
the classifier’s intellectual orientation and his or her immediate need, how is one to organise texts
in a computerised text corpora such that the corpus users can access the texts according to their
own intellectual orientation and immediate needs. Our question is more pertinent in that currently
substantial volumes of textual data is being collected by national initiatives, for instance, the
British National Corpus initiative, the Uppsala based Russian Corpus (Maier, 1994) transnational
collaborations, for example the Association for Computational Linguistics Data Collection
Initiative (ACL/DCI) (Church and Libermann, 1991).
It appears, therefore, that texts in a corpus can be described through the use of a variety of labels.
Indeed, one can create a hierarchy of these labels. There can be pressing lexicographical reasons
for considering the medium as the apex of the hierarchy, followed by national language variant,
for example, British English and American English, then by date and so on, or for gender studies
students the apex would be the gender of the author, followed by date of publication, and
language variants. The labels can be arranged, or in some cases have to be arranged to suit the
needs of the investigator and his or her own particular niches.
The descriptive taxonomy provides the nodes and links of a network - a tree- that describes how
texts are related to each other. The nodes are named after the labels and the links provide
conduits of properties that can be inherited from the superordinate by the subordinate nodes. For
example, the language variant node can be construed as a node that can navigate a user through
all the texts that were written in the particular variant; the topic-node can be used to collate texts
according to topics. The order of these nodes or the taxonomy then depends upon the individual
investigator’s niche. The taxonomy chosen by a lexicographer may not suit the needs of a
32
grammarian, and the taxonomy chosen for stylistic studies would be wasted on a historical
linguist for example. Indeed, we describe below that even in a niche area, like lexicography,
there is no agreement on the descriptive taxonomy.
But no matter whatever taxonomy is chosen, in the context of a computer-based corpora this
hierarchy must form the basis of the organisation of texts within a computer’s file system. Any
change in the taxonomy then suggests the reorganisation of the corpus at the file system level. A
complex task at the best of times and, we believe, a task that should be performed by computer
systems. In order to explicate the notion of a configurable taxonomy we have introduced the term
virtual corpus The adjective virtual has been borrowed from computing science, specifically
operating systems, and is used to describe how entire resources of a computer system are
replicated by a program and made available to individual users. The users of this replication are
the users of a virtual machine: each believing and having access to the whole system, whilst in
reality such an access and usage to machine is limited for very short intervals of time.
The notion of virtual corpus is similar to that of a virtual machine: there is in reality only one
corpus, but the users can arrange the nodes and links as they wish and create for themselves a
corpus, or more accurately, a corpus organisation, based on an actually physically extant set of
texts, for the duration of their use. Thus every corpus user will believe and have access to all or
parts of a corpus that they have themselves configured. And, continuing the operating systems
analogy, such a configurable taxonomy will have to be made available through the agency of a
program, within a suite of corpus management programs, that is capable of producing this virtual
corpus. The specification and operation of such a program that can create virtual corpora is the
focus of the rest of this paper.
33
A virtual corpus management system
The design of corpora, and more so their management, which may include storage and retrieval of
texts, navigation mechanisms, and strict integrity and security checks, determines to a large extent
the efficacy of the corpora for various end users - which may be lexicographers, translators, or
linguists. Most existing corpus management systems have been developed in conjunction with a
particular corpus and have consequently taken a fairly literal approach to the implementation of a
corpus on a computer. This has resulted in software that directly maps the structure of a corpus as
described by the corpus designers to computer-based file or database management system
structure. In the following section we are interested in the coding of corpora that allows different
corpus designers to structure texts as they feel appropriate. We feel that any user of a corpus can
be viewed as a corpus designer.
There have been two main approaches to the storage, retrieval and navigation of texts in a corpus:
an explicit text taxonomy, such as LOB and Brown, in terms of file-store structure; or implicit
text taxonomy, such as Longman, in terms of attributes used in the text “headers”. There are
benefits and limitations with both approaches. With an explicit taxonomy, storage of texts
requires a corpus management system to decide where a text should be placed in its file-store,
whereas the attribute-based system can keep the texts anywhere. The main differences in the to
approaches are in text retrieval, and in this it is useful to think of navigation around a corpus as
highly interactive text retrieval. An explicit taxonomy allows texts to be retrieved quickly by
following the appropriate branches through the taxonomy, without needing to consider or refer to
the corpus as a whole. The criteria for selecting a text from an explicit taxonomy can be viewed
as a “path” traversing the taxonomic structure. Also, an explicit taxonomy provides a means of
navigation through a corpus that computer users find reasonably intuitive. In contrast, an
atribute-based system may need to search for the required criteria in the attributes of all texts in
the corpus, and is likely to be query-based. For user navigation, query-based retrieval usually
means the user has to learn a query language, which some users do not find straight-forward.
An important issue for corpus management systems is the type of retrieval requests that a user is
likely to make. A frequent use of corpora is for the statistical analysis and comparison of subcorpora, so it is important for a corpus management system to provide the facility to extract subcorpora in an intuitive manner by a user.
34
The retrieval benefit of using an explicit taxonomy completely disappears if a number of texts (or
sub-corpus) are required that occur in different parts of the taxonomy, which may be considered
as the case when incomplete paths are being specified as the retrieval criteria. With an attributebased approach, sub-corpora can be easily retrieved. The aim of virtual corpus management is to
provide the flexibility of the attributed-based approach, but with the intuitive functionality of the
explicit taxonomy approach. This is achieved by allowing users to define a ‘virtual taxonomy’
for a corpus of texts, with any number of different virtual taxonomies being concurrently
available over the same corpus. The term “virtual taxonomy” has been defined by Woods in the
context of descriptions of concepts in knowledge representation systems such that whenever a
system “constructs an explicit collection of concept nodes ... the result is a subgraph of the virtual
taxonomy” (Woods, 1991:80). Woods’ motivation for viewing a collection of ‘descriptions’ this
way is that “although its structure is important, one never wants to make it explicit in the memory
of a computer” (Woods, ibid).
The Virtual Corpus Manager within System Quirk has been implemented such that lexicographers
and terminologists can view corpora on the basis of the ‘pragmatic attributes’ of the texts within a
corpus. Viewing these pragmatic attributes at an abstract level, we have divided them into six
categories: text, authorship, publication, language, domain, copyright status. The Virtual Corpus
Manager is shown in Figure 7.
Figure 7: Virtual Corpus Manager main screen.
35
The Virtual Corpus Manager introduces a shift from the usual pre-defined and explicit corpus
hierarchy approach, in that it allows the definition of virtual hierarchies. The Virtual Corpus
Manager supports corpora that are coded as explicit taxonomies and corpora whose descriptions
are attribute-based. It does this by allowing texts to be stored anywhere in a file-system and
maintaining attributes describing the texts. Retrieval of the texts can then be made using the
attributes directly, or by imposing a virtual hierarchy over the attributes.
Earlier in Figure 2 we showed the structure of the Collins-Birmingham corpus which incorporates
a static organisation of texts. The hierarchy has text type (including ‘Newspaper’, ‘Brochures’,
‘Book’, ‘Magazine’ and ‘Correspondence’) at the meta-level and the terminal node of the
hierarchical tree usually refers to ‘topics’. We argue that more than one profile of the same
corpus of texts can be generated by implementing a virtual corpus hierarchy, for instance the
corpus hierarchy shown in Figure 8, which is a variation of the corpus hierarchy shown in Figure
2.
Fiction
Survey
Book
extended texts
positional/horatory
Non-fiction
Argument
articles
balanced
Narrative
travelogue
biography
Figure 8: Section of Collins-Birmingham Corpus
According to the above corpus hierarchy the non-fiction texts are firstly distinguished between the
complementary attributes extended ‘texts’ and ‘articles’, and afterwards the original distinction
between narrative, survey and argument texts is maintained. By modifying the original corpus
hierarchy in this way the user can now retrieve all non-fiction extended texts or all non-fiction
articles. The selection of texts can be further constrained by choosing texts between survey,
argument and narrative, and so on. In a similar manner, texts in the same corpus can again be reorganised by implementing another corpus hierarchy (shown in Figure 9) where the non-fiction
texts can further be distinguished on the basis of the text ‘topic’.
36
extended texts
Fiction
Survey
Book
articles
positional/horatory
Non-fiction
topics
Argument
balanced
Narrative
travelogue
biography
Figure 9: The ‘topic’ sub-corpus from the Collins-Birmingham Corpus.
Figure 10 shows a corpus hierarchy originating by a combination of the corpus hierarchies shown
in Figure 8 and Figure 9.
Fiction
Survey
Book
extended texts
Non-fiction
topics
positional/horatory
Argument
articles
balanced
Narrative
travelogue
biography
Figure 10: Sub-corpus based on merged hierarchies in Figure 8 and Figure 9.
Whilst setting the corpus hierarchy for browsing purposes the user is allowed to include only
relevant values of a pragmatic attribute in the corpus hierarchy. For instance, based on the
organisation of texts shown in Figure 8, a typical corpus hierarchy including relevant values of a
pragmatic attribute would be determined as follows (see Figure 11):
Fiction
Survey
Book
extended texts
positional/horatory
Non-fiction
Argument
articles
balanced
Narrative
travelogue
biography
Figure 11: Hierarchy when only positional/horatory and biography attributes are
explicitly selected from structure in Figure 8.
Some examples of different virtual hierarchies of the same texts is illustrated in Figure 12. In a
dynamic fashion the users can define their own text classification hierarchy or ‘corpus hierarchy’
from the list of pragmatic attributes, where each level of the hierarchy corresponds to one of these
attributes. Additionally, the user is also allowed to include only relevant values of an pragmatic
attribute in the corpus hierarchy (Figure 12b).
37
This results in a corpus hierarchy that is specific to the users’ requirements, as opposed to a
common defined hierarchy for all users. For instance, translators may like the top-most level to
be ‘language’, whereas specialist text users may want a hierarchy that has ‘domain’ as the entry
point in the corpus (Figure 12c), similarly ‘origination date’ with a specification of a range of
dates would be the text classification basis for diachronic oriented text research (Figure 12d).
Top level
Language
ES
Language
Top level
DE
Top level
Domain
Origination data
Language
Language
Region
Text Type
Text Type
Domain
EN
Region
AE
Domain
IT
Domain
PH
Text Type
(a)
Text Type
All Values
(b)
(c)
(d)
Figure 12 a-d: AE refers to Automotive Engineering; IT to Information
Technology; PH to Physics
The navigation mechanism implemented in the virtual corpus manager is novel and has three
main advantages:
a) The navigation is based on a user-defined hierarchy, so various profiles of the corpora can be
viewed by changing the corpus hierarchy (Figure 13a).
b) At each level more than one path can be selected concurrently, allowing sub-corpora to be
browsed in parallel (Figure 1Figure 13b). For example, just the American (US) and British
(GB) variants of English could selected.
c) At any level only known values for texts are available for determining the path. This
ensures that the user may not take a path that leads to a dead-end. For instance, at the
language level texts are classified into four languages ‘English’, ‘German’, ‘Italian’ and
‘Spanish’, however when browsing down if there are no Italian texts in the corpus, this
path would not be available.
38
EN
US
EN
GB
US
Texts
(a)
GB
Texts
(b)
Figure 13 a-b
The browser for navigating a virtual hierarchy is shown in Figure 14.
Figure 14: Virtual Corpus Browser with the virtual hierarchy defined in Figure
12.
The Virtual Corpus Manager provides a mechanism that allows the user to specify various
constraints in a simple interactive manner, without recourse to a query language (Figure 15), and
then retrieves all texts satisfying the user’s constraints. The collection of texts retrieved that
satisfy these constraints could be envisaged as a ‘constrained’ corpus. We argue that, the actual
corpus containing all texts can be considered as the ‘mother corpus’, whereas the constrained
corpus, which in fact partitions the corpus based on certain user defined constraints, can be
regarded as the ‘daughter corpus’. Furthermore, our approach for corpus management
incorporates the notion that texts in a corpus can be related with other texts, for example as
‘shadows’ (translations), annotations and so on.
39
Figure 15: Text selection by attribute query.
40
Conclusion
The discussion above covered the various exemplar corpora used extensively in corpus
linguistics together with our views on corpus taxonomies. We focused on how a corpus
taxonomy can be made flexible such that each individual user of the corpus can impose his or
her own structure on the corpus for the purposes of pursuing their own investigation. We
believe that much of the debate on text typologies is descriptive and it is not possible to put a
value on any of the text typology: the notion of virtual taxonomies and associated
implementations (like the Virtual Corpus Manager) will introduce some degree of objectivity
in that one can evaluate the efficacy of one type of typology against another.
41
Bibliography
Ahmad, K., Davies, A., Hughes, M., Fulford, H., Rogers, M., and Thomas, P. (1990), A
methodology for building multilingual termbases and special-purpose lexica, Translator’s
Workbench Project, ESPRIT II, No. 2315, Report for Workpackage 1.1, Guildford: University
of Surrey
Ahmad, K., Fulford, H., Griffin, S., and Holmes-Higgin, P., (1991), ‘Text-based Knowledge
Acquisition —A Language for Special Purposes Perspective’. In (Eds.) I.M. Graham & R.W.
Milne, Research and Development in Expert Systems VIII, pp 148-162. Cambridge:
Cambridge University Press.
Ahmad, K., Fulford, H., Holmes-Higgin, P., Rogers, M., and Thomas, P. (1990), ‘The
Translator’s Workbench Project’, in Translating and the Computer 11, ed. C. Picken, London:
Aslib
Aijmer, K. and Altenberg, B. (Eds). (1991). English Corpus Linguistics -- Studies in Honour
of Jan Svartik. London: Longman Group.
Atkins, Sue, Clear, Jeremy and Ostler, Nicholas (1992). Corpus Design Criteria, Linguistic and
Literary Computing, Vol. 7 (No. 17) pp1-16.
de Beaugrande, Robert. and Dressler, Wolfgang. (1981). ‘Introduction to Text Linguistics’.
London and New York: Longman.
Biber, Douglas (1988). Variation across speech and writing. Cambridge: Cambridge University
Press (there is a 1991 paperback edition of this book from where the citations are taken).
Biber, Douglas (1989). ‘A Typology of English Texts’. Linguistics, Vol. 27, pp3-43.
Church, Kenneth W., and Libermann, Mark Y. (1991). ‘A status report on the ACL/DCI’. In
(Ed.) Linda M. Jones. pp84-91.
Crystal, David (1987). The Encyclopaedia of Language. Cambridge: Cambridge University
Press.
Garside, R., Leech, G. and Sampson, G. (Eds). (1987). The Computational Analysis of
English. London: Longman Group.
42
Greenbaum, Sidney (1991). The development of the International Corpus of English. In (Eds.)
Kavin Aijmer, Bengt Altenberg. pp 83-91.
Halliday, Michael A.K., (1975). Learning How to Learn. London: XXX
Halliday, Michael A.K. and Hasan, Ruggaya. (1976). Cohesion in English. London: Longman.
Holmes-Higgin, P. and Ahmad, K. (1992). ‘The Machine Assisted Terminology Elicitation
Environment: Text and Data Processing and Management in Prolog’. Technical Report CS-9211. Dept. of Mathematical and Computing Sciences, University of Surrey, Guildford.
Holmes-Higgin, P., Griffin, S., Hook, S. and Abidi S.R. (1993). ‘System Quirk Reference
Guide’. Final Report for Workpackage 5.5, Multilex Project, ESPRIT II, No. 5304.
Ihalainen, Ossi. (1991). A point of verb syntax in South-Western British English: Analysis of a
dialect Continuum. In (Eds.) Kavin Aijmer and Bengt Altenberg. pp280-302.
Kytö, Merja. and Rissanen, Matti. (1988). ‘The Helsinki Corpus of English Texts - Classifying
and Coding the Diachronic Patterns’. In (Eds.) Merja Kytö and Matti Rissanen. pp169-179.
Kytö, Merja., Ihalainen, Ossi., and Rissanen, Matti. (Eds.) (1988). Corpus Linguistics: Hard and
Soft. (Proceedings of the 8th International Conference on English Language Research on
Computerised Corpora). Amsterdam: Rodopi.
Lacey, A.R. (1986) ‘A Dictionary of Philosophy’ (Second Edition). London: Routledge, Kegan,
Paul.
Leech, G. (1991). ‘The state of the art in corpus linguistics’. In (Eds) Aijmer and Altenberg.
pp 8-29.
Maier, Ingrid. (1994). ‘Uppsala Russian Corpus’. (An electronic mail announcement from the
University of Uppsala based on a summary in ‘A Frequency Dictionary of Modern Russian’
(1993). Ackla Universitatis Upsaliensis, Studia Slavica Upsaliensia 32. pp185-188).
43
Marinai, Elisabetta., Peters, Carol., and Picchi, Eugenio. (1991). ‘Bilingual Reference Corpora:
A System for Parallel Text Retrieval’. In (Ed.) Linda M. Junes. pp61-62.
Renouf, Antoinette (1987). ‘Corpus Development’. In (Ed.) John Sinclair (1987). pp1-40.
Rissanen (1991). On the history of that/Zero as object clause links in English. In (Eds.) Kavin
Aijmer and Bengt Altenberg (1991). pp 272-289.
Sager, Juan C., Dungworth, David., and McDonald, Peter F. (1980). ‘English Special Languages:
Principles and practice in science and technology’. Wiesbaden: Oscar Brandsetter Verlag KG.
Savory, Theodore H. (1967). The Language of Science. London: Andre Deutsch Ltd. (2nd
edition).
Searle, John (1969). Speech Acts Cambridge: Cambridge University Press.
Summers, D. (1991). ‘Longman/Lancaster English Language Corpus: Criteria and Design’.
Unpublished manuscript.
Tompa, W.F. and Raymond, R.D. (1991). ‘Database Design for a Dynamic Dictionary’. In
(Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper from
ALLC/ACH Conference, Toronto (June 1989). Oxford: Clarendon Press. pp 257-272.
Woods, W.A. (1991). ‘Understanding Subsumption and Taxonomy: A Framework for Progress’.
In (Ed.) John F. Sowa. Principles of Semantic Networks. Morgan Kaufmann Publishers,
California. pp 45-94.
Yang, Huizhong (1986). A new technique for identifying scientific/technical terms and
describing scientific texts. An interim report in Linguistic and Literary Computing. Vol 1 (No. 2)
1986. pp93-103.
44
Default
1
Mode
Written
Spoken
Written-to-be-read
Spoken-to-be-written
2*
Text Origin
Unclassified
Single, several, joint, corporate, ..........
3
Participation
1-person
many
4
Constitution
Single
5
Preparedness
Prepared
from-notes
6
Medium
Print
Book, periodical, ephemera
7
Style
Prose
Verse, rhyme
8
Genre
Novel, short story, play,
Advertisement, regulations/law, article,
poem, essay, letter
announcement
Fact
Faction*
9
Factuality
Fiction
10
Setting
Unclassified
Home, education, work, personal, public affairs
11
Function
Unmarked
Narrative, informative, expository,
horatory/persuasive, regulatory/instructional,
reflective, entertaining
12
Topic
General
Science, the arts, belief and values, the
environment
13
Technicality
General
Technical, semitechnical
14
Date
Date of publications
Revised edition
15*
Text Status
Original/Reprint
Updated, revised
16
Language
17*
Language links
Single
Parallel
18*
Language status
Source
Translation: Multiple release
19
Authorship
20 Name
21 Gender
22 Age
23 Region
*24 First Language
45
25 Nationality
Cluster
1
‘Prototype label’
Not important for terminology
Dimensional Characteristics
and Exemplars in LOB
Exemplars in LOB
Intimate, interpersonal interaction Situated, non-abstract, very
Informational interaction (face to
(face to face interviews, phone
face conversations, interviews,
involved
conversations, friends)
2
Proximate Prototype and
Informational interaction
phone conversations)
Situated, non-abstract but less
Intimate, interpersonal
involved
3
‘Scientific’ exposition (Academic ‘Extremely’ informational, highly Learned exposition (Academic
prose, official documents)
elaborated, non-narrative, non-
prose, official documents, press
persuasive, extremely abstract
reportage and review,
professional letters)
4
‘Learned’ exposition
Like scientific exposition but
Scientific exposition
only moderately abstract
5
Imaginative narrative
Extremely narrative, moderately
Involved persuasion (interviews,
Fiction - romance, general,
involved, situated, non-abstract
spontaneous speech, popular lore,
mystery, adventure, science-
professional letters)
fiction - prepared speech
6
Informational, narrative
‘Learned exposition’
Situated Reportage (sports
Situated, non-narrative, non-
Imaginative narrative
broadcasts, science fiction)
persuasive and non-abstract
Involved persuasion (Interviews,
Extremely persuasive,
popular lore, letters, academic
moderately involved, non-
prose)
narrative and elaborated
General narrative exposition
(Press reportage and editorials,
general fiction, biography,
humour)
7
8
Biber (1989: 17-19)
46
Imaginative narrative
Download