‘Virtual’ Text Corpora and their Management Khurshid Ahmad and Paul Holmes-Higgin University of Surrey October 1988 Contents Abstract .............................................................................................................................................. 2 Introduction........................................................................................................................................ 2 Text Typologies, Language Varieties and Corpus Building .............................................................. 3 Lancaster-Oslo/Bergen Corpus: The pioneer corpus randomly populated ........................... 3 Birmingham collection of English Text: Specially selected text corpora for lexicography 6 Longman/Lancaster English Language Corpus: Deliberately and randomly selected corpus .................................................................................................................................... 8 British National Corpus Initiative ......................................................................................... 10 Linguistically grounded text typology .................................................................................. 12 Diachronic Corpora and their Structure ................................................................................ 16 PISA’s Bilingual Reference Corpora .................................................................................... 18 Special Language Corpora and ‘whole texts’ .................................................................................... 20 What is a special language .................................................................................................... 20 Specialist texts, cohesion and coherence ................................................................. 21 Specialist texts: A speech act analysis ..................................................................... 22 Language of Science ................................................................................................ 23 Specialist texts or terminological texts .................................................................... 24 British National Corpus and Terminology ............................................................................ 25 Organisation of Text Corpora and text typology ............................................................................... 27 Well grounded text typologies and text representation ......................................................... 28 Coding, Description and Representation of Texts ................................................................ 28 Coding of texts ......................................................................................................... 29 Description of Texts ................................................................................................. 30 Representation of Texts ........................................................................................... 31 A virtual corpus management system ................................................................................................ 34 Conclusion ......................................................................................................................................... 41 Bibliography ...................................................................................................................................... 42 1 ‘Virtual’ Text Corpora and their Management Khurshid Ahmad and Paul Holmes-Higgin University of Surrey October 1988 Abstract Text corpora come in different shapes and forms: the design is typically motivated by a task in hand – reference base for a lexicography project, teaching and learning of second languages, and corpora of historical and anthropological interest. Texts in a given corpus have many different attributes and one or many of the attributes can be chosen to virtually create a (sub-)corpus. This virtual corpus is inspired by the notion of the so-called vritual machines in computing science. Introduction The use of text corpora, particularly the use of computerised text corpora, has had a particularly beneficial use in the study of languages and, perhaps to a lesser extent, on the teaching and learning of languages. Following on from Quirk, his colleagues Greenbaum and Leech, and his student Svartvik, and from Francis and Kucera, the use of text corpora has given a substantial boost to the ‘descriptive studies’ of lexis, syntax, discourse and prosody in English. These studies reflect the state of the language, that of its users, and the uses to which the language was put. Furthermore, the use of text corpora has given a quantitative focus to the study of the ‘international’ varieties of English and a similar focus to the historical/diachronic analysis of language change. The acquisition, organisation and subsequent dissemination of computerised British English text corpora was, in part, the work of dictionary publishers Longman and Collins and in part the work of International Computer archive of Modern English (ICAME). The motivation of the dictionary publishers was to produce dictionaries of English. There is also a range of dictionaries clustered 2 around the Longman/Lancaster 30 million word corpus of contemporary English - including Longman’s Contemporary Dictionary of English (Summers 1991) and the more recent ‘conceptbased’ Longman’s Language Activator (Summers 1993) - and Collins Birmingham Collection of English Text comprising 20 million words and has been used to produce the Collins COBUILD Dictionary and Collins COBUILD Phrasal Verbs. Text Typologies, Language Varieties and Corpus Building In this section we present a brief description of how texts have been categorised and these categories used to develop text corpora. There are open questions in corpus linguistics related to how texts should be selected and, perhaps, more importantly, for what purpose. Some argue that lexicographers and linguists should choose the texts themselves with some advice from teachers of English (Sinclair and colleagues in Sinclair 1987), whilst the corpus linguistics pioneers used a random-selection approach (cf. Lancaster Oslo Bergen Corpus and the Brown Corpus). Still others have argued that there should be an equal mixture of deliberately selected text and randomly selected text (see, for instance, Summers 1991). We hope that the discussion of how text is organised and, indeed, how representative text is chosen, will motivate the reader to consider various parameters that can label a text. These parameters may include the medium in which the text is delivered - books, magazines, journals, leaflets, letters; the genre of the text, fiction or non-fiction, whether it is imaginative or informative, persuasive or instructional. The register and the domain of the text are equally important parameters. Furthermore, there are some atomic features of a text including author’s age and sex, publication period, language variety and so on. (One might consider the use of ‘contextual correlates’ described by Halliday to categorise texts in terms of their tenor and field, given that the mode of the language in the text corpora is textual). Lancaster-Oslo/Bergen Corpus: The pioneer corpus randomly populated The Lancaster-Oslo/Bergen (LOB) Corpus1 was aimed at a general representation of text types for research on a broad range of text types and were selected from four ‘media’: books, newspapers and periodicals, and government documents. Titles for the LOB corpus were randomly selected 1 The LOB Corpus originated in 1970 at the University of Lancaster (UK) under the direction of Geoffrey N Leech. In 1977 the project was transferred to Oslo and Bergen in Norway under the direction of Knut Hoffland and S Johansson. 3 from The British National Bibliography Cumulated Subject Index (1960-64) for books; from the Willing’s Press Guide (1961) for newspapers and periodicals; and from the Catalogue of Government Publications (1961: London, HMSO) for government documents. The ‘media’ typology and the random selection methodology was pioneered by Francis and Kucera for the Brown Corpus of American English. The LOB corpus was categorised into informative and imaginative texts. The latter category contains mainly works of fiction, ranging from detective fiction to science fiction, from adventure and ‘Western fiction’ to general fiction, and romantic texts and humour. Figure 1 shows the structure of the LOB corpus. Reportage Informative Press Editorial Religion Review Skills Trades, Hobbies Popular Lore Belles Le Hoes, Biography, Essays Government documents Learned and Scientific Writing Brown/LOB Corpus General Fictio n Mystery, Detective Science Adventure, Western Imaginative Romance, Love Story Humour Figure 1: LOB Corpus Structure Biber (1988) has added two more categories to those used in the LOB corpus whilst discussing variation across speech and writing samples of English. First of Biber’s additions is professional 4 letters written in an academic context comprising only administrative matters, the second addition is personal letters written to friends or relatives. The first category is defined as ‘informational and interactional’ and the second as from ‘intimate to friendly’ (Biber 1988: 67). Presumably both can be added to the informative category introduced by the designers of LOB corpora. 5 Birmingham collection of English Text: Specially selected text corpora for lexicography The Birmingham collection of English Text was compiled under the guidance of John Sinclair, in close collaboration with Collins Publishers, and served as a source of ‘sufficient and relevant textual evidence’ (Renouf, 1987: 1) for the production of ‘the first wholly new dictionary for many years’ (Sinclair, 1987: vii): a dictionary not based solely in the introspection of lexicographers and their advisers but based rather on how authors of a wide variety of texts (and speakers partaking in conversation and delivering speeches with and to others) use words and phrases. The COBUILD corpus contains 20 million words of current English in its computer store. The COBUILD corpus excluded certain categories of text included in LOB, such as poetry, and excluded drama as not an example of ‘naturally occurring texts’. The focus of the COBUILD team was on texts published between 1960 and 1985, the team preferred general language text rather than ‘technical language’. The COBUILD corpus designers, with advice from teachers of English in the UK and abroad, and from British Council Libraries abroad, selected texts themselves. This method of text selection was different to the random-selection approach used by the LOB corpus - that is, randomly selected titles from bibliographies, and so on. However, in order to check the ‘relevance’ and ‘influence’ of a given text, the COBUILD team regularly checked bestseller lists in newspapers as well as catalogues from leading publishers. The text in the COBUILD corpus is not split along LOB’s informative/imaginative axis, rather, the textual ‘medium’ is taken as a base classifier: books, newspapers, magazines, brochures and leaflets, and personal correspondence are used to define the text typology. The book variety dominates the corpus contributing around 214 texts with the rest contributing 70 texts. The book variety is subdivided into fiction and non-fiction (‘imaginative’ and ‘informative’, respectively) and the former dominates contributing 177 texts out of 214 in the book category. The authorship of books is 75% male and 25% female; British English text accounts for 70% of texts, whilst 20% is American English and the rest other varieties. The text corpora comprises a range of topics from American Indians to Vietnam, from childcare to sex through to Third World, and myths and cults to natural history. The structure of the Birmingham Collection (Renouf, 1987:23-32) is shown in Figure 2. 6 Survey Newspaper Procedure Daily Government Argument extended texts topics articles topics handbooks subject area guidebooks subject area positional/ horatory balanced Banks Brochures, Leaflets Narrative Holidays biography travelogue Academic Humour autobiography extended texts topics articles, collection topics topics extended texts topics articles, collection topics extended texts topics articles, collection topics others CollinsBirmingham Non-fiction Mixed Fiction Historical (novels) other extended texts Book Weekly Topic Monthly Topic General Magazine Correspondence novels short stories Thriller (novels) Academic (novels) Personal letter Future worlds/Fantasy (novels) Figure 2: The Structure of the Collins-Birmingham Collection of English Texts Note the fine-grained organisation of books: positional and horatory texts, when the ‘positional’ author puts forward his or her case in relation to a particular topic and the ‘horatory’ exhorts the reader to do or become something; the Birmingham collection distinguishes between handbooks and guidebooks and the distinction between a variety of narrative texts, travelogue, biography and autobiography is maintained. 7 Longman/Lancaster English Language Corpus: Deliberately and randomly selected corpus Summers (1991) has argued that the motivation for creating the Longman/Lancaster Corpus was to provide lexicographers and linguists with ‘an entirely new, conceived from scratch, corpus of English that could serve a number of purposes and be organised according to objective criteria’ (ibid: 1). The Longman/Lancaster team acknowledges the influence of Geoffrey Leech and Douglas Biber. The primary purpose of this 30 million word corpus was ‘to provide an objective source of language data from which reliable linguistic judgments about the meaning and typical behaviour of words and phrases can be made as a basis for dictionaries, grammars and language books of all kinds’ (ibid: 3). What distinguishes the Longman/Lancaster Corpus from the LOB or the Brown Corpus is that the former is ‘topic driven’ whilst the latter are ‘genre driven’. Recall from Figure 1 that LOB distinguishes between ‘academic discourse’ from ‘press reportage’, ‘press editorial’ from ‘arts’ and so on. The ‘topic driven’ texts in the Longman/Lancaster Corpus are categorised in 10 superfields, as shown in Figure 3 Science Natural, Pure (6%) Applied (4.3%) Social (14.1%) World Affairs (10.9%) Commerce, finance (4.4%) Arts (7.9%) Beliefs, thoughts (4.7%) Leisure (5.7%) Fiction (40%) Poetry, drama and humour (2.3%) Figure 3: ‘Superfields’ in Longman Lancaster Corpus The lexicographic argument for choosing the topic-based approach, pioneered by Michael Rundell of Longman Dictionaries, was that ‘it was more likely to produce text categories that were lexically homogenous’ (Summers 1991: 7). 8 In contrast to the Birmingham Collection of English text comprising mainly 1970s texts, not only does the Longman/Lancaster Corpus contain ‘drama’ as a subject (or text type), but it contains roughly equal samples of English dating back to 1900-49 and 1950-69 (30% each) and 40% of the Longman/Lancaster content is post-1970s. As such, the diachronic breadth of the corpus is greater than that of Birmingham’s. Texts in Longman/Lancaster are divided into metacategories, informative and imaginative, subdivided into superfields and, like the LOB corpus, informative texts comprise books, newspapers and journals, unpublished and ephemera, and imaginative texts, and are mainly works of fiction in book form. There are four ‘external factors’ that form the basis of text categorisation in Longman/Lancaster: ‘region’, including language varieties; ‘time’, a diachronic corpus containing text published between 1900-1980s; ‘medium’, including the ‘sources’ of texts books (80%), periodicals (13.3%) and ephemera (6.7%); and finally, the ‘level’ of text. For informative texts there are three levels: ‘technical’, ‘lay’ and ‘popular’. Similarly, the imaginative texts are divided into ‘literary’, ‘middle’ and ‘popular’. The ‘axiomatic’ features of texts in Longman/Lancaster include authors’ gender and country of origin, target age and gender, number of words in total, title, and so on. Most text types in Longman/Lancaster are about 40,000 words long. No whole texts were included because the ‘emphasis was on many sources rather than the completeness of texts’ (Summers, 1991), which appears smaller than that of Birmingham’s (c. 70,000, where possible, including some whole texts). Longman/Lancaster corpus design is such that half of the 20 million words are derived from carefully selected texts (c. 15 million) - the ‘selective texts’ - and the other half is the randomlyselected, individual titles collectively known as the ‘microcosmic texts’. Like the LOB corpus, Longman/Lancaster have used a book catalogue - Whittaker’s Book in Print - and selected texts originally published in English (in English-speaking countries) before 1900, excluding dictionaries and reference works and works for children. The selective texts are generally wellknown texts, whereas the microcosmic texts include technical texts that are not very well-known. The structure of Longman/Lancaster is shown in Figure 4. 9 General Historical Science Romantic Mystery Adventure Fiction Imaginative Selective ‘Not Fiction’ S1 S8 S1 S8 S1 S8 Books Informative Books Drama Humour Newspapers/ Journals Unpublished & ephemera Medium Influence Region Popularity Time Educational Level Recommended Random Selection Microcosmic Figure 4: Longman/Lancaster Corpus Structure British National Corpus Initiative Atkins, Clear and Ostler, key players in the British National Corpus Initiative, have identified ‘principal features in corpus design’ (1992). Their paper contains the definition and elaboration of a number of terms used in corpus linguistics, particularly corpus-based lexicography. Atkins et al distinguish between text-archive, electronic text library (ELT), and corpus. The first is essentially a collection of readable electronic texts by others including the authors of texts. The second, ELT, is a collection of electronic texts in a standardised format organised loosely on the text content but without any selectional constraints. A corpus is a collection that has been put together for a specific purpose, either for lexicography, for teaching and learning, or for studying syntax, lexis or morphology. 10 Texts organised in a corpus can still be differentiated on a number of counts - the so-called ‘contrastive parameters’ of corpus typology. Recall that LOB comprised texts that were only 5,000 words long, Longman/Lancaster texts could be up to 40,000 but there was no insistence on the texts being complete, similarly Birmingham Collection comprised texts that were about 70,000 words long. Such corpora are called ‘sample corpora’ as opposed to the full text corpora. Greenbaum’s International Corpus of English - a corpus of English from East Africa, Hong Kong, India, Jamaica, New Zealand, the Philippines, Singapore and Wales, is partly based on the LOB model, with 2000 words per text, and partly on the notion of full texts. Historical linguists and etymologists keen to study language change and eager to investigate the life cycle of a ‘loan word’ can benefit from a text corpus containing texts that were written/published at different times, covering decades and, in some cases, 2000 years. The Longman/Lancaster developers regard the corpus as more diachronic, with roughly the same distribution of texts from 1900 - 1985, than Birmingham, whose collection comprises a majority (about 77%) of its articles published between 1970 and 1981, with 1980 - 81 articles making up 37% of the text collection. Indeed, the diachronic and dialectal Helsinki Corpus of English Texts comprises 2100 text samples, ranging from 2500 to 10,000 words, claimed to be written between 850 - 1720. The dialects studied include south west Midlands, Mercian, Cornish, Devon, Somerset, Wiltshire and South Avon (see, for instance, Rissanen (1991) for the diachronic organisation of Helsinki Corpus and Ihalainen (1991) for dialectal variation within the corpus). In the description of Longman/Lancaster corpus it was noted in section 1.3 that there were 10 topics in the corpus, with fiction contributing 40% of the text and science-based text covered about 25%, followed closely by world affairs (10%). The focus of this corpus was on general language, rather than subject specific languages. Table 1 introduces a corpus typology developed by Atkins, Clear and Ostler (1992: 5-9): 11 Variant ‘Real World like’ Fittered Texts Length Full text archived Sample: only certain samples archived permanently permanently Monitor: Full text analysed but not stored - only statistics stored. Time Diachronic Synchronic Topics/Subject fields All subjects/topics General language: Non-technical, broad-readership Language Composition Multilingual Multilingual: All texts ‘shadowed’ with equivalents Standalone in other languages All varieties: national ‘Core’ Texts: Text types common to all varieties varieties, dialectal, etc. ‘Periphery’: Text types specific to some varieties All texts ‘Central’: Selected manageable - in terms of on-line Language Variety Hardware/Storage storage - body of texts ‘Shell’: All other texts, accessible but not stored on-line Table 1: Contrastive Parameters of corpus typology (after Atkins, Clear and Ostler, 1992): Status of individual texts in the Corpus. Linguistically grounded text typology Biber has discussed the typology of English texts at length (Biber, 1988, 1989) and has attempted to shift the basis of text typologies from practical to linguistic grounds. Functional criteria of text types is based on one or two particular functional dichotomies, such as formal versus informal, involvement versus detachment, integration versus detachment, and the use of these points of distinction to describe text ‘types’. Whilst Biber does not detract from the utility of functional distinctions, he argues that these typologies leave much to be desired ‘in identifying the salient LINGUISTIC differences among texts in English’. (1989: 5). Biber has pointed to the considerable linguistic variation in a given functional type of text whilst there is a systematic co-occurrence of linguistic features across the functional types. The linguistically grounded typology is based on ‘sets of syntactic and lexical features that co-occur frequently in texts’, the so-called ‘dimensions’ of variation is identified empirically by multivariable practicable methods (factor analysis to be precise): similar texts in each type are maximally similar in their linguistic characteristic, while the different types are maximally 12 distinct from one another’ (Biber, 1989: 5). Such a typology implies important functional differences as lexical and syntactic features are used to indicate common functions. Biber has analysed the co-occurrence distributions of 67 linguistic features, divided into 162 major grammatical categories used in tagging the LOB corpus and London-Lund Speech corpus, across 481 text samples (written and spoken texts of contemporary British English) spread over 23 genres, including academic prose, forms of press reportage, telephone conversations, letters, and so on. These 67 features make varying degrees of contribution to each of the dimensions, that is a number of features co-occur and the this can be detected empirically such that the greater cluster is labelled as a dimension. Biber then goes on to interpret the functional import of that cluster. For example, Biber distinguishes narrative text from a non-narrative text (expository, descriptive, and so on) by arguing that narrative concerns are marked by ‘considerable reference to past time [past tense verbs], third person innate referents [3rd person pronoun] and reported speech (public verbs): non-narrative concerns [...] are marked by immediate time [present tense verbs] and attributive nominal elaboration [attributive adjectives]’ (Biber, 1989: 9-10). Biber identifies five3 dimensions that underlie each co-occurrence pattern: 1. Involved versus informational production; 2. Narrative versus non-narrative concerns; 2 These are: (i) tenses and aspect members; (ii) place and time adverbials; (iii) pronouns and proverbs; (iv) questions; (v) nominal forms; (vi) passives; (vii) statives; (viii) subordination features; (ix) prepositional phrases, adjectives and adverbs; (x) lexical specifity; (xi) lexical classes; (xii) modals; (xiii) specialised verb classes; (xiv) reduced forms and dispreferred structures; (xv) co-ordination; and (xvi) negation. 3 In his 1989 paper, Biber notes that ‘[T]o date, five major dimensions of variation have been identified in English. Biber (1988) [‘Variations across speech and writing’] presents a unified description of genre variation in English in terms of his five dimensional model’ (1989: 7). The only problem is that Biber opens the discussion with a sevenfactor - later christened ‘dimension’ - model, argues that the seventh factor does not seem to account for much variance in data and will be ignored: ‘Factor 7 is not strong enough for firm interpretation, and this factor will therefore not be considerd further in the present study’ (1989: 114). From then on, Biber concentrates on the first six factors, of the seven identified, and suggests interpretative labels for each factor, to describe the underlying functional dimension’ (1989: 115). The sixth dimension ‘On-line Informational Elaboration’, that distinguises between fragmented informational discourse with integrated or no informational discourse content of the text, was not discussed in the 1989 paper and no reasonwas given for it either. 13 3. Elaborated versus situation-dependent reference; 4. Overt expression of persuasion; 5. Abstract versus non-abstract style. Each dimension has a different number of lexical and syntactic features (Table 2), that are usually unique to the dimension (having some shared features across dimensions). Some dimensions have complementary features, as is the case with the narrative versus non-narrative features, the former distinguished by the co-occurrence of third person pronouns, past tense verbs and so on, from the latter that comprise co-occurrence amongst present tense verbs and attributive adjectives. However, the complementary groups in the other factors reflect similar functional relations. Others, such as ‘overt expression of persuasion’ and ‘abstract versus non-abstract style’ do not have complementary features; for instance, abstract text type will show a marked co-occurrence of conjuncts, agentless passives, past-participle clauses’, whereas non-abstract, that is nontechnical and informal, text type will be marked by an absence of such co-occurrence. Dimension Features Involved -v- informational 30 (6 complementary) Narrative -v- non-narrative 8 (2 complementary) Explicit -v- situation dependent 8 (3 complementary) Overt expression of persuasion 7 Abstract -v- non-abstract style 6 Table 2: Dimensions and number of lexical and syntactic features Biber uses the dimensional statistics to group LOB and London-Lund texts into ‘clusters’: ‘Texts that are similar with respect to the dimension but very different with respect to other dimensions are likely to be grouped into different clusters’ (1989: 17). The texts in LOB and London-Lund corpora appear to form eight ‘distinct’ clusters and their dimensional scores on each of the five dimensions indicate an interacting functional nature of each cluster. Each cluster is made up of either dominant dimension or the absence of such characteristic. Table 3 shows the composition of the clusters according to dimensionality: 14 CLUSTERS Involved Narrative Elaborate Overt Abstract -v- -v- -v- Persuasion -v- Interaction Informational Non-narrative Situated ref. 1. Intimate Extremely Not marked Situated Not marked Non-abstract Interpersonal involved 2. Informational Less involved Not marked Situated Not marked Non-abstract 3. Exposition/ Extremely Non-narrative Highly Non-persuasive Extremely Scientific informational 4. Learned Extremely Non-abstract elaborated Non-narrative informational Highly abstract Non-persuasive elaborated 5. Narrative/ Moderately Extremely Imaginative involved narrative 6. General Informational Moderately abstract Situated Not marked Non-abstract Narrative Not marked Not marked Not marked Not marked Non-narrative Situated Non-persuasive Non-abstract 8. Involved Moderately Non-narrative Moderately Extremely Moderately non- Persuasion involved elaborated persuasive abstract expository 7. Situated Reportage Table 3: Clusters expressed in terms of their dimensions These clusters are organised along functional categories, but the functions are interpreted in terms of dimension and are no longer ad-hoc labels. The dimensions, in turn, are ‘computed’ by calculating the co-occurrence of lexical and syntactic features of texts organised in genre or subgenera. For instance, texts in academic prose subgenera are distributed across tree clusters: two ‘obvious ones’ categorised under exposition - scientific and learned - which account for 75% of the texts, and two less ‘obvious’ ones, ‘general narrative exposition’ accounting for 17% of academic texts, and ‘involved persuasion’ for 9% of the texts. In contrast, 92% of the ‘romantic fiction genre’ in LOB is in the ‘imaginative narrative’ cluster and the rest, 8%, in ‘general narrative exposition’. (Biber 1989: 21-22). Biber treats the ‘distance’ between clusters as an indicator of ‘similarity’: function, lexical and syntactic similarity. There are no ‘surprises’ here: clusters 1 and 2, comprising intimate interpersonal and informational interaction are proximate, as are the ‘expository’, scientific and learned clusters (3, 4). The situated reportage and involved persuasion cluster (7 and 8) are 15 proximate to the imaginative narrative cluster (5). Cluster 6, general expository narrative, is proximate to cluster 4. On the whole, the typology presented by Biber (Figure 5) is a synthesis of the typology used by the LOB corpus researchers, that is, the broad classification into information and imaginative text, and the typology exposed by the ‘functionalists’, that insisted on classifying text according to the communicative function - narrative, expository, horatory and so forth (cf. Birmingham Collection of English Text). The difference here is that Biber has used statistics related to lexical and syntactic feature distribution across texts, thereby well-grounding this typology. Text Lexical Syntactic Features Compute variance Involved/ Informational Narrative/ Non-narrative Elaborate/ Situated Overt persuasion Abstract/ Non-abstract Reportage Persuasion Map Similarities Interactional Expository Informational Expository Learned Scientific Narrative General Imaginative Classify Text Figure 5: Biber’s text typology based on statistical distribution. Diachronic Corpora and their Structure The Helsinki Corpus of English Text comprises a diachronic part and a dialectal part. The diachronic part is divided into Old English, Middle English and Early Modern English, spanning a period between about 750 AD to the beginning of the eighteenth century (Kytö and Rissanen 16 1988; Rissanen 1991). The compilers of the corpus attempted to make the three periods compatible in terms of the amount of texts, text types, registers and genre. The corpus contains about 1.5 million words and texts vary between 2,500 words to 10,000 words. Kytö and Rissanen have defined nine major text prototypes: correspondence, drama, instruction (secular and religious), narration (fiction and non-fiction) proceedings, science and stipulation. The Helsinki Corpus, especially along its diachronic variation, shows the growth and influence of science, and the development of drama in the Middle to Modern Age. Figure 6: Helsinki corpus structureshows the structure of the Helsinki Corpus. Old English Middle English Law Law Document Document Astronomy Astronomy Medicene Other sciences Secular Recipe Prognostication Handbook Religious Homily Rules Religious Stipulation Science Instruction Non-fiction treatise Homily Rules Religious treatise History History Early Modern English National Local Document Astronomy Medicene Other sciences Handbook Education Sermon History Travelogue Diary Biography Narrative Romance Biography Travelogue Fiction Travelogue Correspondence Preface Letter Private Non-private Letter Private Non-private Drama - Play Miracle Morality Play Comedy Proceeding Deposition Deposition 17 Trial Meeting Deposition Figure 6: Helsinki corpus structure. PISA’s Bilingual Reference Corpora The use of a bilingual corpora in corpus-based lexicography and in collecting ‘large quantities of documented evidence on the possible realisation of a concept in two languages’ has been reported by Marinai, Peters and Picchi (1991: 64) in the context of Italian and English. These authors have described a ‘parallel text retrieval’ system that can be used to query both the source and target language text corpora for the purposes of creating a bilingual dictionary. Marinai, Peters and Picchi have created an interesting Italian-English Bilingual Reference corpus that comprises ‘literal translations’, where the translator keeps close to the source text to ‘feel translation’ where translators feel ‘free to render the concepts in what he feels is the most appropriate manner’. The composition of the corpus is given in the following table (Table 4). Text Type Topic Source Target Language Language Genre Sub Genre Informative Learned Papers4 Computer Science Italian English Popular (Excerpts) ‘Inflight Magazine’ English5 Italian Text Book ‘Scientific’ American English Italian - Italian English - English Italian Instructive (Brief Extracts) Imaginative Fiction - short story extracts Poetry Table 4: The composition of texts in the Pisa Bilingual Reference Corpus (derived from Marinai, Peters and Picchi 1991: 64) Unfortunately, details of this ‘reference’ corpus are sketchy, we are not given any dates, but since the authors talk about ‘well-known American [sic] scientific text books’, used by Italian University students and mention texts ‘donated [...] by colleagues’, we have surmised that the 4 The authors obtained these texts from colleagues, so we have surmised that these must be research papers. 5 Inflight Magazines are usually published by a select few organisations based in the English speaking world - we have extrapolated that the source is English. 18 texts in their corpora are of a recent vintage. There is no description of language variety American English or British English - and so on, nor do we have any data about the length of texts, except that the texts, except possibly for the learned paper sub-genre and poetry sub-genre are ‘excerpts’ or brief extracts. Nevertheless, the authors do convey the essence of problems faced by lexicographers in compiling bilingual dictionaries; the availability of a corpus is at once a boon, if the texts are roughly equivalent, or could lead to considerable vigilance when faced with creative translations. 19 Special Language Corpora and ‘whole texts’ What is a special language Human enterprises are characterised usually by a group of people having a specific, carefully delineated interest. The ‘enterprise’ of philosophy, for instance, involves a community of philosophers having their separate university departments, learned societies, journals, newsletters and so on. The ‘enterprise’ of cricket involves a tightly-knit group of players, managers, support staff, entrepreneurs promoting matches; this group has its clubs, galas, magazines, rule-books, biographies and travelogues. Cultural anthropologists would have us believe that shamanistic magical practices are undertaken by a group of shamans who communicate with each other through icons, that there are novice ‘shamans’ taught by the expert in a ‘strange language’. The twentieth century has seen the development of supra-national business and finance communities which undertake not only business but also the training of young people through university courses, seminars, conferences, books, magazines and so forth. The distinguishing feature of each of these enterprises is the way they use their first or second language to communicate with their fellow community members to transfer facts, heuristics, concepts and ideas. One might argue that as geographically determined communities of people communicate through a dialect of a ‘national’ language, or as socially stratified groups use sociolects of their dialects/national language, the ‘epistemologically determined’ specialist human enterprises motivate at times, force at others, the involved community to use an enterprise-specific special language. We have, rather reluctantly, used the term ‘epistemology’, however, our use is based on the definition of epistemology in that it deals with ‘enquiry into the nature and ground of experience, belief and knowledge’ (Lacey, 1986:63). It is our view that the human enterprises, and the involved communities, do share beliefs and knowledge and have similar experiences, or aspire to do so. Epistemology addresses questions about the related topics of the nature of knowledge, its different types, what is known and the origin of knowledge. Therefore, we believe that special languages, as they are used to express experience, belief and knowledge, can be regarded as ‘epistemologically determined’. This ‘epistemologically determined’ community is characterised by its attempt to use many lexical items, comprising a highly nominalized vocabulary than, say, is the case with the community’s everyday, spontaneous natural language and, indeed, is the case with more imaginative forms of writing, such as ‘newspaper-speak’, fiction, and personal letters. Furthermore, on the whole a special language appears to comprise fewer elements of metaphor 20 and simile than, say, dialects and sociolects may have: the endeavour of the specialist human enterprise is to avoid polysemy, to avoid, or to eliminate, culture-bound references and so on. There are some syntactic features that distinguish special language from ‘general’ language: for instance, special languages, particularly the language used by scientists and engineers, contains many more instances of passives than one would find in a quality newspaper (cf. Svartvik 1966) and there would be a surfeit of (complex) nominals that may constitute as much as 60% of a given clause or sentence (cf. Halliday 1993). Table 5 compares the three varieties of ‘national’ languages - sublanguages of the general languages. Language Demarcation variant Linguistic Source Characteristics Dialects Sociolect Geographically Phonological features; Regional history, Metaphor determined few lexical items Geomorphology Similes Flora & Fauna Polysemy Syntagmatic features; Anthropology Metaphor some lexical items National history Similes socio-economics Polysemy ‘Socially’ determined Special Language Epistemologically determined Many lexical items Derivation Semantics Foreign languages Some syntactic features Artificial symbol sets Table 5: A tentative classification of sublanguages based on their ‘demarcation’, linguistic features and some of the principle sources. Based on a discussion in Sager, Dungworth and McDonald (1980:63-69) The differences in the various sub-languages, or special languages, and the language used in literature and poetry has been noted at length by a number of language researchers, including Halliday (1993), de Beaugrande and Dressler (1981), and Sager, Dungworth and McDonald (1980). These differences are particularly accentuated for the ‘language of science’. Indeed, a number of well-known scientists have commented on the language of science, prominent among them is one of the founders of quantum theory and modern physics, Werner Heisenberg. Specialist texts, cohesion and coherence 21 According to de Beaugrande and Dressler, ‘literary and poetic texts could be seen in opposition to text types intended to increase and distribute knowledge about the currently acceptable ‘realworld’ (1981: 186). Particularly, scientific texts that disseminate the results of explorations to extend and clarify existing knowledge through the presentation and examination of evidence inferred from observing natural phenomena or evidence gathered from documented knowledge. Text Types Principle Entities Frequent Conceptual Surface text relations Descriptive Pattern characteristics Objects and Situations Attributes, States, Density of modifiers Frame Subordinates Schema Plan instances, specifications Narrative Actions and events Cause, reason, purpose, enablement and time proximity Argumentative Evaluation of believe Reason, significance, Emphasis and ??? and ideas (±) T(F) volition, value and Recurrence, opposition parallelism, paraphrase Table 6: A text typology along functional lines, i.e., according to the contributions of texts to human interaction We might succeed in distinguishing texts purely on their statistical characteristics. However, as de Beaugrande and Dressler have argued, statistical linguistic should be augmented by the considerations of the functions of text in communication and the pursuit of human goals (1981: 182-185). Specialist texts: A speech act analysis Special language, according to Sager, Dungworth and McDonald uses ‘special text forms, a restricted syntax and arguably a limited morphology’ (1980: 40). Except for patents and legal texts, special languages do not in any significant fashion ‘redesignate syntactic forms’. It is on the semantic level that special language can be differentiated from other languages, as each specialism requires ‘a system of references which organises and structures’ its object of study and the processes involved. The object may comprise materials, instruments, mechanisms and artefacts such as machines, products and parts, and installations, together with their properties, 22 qualities and states. The specialist language attempts to organise and structure processes and also the methods of these processes. Sager, Dungworth and McDonald (1980) were inspired by, though do not wholly subscribe to, the ‘speech act’ approach to language pioneered by Austin (1962) and Searle (1969). A speech act is the result of ‘the convergence of a speaker (or writer), a listener (or reader) and a topic (an area of reference at a particular time and place in a specific situation)’ (1980: 22). A philosophy undergraduate reading commentaries on Aristotle written by a medical scholar in a present setting is one speech act, whilst the same undergraduate reading a newspaper feature article on the ethical issues faced by, say, a government committee on abortion is another speech act. The four simultaneous aspects of a speech act, ‘utterance’, ‘propositional’, ‘illocutionary’ and ‘perlocutionary’ acts are ‘fundamental to the understanding of communication’ (1980: 23). The ‘intentions’ underlying the difference between the ‘illocutionary act’, indicated by punctuation, mood of verb, word order, and ‘perlocutionary act’ indicated by the effect the utterance has on the reader, are classified by Sager et al into four types: informative, evaluative, directive and phatic intensions. These intentions can be differentiated and defined in relation to the four principle features of the speech act for written texts: writer, reader, topic, and time and place. Sager et al’s approach to special language texts appears to incorporate the linguistic features with communicative intent and the action the communication precipitates. Language of Science Savory (1967) has discussed ‘the language of science’ with particular emphasis on the etymology of scientific words, including the ‘processes’ by which scientists build up the vocabulary of their sciences and how the language of science grows, what is its character, and how, Savory speculates, the literature of science comes into being. According to Savory, words of science fall into three major categories - depending on how they were ‘coined’: borrowed words, imported words and invented words (1967: 32-34). Scientists borrow words from their own language, sometimes they give a precise meaning to the word, such as salt and atom; at other times the meaning of these words is just as vague and uncertain as in the ordinary language - words like ‘life’ and ‘bug’ fall into this category; and yet there are some words that keep a vague shadow of their original (general language) meaning, cf. work, force and power. Importation from other languages is quite common amongst the scientists; here the meaning of some of these words is preserved in translation, like radius and femur, and for some even mythological references are preserved: Atlas, the name of the vertebra on which the skull rests, reflects the Greek mythological story of Atlas carrying the world on his shoulders. 23 Scientists add to the overall stock of their languages by ‘inventing’ a new word. There is a group of invented words, words that were carefully defined and care was exercised in preserving their original meaning: hormone is a good example. Yet there are invented words that started their life very much like hormone, but they caught the public imagination and, according to Savory, their meaning has deteriorated: allergy is quoted by him as an example. Table 7 summarises the above discussion. Type Assimilation/Invention Meaning of the word Exemplar Borrowed Words taken by scientists from Uncertain and obscure both in LGP ‘life’ ordinary language and LSP ‘bug’ More precise in LSP than in LGP ‘salt’ ‘atom’ Precise in LSP but still a vague ‘work’ shadow of its LGP meaning ‘force’ ‘power’ Imported Words taken by scientists from LSP meaning the same as the source ‘femur’ languages other than LGP meaning ‘fulcrum’ ‘radius’ their first language Invented Words LSP word that has a metaphysical ‘Atlas’ (name of the vertebra connection with its source LGP corresponds to Atlas carrying meaning the world on his shoulders) ad-hoc creation by LSP words that were carefully ‘Hormone’ scientists defined and care taken to preserve original meaning LSP words carefully devised but caught popular imagination and are exposed to abuse Table 7: Etymology of words. Specialist texts or terminological texts 24 ‘Allergy’ The word ‘terminology’ and its derivative - terminological - have been used much in the same way as many terms and words that were not precisely defined at first, and have since deviated in meaning. In Artificial Intelligence (AI) literature, one finds extensive discussions about terminological knowledge representation formalisms. Formalisms that are the basis used to develop programs for representing not only the definition of a term on a computer system, but also various assertions about the term. In a limited way such formalisms support the intensional and extensional attributes of an object process, action or event. The question here is how one distinguishes between a term, knowledge of a domain and formalism primitives: these are overlapping categories and perhaps terminological knowledge representation does convey another intuition by the AI communicty, the so-called ‘concept logic’ or ‘term logic’, but cannot be held as an example of terminological exactitude. The term ‘terminological corpora’ has been used by corpus linguists (cf. Atkins, Clear, Ostler, 1991) to express the intuition that, somehow, texts that may be regarded as an ‘evidence’ and ‘reference’ source for terms can be isolated from general language texts and used for acquiring terminological data. British National Corpus and Terminology Recently, Atkins, Clear and Ostler (1992, 1:16) have discussed a number of design criteria which they think may be relevant to the construction of text corpora used for lexicography, computational linguistics, machine translation and knowledge engineering. Atkins et al have outlined ‘certain contrastive parameters of corpus typology per se’ (1992: 5). These parameters include length of the constituent texts, that is from unabridged to a small excerpt, onto ‘monitor’ text corpora where the text is stored only temporarily, the time-dimension of the texts, whether the texts were selected for a fixed date - synchronic - or the selection was diachronic, the languages that comprise texts in the corpora, for example is the text corpus monolingual or bilingual, and if bilingual, whether there are parallel texts or not. Amongst the nine contrastive parameters, Atkins et al introduce the contrast between ‘general’ and ‘terminological’ types of texts, and note that ‘terminologists must define conditions which must obtain if a corpus is to be used for terminological use’. The notion of the ‘terminological corpora’ is not elaborated by Atkins et al except when they discuss what kinds of attributes a text may have, a total of 29 as it turns out, and ‘terminology is referred to in two attributes: ‘technicality’ 25 and ‘native language of authors’, of a text. The ‘technicality’ attribute, a 3-valued attribute, that depends on ‘degree of specialist/technical knowledge of the author and target/readership audience’ (1992: 8). For instance, a ‘technical’ text is written for and by specialist, a ‘semitechnical’ for non-specialist by specialists, and ‘general’ text is written usually by non-specialists for non-specialists. The authors note that ‘these [general, semi-technical, technical] must be external variables, not linguistic style variables. This particular attribute is highly important for terminological corpora’ (ibid.). This classification has been used by Yang whilst analysing a corpus of scientific English for identifying terms (1988). The description of the native language of the author attribute is rather incomplete in Atkins, Clear and Ostler, nevertheless, the authors note that if there is no information forthcoming about this attribute, we should assume that the native language is the language of the text. Furthermore, ‘this attribute [is] of particular importance to terminological corpora. Precise information about first language is in many cases unavailable or irrelevant’ (1992: 8). 26 Organisation of Text Corpora and text typology Any collection of texts reflects, howsoever infinitesimally, the state of a language (or languages) used by the authors of the texts. A text corpus is an image of language. But part of the image is shaped by how the corpus was organised, the texts chosen either in part or whole, the texts selected from one period of time or over a continuum, the topics discussed in the texts all belonging to one subject or many and so on. Lexicographers may be happy with text samples of everyday language over a period of time whereas a terminologist may require full texts in a specific subject area after a particular date only. One can infer from discussion in sections 1 and 2 that that whilst there may be broad agreement in corpus-based lexicography as to what constitutes a text type, there remain open questions and fundamental differences in how a text is viewed. The typologies discussed above do work in that they have been used in producing lexica, and in helping a number of researchers gaining insight into (English) language, in a relatively introspection-free manner. The basis of such a typology really then is heuristic - rule of thumb devised by the expert for solving a relatively open-ended problem. The discussion in the previous section showed that the texts comprising a corpus may be classified and organised in many different ways depending on the use to which the corpus is to be put, for instance, lexicographers may be concerned about the variety of national language they may wish to specify a period of time which the lexicographers regard contemporaneous with the life-time of the lexica they wish to produce. Then again, a gender-conscious lexicographer/socialhistorian may regard the gender of the author(s) of the text more important than the 29 or so textattributes specified by Atkins, Clear and Ostler (1992) or the attributes specified by Summers (1991) that emphasis diachronic variation and so on. 27 Well grounded text typologies and text representation The concerns of a text linguist, for example de Beaugrande and Dressler (1981), are related to ‘the contributions of texts to human interaction’ (ibid.: 184), in other words, along functional lines. Therefore, here, the primitives of the text type typology, description, argumentation and narration comprise, to varying degrees, literary, scientific and didactic texts. Text linguists and systemic function-oriented linguists would prefer a different typology of text than, say a lexicographer or social historian. Technical writing and ‘special English’ experts do sometimes take their cue from research in speech acts. Searle (1969) and from the work of Halliday (1973, 1975) and Hasan (1976), have developed a very broad based typology that embraces a much broader category of texts as compared to the typologies used by lexicographers and theoretical text linguists: Sager, Dungworth and McDonald’s ‘typology of special forms - the traditional forms which occur in special communication’ embraces the informative form, including announcement and news-flash; the evaluative form - commentary, test report, testimonial and so on; the directive - including instructions, directives, regulations; and phatic, including presentation (1980: 124-181). Given that texts may have a whole range of attributes that are given, like age of the author, his or her first language, length of the text and so on, that each text in a corpus may (or may not) have been written by its author with an exclusive function in mind, and given that each may be classified according to its classifier’s intellectual orientation and his or her immediate need, the question here is this: how are the texts to be organised in computerised text corpora such that the corpus users can access the texts according to their own intellectual orientation and immediate needs. Coding, Description and Representation of Texts Computer programs are essential for the creation of a text corpus, and are equally important for retrieving one or more texts, or one or more text fragments. Generally, corpus creation programs are based on their programmers’ understanding of the salient properties of the texts that are to be stored in a corpus, and on their understanding of how a corpus is to be structured. The programmers, in turn, learn about the properties and the structure from the organisers of the corpus: these organisers can be literary experts, linguists, lexicographers, information scientists and so on. 28 Once a corpus is created then a potentially well-defined group of users access the stored texts through another set of computer programs. These retrieval programs are based on a model of the user’s requirements as understood by programmers working under the direction of the corpus organisers. The creation and the subsequent usage of a text corpus appears, then, to depend on the shared knowledge of the corpus organisers, the programmers and the end-users. This shared knowledge has to be simple and formal enough such that it can be used in the corpus creation and retrieval programs. This shared knowledge has to be articulated with a zero-intelligence device, a computer, in mind. Furthermore, the texts are to be stored in an electronic medium as distinct from the usual ‘graphic substance’ used in the print medium that is in the form of marks on a surface. The graphetics-physical properties of the symbols that constitute writing systems have to be transformed from print to the electronic medium for the purposes of storage, whilst the reverse process has to e performed for retrieval. The efficient storage and retrieval of texts depends on how texts and text corpora are dealt with at three overlapping, yet clearly identifiable, tasks: the coding of the texts; the description of the texts; and the representation of texts. In other words, how texts are to be coded or, more precisely, encrypted for use in an electronic medium, how a particular class of texts and text corpora are to be described, and how a class of texts and text corpora is to be represented on a computer system. The coding, description and representation comprise some of the shared knowledge mentioned above and it is the articulation of this knowledge that will concern in the following subsections. Coding of texts Coding involves the encryption of words, sentences and paragraphs of a text. For encrypting words, binary equivalents of individual letters and numbers comprising the words are used and the text possibly ‘compressed’ by allocating shorter binary equivalents to the more frequent tokens and longer to the less frequent ones. A given text may comprise all, some or at least one of the ‘modes of graphic expression’: ‘verbal’ (written), ‘pictorial’ (drawings, photographs) and ‘schematas’ (charts, diagrams, scientific and technical notation, formulae and equations) (cf. Crystal, (1987). In the print medium, the verbal mode relies heavily on understanding of the physical properties of the symbols, including font details, shapes, sizes and so on. Furthermore, in the print medium units of discourse, such as sentences and paragraphs, are demarcated by layout configuration and punctuation marks. 29 Graphic mark-up languages are used to encode aspects of the verbal, pictorial and schematic modes of expression: an example of such a language is the Standard Generalised Mark-up Language (SGML). The language provides a notation convention and a grammar that is used to ‘mark up’ text fragments comprising a text. This marked up text then can be processed by a computer program that understands the notation and the grammar of SGML: the program can distinguish text headers from the body of the text, sentences from paragraphs and so on. Furthermore, the attributes of a text can also be ‘annotated’ in the text through mark up languages by the use of mnemonics to indicate, say, author name, publication date, text type and so on. Description of Texts The ‘description’ of individual texts or a group of texts is essentially a set of conventions used to describe some particular texts. Corpus organisers attach a variety of descriptive labels to individual texts. These labels may be used to express, say, functional typology: Longman’s typology is at some variance with the Birmingham Corpus of English, and the two are at variance with de Beaugrande and Dressler (1981) and with Sager, Dungworth and McDonald’s typology. The description of a text in a corpus is usually at a meta-level as compared to the coding of the text. This description can be the description of the text’s functional type, for example, whether the text is imaginative, informative, evaluative or phatic: this description may be what Sager, Dungworth and McDonald call the ‘description [...] of traditional forms’ (1980: 148-181); this description may involve the subject domain members responsible for producing the text. The advent of tagged corpora allows for a lexico-grammatical description of the text in that all known and current tokens in the text are tagged according to the categories and features by given grammar. Therefore, in a modern computerised corpus of texts one may find meta-textual data, that is, additional data that is used to code and to describe each individual text in a corpus. The question here is this: can this meta-textual data be used to build a taxonomy that contains the functional typology at its apex or functional typology as the superordinate level, text-forms at a subordinate level, domain-specific data subordinated to the text form, and the rest of the attributes at the instance level. Then again, one may regard the domain or subject specificity as being the superordinate level rather than the functional typology, or indeed any other attribute. 30 We believe that almost all of the corpus management systems are used in large measure by an enthusiastic, well-motivated community that shares the knowledge of the meta-textual data with the developers of corpora and software engineers that build the corpus management systems. The corpus and software designers, and the users, have a tacit understanding of how the meta-textual level data has been used in specifying a taxonomy. The shared knowledge mentioned above is the shared knowledge of the taxonomy. However, the establishment and the consolidation of corpus linguistics would mean that methods may be developed by a team that has little or no contact with those identifying new techniques or building tools for analysing corpora. The growth of corpus linguistics means that the first victim of the success would be the shared knowledge of text taxonomies. The loss of the shared knowledge also may give rise to the opinion that a corpus management system has some privileged data about a given taxonomy and yet may not have any data about any other taxonomy. We believe that a careful and systematic approach for collecting the meta-textual data about individual texts in a corpus, together with the development of programs that can manipulate this meta-textual data to create a number of different taxonomies and are capable of providing access for a user to all the texts for a given taxonomy, is not only necessary but holds the key to a successful user-driven exploitation of text corpora. Representation of Texts The coding of the texts, and the description of the texts, implicitly assumes ‘shared knowledge’, about the code and the descriptors, between the people that build the corpora, people that program corpus management software, and the end-users of the corpora. This shared knowledge helps these three groups to resolve ambiguities in the code and the descriptors, and, indeed, to improve the code and the descriptors so as to make these easy to use. As luck would have it, in most of the major corpus initiatives (cf. Longman’s, Birmingham) these three groups were well co-ordinated and were extremely well motivated. In ‘Looking Up’ - an account of the COBUILD project corpus buildings, systems analysts and programmers and lexicographers give an enthusiastic and well-dovetailed account of the project. But assuming that the use of corpora, just in lexicography alone, takes off and researchers and lexicographers can access data more cheaply, freely and easily, as compared to now, then how will this implicit ‘shared’ knowledge be articulated? There is considerable evidence that the take-up of corpora in lexicography, language research and language teaching, is increasing rapidly. And, with this rapid increase, one must define a computer-based set of conventions for describing not just some particular text but a class of texts, 31 not only just an encryption algorithm based on implicit knowledge but a simple reasoning strategy that is capable of inferencing over the attributes of texts organised into classes of texts explicated knowledge. The conventions for describing a class of texts, or more generally, a class of things, devised with the sole purpose of making the ‘shared knowledge’ explicit is fashionably terms as a problem of representation. The ‘representation of texts’ or the ‘representation of a text’ has been discussed by prominent linguists, literateurs, literary critics and by philosophers, psychologists and sociologists in erstwhile ideological frameworks or as frameworks that are gaining currency. The question here is this: given that texts may have a whole range of attributes, may have been written with an exclusive function by their authors, and, indeed, may be classified according to the classifier’s intellectual orientation and his or her immediate need, how is one to organise texts in a computerised text corpora such that the corpus users can access the texts according to their own intellectual orientation and immediate needs. Our question is more pertinent in that currently substantial volumes of textual data is being collected by national initiatives, for instance, the British National Corpus initiative, the Uppsala based Russian Corpus (Maier, 1994) transnational collaborations, for example the Association for Computational Linguistics Data Collection Initiative (ACL/DCI) (Church and Libermann, 1991). It appears, therefore, that texts in a corpus can be described through the use of a variety of labels. Indeed, one can create a hierarchy of these labels. There can be pressing lexicographical reasons for considering the medium as the apex of the hierarchy, followed by national language variant, for example, British English and American English, then by date and so on, or for gender studies students the apex would be the gender of the author, followed by date of publication, and language variants. The labels can be arranged, or in some cases have to be arranged to suit the needs of the investigator and his or her own particular niches. The descriptive taxonomy provides the nodes and links of a network - a tree- that describes how texts are related to each other. The nodes are named after the labels and the links provide conduits of properties that can be inherited from the superordinate by the subordinate nodes. For example, the language variant node can be construed as a node that can navigate a user through all the texts that were written in the particular variant; the topic-node can be used to collate texts according to topics. The order of these nodes or the taxonomy then depends upon the individual investigator’s niche. The taxonomy chosen by a lexicographer may not suit the needs of a 32 grammarian, and the taxonomy chosen for stylistic studies would be wasted on a historical linguist for example. Indeed, we describe below that even in a niche area, like lexicography, there is no agreement on the descriptive taxonomy. But no matter whatever taxonomy is chosen, in the context of a computer-based corpora this hierarchy must form the basis of the organisation of texts within a computer’s file system. Any change in the taxonomy then suggests the reorganisation of the corpus at the file system level. A complex task at the best of times and, we believe, a task that should be performed by computer systems. In order to explicate the notion of a configurable taxonomy we have introduced the term virtual corpus The adjective virtual has been borrowed from computing science, specifically operating systems, and is used to describe how entire resources of a computer system are replicated by a program and made available to individual users. The users of this replication are the users of a virtual machine: each believing and having access to the whole system, whilst in reality such an access and usage to machine is limited for very short intervals of time. The notion of virtual corpus is similar to that of a virtual machine: there is in reality only one corpus, but the users can arrange the nodes and links as they wish and create for themselves a corpus, or more accurately, a corpus organisation, based on an actually physically extant set of texts, for the duration of their use. Thus every corpus user will believe and have access to all or parts of a corpus that they have themselves configured. And, continuing the operating systems analogy, such a configurable taxonomy will have to be made available through the agency of a program, within a suite of corpus management programs, that is capable of producing this virtual corpus. The specification and operation of such a program that can create virtual corpora is the focus of the rest of this paper. 33 A virtual corpus management system The design of corpora, and more so their management, which may include storage and retrieval of texts, navigation mechanisms, and strict integrity and security checks, determines to a large extent the efficacy of the corpora for various end users - which may be lexicographers, translators, or linguists. Most existing corpus management systems have been developed in conjunction with a particular corpus and have consequently taken a fairly literal approach to the implementation of a corpus on a computer. This has resulted in software that directly maps the structure of a corpus as described by the corpus designers to computer-based file or database management system structure. In the following section we are interested in the coding of corpora that allows different corpus designers to structure texts as they feel appropriate. We feel that any user of a corpus can be viewed as a corpus designer. There have been two main approaches to the storage, retrieval and navigation of texts in a corpus: an explicit text taxonomy, such as LOB and Brown, in terms of file-store structure; or implicit text taxonomy, such as Longman, in terms of attributes used in the text “headers”. There are benefits and limitations with both approaches. With an explicit taxonomy, storage of texts requires a corpus management system to decide where a text should be placed in its file-store, whereas the attribute-based system can keep the texts anywhere. The main differences in the to approaches are in text retrieval, and in this it is useful to think of navigation around a corpus as highly interactive text retrieval. An explicit taxonomy allows texts to be retrieved quickly by following the appropriate branches through the taxonomy, without needing to consider or refer to the corpus as a whole. The criteria for selecting a text from an explicit taxonomy can be viewed as a “path” traversing the taxonomic structure. Also, an explicit taxonomy provides a means of navigation through a corpus that computer users find reasonably intuitive. In contrast, an atribute-based system may need to search for the required criteria in the attributes of all texts in the corpus, and is likely to be query-based. For user navigation, query-based retrieval usually means the user has to learn a query language, which some users do not find straight-forward. An important issue for corpus management systems is the type of retrieval requests that a user is likely to make. A frequent use of corpora is for the statistical analysis and comparison of subcorpora, so it is important for a corpus management system to provide the facility to extract subcorpora in an intuitive manner by a user. 34 The retrieval benefit of using an explicit taxonomy completely disappears if a number of texts (or sub-corpus) are required that occur in different parts of the taxonomy, which may be considered as the case when incomplete paths are being specified as the retrieval criteria. With an attributebased approach, sub-corpora can be easily retrieved. The aim of virtual corpus management is to provide the flexibility of the attributed-based approach, but with the intuitive functionality of the explicit taxonomy approach. This is achieved by allowing users to define a ‘virtual taxonomy’ for a corpus of texts, with any number of different virtual taxonomies being concurrently available over the same corpus. The term “virtual taxonomy” has been defined by Woods in the context of descriptions of concepts in knowledge representation systems such that whenever a system “constructs an explicit collection of concept nodes ... the result is a subgraph of the virtual taxonomy” (Woods, 1991:80). Woods’ motivation for viewing a collection of ‘descriptions’ this way is that “although its structure is important, one never wants to make it explicit in the memory of a computer” (Woods, ibid). The Virtual Corpus Manager within System Quirk has been implemented such that lexicographers and terminologists can view corpora on the basis of the ‘pragmatic attributes’ of the texts within a corpus. Viewing these pragmatic attributes at an abstract level, we have divided them into six categories: text, authorship, publication, language, domain, copyright status. The Virtual Corpus Manager is shown in Figure 7. Figure 7: Virtual Corpus Manager main screen. 35 The Virtual Corpus Manager introduces a shift from the usual pre-defined and explicit corpus hierarchy approach, in that it allows the definition of virtual hierarchies. The Virtual Corpus Manager supports corpora that are coded as explicit taxonomies and corpora whose descriptions are attribute-based. It does this by allowing texts to be stored anywhere in a file-system and maintaining attributes describing the texts. Retrieval of the texts can then be made using the attributes directly, or by imposing a virtual hierarchy over the attributes. Earlier in Figure 2 we showed the structure of the Collins-Birmingham corpus which incorporates a static organisation of texts. The hierarchy has text type (including ‘Newspaper’, ‘Brochures’, ‘Book’, ‘Magazine’ and ‘Correspondence’) at the meta-level and the terminal node of the hierarchical tree usually refers to ‘topics’. We argue that more than one profile of the same corpus of texts can be generated by implementing a virtual corpus hierarchy, for instance the corpus hierarchy shown in Figure 8, which is a variation of the corpus hierarchy shown in Figure 2. Fiction Survey Book extended texts positional/horatory Non-fiction Argument articles balanced Narrative travelogue biography Figure 8: Section of Collins-Birmingham Corpus According to the above corpus hierarchy the non-fiction texts are firstly distinguished between the complementary attributes extended ‘texts’ and ‘articles’, and afterwards the original distinction between narrative, survey and argument texts is maintained. By modifying the original corpus hierarchy in this way the user can now retrieve all non-fiction extended texts or all non-fiction articles. The selection of texts can be further constrained by choosing texts between survey, argument and narrative, and so on. In a similar manner, texts in the same corpus can again be reorganised by implementing another corpus hierarchy (shown in Figure 9) where the non-fiction texts can further be distinguished on the basis of the text ‘topic’. 36 extended texts Fiction Survey Book articles positional/horatory Non-fiction topics Argument balanced Narrative travelogue biography Figure 9: The ‘topic’ sub-corpus from the Collins-Birmingham Corpus. Figure 10 shows a corpus hierarchy originating by a combination of the corpus hierarchies shown in Figure 8 and Figure 9. Fiction Survey Book extended texts Non-fiction topics positional/horatory Argument articles balanced Narrative travelogue biography Figure 10: Sub-corpus based on merged hierarchies in Figure 8 and Figure 9. Whilst setting the corpus hierarchy for browsing purposes the user is allowed to include only relevant values of a pragmatic attribute in the corpus hierarchy. For instance, based on the organisation of texts shown in Figure 8, a typical corpus hierarchy including relevant values of a pragmatic attribute would be determined as follows (see Figure 11): Fiction Survey Book extended texts positional/horatory Non-fiction Argument articles balanced Narrative travelogue biography Figure 11: Hierarchy when only positional/horatory and biography attributes are explicitly selected from structure in Figure 8. Some examples of different virtual hierarchies of the same texts is illustrated in Figure 12. In a dynamic fashion the users can define their own text classification hierarchy or ‘corpus hierarchy’ from the list of pragmatic attributes, where each level of the hierarchy corresponds to one of these attributes. Additionally, the user is also allowed to include only relevant values of an pragmatic attribute in the corpus hierarchy (Figure 12b). 37 This results in a corpus hierarchy that is specific to the users’ requirements, as opposed to a common defined hierarchy for all users. For instance, translators may like the top-most level to be ‘language’, whereas specialist text users may want a hierarchy that has ‘domain’ as the entry point in the corpus (Figure 12c), similarly ‘origination date’ with a specification of a range of dates would be the text classification basis for diachronic oriented text research (Figure 12d). Top level Language ES Language Top level DE Top level Domain Origination data Language Language Region Text Type Text Type Domain EN Region AE Domain IT Domain PH Text Type (a) Text Type All Values (b) (c) (d) Figure 12 a-d: AE refers to Automotive Engineering; IT to Information Technology; PH to Physics The navigation mechanism implemented in the virtual corpus manager is novel and has three main advantages: a) The navigation is based on a user-defined hierarchy, so various profiles of the corpora can be viewed by changing the corpus hierarchy (Figure 13a). b) At each level more than one path can be selected concurrently, allowing sub-corpora to be browsed in parallel (Figure 1Figure 13b). For example, just the American (US) and British (GB) variants of English could selected. c) At any level only known values for texts are available for determining the path. This ensures that the user may not take a path that leads to a dead-end. For instance, at the language level texts are classified into four languages ‘English’, ‘German’, ‘Italian’ and ‘Spanish’, however when browsing down if there are no Italian texts in the corpus, this path would not be available. 38 EN US EN GB US Texts (a) GB Texts (b) Figure 13 a-b The browser for navigating a virtual hierarchy is shown in Figure 14. Figure 14: Virtual Corpus Browser with the virtual hierarchy defined in Figure 12. The Virtual Corpus Manager provides a mechanism that allows the user to specify various constraints in a simple interactive manner, without recourse to a query language (Figure 15), and then retrieves all texts satisfying the user’s constraints. The collection of texts retrieved that satisfy these constraints could be envisaged as a ‘constrained’ corpus. We argue that, the actual corpus containing all texts can be considered as the ‘mother corpus’, whereas the constrained corpus, which in fact partitions the corpus based on certain user defined constraints, can be regarded as the ‘daughter corpus’. Furthermore, our approach for corpus management incorporates the notion that texts in a corpus can be related with other texts, for example as ‘shadows’ (translations), annotations and so on. 39 Figure 15: Text selection by attribute query. 40 Conclusion The discussion above covered the various exemplar corpora used extensively in corpus linguistics together with our views on corpus taxonomies. We focused on how a corpus taxonomy can be made flexible such that each individual user of the corpus can impose his or her own structure on the corpus for the purposes of pursuing their own investigation. We believe that much of the debate on text typologies is descriptive and it is not possible to put a value on any of the text typology: the notion of virtual taxonomies and associated implementations (like the Virtual Corpus Manager) will introduce some degree of objectivity in that one can evaluate the efficacy of one type of typology against another. 41 Bibliography Ahmad, K., Davies, A., Hughes, M., Fulford, H., Rogers, M., and Thomas, P. (1990), A methodology for building multilingual termbases and special-purpose lexica, Translator’s Workbench Project, ESPRIT II, No. 2315, Report for Workpackage 1.1, Guildford: University of Surrey Ahmad, K., Fulford, H., Griffin, S., and Holmes-Higgin, P., (1991), ‘Text-based Knowledge Acquisition —A Language for Special Purposes Perspective’. In (Eds.) I.M. Graham & R.W. Milne, Research and Development in Expert Systems VIII, pp 148-162. Cambridge: Cambridge University Press. Ahmad, K., Fulford, H., Holmes-Higgin, P., Rogers, M., and Thomas, P. (1990), ‘The Translator’s Workbench Project’, in Translating and the Computer 11, ed. C. Picken, London: Aslib Aijmer, K. and Altenberg, B. (Eds). (1991). English Corpus Linguistics -- Studies in Honour of Jan Svartik. London: Longman Group. Atkins, Sue, Clear, Jeremy and Ostler, Nicholas (1992). Corpus Design Criteria, Linguistic and Literary Computing, Vol. 7 (No. 17) pp1-16. de Beaugrande, Robert. and Dressler, Wolfgang. (1981). ‘Introduction to Text Linguistics’. London and New York: Longman. Biber, Douglas (1988). Variation across speech and writing. Cambridge: Cambridge University Press (there is a 1991 paperback edition of this book from where the citations are taken). Biber, Douglas (1989). ‘A Typology of English Texts’. Linguistics, Vol. 27, pp3-43. Church, Kenneth W., and Libermann, Mark Y. (1991). ‘A status report on the ACL/DCI’. In (Ed.) Linda M. Jones. pp84-91. Crystal, David (1987). The Encyclopaedia of Language. Cambridge: Cambridge University Press. Garside, R., Leech, G. and Sampson, G. (Eds). (1987). The Computational Analysis of English. London: Longman Group. 42 Greenbaum, Sidney (1991). The development of the International Corpus of English. In (Eds.) Kavin Aijmer, Bengt Altenberg. pp 83-91. Halliday, Michael A.K., (1975). Learning How to Learn. London: XXX Halliday, Michael A.K. and Hasan, Ruggaya. (1976). Cohesion in English. London: Longman. Holmes-Higgin, P. and Ahmad, K. (1992). ‘The Machine Assisted Terminology Elicitation Environment: Text and Data Processing and Management in Prolog’. Technical Report CS-9211. Dept. of Mathematical and Computing Sciences, University of Surrey, Guildford. Holmes-Higgin, P., Griffin, S., Hook, S. and Abidi S.R. (1993). ‘System Quirk Reference Guide’. Final Report for Workpackage 5.5, Multilex Project, ESPRIT II, No. 5304. Ihalainen, Ossi. (1991). A point of verb syntax in South-Western British English: Analysis of a dialect Continuum. In (Eds.) Kavin Aijmer and Bengt Altenberg. pp280-302. Kytö, Merja. and Rissanen, Matti. (1988). ‘The Helsinki Corpus of English Texts - Classifying and Coding the Diachronic Patterns’. In (Eds.) Merja Kytö and Matti Rissanen. pp169-179. Kytö, Merja., Ihalainen, Ossi., and Rissanen, Matti. (Eds.) (1988). Corpus Linguistics: Hard and Soft. (Proceedings of the 8th International Conference on English Language Research on Computerised Corpora). Amsterdam: Rodopi. Lacey, A.R. (1986) ‘A Dictionary of Philosophy’ (Second Edition). London: Routledge, Kegan, Paul. Leech, G. (1991). ‘The state of the art in corpus linguistics’. In (Eds) Aijmer and Altenberg. pp 8-29. Maier, Ingrid. (1994). ‘Uppsala Russian Corpus’. (An electronic mail announcement from the University of Uppsala based on a summary in ‘A Frequency Dictionary of Modern Russian’ (1993). Ackla Universitatis Upsaliensis, Studia Slavica Upsaliensia 32. pp185-188). 43 Marinai, Elisabetta., Peters, Carol., and Picchi, Eugenio. (1991). ‘Bilingual Reference Corpora: A System for Parallel Text Retrieval’. In (Ed.) Linda M. Junes. pp61-62. Renouf, Antoinette (1987). ‘Corpus Development’. In (Ed.) John Sinclair (1987). pp1-40. Rissanen (1991). On the history of that/Zero as object clause links in English. In (Eds.) Kavin Aijmer and Bengt Altenberg (1991). pp 272-289. Sager, Juan C., Dungworth, David., and McDonald, Peter F. (1980). ‘English Special Languages: Principles and practice in science and technology’. Wiesbaden: Oscar Brandsetter Verlag KG. Savory, Theodore H. (1967). The Language of Science. London: Andre Deutsch Ltd. (2nd edition). Searle, John (1969). Speech Acts Cambridge: Cambridge University Press. Summers, D. (1991). ‘Longman/Lancaster English Language Corpus: Criteria and Design’. Unpublished manuscript. Tompa, W.F. and Raymond, R.D. (1991). ‘Database Design for a Dynamic Dictionary’. In (Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper from ALLC/ACH Conference, Toronto (June 1989). Oxford: Clarendon Press. pp 257-272. Woods, W.A. (1991). ‘Understanding Subsumption and Taxonomy: A Framework for Progress’. In (Ed.) John F. Sowa. Principles of Semantic Networks. Morgan Kaufmann Publishers, California. pp 45-94. Yang, Huizhong (1986). A new technique for identifying scientific/technical terms and describing scientific texts. An interim report in Linguistic and Literary Computing. Vol 1 (No. 2) 1986. pp93-103. 44 Default 1 Mode Written Spoken Written-to-be-read Spoken-to-be-written 2* Text Origin Unclassified Single, several, joint, corporate, .......... 3 Participation 1-person many 4 Constitution Single 5 Preparedness Prepared from-notes 6 Medium Print Book, periodical, ephemera 7 Style Prose Verse, rhyme 8 Genre Novel, short story, play, Advertisement, regulations/law, article, poem, essay, letter announcement Fact Faction* 9 Factuality Fiction 10 Setting Unclassified Home, education, work, personal, public affairs 11 Function Unmarked Narrative, informative, expository, horatory/persuasive, regulatory/instructional, reflective, entertaining 12 Topic General Science, the arts, belief and values, the environment 13 Technicality General Technical, semitechnical 14 Date Date of publications Revised edition 15* Text Status Original/Reprint Updated, revised 16 Language 17* Language links Single Parallel 18* Language status Source Translation: Multiple release 19 Authorship 20 Name 21 Gender 22 Age 23 Region *24 First Language 45 25 Nationality Cluster 1 ‘Prototype label’ Not important for terminology Dimensional Characteristics and Exemplars in LOB Exemplars in LOB Intimate, interpersonal interaction Situated, non-abstract, very Informational interaction (face to (face to face interviews, phone face conversations, interviews, involved conversations, friends) 2 Proximate Prototype and Informational interaction phone conversations) Situated, non-abstract but less Intimate, interpersonal involved 3 ‘Scientific’ exposition (Academic ‘Extremely’ informational, highly Learned exposition (Academic prose, official documents) elaborated, non-narrative, non- prose, official documents, press persuasive, extremely abstract reportage and review, professional letters) 4 ‘Learned’ exposition Like scientific exposition but Scientific exposition only moderately abstract 5 Imaginative narrative Extremely narrative, moderately Involved persuasion (interviews, Fiction - romance, general, involved, situated, non-abstract spontaneous speech, popular lore, mystery, adventure, science- professional letters) fiction - prepared speech 6 Informational, narrative ‘Learned exposition’ Situated Reportage (sports Situated, non-narrative, non- Imaginative narrative broadcasts, science fiction) persuasive and non-abstract Involved persuasion (Interviews, Extremely persuasive, popular lore, letters, academic moderately involved, non- prose) narrative and elaborated General narrative exposition (Press reportage and editorials, general fiction, biography, humour) 7 8 Biber (1989: 17-19) 46 Imaginative narrative