Corpus design and types of corpora Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com Outline of the session • Corpus design issues – Corpus representativeness – Corpus balance – Sampling – Corpus size – Types of corpora • Introducing some well-known English corpora of different types Representativeness • A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety • A corpus is different from a random collection of texts or an archive • Representativeness is a defining feature of a corpus • As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness Some definitions … • “generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type” (Leech 1992: 116) • “…selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) • “A well-organized collection of data” (McEnery 2003) • “gathered according to explicit design criteria” (TogniniBonelili 2001: 2) • “built according to explicit design criteria for a specific purpose” (Atkins et al 1992) • texts selected and put together “in a principled way” (Johansson 1998: 3) What is representativeness? • “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety” (Leech 1991) • Representativeness refers to the extent to which a sample includes the full range of variability in a population (Biber 1993) What is representativeness? • Representativeness is a fluid concept closely related to your research questions – If you want a corpus which is representative of general English, a corpus representative of newspapers will not do – If you want a corpus representative of newspapers, a corpus representative of The Times will not do Two types of representativeness • The representativeness of general corpora and (domain- or genre specific) specialized corpora are achieved and measured in different ways – General corpora • Balance: The range of genres included in a corpus and their proportion • Sampling: How the text chunks for each genre are selected – Specialized corpora • Degree of closure/saturation: Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point, i.e. the curve of lexical growth is flattening out Why should we care about representativeness? • Reader of corpus-based studies (assessment) – To interpret the results of corpus research with caution, considering whether the corpus data and the method used in the study was appropriate • Corpus user (assessment) – Important to “know your corpus” – To decide whether a given corpus is appropriate for their specific research question – To make appropriate claims on the basis of such a corpus • Corpus creator (assessment?) – To make their corpus as representative as possible of a language (variety) claimed to represent – To document design criteria explicitly and make the documentation available to corpus users Criteria for text selection • The criteria used to select texts for a corpus are principally external – The external vs. internal criteria corresponds to Biber’s (1993: 243) situational vs. linguistic perspectives • External criteria are defined situationally irrespective of the distribution of linguistic features • Internal criteria are defined linguistically, taking into account the distribution of such features • It is circular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of corpus data – If the distribution of linguistic features is pre-determined when the corpus is designed, there is no point in analyzing such a corpus to discover naturally occurring linguistic feature distributions – The corpus is problematic as it is skewed by design Criteria for text selection • Time? – If a corpus is not regularly updated, it rapidly becomes unrepresentative (Hunston 2002) • The relevance of permanence in corpus design actually depends on how we view a corpus - a static or dynamic language model – Static model: sample corpora (nearly all existing corpora, BNC, LOB/FLOB) – Dynamic model: monitor corpora (e.g. Bank of English) Criteria for text selection • Tips – “Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.” (Sinclair 2005) Corpus balance • A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration • The proportions of different kinds of text it contains should correspond with informed and intuitive judgements • There is no scientific measure for balance – just best estimation • The acceptable balance is determined by the intended use – your research questions The BNC model • Generally accepted as being a balanced corpus • Has been followed in the construction of a number of corpora • 4,124 texts (including transcripts of recording) • ca. 100 million words: 90% Written + 10% Spoken • Three criteria for Written – Domain: the content type (i.e. subject field) – Time: the period of text production – Medium: the type of text publication (book, periodicals etc) • Two criteria for Spoken – Demographic: informal conversations by speakers selected by age group, sex, social class and geographical region – Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in 4 broad context categories Written BNC Spoken BNC BNC vs. balance • The design criteria of the BNC illustrates the notion of corpus balance/representativeness very well – “In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Thus, having chosen to sample such things as popular novels, or technical writing, bestseller lists and library circulation statistics were consulted to select particular examples of them.” (Aston and Burnard 1998: 28) Pragmatics in corpus design • “Most general corpora of today are badly balanced because they do not have nearly enough spoken language in them; estimates of the optimal proportion of spoken language range from 50% - the neutral option - to 90%, following a guess that most people experience many times as much speech as writing” (Sinclair 2005) • The written BNC is nine times as large as the spoken BNC – Is speech less frequent or important than writing? Pragmatics in corpus design • Absolutely not! • …but writing typically has a larger audience than speech • …also collection of spoken data costs 10 times as much as for written data • …it takes 10 hours to transcribe one hour of recording • Pragmatic considerations also mean that balance is a more important issue for a static sample corpus than for a dynamic monitor corpus – As a monitor corpus is frequently updated, it is usually “impossible to maintain a corpus that also includes text of many different types, as some of them are just too expensive or time consuming to collect on a regular basis.” (Hunston 2002: 30-31) Corpus balance: Some tips • “The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components.” (Sinclair 2005) • “It would be short-sighted indeed to wait until one can scientifically balance a corpus before starting to use one, and hasty to dismiss the results of corpus analysis as ‘unreliable’ or ‘irrelevant’ because the corpus used cannot be proved to be ‘balanced’.” (Atkins et al 1992: 6) Sampling in corpus creation • Language is infinite, but a corpus is finite in size, so sampling is inescapable in corpus building – “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.” (Biber 1993) • Population ( language/variety) vs. sample (corpus) – The aim of sampling “is to secure a sample which, subject to limitations of size, will reproduce the characteristics of the population, especially those of immediate interest, as closely as possible” (Yates 1965: 9) – A sample is a scaled-down version of a larger population – A sample is representative if what we find for the sample also holds for the general population • Corpus representativeness and balance rely heavily on sampling – A corpus is a sample of a given population (language or language variety) Sampling in corpus creation • Sampling unit – For written text, it could be a book (chapter), periodical or newspaper (article) • Sampling frame – A list of sampling units • Population – Languages, language, or language variety under consideration – The assembly of all sampling units, which can be defined in terms of • Language production (demographic: speakers and writers) • Language reception (demographic: audience and readers) • Language as a product (registers and genres) Examples of Brown and LOB • Brown – Population: Written English text published in the United States in 1961 – Sampling frame: A list of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum – Sampling unit: each book/periodical within the sampling frame • LOB – Population: Written English text published in the UK around 1961 – Sampling frame: The British National Bibliography Cumulated Subject Index 1960–1964 (for books) and Willing’s Press Guide 1961 (for periodicals) – Sampling unit: each book/periodical within the sampling frame Sampling techniques • Simple random sampling – All sampling units within the sampling frame are numbered and the sample is chosen by use of a table of random numbers • Positively correlating with frequency in the population, so rare features may not be included • Stratified random sampling – The population is divided in relatively homogeneous groups (i.e. strata), and then these latter are sampled at random • Never less representative than simple random sampling Stratified random sampling • The whole population for the Brown/LOB corpus is divided into 15 text categories and then samples were drawn from each category at random • In demographic sampling for collecting spoken data, individuals (sampling units) in the population are first divided into different groups on the basis of demographic variables such as speaker/writer age, sex and social class, and then samples are taken at random from each group Size of samples • Full texts or text segments? – “Samples of language for a corpus should wherever possible consist of entire documents or transcriptions of complete speech events” (Sinclair 2005) • Good for studying textual organization – A full-text corpus may be inappropriate or problematic • Peculiarity of an individual style or topic may occasionally show through • There are copyright issues in including full texts • Frequent linguistic features are quite stable in their distributions and hence short text chunks (e.g. 2,000 running words) are usually sufficient • Text initial, middle or end chunks? – Text initial, middle, and end samples must be taken in a balanced way Proportion of samples • In stratified random sampling, how many samples should be taken for each category? – The numbers of samples across text categories should be proportional to their frequencies and/or weights in the target population in order for the resulting corpus to be considered as representative – Difficult to determine objectively, just wellinformed and intuitive guess Proportion of genres in Brown Constant sample size: ca. 2,000 words “Relatively speaking…” • Any claim of corpus representativeness and balance must be interpreted in relative terms – There is no objective way to balance a corpus or to measure its representativeness – Any claim for representativeness is an act of faith rather than a statement of fact • Corpus balance and representativeness are a fluid concept – The research question that one has in mind when building/choosing a corpus determines what an acceptable balance is for the corpus one should use and whether it is suitably representative • Corpus balance is also influenced by practical considerations – How easily can data of different types be collected? Corpus size • How large should a corpus be? – There is no easy answer to this question. • Krishnamurthy (2001): “Size matters.” • Leech (1991): “Size is not all-important.” • The size of the corpus needed depends upon the purpose for which it is intended as well as a number of practical considerations – The kind of query that is anticipated from users • Are you studying common or rare linguistic features? – The methodology they use to study the data • How much work can be done by the machine and how much has to be done by hand? – For corpus creators, also the source of data • Are the data in electronic form readily available at a reasonable cost? • Can copyright permissions be granted easily if at all? Corpus size • Corpus size increases with the development of technology – 1960s-70s • Brown and LOB: one million words – 1980s • The Birmingham/Cobuild corpora: 20 M words – 1990s • The British National Corpus: 100 M words – Early 21st Century • The Bank of English: 645 M words Corpus size • Is a large corpus really what you want? – The size of the corpus needed to explore a research question depends on the frequency and distribution of the linguistic features under consideration in that corpus – your research question • Corpora for lexical studies are usually much larger than those for grammatical studies – Specialized corpora serve a very different yet important purpose from large multi-million-word corpora – Corpora that need extensive manual annotation or analysis are necessarily small – Many corpus tools set a ceiling on the number of concordances that can be extracted • The optimum size of a corpus is determined by the research question the corpus is intended to address as well as practical considerations Exploring existing English corpora • To learn how corpora can be classified • To learn about design decisions in creating different kinds of corpora • To become familiar with a range of wellknown and influential corpora • Corpus survey: “Well-known and influential corpora” Types of corpora, different uses • • • • • • • General/reference vs. specialized corpora Written vs. spoken corpora Synchronic vs. diachronic corpora Monolingual vs. multilingual corpora Comparable vs. parallel corpora Native vs. learner corpora Developmental vs. learner/interlanguage corpora • Raw vs. annotated corpora • Static/sample vs. dynamic/monitor corpora • … Monitor corpora • Constantly updated and growing in size – – – – Much larger corpus size Often contain full text Always up-to-date Often only admit new material which has new features not already present in corpus – Used to track changes across different periods of time – Monitor corpora could be a series of static corpora • Disadvantages – – – – No attempt to balance the corpus Text availability can become an issue (e.g. copyrights) Confusing to indicate specific corpus version (token number) Cannot easily compare results obtained from corpora of different sizes Some well-known English corpora • • • • • • • • • • The British National Corpus (BNC) The Bank of English (BoE) BYU American English corpus Corpora of the Brown family (Brown, LOB, FLOB, Frown) ICE corpora (GB, EA, HK, Singapore, Philippines, New Zealand etc) London-Lund corpus of spoken English SBCSAE The Helsinki Diachronic Corpus of English Texts (8th 18th Century, ca. 5 million words) The International Corpus of Learner English (ICLE) MICASE The BNC • First and best-known national corpus (sample corpus) • 100 M word balanced corpus of written (90%) and spoken (10%) British English in current use • 1960 - earlier 1990s (1966-1974, 1974-1984, 1985-1993) • Rich metadata encoded for language variation studies • POS tagged • Accessing the BNC – BYU-BNC: http://corpus.byu.edu/bnc/ – BNC Online: http://www.natcorp.ox.ac.uk/getting/index.xml.ID=order_online – Lancaster BNCWeb CQP edition http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php – BNC Baby: http://www.natcorp.ox.ac.uk/corpus/baby/index.html – Sketch Engine: http://www.sketchengine.co.uk/ – BNC PIE: http://pie.usna.edu/ The BoE • Best known monitor corpus • 645 M words (counting and growing) of presentday English language • 75% written and 25% spoken • 70% BrE, 20% AmE and 10% other English varieties • Particularly useful for lexical and lexicographic studies, e.g. tracking new words, new uses or meanings of old words, and words falling out of use • Access to the BoE – A 56 M word sampler: http://www.collins.co.uk/books.aspx?group=153 Corpus of Contemporary American English (COCA) • 385+ M words of American English • 20M per year for 1990-2008 • Equally divided among spoken, fiction, popular magazines, newspapers, and academic texts • Updated every 6-9 months • Useful for studying variation across genres and over time • Free online access – http://www.americancorpus.org/ Corpora of the Brown family • • • • • Brown: Written AmE in 1961 LOB: Written BrE in 1961 FLOB: Written BrE in 1991 Frown: Written AmE in 1991 Common corpus design – One M word each – 500 samples (ca. 2000 words each) – Same proportions from the same 15 text categories • Useful for synchronic and diachronic comparison of BrE and AmE • Further information ICAME CD: http://khnt.hit.uib.no/icame/manuals/ Exended Brown family: http://cqpweb.lancs.ac.uk (access account to be applied) The ICE corpora • 20 one M word balanced corpora – E.g. Britain, Ireland, US, Canada, Hong Kong, Singapore, India, the Philippines, East Africa • Common corpus design – – – – 500 samples (ca. 2000 words each) 60% spoken + 40% written 12 Genres 1990-1994 • Designed for the synchronic study of “world Englishes” • More information – http://www.ucl.ac.uk/english-usage/ice/ The London-Lund Corpus • First electronic corpus of spontaneous language • A corpus of spoken British English recorded from 19531987 • 100 texts, each of 5,000 words, totaling half a million running words • Both dialogue (e.g. face-to-face conversations, telephone conversations, and public discussion) and monologues (both spontaneous and prepared) • Speaker information (gender, age, occupation) • Annotated with prosodic information • Further information – http://khnt.hit.uib.no/icame/manuals/ SBCSAE • Based on hundreds of recordings of spontaneous speech from all over the United States • Representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds • Each of the 60 transcripts is time stamped and accompanied by a digital audio file • Free download – http://talkbank.org/data/CABank/SBCSAE.zip Helsinki Corpus of English Texts • Best-known historical corpus • 1.5 million words of English in 400 text samples dating from the 8th to 18th centuries • Divided into three periods (Old, Middle, and Early Modern English) and 11 sub-periods • Socio-historical variation and a wide range of text types for each specific period • Allows for researchers to go beyond simply dating and reporting language change by combining diachronic, sociolinguistic and genre studies • Further information – Oxford Text Archive: http://ota.oucs.ox.ac.uk/headers/1477.xml The ICLE corpus • First and best-known learner English corpus • Comprising argumentative essays written by advanced learners of English (i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study • Over 2.5 million words in 3,640 texts ranging between 500-1,000 words in length • 11 L1 backgrounds and still expanding with 8 additional L1s • Useful in investigating the interlanguage of the foreign language learners • Further information: http://cecl.fltr.ucl.ac.be/ MICASE • ca. 1.8 M words in 152 transcripts of nearly 200 hours of recordings of 1,571 speakers • Focusing on contemporary university speech within the domain of the University of Michigan • Encoded with speaker information (age, academic role, language status) • Free online search or transcript download – http://quod.lib.umich.edu/m/micase/