Corpus design and types of corpora

Corpus design
and types of corpora
Corpus Linguistics
Richard Xiao
[email protected]
Outline of the session
• Corpus design issues
– Corpus representativeness
– Corpus balance
– Sampling
– Corpus size
– Types of corpora
• Introducing some well-known English
corpora of different types
Representativeness
• A corpus is a collection of (1) machine-readable
(2) authentic texts (including transcripts of
spoken data) which is (3) sampled to be (4)
representative of a particular language or
language variety
• A corpus is different from a random collection of
texts or an archive
• Representativeness is a defining feature of a
corpus
• As language is infinite but a corpus has to be
finite in size, we sample and proportionally
include a wide range of text types to ensure
maximum balance and representativeness
Some definitions …
• “generally assembled with particular purposes in mind,
and are often assembled to be (informally speaking)
representative of some language or text type” (Leech
1992: 116)
• “…selected and ordered according to explicit linguistic
criteria in order to be used as a sample of the language”
(Sinclair 1996)
• “A well-organized collection of data” (McEnery 2003)
• “gathered according to explicit design criteria” (TogniniBonelili 2001: 2)
• “built according to explicit design criteria for a specific
purpose” (Atkins et al 1992)
• texts selected and put together “in a principled way”
(Johansson 1998: 3)
What is representativeness?
• “A corpus is thought to be representative
of the language variety it is supposed to
represent if the findings based on its
contents can be generalized to the said
language variety” (Leech 1991)
• Representativeness refers to the extent to
which a sample includes the full range of
variability in a population (Biber 1993)
What is representativeness?
• Representativeness is a fluid concept
closely related to your research questions
– If you want a corpus which is representative
of general English, a corpus representative of
newspapers will not do
– If you want a corpus representative of
newspapers, a corpus representative of The
Times will not do
Two types of representativeness
• The representativeness of general corpora and
(domain- or genre specific) specialized corpora
are achieved and measured in different ways
– General corpora
• Balance: The range of genres included in a corpus and their
proportion
• Sampling: How the text chunks for each genre are selected
– Specialized corpora
• Degree of closure/saturation: Closure/saturation for a
particular linguistic feature (e.g. size of lexicon) of a variety
of language (e.g. computer manuals) means that the
feature appears to be finite or is subject to very limited
variation beyond a certain point, i.e. the curve of lexical
growth is flattening out
Why should we care about
representativeness?
• Reader of corpus-based studies (assessment)
– To interpret the results of corpus research with caution,
considering whether the corpus data and the method used in the
study was appropriate
• Corpus user (assessment)
– Important to “know your corpus”
– To decide whether a given corpus is appropriate for their specific
research question
– To make appropriate claims on the basis of such a corpus
• Corpus creator (assessment?)
– To make their corpus as representative as possible of a
language (variety) claimed to represent
– To document design criteria explicitly and make the
documentation available to corpus users
Criteria for text selection
• The criteria used to select texts for a corpus are
principally external
– The external vs. internal criteria corresponds to Biber’s (1993:
243) situational vs. linguistic perspectives
• External criteria are defined situationally irrespective of the
distribution of linguistic features
• Internal criteria are defined linguistically, taking into account the
distribution of such features
• It is circular to use internal criteria like the distribution of
words or grammatical features as the primary
parameters for the selection of corpus data
– If the distribution of linguistic features is pre-determined when
the corpus is designed, there is no point in analyzing such a
corpus to discover naturally occurring linguistic feature
distributions
– The corpus is problematic as it is skewed by design
Criteria for text selection
• Time?
– If a corpus is not regularly updated, it rapidly
becomes unrepresentative (Hunston 2002)
• The relevance of permanence in corpus
design actually depends on how we view a
corpus - a static or dynamic language
model
– Static model: sample corpora (nearly all
existing corpora, BNC, LOB/FLOB)
– Dynamic model: monitor corpora (e.g. Bank of
English)
Criteria for text selection
• Tips
– “Criteria for determining the structure of a
corpus should be small in number, clearly
separate from each other, and efficient as a
group in delineating a corpus that is
representative of the language or variety
under examination.” (Sinclair 2005)
Corpus balance
• A balanced corpus covers a wide range of text
categories which are supposed to be
representative of the language (variety) under
consideration
• The proportions of different kinds of text it
contains should correspond with informed and
intuitive judgements
• There is no scientific measure for balance – just
best estimation
• The acceptable balance is determined by the
intended use – your research questions
The BNC model
• Generally accepted as being a balanced corpus
• Has been followed in the construction of a number of
corpora
• 4,124 texts (including transcripts of recording)
• ca. 100 million words: 90% Written + 10% Spoken
• Three criteria for Written
– Domain: the content type (i.e. subject field)
– Time: the period of text production
– Medium: the type of text publication (book, periodicals etc)
• Two criteria for Spoken
– Demographic: informal conversations by speakers selected by
age group, sex, social class and geographical region
– Context-governed: formal encounters such as meetings, lectures
and radio broadcasts recorded in 4 broad context categories
Written BNC
Spoken BNC
BNC vs. balance
• The design criteria of the BNC illustrates the
notion of corpus balance/representativeness
very well
– “In selecting texts for inclusion in the corpus, account
was taken of both production, by sampling a wide
variety of distinct types of material, and reception, by
selecting instances of those types which have a wide
distribution. Thus, having chosen to sample such
things as popular novels, or technical writing, bestseller lists and library circulation statistics were
consulted to select particular examples of them.”
(Aston and Burnard 1998: 28)
Pragmatics in corpus design
• “Most general corpora of today are badly
balanced because they do not have nearly
enough spoken language in them; estimates of
the optimal proportion of spoken language range
from 50% - the neutral option - to 90%, following
a guess that most people experience many
times as much speech as writing” (Sinclair 2005)
• The written BNC is nine times as large as the
spoken BNC
– Is speech less frequent or important than writing?
Pragmatics in corpus design
• Absolutely not!
• …but writing typically has a larger audience than speech
• …also collection of spoken data costs 10 times as much
as for written data
• …it takes 10 hours to transcribe one hour of recording
• Pragmatic considerations also mean that balance is a
more important issue for a static sample corpus than for
a dynamic monitor corpus
– As a monitor corpus is frequently updated, it is usually
“impossible to maintain a corpus that also includes text of many
different types, as some of them are just too expensive or time
consuming to collect on a regular basis.” (Hunston 2002: 30-31)
Corpus balance: Some tips
• “The corpus builder should retain, as target
notions, representativeness and balance. While
these are not precisely definable and attainable
goals, they must be used to guide the design of
a corpus and the selection of its components.”
(Sinclair 2005)
• “It would be short-sighted indeed to wait until
one can scientifically balance a corpus before
starting to use one, and hasty to dismiss the
results of corpus analysis as ‘unreliable’ or
‘irrelevant’ because the corpus used cannot be
proved to be ‘balanced’.” (Atkins et al 1992: 6)
Sampling in corpus creation
• Language is infinite, but a corpus is finite in size, so sampling is
inescapable in corpus building
– “Some of the first considerations in constructing a corpus concern the
overall design: for example, the kinds of texts included, the number of
texts, the selection of particular texts, the selection of text samples from
within texts, and the length of text samples. Each of these involves a
sampling decision, either conscious or not.” (Biber 1993)
• Population ( language/variety) vs. sample (corpus)
– The aim of sampling “is to secure a sample which, subject to limitations
of size, will reproduce the characteristics of the population, especially
those of immediate interest, as closely as possible” (Yates 1965: 9)
– A sample is a scaled-down version of a larger population
– A sample is representative if what we find for the sample also holds for
the general population
• Corpus representativeness and balance rely heavily on sampling
– A corpus is a sample of a given population (language or language
variety)
Sampling in corpus creation
• Sampling unit
– For written text, it could be a book (chapter), periodical or
newspaper (article)
• Sampling frame
– A list of sampling units
• Population
– Languages, language, or language variety under consideration
– The assembly of all sampling units, which can be defined in
terms of
• Language production (demographic: speakers and writers)
• Language reception (demographic: audience and readers)
• Language as a product (registers and genres)
Examples of Brown and LOB
• Brown
– Population: Written
English text published in
the United States in 1961
– Sampling frame: A list of
the collection of books and
periodicals in the Brown
University Library and the
Providence Athenaeum
– Sampling unit: each
book/periodical within the
sampling frame
• LOB
– Population: Written
English text published in
the UK around 1961
– Sampling frame: The
British National
Bibliography Cumulated
Subject Index 1960–1964
(for books) and Willing’s
Press Guide 1961 (for
periodicals)
– Sampling unit: each
book/periodical within the
sampling frame
Sampling techniques
• Simple random sampling
– All sampling units within the sampling frame are
numbered and the sample is chosen by use of a table
of random numbers
• Positively correlating with frequency in the population, so rare
features may not be included
• Stratified random sampling
– The population is divided in relatively homogeneous
groups (i.e. strata), and then these latter are sampled
at random
• Never less representative than simple random sampling
Stratified random sampling
• The whole population for the Brown/LOB corpus
is divided into 15 text categories and then
samples were drawn from each category at
random
• In demographic sampling for collecting spoken
data, individuals (sampling units) in the
population are first divided into different groups
on the basis of demographic variables such as
speaker/writer age, sex and social class, and
then samples are taken at random from each
group
Size of samples
• Full texts or text segments?
– “Samples of language for a corpus should wherever
possible consist of entire documents or transcriptions
of complete speech events” (Sinclair 2005)
• Good for studying textual organization
– A full-text corpus may be inappropriate or
problematic
• Peculiarity of an individual style or topic may occasionally
show through
• There are copyright issues in including full texts
• Frequent linguistic features are quite stable in their
distributions and hence short text chunks (e.g. 2,000 running
words) are usually sufficient
• Text initial, middle or end chunks?
– Text initial, middle, and end samples must be taken in
a balanced way
Proportion of samples
• In stratified random sampling, how many
samples should be taken for each
category?
– The numbers of samples across text
categories should be proportional to their
frequencies and/or weights in the target
population in order for the resulting corpus to
be considered as representative
– Difficult to determine objectively, just wellinformed and intuitive guess
Proportion of genres in Brown
Constant sample size: ca. 2,000 words
“Relatively speaking…”
• Any claim of corpus representativeness and balance
must be interpreted in relative terms
– There is no objective way to balance a corpus or to measure its
representativeness
– Any claim for representativeness is an act of faith rather than a
statement of fact
• Corpus balance and representativeness are a fluid
concept
– The research question that one has in mind when
building/choosing a corpus determines what an acceptable
balance is for the corpus one should use and whether it is
suitably representative
• Corpus balance is also influenced by practical
considerations
– How easily can data of different types be collected?
Corpus size
• How large should a corpus be?
– There is no easy answer to this question.
• Krishnamurthy (2001): “Size matters.”
• Leech (1991): “Size is not all-important.”
• The size of the corpus needed depends upon the
purpose for which it is intended as well as a number of
practical considerations
– The kind of query that is anticipated from users
• Are you studying common or rare linguistic features?
– The methodology they use to study the data
• How much work can be done by the machine and how much has to
be done by hand?
– For corpus creators, also the source of data
• Are the data in electronic form readily available at a reasonable
cost?
• Can copyright permissions be granted easily if at all?
Corpus size
• Corpus size increases with the
development of technology
– 1960s-70s
• Brown and LOB: one million words
– 1980s
• The Birmingham/Cobuild corpora: 20 M words
– 1990s
• The British National Corpus: 100 M words
– Early 21st Century
• The Bank of English: 645 M words
Corpus size
• Is a large corpus really what you want?
– The size of the corpus needed to explore a research question
depends on the frequency and distribution of the linguistic
features under consideration in that corpus – your research
question
• Corpora for lexical studies are usually much larger than those for
grammatical studies
– Specialized corpora serve a very different yet important purpose
from large multi-million-word corpora
– Corpora that need extensive manual annotation or analysis are
necessarily small
– Many corpus tools set a ceiling on the number of concordances
that can be extracted
• The optimum size of a corpus is determined by the
research question the corpus is intended to address
as well as practical considerations
Exploring existing English corpora
• To learn how corpora can be classified
• To learn about design decisions in
creating different kinds of corpora
• To become familiar with a range of wellknown and influential corpora
• Corpus survey: “Well-known and influential
corpora”
Types of corpora, different uses
•
•
•
•
•
•
•
General/reference vs. specialized corpora
Written vs. spoken corpora
Synchronic vs. diachronic corpora
Monolingual vs. multilingual corpora
Comparable vs. parallel corpora
Native vs. learner corpora
Developmental vs. learner/interlanguage
corpora
• Raw vs. annotated corpora
• Static/sample vs. dynamic/monitor corpora
• …
Monitor corpora
• Constantly updated and growing in size
–
–
–
–
Much larger corpus size
Often contain full text
Always up-to-date
Often only admit new material which has new features not
already present in corpus
– Used to track changes across different periods of time
– Monitor corpora could be a series of static corpora
• Disadvantages
–
–
–
–
No attempt to balance the corpus
Text availability can become an issue (e.g. copyrights)
Confusing to indicate specific corpus version (token number)
Cannot easily compare results obtained from corpora of different
sizes
Some well-known English corpora
•
•
•
•
•
•
•
•
•
•
The British National Corpus (BNC)
The Bank of English (BoE)
BYU American English corpus
Corpora of the Brown family (Brown, LOB, FLOB, Frown)
ICE corpora (GB, EA, HK, Singapore, Philippines, New
Zealand etc)
London-Lund corpus of spoken English
SBCSAE
The Helsinki Diachronic Corpus of English Texts (8th 18th Century, ca. 5 million words)
The International Corpus of Learner English (ICLE)
MICASE
The BNC
• First and best-known national corpus (sample corpus)
• 100 M word balanced corpus of written (90%) and
spoken (10%) British English in current use
• 1960 - earlier 1990s (1966-1974, 1974-1984, 1985-1993)
• Rich metadata encoded for language variation studies
• POS tagged
• Accessing the BNC
– BYU-BNC: http://corpus.byu.edu/bnc/
– BNC Online:
http://www.natcorp.ox.ac.uk/getting/index.xml.ID=order_online
– Lancaster BNCWeb CQP edition
http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php
– BNC Baby: http://www.natcorp.ox.ac.uk/corpus/baby/index.html
– Sketch Engine: http://www.sketchengine.co.uk/
– BNC PIE: http://pie.usna.edu/
The BoE
• Best known monitor corpus
• 645 M words (counting and growing) of presentday English language
• 75% written and 25% spoken
• 70% BrE, 20% AmE and 10% other English
varieties
• Particularly useful for lexical and lexicographic
studies, e.g. tracking new words, new uses or
meanings of old words, and words falling out of
use
• Access to the BoE
– A 56 M word sampler:
http://www.collins.co.uk/books.aspx?group=153
Corpus of Contemporary
American English (COCA)
• 385+ M words of American English
• 20M per year for 1990-2008
• Equally divided among spoken, fiction, popular
magazines, newspapers, and academic texts
• Updated every 6-9 months
• Useful for studying variation across genres and
over time
• Free online access
– http://www.americancorpus.org/
Corpora of the Brown family
•
•
•
•
•
Brown: Written AmE in 1961
LOB: Written BrE in 1961
FLOB: Written BrE in 1991
Frown: Written AmE in 1991
Common corpus design
– One M word each
– 500 samples (ca. 2000 words each)
– Same proportions from the same 15 text categories
• Useful for synchronic and diachronic comparison of BrE and
AmE
• Further information
ICAME CD: http://khnt.hit.uib.no/icame/manuals/
Exended Brown family: http://cqpweb.lancs.ac.uk
(access account to be applied)
The ICE corpora
• 20 one M word balanced corpora
– E.g. Britain, Ireland, US, Canada, Hong Kong,
Singapore, India, the Philippines, East Africa
• Common corpus design
–
–
–
–
500 samples (ca. 2000 words each)
60% spoken + 40% written
12 Genres
1990-1994
• Designed for the synchronic study of “world
Englishes”
• More information
– http://www.ucl.ac.uk/english-usage/ice/
The London-Lund Corpus
• First electronic corpus of spontaneous language
• A corpus of spoken British English recorded from 19531987
• 100 texts, each of 5,000 words, totaling half a million
running words
• Both dialogue (e.g. face-to-face conversations,
telephone conversations, and public discussion) and
monologues (both spontaneous and prepared)
• Speaker information (gender, age, occupation)
• Annotated with prosodic information
• Further information
– http://khnt.hit.uib.no/icame/manuals/
SBCSAE
• Based on hundreds of recordings of
spontaneous speech from all over the United
States
• Representing a wide variety of people of
different regional origins, ages, occupations, and
ethnic and social backgrounds
• Each of the 60 transcripts is time stamped and
accompanied by a digital audio file
• Free download
– http://talkbank.org/data/CABank/SBCSAE.zip
Helsinki Corpus of English Texts
• Best-known historical corpus
• 1.5 million words of English in 400 text samples dating
from the 8th to 18th centuries
• Divided into three periods (Old, Middle, and Early
Modern English) and 11 sub-periods
• Socio-historical variation and a wide range of text types
for each specific period
• Allows for researchers to go beyond simply dating and
reporting language change by combining diachronic,
sociolinguistic and genre studies
• Further information
– Oxford Text Archive: http://ota.oucs.ox.ac.uk/headers/1477.xml
The ICLE corpus
• First and best-known learner English corpus
• Comprising argumentative essays written by
advanced learners of English (i.e. university
students of English as a foreign language (EFL)
in their 3rd or 4th year of study
• Over 2.5 million words in 3,640 texts ranging
between 500-1,000 words in length
• 11 L1 backgrounds and still expanding with 8
additional L1s
• Useful in investigating the interlanguage of the
foreign language learners
• Further information: http://cecl.fltr.ucl.ac.be/
MICASE
• ca. 1.8 M words in 152 transcripts of
nearly 200 hours of recordings of 1,571
speakers
• Focusing on contemporary university
speech within the domain of the University
of Michigan
• Encoded with speaker information (age,
academic role, language status)
• Free online search or transcript download
– http://quod.lib.umich.edu/m/micase/