The BNC XML edition
Guy Aston
guy@sslmit.unibo.it
The BNC
100 million words of late 20th century
British English (written and spoken)
• Synchronic: begun in 1991, completed in 1994
• Slightly revised for the 2nd edition BNC World
(2001), and the 3rd edition BNC XML (2007)
• Sub-corpora releases:
– BNC Sampler (samples of one million written words, one
million spoken)
– BNC Baby (four one-million word samples from four
different genres: academic, non-academic, newspaper,
conversation)
BNC-XML (published last week)
• Single user licence £60 (£400 for 10) + VAT
NB: requires Windows XP
• Network licence £350 + VAT
• Free online service with limited query options
– www.natcorp.ox.ac.uk
• Prices valid till 01/06/2007
The BNC consortium
•
•
•
•
•
Oxford University Press
Addison-Wesley Longman
Larousse Kingfisher Chambers
Oxford University Computing Services
University Centre for Computer Corpus Research on
Language, University of Lancaster
• British Library Research and Innovation Centre.
• Funded by the commercial partners, the Science and
Engineering Council (now EPSRC) and the DTI under the
Joint Framework for Information Technology programme.
• Additional support: British Library and British Academy
Selection criteria: written texts
Domain
• 75% from informative writing: roughly equal
quantities from the fields of applied sciences,
arts, belief & thought, commerce & finance,
leisure, natural & pure science, social science,
world affairs
• 25% from imaginative writing, i.e literary and
creative works
Selection criteria: written texts
Medium
• 60% from books
• 25% from periodicals (newspapers etc.)
• 5-10% from miscellaneous published material
(brochures, advertising leaflets, etc)
• 5-10% from unpublished material (personal letters
and diaries, essays and memoranda, etc)
• <5% from written-to-be-spoken material (political
speeches, play texts, broadcast scripts, etc.)
Selection criteria: written texts
Time
• post-1975
• a few imaginative works date back to 1964,
because of their continued sales/popularity
Further classification criteria:
written texts (original criteria)
• Sample size (number of words) and extent (start
and end points)
• Topic or subject of the text
• Author's name, age, gender, region of origin, and
domicile
• Target age group and gender
• "Level" of writing (reading difficulty) : the more
literary or technical a text, the "higher" its level
New classification criteria
(written texts: derived from Lee 2001)
•
•
•
•
•
•
Academic writing
Non-academic prose and biography
Fiction and verse
Newspapers
Other published written material
Unpublished written material
Selection criteria: spoken texts
roughly equal quantities of:
demographic (spoken conversation)
• transcriptions of spontaneous natural conversations made
by members of the public
context-governed (other spoken material)
• transcriptions of recordings made at specific types of
meeting and event.
The original recordings transcribed for inclusion in the BNC have
been deposited at the National Sound Archives of the British
Library.
Spoken texts: demographic
124 volunteers
• males and females of a wide range of ages and social
groupings, living in 38 different locations across the UK
• similar numbers of men and women, from each age and from
each social grouping
• conversations recorded unobtrusively over two or three days
• permissions obtained after each conversation
• participants' age, sex, accent, occupation, relationship
recorded if possible as classification criteria
Spoken texts: context-governed
Four broad categories of social context, roughly
equal quantities of speech
• Educational and informative events, such as lectures,
news broadcasts, classroom discussion, tutorials
• Business events such as sales demonstrations, trades
union meetings, consultations, interviews
• Institutional and public events, such as sermons,
political speeches, council meetings
• Leisure events, such as sports commentaries, afterdinner speeches, club meetings, radio phone-ins
Specific type of event as classification criterion
BNC-XML: Composition
•
•
•
•
•
Spoken demographic
Spoken context-gov
Written books/period
Written-to-be-spoken
Written miscellaneous
• TOTAL
texts
153
755
2685
35
421
w-units
4233955
6175896
79238146
1278618
7437168
4049 98363783
%
4.30
6.27
80.55
1.29
7.56
BNC-XML
• same texts, same part-of-speech tagging as BNC
world
• not checked against original texts/recordings
• numbers hopefully righter
–
–
–
–
–
odd duplicate texts and parts of texts eliminated
text categorisation errors corrected
tokenisation/segmentation errors corrected
multi-word tokens eliminated
non-linguistic and paralinguistic descriptions
standardised
• query software improved (Xaira)
BNC-XML corpus structure
• 1 corpus header
– information about corpus and corpus markup
• 1 bibliography file
– information about single text documents
• 4049 text documents
BNC-XML document structure
<bncDoc>
<teiHeader>
the header
</teiHeader>
<wtext> (written)
the text
</wtext>
</bncDoc>
or <stext> (spoken)
or </stext>
BNC-XML: header
• Textual metadata
–
–
–
–
–
file description
source (bibliographic data)
selection and classification categorisations
participant data (speech)
other things …
<teiHeader> <fileDesc>
<titleStmt><title>[ACET factsheets & newsletters]. Sample containing about 6688 words
of miscellanea (domain: social science)</title> <respStmt> <resp>Data capture and
transcription</resp> <name>Oxford University Press</name> </respStmt>
</titleStmt>
<extent>6688 tokens; 6708 w-units; 423 s-units</extent>
<publicationStmt><distributor>Distributed under licence by Oxford University Computing
Services on behalf of the BNC Consortium.</distributor> <availability>This material is
protected by international copyright laws and may not be copied or redistributed in
any way. </availability> <idno type="bnc">A00</idno> </publicationStmt>
<sourceDesc> <bibl> <title>[ACET factsheets & newsletters].</title> <publisher>Aids
Care Education & Training</publisher> <pubPlace>London </pubPlace><date
value="1991-09">1991-09</date></bibl></sourceDesc>
</fileDesc><profileDesc>
<creation date="1991">1991-09</creation>
<textClass> <catRef targets="WRI ALLTIM3 ALLAVA2 ALLTYP5 WRIAAG0 WRIAD0
WRIASE0 WRIATY2 WRIAUD3 WRIDOM4 WRILEV2 WRIMED3 WRIPP5
WRISAM5 WRISTA2 WRITAS3" /> <classCode scheme="DLEE">W nonAc:
medicine</classCode> <keywords><term>Health</term> <term>Sex</term>
</keywords> </textClass> </profileDesc> … </teiHeader>
BNC-XML: text elements
• wtext or stext
• div = section
• p = paragraph or
u = utterance
• s = sentence
• w = word and
c = punctuation
also: head, note, caption,
event, gap, vocal …
• word attributes
– c5 = claws5
– pos = part-of-speech
– hw = headword
(lemma)
<wtext type="NONAC"><div level="1" n="1" type="leaflet">
<head type="MAIN"><s n="1"><w c5="NN1" hw="factsheet"
pos="SUBST">FACTSHEET</w> <w c5="DTQ" hw="what"
pos="PRON">WHAT</w> <w c5="VBZ" hw="be" pos="VERB">IS</w> <w
c5="NN1" hw="aids" pos="SUBST">AIDS</w><c c5="PUN">?</c>
</s> </head>
<p><s n="2"><hi rend="bo"> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w> <c
c5="PUL">(</c><w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w> <w
c5="AJ0" hw="immune" pos="ADJ">Immune</w> <w c5="NN1" hw="deficiency"
pos="SUBST">Deficiency</w> <w c5="NN1" hw="syndrome"
pos="SUBST">Syndrome</w><c c5="PUR">)</c></hi> <w c5="VBZ" hw="be"
pos="VERB">is</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1"
hw="condition" pos="SUBST">condition</w> <w c5="VVN" hw="cause"
pos="VERB">caused</w> <w c5="PRP" hw="by" pos="PREP">by</w> <w
c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="virus"
pos="SUBST">virus</w> <w c5="VVN" hw="call" pos="VERB">called</w> <w
c5="NP0" hw="hiv" pos="SUBST">HIV</w> <c c5="PUL">(</c> <w c5="AJ0NN1" hw="human" pos="ADJ">Human</w> <w c5="NN1" hw="immuno"
pos="SUBST">Immuno</w> <w c5="NN1" hw="deficiency"
pos="SUBST">Deficiency</w> <w c5="NN1" hw="virus"
pos="SUBST">Virus</w><c c5="PUR">)</c><c c5="PUN">.</c> </s> … </p>
… </div></wtext>
ok, so what?
• you don't have to see it like that - do you prefer this?
FACTSHEET WHAT IS AIDS?
AIDS (Acquired Immune Deficiency Syndrome) is
a condition caused by a virus called HIV
(Human Immuno Deficiency Virus).
• but the markup is what enables you to (a) see it like
this; (b) do interesting things, e.g.
– distinguish aids=SUBST from aids=VERB, aids=NN1
from aids=NN2
– distinguish occurrences in writing from ones in speech
– distinguish occurrences in headings from ones in text
paragraphs
Out of date?
• new/obsolete text types
–
–
–
–
e-mail
web pages / blogs
SMS
personal letters
• new/obsolete topics
–
–
–
–
globalization
internet
Elvis
Word Perfect
• new/obsolete language (especially in speech age
groups)
Out of date?
• Results always need interpreting bearing in mind
the composition of the corpus
• There aren't many alternatives
– Web-as-corpus: 85% of written texts aren't on the web and spoken texts?
– Results from monitor corpora non-replicable
– Copyright permissions unrepeatable
• Surprising how few things don't occur in the BNC
• Quantitative/qualitative evaluations will arrive …
Xaira
• XML-aware indexing and retrieval
application
• Not corpus- or language-specific
• Comes free with BNC-XML
• Lots of alternatives, but not many that allow
use of XML markup (Zurich, CWB)
Xaira improvements
•
•
•
•
•
Standalone use (Windows XP)
Frequency/distribution data by text mode/class
Query by text mode/class
Exclude occurrences in headers (Oxford)
Easier use of word-class data
– Lemma queries (eg hw="be")
– Addkey queries (eg pos="VERB")
– Lemma collocations, POS colligations
Oh, and you can examine the whole text if you want
What can you do with it in
teaching/learning?
• And why should you want to?
• A few examples
A problem-solving and problemdiscovering resource for
• Preparation (teacher)
• Classroom use (teacher/learner)
• Self-study (learner)
• Complements (and corrects) intuition
• Increases learner autonomy
• Critiques the myth of the native speaker
The ins and outs of autonomous use
• focus on patterns which recur, without necessarily
trying to explain all the data (patterns not rules)
• notice memorable instances
• DON'T overgeneralise - take something you can use
• look for
–
–
–
–
–
collocations
colligations (including position in structural unit - Hoey)
semantic preferences
semantic prosodies/pragmatic associations (apathetic)
associations with particular genres/domains
• be curious: browse the context, investigate exceptions
What are "ins and outs"?
• 50 occurrences, sort left 2
• colligation: (all) the ins and outs of
• semantic preference: know/learn/understand/keep
up with/get to grips with/get down to/forget;
explain/teach/guide through/give/look at
• semantic prosody: difficulty(?)
• analysis - mainly spoken conversation, but
numbers too small for reliable inference
Grammar:
the aim to do or the aim of doing
•
•
•
•
aim NN1
aim n same frequency as aim v
aim + of /4 2004 + to /4 1610
to - with possessive (POS/DPS) + be; or of
NP
• of - with the aim of VVG [good things!]
• chief/main (is to), stated
aim + to
• colligation: possessive/the aim BE to INF
• semantic prosody: positively evaluated
outcome (cf right collocates - next slide)
aim + of
aim + of
• with the aim of V+ing (colligation)
• main/sole/stated/specific (semantic preference)