The BNC XML edition Guy Aston guy@sslmit.unibo.it The BNC 100 million words of late 20th century British English (written and spoken) • Synchronic: begun in 1991, completed in 1994 • Slightly revised for the 2nd edition BNC World (2001), and the 3rd edition BNC XML (2007) • Sub-corpora releases: – BNC Sampler (samples of one million written words, one million spoken) – BNC Baby (four one-million word samples from four different genres: academic, non-academic, newspaper, conversation) BNC-XML (published last week) • Single user licence £60 (£400 for 10) + VAT NB: requires Windows XP • Network licence £350 + VAT • Free online service with limited query options – www.natcorp.ox.ac.uk • Prices valid till 01/06/2007 The BNC consortium • • • • • Oxford University Press Addison-Wesley Longman Larousse Kingfisher Chambers Oxford University Computing Services University Centre for Computer Corpus Research on Language, University of Lancaster • British Library Research and Innovation Centre. • Funded by the commercial partners, the Science and Engineering Council (now EPSRC) and the DTI under the Joint Framework for Information Technology programme. • Additional support: British Library and British Academy Selection criteria: written texts Domain • 75% from informative writing: roughly equal quantities from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs • 25% from imaginative writing, i.e literary and creative works Selection criteria: written texts Medium • 60% from books • 25% from periodicals (newspapers etc.) • 5-10% from miscellaneous published material (brochures, advertising leaflets, etc) • 5-10% from unpublished material (personal letters and diaries, essays and memoranda, etc) • <5% from written-to-be-spoken material (political speeches, play texts, broadcast scripts, etc.) Selection criteria: written texts Time • post-1975 • a few imaginative works date back to 1964, because of their continued sales/popularity Further classification criteria: written texts (original criteria) • Sample size (number of words) and extent (start and end points) • Topic or subject of the text • Author's name, age, gender, region of origin, and domicile • Target age group and gender • "Level" of writing (reading difficulty) : the more literary or technical a text, the "higher" its level New classification criteria (written texts: derived from Lee 2001) • • • • • • Academic writing Non-academic prose and biography Fiction and verse Newspapers Other published written material Unpublished written material Selection criteria: spoken texts roughly equal quantities of: demographic (spoken conversation) • transcriptions of spontaneous natural conversations made by members of the public context-governed (other spoken material) • transcriptions of recordings made at specific types of meeting and event. The original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library. Spoken texts: demographic 124 volunteers • males and females of a wide range of ages and social groupings, living in 38 different locations across the UK • similar numbers of men and women, from each age and from each social grouping • conversations recorded unobtrusively over two or three days • permissions obtained after each conversation • participants' age, sex, accent, occupation, relationship recorded if possible as classification criteria Spoken texts: context-governed Four broad categories of social context, roughly equal quantities of speech • Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials • Business events such as sales demonstrations, trades union meetings, consultations, interviews • Institutional and public events, such as sermons, political speeches, council meetings • Leisure events, such as sports commentaries, afterdinner speeches, club meetings, radio phone-ins Specific type of event as classification criterion BNC-XML: Composition • • • • • Spoken demographic Spoken context-gov Written books/period Written-to-be-spoken Written miscellaneous • TOTAL texts 153 755 2685 35 421 w-units 4233955 6175896 79238146 1278618 7437168 4049 98363783 % 4.30 6.27 80.55 1.29 7.56 BNC-XML • same texts, same part-of-speech tagging as BNC world • not checked against original texts/recordings • numbers hopefully righter – – – – – odd duplicate texts and parts of texts eliminated text categorisation errors corrected tokenisation/segmentation errors corrected multi-word tokens eliminated non-linguistic and paralinguistic descriptions standardised • query software improved (Xaira) BNC-XML corpus structure • 1 corpus header – information about corpus and corpus markup • 1 bibliography file – information about single text documents • 4049 text documents BNC-XML document structure <bncDoc> <teiHeader> the header </teiHeader> <wtext> (written) the text </wtext> </bncDoc> or <stext> (spoken) or </stext> BNC-XML: header • Textual metadata – – – – – file description source (bibliographic data) selection and classification categorisations participant data (speech) other things … <teiHeader> <fileDesc> <titleStmt><title>[ACET factsheets & newsletters]. Sample containing about 6688 words of miscellanea (domain: social science)</title> <respStmt> <resp>Data capture and transcription</resp> <name>Oxford University Press</name> </respStmt> </titleStmt> <extent>6688 tokens; 6708 w-units; 423 s-units</extent> <publicationStmt><distributor>Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.</distributor> <availability>This material is protected by international copyright laws and may not be copied or redistributed in any way. </availability> <idno type="bnc">A00</idno> </publicationStmt> <sourceDesc> <bibl> <title>[ACET factsheets & newsletters].</title> <publisher>Aids Care Education & Training</publisher> <pubPlace>London </pubPlace><date value="1991-09">1991-09</date></bibl></sourceDesc> </fileDesc><profileDesc> <creation date="1991">1991-09</creation> <textClass> <catRef targets="WRI ALLTIM3 ALLAVA2 ALLTYP5 WRIAAG0 WRIAD0 WRIASE0 WRIATY2 WRIAUD3 WRIDOM4 WRILEV2 WRIMED3 WRIPP5 WRISAM5 WRISTA2 WRITAS3" /> <classCode scheme="DLEE">W nonAc: medicine</classCode> <keywords><term>Health</term> <term>Sex</term> </keywords> </textClass> </profileDesc> … </teiHeader> BNC-XML: text elements • wtext or stext • div = section • p = paragraph or u = utterance • s = sentence • w = word and c = punctuation also: head, note, caption, event, gap, vocal … • word attributes – c5 = claws5 – pos = part-of-speech – hw = headword (lemma) <wtext type="NONAC"><div level="1" n="1" type="leaflet"> <head type="MAIN"><s n="1"><w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w> <w c5="DTQ" hw="what" pos="PRON">WHAT</w> <w c5="VBZ" hw="be" pos="VERB">IS</w> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w><c c5="PUN">?</c> </s> </head> <p><s n="2"><hi rend="bo"> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w> <c c5="PUL">(</c><w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w> <w c5="AJ0" hw="immune" pos="ADJ">Immune</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w><c c5="PUR">)</c></hi> <w c5="VBZ" hw="be" pos="VERB">is</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="condition" pos="SUBST">condition</w> <w c5="VVN" hw="cause" pos="VERB">caused</w> <w c5="PRP" hw="by" pos="PREP">by</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="virus" pos="SUBST">virus</w> <w c5="VVN" hw="call" pos="VERB">called</w> <w c5="NP0" hw="hiv" pos="SUBST">HIV</w> <c c5="PUL">(</c> <w c5="AJ0NN1" hw="human" pos="ADJ">Human</w> <w c5="NN1" hw="immuno" pos="SUBST">Immuno</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="virus" pos="SUBST">Virus</w><c c5="PUR">)</c><c c5="PUN">.</c> </s> … </p> … </div></wtext> ok, so what? • you don't have to see it like that - do you prefer this? FACTSHEET WHAT IS AIDS? AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus). • but the markup is what enables you to (a) see it like this; (b) do interesting things, e.g. – distinguish aids=SUBST from aids=VERB, aids=NN1 from aids=NN2 – distinguish occurrences in writing from ones in speech – distinguish occurrences in headings from ones in text paragraphs Out of date? • new/obsolete text types – – – – e-mail web pages / blogs SMS personal letters • new/obsolete topics – – – – globalization internet Elvis Word Perfect • new/obsolete language (especially in speech age groups) Out of date? • Results always need interpreting bearing in mind the composition of the corpus • There aren't many alternatives – Web-as-corpus: 85% of written texts aren't on the web and spoken texts? – Results from monitor corpora non-replicable – Copyright permissions unrepeatable • Surprising how few things don't occur in the BNC • Quantitative/qualitative evaluations will arrive … Xaira • XML-aware indexing and retrieval application • Not corpus- or language-specific • Comes free with BNC-XML • Lots of alternatives, but not many that allow use of XML markup (Zurich, CWB) Xaira improvements • • • • • Standalone use (Windows XP) Frequency/distribution data by text mode/class Query by text mode/class Exclude occurrences in headers (Oxford) Easier use of word-class data – Lemma queries (eg hw="be") – Addkey queries (eg pos="VERB") – Lemma collocations, POS colligations Oh, and you can examine the whole text if you want What can you do with it in teaching/learning? • And why should you want to? • A few examples A problem-solving and problemdiscovering resource for • Preparation (teacher) • Classroom use (teacher/learner) • Self-study (learner) • Complements (and corrects) intuition • Increases learner autonomy • Critiques the myth of the native speaker The ins and outs of autonomous use • focus on patterns which recur, without necessarily trying to explain all the data (patterns not rules) • notice memorable instances • DON'T overgeneralise - take something you can use • look for – – – – – collocations colligations (including position in structural unit - Hoey) semantic preferences semantic prosodies/pragmatic associations (apathetic) associations with particular genres/domains • be curious: browse the context, investigate exceptions What are "ins and outs"? • 50 occurrences, sort left 2 • colligation: (all) the ins and outs of • semantic preference: know/learn/understand/keep up with/get to grips with/get down to/forget; explain/teach/guide through/give/look at • semantic prosody: difficulty(?) • analysis - mainly spoken conversation, but numbers too small for reliable inference Grammar: the aim to do or the aim of doing • • • • aim NN1 aim n same frequency as aim v aim + of /4 2004 + to /4 1610 to - with possessive (POS/DPS) + be; or of NP • of - with the aim of VVG [good things!] • chief/main (is to), stated aim + to • colligation: possessive/the aim BE to INF • semantic prosody: positively evaluated outcome (cf right collocates - next slide) aim + of aim + of • with the aim of V+ing (colligation) • main/sole/stated/specific (semantic preference)