intro - British National Corpus

advertisement
Introduction : corpora, corpus use,
and the British National Corpus
Dr. Ylva Berglund Prytz
ylva.berglund@oucs.ox.ac.uk
http://www.natcorp.ox.ac.uk/
Outline






Presentation: Corpora, corpus use, and the BNC
Demonstration: How to use BNC with Xaira
Hands-on: BNC with Xaira
Presentation: Using the BNC for teaching and
research
More hands-on: exploring more
Questions and answers
At the end of today you should

have a basic working knowledge about
 corpora and corpus use
 the BNC
 Xaira

feel confident using Xaira
be able to explore area on your own

know where to turn for help and advice

Approaches to linguistic study
Intuition
• “Feel” what is
right/wrong/possible
• One person’s
language
• Subjective
Study of usage
• Examine what is
actually said/written
• Several people
• Objective
How do you study usage?
Examine naturally occurring language
 Draw conclusions

Need a sample of language,
produced by different people
in various contexts
Find a corpus!
What is a corpus?



A collection of naturally occurring language data
compiled to mirror a language/language variety
(Usually) computer-readable
(Usually) contains more than text (annotation,
meta-data)
What is a corpus? – some definitions
A corpus is a collection of naturally-occurring
language text, chosen to characterise a state or variety
of language. (Sinclair 1991: 171)
A corpus can be defined as a collection of texts
assumed to be representative of a given language.
(Tognini-Bonelli 2001: 2)
All the material included in a corpus, whether spoken,
written […] is assumed to be taken from genuine
communications of people going about their normal
business. (ibid: 55)
How can a corpus help?

Look for patterns to see regularities

Quantify

See several examples

Real language – language in use

Based on a variety of sources
Types of corpora
• Balanced corpora (= Reference or general corpora)
• Specialised corpora
 Genre-specific, LSP (e.g. English for Academic Purposes)
…
 Varieties (dialectal, social, historical)
 Learner language, English as a Lingua Franca
• Multilingual corpora
 Parallel corpora (translations; alignable)
 Comparable corpora (similar texts)
• Fixed size / monitor corpora
• Mode and medium
 Written, spoken and transcribed, spoken with audio, video
Famous corpora

Brown family (Brown, LOB, FLOB)
 1 million words, different text categories

Bank of English
 Monitor corpus, grows with time

International Corpus of English (ICE)
 Different national varieties of English. 1 million words written
and spoken

British National Corpus
 Reference corpus, fixed, 100 million words, written and
spoken
British National Corpus (BNC)
What is the BNC?


A snapshot of British English, taken at the end of
the 20th century
100 million words in approx 4,000 different text
samples, both spoken (10%) and written (90%)

Synchronic (1960-93), sampled, general purpose
corpus

Available under licence; latest edition is BNC XML
edition (March 2007)
More than text

Metadata
 About text, author/speaker, audience

Structural & typographical information
 Paragraph, sentence, heading, list, bolds

Extra-linguistic information
 Voice quality, noise, pauses, overlap

Linguistic information
 Part-of-speech
Who produced the BNC and why?

a consortium of dictionary publishers and
academic researchers
 OUP, Longman, Chambers
 OUCS, UCREL, BL R&D

with funding from DTI/ SERC under JFIT 19901994

Lexicographers, NLP researchers,
 But not language teachers!
Stated Project Goals

A synchronic (1990-4) corpus of samples both
spoken and written from the full range of British
English language production
of non-opportunistic design, for generic applicability

with word class annotation

and contextual information

Actual (?) project goals

Better ELT dictionaries
 authoritative
 both speech and writing

A model for European corpus work
 design, and encoding
 Industrial-academic co-operation

A REALLY BIG corpus
Production of the BNC



took three years (at least)
cost GBP 1.6 million (at least)
came about through an unusual coincidence of
interests amongst:
 Lexicographical publishers
 Government (DTI)
 Engineering and Science Research Council
Project consequences

industrial-scale text production system
necessary compromises?

technically over-ambitious?

IPR and profitability

The BNC looks back to Brown and LOB in its
design and markup, and forward to the Web in
its scope and indeterminacy
How was the corpus created?
How was the corpus created?
1.
2.
3.
4.
5.
6.
7.
8.
Corpus design
Text selection
Clearance
Capture
Add additional information
Merge
(documentation)
Distribution
The BNC “sausage machine”
Written
OUP
(OUP/Chambers
)
Spoken
(Longman)
Selection, clearance, and capture
Enrichment and encoding
Initial CDIF Conversion
and Validation
(OUCS)
Word Class Annotation
(UCREL)
Header generation
and final validation
(OUCS)
Documentation, distribution, maintenance
Text selection
1.
Design criteria




2.
Types of texts
Sources
Number of samples
Size of samples
Descriptive criteria

Additional information where available
Selection criteria: written texts
Domain
imaginative (c 25%)
informative
Medium
Book, periodicals, misc. published, unpublished,
written to be spoken
Time
1985-1993
(1960-75, 1975-84)
“Descriptive” criteria: written texts

Sample size (number of words) and extent (start and end
points)

Topic or subject of the text

Author's name, age, gender, region of origin, and domicile

Target age group and gender

"Level" of writing (reading difficulty) : the more literary or
technical a text, the "higher" its level
Selection criteria: spoken texts
demographic (spoken conversation)
 transcriptions of spontaneous natural
conversations made by recruited volunteers
 original recordings are available from British
Library
context-governed (other spoken material)
 transcriptions of recordings made at specific types
of meeting and event.
Spoken texts: context-governed
Four broad categories of social context:
• Educational and informative events, such as
lectures, news broadcasts, classroom discussion,
tutorials
• Business events such as sales demonstrations,
trades union meetings, consultations, interviews
• Institutional and public events, such as sermons,
political speeches, council meetings
• Leisure events, such as sports commentaries, afterdinner speeches, club meetings, radio phone-ins
Descriptive criteria: spoken texts





Features relating to the speaker (age, sex, social
class, dialect)
Context of recording (place, time)
Features of the recording (non-verbal events,
paralinguistic phenomena, unclear instances)
Included when known
Sometimes provided by respondent
What is the result?
What is the BNC?




4,000+ texts
Ca. 100,000,000 words
10% spoken
Information about
 the texts
 the speakers/writers
 the words

Delivered with a search tool: XAIRA
What's in the BNC?
79238146
6175896
4233955
8715786
Spoken Demographic
Spoken Context Governed
Books and Periodicals
Other written
What topics?
16496420
12237834
3821902
3037533
14025537
6574857
7341163
7174152
17244534
Imaginative
Applied Science
Arts
Scientific
World Affairs
Belief
Social Science
Commerce
Leisure
Post-hoc text-type classification
Academic
Literary
Press
Nonfiction
Unpublished
Conversation
OtherSpolen
...sentences
...words
Format
Corpus header (1)
Corpus texts (4,000+)
<corpus>
<corpusHeader></corpusHeader>
<corpusText>
<textHeader></textHeader>
<text></text>
</corpusText>
<corpusText>
<textHeader></textHeader>
<text></text>
</corpusText>
…
Text header
Text
</corpus>
Annotation, encoding, markup
•
A means of making explicit, and thus
processable:
 structure
• texts, sections, paragraphs, turns, sentences,
words...
 metadata
• text-type, situational parameters, context
 analysis
• morphology, syntactic function, translation
Word class annotation


CLAWS (Leech, Garside et al) approach
What counts as a word?
This isn't prima facie obvious, in
spite of spelling conventions.

In BNC-XML, each word is explicitly marked
and annotated with
 a root form or lemma
 an automatically assigned C5 word class code
 a simplified POS code
Example: word class annotation
<s n="11"><w c5="NN1" hw="difficulty"
pos="SUBST">Difficulty </w><w c5="VBZ" hw="be"
pos="VERB">is </w><w c5="VBG" hw="be"
pos="VERB">being </w><w c5="VVN" hw="express"
pos="VERB">expressed </w><w c5="PRP" hw="with"
pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the
</w><w c5="NN1" hw="method" pos="SUBST">method
</w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI"
hw="be" pos="VERB">be </w><w c5="VVN" hw="use"
pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to
</w><w c5="VVI" hw="launch" pos="VERB">launch </w><w
c5="AT0" hw="the" pos="ART">the </w><w c5="NN1"
hw="scheme" pos="SUBST">scheme</w><c
c5="PUN">.</c></s>
<s n="11">
<w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w>
<w c5="VBZ" hw="be" pos="VERB">is </w>
<w c5="VBG" hw="be" pos="VERB">being </w>
<w c5="VVN" hw="express" pos="VERB">expressed </w>
<w c5="PRP" hw="with" pos="PREP">with </w>
<w c5="AT0" hw="the" pos="ART">the </w>
<w c5="NN1" hw="method" pos="SUBST">method </w>
<w c5="TO0" hw="to" pos="PREP">to </w>
<w c5="VBI" hw="be" pos="VERB">be </w>
<w c5="VVN" hw="use" pos="VERB">used </w>
<w c5="TO0" hw="to" pos="PREP">to </w>
<w c5="VVI" hw="launch" pos="VERB">launch </w>
<w c5="AT0" hw="the" pos="ART">the </w>
<w c5="NN1" hw="scheme" pos="SUBST">scheme</w>
<c c5="PUN">.</c>
</s>
c5 = detailed part-of-speech
hw = head word (new)
pos = simple part-of-speech (new)
Some BNC-XML elements






<wtext> or <stext>
<div> = section
<p> = paragraph or <u> = utterance
<s> = “sentence”
<w> = word and <c> = punctuation
<mw> = multiword unit
What is the markup for?

It makes it possible for you to
 distinguish aids=SUBST from aids=VERB
 distinguish occurrences in writing from ones in speech
 distinguish occurrences in headings from ones in
paragraphs
 identify contextual units like sentences and paragraphs
 FACTSHEET WHAT IS AIDS?
AIDS (Acquired Immune Deficiency Syndrome) is a
condition caused by a virus called HIV (Human Immuno
Deficiency Virus).
Who uses the BNC (and how?)

Linguists
 Research on (English) language

Teachers
Lexicographers
PublishersNLP researchers
 Reference, Generate teaching materials, In classroom

 Dictionaries, EFL text books

Language engineers
 Language + computer tools, AI, NLP


Students/language learners
Computer scientists
 Information retrieval

Psychologists/neurologists
 General ‘norm’ or reference
What makes the BNC so special?

Size
Design
General availability

Standardized markup system


 Structural annotation
 Word class annotation
 Contextual information

Model for other projects
...in these respects, the BNC remains
distinctive, twenty years on!
How to use the BNC (with Xaira)
The BNC can be used in different ways
and with different tools

User needs to know
 What information is available
 Where/how is information coded
XAIRA can help
Search for

Words or phrases
Word class information
Annotation/mark-up

or a combination of them


Display

Search term with context
 with or without mark-up



Information about text
Collocations (co-occurring words)
Distribution across parts of the corpus
and much more
XAIRA – XML-aware retrieval application




Searches an index of the corpus
Uses information in the headers and the texts
Often more than one way to make a search
Can be used with other corpora (if they are
indexed first)
Introduction : corpora, corpus use,
and the British National Corpus
Dr. Ylva Berglund Prytz
ylva.berglund@oucs.ox.ac.uk
http://www.natcorp.ox.ac.uk/
Download