Uploaded by Muhammad Haseeb

BNC

advertisement
What sort of corpus is the BNC?
Monolingual: It deals with modern British English, not other
languages used in Britain. However non-British English and foreign
language words do occur in the corpus.
Synchronic: It covers British English of the late twentieth century,
rather than the historical development which produced it.
General: It includes many different styles and varieties, and is not
limited to any particular subject field, genre or register. In
particular, it contains examples of both spoken and written
language.
Sample: For written sources, samples of 45,000 words are taken
from various parts of single-author texts. Shorter texts up to a
maximum of 45,000 words, or multi-author texts such as magazines
and newspapers, are included in full. Sampling allows for a wider
coverage of texts within the 100 million limit, and avoids overrepresenting idiosyncratic texts.
British National Corpus
You are here:Home/British National Corpus
What is British National Corpus?
The British National Corpus (BNC) is a 100-million-word collection of samples of a written and
spoken language of British English from the later part of the 20th century.
The BNC consists of the bigger written part (90 % written part, e.g. newspapers,
academic books, letters, essays, etc.) and the smaller spoken part (remaining 10 % spoken
part, e.g. informal conversations, radio shows, etc.). The spoken part is also available in
the audio format.
The corpus texts contain a large amount of information and thus each user can use many
search criteria as a time of publication, region captured spoken text, type of media and text
domain, or the David Lee’s classification – a detailed genre specification. The full list of genres
of this classification is here.
The official website: http://www.natcorp.ox.ac.uk
Content in detail
See the charts and more information about texts in the British National Corpus.
Distribution of parts of speech
Further information about texts in the corpus
Tools to work with British National Corpus
A complete set of tools is available to work with the British National Corpus to generate:

word sketch– English collocations categorized by grammatical relations

thesaurus– synonyms and similar words for every word

keywords– terminology extraction of one-word and multi-word units

word lists – lists of English nouns, verbs, adjectives etc. organized by frequency

n-grams– frequency list of multi-word units

concordance– examples in context

trends– diachronic analysis automatically identifies neologisms and changes in use
Part-of-speech tagsets in BNC
Sketch Engine offers BNC tagged with the 2 different POS tagsets:

Penn TreeBank tagset
tagset used in the CLAWS POS tagger version 5 with specific attributes: An attribute can refer to:
A possitional attribute – information added to each token in a corpus, e.g. its lemma or part of speech.
A structure attribute – information added to a structure in a corpus, often called metadata
view glossary


ambtag: the ambivalent part of speech tag (all tags before disambiguation)

pos: one-letter abbreviation of the part of speech (the second part of lempos)
Download