Definitions of corpus from OED 2

advertisement
(Any comments directed to you appear in red font - also your two practical tasks for next
week)
CORPUS DEFINITION; COMPILATION CRITERIA; TYPOLOGY OF CORPORA
Synchronic perspective (no special ref to the history of corpora or corpus linguistics)
http://main.amu.edu.pl/~przemka/diplsem2002-3/diplsem2.html
http://xrefer.com
Why do we need corpora?
Four kinds of observational techniques (after Chafe 1992, in Svartvik ed. 1992 Directions in
Corpus Linguistics)
artificial
natural
behavioral
experiments,
elicitation
corpora,
ethnography
introspective
semantics,
judgments regarding
invented language
daydreaming
Why bother with a corpus?
Expert speakers have only partial knowledge
Corpus is more comprehensive and balanced
Expert speakers think of what is possible
Corpus shows us what is common and typical
Expert speakers cannot quantify their knowledge
Corpus can give us fairly accurate statistics
Expert speakers cannot make up natural examples
Corpus can give us many natural examples
"Korpus"
Kaszubski (forthcoming):
W języku polskim określenie „językoznawstwo korpusowe” brzmi wielce niezręcznie. Słowo
"korpus" nie ma w polskim leksykonie ustabilizowanego znaczenia dotyczącego zbioru
danych czy zbioru tekstów. Słownik języka polskiego w wersji online najważniejszego w
Polsce wydawcy naukowego – PWN – pomija ten sens1, pomimo iż część stron internetowych
wydawnictwo poświęca bezpośrednio tzw. „Korpusowi Języka Polskiego” (KJP)2, na których
umieszcza nawet definicję lingwistycznego znaczenia terminu korpus!3. Z kolei przymiotnik
„korpusowy”, zliczając wszystkie jego formy deklinacyjne, pojawia się w polskim Internecie
około stu razy, z czego większość dotyczy dziedzin związanych z budową maszyn i
wojskowością, a jedynie niecałe 10 procent sensu językowego4. Nie może zatem ulegać
wątpliwości, że wyrazy „korpus” i „korpusowy” używane w odniesieniu do zbioru tekstów,
czy nawet ogólnie do zbioru danych, mają w Polsce znaczenie jedynie dla wąskiego grona
specjalistów, natomiast dla szerszego odbiorcy brzmieć będą niezrozumiale, a nawet
Pod http://sjp.pwn.pl/ uzyskamy informację, że korpus to: „1. «ciało człowieka lub zwierzęcia prócz kończyn i
głowy; tułów»; 2. → garmond [= druk. «stopień czcionki, równy dziesięciu punktom typograficznym», PK]; 3.
«główna część budowli; w architekturze pałacowej: część centralna budynku o charakterze reprezentacyjnym; w
architekturze sakralnej: część nawowa kościoła»; 4. techn. «główna, tworząca całość, część jakiegoś urządzenia,
maszyny, przyrządu itp.; kadłub»; 5. wojsk. «duża jednostka taktyczna składająca się z kilku dywizji lub brygad;
wchodzi w skład armii lub może działać samodzielnie»”. Przymiotnik „korpusowy” opisany jest bardzo krótko
jako pochodzący od rzeczownika „korpus”.
2
Http://slowniki.pwn.pl/korpus/ lub http://korpus.pwn.pl.
3
Definicja jest zresztą nieścisła ("Korpus to dowolny zbiór tekstów, w którym czegoś szukamy"), albowiem
zakłada dowolność w doborze materiału a tym samym umniejsza znaczenie (statystycznej) reprezentatywności,
która winna cechować korpus językowy w odróżnieniu od zwykłego tekstowego archiwum, których używa się
na przykład w przetwarzaniu języka naturalnego, o którym piszę nieco dalej. Dopiero na jednej z dalszych stron
PWN (http://korpus.pwn.pl/powstawanie.php) pojawia się obszerniejsza dyskusja tej kwestii („Korpus tekstów
musi być odpowiednio zrównoważony gatunkowo, chronologicznie, stylowo, terytorialnie i pod innymi
względami, np. ze względu na wiek i płeć autorów. To właśnie założona uprzednio struktura oraz rodzaj
wyszukiwarki różni korpusy naukowe od innych wielkich zbiorów tekstów, choćby internetowych archiwów
gazet codziennych bądź ogólnych zasobów sieci.”). Eksplikacja ta jest jednak, moim zdaniem, spóźniona, gdyż
reprezentatywność stanowi cechę definicyjną korpusu i powinna zostać umieszczona w głównej definicji.
4
Wyszukiwarka „Google” zwróciła 2 sierpnia 2002 dokładnie 90 trafień, wśród których znalazły się: 1) frazy
dotyczące głównego znaczenia maszynowego (=„kadłub”), takie jak jak „ciśnienie korpusowe” (w silniku),
„(hydrauliczna) prasa korpusowa”, „płytka korpusowa” (gitary, także płytek skrawających), „części korpusowe”
(do ciągnika itp), „detale korpusowe” (do wyrobu/obróbki na szlifierkach), „elementy korpusowe”
(silnika/obrabiarek itp.), „połączenia korpusowe” (dotyczy korpusów mebli), „ konstrukcja korpusowa" (o
korpusie szafy stojącej; może także dotyczyć okuć do systemów drzwi składano-przesuwnych); 2) pokrewne
użycia związane z jubilerstwem i wyrobem mniejszych przedmiotów użytkowych: „artykuły korpusowe” (czyli
wszystko, co się stawia na stole, np. zestawy na przyprawy, świeczniki), „świecznik korpusowy”, „srebra
korpusowe”, „wyroby/obiekty korpusowe” (dotyczy masywnej biżuterii); 3) frazy związane z wojskowością:
„korpusowa stacja telefoniczna" (tzn należąca do danego korpusu wojsk), „korpusowy pododdział”, „oficerowie
korpusowi”, „okręg korpusowy”, „dowódca Okręgu Korpusowego. Językoznawcze wystąpienia to np. „strony
korpusowe” PWN (http://slowniki.pwn.pl/poradnia/lista.php?kat=16&od=20); „uboga reprezentacja korpusowa”
(w publikacji Marka Łazińskiego „Rozróżnianie znaczeń synonimicznych w korpusie tekstów” –
http://sjikp.us.edu.pl/ps/ps_29_05.html). Co ciekawe, w Internecie spotkać można też napotkać frazę
„lingwistyka korpusowa” (np. na stronach polsko-angielskiego projektu PELCRA –
http://www1.uni.lodz.pl/pelcra/index-pl.htm), wobec zupełnego niemal braku „językoznawstwa korpusowego”,
jednak rozróżnienie pomiędzy tego rodzaju synonimami to temat na osobną analizę korpusową.
1
śmiesznie – ze względu na narzucające się skojarzenia z głównym sensem rzeczownikowym
(„tułów”).
"Corpus" (Kaszubski forthcoming - ctnd):
W przeciwieństwie do polszczyzny, leksykon angielski notuje dwa znaczenia wyrazu
corpus, które sankcjonują stosowność terminu corpus linguistics. W drugim wydaniu
wielkiego Słownika Oksfordzkiego Języka Angielskiego (Oxford English Dictionary 2nd ed.)
czytamy, że korpusem nazywa się, po pierwsze, zbiór lub zestaw tekstów lub innych
materiałów czyjegoś autorstwa (np. korpus Szekspirowski), albo też ogół literatury napisanej
na dany temat („A body or collection of writings or the like; the whole body of literature on
any subject”). Drugie ze znaczeń określa corpus jako zbiór materiałów języka pisanego lub
mówionego używanych jako podstawa do badań nad językiem („the body of written or
spoken material upon which a linguistic analysis is based”)5. Od tego właśnie sensu wywodzi
się bezpośrednio termin corpus linguistics (Aston & Burnard 1998: 4), choć podkreślić
należy, że – jak wspomniałem – językoznawstwo korpusowe definicję terminu „korpus”
jeszcze bardziej zawęża. Korpus językowy to bowiem nie dowolny zbiór danych językowych
lecz reprezentatywna dla założonej populacji czy grupy próbka języka, przechowywana i
badana w formie w jakiej została naturalnie (spontanicznie) użyta, często z zachowaniem
informacji o kontekście socjologicznym. Zbiór tekstów nazwany korpusem musi zatem
charakteryzować się nie tylko dostatecznym rozmiarem, mierzonym ilością słów,
gwarantującym dokładność badań kwantytatywnych, ale także jakością doboru, czyli
podporządkowaniem pozatekstowym kryteriom określającym, jaką odmianę języka dany
korpus reprezentuje. Trafnie skompilowany korpus pozwala uzyskane na jego podstawie dane
ilościowe odpowiednio interpretować i ekstrapolować na rzeczywistość językową poza
korpusem.
Drugi sens niekiedy nakłada się na pierwszy, np. gdy analizie poddajemy elementy językowe cechujące styl
pisarski konkretnego autora.
5
OED 2 (excerpts):
corpus ... pl. corpora ... 1. The body of a man or animal. (Cf. corpse.) Formerly
frequent; now only humorous or grotesque. 1854 Villikins & his Dinah (in Mus. Bouquet, No.
452), He kissed her cold corpus a thousand times o'er. 2. Phys. A structure of a special
character or function in the animal body, as corpus callosum [= ciało modzelowate,
spoidło wielkie mózgu, PK], the transverse commissure connecting the cerebral hemispheres;
corpus luteum L. luteus [= ciałko żółte, PK], --um yellow (pl. corpora lutea), a yellowish
body developed in the ovary from the ruptured Graafian follicle after discharge of the ovum.
3. A body or complete collection of writings or the like; the whole body of
literature on any subject. 1727-51 Chambers Cycl. s.v., Corpus is also used in matters of
learning, for several works of the same nature, collected, and bound together.. We have also a
corpus of the Greek poets.. The corpus of the civil law is composed of the digest, code, and
institutes. 1865 Mozley Mirac. i. 16 Bound up inseparably with the whole corpus of Christian
tradition. 4. The body of written or spoken material upon which a linguistic
analysis is based. 1956 W. S. Allen in Trans. Philol. Soc. 128 The analysis here presented
is based on the speech of a single informant.. and in particular upon a corpus of material, of
which a large proportion was narrative, derived from approximately 100 hours of listening.
1964 E. Palmer tr. Martinet's Elem. General Linguistics ii. 40 The theoretical objection one
may make against the `corpus' method is that two investigators operating on the same
language but starting from diVerent `corpuses', may arrive at diVerent descriptions of the
same language. 1983 G. Leech et al. in Trans. Philol. Soc. 25 We hope that this will be
judged.. as an attempt to explore the possibilities and problems of corpus-based research by
reference to first-hand experience, instead of by a general survey. 5. The body or material
substance of anything; principal, as opposed to interest or income . 1884 Law Rep.
25 Chanc. Div. 711 If these costs were properly incurred they ought to be paid out of corpus
and not out of income. phr. corpus delicti (see quot. 1832); also, in lay use, the concrete
evidence of a crime, esp. the body of a murdered person.
Corpus - contemporary dictionary definitions
Cobuild 1987:
corpus, corpora, corpuses. (...) 1. A corpus is a large number of articles, books, magazines, etc
that have been deliberately collected together for some purpose; a formal or technical word.
EG We have been trying to collect a corpus of listening comprehension materials... ... that
classic corpus of law, the Code Napoleon. 2. See also habeas corpus.
OALD 2000:
corpus (pl. corpora or corpuses) technical a collection of written or spoken texts: acorpus of
100 million words of spoken English; the whole corpus of Reneissance poetry -- see also
HABEAS CORPUS.
CORPUS - final definition:
"a collection of pieces of language, selected and ordered according to explicit linguistic
criteria in order to be used as a sample of the language" (Sinclair 1996 - EAGLES96).
Defining characteristics (PK):




naturally-occurring / authentic text/discourse
usu. machine-readable / electronic / computer-stored and -processable
compiled according to criteria (usu external: e.g. geographical, sociological, mediumtype etc), thus (ideally) representative of the sampled language variety
Compilation criteria
SOME OVERALL CONSIDERATIONS regarding representativeness:









corpus size (type of corpus! / research purpose: representative of what?)
(sizes of) subcorpora (if any)
sampling (fragments or whole texts? - may depend on text genre)
which text categories (production factor)
which media (spoken/written - vehicle for transmission of language) (production
factor)
which mode (= format of presentation, e.g. dialogue vs monologue) (production
factor)
distribution of text audience sizes and types (reception factor) [cf. Sinclair's diagram
in EAGLES96 text typology recommendadtions]
author/speaker type: age / social background / education / gender etc.
copyright (Kennedy: 76-8 + recent corpora list posting)
Text typology: internal & external criteria (EAGLES96 site
http://www.ilc.pi.cnr.it/EAGLES/home.html):
The earliest electronic corpora were designed using external criteria—reference to
institutionalised types of text, or features of the nonlinguistic environment or society in which
the texts occurred. More recently, some internal criteria — differentiating features of the
language of the texts — have been offered by researchers. This work suggests that a
thorough classification of texts, an adequate typology, will eventually consist of a
balanced combination of the two types of criteria. Many internal and external criteria
reflect each other. A text that showed a high average sentence length would be likely to be of
one kind of book or magazine rather than another.
MOST CORPORA STILL USE EXTERNAL CRITERIA ONLY (PK)
Categories of spoken text (see e.g. Kennedy 1998: 72; see also BNC spoken part composition
below)
Example: Brown corpus (Kucera & Francis 1964): basic criteria
a) one million words
b) divided roughly evenly into genres (15 text categories: see g:\corpora\texts\icamepk\brown1): zad. domowe 1: ustalic 15 kategorii tekstowych reprezentoanych w Brown
corpus, korzystajac z danych na sieci lokalnej IFA)
c) 500 samples
d) 2,000words in each
e) written published sources
(more in Kennedy 1998: 24-6; also there -- structure of SEU: 18, Longman-Lancaster Corpus:
49, some BNC: 51; ICE: 55)
Example: The Bank of English: reference corpus of 167 million words (1996).
Subcorpora:





Newspapers: 43 million words
Books: 37
Magazines: 38
Radio: 39
Ephemera (a wide variety of pamphlets and brochures): 1,5
Informal Spoken: 8,5
Further subcorpora, e.g UK is a subcorpus for Newspapers; then Components, such as The
Times (10 million), The Guardian (12 m)
Example: British national Corpus (BNC):
100 million word BNC: take 4 years to read aloud, at 8 hours a day. The Associated Press
newswire, by comparison, generates some 50 million words per year. The overall size of the
BNC corresponds to roughly 10 years of linguistic experience of the average speaker in terms
of quantity --- though not, of course, in quality, given that it aims to sample the language as a
whole, rather than that experienced by any particular type of speaker.
Most samples in the BNC are of between 40,000 and 50,000 words; published texts are rarely
complete. There is, however, considerable variation in size, caused by the exigencies of
sampling and availability. In particular, most spoken demographic texts, which consist of
casual conversations, are rather longer, since they were formed by grouping together all the
speech recorded by a single informant. Conversely, several texts containing samples of
written unpublished materials such as school essays or office memoranda are very short.
Corpus composition The BNC was designed to characterize the state of contemporary
British English in its various social and generic uses. In selecting texts for inclusion in the
corpus, account was taken of both production, by sampling a wide variety of distinct types of
material, and reception, by selecting instances of those types which have a wide distribution.
Written texts Ninety per cent of the BNC is made up of written texts, chosen according to
three selection features: domain (subject field), time (within certain dates) and medium
(book, periodical, unpublished, etc.). In this way, it was hoped to maximize variety in the
language styles represented, both so that the corpus could be regarded as a microcosm of
current British English in its entirety, and so that different styles might be compared and
contrasted. Each selection feature was divided into classes and target percentages were set
for each class. Thus for the selection feature `medium', five classes (books, periodicals,
miscellaneous published, miscellaneous unpublished, and written- to-be spoken) were
identified. Samples were then selected in the following proportions: 60 per cent from books,
30 per cent from periodicals, 10 per cent from the remaining three miscellaneous sources.
Similarly, for the selection feature `domain', 75 per cent of the samples were drawn from
texts classed as `informative', and 25 per cent from texts classed as `imaginative'.
The evidence from catalogues of books and periodicals suggests that imaginative texts
account for less than 25 per cent of published output. Correspondence, reference works,
unpublished reports, etc. add further to the bulk of informative text which is produced and
consumed. Nevertheless, the overall distribution between informative and imaginative text
samples in the BNC was set to reflect the influential cultural role of literature and creative
writing. The target percentages for the eight informative domains were arrived at by
consensus within the project, based loosely upon the pattern of book publishing in the UK
during the past 20 years or so.
DOMAIN
Imaginative
Arts
Belief_and_thought
Commerce_and_finance
Leisure
Natural_and_pure_science
Applied_science
Social_science
World_affairs
Unclassified
TIME
1960-1974
1975-1993
Unclassified
texts
53
2596
560
MEDIUM
Book
Periodical
Misc. published
Misc._unpublished
To-be-spoken
Unclassified
texts
1488
1167
181
245
49
79
texts
625
259
146
284
374
144
364
510
453
50
percentage
19.47
8.07
4.54
8.85
11.65
4.48
11.34
15.89
14.11
1.55
percentage
1.65
80.89
17.45
percentage
46.36
36.36
5.64
7.63
1.52
2.46
words
19664309
7253846
3053672
7118321
9990080
3752659
7369290
13290441
16507399
1740527
words
2036939
80077473
7626132
words
52574506
27897931
3936637
3595620
1370870
364980
percentage
21.91
8.08
3.40
7.93
11.13
4.18
8.21
14.80
18.39
1.93
percentage
2.26
89.23
8.49
percentage
58.58
31.08
4.38
4.00
1.52
0.40
`Miscellaneous published' includes brochures, leaflets, manuals, advertisements.
`Miscellaneous unpublished' includes letters, memos, re- ports, minutes, essays. `Written-tobe-spoken' includes scripted television material, play scripts etc.
BNC corpus composition - cntd
Written texts are further classified in the corpus according to sets of descriptive
features, e.g.
Attribute
written_age (of author)
written_audience
written_domain
written_domicile (of author)
written_gender (of target audience)
written_level (of circulation)
written_medium
written_place (of publication)
written_pubstatus
written_sample
written_selection
written_sex (of author)
written_status (of reception)
written_time
written_type (of author)
Values
under 15; 15-24; 25-34; 35-44; 45-59; 60 or
over
child; teenage; adult; any
imaginative; natural and pure sciences;
applied sciences; social science; world a#airs;
commerce and finance; arts; belief and
thought; leisure
country or region
male; female; mixed; unknown
low; medium; high
book; periodical; misc published; misc
unpublished; to-be-spoken
country or region
published; unpublished
whole text; beginning sample; middle sample;
end sample; composite
selective; random
male; female; mixed; unknown
low; medium; high
1960-1974; 1975-1993
corporate; multiple; sole; unknown
Spoken texts
Ten percent of the BNC is made up of transcribed spoken material, totalling about 10 million
words. Roughly equal quantities were collected in each of two different ways:


a demographic component of informal encounters recorded by a socially-stratified
sample of respondents, selected by age group, sex, social class and geographic region;
a context-governed component of more formal encounters (meetings, debates,
lectures, seminars, radio programmes and the like), categorized by topic and type of
interaction.
The classifications apply to both the demographic and context-governed components:
Region_where_text_captured
South
Midlands
North
Unclassified
Interaction_type
Monologue
Dialogue
Unclassified
texts
218
672
25
texts
296
208
334
77
percentage
32.34
22.73
36.50
8.41
percentage
23.82
73.44
2.73
words
4728472
2418278
2636312
582402
words
1932225
7760753
672486
percentage
45.61
23.33
25.43
5.61
percentage
18.64
74.87
6.48
The context-governed component consists of 762 texts (6.1 million words). Main criterion domain:
DOMAIN
Educational_and_informative
Business
Institutional
Leisure
Unclassified
texts
144
136
241
187
54
percentage
18.89
17.84
31.62
24.54
7.08
words
1265318
1321844
1345694
1459419
761973
percentage
20.56
21.47
21.86
23.71
12.38
Each of these categories was divided into the subcategories monologue (40 per cent) and
dialogue (60 per cent), and within each category a range of contexts defined as follows:
educational and informative
business
institutional
leisure
Lectures, talks and educational demonstrations; news
commentaries; classroom interaction etc.
Company and trades union talks or interviews; business
meetings; sales demonstrations etc.
Political speeches; sermons; local and national
governmental proceedings etc.
Sports commentaries; broadcast chat shows and phoneins; club meeting and speeches etc.
The overall aim was to achieve a balanced selection within each category, taking into account such features as
region, level, gender of speakers, and topic. Since the length of these text types varies considerably --- news
commentaries may be only a few minutes long, while some business meetings and parliamentary proceedings
may last for hours --- an upper limit of 10,000 words per text was generally imposed.
KORPUS JĘZYKA POLSKIEGO PWN (http://korpus.pwn.pl/)- zad. domowe 2: ustalic +
zreferowac skład korpusu (probki internetowej korpusu PWN; jak oceniacie reprezantywność
korpusu biorac pod uwagę, że:
"W porównaniu z innymi korpusami na świecie nasz zbiór zawiera dość dużo tekstów
literackich. Postanowiliśmy bowiem uwzględnić szczególnie żywą w Polsce tradycję
autorytetu kulturalnego jako kryterium poprawności językowej. Pierwszy trzon naszego
korpusu stanowiło kilkadziesiąt pozycji dwudziestowiecznej klasyki literackiej: prozy,
dramatu, a także poezji (choć teksty poetyckie są w innych korpusach często pomijane jako
nienaturalne)"
(this section will be covered by me in next class, but read it please)
From: Adam Kilgarriff <adam.kilgarriff@itri.brighton.ac.uk>
Subject: [Corpora-List] Legal aspects of corpora compiling
On 25 Sept Rafal L. Górski asked:
>
> Does anybody know about research on legal aspects of corpora compiling
> (copyright restrictions).
A short answer:
to be unequivocally, completely, totally in the clear you need to get copyright clearance from all
copyright holders (publishers and/or authors, all speakers for spoken material). Some will give it to
you, others won't, and it is a lot of work to gather. (I attended a rather nice talk on BNC copyright
issues titled "Ladies love lupins". Sometimes, the only way to get the copyright clearance sought was
to take the lady concerned a bunch of flowers.)
HOWEVER
the law is in its infancy and there is very little which is obviously right or wrong/legal or illegal. If
you have an enemy with rich enough lawyers, you will always be found in the wrong (cf Napster when you're up against the music business it's apparently illegal even to tell someone where they
might find something) so it's pointless viewing the law as a set of rules. Rather, you have to avoid
doing things which someone who is rich and inclined to sue is going to view as provocative.=20
Considerations:
1) PUBLISHING
the issue is heavier if you are going to publish/ copy on the data than if you are not. If it's only for inhouse use, then one simple issue is "who will ever know", and it is not clear that, eg, downloading a
report onto you PC's desktop is any different to downloading it into a corpus. Copyright law is in
general about the case where someone makes money from selling intellectual property: if you are
going to sell a corpus, the issues need taking very seriously, as people will be upset by you making
money out of selling their text (unless you give them a share).
2) EXTRACT SIZE
the issue is heavier, the larger the extracts you take. There is a traditional exemption from copyright
for short extracts, so eg you can take brief quotes, eg in a review or academic book, without asking
permission. There are different opinions about how much you can quote. If you are quoting a short
poem, you couldn't quote it all on the grounds that there weren't many words, so the definition of
'short' has to do with 'as a proportion of the whole' as well as absolute length. As a general principle,
keep extracts short. (In one project, we used "3000 words or one third of the document, whichever is
the shorter")
3) BE COOPERATIVE
avoid including anything where there is an explicit reason not to. In the context of the web, 'no robots'
convention allows authors to say they don't want their page to be viewed by robots. One should also
read this as "keep off" from the point of view of corpus compilation. Some literary authors are
notoriously litiginous.
Corpus size vs dictionary size (this section will be covered by me in next class, but read it
please):
From: "Ramesh Krishnamurthy" <ramesh@easynet.co.uk>
Subject: [Corpora-List] Corpus size for lexicography
Date: Tue, 1 Oct 2002 00:32:07 +0100
Dear Robert Amsler
I am concerned that your statements regarding corpus sizes for lexicographic purposes might
be *highly* misleading, at least for English:
> 1 million words we now know to be quite small
> (adequate only for a Pocket Dictionary worth of entries).
> Collegiate dictionaries require at least a 10 million word corpus, and
> Unabridged dictionaries at least 100 million words (the target of the ANC).
1. From my experience while working for Cobuild at Birmingham University:
a) approx. half of the types/wordforms in most corpora have only one token (i.e. occur only
once): e.g. 213,684 out of 475,633 in the 121m corpus (1993); 438,647 out of 938,914 in the
418m corpus (2000).
b) dictionary entries cannot be based on one example; so let us say you need at least 10
examples (a very modest figure; in fact, as our corpus has grown, and our software and
understanding has become more sophisticated, the minimum threshold increases for some =
linguistic phenomena, as we find that we often require many more examples before particular
features/patterns even become apparent, or certain statistics become reliable)
c) many types with 10+ tokens will not be included in most dictionaries (e.g. numerical
entities, proper names, etc; some may be included in the dictionary, e.g. 24-7,the White
House, etc, depending on editorial policy; the placement problem for numerical =
entities is a separate issue)
d) there are roughly 2.2 types per lemma (roughly equal to a dictionary headword) in English
(the lemma "be" has c. 18 types, including some archaic ones and contractions; most verbs
have 4 or 5 types; at the other end of the scale, many uncount nouns and adjectives, most
adverbs and grammatical words, have only one type); of course some types may belong to
lemmas, but will need to be treated as headwords in their own right, for sound lexicographic
reasons.
2. Calculating potential dictionary headwords from corpus facts and figures:
a) In the 18m Cobuild corpus (1986), there were 43,579 types with 10+ tokens.
Dividing by 2.2, we get c. 19,800 lemmas with 10+ tokens, i.e. potential dictionary
headwords
b) In the 120m Cobuild Bank of English corpus (1993), there were 99,326 types with 10+
tokens = c. 45,150 headwords
c) In the 450m Bank of English corpus (2001), there were 204,626 types with 10+ tokens = c.
93,000 headwords
I don't think the Cobuild corpora are untypical for such rough calculations.
3. Some dictionary figures:
It is difficult to gauge from dictionary publishers' marketing blurbs exactly how many
headwords are in their dictionaries, but here are a few figures taken from the Web today
(unless otherwise stated).
a) Pocket:
Webster's New World Pocket = 37,000 entries
b) Collegiate:
New Shorter OED: 97,600 entries
Oxford Concise: 220,000 words, phrases and meanings
Webster's New World College: 160,000 entries
(cf Collins English Dictionary 1992: 180,000 references)
c) Unabridged:
OED: 500,000 entries
Random House Webster's Unabridged: 315,000 entries
(cf American Heritage 1992: 350,000 entries/meanings)
d) EFL Dictionaries
(cf Longman 1995: 80,000 words/phrases)
(cf Oxford 1995: 63,000 references)
(cf Cambridge 1995: 100,000 words/phrases)
(cf Cobuild 1995: 75,000 references)
4. So, by my reckoning, the 100m-word ANC corpus (yielding less than 45,000 potential
headwords) will be adequate for a Pocket Dictionary, but will struggle to meet Collegiate
requirements, and will be totally inadequate as the sole basis for an Unabridged Dictionary (if
that really is the ANC's aim).
Surely we will need corpora in the billions of words range before we can start to compile truly
corpus-based Unabridged dictionaries. Until then, corpora can assist us in most lexicographic
and linguistic enterprises, but we cannot say that they are adequate in size. It is no
coincidence that corpora first became used for EFL lexicography, where the requirement in
number of headwords is more modest. But even here, it took much larger corpora to give us
reliable evidence of the range of meanings, grammatical patterning and collocational
behaviour of all but the most common words.
I have no wish to disillusion lexicographers working with smaller corpora. Cobuild's initial
attempts in corpus lexicography entailed working with evidence from corpora of 1m and 7m
words. Many of those analyses remain valid in essence, even when checked in our 450m word
corpus. But we now have a better overview, and many more accurate details. Smaller corpora
can be adequate for more restricted investigations, such as domain-specific dictionaries, local
grammars, etc. But for robust generalizations about the entire lexicon, the bigger the corpus
the better.
corpus vs WWW
corpus linguistics vs NLP
usage vs use (OALD 2000)
usage (n): the way in which words are used in a language: current English usage; It's not a
word in common usage; 2. the fact of sth being used; how much sth is used: land usage
use (n): 1. the act of using sth; the state of being used: the use of chemical weapons; 2. a
purpose for which sth is used; a way in which sth is or can be used: I'm sure you'll think of a
use for it; a wide variety of uses;
--------------------Oxford Companion to the English Language 1992
USAGE
[13c: an adoption of Old French usage, corresponding to medieval Latin usaticum, from Latin
usus use]. (1) In general terms, the customary or habitual way of doing something; what is
done in practice as distinct from principle or theory; an instance or application of this, a
customary practice, as in religion, commerce, military matters, and diplomacy. (2) In
linguistic terms, the way in which the elements of language are customarily used to produce
meaning; this includes accent, pronunciation, spelling, punctuation, words, and idioms. It
occurs neutrally in such terms as formal usage, disputed usage, and local usage, and it has
strong judgemental and prescriptive connotations in such terms as bad usage, correct usage,
usage and abusage, and usage controversies.
CORPUS EVIDENCE:
lexical information (in case sth isn't clear, bring it up in next class)







How often does a particular word-form, or group of forms (such as the various forms
of the verb `start': `start', `starts', `starting', `started') appear in the corpus? Is `start'
more or less common than `begin'? [z-score]
With what meanings is a particular word-form, or group of forms, used? Is `back'
more frequently used with reference to a part of the body or a direction? Do we `start'
and `begin' the same sorts of things?
How often does a particular word-form, or group of forms, appear near to other
particular word forms, which collocate with it within a given distance? Does
`immemorial' always have `time' as a collocate? Is it more common for `prices' to
`rise' or to `increase'? Do different senses of the same word have different collocates?
How often does a particular word-form, or group of forms, appear in particular
grammatical structures, which colligate with it? Is it more common to `start to do
something', or to `start doing it'? Do different senses of the same word have diVerent
colligates?
How often does a particular word-form, or group of forms, appear in a certain
semantic environment, showing a tendency to have positive or negative connotations?
Does the intensifier `totally' always modify verbs and adjectives with a negative
meaning, such as `fail' and `ridiculous'? [semantic prosody, PK]
How often does a particular word-form, or group of forms, appear in a particular type
of text, or in a particular type of speaker or author's language? Is `little' or `small' more
common in conversation? Do women say `sort of ' more than men? Does the word
`wicked' always have positive connotations for the young? Is the word `predecease'
found outside legal texts and obituaries? Do lower-class speakers use more (or
different) expletives?
Whereabouts in texts does a particular word form, or group of forms, tend to occur?
Does its meaning vary according to its position? How often does it occur within notes
or headings, following a pause, near the end of a text, or at the beginning of a
sentence, paragraph or utterance? And is it in fact true that `and' never begins a
sentence?
Morphosyntactic information (in case sth isn't clear, bring it up in next class):





How frequent is a particular morphological form or grammatical structure? How much
more common are clauses with active than with passive main verbs? What proportion
of passive forms have the agent specified in a following `by' phrase?
With what meanings is a particular structure used? Is there a diVerence between `I
hope that' and `I hope to'?
How often does a particular structure occur with particular collocates or colligates? Is
`if I was you' or `if I were you' more common?
How often does a particular structure appear in a particular type of text, or in a
particular type of speaker or author's language? Are passives more common in
scientific texts? Is the subjunctive used less by younger speakers?
Whereabouts in texts does a particular structure tend to occur? Do writers and
speakers tend to switch from the past tense to the `historic present' at particular points
in narratives?
Semantic or pragmatic information (in case sth isn't clear, bring it up in next class):






What tools are most frequently referred to in texts talking about gardening?
What fields of metaphor are employed in economic discourse?
Do the upper-middle classes talk diVerently about universities from the working
classes?
How do speakers close conversations, or open lectures? How do chairpersons switch
from one point to another in meetings?
Are pauses in conversation more common between utterances than within them?
What happens when conversationalists stop laughing?
COMPARING CORPORA (in case sth isn't clear, bring it up in next class):
While the examples just cited have all concerned analyses within a particular corpus, it is
evident that all these areas can also be examined contrastively, comparing data from corpora
of different languages, historical periods, dialects or geographical varieties, modes (spoken or
written), or registers. By comparing one of the standard corpora collected twenty years ago
with an analogous corpus of today, it is possible to investigate recent changes in English. By
comparing corpora collected in different parts of the world, it is possible to investigate
differences between, for instance, British and Australian English. By comparing a corpus of
translated texts with one of texts originally created in the target language, it is possible to
identify linguistic properties peculiar to translation. By comparing a small homogeneous
corpus of some particular kind of material with a large balanced corpus (such as the BNC), it
is possible to identify the distinctive linguistic characteristics of [the special type/genre of
text, PK].
Typology of corpora (this section will be covered by me in next class):
(sources: Sinclair 1996 http://www.ilc.pi.cnr.it/EAGLES/corpustyp/corpustyp.html- Project EAGLES - Expert
Advisory Group on Language Engineering Standards; BNC handbook; Kennedy 1998: 13-23):
1) what they represent
2) how they are compiled
3) how they are stored


pre-electronic (e.g. in biblical and literarry studies, early lexicography, LT &
grammar)
electronic / machine-readable



sample
balanced
sample-text vs full-text corpus


synchronic (static) vs diachronic (dynamic) corpora
monitor corpus




written
spoken
mixed
multimedia



general or reference corpus (= balanced; also called core)
specialised (particular research projects)
NLP: training or test corpus



geographical varieties (UK, US, Australian etc: NO corpus of ALL English, but ICE)
historical varieties
register/dialect/genre/topic etc-specific



special (deviant from general norm - Sinclair 1996)
child language
learner language




monolingual
bilingual
parallel / translation
comparable


non-annotated or unannotated or raw
annotated (POS-tagged, parsed, error tagged etc.)
Download