(Any comments directed to you appear in red font - also your two practical tasks for next week) CORPUS DEFINITION; COMPILATION CRITERIA; TYPOLOGY OF CORPORA Synchronic perspective (no special ref to the history of corpora or corpus linguistics) http://main.amu.edu.pl/~przemka/diplsem2002-3/diplsem2.html http://xrefer.com Why do we need corpora? Four kinds of observational techniques (after Chafe 1992, in Svartvik ed. 1992 Directions in Corpus Linguistics) artificial natural behavioral experiments, elicitation corpora, ethnography introspective semantics, judgments regarding invented language daydreaming Why bother with a corpus? Expert speakers have only partial knowledge Corpus is more comprehensive and balanced Expert speakers think of what is possible Corpus shows us what is common and typical Expert speakers cannot quantify their knowledge Corpus can give us fairly accurate statistics Expert speakers cannot make up natural examples Corpus can give us many natural examples "Korpus" Kaszubski (forthcoming): W języku polskim określenie „językoznawstwo korpusowe” brzmi wielce niezręcznie. Słowo "korpus" nie ma w polskim leksykonie ustabilizowanego znaczenia dotyczącego zbioru danych czy zbioru tekstów. Słownik języka polskiego w wersji online najważniejszego w Polsce wydawcy naukowego – PWN – pomija ten sens1, pomimo iż część stron internetowych wydawnictwo poświęca bezpośrednio tzw. „Korpusowi Języka Polskiego” (KJP)2, na których umieszcza nawet definicję lingwistycznego znaczenia terminu korpus!3. Z kolei przymiotnik „korpusowy”, zliczając wszystkie jego formy deklinacyjne, pojawia się w polskim Internecie około stu razy, z czego większość dotyczy dziedzin związanych z budową maszyn i wojskowością, a jedynie niecałe 10 procent sensu językowego4. Nie może zatem ulegać wątpliwości, że wyrazy „korpus” i „korpusowy” używane w odniesieniu do zbioru tekstów, czy nawet ogólnie do zbioru danych, mają w Polsce znaczenie jedynie dla wąskiego grona specjalistów, natomiast dla szerszego odbiorcy brzmieć będą niezrozumiale, a nawet Pod http://sjp.pwn.pl/ uzyskamy informację, że korpus to: „1. «ciało człowieka lub zwierzęcia prócz kończyn i głowy; tułów»; 2. → garmond [= druk. «stopień czcionki, równy dziesięciu punktom typograficznym», PK]; 3. «główna część budowli; w architekturze pałacowej: część centralna budynku o charakterze reprezentacyjnym; w architekturze sakralnej: część nawowa kościoła»; 4. techn. «główna, tworząca całość, część jakiegoś urządzenia, maszyny, przyrządu itp.; kadłub»; 5. wojsk. «duża jednostka taktyczna składająca się z kilku dywizji lub brygad; wchodzi w skład armii lub może działać samodzielnie»”. Przymiotnik „korpusowy” opisany jest bardzo krótko jako pochodzący od rzeczownika „korpus”. 2 Http://slowniki.pwn.pl/korpus/ lub http://korpus.pwn.pl. 3 Definicja jest zresztą nieścisła ("Korpus to dowolny zbiór tekstów, w którym czegoś szukamy"), albowiem zakłada dowolność w doborze materiału a tym samym umniejsza znaczenie (statystycznej) reprezentatywności, która winna cechować korpus językowy w odróżnieniu od zwykłego tekstowego archiwum, których używa się na przykład w przetwarzaniu języka naturalnego, o którym piszę nieco dalej. Dopiero na jednej z dalszych stron PWN (http://korpus.pwn.pl/powstawanie.php) pojawia się obszerniejsza dyskusja tej kwestii („Korpus tekstów musi być odpowiednio zrównoważony gatunkowo, chronologicznie, stylowo, terytorialnie i pod innymi względami, np. ze względu na wiek i płeć autorów. To właśnie założona uprzednio struktura oraz rodzaj wyszukiwarki różni korpusy naukowe od innych wielkich zbiorów tekstów, choćby internetowych archiwów gazet codziennych bądź ogólnych zasobów sieci.”). Eksplikacja ta jest jednak, moim zdaniem, spóźniona, gdyż reprezentatywność stanowi cechę definicyjną korpusu i powinna zostać umieszczona w głównej definicji. 4 Wyszukiwarka „Google” zwróciła 2 sierpnia 2002 dokładnie 90 trafień, wśród których znalazły się: 1) frazy dotyczące głównego znaczenia maszynowego (=„kadłub”), takie jak jak „ciśnienie korpusowe” (w silniku), „(hydrauliczna) prasa korpusowa”, „płytka korpusowa” (gitary, także płytek skrawających), „części korpusowe” (do ciągnika itp), „detale korpusowe” (do wyrobu/obróbki na szlifierkach), „elementy korpusowe” (silnika/obrabiarek itp.), „połączenia korpusowe” (dotyczy korpusów mebli), „ konstrukcja korpusowa" (o korpusie szafy stojącej; może także dotyczyć okuć do systemów drzwi składano-przesuwnych); 2) pokrewne użycia związane z jubilerstwem i wyrobem mniejszych przedmiotów użytkowych: „artykuły korpusowe” (czyli wszystko, co się stawia na stole, np. zestawy na przyprawy, świeczniki), „świecznik korpusowy”, „srebra korpusowe”, „wyroby/obiekty korpusowe” (dotyczy masywnej biżuterii); 3) frazy związane z wojskowością: „korpusowa stacja telefoniczna" (tzn należąca do danego korpusu wojsk), „korpusowy pododdział”, „oficerowie korpusowi”, „okręg korpusowy”, „dowódca Okręgu Korpusowego. Językoznawcze wystąpienia to np. „strony korpusowe” PWN (http://slowniki.pwn.pl/poradnia/lista.php?kat=16&od=20); „uboga reprezentacja korpusowa” (w publikacji Marka Łazińskiego „Rozróżnianie znaczeń synonimicznych w korpusie tekstów” – http://sjikp.us.edu.pl/ps/ps_29_05.html). Co ciekawe, w Internecie spotkać można też napotkać frazę „lingwistyka korpusowa” (np. na stronach polsko-angielskiego projektu PELCRA – http://www1.uni.lodz.pl/pelcra/index-pl.htm), wobec zupełnego niemal braku „językoznawstwa korpusowego”, jednak rozróżnienie pomiędzy tego rodzaju synonimami to temat na osobną analizę korpusową. 1 śmiesznie – ze względu na narzucające się skojarzenia z głównym sensem rzeczownikowym („tułów”). "Corpus" (Kaszubski forthcoming - ctnd): W przeciwieństwie do polszczyzny, leksykon angielski notuje dwa znaczenia wyrazu corpus, które sankcjonują stosowność terminu corpus linguistics. W drugim wydaniu wielkiego Słownika Oksfordzkiego Języka Angielskiego (Oxford English Dictionary 2nd ed.) czytamy, że korpusem nazywa się, po pierwsze, zbiór lub zestaw tekstów lub innych materiałów czyjegoś autorstwa (np. korpus Szekspirowski), albo też ogół literatury napisanej na dany temat („A body or collection of writings or the like; the whole body of literature on any subject”). Drugie ze znaczeń określa corpus jako zbiór materiałów języka pisanego lub mówionego używanych jako podstawa do badań nad językiem („the body of written or spoken material upon which a linguistic analysis is based”)5. Od tego właśnie sensu wywodzi się bezpośrednio termin corpus linguistics (Aston & Burnard 1998: 4), choć podkreślić należy, że – jak wspomniałem – językoznawstwo korpusowe definicję terminu „korpus” jeszcze bardziej zawęża. Korpus językowy to bowiem nie dowolny zbiór danych językowych lecz reprezentatywna dla założonej populacji czy grupy próbka języka, przechowywana i badana w formie w jakiej została naturalnie (spontanicznie) użyta, często z zachowaniem informacji o kontekście socjologicznym. Zbiór tekstów nazwany korpusem musi zatem charakteryzować się nie tylko dostatecznym rozmiarem, mierzonym ilością słów, gwarantującym dokładność badań kwantytatywnych, ale także jakością doboru, czyli podporządkowaniem pozatekstowym kryteriom określającym, jaką odmianę języka dany korpus reprezentuje. Trafnie skompilowany korpus pozwala uzyskane na jego podstawie dane ilościowe odpowiednio interpretować i ekstrapolować na rzeczywistość językową poza korpusem. Drugi sens niekiedy nakłada się na pierwszy, np. gdy analizie poddajemy elementy językowe cechujące styl pisarski konkretnego autora. 5 OED 2 (excerpts): corpus ... pl. corpora ... 1. The body of a man or animal. (Cf. corpse.) Formerly frequent; now only humorous or grotesque. 1854 Villikins & his Dinah (in Mus. Bouquet, No. 452), He kissed her cold corpus a thousand times o'er. 2. Phys. A structure of a special character or function in the animal body, as corpus callosum [= ciało modzelowate, spoidło wielkie mózgu, PK], the transverse commissure connecting the cerebral hemispheres; corpus luteum L. luteus [= ciałko żółte, PK], --um yellow (pl. corpora lutea), a yellowish body developed in the ovary from the ruptured Graafian follicle after discharge of the ovum. 3. A body or complete collection of writings or the like; the whole body of literature on any subject. 1727-51 Chambers Cycl. s.v., Corpus is also used in matters of learning, for several works of the same nature, collected, and bound together.. We have also a corpus of the Greek poets.. The corpus of the civil law is composed of the digest, code, and institutes. 1865 Mozley Mirac. i. 16 Bound up inseparably with the whole corpus of Christian tradition. 4. The body of written or spoken material upon which a linguistic analysis is based. 1956 W. S. Allen in Trans. Philol. Soc. 128 The analysis here presented is based on the speech of a single informant.. and in particular upon a corpus of material, of which a large proportion was narrative, derived from approximately 100 hours of listening. 1964 E. Palmer tr. Martinet's Elem. General Linguistics ii. 40 The theoretical objection one may make against the `corpus' method is that two investigators operating on the same language but starting from diVerent `corpuses', may arrive at diVerent descriptions of the same language. 1983 G. Leech et al. in Trans. Philol. Soc. 25 We hope that this will be judged.. as an attempt to explore the possibilities and problems of corpus-based research by reference to first-hand experience, instead of by a general survey. 5. The body or material substance of anything; principal, as opposed to interest or income . 1884 Law Rep. 25 Chanc. Div. 711 If these costs were properly incurred they ought to be paid out of corpus and not out of income. phr. corpus delicti (see quot. 1832); also, in lay use, the concrete evidence of a crime, esp. the body of a murdered person. Corpus - contemporary dictionary definitions Cobuild 1987: corpus, corpora, corpuses. (...) 1. A corpus is a large number of articles, books, magazines, etc that have been deliberately collected together for some purpose; a formal or technical word. EG We have been trying to collect a corpus of listening comprehension materials... ... that classic corpus of law, the Code Napoleon. 2. See also habeas corpus. OALD 2000: corpus (pl. corpora or corpuses) technical a collection of written or spoken texts: acorpus of 100 million words of spoken English; the whole corpus of Reneissance poetry -- see also HABEAS CORPUS. CORPUS - final definition: "a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language" (Sinclair 1996 - EAGLES96). Defining characteristics (PK): naturally-occurring / authentic text/discourse usu. machine-readable / electronic / computer-stored and -processable compiled according to criteria (usu external: e.g. geographical, sociological, mediumtype etc), thus (ideally) representative of the sampled language variety Compilation criteria SOME OVERALL CONSIDERATIONS regarding representativeness: corpus size (type of corpus! / research purpose: representative of what?) (sizes of) subcorpora (if any) sampling (fragments or whole texts? - may depend on text genre) which text categories (production factor) which media (spoken/written - vehicle for transmission of language) (production factor) which mode (= format of presentation, e.g. dialogue vs monologue) (production factor) distribution of text audience sizes and types (reception factor) [cf. Sinclair's diagram in EAGLES96 text typology recommendadtions] author/speaker type: age / social background / education / gender etc. copyright (Kennedy: 76-8 + recent corpora list posting) Text typology: internal & external criteria (EAGLES96 site http://www.ilc.pi.cnr.it/EAGLES/home.html): The earliest electronic corpora were designed using external criteria—reference to institutionalised types of text, or features of the nonlinguistic environment or society in which the texts occurred. More recently, some internal criteria — differentiating features of the language of the texts — have been offered by researchers. This work suggests that a thorough classification of texts, an adequate typology, will eventually consist of a balanced combination of the two types of criteria. Many internal and external criteria reflect each other. A text that showed a high average sentence length would be likely to be of one kind of book or magazine rather than another. MOST CORPORA STILL USE EXTERNAL CRITERIA ONLY (PK) Categories of spoken text (see e.g. Kennedy 1998: 72; see also BNC spoken part composition below) Example: Brown corpus (Kucera & Francis 1964): basic criteria a) one million words b) divided roughly evenly into genres (15 text categories: see g:\corpora\texts\icamepk\brown1): zad. domowe 1: ustalic 15 kategorii tekstowych reprezentoanych w Brown corpus, korzystajac z danych na sieci lokalnej IFA) c) 500 samples d) 2,000words in each e) written published sources (more in Kennedy 1998: 24-6; also there -- structure of SEU: 18, Longman-Lancaster Corpus: 49, some BNC: 51; ICE: 55) Example: The Bank of English: reference corpus of 167 million words (1996). Subcorpora: Newspapers: 43 million words Books: 37 Magazines: 38 Radio: 39 Ephemera (a wide variety of pamphlets and brochures): 1,5 Informal Spoken: 8,5 Further subcorpora, e.g UK is a subcorpus for Newspapers; then Components, such as The Times (10 million), The Guardian (12 m) Example: British national Corpus (BNC): 100 million word BNC: take 4 years to read aloud, at 8 hours a day. The Associated Press newswire, by comparison, generates some 50 million words per year. The overall size of the BNC corresponds to roughly 10 years of linguistic experience of the average speaker in terms of quantity --- though not, of course, in quality, given that it aims to sample the language as a whole, rather than that experienced by any particular type of speaker. Most samples in the BNC are of between 40,000 and 50,000 words; published texts are rarely complete. There is, however, considerable variation in size, caused by the exigencies of sampling and availability. In particular, most spoken demographic texts, which consist of casual conversations, are rather longer, since they were formed by grouping together all the speech recorded by a single informant. Conversely, several texts containing samples of written unpublished materials such as school essays or office memoranda are very short. Corpus composition The BNC was designed to characterize the state of contemporary British English in its various social and generic uses. In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Written texts Ninety per cent of the BNC is made up of written texts, chosen according to three selection features: domain (subject field), time (within certain dates) and medium (book, periodical, unpublished, etc.). In this way, it was hoped to maximize variety in the language styles represented, both so that the corpus could be regarded as a microcosm of current British English in its entirety, and so that different styles might be compared and contrasted. Each selection feature was divided into classes and target percentages were set for each class. Thus for the selection feature `medium', five classes (books, periodicals, miscellaneous published, miscellaneous unpublished, and written- to-be spoken) were identified. Samples were then selected in the following proportions: 60 per cent from books, 30 per cent from periodicals, 10 per cent from the remaining three miscellaneous sources. Similarly, for the selection feature `domain', 75 per cent of the samples were drawn from texts classed as `informative', and 25 per cent from texts classed as `imaginative'. The evidence from catalogues of books and periodicals suggests that imaginative texts account for less than 25 per cent of published output. Correspondence, reference works, unpublished reports, etc. add further to the bulk of informative text which is produced and consumed. Nevertheless, the overall distribution between informative and imaginative text samples in the BNC was set to reflect the influential cultural role of literature and creative writing. The target percentages for the eight informative domains were arrived at by consensus within the project, based loosely upon the pattern of book publishing in the UK during the past 20 years or so. DOMAIN Imaginative Arts Belief_and_thought Commerce_and_finance Leisure Natural_and_pure_science Applied_science Social_science World_affairs Unclassified TIME 1960-1974 1975-1993 Unclassified texts 53 2596 560 MEDIUM Book Periodical Misc. published Misc._unpublished To-be-spoken Unclassified texts 1488 1167 181 245 49 79 texts 625 259 146 284 374 144 364 510 453 50 percentage 19.47 8.07 4.54 8.85 11.65 4.48 11.34 15.89 14.11 1.55 percentage 1.65 80.89 17.45 percentage 46.36 36.36 5.64 7.63 1.52 2.46 words 19664309 7253846 3053672 7118321 9990080 3752659 7369290 13290441 16507399 1740527 words 2036939 80077473 7626132 words 52574506 27897931 3936637 3595620 1370870 364980 percentage 21.91 8.08 3.40 7.93 11.13 4.18 8.21 14.80 18.39 1.93 percentage 2.26 89.23 8.49 percentage 58.58 31.08 4.38 4.00 1.52 0.40 `Miscellaneous published' includes brochures, leaflets, manuals, advertisements. `Miscellaneous unpublished' includes letters, memos, re- ports, minutes, essays. `Written-tobe-spoken' includes scripted television material, play scripts etc. BNC corpus composition - cntd Written texts are further classified in the corpus according to sets of descriptive features, e.g. Attribute written_age (of author) written_audience written_domain written_domicile (of author) written_gender (of target audience) written_level (of circulation) written_medium written_place (of publication) written_pubstatus written_sample written_selection written_sex (of author) written_status (of reception) written_time written_type (of author) Values under 15; 15-24; 25-34; 35-44; 45-59; 60 or over child; teenage; adult; any imaginative; natural and pure sciences; applied sciences; social science; world a#airs; commerce and finance; arts; belief and thought; leisure country or region male; female; mixed; unknown low; medium; high book; periodical; misc published; misc unpublished; to-be-spoken country or region published; unpublished whole text; beginning sample; middle sample; end sample; composite selective; random male; female; mixed; unknown low; medium; high 1960-1974; 1975-1993 corporate; multiple; sole; unknown Spoken texts Ten percent of the BNC is made up of transcribed spoken material, totalling about 10 million words. Roughly equal quantities were collected in each of two different ways: a demographic component of informal encounters recorded by a socially-stratified sample of respondents, selected by age group, sex, social class and geographic region; a context-governed component of more formal encounters (meetings, debates, lectures, seminars, radio programmes and the like), categorized by topic and type of interaction. The classifications apply to both the demographic and context-governed components: Region_where_text_captured South Midlands North Unclassified Interaction_type Monologue Dialogue Unclassified texts 218 672 25 texts 296 208 334 77 percentage 32.34 22.73 36.50 8.41 percentage 23.82 73.44 2.73 words 4728472 2418278 2636312 582402 words 1932225 7760753 672486 percentage 45.61 23.33 25.43 5.61 percentage 18.64 74.87 6.48 The context-governed component consists of 762 texts (6.1 million words). Main criterion domain: DOMAIN Educational_and_informative Business Institutional Leisure Unclassified texts 144 136 241 187 54 percentage 18.89 17.84 31.62 24.54 7.08 words 1265318 1321844 1345694 1459419 761973 percentage 20.56 21.47 21.86 23.71 12.38 Each of these categories was divided into the subcategories monologue (40 per cent) and dialogue (60 per cent), and within each category a range of contexts defined as follows: educational and informative business institutional leisure Lectures, talks and educational demonstrations; news commentaries; classroom interaction etc. Company and trades union talks or interviews; business meetings; sales demonstrations etc. Political speeches; sermons; local and national governmental proceedings etc. Sports commentaries; broadcast chat shows and phoneins; club meeting and speeches etc. The overall aim was to achieve a balanced selection within each category, taking into account such features as region, level, gender of speakers, and topic. Since the length of these text types varies considerably --- news commentaries may be only a few minutes long, while some business meetings and parliamentary proceedings may last for hours --- an upper limit of 10,000 words per text was generally imposed. KORPUS JĘZYKA POLSKIEGO PWN (http://korpus.pwn.pl/)- zad. domowe 2: ustalic + zreferowac skład korpusu (probki internetowej korpusu PWN; jak oceniacie reprezantywność korpusu biorac pod uwagę, że: "W porównaniu z innymi korpusami na świecie nasz zbiór zawiera dość dużo tekstów literackich. Postanowiliśmy bowiem uwzględnić szczególnie żywą w Polsce tradycję autorytetu kulturalnego jako kryterium poprawności językowej. Pierwszy trzon naszego korpusu stanowiło kilkadziesiąt pozycji dwudziestowiecznej klasyki literackiej: prozy, dramatu, a także poezji (choć teksty poetyckie są w innych korpusach często pomijane jako nienaturalne)" (this section will be covered by me in next class, but read it please) From: Adam Kilgarriff <adam.kilgarriff@itri.brighton.ac.uk> Subject: [Corpora-List] Legal aspects of corpora compiling On 25 Sept Rafal L. Górski asked: > > Does anybody know about research on legal aspects of corpora compiling > (copyright restrictions). A short answer: to be unequivocally, completely, totally in the clear you need to get copyright clearance from all copyright holders (publishers and/or authors, all speakers for spoken material). Some will give it to you, others won't, and it is a lot of work to gather. (I attended a rather nice talk on BNC copyright issues titled "Ladies love lupins". Sometimes, the only way to get the copyright clearance sought was to take the lady concerned a bunch of flowers.) HOWEVER the law is in its infancy and there is very little which is obviously right or wrong/legal or illegal. If you have an enemy with rich enough lawyers, you will always be found in the wrong (cf Napster when you're up against the music business it's apparently illegal even to tell someone where they might find something) so it's pointless viewing the law as a set of rules. Rather, you have to avoid doing things which someone who is rich and inclined to sue is going to view as provocative.=20 Considerations: 1) PUBLISHING the issue is heavier if you are going to publish/ copy on the data than if you are not. If it's only for inhouse use, then one simple issue is "who will ever know", and it is not clear that, eg, downloading a report onto you PC's desktop is any different to downloading it into a corpus. Copyright law is in general about the case where someone makes money from selling intellectual property: if you are going to sell a corpus, the issues need taking very seriously, as people will be upset by you making money out of selling their text (unless you give them a share). 2) EXTRACT SIZE the issue is heavier, the larger the extracts you take. There is a traditional exemption from copyright for short extracts, so eg you can take brief quotes, eg in a review or academic book, without asking permission. There are different opinions about how much you can quote. If you are quoting a short poem, you couldn't quote it all on the grounds that there weren't many words, so the definition of 'short' has to do with 'as a proportion of the whole' as well as absolute length. As a general principle, keep extracts short. (In one project, we used "3000 words or one third of the document, whichever is the shorter") 3) BE COOPERATIVE avoid including anything where there is an explicit reason not to. In the context of the web, 'no robots' convention allows authors to say they don't want their page to be viewed by robots. One should also read this as "keep off" from the point of view of corpus compilation. Some literary authors are notoriously litiginous. Corpus size vs dictionary size (this section will be covered by me in next class, but read it please): From: "Ramesh Krishnamurthy" <ramesh@easynet.co.uk> Subject: [Corpora-List] Corpus size for lexicography Date: Tue, 1 Oct 2002 00:32:07 +0100 Dear Robert Amsler I am concerned that your statements regarding corpus sizes for lexicographic purposes might be *highly* misleading, at least for English: > 1 million words we now know to be quite small > (adequate only for a Pocket Dictionary worth of entries). > Collegiate dictionaries require at least a 10 million word corpus, and > Unabridged dictionaries at least 100 million words (the target of the ANC). 1. From my experience while working for Cobuild at Birmingham University: a) approx. half of the types/wordforms in most corpora have only one token (i.e. occur only once): e.g. 213,684 out of 475,633 in the 121m corpus (1993); 438,647 out of 938,914 in the 418m corpus (2000). b) dictionary entries cannot be based on one example; so let us say you need at least 10 examples (a very modest figure; in fact, as our corpus has grown, and our software and understanding has become more sophisticated, the minimum threshold increases for some = linguistic phenomena, as we find that we often require many more examples before particular features/patterns even become apparent, or certain statistics become reliable) c) many types with 10+ tokens will not be included in most dictionaries (e.g. numerical entities, proper names, etc; some may be included in the dictionary, e.g. 24-7,the White House, etc, depending on editorial policy; the placement problem for numerical = entities is a separate issue) d) there are roughly 2.2 types per lemma (roughly equal to a dictionary headword) in English (the lemma "be" has c. 18 types, including some archaic ones and contractions; most verbs have 4 or 5 types; at the other end of the scale, many uncount nouns and adjectives, most adverbs and grammatical words, have only one type); of course some types may belong to lemmas, but will need to be treated as headwords in their own right, for sound lexicographic reasons. 2. Calculating potential dictionary headwords from corpus facts and figures: a) In the 18m Cobuild corpus (1986), there were 43,579 types with 10+ tokens. Dividing by 2.2, we get c. 19,800 lemmas with 10+ tokens, i.e. potential dictionary headwords b) In the 120m Cobuild Bank of English corpus (1993), there were 99,326 types with 10+ tokens = c. 45,150 headwords c) In the 450m Bank of English corpus (2001), there were 204,626 types with 10+ tokens = c. 93,000 headwords I don't think the Cobuild corpora are untypical for such rough calculations. 3. Some dictionary figures: It is difficult to gauge from dictionary publishers' marketing blurbs exactly how many headwords are in their dictionaries, but here are a few figures taken from the Web today (unless otherwise stated). a) Pocket: Webster's New World Pocket = 37,000 entries b) Collegiate: New Shorter OED: 97,600 entries Oxford Concise: 220,000 words, phrases and meanings Webster's New World College: 160,000 entries (cf Collins English Dictionary 1992: 180,000 references) c) Unabridged: OED: 500,000 entries Random House Webster's Unabridged: 315,000 entries (cf American Heritage 1992: 350,000 entries/meanings) d) EFL Dictionaries (cf Longman 1995: 80,000 words/phrases) (cf Oxford 1995: 63,000 references) (cf Cambridge 1995: 100,000 words/phrases) (cf Cobuild 1995: 75,000 references) 4. So, by my reckoning, the 100m-word ANC corpus (yielding less than 45,000 potential headwords) will be adequate for a Pocket Dictionary, but will struggle to meet Collegiate requirements, and will be totally inadequate as the sole basis for an Unabridged Dictionary (if that really is the ANC's aim). Surely we will need corpora in the billions of words range before we can start to compile truly corpus-based Unabridged dictionaries. Until then, corpora can assist us in most lexicographic and linguistic enterprises, but we cannot say that they are adequate in size. It is no coincidence that corpora first became used for EFL lexicography, where the requirement in number of headwords is more modest. But even here, it took much larger corpora to give us reliable evidence of the range of meanings, grammatical patterning and collocational behaviour of all but the most common words. I have no wish to disillusion lexicographers working with smaller corpora. Cobuild's initial attempts in corpus lexicography entailed working with evidence from corpora of 1m and 7m words. Many of those analyses remain valid in essence, even when checked in our 450m word corpus. But we now have a better overview, and many more accurate details. Smaller corpora can be adequate for more restricted investigations, such as domain-specific dictionaries, local grammars, etc. But for robust generalizations about the entire lexicon, the bigger the corpus the better. corpus vs WWW corpus linguistics vs NLP usage vs use (OALD 2000) usage (n): the way in which words are used in a language: current English usage; It's not a word in common usage; 2. the fact of sth being used; how much sth is used: land usage use (n): 1. the act of using sth; the state of being used: the use of chemical weapons; 2. a purpose for which sth is used; a way in which sth is or can be used: I'm sure you'll think of a use for it; a wide variety of uses; --------------------Oxford Companion to the English Language 1992 USAGE [13c: an adoption of Old French usage, corresponding to medieval Latin usaticum, from Latin usus use]. (1) In general terms, the customary or habitual way of doing something; what is done in practice as distinct from principle or theory; an instance or application of this, a customary practice, as in religion, commerce, military matters, and diplomacy. (2) In linguistic terms, the way in which the elements of language are customarily used to produce meaning; this includes accent, pronunciation, spelling, punctuation, words, and idioms. It occurs neutrally in such terms as formal usage, disputed usage, and local usage, and it has strong judgemental and prescriptive connotations in such terms as bad usage, correct usage, usage and abusage, and usage controversies. CORPUS EVIDENCE: lexical information (in case sth isn't clear, bring it up in next class) How often does a particular word-form, or group of forms (such as the various forms of the verb `start': `start', `starts', `starting', `started') appear in the corpus? Is `start' more or less common than `begin'? [z-score] With what meanings is a particular word-form, or group of forms, used? Is `back' more frequently used with reference to a part of the body or a direction? Do we `start' and `begin' the same sorts of things? How often does a particular word-form, or group of forms, appear near to other particular word forms, which collocate with it within a given distance? Does `immemorial' always have `time' as a collocate? Is it more common for `prices' to `rise' or to `increase'? Do different senses of the same word have different collocates? How often does a particular word-form, or group of forms, appear in particular grammatical structures, which colligate with it? Is it more common to `start to do something', or to `start doing it'? Do different senses of the same word have diVerent colligates? How often does a particular word-form, or group of forms, appear in a certain semantic environment, showing a tendency to have positive or negative connotations? Does the intensifier `totally' always modify verbs and adjectives with a negative meaning, such as `fail' and `ridiculous'? [semantic prosody, PK] How often does a particular word-form, or group of forms, appear in a particular type of text, or in a particular type of speaker or author's language? Is `little' or `small' more common in conversation? Do women say `sort of ' more than men? Does the word `wicked' always have positive connotations for the young? Is the word `predecease' found outside legal texts and obituaries? Do lower-class speakers use more (or different) expletives? Whereabouts in texts does a particular word form, or group of forms, tend to occur? Does its meaning vary according to its position? How often does it occur within notes or headings, following a pause, near the end of a text, or at the beginning of a sentence, paragraph or utterance? And is it in fact true that `and' never begins a sentence? Morphosyntactic information (in case sth isn't clear, bring it up in next class): How frequent is a particular morphological form or grammatical structure? How much more common are clauses with active than with passive main verbs? What proportion of passive forms have the agent specified in a following `by' phrase? With what meanings is a particular structure used? Is there a diVerence between `I hope that' and `I hope to'? How often does a particular structure occur with particular collocates or colligates? Is `if I was you' or `if I were you' more common? How often does a particular structure appear in a particular type of text, or in a particular type of speaker or author's language? Are passives more common in scientific texts? Is the subjunctive used less by younger speakers? Whereabouts in texts does a particular structure tend to occur? Do writers and speakers tend to switch from the past tense to the `historic present' at particular points in narratives? Semantic or pragmatic information (in case sth isn't clear, bring it up in next class): What tools are most frequently referred to in texts talking about gardening? What fields of metaphor are employed in economic discourse? Do the upper-middle classes talk diVerently about universities from the working classes? How do speakers close conversations, or open lectures? How do chairpersons switch from one point to another in meetings? Are pauses in conversation more common between utterances than within them? What happens when conversationalists stop laughing? COMPARING CORPORA (in case sth isn't clear, bring it up in next class): While the examples just cited have all concerned analyses within a particular corpus, it is evident that all these areas can also be examined contrastively, comparing data from corpora of different languages, historical periods, dialects or geographical varieties, modes (spoken or written), or registers. By comparing one of the standard corpora collected twenty years ago with an analogous corpus of today, it is possible to investigate recent changes in English. By comparing corpora collected in different parts of the world, it is possible to investigate differences between, for instance, British and Australian English. By comparing a corpus of translated texts with one of texts originally created in the target language, it is possible to identify linguistic properties peculiar to translation. By comparing a small homogeneous corpus of some particular kind of material with a large balanced corpus (such as the BNC), it is possible to identify the distinctive linguistic characteristics of [the special type/genre of text, PK]. Typology of corpora (this section will be covered by me in next class): (sources: Sinclair 1996 http://www.ilc.pi.cnr.it/EAGLES/corpustyp/corpustyp.html- Project EAGLES - Expert Advisory Group on Language Engineering Standards; BNC handbook; Kennedy 1998: 13-23): 1) what they represent 2) how they are compiled 3) how they are stored pre-electronic (e.g. in biblical and literarry studies, early lexicography, LT & grammar) electronic / machine-readable sample balanced sample-text vs full-text corpus synchronic (static) vs diachronic (dynamic) corpora monitor corpus written spoken mixed multimedia general or reference corpus (= balanced; also called core) specialised (particular research projects) NLP: training or test corpus geographical varieties (UK, US, Australian etc: NO corpus of ALL English, but ICE) historical varieties register/dialect/genre/topic etc-specific special (deviant from general norm - Sinclair 1996) child language learner language monolingual bilingual parallel / translation comparable non-annotated or unannotated or raw annotated (POS-tagged, parsed, error tagged etc.)