Language Corpora: Seeing Real English Grammar What’s wrong with these examples? Conjunctions: Marsha ordered a double latte, for she had a long night ahead of her. He is always boasting; however, no one seems to mind. Sentence types: The wolf wailed in an awful way. The jolly Santa smiled cheerfully. Subordinate clauses: Joan grimaced noticeably when Eric began his speech. Lorry’s father loves his garage in which he builds models of prehistoric animals. The Problem • Linguistics aims to describe real language, not rules made up by ‘language police’ • Even good grammar textbooks cannot represent real language. (They’d weigh a ton!) • Even good grammar textbooks tend to ‘bend’ the language to get the grammar rules across. Problem with Textbook Examples • Stilted language Marsha ordered a double latte, for she had a long night ahead of her. • Mixing of genres in a single sentence loves his garage in which he builds models of prehistoric animals. conversational written One Solution • Look at large amounts of real language - the corpus linguistics approach – Enabled by computers with large memory capacity – (But the Oxford English Dictionary was built on the same principle) Types of Corpora • a corpus is a collection of written or spoken language – Charles Dickens’ A Christmas Carol – The New York Times online – The Santa Barbara corpus of spoken English • a representative corpus includes samples from the various types of language usage – The Brown Corpus – The British National Corpus – MICASE The Brown Corpus: 1st representative corpus • The Brown corpus consists of 500 text samples • Each sample consists of just over 2,000 words • Types of language usage include: A. PRESS: REPORTAGE (44 texts) H. MISCELLANEOUS: GOVT (30 texts) B. PRESS: EDITORIALS (27 texts) J. LEARNED (80 texts) C. PRESS: REVIEWS (17 texts) K. FICTION: GENERAL (29 texts) D. RELIGION (17 texts) L. FICTION: MYSTERY (24 texts) E. SKILL AND HOBBIES (36 texts) M. FICTION: SCIENCE (6 texts) F. POPULAR LORE (48 texts) N. FICTION: ADVENTURE (29 texts) G. BELLES-LETTRES (75 texts) O. FICTION: ROMANCE (29 texts) P. HUMOR (9 texts) A Simple Example • Shifting word meaning go to: http: //chss.montclair.edu /linguistics /corpus.tutorial.htm More Sample Corpus Applications • • • • • Co-occurrence Restrictions Part of Speech Identification More POS: -ly words Intransitive sentences with good vs. well Syntactic Construction: the passive Use the right corpus for your query For a query about . . . • current standard Engl. • current everyday usage • frequency of a word/phrase • a single author’s writing • word pair comparisons Look at • up-to-date, written • up-to-date, spoken • a large corpus • Project Gutenburg • a concordancer that lists collocates Copyright Issues • Current Copyright Law: • * For works created after January 1, 1978, copyright protection will endure for the life of the author plus an additional 70 years. • * For pre-1978 works still in their original or renewal term of copyright, the total term is extended to 95 years from the date that copyright was originally secured. • This means that 20th century literature is unavailable on the web, except when permission has been obtained to put it there. For example, Sylvia Plath’s poems are available. Know your data collection. Some collections won’t have enough examples. • Sample topic: “Looking at a form like progress, button, or butter . . .” Brown Times(1/95) Health # of words progress button butter 1.3 million 120 13 27 3.5 million 268 51 65 200,000 12 1 13 Know your data collection. Its size may be important Doc Words 98,856 Times 1/95 3,567,629 Starr Rept Raw count for true 28 518 Normed count for true (per 100,000 words) 27.67 14 Even though there are many more occurrences of true in a month of The Times than in the Starr Report, true appears more often every 100,000 words in Starr. Know your tools • • • • • • A concordancer A collocate list A part of speech tagger Searching Browsing Sorting Hong Kong Web Concordancer • If you ask for a word that has many hits in the data, it will give you the first 2001 hits • you can search for prefixes, suffixes, etc. – “Search string: equal to, starts with, ends with, contains” • you can search for phrases: go to, was seen • you can sort the output – by word to the left of the hit (good if you’re looking for specifiers -- determiners, auxiliaries, etc.) – or word to the right (good if you’re looking for complements) HK Concordancer gives collocates (words in the neighborhood of the keyword) Concordances for was seen = 5 1 erms; even marriage and the family was seen as a contractual arrangement. It i 2 faire was a conscious policy. Law was seen as an emanation of the "sovereign 3 ent of each of the sample children was seen in the home. The parent was asked 4 Hearst changed to concern when it was seen that he had strong support in many 5 , another change in muscle nuclei was seen, usually occurring in fibers that Right collocates for 'was seen' as 2 in 1 that 1 usually 1 References Ball, Catherine. Concordances and Corpora. Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press. U.S. Copyright Office. Frequently Asked Questions. (http://www.copyright.gov/faq.html#q46)