Corpus Linguistics Lecture 1 Albert Gatt Contact details My email: albert.gatt@um.edu.mt Drop me a line with queries etc, and to arrange meetings. Course web page Course web page: http://staff.um.edu.mt/albert.gatt/home/teachin g/corpusLing.html Details of tutorials, lectures etc will always be on the web page. Readings for the lecture Downloadable lecture notes (available after the lecture) Suggested text T. McEnery and A. Wilson. (2001). Corpus Linguistics. Edinburgh University Press NB: Over the course of these lectures, other readings will also be proposed and made available, usually online. Lectures and assessment Structure of lectures: all lectures will take place in the lab usually, about half the lecture (1hr) will be devoted to practical work Course assessment: assignment Final essay (ca. 1500-2000 words) Essay topics will involve research on corpora! Questions… ? What is corpus linguistics? A new theory of language? No. In principle, any theory of language is compatible with corpus-based research. A separate branch of linguistics (in addition to syntax, semantics…)? No. Most aspects of language can be studied using a corpus (in principle). A methodology to study language in all its aspects? Yes! The most important principle is that aspects of language are studied empirically by analysing natural data using a corpus. A corpus is an electronic, machine-readable collection of texts that represent “real life” language use. Goals of this lecture To define the terms: corpus linguistics corpus To give an overview of the history of corpus linguistics To contrast the corpus-based approach to other methodologies used in the study of language An initial example Suppose you’re a linguist interested in the syntax of verb phrases. Some verbs are transitive, some intransitive I ate the meat pie (transitive) I swam (intransitive) What about: quiver quake Most traditional grammars characterise these as intransitive Are these really intransitive? One possible methodology… The standard method relies on the linguist’s intuition: I never use quiver/quake with a direct object. I am a native speaker of this language. All native speakers have a common mental grammar or competence (Chomsky). Therefore, my mental grammar is the same as everyone else’s. Therefore, my intuition accurately reflects English speakers’ competence. Therefore, quiver/quake are intransitive. NB: The above is a gross simplification! E.g. linguists often rely on judgements elicited from other native speakers. Another possible methodology… This one relies on data: I may never use quiver/quake with a direct object, but… …other people might Therefore, I’ll get my hands on a large sample of written and/or spoken English and check. Quiver/quake: the corpus linguist’s answer A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions: the insect quivered its wings it quaked his bowels (with fear) Used a corpus of 50 million words to find examples of the verbs. With sufficient data, you can find examples that your own intuition won’t give you… Example II: lexical semantics Quasi-synonymous lexical items exhibit subtle differences in context. strong powerful A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning. Example II continued Some differences between strong and powerful (source: British National Corpus): strong powerful wind, feeling, accent, flavour tool, weapon, punch, engine The differences are subtle, but examining their collocates helps. Some preliminary definitions The second approach is typical of the corpus-based methodology: Corpus: A large, machine-readable collection of texts. Often, in addition to the texts themselves, a corpus is annotated with relevant linguistic information. Corpus-based methodology: An approach to Natural Language analysis that relies on generalisations made from data. Example (British National Corpus) British National Corpus (BNC): 100 million words of English 90% written, 10% spoken Designed to be representative and balanced. Texts from different genres (literature, news, academic writing…) Annotated: Every single word is accompanied by part-of-speech information. Example (continued) A sentence in the BNC: Explosives found on Hampstead Heath. <s> <w NN2>Explosives <w VVD>found <w PRP>on <w NP0>Hampstead <w NP0>Heath <PUN>. Example (continued) new sentence <s> plural noun <w NN2>Explosives past tense verb <w VVD>found preposition <w PRP>on proper noun <w NP0>Hampstead proper noun <w NP0>Heath punctuation <PUN>. Explosives found on Hampstead Heath Important to note This is not “raw” text. Annotation means we can search for particular patterns. E.g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun” The collection is very large Only in very large collections are we likely to find rare occurrences. Corpus search is done by computer. You can’t trawl through 100 million words manually! The practical objections… But we’re linguists not computer scientists! Do I have to write programs? No, there are literally dozens of available tools to search in a corpus. Are all corpora good for all purposes? No. Some are “general-purpose”, like the BNC. Others are designed to address specific issues. The theoretical objections… What guarantee do we have that the texts in our corpus are “good data”, quality texts, written by people we can trust? How do I know that what I find isn’t just a small, exceptional case. E.g. quiver in a transitive construction could be really a one-off! Just because there are a few examples of something, doesn’t mean that all native speakers use a certain construction! Do we throw intuition out of the window? Part 2 A brief history of corpus linguistics Language and the cognitive revolution Before the 1950’s, the linguist’s task was: to collect data about a language; to make generalisations from the data (e.g. “In Maltese, the verb always agrees in number and gender with the subject NP”) The basic idea: language is “out there”, the sum total of things people say and write. After the 1950’s: the so-called “cognitive revolution” language treated as a mental phenomenon no longer about collecting data, but explaining what mental capabilities speakers have The 19th & early 20th Century Many early studies relied on corpora. Language acquisition research was based on collections of child data. Anthropologists collected samples of unknown languages. Comparative linguists used large samples from different languages. A lot of work done on frequencies: frequency of words… frequency of grammatical patterns… frequency of different spellings… All of this was interrupted around 1955. Chomsky and the cognitive turn Chomsky (1957) was primarily responsible for the new, cognitive view of language. He distinguished (1965): Descriptive adequacy: describing language, making generalisations such as “X occurs more often than Y” Explanatory adequacy: explaining why some things are found in a language, but not others, by appealing to speakers’ competence, their mental grammar He made several criticisms of corpus-based approaches. Criticisms of corpora (I) Competence vs. performance: To explain language, we need to focus on competence of an idealised speaker-hearer. Competence = internalised, tacit knowledge of language Performance – the language we speak/write – is not a good mirror of our knowledge it depends on situations it can be degraded it can be influenced by other cognitive factors beyond linguistic knowledge Criticisms of corpora (II) Early work using corpora assumed that: the number of sentences of a language is finite (so we can get to know everything about language if the sample is large enough) But actually, it is impossible to count the number of sentences in a language. Syntactic rules make the possibilities literally infinite: the man in the house (NP -> NP + PP) the man in the house on the beach (PP -> PREP + NP) the man in the house on the beach by the lake … So what use is a corpus? We’re never going to have an infinite corpus. Criticisms of corpora (III) A corpus is always skewed, i.e. biased in favour of certain things. Certain obvious things are simply never said. E.g. We probably won’t find a dog is a dog in our corpus. A corpus is always partial: We will only find things in a corpus if they are frequent enough. A corpus is necessarily only a sample. Rare things are likely to be omitted from a sample. Criticisms of corpora (IV) Why use a corpus if we already know things by introspection? How can a corpus tell us what is ungrammatical? Corpora won’t contain “disallowed” structures, because these are by definition not part of the language. So a corpus contains exclusively positive evidence: you only get the “allowed” things But if X is not in the corpus, this doesn’t mean it’s not allowed. It might just be rare, and your corpus isn’t big enough. (Skewness) Refutations Corpora can be better than introspectvie evidence because: They are public; other people can verify and replicate your results (the essence of scientific method). Some kinds of data are simply not available to introspection. E.g. people aren’t good at estimating the frequency of words or structures. Skewness can itself be informative: If X occurs more frequently than Y in a corpus, that in itself is an interesting fact. Refutations (II) By the way, nobody’s saying “throw introspection out the window”… There is no reason not to combine the corpusbased and the introspection-based method. Many other objections can be overcome by using large enough corpora. Pre-1950, most corpus work was done manually, so it was error prone. Machine-readable corpora means we have a great new tool to analyse language very efficiently! Corpora in the late 20th Century Corpus linguistics enjoyed a revival with the advent of the digital personal computer. Kucera and Francis: the Brown Corpus, one of the first Svartvik: the London-Lund Corpus, which built on Brown These were rapidly followed by others… Today, corpora are firmly back on the linguistic landscape. Summary Introduced the notion of corpus and corpus-based research Gave a quick overview of the history of this methodology Looked at some possible objections to corpus-based methods, and some possible counter-arguments Next lecture We look more closely at some important properties of a corpus: Machine-readability Balance Representativeness …