Corpus Linguistics

Corpus Linguistics Lecture 1 Albert Gatt Contact details  My email: albert.gatt@um.edu.mt  Drop me a line with queries etc, and to arrange meetings. Course web page  Course web page: http://staff.um.edu.mt/albert.gatt/home/teachin g/corpusLing.html  Details of tutorials, lectures etc will always be on the web page.  Readings for the lecture  Downloadable lecture notes (available after the lecture) Suggested text  T. McEnery and A. Wilson. (2001). Corpus Linguistics. Edinburgh University Press  NB: Over the course of these lectures, other readings will also be proposed and made available, usually online. Lectures and assessment  Structure of lectures:  all lectures will take place in the lab  usually, about half the lecture (1hr) will be devoted to practical work  Course assessment: assignment  Final essay (ca. 1500-2000 words)  Essay topics will involve research on corpora! Questions… ? What is corpus linguistics?  A new theory of language?  No. In principle, any theory of language is compatible with corpus-based research.  A separate branch of linguistics (in addition to syntax, semantics…)?  No. Most aspects of language can be studied using a corpus (in principle).  A methodology to study language in all its aspects?  Yes! The most important principle is that aspects of language are studied empirically by analysing natural data using a corpus.  A corpus is an electronic, machine-readable collection of texts that represent “real life” language use. Goals of this lecture  To define the terms:  corpus linguistics  corpus  To give an overview of the history of corpus linguistics  To contrast the corpus-based approach to other methodologies used in the study of language An initial example  Suppose you’re a linguist interested in the syntax of verb phrases.  Some verbs are transitive, some intransitive  I ate the meat pie (transitive)  I swam (intransitive)  What about:  quiver  quake Most traditional grammars characterise these as intransitive  Are these really intransitive? One possible methodology…  The standard method relies on the linguist’s intuition:  I never use quiver/quake with a direct object.  I am a native speaker of this language.  All native speakers have a common mental grammar or competence (Chomsky).  Therefore, my mental grammar is the same as everyone else’s.  Therefore, my intuition accurately reflects English speakers’ competence.  Therefore, quiver/quake are intransitive.  NB: The above is a gross simplification! E.g. linguists often rely on judgements elicited from other native speakers. Another possible methodology…  This one relies on data:  I may never use quiver/quake with a direct object, but…  …other people might  Therefore, I’ll get my hands on a large sample of written and/or spoken English and check. Quiver/quake: the corpus linguist’s answer  A study by Atkins and Levin (1995) found that quiver and quake do occur in transitive constructions:  the insect quivered its wings  it quaked his bowels (with fear)  Used a corpus of 50 million words to find examples of the verbs.  With sufficient data, you can find examples that your own intuition won’t give you… Example II: lexical semantics  Quasi-synonymous lexical items exhibit subtle differences in context.  strong  powerful  A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning. Example II continued  Some differences between strong and powerful (source: British National Corpus):  strong  powerful wind, feeling, accent, flavour tool, weapon, punch, engine  The differences are subtle, but examining their collocates helps. Some preliminary definitions  The second approach is typical of the corpus-based methodology:  Corpus: A large, machine-readable collection of texts.  Often, in addition to the texts themselves, a corpus is annotated with relevant linguistic information.  Corpus-based methodology: An approach to Natural Language analysis that relies on generalisations made from data. Example (British National Corpus)  British National Corpus (BNC):  100 million words of English  90% written, 10% spoken  Designed to be representative and balanced.  Texts from different genres (literature, news, academic writing…)  Annotated: Every single word is accompanied by part-of-speech information. Example (continued)  A sentence in the BNC:  Explosives found on Hampstead Heath.        <s> <w NN2>Explosives <w VVD>found <w PRP>on <w NP0>Hampstead <w NP0>Heath <PUN>. Example (continued) new sentence  <s> plural noun  <w NN2>Explosives past tense verb  <w VVD>found preposition  <w PRP>on proper noun  <w NP0>Hampstead proper noun  <w NP0>Heath punctuation  <PUN>. Explosives found on Hampstead Heath Important to note  This is not “raw” text.  Annotation means we can search for particular patterns.  E.g. for the quiver/quake study: “find all occurrences of quiver which are verbs, followed by a determiner and a noun”  The collection is very large  Only in very large collections are we likely to find rare occurrences.  Corpus search is done by computer. You can’t trawl through 100 million words manually! The practical objections…  But we’re linguists not computer scientists! Do I have to write programs?  No, there are literally dozens of available tools to search in a corpus.  Are all corpora good for all purposes?  No. Some are “general-purpose”, like the BNC. Others are designed to address specific issues. The theoretical objections…  What guarantee do we have that the texts in our corpus are “good data”, quality texts, written by people we can trust?  How do I know that what I find isn’t just a small, exceptional case. E.g. quiver in a transitive construction could be really a one-off!  Just because there are a few examples of something, doesn’t mean that all native speakers use a certain construction!  Do we throw intuition out of the window? Part 2 A brief history of corpus linguistics Language and the cognitive revolution  Before the 1950’s, the linguist’s task was:  to collect data about a language;  to make generalisations from the data (e.g. “In Maltese, the verb always agrees in number and gender with the subject NP”)  The basic idea: language is “out there”, the sum total of things people say and write.  After the 1950’s:  the so-called “cognitive revolution”  language treated as a mental phenomenon  no longer about collecting data, but explaining what mental capabilities speakers have The 19th & early 20th Century  Many early studies relied on corpora.  Language acquisition research was based on collections of child data.  Anthropologists collected samples of unknown languages.  Comparative linguists used large samples from different languages.  A lot of work done on frequencies:  frequency of words…  frequency of grammatical patterns…  frequency of different spellings…  All of this was interrupted around 1955. Chomsky and the cognitive turn  Chomsky (1957) was primarily responsible for the new, cognitive view of language.  He distinguished (1965):  Descriptive adequacy: describing language, making generalisations such as “X occurs more often than Y”  Explanatory adequacy: explaining why some things are found in a language, but not others, by appealing to speakers’ competence, their mental grammar  He made several criticisms of corpus-based approaches. Criticisms of corpora (I)  Competence vs. performance:  To explain language, we need to focus on competence of an idealised speaker-hearer.  Competence = internalised, tacit knowledge of language  Performance – the language we speak/write – is not a good mirror of our knowledge  it depends on situations  it can be degraded  it can be influenced by other cognitive factors beyond linguistic knowledge Criticisms of corpora (II)  Early work using corpora assumed that:  the number of sentences of a language is finite (so we can get to know everything about language if the sample is large enough)  But actually, it is impossible to count the number of sentences in a language.  Syntactic rules make the possibilities literally infinite: the man in the house (NP -> NP + PP) the man in the house on the beach (PP -> PREP + NP) the man in the house on the beach by the lake …  So what use is a corpus? We’re never going to have an infinite corpus. Criticisms of corpora (III)  A corpus is always skewed, i.e. biased in favour of certain things.  Certain obvious things are simply never said. E.g. We probably won’t find a dog is a dog in our corpus.  A corpus is always partial: We will only find things in a corpus if they are frequent enough.  A corpus is necessarily only a sample.  Rare things are likely to be omitted from a sample. Criticisms of corpora (IV)  Why use a corpus if we already know things by introspection?  How can a corpus tell us what is ungrammatical?  Corpora won’t contain “disallowed” structures, because these are by definition not part of the language.  So a corpus contains exclusively positive evidence: you only get the “allowed” things  But if X is not in the corpus, this doesn’t mean it’s not allowed.  It might just be rare, and your corpus isn’t big enough. (Skewness) Refutations  Corpora can be better than introspectvie evidence because:  They are public; other people can verify and replicate your results (the essence of scientific method).  Some kinds of data are simply not available to introspection. E.g. people aren’t good at estimating the frequency of words or structures.  Skewness can itself be informative: If X occurs more frequently than Y in a corpus, that in itself is an interesting fact. Refutations (II)  By the way, nobody’s saying “throw introspection out the window”…  There is no reason not to combine the corpusbased and the introspection-based method.  Many other objections can be overcome by using large enough corpora.  Pre-1950, most corpus work was done manually, so it was error prone.  Machine-readable corpora means we have a great new tool to analyse language very efficiently! Corpora in the late 20th Century  Corpus linguistics enjoyed a revival with the advent of the digital personal computer.  Kucera and Francis: the Brown Corpus, one of the first  Svartvik: the London-Lund Corpus, which built on Brown  These were rapidly followed by others… Today, corpora are firmly back on the linguistic landscape. Summary  Introduced the notion of corpus and corpus-based research  Gave a quick overview of the history of this methodology  Looked at some possible objections to corpus-based methods, and some possible counter-arguments Next lecture  We look more closely at some important properties of a corpus:     Machine-readability Balance Representativeness …

Corpus Linguistics

Related documents

Products

Support

Corpus Linguistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib