Computational Linguistics

LELA 30922
Lecture 2
Corpus-based research in Linguistics
See esp. Meyer pp. 11-29
What is corpus linguistics?
• Not a branch of linguistics, like socio~,
psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a
philosophy) to support linguistic
investigation across all branches of the
• Assessment for this course is to use
corpus/corpora to investigate something
• This lecture may give you some ideas of the
kind of thing you can do
Applications of corpus linguistics
Grammatical studies
Study of language variation
Historical linguistics
Contrastive analysis and translation theory
Study of language acquisition (psycholinguistics)
Language teaching
• Study of behaviour of individual words
• Particularly useful for dictionary
construction (lexicography)
• Can identify more and less frequently
occurring words
• More interesting is HOW words are used
– Syntax
– Meaning
• Most frequent words are function words
(the, of, and, to, a are 5 most frequent words
in LOB)
• If corpus is small, it can only give an
indicative “snapshot” of word usage
• LOB (1m words): hundreds of words occur
less than 10 times
• For dictionary construction, need bigger corpus
• “Monitor” corpus, constantly updated and added
• Traditional lexicography: collection of “slips” by
– OED took 50 years and includes 5m citations, sorted
and edited manually
• Same idea, but more systematic
• Dictionary as descriptive rather than pre- (or pro-)
• Collins COBUILD
– Birmingham corpus (20m words, 1980s)
– Bank of English corpus (415m words in Oct 2000)
• 70m words of transcripts of BBC broadcasts
• Used as basis of BBC English dictionary
• Cambridge Language Survey
• Longman’s corpus of American English, and use
of BNC for (BrE) dictionary
Lexicography: how do corpora help?
• Concordancing
Lists occurrences of word in context
Identify syntactic use of word
Identify range of meanings
Identify relative frequency of different uses/meanings
• Collocation
– What words occur together?
– Compare distribution of close synonyms
• Dictionaries can be subjective
– Can be interesting to compare meanings/uses given by dictionaries with
actual usage in corpora
Target word = dog
Significance measure: t-score
Grammatical studies
• Study of a particular grammatical
– Restrictions on form, meaning or context
– Overall frequency (eg relative to alternative
– Use in different registers (eg narrative vs
argumentative) or modes (eg written vs spoken)
Examples of grammatical studies
• Appositives
– eg George Bush, US president or US president George Bush)
– See CF Meyer “Can you really study language variation in
linguistic corpora?” American Speech 79.4 (2004) 339-355
– Genuine titles, “pseudotitles”, descriptives
• Junichiro Koizumi, the Japanese prime minister
• Gerald Ford, former president of the USA
• Osama bin Laden, America’s no.1 enemy
– Looked at how appositives (esp. pseudotitles) are used differently
in newspaper reports from different countries, and how
descriptives become pseudotitles
Examples of grammatical studies
• Clefts and pseudoclefts
– It’s linguistics that interests me most.
– What interests me most is linguistics.
– Linguistics interests me most.
• Infinitival complement clauses
– I hope to go ~ I hope that I can go
– I’m happy to go ~ I’m happy that I can go
– … the proposal to go ~ the proposal that I go
• Simple past vs perfective verb forms
• Use of modals can~may, shall~will
• Use of passive, and means/reasons to avoid
– eg especially in translation
Grammatical studies
• Most try to investigate the factors that
determine choice of one construction over
Grammatical studies
• Corpus needs to be sufficiently marked up
and tools need to be available for examples to
be extracted
• Corpus may need to be sufficiently large to
get good number of examples
• If comparing registers/subject
domains/modes, corpus needs to reflect these
Study of language variation
• Both lexical and grammatical studies often
contrast usage by mode, domain, register etc.
• Sociolinguists often interested in other aspects, eg
sex, age, social class of author or audience;
historical linguists interested in change over time
• Recent corpora (eg BNC) have included this
information in header mark-up
• Simple examples
– lovely used more by females than males
– What does cool mean?
Genre classification
• Are there lexical and grammatical factors that can help us
to classify text genres?
• Biber used statistical measures to identify stylistic factors
that co-occurred, and could therefore be definitional of text
types and genres
– Eg conjuncts like therefore, nevertheless and use of passive
together indicate more formal style
• Factor analysis
– choose a range of features to measure, see which ones are
– does not (necessarily) predetermine analysis (except obviously you
have to decide what features might be significant)
Historical linguistics
• Similar things can be done with historical texts,
though (obviously) these are more limited in terms
of genre
• Also, diachronic studies can compare texts from
different periods (again as long as you compare
like for like as much as possible)
• Topics:
– Change in lexical meaning/usage
– Change/emergence of grammatical constructions
Example of historical study
• Nevalainen in J. Engl. Ling (2000) used Corpus of
Early English Correspondence (U. Helsinki) to
track sex roles in linguistic innovation
• Popular theory that females more innovative, and
males follow trends
• He analysed sex-of-author differences in three
linguistic changes between 16th and 20th century:
– Replacement of ye by you in subject position
– Replacement of 3rd-person verb suffix -th by -s
– Reduction in use of multiple negatives and use of any
and ever instead
Contrastive analysis, translation theory
• Parallel corpora
– texts + their translations
– preferably “aligned”
• Comparable corpora
– Texts in different languages but of a similar
– What parallels are there in genre
Use of parallel corpora
• Aligned corpus allows search for word or phrase and its
– How is it translated?
– Is it translated consistently?
• Of interest in studies of “translationese”
– Translated text too influenced by original
– Certain constructions more prevalent in translation than in native
• Evidence of “explicitation”
– Translation is often more explicit than original
– Sometimes, explanation added for foreign reader
– But often, just a reflection of the translator’s effort (eg replacement
of pronoun by more explicit referent)
• Also can be used as a tool for translators
Language acquisition
• First-language acquisition
– CHILDES database (Child Language Data Exchange
– Transcriptions of conversations with (and between)
young children
– Includes software to help extract data
• Second-language acquisition
– Learner corpora, notably ICLE