Computational Linguistics

advertisement
LELA 30922
Lecture 2
Corpus-based research in Linguistics
See esp. Meyer pp. 11-29
1/23
What is corpus linguistics?
• Not a branch of linguistics, like socio~,
psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a
philosophy) to support linguistic
investigation across all branches of the
subject
2/23
Reminder
• Assessment for this course is to use
corpus/corpora to investigate something
• This lecture may give you some ideas of the
kind of thing you can do
3/23
Applications of corpus linguistics
•
•
•
•
•
•
•
Lexicology
Grammatical studies
Study of language variation
Historical linguistics
Contrastive analysis and translation theory
Study of language acquisition (psycholinguistics)
Language teaching
4/23
Lexicology
• Study of behaviour of individual words
• Particularly useful for dictionary
construction (lexicography)
• Can identify more and less frequently
occurring words
• More interesting is HOW words are used
– Syntax
– Meaning
5/23
Lexicology
• Most frequent words are function words
(the, of, and, to, a are 5 most frequent words
in LOB)
• If corpus is small, it can only give an
indicative “snapshot” of word usage
• LOB (1m words): hundreds of words occur
less than 10 times
6/23
Lexicography
• For dictionary construction, need bigger corpus
• “Monitor” corpus, constantly updated and added
to
• Traditional lexicography: collection of “slips” by
experts
– OED took 50 years and includes 5m citations, sorted
and edited manually
• Same idea, but more systematic
• Dictionary as descriptive rather than pre- (or pro-)
scriptive
7/23
Lexicography
• Collins COBUILD
– Birmingham corpus (20m words, 1980s)
– Bank of English corpus (415m words in Oct 2000)
• 70m words of transcripts of BBC broadcasts
• Used as basis of BBC English dictionary
• Cambridge Language Survey
• Longman’s corpus of American English, and use
of BNC for (BrE) dictionary
8/23
Lexicography: how do corpora help?
• Concordancing
–
–
–
–
Lists occurrences of word in context
Identify syntactic use of word
Identify range of meanings
Identify relative frequency of different uses/meanings
• Collocation
– What words occur together?
– Compare distribution of close synonyms
• Dictionaries can be subjective
– Can be interesting to compare meanings/uses given by dictionaries with
actual usage in corpora
http://www.collins.co.uk/corpus/CorpusSearch.aspx
9/23
10/23
Target word = dog
Significance measure: t-score
11/23
Grammatical studies
• Study of a particular grammatical
construction
– Restrictions on form, meaning or context
– Overall frequency (eg relative to alternative
constructions)
– Use in different registers (eg narrative vs
argumentative) or modes (eg written vs spoken)
12/23
Examples of grammatical studies
• Appositives
– eg George Bush, US president or US president George Bush)
– See CF Meyer “Can you really study language variation in
linguistic corpora?” American Speech 79.4 (2004) 339-355
– Genuine titles, “pseudotitles”, descriptives
• Junichiro Koizumi, the Japanese prime minister
• Gerald Ford, former president of the USA
• Osama bin Laden, America’s no.1 enemy
– Looked at how appositives (esp. pseudotitles) are used differently
in newspaper reports from different countries, and how
descriptives become pseudotitles
13/23
Examples of grammatical studies
• Clefts and pseudoclefts
– It’s linguistics that interests me most.
– What interests me most is linguistics.
– Linguistics interests me most.
• Infinitival complement clauses
– I hope to go ~ I hope that I can go
– I’m happy to go ~ I’m happy that I can go
– … the proposal to go ~ the proposal that I go
• Simple past vs perfective verb forms
• Use of modals can~may, shall~will
• Use of passive, and means/reasons to avoid
– eg especially in translation
14/23
Grammatical studies
• Most try to investigate the factors that
determine choice of one construction over
another
–
–
–
–
Lexical
Grammatical
Stylistic
etc
15/23
Grammatical studies
• Corpus needs to be sufficiently marked up
and tools need to be available for examples to
be extracted
• Corpus may need to be sufficiently large to
get good number of examples
• If comparing registers/subject
domains/modes, corpus needs to reflect these
16/23
Study of language variation
• Both lexical and grammatical studies often
contrast usage by mode, domain, register etc.
• Sociolinguists often interested in other aspects, eg
sex, age, social class of author or audience;
historical linguists interested in change over time
• Recent corpora (eg BNC) have included this
information in header mark-up
• Simple examples
– lovely used more by females than males
– What does cool mean?
17/23
Genre classification
• Are there lexical and grammatical factors that can help us
to classify text genres?
• Biber used statistical measures to identify stylistic factors
that co-occurred, and could therefore be definitional of text
types and genres
– Eg conjuncts like therefore, nevertheless and use of passive
together indicate more formal style
• Factor analysis
– choose a range of features to measure, see which ones are
correlated
– does not (necessarily) predetermine analysis (except obviously you
have to decide what features might be significant)
18/23
Historical linguistics
• Similar things can be done with historical texts,
though (obviously) these are more limited in terms
of genre
• Also, diachronic studies can compare texts from
different periods (again as long as you compare
like for like as much as possible)
• Topics:
– Change in lexical meaning/usage
– Change/emergence of grammatical constructions
19/23
Example of historical study
• Nevalainen in J. Engl. Ling (2000) used Corpus of
Early English Correspondence (U. Helsinki) to
track sex roles in linguistic innovation
• Popular theory that females more innovative, and
males follow trends
• He analysed sex-of-author differences in three
linguistic changes between 16th and 20th century:
– Replacement of ye by you in subject position
– Replacement of 3rd-person verb suffix -th by -s
– Reduction in use of multiple negatives and use of any
and ever instead
20/23
Contrastive analysis, translation theory
• Parallel corpora
– texts + their translations
– preferably “aligned”
• Comparable corpora
– Texts in different languages but of a similar
nature
– What parallels are there in genre
characteristics?
21/23
Use of parallel corpora
• Aligned corpus allows search for word or phrase and its
translation
– How is it translated?
– Is it translated consistently?
• Of interest in studies of “translationese”
– Translated text too influenced by original
– Certain constructions more prevalent in translation than in native
text
• Evidence of “explicitation”
– Translation is often more explicit than original
– Sometimes, explanation added for foreign reader
– But often, just a reflection of the translator’s effort (eg replacement
of pronoun by more explicit referent)
• Also can be used as a tool for translators
22/23
Language acquisition
• First-language acquisition
– CHILDES database (Child Language Data Exchange
System) http://childes.psy.cmu.edu/
– Transcriptions of conversations with (and between)
young children
– Includes software to help extract data
• Second-language acquisition
– Learner corpora, notably ICLE
– http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/C
ecl-Projects/Icle/icle.htm
23/23
Download