Computational Linguistics

LELA 30922 Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp. 11-29 1/23 What is corpus linguistics? • Not a branch of linguistics, like socio~, psycho~, … • Not a theory of linguistics • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject 2/23 Reminder • Assessment for this course is to use corpus/corpora to investigate something • This lecture may give you some ideas of the kind of thing you can do 3/23 Applications of corpus linguistics • • • • • • • Lexicology Grammatical studies Study of language variation Historical linguistics Contrastive analysis and translation theory Study of language acquisition (psycholinguistics) Language teaching 4/23 Lexicology • Study of behaviour of individual words • Particularly useful for dictionary construction (lexicography) • Can identify more and less frequently occurring words • More interesting is HOW words are used – Syntax – Meaning 5/23 Lexicology • Most frequent words are function words (the, of, and, to, a are 5 most frequent words in LOB) • If corpus is small, it can only give an indicative “snapshot” of word usage • LOB (1m words): hundreds of words occur less than 10 times 6/23 Lexicography • For dictionary construction, need bigger corpus • “Monitor” corpus, constantly updated and added to • Traditional lexicography: collection of “slips” by experts – OED took 50 years and includes 5m citations, sorted and edited manually • Same idea, but more systematic • Dictionary as descriptive rather than pre- (or pro-) scriptive 7/23 Lexicography • Collins COBUILD – Birmingham corpus (20m words, 1980s) – Bank of English corpus (415m words in Oct 2000) • 70m words of transcripts of BBC broadcasts • Used as basis of BBC English dictionary • Cambridge Language Survey • Longman’s corpus of American English, and use of BNC for (BrE) dictionary 8/23 Lexicography: how do corpora help? • Concordancing – – – – Lists occurrences of word in context Identify syntactic use of word Identify range of meanings Identify relative frequency of different uses/meanings • Collocation – What words occur together? – Compare distribution of close synonyms • Dictionaries can be subjective – Can be interesting to compare meanings/uses given by dictionaries with actual usage in corpora http://www.collins.co.uk/corpus/CorpusSearch.aspx 9/23 10/23 Target word = dog Significance measure: t-score 11/23 Grammatical studies • Study of a particular grammatical construction – Restrictions on form, meaning or context – Overall frequency (eg relative to alternative constructions) – Use in different registers (eg narrative vs argumentative) or modes (eg written vs spoken) 12/23 Examples of grammatical studies • Appositives – eg George Bush, US president or US president George Bush) – See CF Meyer “Can you really study language variation in linguistic corpora?” American Speech 79.4 (2004) 339-355 – Genuine titles, “pseudotitles”, descriptives • Junichiro Koizumi, the Japanese prime minister • Gerald Ford, former president of the USA • Osama bin Laden, America’s no.1 enemy – Looked at how appositives (esp. pseudotitles) are used differently in newspaper reports from different countries, and how descriptives become pseudotitles 13/23 Examples of grammatical studies • Clefts and pseudoclefts – It’s linguistics that interests me most. – What interests me most is linguistics. – Linguistics interests me most. • Infinitival complement clauses – I hope to go ~ I hope that I can go – I’m happy to go ~ I’m happy that I can go – … the proposal to go ~ the proposal that I go • Simple past vs perfective verb forms • Use of modals can~may, shall~will • Use of passive, and means/reasons to avoid – eg especially in translation 14/23 Grammatical studies • Most try to investigate the factors that determine choice of one construction over another – – – – Lexical Grammatical Stylistic etc 15/23 Grammatical studies • Corpus needs to be sufficiently marked up and tools need to be available for examples to be extracted • Corpus may need to be sufficiently large to get good number of examples • If comparing registers/subject domains/modes, corpus needs to reflect these 16/23 Study of language variation • Both lexical and grammatical studies often contrast usage by mode, domain, register etc. • Sociolinguists often interested in other aspects, eg sex, age, social class of author or audience; historical linguists interested in change over time • Recent corpora (eg BNC) have included this information in header mark-up • Simple examples – lovely used more by females than males – What does cool mean? 17/23 Genre classification • Are there lexical and grammatical factors that can help us to classify text genres? • Biber used statistical measures to identify stylistic factors that co-occurred, and could therefore be definitional of text types and genres – Eg conjuncts like therefore, nevertheless and use of passive together indicate more formal style • Factor analysis – choose a range of features to measure, see which ones are correlated – does not (necessarily) predetermine analysis (except obviously you have to decide what features might be significant) 18/23 Historical linguistics • Similar things can be done with historical texts, though (obviously) these are more limited in terms of genre • Also, diachronic studies can compare texts from different periods (again as long as you compare like for like as much as possible) • Topics: – Change in lexical meaning/usage – Change/emergence of grammatical constructions 19/23 Example of historical study • Nevalainen in J. Engl. Ling (2000) used Corpus of Early English Correspondence (U. Helsinki) to track sex roles in linguistic innovation • Popular theory that females more innovative, and males follow trends • He analysed sex-of-author differences in three linguistic changes between 16th and 20th century: – Replacement of ye by you in subject position – Replacement of 3rd-person verb suffix -th by -s – Reduction in use of multiple negatives and use of any and ever instead 20/23 Contrastive analysis, translation theory • Parallel corpora – texts + their translations – preferably “aligned” • Comparable corpora – Texts in different languages but of a similar nature – What parallels are there in genre characteristics? 21/23 Use of parallel corpora • Aligned corpus allows search for word or phrase and its translation – How is it translated? – Is it translated consistently? • Of interest in studies of “translationese” – Translated text too influenced by original – Certain constructions more prevalent in translation than in native text • Evidence of “explicitation” – Translation is often more explicit than original – Sometimes, explanation added for foreign reader – But often, just a reflection of the translator’s effort (eg replacement of pronoun by more explicit referent) • Also can be used as a tool for translators 22/23 Language acquisition • First-language acquisition – CHILDES database (Child Language Data Exchange System) http://childes.psy.cmu.edu/ – Transcriptions of conversations with (and between) young children – Includes software to help extract data • Second-language acquisition – Learner corpora, notably ICLE – http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/C ecl-Projects/Icle/icle.htm 23/23

Computational Linguistics

Related documents

Products

Support

Computational Linguistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib