Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, 2014 - Monday What is Corpus Linguistics? Is it a branch of Linguistics, like: phonology, syntax, semantics, etc.? Or It is a methodology of language studies? In an international conference on Corpus Linguistics held in the US in 2005, some said: It is an empirical method of linguistic analysis and description + using real-life examples of language from corpora. Some others said: it is an approach or methodology for studying language use. Some viewed it as a theory, much more than methodology. “Tuebert and Krishnamurty”’s view: It is a bottom-up approach that looks at the evidence of the corpus, analyses the evidence with the aim of finding probabilities and patterns, i.e., searching behind the curtain of language data for a system which would explain those data. Corpus design and construction Corpus is a collection of naturally occurring language texts, spoken or written. Corpus is designed and compiled based on following principles: Corpus contents are selected based on their communicative purpose. Controlling the subject matter in corpus is done by external criteria. Criteria determining the structure of the corpus are small in number and separated from each other. Samples of language for corpus consist of entire text. The design and composition of the corpus are fully documented with full justification. Common external criteria Mode Date Spoken,Written Period of time Type Location Book,Journal,etc. UK,USA,etc. Domain Academic,etc. Language British,American,etc. Types of Corpora Medium Periods of time Numebr of languages Text selection procedure Purpose Type of speaker Criteria Annotation Types of corpora in terms of purpose General vs. Specialized/Domain-specific General corpora are bigger than specialized ones. They aim to examine patterns of language use for a language as a whole. BNC (British National Corpus). Specialized/Domain-specific corpora aim to describe language use in a specific variety, register or genre. JDEST Computer Corpus of Text in English for Science and Technology The selection of the contents of a specialized corpus requires the corpus linguist to seek advice from the experts of the field to ensure its representativeness and balance. A.Types of corpora in terms of text selection procedure B.Types of corpora in terms of periods of time A.Sample vs. Full-text Sample corpora consist of sections of samples of approximately same length. SEU (Survey of English Usage corpus) Full-text corpora consist of full texts. English Poetry Full-Text Database B.Closed/Static vs. Open/Dynamic In closed/static corpora, once the corpora are completed, no more texts are added. In open/dynamic corpora, new materials are continually added and older materials are discarded. Bank of English (University of Birmingham) A.Types of corpora in terms of medium B.Types of corpora in terms of numbers of languages A.Written vs. Spoken Written corpora only contain written texts. Brown Spoken corpora contain spoken materials, concentrating on stress, intonation, etc. MARSEC (Machine Readable Spoken English Corpus) Mixed: Contating both written and spoken ICE (International Corpus of English) B.Monolingual vs. Multilingual(Parallel, Translation) Monolingual corpora are made of samples of only one language. Multilingual corpora are made of samples of more than one languages.(Same sampels, different labguages) English-Norwegian Parallel Corpus A.Types of corpora in terms of type of speaker B.Types of corpora in terms of annotation A.Native vs. Learner Native corpora are written by Native English-Speakers. Learner Corpora are written by those who learn English. B.Plain/Unannotated vs. Annotated Plain/Unannotated corpora solely covers samples. Project Gutenberg Annotated corpora contains samples of texts plus some explicit linguistic information, e.g., genre, register,etc. Corpus application Dictionary production Tracking changes and vaiations Reference materials production Application s Language research Tracking of English Language variations and changes Leech ( 2007) suggests four main reasons for all changes: Grammaticalization: is a process of language change by which words representing objects and actions (i.e. nouns and verbs) transform to become grammatical markers (affixes, preposition, etc.). You will/do let us go. Colloquialization: is a tendency for written norms to become more informal and move closer to speech. As an example, there has been a decrease in the use of passive forms, and a change in ‘of’ genitive “ the defeat of Liverpool” to “ Liverpool’s defeat”. Americanaization: Developing the new norms in language in the US results in an increase in the use of that American term. “person” to “guy”, “doctor” to “doc”, “wireless” to “radio”. Democratization: Removing linguistic inequalities in society. A decrease in the use of honorifics “Mr.” and “Madam” and increase in camaraderie terms, example, “Mr.” to “dude, guy”. Production of dictionaries Production of other reference materials: practice books,grammar books,etc. Due to the simplicity of the samples in corpora, their samples can be used in producing dictionaries and pedagogical coursebooks and reference material to make a better understanding in students and learners and users. Using a corpus for language research A major application of corpora is to study different aspects of Linguistics including: Lexis: Corpora reveal which words have higher frequency, provide information about word-formation, derivation and word-compounds. Grammar: It provides information about how sentences and utterances are formed. Phraseology: In linguistics, phraseology is the study of set or fixed expressions, such as idioms, collocations, phrasal verb, and other types of multi-word lexical units in which the component parts of the expression’s meaning is created by co-selection of words. Many phraseologies are adjacent words called n-grams, e.g., thank you. Another kind of phraseology involves non-adjacent words, e.g., the … of … . Literal and metaphorical meanings: Corpora try to teach metaphorical meanings of a term, besides its literal meaning. Fruit, something to be eaten, and results.