An Introduction to Corpus Linguistics

Chapter 3:
An Introduction to
Corpus Linguistics
Compiled by:
Sajjad Ghadamyari
Farhad Ghiasvand
Presentation Date:
Dec. 8, 2014 - Monday
What is Corpus Linguistics?
 Is it a branch of Linguistics, like: phonology, syntax, semantics,
 It is a methodology of language studies?
 In an international conference on Corpus Linguistics held in the
US in 2005, some said: It is an empirical method of linguistic
analysis and description + using real-life examples of language
from corpora.
 Some others said: it is an approach or methodology for studying
language use.
 Some viewed it as a theory, much more than methodology.
 “Tuebert and Krishnamurty”’s view: It is a bottom-up approach
that looks at the evidence of the corpus, analyses the evidence
with the aim of finding probabilities and patterns, i.e., searching
behind the curtain of language data for a system which would
explain those data.
Corpus design and construction
Corpus is a collection of naturally occurring language texts,
spoken or written.
Corpus is designed and compiled based on following principles:
 Corpus contents are selected based on their communicative
 Controlling the subject matter in corpus is done by external
 Criteria determining the structure of the corpus are small in
number and separated from each other.
 Samples of language for corpus consist of entire text.
 The design and composition of the corpus are fully documented
with full justification.
Common external criteria
Period of time
Types of Corpora
Periods of
Numebr of
Type of
Types of corpora in terms of purpose
General vs. Specialized/Domain-specific
 General corpora are bigger than specialized ones. They aim to
examine patterns of language use for a language as a whole.
 BNC (British National Corpus).
 Specialized/Domain-specific corpora aim to describe language
use in a specific variety, register or genre.
 JDEST Computer Corpus of Text in English for Science and
 The selection of the contents of a specialized corpus requires the
corpus linguist to seek advice from the experts of the field to
ensure its representativeness and balance.
A.Types of corpora in terms of text selection procedure
B.Types of corpora in terms of periods of time
A.Sample vs. Full-text
 Sample corpora consist of sections of samples of approximately same length.
 SEU (Survey of English Usage corpus)
 Full-text corpora consist of full texts.
 English Poetry Full-Text Database
B.Closed/Static vs. Open/Dynamic
 In closed/static corpora, once the corpora are completed, no more
texts are added.
 In open/dynamic corpora, new materials are continually added and
older materials are discarded.
 Bank of English (University of Birmingham)
A.Types of corpora in terms of medium
B.Types of corpora in terms of numbers of languages
A.Written vs. Spoken
 Written corpora only contain written texts.
 Brown
 Spoken corpora contain spoken materials, concentrating on stress, intonation,
 MARSEC (Machine Readable Spoken English Corpus)
 Mixed: Contating both written and spoken
 ICE (International Corpus of English)
B.Monolingual vs. Multilingual(Parallel, Translation)
 Monolingual corpora are made of samples of only one language.
 Multilingual corpora are made of samples of more than one
languages.(Same sampels, different labguages)
 English-Norwegian Parallel Corpus
A.Types of corpora in terms of type of speaker
B.Types of corpora in terms of annotation
A.Native vs. Learner
 Native corpora are written by Native English-Speakers.
 Learner Corpora are written by those who learn English.
B.Plain/Unannotated vs. Annotated
 Plain/Unannotated corpora solely covers samples.
 Project Gutenberg
 Annotated corpora contains samples of texts plus some explicit
linguistic information, e.g., genre, register,etc.
Corpus application
changes and
Tracking of English Language
variations and changes
Leech ( 2007) suggests four main reasons for all changes:
 Grammaticalization: is a process of language change by which
words representing objects and actions (i.e. nouns and verbs)
transform to become grammatical markers (affixes, preposition,
 You will/do let us go.
 Colloquialization: is a tendency for written norms to become
more informal and move closer to speech.
 As an example, there has been a decrease in the use of passive
forms, and a change in ‘of’ genitive “ the defeat of Liverpool” to “
Liverpool’s defeat”.
 Americanaization: Developing the new norms in language
in the US results in an increase in the use of that
American term.
 “person” to “guy”, “doctor” to “doc”, “wireless” to
 Democratization: Removing linguistic inequalities in
 A decrease in the use of honorifics “Mr.” and “Madam”
and increase in camaraderie terms, example, “Mr.” to
“dude, guy”.
Production of dictionaries
Production of other reference materials:
practice books,grammar books,etc.
 Due to the simplicity of the samples in corpora, their
samples can be used in producing dictionaries and
pedagogical coursebooks and reference material to
make a better understanding in students and learners
and users.
Using a corpus for language research
A major application of corpora is to study different aspects of Linguistics
 Lexis: Corpora reveal which words have higher frequency, provide
information about word-formation, derivation and word-compounds.
 Grammar: It provides information about how sentences and utterances are
 Phraseology: In linguistics, phraseology is the study of set or fixed
expressions, such as idioms, collocations, phrasal verb, and other types of
multi-word lexical units in which the component parts of the expression’s
meaning is created by co-selection of words.
 Many phraseologies are adjacent words called n-grams, e.g., thank you.
Another kind of phraseology involves non-adjacent words, e.g., the … of …
 Literal and metaphorical meanings: Corpora try to teach metaphorical
meanings of a term, besides its literal meaning.
 Fruit, something to be eaten, and results.