1. History of corpus linguistics

advertisement
Introduction
Corpus linguistics is a part of computational linguistics and a part of general language
studies. It deals with large collections of written and spoken texts, up to hundreds of millions
words. The main areas of interest in corpus linguistics are twofold. First, it is building,
maintaining and operationalizing such collections. Second, it is analysing and interpreting
language features and phenomena that are usually not visible on a smaller scale.
Thus, the first part of this work? presents computer technology that provides tools for
language analysis and formal description of the language as a system that can be digitalized,
stored and accessed with the use of electronic tools. This requires designing effective storage
systems and user-friendly interfaces that make the texts available for research and analysis.
Storing letters and words has not been a problem even for early computers, but storing
descriptions? of various aspects of words and finding multidimensional links between
elements of the language is still challenging for language technologists. This is the domain
of computer science. Methods of analysis come from statistics and matrix analysis, as the
language is multidimensional and many aspects should be taken into consideration at the
same time?
The second part deals with interpretation of data searched and obtained from a large amount
of evidence of language use. This interfaces the language as a social, cultural and
psychological phenomenon. That is the domain of psycholinguistics and sociolinguistics,
cultural studies and other linguistic areas of interest such as: lexical studies, semantics
pragmatics, stylistics, speech analysis, and last not least teaching and translation. Results in
this area are both academic, that is, growth in knowledge about the language, its
interrelations with other languages, the way it functions, and practical by making available
databases and tools for learners, teachers and translators: both human or machine translators.
The third part is an introduction to corpus analysis and statistics. Some basic concepts in
corpus linguistics as empirical domain are presented. Counts, frequencies and statistical tests
show the methodology of corpus analysis.
And last but not least part four gives examples of the use and application of the tools in
linguistics and suggests ideas for further research and development. Having at your disposal
such a large collection of text with language in use, human imagination seems to be the only
limit to research.
5
Part 1. Corpora
1. History of corpus linguistics
Linguists have always been interested in how people use languages, all languages that are
available for them: native, second and foreign languages they learn. These real life acts of
linguistic behaviour can be observed by investigating corpora. For years linguistics was seen
as an empirical science. That is why research on collections of texts, has a long tradition.
Without computers, handling a collection of texts bigger than one book was extremely timeconsuming. What is more, there was a fashionable tendency in linguistics against corpus
linguistics, which prevailed for twenty years since the late 1950s. However, contrary to this
popular trend, many linguists worked on corpora for thirty years. Their work enabled corpus
linguistics to start developing fast in the eighties.
Under Chomsky’s influence for many years linguistics dealt with “potential language”, i.e.
the internal, not the real, language of a linguist. In fact it is a case study of a linguist’s
potential language. Comparing this area of interest to other social studies, it is really
interesting to study people’s dreams, but it is much more interesting to observe their
behaviour and actualacts and reactions. Both approaches can find their place in a field of
research called linguistics and none of them is superior to the other.
Let us define a corpus linguist as a linguist whose main interest is the language as an existing
phenomenon expressed orally or with the use of a tool, for example: paper, screen, silk or
stone.
Neither an “armchair linguist” nor a corpus linguist can describe a language in its entirety.
The former can always find utterances that have not been uttered, before the latter can find
phrases actually used that he has not ever imagined could be uttered. The limits in the first
case are set by the imagination of the linguist, while in the second case by the basic reasons
humans use the language (Jacobson).
Technology makes corpora feasible for research without wasting researchers’ time on sorting
the words several times tediously. It is worth pointing out that first PC computers appeared
in 1976. Constant interest in the real use of language and availability of user-friendly tools
for research are the key factors in the developments in corpus linguistics which depicts a
broader picture on what users do with the language far beyond what a linguist can ever
imagine.
Early corpus linguistics – before Chomskyan revolution in the 1950s.
Year
Researchers and a short description of their corpora and research
Area of study
1897
Käding created 11 million word corpus of German to study
frequency distribution and sequences of letters in German (5 000
analysts were used to study it)
spelling
1898
Preyer - studies on first language acquisition were based on parental
diaries
acquisition
1921
Thorndike’s word counts were important in defining the goals of
vocabulary movement in second language pedagogy
pedagogy
6
1924
Stern – studies on first language acquisition were based on parental
diaries
acquisition
1931
Palmer’s word counts were important in defining the goals of
vocabulary movement in second language pedagogy
pedagogy
1940
Fries and Traver - vocabulary lists for foreign learners
pedagogy
1940
Eaton – comparing the frequency of word meanings in Dutch,
French, German, and Italian
comparative
studies
1947
Bongers’ vocabulary lists for foreign learners
pedagogy
1949
Father Busa started creating a computerized corpus of medieval
philosophy
computerized
corpora
1949
Lorge - semantic frequency list
syntax and
semantics
1952
Fries – descriptive grammar of English based on a corpus
syntax
1954
McCarthy - Corpora of children’s utterances were gathered to
establish the stages of linguistic development of a child.
acquisition
1956
Gougenheim et all – a study of high frequency lexical and
grammatical choices was based on transcribed corpus of spoken
French
syntax and
semantics
1956
Julliand started developing machine readable corpora, sampling
techniques, annotations
mechanolinguistics
1957
Firth in a series of writings from 1930s, 1940s and 1950s
established the basic terminology used in corpus linguistics NeoFirthians (Halliday, Hoey, Sinclair) continued the tradition.
corpus
linguistics
1960
Quirk started Survey of English Usage
grammar
1960
Francis and Kucera began Brown Corpus
grammar
1964
Julliand and Chang Rodriguez - report on corpus construction
mechanolinguistics
1967
Father Busa finished his project on medieval philosophy
computerized
corpora
1970
Bloom – Longitudinal studies on language acquisition
acquisition
1973
Brown – Longitudinal studies on language acquisition
acquisition
Renaissance of corpus linguistics
1985
Quirk et all - a comprehensive grammar of the English language
syntax
7
2. Building a corpus
2.1. Definition of a corpus
The word corpus comes from Latin, it means body. The plural is usually corpora. A new
plural is corpuses.
A corpus is normally defined as a collection of texts, spoken and/or written, which has been
designed and compiled based on a set of clearly defined criteria.
Here are some definitions of corpora from various sources:
“A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon
verse. In linguistics and lexicography, a body of texts, utterances, or other specimens
considered more or less representative of a language, and usually stored as an electronic
database. Currently, computer corpora may store many millions of running words, whose
features can be analysed by means of tagging (the addition of identifying and classifying tags
to words and other formations) and the use of concordancing programs. Corpus linguistics
studies data in any such corpus.”
(The Oxford Companion to the English Language, ed. McArthur & McArthur, 1992)
“A collection of linguistic data, either written texts or a transcription of recorded speech,
which can be used as a starting-point of linguistic description or as a means of verifying
hypotheses about a language.”
(David Crystal, A Dictionary of Linguistics and Phonetics, Blackwell, 3rd Edition, 1991)
“A collection of naturally occurring language text, chosen to characterize a state or variety of
a language.”
(John Sinclair, Corpus, Concordance, Collocation, OUP, 1991)
2.2. Legal matters
Collections of texts are usually created for research and practical applications. Examples are
presented in Part 4. Parts of the texts from the corpus are quoted in reports, dictionaries and
8
other publications. Thus, the copyright issues arise. Corpus holder should get a permission
from authors of the texts to use them.
“Language cannot be invented; it can only be captured.”
(Sinclair, 1997: 31)
The number of utterances in any language is unlimited. However, the data collected is limited
and finite. Thus the problem of representativeness arises.
2.3. Types of corpora
There are different categories in which corpora can be described.
The first distinction can be drawn between a monitor corpus and a sample corpus (Sinclair:
1991 23-26).
The monitor corpus attempts to be a representative cross-section of the spoken and/or written
language to be studied (e.g. the Bank of English (COBUILD) and the British National
Corpus) and by its very nature it has to be very large (the Bank of English is about 400
million words of written and spoken texts and continues to grow). Monitor corpora have to be
continually updated with 'new' texts, and 'old' texts must be discarded if they are to be truly
representative.
The sample corpus does not pretend to be representative of the whole spoken and/or written
forms of the language to be investigated. Sample corpora are much more common and they
are the norm in most corpus-based studies (e.g. International Corpus of English and the Hong
Kong Corpus of Spoken English at PolyU).
In order to create a corpus, it is vital to establish a set of homogenous criteria, which should
be applied consistently to all items in the corpus.. The type of corpus reflects the research
aims that are intended to be investigated.
There are various categories related to text origin, authorship, size, language, social and
cultural contexts, time, etc. that need to be taken into consideration.
Speech and sign language are inherently human, that is humans do not need anything but their
body to produce language. Written language and netspeak are tool dependent, that is one
needs some surface and a tool to write with.
The categories presented below present show various aspects and dimensions of corpora.
9
Types of corpora based on text origin with examples :
1
2
3
Written corpus
4
Spoken corpus
Spoken corpus
Netspeak
recorded as sound
transcribed into
files
written form
Media recordings -
Media recordings -
Printed in a book
www sites
TV (various subtypes)
TV (various
(various subtypes)
(various
subtypes)
subtypes)
Media recordings -
Media recordings -
radio
radio
Private conversation
Private
Private
Private e-
conversation
correspondence
mail
Business
Business
Business e-
conversation
correspondence
mail
Business conversation
printed in a journal
Electronic
media
Types of corpora based on authors of the utterances
1
Gender
Female spoken
Male spoken or written
or written
2
Age
Adult
Children, teenagers
3
Special features
Person or people Person or people with special educational
without language needs (dyslexic, blind, mentally
abnormalities
handicapped native speakers or non-native
learners)
4
Unified group or culture National
Subcultures (various types)
5
Nationality
One nation
Multinational
6
Linguistic origin
Monolingual
Bilingual or multilingual
7
Linguistic unity
Monolingual
Parallel bilingual or multilingual
8
Linguistic experience
Native
Learner
9
Number of authors
One
Many
10 Source of the text
Original creation Translation
Types of corpora based on their size and openness:
1
Monitor corpus
Sample corpus
10
2
Closed corpus (one writer’s productions e.g. a
Open - can be developed
Shakespeare corpus, a Mallory corpus
Types of corpora based on location, place, geographical features:
Language(s) spoken on a selected area
District
Town
Region
Country
Types of corpora based on content:
Various, as in monitor corpus
Restricted e.g.
One artist’s texts (writer, poet, singer)
Translations of legal documents
Essays of students from one class.
Types of corpora based on time:
1
Length of period of collection
Long time
Short time
2
Ways of collecting in time
Continuous
Sampled
3
History
Synchronic
Diachronic
11
Download