corpus

advertisement
As the main carrier of semantic information, a word is the main element of
the utterance. Modern theoretical and applied research, from logical to
morphological, pays great attention to it because no other linguistic unit has such
unity of form and content and plays such an important role, as a word. Language
and especially its vocabulary is constantly evolving. Words take on new meanings,
old meanings disappear. In most cases, new words reflect the emerging new
concepts of science, technology, life, social relations, politics and economics.
Present linguistic situation with the computerization of society, and the so-called
"information explosion" which lead to sharp expansion of channels of
communication forces to pay special attention to shifts in lexical meanings.
Modern methods and techniques of research produce new tools which were
unknown to linguists of the past and which help us verify hypothesis and the
results of studies.
In modern linguistics the tool for research is not only a dictionary as the
registrar of the word in its paradigmatics and syntagmatics but also a concordance,
based on a representative sample of texts. Despite the fact that changes in the
meaning of only one word constitute a problem, a collection of words belonging to
a specific domain can be represented in the form of a system, in which meanings of
words are connected in a certain way.
In general, applied aspects of linguistics which support various spheres of
human activity concentrate primarily on a general problem – the problem of
processing information in society. This information means not only written texts
but also oral speech – the most usual method of communication. New
informational technologies enable us to study language from various sources –
dictionaries, books of fiction, newspapers, etc. and to introduce and process large
amounts of texts with the help of computers – text corpora. Computer processing
and special programs are extremely important for lexicology and lexicography
because they make work of compiling dictionaries, writing textbooks, carrying out
literary research of ancient and modern authors much easier and productive.
Semantics and syntax studies also undergone serious changes caused by new
possibilities.
Corpus linguistics (which originated from applied linguistics) is based on the
corpus – a large amount of living language material which can be extracted from
various sources and introduced into a computer. It studies speech and language
from a different angle revealing a huge vocabulary for research and new powerful
tools for scientists.
At present corpus linguistics which studies distribution of linguistic
phenomena in different languages and gives the possibility to obtain new and
objective linguistic data becomes very important. The advantages of this trend are
that it avoids the subjectivity typical for traditional linguistics and is based on
objective information.
Some basic features of Corpus Linguistics became known long time ago, for
example, distributive methodology, creation of concordances, etc. However, as an
independent linguistic trend it was formed relatively recently.
According to P. Baker “in linguistics a corpus is a collection of texts (a
‘body’ of language) stored in an electronic database” (Baker 2006).
The corpus (text corpus) of any language is a collection of the texts on given
language which is represented in an electronic form and provided with “apparatus
criticus” (scientific definitions, literature cited, references). This “apparatus
criticus” built in the corpus is called “markup” (lay-out), or “annotation” of the
corpus. If “markup” is correct it is easy to find any word, phrase, grammatical
structure in the corpus which are necessary for a language analysis. The text corpus
is used to do statistical analysis, to check occurrences of linguistic rules, to
determine the usage of a special sound, word or syntactic construction, to research
phraseology and to gain an overview of the word in its linguistic environment. Any
corpus helps to examine how people use any word (lemma) in a spoken and written
language.
Corpus research also promotes revealing a wide spectrum of semantics of
multiple-valued words in a wide context, encourages the progress of word
identification in a particular act of communication. Any investigation of the text
corpus gives substantial not only for compiling a dictionary but also for
differentiation of different variants of language. The following types of corpora
can be defined:
1)
Annotated Corpus;
2)
Comparable (reference) Corpus;
3)
Monitor Corpus;
4)
Monolingual Corpus;
5)
Multilingual Corpus;
6)
Parallel (aligned) Corpus;
7)
Reference Corpus;
8)
Spoken Corpus;
9)
Unannotated Corpus;
10)
Speech Corpus.
The British National Corpus (BNC) is one of the first corpora created in the
world by specialists of lexicography. The capacity of the BNC is more than one
hundred million word usage. The corpus also includes metatext and parts of speech
“markup”, a subcorpus of oral speech. Mostly of the corpus incorporates different
types of written and spoken British English. The BNC also contains a great amount
of speech phenomena.
Corpus linguistics gives material for different studies of language and its
variants, and defines the basic method for the analysis of the texts on the basis of
corpora (Corpus-Based Approach). This approach, or the method of linguistic
research based on the text corpora, is focused on applied study of language, its
functioning in real environment what is important for language teaching. For
example, the lexicographic analysis on the basis of corpora obviously helps to open
the contextual use of the words, especially synonymous (for example, small/little,
big/large), their frequency compatibility with other words, a regularity in different
styles, and to define their semantics accurately.
The corpus-based approach provides with an opportunity to observe the
behavior of words, phrases, grammatical categories, syntactic constructions, etc. in
natural language environment that is in real rather than artificially constructed
contexts.
In addition, corpus studies allow (applying statistical methods) to formulate,
prove or disprove a hypothesis about a particular linguistic phenomenon which is
based on a large amount of material.
Moreover, if the researchers use the existing corpus, they completely bypass
a long and time-consuming phase of collecting material (survey informants, work
with the dictionary card files or written texts, etc.).
Download