Результаты поиска могут быть представлены в 2 видах

advertisement
CORPORA OF THE RUSSIAN LANGUAGE
Victor Zakharov
Saint-Petersburg State University, Saint-Petersburg, Russia
vz1311@yandex.ru
КОРПУСЫ РУССКОГО ЯЗЫКА
Виктор Павлович Захаров
СПбГУ, Санкт-Петербург, Россия
The paper describes corpora of the Russian language and the state of the art of Russian corpus
linguistics.
The earliest Russian corpus was built in 1980s at the University of Uppsala (Sweden). But another
projects influenced this direction in Russia so much that it should be mentioned. In 1960–70s the Frequency
Dictionary of Russian was created by L.N. Zasorina. Text database for the dictionary counted about 1 mln
tokens. In 1980s the Computer Fund of the Russian Language project started. The idea belonged to the
academician Andrei Yershov. The Fund was to include among various databases a collection of texts, i.e.
corpus.
Among first Russian corpora we have to name Tubingen Russian corpora, Russian newspaper corpus
(MSU), Helsinki Annotated Corpus.
The National Corpus of the Russian Language (http://ruscorpora.ru, RNC) is the most popular one
among linguists for the opportunities which it presents. The main RNC corpus (230 mln tokens) represents
texts in proportion to their share in real-life usage. Every text included in the main corpus is subject to
metatagging and morphological tagging. Morphological tagging is carried out automatically. In a small part
of the main corpus grammatical homonyms are disambiguated by hand. The main corpus of the RNC contains semantic annotation, too. Semantic annotation in the main corpus is a unique feature of RNC that makes
it distinct from other national corpora. The RNC semantic annotation could provide the rules for associating
context tags of special semantic classes with different meanings.
Other RNC subcorpora are The Corpus of Spoken Russian, Deeply Annotated Corpus (treebank),
Parallel text corpus, Multimodal/multimedia corpus etc.
For search RNC uses the search engine Yandex-Server adapted for corpus needs. The search can be
limited to a chosen subcorpus. One can search by an exact form, by a set phrase, by lexico-grammatical and
semantic features, by specified position features (before or after punctuation marks, in the beginning or in the
end of a sentence, capitalization, etc).
The remarkable Charts service of RNC is similar to Google Books NGram Viewer. It shows chronological distribution of lexical units (text forms, phrases), found in the main corpus of RNC.
Among other corpora of Russian we can list Leeds University corpora, Moshkov's Library corpus
(http:// aot.ru/search1.html), Sketch Engine corpora (http://sketchengine.co.uk). The system besides standard
search with concordance output issues lists of collocations based on individual syntactic models (word
sketches), forms a word frequency list, groups lexical units into lexico-semantic fields with internal clustering, and shows the strength of syntagmatic relations between lexemes.
Oral speech, and especially, the nonpublic oral improvised speech, according to many scientists, is
the most important version of language, the closest to its "kernel", and showing the most characteristic models of language. The paper deals with main Russian speech corpora and some other special corpora.
Download