Результаты поиска могут быть представлены в 2 видах

CORPORA OF THE RUSSIAN LANGUAGE Victor Zakharov Saint-Petersburg State University, Saint-Petersburg, Russia vz1311@yandex.ru КОРПУСЫ РУССКОГО ЯЗЫКА Виктор Павлович Захаров СПбГУ, Санкт-Петербург, Россия The paper describes corpora of the Russian language and the state of the art of Russian corpus linguistics. The earliest Russian corpus was built in 1980s at the University of Uppsala (Sweden). But another projects influenced this direction in Russia so much that it should be mentioned. In 1960–70s the Frequency Dictionary of Russian was created by L.N. Zasorina. Text database for the dictionary counted about 1 mln tokens. In 1980s the Computer Fund of the Russian Language project started. The idea belonged to the academician Andrei Yershov. The Fund was to include among various databases a collection of texts, i.e. corpus. Among first Russian corpora we have to name Tubingen Russian corpora, Russian newspaper corpus (MSU), Helsinki Annotated Corpus. The National Corpus of the Russian Language (http://ruscorpora.ru, RNC) is the most popular one among linguists for the opportunities which it presents. The main RNC corpus (230 mln tokens) represents texts in proportion to their share in real-life usage. Every text included in the main corpus is subject to metatagging and morphological tagging. Morphological tagging is carried out automatically. In a small part of the main corpus grammatical homonyms are disambiguated by hand. The main corpus of the RNC contains semantic annotation, too. Semantic annotation in the main corpus is a unique feature of RNC that makes it distinct from other national corpora. The RNC semantic annotation could provide the rules for associating context tags of special semantic classes with different meanings. Other RNC subcorpora are The Corpus of Spoken Russian, Deeply Annotated Corpus (treebank), Parallel text corpus, Multimodal/multimedia corpus etc. For search RNC uses the search engine Yandex-Server adapted for corpus needs. The search can be limited to a chosen subcorpus. One can search by an exact form, by a set phrase, by lexico-grammatical and semantic features, by specified position features (before or after punctuation marks, in the beginning or in the end of a sentence, capitalization, etc). The remarkable Charts service of RNC is similar to Google Books NGram Viewer. It shows chronological distribution of lexical units (text forms, phrases), found in the main corpus of RNC. Among other corpora of Russian we can list Leeds University corpora, Moshkov's Library corpus (http:// aot.ru/search1.html), Sketch Engine corpora (http://sketchengine.co.uk). The system besides standard search with concordance output issues lists of collocations based on individual syntactic models (word sketches), forms a word frequency list, groups lexical units into lexico-semantic fields with internal clustering, and shows the strength of syntagmatic relations between lexemes. Oral speech, and especially, the nonpublic oral improvised speech, according to many scientists, is the most important version of language, the closest to its "kernel", and showing the most characteristic models of language. The paper deals with main Russian speech corpora and some other special corpora.

Результаты поиска могут быть представлены в 2 видах

Related documents

Products

Support

Результаты поиска могут быть представлены в 2 видах

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib