Russian National Corpus
today:
overview and perspectives
http://ruscorpora.ru
Vladimir A. Plungian
(Moscow)
Outline
• RNC: current state of the art
• Search possibilities and special
properties
• Applications
• Further development
The goals of RNC
• Supporting all kinds of linguistic research,
both descriptive and theoretical,
synchronic and diachronic
– Lexicographic and morphosyntactic studies
• Observing language change, especially
small and gradual
– Discourse and sociolinguistic studies
• Assisting teaching and learning Russian
Corpus as ideology
• One of the first corpora designed by
linguists and for linguists
• Towards a usage-based linguistic model,
inter alia:
– From prescriptive to descriptive attitudes
– From binary to gradual grammaticality
judgments
– From single-system view to synchronic and
diachronic variation
– From egalitarian to quantitative approach to
linguistic form
The RNC project
(Russian Academy of Sciences)
• Started in 2003 (preparatory studies since 2001)
• Available on the internet since April 2004
• Main participants: Vinogradov Institute for
Russian Language (Moscow, RAS); Institute for
Linguistic Research (St.-Petersburg, RAS);
Moscow State Lomonosov U.; Voronezh State U.
• Technically supported by Yandex® – the biggest
Russian internet resource with one of the most
powerful and innovative search engines
Composition and structure:
the main corpus
1. Morphologically annotated texts in written and
spoken Standard Russian of XVIII-XXI cent.
– Late Modern Russian (written texts from the 2nd half
of XX century up to the present day): 100 mln
– The modern newspapers corpus: 100 mln
– The corpus of oral texts (the same period): 6 mln
– Early Modern Russian (written XVIII, XIX and early
XX-century texts): 80 mln
– The corpus of Russian poetry: 4 mln
– The corpus for accentological studies [oral + poetry]
Composition and structure:
minor sub-corpora
2. The parallel corpora
1.
2.
3.
4.
English-Russian [10 mln]
German-Russian [2 mln]
Ukrainian-Russian [1 mln]
Polish-Russian [1 mln, in preparation]
3. The small corpus of dialect texts: 0,2 mln
4. The small syntactically annotated corpus:
0,5 mln
5. The small learner’s corpus: 7 mln
The main corpus
• Circa 300 mln tokens
• All types of written texts
– fiction (both prose and drama), poetry, memoirs,
newspaper accounts and reviews, advertisements,
texts on education, engineering, science, philosophy,
religion, business, law, as well as texts of private use
non intended for publication (diaries, private
correspondence, etc.)
• Spontaneous oral texts, public performances,
movie transcripts
Annotation in RNC
•
•
•
•
•
Major types:
meta-textual annotation
morphological annotation
accentual annotation
semantic annotation
+
poetic annotation (metrics, strophics,
rhyme types, etc.)
Meta-textual annotation
• Primary text descriptors: author (name, sex,
age), title, creation date, size (number of
words)
• For fiction: genre (e.g. humour, fantasy), text
type (e.g. novel, essay), time and place
described (e.g. Soviet Union, 1930es)
• For non-fiction: functional sphere (e.g. religion,
law), text type (e.g. report, advertisement),
subject (e.g. sports, science)
• All meta-textual parameters are searchable
Morphological annotation
• Automatic parsing (without
disambiguation)
• Manual disambiguation and accentuation
in a relatively small sub-section (ca 7 mln
tokens)
• Morphological information: part of speech,
inflectional categories, non-standard forms
(distorted or anomalous)
Semantic annotation
• Lexicon-based annotation
• Specific sets of values for different
lexical classes:
– verbs, adjectives, adverbs,
numerals, pronouns, predicate nouns,
non-predicate nouns, proper nouns
(names, surnames and patronymics)
Semantic annotation: values
• Include primarily taxonomic parameters
(e.g. ‘motion’, ‘speech’, ‘colour’,
‘instrument’, ‘person’, etc.), as well as:
– Mereology (sets & parts ~ wholes)
– Some derivational features (diminutives,
augmentatives, attenuatives, semelfactives,
etc.)
Searching on semantic base,
an example
Construction of the type
<в ночь> с четверга на пятницу ≈ ‘Thursday
night’
Query:
preposition С
+ noun, GRAMM: ‘genitive’, SEM: ‘span of time’
+ preposition НА
+ noun, GRAMM: ‘accusative’, SEM: ‘span of time’
Syntactic corpus:
sample search
Syntactic corpus:
sample search
Applications
• Linguistic research
• Including non-linguist students’ research
activities
• Education materials
• Reference tool for non-experts
Applications:
research
Actual language usage (as opposed to
grammars)
Short-term grammatical changes
Including evolution of word meanings
and usage
Applications:
students’ activities
Getting young people interested in
language as a phenomenon: from small
toy-researches to full-fledged
investigations.
Not necessarily linguistic
students!
Applications:
educational materials
Russian linguistic education is
traditionally oriented towards classical
literature and based on a fixed set of
examples wandering from a manual to
another.
depressive attitudes towards courses of
Russian among younger people
Applications:
educational materials
The Corpus provides instruments and
resources to switch to (a) usage-based
and (b) domain-specific linguistic
training.
Applications:
reference tool
The Corpus provides quick answers to
many expert and non-expert questions.
Especially convenient for simple lexical
queries: word history.
When (first) and in what sense
was the word used?
Further development
• Oral and poetic texts
• Multi-media corpus (annotated movies)
• Full derivational annotation (searching for
derivational parameters)
• Improving statistics and frequency
modules
• Emphasis on parallel corpora
• Slavic parallel corpora?