Russian National Corpus today: overview and perspectives http://ruscorpora.ru Vladimir A. Plungian (Moscow) Outline • RNC: current state of the art • Search possibilities and special properties • Applications • Further development The goals of RNC • Supporting all kinds of linguistic research, both descriptive and theoretical, synchronic and diachronic – Lexicographic and morphosyntactic studies • Observing language change, especially small and gradual – Discourse and sociolinguistic studies • Assisting teaching and learning Russian Corpus as ideology • One of the first corpora designed by linguists and for linguists • Towards a usage-based linguistic model, inter alia: – From prescriptive to descriptive attitudes – From binary to gradual grammaticality judgments – From single-system view to synchronic and diachronic variation – From egalitarian to quantitative approach to linguistic form The RNC project (Russian Academy of Sciences) • Started in 2003 (preparatory studies since 2001) • Available on the internet since April 2004 • Main participants: Vinogradov Institute for Russian Language (Moscow, RAS); Institute for Linguistic Research (St.-Petersburg, RAS); Moscow State Lomonosov U.; Voronezh State U. • Technically supported by Yandex® – the biggest Russian internet resource with one of the most powerful and innovative search engines Composition and structure: the main corpus 1. Morphologically annotated texts in written and spoken Standard Russian of XVIII-XXI cent. – Late Modern Russian (written texts from the 2nd half of XX century up to the present day): 100 mln – The modern newspapers corpus: 100 mln – The corpus of oral texts (the same period): 6 mln – Early Modern Russian (written XVIII, XIX and early XX-century texts): 80 mln – The corpus of Russian poetry: 4 mln – The corpus for accentological studies [oral + poetry] Composition and structure: minor sub-corpora 2. The parallel corpora 1. 2. 3. 4. English-Russian [10 mln] German-Russian [2 mln] Ukrainian-Russian [1 mln] Polish-Russian [1 mln, in preparation] 3. The small corpus of dialect texts: 0,2 mln 4. The small syntactically annotated corpus: 0,5 mln 5. The small learner’s corpus: 7 mln The main corpus • Circa 300 mln tokens • All types of written texts – fiction (both prose and drama), poetry, memoirs, newspaper accounts and reviews, advertisements, texts on education, engineering, science, philosophy, religion, business, law, as well as texts of private use non intended for publication (diaries, private correspondence, etc.) • Spontaneous oral texts, public performances, movie transcripts Annotation in RNC • • • • • Major types: meta-textual annotation morphological annotation accentual annotation semantic annotation + poetic annotation (metrics, strophics, rhyme types, etc.) Meta-textual annotation • Primary text descriptors: author (name, sex, age), title, creation date, size (number of words) • For fiction: genre (e.g. humour, fantasy), text type (e.g. novel, essay), time and place described (e.g. Soviet Union, 1930es) • For non-fiction: functional sphere (e.g. religion, law), text type (e.g. report, advertisement), subject (e.g. sports, science) • All meta-textual parameters are searchable Morphological annotation • Automatic parsing (without disambiguation) • Manual disambiguation and accentuation in a relatively small sub-section (ca 7 mln tokens) • Morphological information: part of speech, inflectional categories, non-standard forms (distorted or anomalous) Semantic annotation • Lexicon-based annotation • Specific sets of values for different lexical classes: – verbs, adjectives, adverbs, numerals, pronouns, predicate nouns, non-predicate nouns, proper nouns (names, surnames and patronymics) Semantic annotation: values • Include primarily taxonomic parameters (e.g. ‘motion’, ‘speech’, ‘colour’, ‘instrument’, ‘person’, etc.), as well as: – Mereology (sets & parts ~ wholes) – Some derivational features (diminutives, augmentatives, attenuatives, semelfactives, etc.) Searching on semantic base, an example Construction of the type <в ночь> с четверга на пятницу ≈ ‘Thursday night’ Query: preposition С + noun, GRAMM: ‘genitive’, SEM: ‘span of time’ + preposition НА + noun, GRAMM: ‘accusative’, SEM: ‘span of time’ Syntactic corpus: sample search Syntactic corpus: sample search Applications • Linguistic research • Including non-linguist students’ research activities • Education materials • Reference tool for non-experts Applications: research Actual language usage (as opposed to grammars) Short-term grammatical changes Including evolution of word meanings and usage Applications: students’ activities Getting young people interested in language as a phenomenon: from small toy-researches to full-fledged investigations. Not necessarily linguistic students! Applications: educational materials Russian linguistic education is traditionally oriented towards classical literature and based on a fixed set of examples wandering from a manual to another. depressive attitudes towards courses of Russian among younger people Applications: educational materials The Corpus provides instruments and resources to switch to (a) usage-based and (b) domain-specific linguistic training. Applications: reference tool The Corpus provides quick answers to many expert and non-expert questions. Especially convenient for simple lexical queries: word history. When (first) and in what sense was the word used? Further development • Oral and poetic texts • Multi-media corpus (annotated movies) • Full derivational annotation (searching for derivational parameters) • Improving statistics and frequency modules • Emphasis on parallel corpora • Slavic parallel corpora?