System of Pronominal Words in Czech with Respect to German and English Magda Razímová Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic razimova@ufal.mff.cuni.cz Outline of the talk Introduction Pronouns in the Prague Dependency Treebank 2.0 Personal pronouns Other pronoun types Pro-adverbs and pro-numerals Application of the presented scheme to English and German Final remarks SLE 2006 2/20 razimova@ufal.mff.cuni.cz Introduction Pronouns and other pro-forms to replace or substitute other words, phrases, or sentences anaphoric and deictic functions pronouns, also pro-adjectives, pro-adverbs, and pro-numerals ‘closed’ classes semantically relevant regularities within the sub-classes (nobody-never-nowhere; everybody-always-everywhere) Pro-forms in the Prague Dependency Treebank 2.0 SLE 2006 formal linguistic system for annotation of pro-forms making the present regularities explicit part of the deep-syntactic layer (tectogrammatical layer, t-layer) representation by a reduced set of (underlying) lemmas in combination with relevant attributes 3/20 razimova@ufal.mff.cuni.cz PDT project – historical background mid 1960’s Functional Generative Description (Petr Sgall et al.) 1994 Czech National Corpus 1995 PDT started 1998 PDT 0.5 pre-release 2001 PDT 1.0 released by LDC (LDC2001T10) manual annotation of morphology and surface syntax 2006 PDT 2.0 (just) released by LDC (LDC2006T01) interlinked morphological, surface-syntactic and complex deep-syntactic annotation SLE 2006 4/20 razimova@ufal.mff.cuni.cz PDT 2.0 Layers of annotation tectogrammatical layer analytical layer surface-syntactic dependency tree 75 % of the m-layer data (5,330 doc., 87,913 sent., 1,503,739 tokens) morphological layer deep-syntactic dependency tree 59 % of the a-layer data (3,165 doc., 49,431 sent., 833,195 tokens) m-lemma and m-tag associated with each token 7,110 textual documents (115,844 sent. with 1,957,247 tokens) word layer SLE 2006 original text, segmented on word boundaries 5/20 lit: He-was would went toforest. razimova@ufal.mff.cuni.cz He would have gone to the forest. Outline of the talk Introduction Pronouns in the Prague Dependency Treebank 2.0 Personal pronouns Other pronoun types Pro-adverbs and pro-numerals Application of the presented scheme to English and German Final remarks SLE 2006 6/20 razimova@ufal.mff.cuni.cz Pronouns in the Prague Dependency Treebank 2.0 at the t-layer, personal pronouns treated separately from the other pronoun types pro-adverbs and pro-numerals represented in the same way like indefinite, negative etc. pronouns semantic features originally present in the word form extracted and stored as values of inner attributes of the t-node that corresponds to the given word form SLE 2006 7/20 razimova@ufal.mff.cuni.cz Personal pronouns in the PDT 2.0 all personal pronouns (no matter whether they are present in the sentence, or restored at the t-layer) are represented by nodes labeled with a single, ‘artificial’ lemma #PersPron grammatical information that a personal pronoun expresses in the sentence is stored in node attributes person, number, and gender attribute politeness for discerning between honorific and non-honorific usage for example: the pronoun vy in vy jste přišel (you came said politely to a single person) is represented as follows: #PersPron + 2nd(person) + singular + masc.anim. + polite possessive pronouns which correspond to personal pronouns (jeho (his), náš (our)) are represented in the same way SLE 2006 8/20 razimova@ufal.mff.cuni.cz Personal pronouns and co-reference at the t-layer, representation of personal pronouns was completed with the annotation of co-reference (i.e relations between nodes referring to the same entity) Tím, že Evropská unie nechala ve rwandské operaci Francii na holičkách, podle Léotarda ukázala, že její politika nemá žádný africký rozměr. (According to Léotard, by the fact that the European SLE 2006 Union left France in the lurch concerning the Rwanda9/20 operation, [it] has shownrazimova@ufal.mff.cuni.cz that its politics has Other pronoun types in the PDT 2.0 indefinite, negative, interrogative, and relative pronouns in Czech pronoun system, single meanings are expressed regularly by means of a relatively small group of prefixes that join together with a small set of bases transparent correspondence between the semantic features and formal composition of pronouns: at the t-layer, pronouns with the same base element grouped together – each pronoun group represented by the lemma corresponding to the respective relative pronoun: indefinite prefix ně-: někdo (somebody) – něco (something) – nějaký (some) negative prefix ni-: nikdo (nobody) – nic (nothing)… e.g. někdo (somebody) and nikdo (nobody) represented by the lemma kdo (who) corresponding possessive pronouns represented in the same way the semantic feature completing the reduced lemma was stored in the indeftype attribute SLE 2006 10/20 razimova@ufal.mff.cuni.cz Other pronoun types and the indeftype attribute all indefinite, negative, interrogative, and relative pronouns represented by only four lemmas at the t-layer the reduced lemmas were completed by a value of the indeftype attribute (actually 11 values) SLE 2006 11/20 razimova@ufal.mff.cuni.cz Pro-adverbs and pro-numerals in the PDT 2.0 in Czech, pro-adverbs (e.g. nikde (nowhere), nějak (somehow)) and pro-numerals (e.g. několik (a few)) express the same semantic features like pronouns represented in the same way like indefinite, negative, interrogative, and relative pronouns at the t-layer another derivational relation can be seen between proadverbs with directional meaning and those of location – for example, the adverb odněkud (from somewhere) is represented as follows: SLE 2006 lemma kde (where) + indef value (of the indeftype attribute) + functor DIR1 capturing the directional meaning 12/20 razimova@ufal.mff.cuni.cz Zakládá-li si někdo na tom, že se vyhýbá cizím slovům, pak udělá nejlíp, když se nikdy nepodívá do Etymologického slovníku jazyka českého. If someone finds it important that [he] eliminates foreign words, then the best thing [he] can do is if [he] never looks in the Etymology Dictionary of Czech. SLE 2006 13/20 razimova@ufal.mff.cuni.cz Outline of the talk Introduction Pronouns in the Prague Dependency Treebank 2.0 Personal pronouns Other pronoun types Pro-adverbs and pro-numerals Application of the presented scheme to English and German Final remarks SLE 2006 14/20 razimova@ufal.mff.cuni.cz Application of the presented scheme to English and German indefinite, negative, interrogative, and relative pronouns and other pro-forms are unproductive classes with (at least to a certain extent) transparent derivational relations also in other languages preliminary sketch of several English and German pronouns: still not solved: English anybody, German niemand and nirgendjemand … SLE 2006 15/20 razimova@ufal.mff.cuni.cz Related conception: Helbig’s MultiNet (i) similar treatment of indefinite and negative pronouns as of two subtypes of the same entity was introduced also in the MultiNet knowledge representation system (Helbig, H. (2001), Die semantische Struktur natürlicher Sprache, Springer, 2001) (Negators and their antonyms, in Helbig (2001), p. 164) SLE 2006 16/20 razimova@ufal.mff.cuni.cz Related conception: Helbig’s MultiNet (ii) Sentential negation with kein (no) Peter lent his tools to nobody. Constituent negation with kein (no) Peter buys no motorcycle, but a bike. (in Helbig (2001), p. 170) SLE 2006 17/20 razimova@ufal.mff.cuni.cz Outline of the talk Introduction Pronouns in the Prague Dependency Treebank 2.0 Personal pronouns Other pronoun types Pro-adverbs and pro-numerals Application of the presented scheme to English and German Final remarks SLE 2006 18/20 razimova@ufal.mff.cuni.cz Final remarks achievements: all pro-forms in Czech divided into two groups: • personal (and corresponding possessive) pronouns • other pronoun types (and corresponding possessive pronouns) and other pro-forms several pro-form analogies crossing the part-of-speech boundaries are explicitly marked in the annotation verification of the annotation scheme on large-scale data future work: SLE 2006 to elaborate the scheme for other languages in more detail, taking into consideration specific phenomena of the respective language to describe the relations among the Czech and other pro-form systems (for example, for the purposes of machine translation) 19/20 razimova@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/pdt2.0/ SLE 2006 20/20 razimova@ufal.mff.cuni.cz