Peter Grzybek Estonian Proverbs: Searching for regularities www.peter-grzybek.eu Peter Grzybek Kriq 75, August 18/19, 2014 1/47 How long is a proverb ? How long are words in proverbs ? Does word length depend on proverb length ? Is word length independent of within-text position ? Peter Grzybek Kriq 75, August 18/19, 2014 2/47 How to measure the length of linguistic units and entities ? Memo: „There are no positive facts in language.“ (Saussure) There is always more than one definition. I. Define the entity you want to measure. If you want to measure sentence length, define ‚sentence‘. II. If you want to measure word length, define ‚word‘. Determine the measuring units in which you want to measure. E.g., sentence length: number of clauses, phrases, words, syllables, morphemes, … ? E.g., word length: number of syllables, morphemes, letters, graphemes, of phonemes, … ? III. Define the measuring units. Define ‘clause’, ‘phrase’, ‘syllable’, ‘morpheme’, ‘phoneme’, ‘grapheme’, ‘letter’, … ? Rule in Quantitative Linguistics: Take direct constituents as measuring units Peter Grzybek Kriq 75, August 18/19, 2014 3/47 How long are proverbs ? Sentence length: How many… XY clauses, phrases words, stems Syllables, morphemes Peter Grzybek One proverb one sentence per proverb ? + syntactic analysis - dependent on syntax theory; reduced number of clauses/phrases in proverbs (lack of variation) + sufficient variation - dependent on lexical theory: (orthographic word, phonological word, etc.) + lexical analysis; rhythmic structure - Dependent on morphology and phonotactics; high degree of variation Kriq 75, August 18/19, 2014 4/47 Orthographic problems: Mother-in-law - Isn‘t that a problem ? В этом доме. в кратцу - вкратце Phonological word (tact group): Ná mostu. In agglutinative languages … … stems do not change, … affixes do not fuse with other affixes, … affixes do not change form conditioned by other affixes. Peter Grzybek Kriq 75, August 18/19, 2014 5/47 How long are words ? How many… XY letters, graphemes phonemes Syllables, morphemes Peter Grzybek per word ? + easy (automatic) analysis - high degree of alphabetic arbitrariness; high degree of variation + better linguistic justification - dependent on phonological theory; high degree of variation neglect of quantity + lexical analysis; rhythmic structure - High degree of variation Kriq 75, August 18/19, 2014 6/47 Estonian phonemes: Three degrees of phonemic length (consonants and vowels) [o] [oˈ] [oː] Peter Grzybek (short o) (long o) (extra long o) koli kooli kooli" = „Müll“ = „Schule“ = „schulen“ Kriq 75, August 18/19, 2014 7/47 Decisions / Definitions (In accordance with Kriq 1967) Peter Grzybek Linguistic Unit Definition Sentence One proverb Length Number of words / stems Word Orthographic Length Number of syllables Kriq 75, August 18/19, 2014 8/47 Üks riisub rihaga, teine pühib luuaga. (EV 15016) [Der eine recht mit dem Rechen, der andere kehrt mit dem Besen.] Wo:6 – St:6 – Sy:13 Üks rii-sub ri-ha-ga, tei-ne pü-hib luu-a-ga. Isi puu, isi puuke. (EV 2245) [Das eine ist der Baum, das andere ist das Bäumchen] Wo:4 – St:4 – Sy:7 I-si puu, i-si puu-ke. Peter Grzybek Kriq 75, August 18/19, 2014 9/47 Erna Normann (1955) Valimik eesti vanasõnu Frequencies Length 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Peter Grzybek 93 619 518 710 511 518 218 175 81 62 40 13 5 5 2 4 0 1 1 3576 proverbs Ca. end 19th, early 20th century 800 600 400 200 0 3 4 5 6 Kriq 75, August 18/19, 2014 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 10/47 Comparisons Old (17th/18th century) and Contemporary 3576 Peter Grzybek 618 294 0,35 0,30 Contemporary Normann Old 0,25 Frequency Frequencies Length Normann Old New 3 93 21 26 4 619 103 89 5 518 80 52 6 710 129 39 7 511 86 17 8 518 96 44 9 218 39 12 10 175 21 9 11 81 21 1 12 62 14 5 13 40 4 0 14 13 3 1 15 5 0 0 16 5 1 0 17 2 0 18 4 1 19 0 20 1 21 1 0,20 0,15 0,10 0,05 0,00 2 4 6 8 10 12 14 16 18 20 22 Length Bimodal distributions: Additional Peaks ( 6 / 8 ) Question: Does the word-stem distinction explain the bi-modality? Kriq 75, August 18/19, 2014 11/47 Eesti vanasõnad (12921 proverbs) 35 30 Stems per proverb 25 20 15 10 Words per proverb 5 Stems per proverb 0 0 5 10 15 20 25 30 35 Orthographic words per proverb 3000 2500 2500 Words stems: Linear relation ! 1500 1000 500 Number of proverbs Number of proverbs 2000 2000 1500 1000 500 0 0 5 10 15 20 25 (Orthographic) words per proverb Peter Grzybek 30 35 Concentration on words Kriq 75, August 18/19, 2014 0 0 5 10 15 20 25 30 35 Stems per proverb 12/47 Some In-between conclusions 1. Bi-modality seems to originate in the proverb material‘s characteristics; this phenomenon needs more detailed study 2. It seems reasonable to assume the overall picture to be a result of differences between syntactically different provers: e.g., „simple“ (unipartite proverbs without hypotaxis) vs. „complex“ (n-partite proverbs with hypotaxis). 3. As long as we do not have relevant data available, data pooling seems to be an appropriate procedure, to make the forest visible before the trees. Pooling data: Intervals 3000 2-3, 4-5, 6-7,… Number of proverbs 2500 2000 1500 1000 500 0 0 5 10 15 20 25 30 35 (Orthographic) words per proverb Peter Grzybek Kriq 75, August 18/19, 2014 13/47 Is there a way to find a theoretical model for sentence length frequencies ? Assumptions: 1. The distribution of length is organized in a law-like manner. 2. It is sufficient to make assumptions about the difference D of two neighboring frequencies (probabilities) D Px Px 1 Px 1 Px Px 1 Px 1 Px 1 Px 1 Which factors influence D ? a b c d q a bx language-specific factors D cx d production-specific factors norming forces level-specific factors (words vs. phrases) ca bd kx ; k Px Px 1 c ca m x Px a bx Px 1 cx d k x 1 x qxP Px 0 m x 1 x Hyperpascal distribution (Beta-binomial d.) Peter Grzybek Kriq 75, August 18/19, 2014 14/47 Eesti vanasõnad Testing the hyperpascal distribution k x 1 x x Px q P0 m x 1 x k = 1.21 m = 0.07 q = 0.39 C = X²/N = 0.0193 1. Length of Estonian proverbs is regularly organized. 2. The well-known hyperpascal distribution is a good model. Peter Grzybek Kriq 75, August 18/19, 2014 15/47 Is there a regularity of word length in Estonian proverbs ? Normann (21038 words) Syllables per word (x) 1 2 3 4 5 6 7 Peter Grzybek Number of words (fx) 6648 10573 2730 920 149 16 2 15000 10000 5000 0 1 Kriq 75, August 18/19, 2014 2 3 4 5 6 7 16/47 In search of a word length model Px g x Px1 g ( x) a x e a a x Px x! Poisson-distribution 1-displaced Poisson-distribution („Fucks distribution“) Px Px a Px 1 x x 0,1, 2,... e a a x 1 x 1, 2,3,... x 1! 15000 10000 5000 0 1 2 3 4 C = X²/N = 0.08 Peter Grzybek Kriq 75, August 18/19, 2014 5 6 7 No good model ! 17/47 An alternative model for word length in Estonian (proverbs) Px g x Px1 g ( x) a q b Px q Px 1 Geometric distribution Px pq x x 0,1, 2,... 1-displaced geometric distribution Px pq x 1 x 1, 2,3,... 1-displaced Shenton-Skees geometric distribution 1 Px pq x 1 1 a x x 1, 2,3,... p Word stems Orthographic words Peter Grzybek p = 0.85 a = 3.49 p = 0.88 a = 4.71 C = 0.0062 C = 0.0023 Kriq 75, August 18/19, 2014 18/47 Word length in Eesti vanasõnad (88296 words) Syllables Number of per words word (x) (fx) 1 27272 2 43696 3 12127 4 4185 5 822 6 165 7 32 8 6 9 1 p = 0.84 a = 3.30 C = 0.0074 1 Px pq x 1 1 a x x 1, 2,3,... p Peter Grzybek Kriq 75, August 18/19, 2014 19/47 Proverb Length Word Length (Normann) T3 𝒙 (Word length) Peter Grzybek 2.2652 T4 1.9939 T5 Proverb T6 1.9830 1.9554 Length T7 1.9642 Kriq 75, August 18/19, 2014 T8 T9 T10 1.8434 1.8507 1.8217 20/47 Menzerath-Altmann law (Altmann 1980) »The longer (more complex) a linguistic construct, the shorter (less complex) its constituents.« Example: The longer a sentence the shorter the clauses constituting the sentence. NB: Direct relations (in the classical structuralist paradigm) only, i.e., the relation of a construct to its immediate constituents; the relation between entities from indirectly related levels (e.g., between sentences and words, leapfrogging the intermediate level of sub-sentential constructs like clauses or phrases) is expected to show different (more complex) tendencies. Basic form: y' a y x y K xa WoL K SeL a y: construct = dependent variable, x: constituent independent variable K: integration constant, a: parameter determining the steepness of the decrease (for a < 0). y' a b y x y K x a ebx Full form y' a c b 2 y x x y K x a ebx e c / x Extended form (Wimmer-Altmann law) Peter Grzybek Kriq 75, August 18/19, 2014 21/47 Proverb Length Word Length 3,5 Normann Word length (syllables per word) 3,0 y K ec/ x 2,5 K = 1.68 c = –0.84 2,0 R² = 0.90 1,5 2 4 6 8 10 Proverb length (words per sentence) 3,5 y K x a ec/ x Eesti vanasõnad Word length 3,0 2,5 K = 1.71 a = 0.18 c = –1.05 2,0 R² = 0.98 1,5 0 5 10 15 20 25 Proverb length Peter Grzybek Kriq 75, August 18/19, 2014 22/47 Word Length Syllable Length Eesti vanasõnad Syllable length (letters per syllable) 3,4 y K ec/ x 3,2 3,0 2,8 2,6 2,4 K = 2.02 c = 0.42 2,2 R² = 0.96 2,0 0 2 4 6 8 10 Word length (syllables per word) Peter Grzybek Kriq 75, August 18/19, 2014 23/47 Positional aspects of word length Pos Pos11 W W ii tt hh ii nn -- PP rr oo vv ee rr bb ii aa ll Pos Pos22 Pos Pos33 Pos Pos44 Pos Pos55 𝒙 1.9765 1.9765 1.9608 1.8943 1.8943 1.8852 1.8852 1.7980 𝒙 (Word 1.9608 1.7980 length) (Word length) PP oo ss ii tt ii oo nn Pos Pos66 Pos Pos77 Pos Pos88 1.9756 2.0373 2.0373 1.9704 1.9756 1.9704 Pos Pos99 Pos Pos10 10 1.9771 1.9771 2.1714 2.1714 Mean word length 2,4 2,2 Fourier series: R² = 0.99 2,0 1,8 1,6 2 4 6 8 10 Position f x k a sin bx c cos dx e sin fx g cos hx Peter Grzybek Kriq 75, August 18/19, 2014 24/47 In the two approaches discussed above, analyses concerned: • • no attention to within-sentence position, ignoring the specific proverb length. the dependence of word length on sentence length the dependence of word length on within-proverb position Mean word lengths 2,4 2,2 2,0 1,8 1,6 3 4 5 6 7 8 9 10 Position (sentence-length specific) Unipartite proverbs with length T3–T5 Decrease – increase Minimum at 2nd position Maximum at last position Bipartite proverbs with length T6–T10 Cycle I: unipartite proverbs (T6) Cycle II: T7, T9, and T10 unipartite proverbs Peter Grzybek Kriq 75, August 18/19, 2014 T6, T8 = monotonous increase 25/47 What causes proverbs to be long(er) or short(er) ? From internal synergetic to external factors Peter Grzybek Kriq 75, August 18/19, 2014 26/47 ... Tänan teid kannatlikkuse ja tähelepanu ... Peter Grzybek Kriq 75, August 18/19, 2014 27/47 Familiarity Frequency German data American data 85 100 80 Familiarity (PTP) 80 75 70 60 65 40 60 Observed Theoretical 55 20 50 0 0 100 200 300 400 45 0 Frequency (corpus-based) 500 1000 1500 2000 2500 3000 3500 Sentence Length and Familiarity (German data: N = 11.355; excluding zero-familiarity, f >100) SL 8,50 8,00 7,50 7,00 SeL = 8.40 FRQ-0.09 R² = 0.89 6,50 6,00 0,00 5,00 10,00 15,00 20,00 FAM Peter Grzybek Kriq 75, August 18/19, 2014 28/47 Desiderata for Estonian Paremiology Variants vs. Types Frequency Familiarity 1. Linguistic forms of variants 2. Frequency 1. of variants 2. of types 3. Familiarity 1. of variants 2. of types “It seems preposterous even to ask where the 'variants of one proverb' end and the 'variants of another proverb' begin, or how many 'different proverbs' could be found within such a thicket.” Peter Grzybek Kriq 75, August 18/19, 2014 29/47 Frequency distribution of ‚variants‘ (Unreliable data for f > 10) Zipf distribution 𝑥 −𝑎 𝑃𝑥 = , 𝑇(𝑎) 𝑃𝑥 = 𝑥 = 1,2,3, … 𝑇 𝑎 = Peter Grzybek Right-truncated Zipf distribution 𝑥 −𝑎 , 𝐹(𝑅) 𝑥 = 1,2,3, … , 𝑅 𝐹 𝑅 = ∞ −𝑎 𝑗=1 𝑗 𝑅 −𝑎 𝑖=1 𝑖 a = 2.08 a = 1.91 R =9 C=X²/N = 0.06 C=X²/N = 0.0032 Kriq 75, August 18/19, 2014 30/47 7,4 y K ec/ x Proverb length 7,2 7,0 6,8 6,6 6,4 K = 6.52 c = 0.07 6,2 R² = 0.96 6,0 2 4 6 8 Number of variants Peter Grzybek Kriq 75, August 18/19, 2014 31/47 Peter Grzybek Kriq 75, August 18/19, 2014 32/47 July 21, 1939: Arvo Arnol‘dovič Krikmann Belgian National Day Village Pudivere (German: Poidifer) Estonian Writer Eduard Vilde (1865-1933) Simuna Parish Important point in F.G.W. Struve‘s Geodatic arc, A chain of triangulations (1827) July 21, 1940: President Konstantin Päts affirmed the government of Johannes Vares (appointed by Andrej Ždanov), accompanied by the arrival of Soviet demonstrators and Red Army troops, replacement of the Flag of Estonia by the Red flag on Pikk Hermann, meeting of the newly elected parliament Riigikogu on July 21. July 21, 1944: Graf Claus von Stauffenberg and his fellow conspirators were executed in Berlin for the plot to assassinate Adolf Hitler. July 21, 1944: The United States Senate ratifies the North Atlantic Treaty. Peter Grzybek Kriq 75, August 18/19, 2014 33/47 Peter Grzybek Kriq 75, August 18/19, 2014 34/47