Word - Eesti Teaduste Akadeemia

advertisement
Peter Grzybek
Estonian Proverbs:
Searching for regularities
www.peter-grzybek.eu
Peter Grzybek
Kriq 75, August 18/19, 2014
1/47
 How long is a proverb ?
 How long are words in proverbs ?
 Does word length depend on proverb length ?
 Is word length independent of within-text position ?
Peter Grzybek
Kriq 75, August 18/19, 2014
2/47
How to measure the length of linguistic units and entities ?
Memo:
„There are no positive facts in language.“ (Saussure)
 There is always more than one definition.
I.
Define the entity you want to measure.
 If you want to measure sentence length, define ‚sentence‘.

II.
If you want to measure word length, define ‚word‘.
Determine the measuring units in which you want to measure.
 E.g., sentence length: number of clauses, phrases, words, syllables,
morphemes, … ?
 E.g., word length: number of syllables, morphemes, letters, graphemes, of
phonemes, … ?
III. Define the measuring units.
 Define ‘clause’, ‘phrase’, ‘syllable’, ‘morpheme’, ‘phoneme’, ‘grapheme’,
‘letter’, … ?
Rule in Quantitative Linguistics:
Take direct constituents as measuring units
Peter Grzybek
Kriq 75, August 18/19, 2014
3/47
How long are proverbs ?
Sentence length:
How many…
XY
clauses, phrases
words, stems
Syllables,
morphemes
Peter Grzybek
One proverb  one sentence
per proverb ?
+
syntactic analysis
-
dependent on syntax theory;
reduced number of clauses/phrases in
proverbs (lack of variation)
+
sufficient variation
-
dependent on lexical theory:
(orthographic word, phonological word, etc.)
+
lexical analysis; rhythmic structure
-
Dependent on morphology and phonotactics;
high degree of variation
Kriq 75, August 18/19, 2014
4/47
Orthographic problems:
Mother-in-law - Isn‘t that a problem ?
В этом доме.
в кратцу - вкратце
Phonological word (tact group):
Ná mostu.
In agglutinative languages …
… stems do not change,
… affixes do not fuse with other affixes,
… affixes do not change form conditioned by other affixes.
Peter Grzybek
Kriq 75, August 18/19, 2014
5/47
How long are words ?
How many…
XY
letters, graphemes
phonemes
Syllables,
morphemes
Peter Grzybek
per word ?
+
easy (automatic) analysis
-
high degree of alphabetic arbitrariness;
high degree of variation
+
better linguistic justification
-
dependent on phonological theory;
high degree of variation
neglect of quantity
+
lexical analysis; rhythmic structure
-
High degree of variation
Kriq 75, August 18/19, 2014
6/47
Estonian phonemes:
Three degrees of phonemic length (consonants and vowels)
[o]
[oˈ]
[oː]
Peter Grzybek
(short o)
(long o)
(extra long o)
koli
kooli
kooli"
= „Müll“
= „Schule“
= „schulen“
Kriq 75, August 18/19, 2014
7/47
Decisions / Definitions
(In accordance with Kriq 1967)
Peter Grzybek
Linguistic Unit
Definition
Sentence
One proverb
Length
Number of words / stems
Word
Orthographic
Length
Number of syllables
Kriq 75, August 18/19, 2014
8/47
Üks riisub rihaga, teine pühib luuaga. (EV 15016)
[Der eine recht mit dem Rechen, der andere kehrt mit dem Besen.]
Wo:6 – St:6 – Sy:13
Üks rii-sub ri-ha-ga, tei-ne pü-hib luu-a-ga.
Isi puu, isi puuke. (EV 2245)
[Das eine ist der Baum, das andere ist das Bäumchen]
Wo:4 – St:4 – Sy:7
I-si puu, i-si puu-ke.
Peter Grzybek
Kriq 75, August 18/19, 2014
9/47
Erna Normann (1955)
Valimik eesti vanasõnu
Frequencies
Length
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Peter Grzybek
93
619
518
710
511
518
218
175
81
62
40
13
5
5
2
4
0
1
1
3576 proverbs
Ca. end 19th, early 20th century
800
600
400
200
0
3
4
5
6
Kriq 75, August 18/19, 2014
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
10/47
Comparisons
Old (17th/18th century) and Contemporary
3576
Peter Grzybek
618
294
0,35
0,30
Contemporary
Normann
Old
0,25
Frequency
Frequencies
Length Normann
Old
New
3
93
21
26
4
619
103
89
5
518
80
52
6
710
129
39
7
511
86
17
8
518
96
44
9
218
39
12
10
175
21
9
11
81
21
1
12
62
14
5
13
40
4
0
14
13
3
1
15
5
0
0
16
5
1
0
17
2
0
18
4
1
19
0
20
1
21
1
0,20
0,15
0,10
0,05
0,00
2
4
6
8
10
12
14
16
18
20
22
Length
Bimodal distributions:
Additional Peaks ( 6 / 8 )
Question:
Does the word-stem distinction explain
the bi-modality?
Kriq 75, August 18/19, 2014
11/47
Eesti vanasõnad
(12921 proverbs)
35
30
Stems per proverb
25
20
15
10
Words per proverb
5
Stems per proverb
0
0
5
10
15
20
25
30
35
Orthographic words per proverb
3000
2500
2500
Words  stems:
Linear relation !
1500
1000
500
Number of proverbs
Number of proverbs
2000
2000
1500
1000
500
0
0
5
10
15
20
25
(Orthographic) words per proverb
Peter Grzybek
30
35
 Concentration on words
Kriq 75, August 18/19, 2014
0
0
5
10
15
20
25
30
35
Stems per proverb
12/47
Some In-between conclusions
1. Bi-modality seems to originate in the proverb material‘s characteristics;
this phenomenon needs more detailed study
2. It seems reasonable to assume the overall picture to be a result of
differences between syntactically different provers: e.g., „simple“ (unipartite proverbs without hypotaxis) vs. „complex“ (n-partite proverbs with
hypotaxis).
3. As long as we do not have relevant data available, data pooling seems to
be an appropriate procedure, to make the forest visible before the trees.
Pooling data: Intervals
3000
2-3, 4-5, 6-7,…
Number of proverbs
2500
2000
1500
1000
500
0
0
5
10
15
20
25
30
35
(Orthographic) words per proverb
Peter Grzybek
Kriq 75, August 18/19, 2014
13/47
Is there a way to find a theoretical model
for sentence length frequencies ?
Assumptions:
1. The distribution of length is organized in a law-like manner.
2. It is sufficient to make assumptions about the difference D of two
neighboring frequencies (probabilities)
D
Px  Px 1   Px 1
Px  Px 1  Px 1

Px 1
Px 1
Which factors influence D ?
a
b
c
d
q
a  bx
language-specific factors
D
cx  d
production-specific factors
norming forces
level-specific factors (words vs. phrases)
ca
bd
kx
; k
 Px 
Px 1
c
ca
m x
Px 
a  bx
Px 1
cx  d
 k  x  1


x
 qxP
Px  
0
 m  x  1


x


Hyperpascal distribution
(Beta-binomial d.)
Peter Grzybek
Kriq 75, August 18/19, 2014
14/47
Eesti vanasõnad
Testing the hyperpascal distribution
 k  x  1


x  x

Px 
q P0
 m  x  1


x


k = 1.21
m = 0.07
q = 0.39
C = X²/N = 0.0193
1. Length of Estonian proverbs is regularly organized.
2. The well-known hyperpascal distribution is a good model.
Peter Grzybek
Kriq 75, August 18/19, 2014
15/47
Is there a regularity of word length in Estonian proverbs ?
Normann
(21038 words)
Syllables
per word
(x)
1
2
3
4
5
6
7
Peter Grzybek
Number
of words
(fx)
6648
10573
2730
920
149
16
2
15000
10000
5000
0
1
Kriq 75, August 18/19, 2014
2
3
4
5
6
7
16/47
In search of a word length model
Px  g  x  Px1
g ( x) 
a
x

e a a x
Px 
x!
Poisson-distribution
1-displaced Poisson-distribution
(„Fucks distribution“)
Px 
Px 
a
Px 1
x
x  0,1, 2,...
e  a a x 1
x  1, 2,3,...
 x  1!
15000
10000
5000
0
1
2
3
4
C = X²/N = 0.08
Peter Grzybek
Kriq 75, August 18/19, 2014
5
6
7
 No good model !
17/47
An alternative model for word length
in Estonian (proverbs)
Px  g  x  Px1
g ( x) 
a
q
b

Px  q  Px 1
Geometric distribution
Px  pq x x  0,1, 2,...
1-displaced geometric distribution
Px  pq x 1 x  1, 2,3,...
1-displaced Shenton-Skees geometric distribution


1 
Px  pq x 1 1  a  x    x  1, 2,3,...
p 


Word stems
Orthographic words
Peter Grzybek
p = 0.85
a = 3.49
p = 0.88
a = 4.71
C = 0.0062
C = 0.0023
Kriq 75, August 18/19, 2014
18/47
Word length in Eesti vanasõnad
(88296 words)
Syllables Number of
per
words
word (x)
(fx)
1
27272
2
43696
3
12127
4
4185
5
822
6
165
7
32
8
6
9
1
p = 0.84
a = 3.30
C = 0.0074


1 
Px  pq x 1 1  a  x    x  1, 2,3,...
p 


Peter Grzybek
Kriq 75, August 18/19, 2014
19/47
Proverb Length  Word Length
(Normann)
T3
𝒙
(Word
length)
Peter Grzybek
2.2652
T4
1.9939
T5
Proverb
T6
1.9830
1.9554
Length
T7
1.9642
Kriq 75, August 18/19, 2014
T8
T9
T10
1.8434 1.8507 1.8217
20/47
Menzerath-Altmann law (Altmann 1980)
»The longer (more complex) a linguistic construct, the shorter (less complex) its constituents.«
Example: The longer a sentence the shorter the clauses constituting the sentence.
NB: Direct relations (in the classical structuralist paradigm) only, i.e., the relation of a construct to
its immediate constituents; the relation between entities from indirectly related levels (e.g.,
between sentences and words, leapfrogging the intermediate level of sub-sentential constructs like
clauses or phrases) is expected to show different (more complex) tendencies.
Basic form:
y' a

y x
y  K  xa
WoL  K  SeL a
y: construct = dependent variable,
x: constituent independent variable
K: integration constant,
a: parameter determining the steepness of the decrease (for a < 0).
y'
a
b
y
x
y  K  x a  ebx
Full form
y'
a c
b  2
y
x x
y  K  x a  ebx  e  c / x
Extended form (Wimmer-Altmann law)
Peter Grzybek
Kriq 75, August 18/19, 2014
21/47
Proverb Length  Word Length
3,5
Normann
Word length
(syllables per word)
3,0
y  K  ec/ x
2,5
K = 1.68
c = –0.84
2,0
R² = 0.90
1,5
2
4
6
8
10
Proverb length
(words per sentence)
3,5
y  K  x a  ec/ x
Eesti
vanasõnad
Word length
3,0
2,5
K = 1.71
a = 0.18
c = –1.05
2,0
R² = 0.98
1,5
0
5
10
15
20
25
Proverb length
Peter Grzybek
Kriq 75, August 18/19, 2014
22/47
Word Length  Syllable Length
Eesti
vanasõnad
Syllable length (letters per syllable)
3,4
y  K  ec/ x
3,2
3,0
2,8
2,6
2,4
K = 2.02
c = 0.42
2,2
R² = 0.96
2,0
0
2
4
6
8
10
Word length (syllables per word)
Peter Grzybek
Kriq 75, August 18/19, 2014
23/47
Positional aspects of word length
Pos
Pos11
W
W ii tt hh ii nn -- PP rr oo vv ee rr bb ii aa ll
Pos
Pos22
Pos
Pos33
Pos
Pos44
Pos
Pos55
𝒙
 1.9765
1.9765 1.9608
 1.8943
1.8943
1.8852
1.8852 1.7980
𝒙
(Word
 1.9608
 1.7980

length)
(Word
length)
PP oo ss ii tt ii oo nn
Pos
Pos66
Pos
Pos77
Pos
Pos88
1.9756
 2.0373
2.0373 1.9704

1.9756
 1.9704
Pos
Pos99
Pos
Pos10
10
1.9771
1.9771 2.1714
 2.1714
Mean word length
2,4
2,2
Fourier series:
R² = 0.99
2,0
1,8
1,6
2
4
6
8
10
Position
f  x   k  a  sin bx   c  cos  dx   e  sin  fx   g  cos  hx 
Peter Grzybek
Kriq 75, August 18/19, 2014
24/47
In the two approaches discussed above, analyses concerned:
•
•
 no attention to within-sentence position,
 ignoring the specific proverb length.
the dependence of word length on sentence length
the dependence of word length on within-proverb position
Mean word lengths
2,4
2,2
2,0
1,8
1,6
3
4
5
6
7
8
9
10
Position (sentence-length specific)
Unipartite proverbs with length T3–T5
Decrease – increase
Minimum at 2nd position
Maximum at last position
Bipartite proverbs with length T6–T10
Cycle I:  unipartite proverbs (T6)
Cycle II:
T7, T9, and T10
 unipartite proverbs
Peter Grzybek
Kriq 75, August 18/19, 2014
T6, T8
= monotonous increase
25/47
What causes proverbs to be long(er) or short(er) ?
From internal synergetic to external factors
Peter Grzybek
Kriq 75, August 18/19, 2014
26/47
... Tänan teid kannatlikkuse ja tähelepanu ...
Peter Grzybek
Kriq 75, August 18/19, 2014
27/47
Familiarity Frequency
German data
American data
85
100
80
Familiarity (PTP)
80
75
70
60
65
40
60
Observed
Theoretical
55
20
50
0
0
100
200
300
400
45
0
Frequency (corpus-based)
500
1000
1500
2000
2500
3000
3500
Sentence Length and Familiarity
(German data:
N = 11.355; excluding zero-familiarity, f >100)
SL
8,50
8,00
7,50
7,00
SeL = 8.40  FRQ-0.09
R² = 0.89
6,50
6,00
0,00
5,00
10,00
15,00
20,00
FAM
Peter Grzybek
Kriq 75, August 18/19, 2014
28/47
Desiderata for Estonian Paremiology

Variants vs. Types

Frequency

Familiarity
1.
Linguistic forms of variants
2.
Frequency
1. of variants
2. of types
3.
Familiarity
1. of variants
2. of types
“It seems preposterous even to ask where the 'variants of one proverb' end
and the 'variants of another proverb' begin, or how many 'different proverbs'
could be found within such a thicket.”
Peter Grzybek
Kriq 75, August 18/19, 2014
29/47
Frequency distribution of ‚variants‘
(Unreliable data for f > 10)
Zipf distribution
𝑥 −𝑎
𝑃𝑥 =
,
𝑇(𝑎)
𝑃𝑥 =
𝑥 = 1,2,3, …
𝑇 𝑎 =
Peter Grzybek
Right-truncated Zipf distribution
𝑥 −𝑎
,
𝐹(𝑅)
𝑥 = 1,2,3, … , 𝑅
𝐹 𝑅 =
∞
−𝑎
𝑗=1 𝑗
𝑅
−𝑎
𝑖=1 𝑖
a = 2.08
a = 1.91
R =9
C=X²/N = 0.06
C=X²/N = 0.0032
Kriq 75, August 18/19, 2014
30/47
7,4
y  K  ec/ x
Proverb length
7,2
7,0
6,8
6,6
6,4
K = 6.52
c = 0.07
6,2
R² = 0.96
6,0
2
4
6
8
Number of variants
Peter Grzybek
Kriq 75, August 18/19, 2014
31/47
Peter Grzybek
Kriq 75, August 18/19, 2014
32/47
July 21, 1939:
Arvo Arnol‘dovič Krikmann
Belgian National Day
Village Pudivere (German: Poidifer)
Estonian Writer Eduard Vilde (1865-1933)
Simuna Parish
Important point in F.G.W. Struve‘s Geodatic arc,
A chain of triangulations (1827)
July 21, 1940:
President Konstantin Päts affirmed the government of Johannes Vares (appointed by Andrej
Ždanov), accompanied by the arrival of Soviet demonstrators and Red Army troops,
replacement of the Flag of Estonia by the Red flag on Pikk Hermann, meeting of the newly
elected parliament Riigikogu on July 21.
July 21, 1944:
Graf Claus von Stauffenberg and his fellow conspirators were executed in Berlin for the plot
to assassinate Adolf Hitler.
July 21, 1944:
The United States Senate ratifies the North Atlantic Treaty.
Peter Grzybek
Kriq 75, August 18/19, 2014
33/47
Peter Grzybek
Kriq 75, August 18/19, 2014
34/47
Download