Word Lists, Concordances, Text Comparison, and Morphology in

advertisement
Word Lists, Concordances, Text
Comparison, and Morphology in
Latin and Ancient Greek Texts
by Robert Maier, Freising
robert@maierphil.de
http://www.maierphil.de
Word Lists, Concordances, Text Comparison, and
Morphology in Latin and Ancient Greek Texts
• Historical Overview
• PHI and TLG Latin and Greek Texts
• Beyond PHI and TLG: Other Digitalisation
Projects
• State of the Art
• Future Requirements
• Conclusion
Historical Overview
From the 16th to the 20th century :
 Dictionarium seu Latinae Linguae Thesaurus
(first edition 1531 by Robert Estienne)
 Thesaurus Graecae Linguae (first edition
1572 by Henri Estienne alias Henricus
Stephanus)
 Thesaurus Linguae Latinae (Bayerische
Akademie der Wissenschaften, 1894-??,
actually arrived at letter p)
Historical Overview
Electronic lexica and text collections
in the 20th century:
„Grease. Greasy. Greased. Various forms and applications of the root, literal
and metaphorical. I didn‘t believe him at first. I laughed in his face. Then he
pressed a button and the machine began listing all the phrases in my work in
which the word grease appears in one form or another. There they were,
streaming across the screen in front of me, faster than I could read them, with
page references and line numbers.“ (from Small World by David Lodge, 1984)
 Thesaurus Latinae Graecae (TLG)
 Packard Humanities Institute (PHI)
 CETEDOC Library of Christian Latin
Texts (CLCLT)
PHI and TLG Latin and Greek Texts
License Model:
 Initially:
 Data stored on CD-ROM
 Periodic subscription fee
 Free access for software developers
 Now:
 PHI: Data stored on CD-ROM, no
subscription fee
 TLG: online
PHI and TLG Latin and Greek Texts
Text Format:
 Text: Beta Code (7 Bit ASCII)
 Level Encoding: Binary values (MSB = 1)
Byte no.
Hexadecimal view
0x0000 EF 80 B0 B0
0x0008 81 B0 B0 B1
0x0010 E8 E5 F3 FF
0x0018 40 40 7B 31
0x0020 51 2A 48 2A
0x0028 55 2A 53 20
0x0030 2A 49 20 2A
0x0038 4D 2A 55 2A
0x0040 53 24 7D 31
0x0048 28 2F 57 53
0x0050 45 29 4E 20
0x0058 53 20 47 45
0x0060 46 49 2F 41
0x0068 57 29 3D 20
0x0070 53 53 49 45
0x0078 4E 45 4B 49
B0
FF
AF
24
53
2A
52
4C
20
50
54
57
49
2A
20
2F
B7
EF
F4
32
2A
4B
2A
2A
A1
45
41
47
53
53
2A
57
FF
82
FF
30
45
2A
57
4F
40
52
49
52
2C
4F
53
4E
EF
D4
40
2A
2A
41
2A
2A
2A
20
3D
41
20
2F
45
2C
ASCII view
°°·ÿï
°°±ÿï‚Ô
èåóÿ¯ôÿ@
@@{1$20*
Q*H*S*E*
U*S *K*A
*I *R*W*
M*U*L*O*
S$}1░¡@*
(/WSPER░
E)N░TAI=
S░GEWGRA
FI/AIS,░
W)=░*SO/
SSIE░*SE
NEKI/WN,
0007, 001, Thes, t
ΘΗΣΕΥΣ ΚΑΙ ΡΩΜΥΛΟΣ
0007, 001, Thes, 1, 1, 1
Ὥσπερ ἐν ταῖς γεωγραφίαις,
ὦ Σόσσιε Σενεκίων,
(From Plutarch: Theseus and
Romulus)
PHI and TLG Latin and Greek Texts
Software development:
PHI and TLG Latin and Greek Texts
PHI and TLG Latin and Greek Texts
PHI and TLG Software Projects:
 Pandora (for Macintosh)
 PHI / TLG Workplace
 Musaios
 SNS Greek & Latin
 Diogenes
 LECTOR
PHI and TLG Latin and Greek Texts
Software Features:
Musaios
SNS Greek Diogenes Workplace LECTOR
& Latin
Pack
2007
X
X
X
X
Greek text display
X
Text export
X
X
X
X
X
Word Searches
X
X
X
X
X
Concordances
O
O
-
O
X
Morphology
-
-
X*
O*
O
Dictionary
-
-
X*
O*
O
Text comparison
-
-
-
-
X
Statistics
-
-
-
-
O
Text to Speech
(TTS)
-
-
-
-
X
*
The morphology and dictionary tools are provided by the Perseus project.
Beyond PHI and TLG
Other Digitalization Projects:
Bibliotheca Teubneriana Latina (BTL)
Library of Latin Texts (CLCLT)
Perseus Project (various texts, morphology)
CAMENA (early modern period, morphology)
eAQUA (text mining, cooccurrence analysis,
morphology, NER)
 Special Projects:
 Heidelberger Papyri
 Electronic Archive of Greek and Latin
Epigraphy (EAGLE, Heidelberg)





State of the Art
Concordances:
State of the Art
Frequencies:
#
Caesar
Hirtius
1 in
2.25% in
2.52%
2 et
2.06% cum
2.01%
3 ad
1.59% ad
4 cum
5 ex
#
Caesar
9 quod
Hirtius
0.79% ut
0.63%
10 ab
0.68% quae
0.61%
1.27%
11 qui
0.68% se
0.58%
1.08% et
1.20%
12 non
0.64% atque
0.58%
1.05% qui
0.70%
13 a
0.53% quam
0.55%
6 atque 1.01% non
0.69%
14 neque
0.52% esse
0.54%
7 ut
0.91% ex
0.67%
15 caesar
0.49% caesar 0.52%
8 se
0.86% quod
0.66%
16 quae
0.48% aut
0.46%
State of the Art
Text Comparison (Cooccurrence Analysis):
Caesar
Hirtius
Caes, Gal, 2, 5, 6, 4:
castra in altitudinem pedum XII vallo
fossaque duodeviginti pedum muniri
iubet.
Hirt, Gal, 8, 9, 3, 2:
haec imperat vallo pedum duodecim
muniri,
muniri loriculam pro hac ratione eius
altitudinis inaedificari, ...
Caes, Gal, 3, 17, 4, 3:
magnaque praeterea multitudo undique
ex Gallia perditorum hominum
latronumque convenerat, ...
Hirt, Gal, 8, 30, 1, 3:
..., qui, ut primum defecerat Gallia,
Gallia
collectis undique perditis hominibus,
hominibus
servis ad libertatem vocatis, ...
Caes, Civ, 1, 3, 4, 2:
omnes amici consulum,
consulum necessarii
Pompei atque ii, qui veteres inimicitias
cum Caesare gerebant, in senatum
coguntur.
Hirt, Gal, 8, 52, 5, 3:
quod ne fieret, consules amicique
Pompei evicerunt atque ita rem
morando discusserunt.
State of the Art
mean word length
Statistics:
9
Prose:
8
■ Tacitus (Lat.)
7
● Herodot (Gr.)
6
Poetry:
5
▲
Vergil (Lat.)
4
▼
Homer (Gr.)
3
2
1
-10
-9
-8
-7
-6
-5
-4
-3
normalized rel. no. of occurrences: ln((N[w]-1)/SUM(N[w])+1/N[0])
Future Requirements
 Integration of different text corpora
 Named Entity Recognition (NER)
 Cross language text comparison
 Advanced statistical methods
 Morphology
 Lexica and translation tools
 Advanced stylistic searches
 Text to Speech (TTS)
Conclusion
 The text collections of PHI and TLG are well
known.
 Several software packages have been
designed for use with the PHI and TLG texts.
 Future developments will require the
integration of more advanced text analysis
methods.
 Some of those methods will require text
mining services.
THE END
Questions / Remarks?
Acknowledgement
Thanks to Reinhard Gruhl and Marco Büchler
for helpful discussions.
Download