Word Lists, Concordances, Text Comparison, and Morphology in Latin and Ancient Greek Texts by Robert Maier, Freising robert@maierphil.de http://www.maierphil.de Word Lists, Concordances, Text Comparison, and Morphology in Latin and Ancient Greek Texts • Historical Overview • PHI and TLG Latin and Greek Texts • Beyond PHI and TLG: Other Digitalisation Projects • State of the Art • Future Requirements • Conclusion Historical Overview From the 16th to the 20th century : Dictionarium seu Latinae Linguae Thesaurus (first edition 1531 by Robert Estienne) Thesaurus Graecae Linguae (first edition 1572 by Henri Estienne alias Henricus Stephanus) Thesaurus Linguae Latinae (Bayerische Akademie der Wissenschaften, 1894-??, actually arrived at letter p) Historical Overview Electronic lexica and text collections in the 20th century: „Grease. Greasy. Greased. Various forms and applications of the root, literal and metaphorical. I didn‘t believe him at first. I laughed in his face. Then he pressed a button and the machine began listing all the phrases in my work in which the word grease appears in one form or another. There they were, streaming across the screen in front of me, faster than I could read them, with page references and line numbers.“ (from Small World by David Lodge, 1984) Thesaurus Latinae Graecae (TLG) Packard Humanities Institute (PHI) CETEDOC Library of Christian Latin Texts (CLCLT) PHI and TLG Latin and Greek Texts License Model: Initially: Data stored on CD-ROM Periodic subscription fee Free access for software developers Now: PHI: Data stored on CD-ROM, no subscription fee TLG: online PHI and TLG Latin and Greek Texts Text Format: Text: Beta Code (7 Bit ASCII) Level Encoding: Binary values (MSB = 1) Byte no. Hexadecimal view 0x0000 EF 80 B0 B0 0x0008 81 B0 B0 B1 0x0010 E8 E5 F3 FF 0x0018 40 40 7B 31 0x0020 51 2A 48 2A 0x0028 55 2A 53 20 0x0030 2A 49 20 2A 0x0038 4D 2A 55 2A 0x0040 53 24 7D 31 0x0048 28 2F 57 53 0x0050 45 29 4E 20 0x0058 53 20 47 45 0x0060 46 49 2F 41 0x0068 57 29 3D 20 0x0070 53 53 49 45 0x0078 4E 45 4B 49 B0 FF AF 24 53 2A 52 4C 20 50 54 57 49 2A 20 2F B7 EF F4 32 2A 4B 2A 2A A1 45 41 47 53 53 2A 57 FF 82 FF 30 45 2A 57 4F 40 52 49 52 2C 4F 53 4E EF D4 40 2A 2A 41 2A 2A 2A 20 3D 41 20 2F 45 2C ASCII view °°·ÿï °°±ÿï‚Ô èåóÿ¯ôÿ@ @@{1$20* Q*H*S*E* U*S *K*A *I *R*W* M*U*L*O* S$}1░¡@* (/WSPER░ E)N░TAI= S░GEWGRA FI/AIS,░ W)=░*SO/ SSIE░*SE NEKI/WN, 0007, 001, Thes, t ΘΗΣΕΥΣ ΚΑΙ ΡΩΜΥΛΟΣ 0007, 001, Thes, 1, 1, 1 Ὥσπερ ἐν ταῖς γεωγραφίαις, ὦ Σόσσιε Σενεκίων, (From Plutarch: Theseus and Romulus) PHI and TLG Latin and Greek Texts Software development: PHI and TLG Latin and Greek Texts PHI and TLG Latin and Greek Texts PHI and TLG Software Projects: Pandora (for Macintosh) PHI / TLG Workplace Musaios SNS Greek & Latin Diogenes LECTOR PHI and TLG Latin and Greek Texts Software Features: Musaios SNS Greek Diogenes Workplace LECTOR & Latin Pack 2007 X X X X Greek text display X Text export X X X X X Word Searches X X X X X Concordances O O - O X Morphology - - X* O* O Dictionary - - X* O* O Text comparison - - - - X Statistics - - - - O Text to Speech (TTS) - - - - X * The morphology and dictionary tools are provided by the Perseus project. Beyond PHI and TLG Other Digitalization Projects: Bibliotheca Teubneriana Latina (BTL) Library of Latin Texts (CLCLT) Perseus Project (various texts, morphology) CAMENA (early modern period, morphology) eAQUA (text mining, cooccurrence analysis, morphology, NER) Special Projects: Heidelberger Papyri Electronic Archive of Greek and Latin Epigraphy (EAGLE, Heidelberg) State of the Art Concordances: State of the Art Frequencies: # Caesar Hirtius 1 in 2.25% in 2.52% 2 et 2.06% cum 2.01% 3 ad 1.59% ad 4 cum 5 ex # Caesar 9 quod Hirtius 0.79% ut 0.63% 10 ab 0.68% quae 0.61% 1.27% 11 qui 0.68% se 0.58% 1.08% et 1.20% 12 non 0.64% atque 0.58% 1.05% qui 0.70% 13 a 0.53% quam 0.55% 6 atque 1.01% non 0.69% 14 neque 0.52% esse 0.54% 7 ut 0.91% ex 0.67% 15 caesar 0.49% caesar 0.52% 8 se 0.86% quod 0.66% 16 quae 0.48% aut 0.46% State of the Art Text Comparison (Cooccurrence Analysis): Caesar Hirtius Caes, Gal, 2, 5, 6, 4: castra in altitudinem pedum XII vallo fossaque duodeviginti pedum muniri iubet. Hirt, Gal, 8, 9, 3, 2: haec imperat vallo pedum duodecim muniri, muniri loriculam pro hac ratione eius altitudinis inaedificari, ... Caes, Gal, 3, 17, 4, 3: magnaque praeterea multitudo undique ex Gallia perditorum hominum latronumque convenerat, ... Hirt, Gal, 8, 30, 1, 3: ..., qui, ut primum defecerat Gallia, Gallia collectis undique perditis hominibus, hominibus servis ad libertatem vocatis, ... Caes, Civ, 1, 3, 4, 2: omnes amici consulum, consulum necessarii Pompei atque ii, qui veteres inimicitias cum Caesare gerebant, in senatum coguntur. Hirt, Gal, 8, 52, 5, 3: quod ne fieret, consules amicique Pompei evicerunt atque ita rem morando discusserunt. State of the Art mean word length Statistics: 9 Prose: 8 ■ Tacitus (Lat.) 7 ● Herodot (Gr.) 6 Poetry: 5 ▲ Vergil (Lat.) 4 ▼ Homer (Gr.) 3 2 1 -10 -9 -8 -7 -6 -5 -4 -3 normalized rel. no. of occurrences: ln((N[w]-1)/SUM(N[w])+1/N[0]) Future Requirements Integration of different text corpora Named Entity Recognition (NER) Cross language text comparison Advanced statistical methods Morphology Lexica and translation tools Advanced stylistic searches Text to Speech (TTS) Conclusion The text collections of PHI and TLG are well known. Several software packages have been designed for use with the PHI and TLG texts. Future developments will require the integration of more advanced text analysis methods. Some of those methods will require text mining services. THE END Questions / Remarks? Acknowledgement Thanks to Reinhard Gruhl and Marco Büchler for helpful discussions.