Christina Tånnander, TPB The Danish text corpus The text corpus is put together by Christian Wallin, NOTA, and consists of more than 10.000.000 words from Danish newspapers and popular science. The corpus has been cleaned and processed by Christina Tånnander, TPB: 1. 2. 3. 4. Conversion from utf8 to ascii (manual with UltraEdit) (>korpus.txt) Cleaned and <p>-chunked by (>clean_corpus.txt) Sentence-chunked by (>sent_corpus.txt) Statistics calculated by (>stat_corpus.txt) Sentences, word and character statistics. Lc = lowercase. Total sentences: Unique sentences: Unique lc sentences: 627360 552793 552054 Total words: Unique words: Unique lc words: 10 042 857 387 877 346 261 Total characters: Unique characters: Unique lc characters: 50 728 569 169 136 The recording manuscript should consist of sentences of a certain maximum length. Counting syllables or vowels in the sentences gives us a hint of how many sentences in the text corpus we can use for the recording manuscript. If the maximum length is 30 vowels, we have 350 000 sentences to choose from. Average words / sentence: Average characters / sentence: Average characters / word: Average vowels / sentence: 16.01 80.86 5.05 30.29 Sentences with less than 10 vowels: Sentences with less than 20 vowels: Sentences with less than 30 vowels: Sentences with less than 40 vowels: 133 267 245 635 352 341 443 061 The pronunciation dictionaries should cover most of the words with a certain frequency in the text corpus. This table shows the number of words with one occurrence in the corpus, two or less occurrences etc. Words == f1: Words >= f2: Words >= f5: Words >= f10: 211 429 176 448 83 518 50 161 Christina Tånnander, TPB Words >= f25: Words >= f50: Words >= f100: Words >= f250: Words >= f500: Words >= f1000: Words >= f5000: Words >= f10000: 25 077 14 479 8 149 3 504 1 790 844 182 91 12 700 of the words in the Danish text corpus exists in the Swedish name dictionary. About 7 000 of them occur 1-5 times in the corpus and about 6 000 more than six times: Words in TPB name lexicon: Words in TPB name lexicon (f 1-5): Words in TPB name lexicon (f 6-): 12 779 6 925 5 854