The Danish text corpus

advertisement
Christina Tånnander, TPB
The Danish text corpus
The text corpus is put together by Christian Wallin, NOTA, and consists of more than 10.000.000
words from Danish newspapers and popular science. The corpus has been cleaned and processed by
Christina Tånnander, TPB:
1.
2.
3.
4.
Conversion from utf8 to ascii (manual with UltraEdit) (>korpus.txt)
Cleaned and <p>-chunked by CleanCorpus.pl (>clean_corpus.txt)
Sentence-chunked by SentCorpus.pl (>sent_corpus.txt)
Statistics calculated by StatCorpus.pl (>stat_corpus.txt)
Sentences, word and character statistics. Lc = lowercase.
Total sentences:
Unique sentences:
Unique lc sentences:
627360
552793
552054
Total words:
Unique words:
Unique lc words:
10 042 857
387 877
346 261
Total characters:
Unique characters:
Unique lc characters:
50 728 569
169
136
The recording manuscript should consist of sentences of a certain maximum length. Counting syllables
or vowels in the sentences gives us a hint of how many sentences in the text corpus we can use for the
recording manuscript. If the maximum length is 30 vowels, we have 350 000 sentences to choose
from.
Average words / sentence:
Average characters / sentence:
Average characters / word:
Average vowels / sentence:
16.01
80.86
5.05
30.29
Sentences with less than 10 vowels:
Sentences with less than 20 vowels:
Sentences with less than 30 vowels:
Sentences with less than 40 vowels:
133 267
245 635
352 341
443 061
The pronunciation dictionaries should cover most of the words with a certain frequency in the text
corpus. This table shows the number of words with one occurrence in the corpus, two or less
occurrences etc.
Words == f1:
Words >= f2:
Words >= f5:
Words >= f10:
211 429
176 448
83 518
50 161
Christina Tånnander, TPB
Words >= f25:
Words >= f50:
Words >= f100:
Words >= f250:
Words >= f500:
Words >= f1000:
Words >= f5000:
Words >= f10000:
25 077
14 479
8 149
3 504
1 790
844
182
91
12 700 of the words in the Danish text corpus exists in the Swedish name dictionary. About 7 000 of
them occur 1-5 times in the corpus and about 6 000 more than six times:
Words in TPB name lexicon:
Words in TPB name lexicon (f 1-5):
Words in TPB name lexicon (f 6-):
12 779
6 925
5 854
Download