COMPLEXITY „H“– program for analysis of human texts Michaela Zemková - Charles University in Prague, Faculty of Science – Department of Philosophy and History of Science, Daniel Zahradník – Czech University of Life Sciences, Prague, Faculty of Forestry and Wood Sciences – Department of Forest Management Index 1. Basic information about the programme ................................................................................ 2 1.1 Procedure of the computation .................................................................................... 2 1.2 Loading the data ............................................................................................................... 3 2. Computations ......................................................................................................................... 4 2.1 Decomposition of the text to n-grams (n-grams) ............................................................. 4 2.1.1 List of n-grams of chosen length ............................................................................... 4 2.1.2 Simple listing of the vocabulary................................................................................ 4 2.2 Linguistic characteristics .................................................................................................. 4 2.2.1 Linguistic complexity (CL) ....................................................................................... 5 2.2.2 Linguistic complexity for whole words (CL– whole words) .................................... 5 2.2.3 Trifonov´s Complexity (CT) .................................................................................... 6 2.2.4 Trifonov´s Complexity for whole words (CT- whole words) .................................. 6 2.2.5 Shannon’s entropy (CE) ............................................................................................ 6 2.2.6 Markovian entropy model (CM) ............................................................................... 7 2.2.7 Transition matrix ....................................................................................................... 7 3. Other function ........................................................................................................................ 8 3.1. The conversion of the text to a sequence of consonants and vowels .............................. 8 3.2. Random selection of sample ........................................................................................... 8 3.3. Comparison with random model – the Monte Carlo simulations ................................... 8 4. Example of computation ........................................................................................................ 8 1 1. Basic information about the programme The “Complexity“ programme was developed as an extension of the program Complexity - Software tools for linguistic based analysis of genetic sequences. In addition to calculation which can be applied for both – genetic sequences and human texts, such as entropy, linguistic complexity, transition matrix, n-gram based text decomposition and so on, there are some other functions made especially for analysis of human texts (decomposition of the text to vocabulary, conversion of text into sequence of consonants and vowels) Sequences in txt format could be used as input data. At present the programme provides following procedures: Linguistic complexity (CL) - for single letters - for whole words (CL- whole words) Linguistic complexity suggested by E.N. Trifonov (CT) - for single letters - for whole words (CT- whole words) Shannon ´s entropy (CE) Markov ´s model of entropy (CM) Transition matrix/ one step of Markov´s entropy procedure n-gram based text decomposition (n-grams) Other functions: Comparison with random model using Monte Carlo simulations Random selection of given size from investigated text The conversion of text to a sequence of consonants and vowels 1.1 Procedure of the computation The procedure could be divided into steps: Loading the data /see chapter 1.2/ Setting the length of an n-gram (the item „n-gram“ on the main form,) and window size /see 2.1/ Special settings in menu such as selection of random sample or conversion of the text into sequence of consonants and vowels /see 3.1., 3.2. / Activation of random model on the main form (if requested) /see 3.3./ Calculation (main menu – computation) /see 2. / It is possible to do the calculation directly using the following form: 2 1.2 Loading the data . Data could be loaded in the main menu by File → open. It is possible to load sequences in the txt format. For loading more files as one dataset (for example several chapters of some text), there is an option “Open all files in folder“. (Just click on the first file in folder and all other files will be loaded automatically) 3 2. Computations All calculations are available in the main menu under computation. There is only one adjustable item – the window size which generally reflects the maximal lengths of “words” and it is adjusted if variable. The alphabet size is recognized automatically. 2.1 Decomposition of the text to n-grams (n-grams) The method suggested by Shannon (1948) is based on decomposition of the text into, so called, n-grams – the words of length n. The program shows the list of n-grams of given length which is adjusted by hand in the form. For unique n-grams, there is the information about frequency; the list is arranged in descending order. Generally, an n-gram is a substring of length n of the sequence with occurrence which is given by certain probability. In these terms it is used in language modelling, where n usually means number of words in collocation – connection of words- instead of number of letters and the probability is thus connected with the collocation. In broader sense, an n-gram is any sequence of n symbols in the text. In case we take whole word as a symbol, then, for n > 1, n-gram is a connection of n words 2.1.1 List of n-grams of chosen length Length of n-grams is adjusted manually in the main form using the item n-gram. The program offers version for single letters (the function n-grams) and for whole words (the function n-grams-whole words, where n corresponds to the number of words in the collocation) 2.1.2 Simple listing of the vocabulary As is clear from the above description, the listing of the vocabulary used in given text is obtained using the function n-grams-whole words with the value of n equal to 1. (Should be set in the item n-gram.) 2.2 Linguistic characteristics Evaluation of linguistic characteristics as they were defined for the genetic sequences (e.g. string of nucleotides or amino-acids) is a bit more problematic, when we come to the human texts. In case of genetic sequences we supposed that each letter is a symbol carrying some meaning and its complexity is the affair of combinatorics of letters. This is not the case in human natural languages where real complexity (e.g. richness of vocabulary, how diversified the language is) depends on combinatorics of unites carrying lexical meaning. Such a unit is usually a word. That is why we implemented 2 variants in our software for linguistic analysis of human texts: the first one, where letter is supposed to be a meaningful symbol (original concept used for genetic sequences) and the second one, where the whole word is the meaningful unit. This concept is, of course, simplified. In deeper look is obvious, that unit carrying some meaning is probably something between letter (or syllable) and word. Such a unit according to linguistic theories is morpheme - the 4 smallest linguistic unit that has semantic meaning. (Words, by itself, are often compounds of other words and contain affixes, prefixes and so on.) Implementation of an algorithm based on recognition of morphemes would be difficult. That is why we offer only simplified version for letters vs. whole words. For the same reason we ignore that written language is not a phonetic transcription. Ideal model should count with texts transcribed from phonetic form according to universal rules which is, in the matter of the fact, almost impossible requirement. The output of evaluation of linguistic characteristics is a numerical value and the graph. The axis x means window size and on axis y there are values of given linguistic characteristic. Other offers of the graph are available by right-clicking the graph and choosing the command (such as extract of values, steps of the computation or saving the image). If the item n-gram is adjusted then the extract of n- grams of given length is offered. 2.2.1 Linguistic complexity (CL) The linguistic complexity suggested by Kolmogorov is the simplest concept of measuring vocabulary usage in given sequence. Where Vi is the number of different words of length i and Vmax is maximal possible number of words of length i. Therefore, this simple concept of complexity reflects how many words of length i from a possible vocabulary size occur in the given text. The composition of ilength words is reflected as follows: Tab.1. Length of word Number of i-length words 2 3 10 40 Maximal vocabulary usage 50 50 CL (10+40)/(50+50)= 0,5 The output of the measure is a value of CL and a graph. Adjustable item: window size (preset to 20, higher values produce also higher CL) Item n-gram has no effect on computation. 2.2.2 Linguistic complexity for whole words (CL– whole words) There is the same principle of computation as in the case of CL described above, but the whole word stands here as one unit instead of a single letter. (Word means unit divided by blanks) The output of the computation is numerical value of CL and the graph (axis x = window size, axis y = CL). In addition, there is also list of words (or collocations) if the value of n was more than 1. Adjustable items are thus the window size and also the n-gram. Both of them have some influence on computation. Graphs show that values of window size bigger than 20 don’t affect CL too much. 5 2.2.3 Trifonov´s Complexity (CT) The complexity suggested by E.N. Trifonov reflects the richness of vocabulary (diversification of the vocabulary). The concept takes into account both -frequency and composition. Complexity defined by Trifonov (1990) for any sequence of length L is given by the product: Where Ui corresponds to the ratio of the actual to maximal vocabulary sizes for word length i. The algorithm runs in steps for i=1...n and doesn’t take all combinations of words in one “package” but it reflects the distribution of all words of length i. The final value can be compared with another value of CT of different length. The algorithm is sensitive for the presence of repetitive strings which cause rapid decrease of CT value. Suspiciously low CT value could indicate presence of copies of longer parts of the text. Example: Tab.2-The difference between CL a CT Length of word Number of i-length words 2 3 25 25 Maximal vocabulary usage 50 50 CL =(25+25)/(50+50)= 0,5 CT=25/50* 25/50 = 0, 25 2.2.4 Trifonov´s Complexity for whole words (CT- whole words) Similarly as the CL, the CT –whole words algorithm works in such a way, that factors Ui (actual vocabulary usage) means the ratio of unique words/collocations to total number of words used in given text minus 1 (see example in chapter 4.). Outputs: CT value, graph of CT and of vocabulary usage (Available by right-clicking the graph and choosing the vocabulary usage command) Adjustable item: window size is preset on 20, however values higher than 3 have insignificant effect on the result. Adjusting of item n-gram has no effect on the result value of CT. It influences only the list of words/ collocations according to given n.) The results of the analysis of literary texts of approximate length of thousands characters with using CT-whole words procedure are usually higher by several orders of magnitude than CT for single letters. /see chapter 2.1.3./ 2.2.5 Shannon’s entropy (CE) Shannon's entropy measures the information potentially contained in a message without considering how many real information the message actually has. The sequence of repeating one character (AAAA...) has zero entropy. In a sequence where all possible symbols occur evenly, the entropy comes near to 1. Shannon ´s information entropy is defined as: Where N is the length of the sequence (whole text), ni is the frequency of ith symbol in the text and K is the alphabet size. 6 The Shannon’s information entropy however doesn’t reflect the real composition of the words, it reflects the potential amount of information in the sequence – i.e. it reflects the frequency of symbols only. The sequence AAAACCCCCC would have the same entropy as the sequence ACACACCACA: H´= 4/10 * log(4/10) + 6/10 * log(6/10) – there are 2 symbols (A occurs 4 times and C occurs 6 times) used in sequence of length N=10, thus n1 =4, n2 =6. 2.2.6 Markovian entropy model (CM) Markovian model is an alternative entropy measure also counting with real combination and relations of letters within the text (inner composition of the vocabulary). Actual value of CM is changing with the given length of words. Where M is a number of possible words: M=Km , K is the alphabet size, m is the window size (maximal length of words) and N the length of the whole text. The length of words varies from 1 to m. The model contains the probability of combination of letters which is reflected in transition matrix. Output: Result value and graph 1) only one value for length of words 2 2) the extract of all values for words of length 1 to m. Available by right-clicking the graph and choosing the write values command Adjustable items: window size (the calculation of course takes longer for higher numbers) N-gram - has no effect on results, only it provides extract of n- grams of given length n. 2.2.7 Transition matrix The matrix is one of the steps in CM measure; it simply reflects how many times various combinations of letters occur. Thus it provides transparent information about preferences between each letters. Example of output: Tab.3 –Representation of the text using the transition matrix. (The poem of L. Carroll – Jabberwock, length of the text – 206 characters, the most frequent combination of letters is, as is typical for English language, “TH”) (In this case, the window size has no effect) 7 3. Other function 3.1. The conversion of the text to a sequence of consonants and vowels The function is activated in main menu by edit → convert phones to consonants/vowels. The later list of n-grams is already converted into sequence of C/V. The ratio of consonants and vowels in whole analysed text is obtained simply by adjusting n=1. Syllabic consonants such as L, R are not taken into consideration and they are always considered consonants. 3.2. Random selection of sample The function provides random selection of given length from the sample. It is available in the main menu in settings and it is set always after the data are loaded. 3.3. Comparison with random model – the Monte Carlo simulations The comparison with random generated model is possible in case of linguistic characteristics. If activated in the main form, it is visualised in graph always as a blue line and shows the situation of random sequence of the same parameters (alphabet size, length of the text, window size) as investigated sequence. Please acknowledge that activation of random model requires longer computation time! 4. Example of computation Let’s have an example: The poem written by Lewis Carroll – Jabberwock of length 122 characters: It was brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. Linguistic characteristics: CT for single letters: There are 18 different one-character words in our text (I, T, W, A, S…), whereas maximal possible vocabulary size is 26. Thus, the vocabulary usage is U1= 18/26 = 0, 692308 Our text contains 71 different (unique) two-character words. (IT, TW, WA, AS…) Maximal number of two-character words is 26 2 676 , but the text is only 100 characters long, so maximal vocabulary size is 99 in our case. Thus U2= 71/99 = 0,717172. There are 88 different three- character words of maximal possible vocabulary size 98 and so on. Final result is product of vocabulary usage Ui: CT=18/26* 71/99*88/98*93/97*94/96*94/95*94/94*93/93 = 0,414144 CT for whole words: Instead of counting unique n-letters combinations we count unique connection of words (collocations) as we can see in the table 1. In this case, CT =0, 747036, factors Ui are only 2 (for n=1, n=2) because all the n-grams of n>2 are unique so Ui would be equal to 1. 8 Tab. 1. Word legth 1 2 Unique combination 18 21 Max. vocabulary size 23 22 Vocabulary usage Ui 0,782608696 0,954545455 CT 0,747036 IT WAS U U ITWAS WASBRILLIG U U BRILLIG AND U U BRILLIGAND ANDTHE U U THE SLITHY U U THESLITHY SLITHYTOVES U U TOVES DID U U TOVESDID DIDGYRE U U GYRE AND U GYREAND ANDGIMBLE U U GIMBLE IN U U GIMBLEIN INTHE U U THE WABEALL U THEWABEALL WABEALLMIMSY U U MIMSY WERE U U MIMSYWERE WERETHE U U U THEBOROGOVES BOROGOVESAND U U ANDTHE THEMOME U MOMERATHS RATHSOUTGRABE U U THE BOROGOVES AND THE MOME RATHS U U OUTGRABE U The activation of random model: On the main form, item Monte carlo, Parameters of random model are preset. In case of standard computation it is not necessary to change them. Example: The calculation of CT. There is length of words on axis x (e.g. window size) and there is value of complexity CT on axis y (graph 1.) and value of vocabulary usage (graph 2.) The random model is marked in blue, thinner lines mark the 95% range of probability Graph 1.:Complexity CT Graph 2.:Vocabulary usage 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18 9 The ratio of consonant and vowels: →First, the data are loaded, then the conversion of the text to sequence of C/V is done. (edit → convert phones to consonants/ vowels) → setting the length of an n-gram n=1 (on the main form, item „n-gram“) → Calculation using the n-grams function The result: C……132 V……. 74 10