Manual - Complexity HÂ 380.50 Kb

advertisement
COMPLEXITY „H“– program for analysis of human texts
Michaela Zemková - Charles University in Prague, Faculty of Science – Department of Philosophy and History
of Science, Daniel Zahradník – Czech University of Life Sciences, Prague, Faculty of Forestry and Wood
Sciences – Department of Forest Management
Index
1. Basic information about the programme ................................................................................ 2
1.1
Procedure of the computation .................................................................................... 2
1.2 Loading the data ............................................................................................................... 3
2. Computations ......................................................................................................................... 4
2.1 Decomposition of the text to n-grams (n-grams) ............................................................. 4
2.1.1 List of n-grams of chosen length ............................................................................... 4
2.1.2 Simple listing of the vocabulary................................................................................ 4
2.2 Linguistic characteristics .................................................................................................. 4
2.2.1 Linguistic complexity (CL) ....................................................................................... 5
2.2.2 Linguistic complexity for whole words (CL– whole words) .................................... 5
2.2.3 Trifonov´s Complexity (CT) .................................................................................... 6
2.2.4 Trifonov´s Complexity for whole words (CT- whole words) .................................. 6
2.2.5 Shannon’s entropy (CE) ............................................................................................ 6
2.2.6 Markovian entropy model (CM) ............................................................................... 7
2.2.7 Transition matrix ....................................................................................................... 7
3. Other function ........................................................................................................................ 8
3.1. The conversion of the text to a sequence of consonants and vowels .............................. 8
3.2. Random selection of sample ........................................................................................... 8
3.3. Comparison with random model – the Monte Carlo simulations ................................... 8
4. Example of computation ........................................................................................................ 8
1
1. Basic information about the programme
The “Complexity“ programme was developed as an extension of the program
Complexity - Software tools for linguistic based analysis of genetic sequences. In addition to
calculation which can be applied for both – genetic sequences and human texts, such as
entropy, linguistic complexity, transition matrix, n-gram based text decomposition and so on,
there are some other functions made especially for analysis of human texts (decomposition of
the text to vocabulary, conversion of text into sequence of consonants and vowels)
Sequences in txt format could be used as input data.
At present the programme provides following procedures:
 Linguistic complexity (CL)
- for single letters
- for whole words (CL- whole words)
 Linguistic complexity suggested by E.N. Trifonov (CT)
- for single letters
- for whole words (CT- whole words)




Shannon ´s entropy (CE)
Markov ´s model of entropy (CM)
Transition matrix/ one step of Markov´s entropy procedure
n-gram based text decomposition (n-grams)
Other functions:
 Comparison with random model using Monte Carlo simulations
 Random selection of given size from investigated text
 The conversion of text to a sequence of consonants and vowels
1.1 Procedure of the computation
The procedure could be divided into steps:


Loading the data /see chapter 1.2/
Setting the length of an n-gram (the item „n-gram“ on the main form,) and
window size /see 2.1/
 Special settings in menu such as selection of random sample or conversion of
the text into sequence of consonants and vowels /see 3.1., 3.2. /
 Activation of random model on the main form (if requested) /see 3.3./
 Calculation (main menu – computation) /see 2. /
It is possible to do the calculation directly using the following form:
2
1.2 Loading the data
.
Data could be loaded in the main menu by File → open. It is possible to load
sequences in the txt format. For loading more files as one dataset (for example several
chapters of some text), there is an option “Open all files in folder“. (Just click on the first file
in folder and all other files will be loaded automatically)
3
2. Computations
All calculations are available in the main menu under computation. There is only one
adjustable item – the window size which generally reflects the maximal lengths of “words”
and it is adjusted if variable. The alphabet size is recognized automatically.
2.1 Decomposition of the text to n-grams (n-grams)
The method suggested by Shannon (1948) is based on decomposition of the text into,
so called, n-grams – the words of length n. The program shows the list of n-grams of given
length which is adjusted by hand in the form. For unique n-grams, there is the information
about frequency; the list is arranged in descending order.
Generally, an n-gram is a substring of length n of the sequence with occurrence which is given by certain
probability. In these terms it is used in language modelling, where n usually means number of words in
collocation – connection of words- instead of number of letters and the probability is thus connected with the
collocation.
In broader sense, an n-gram is any sequence of n symbols in the text. In case we take whole
word as a symbol, then, for n > 1, n-gram is a connection of n words
2.1.1 List of n-grams of chosen length
Length of n-grams is adjusted manually in the main form using the item n-gram. The
program offers version for single letters (the function n-grams) and for whole words (the
function n-grams-whole words, where n corresponds to the number of words in the
collocation)
2.1.2 Simple listing of the vocabulary
As is clear from the above description, the listing of the vocabulary used in given text
is obtained using the function n-grams-whole words with the value of n equal to 1. (Should
be set in the item n-gram.)
2.2 Linguistic characteristics
Evaluation of linguistic characteristics as they were defined for the genetic sequences (e.g.
string of nucleotides or amino-acids) is a bit more problematic, when we come to the human
texts. In case of genetic sequences we supposed that each letter is a symbol carrying some
meaning and its complexity is the affair of combinatorics of letters. This is not the case in
human natural languages where real complexity (e.g. richness of vocabulary, how diversified
the language is) depends on combinatorics of unites carrying lexical meaning. Such a unit is
usually a word. That is why we implemented 2 variants in our software for linguistic analysis
of human texts: the first one, where letter is supposed to be a meaningful symbol (original
concept used for genetic sequences) and the second one, where the whole word is the
meaningful unit.
This concept is, of course, simplified. In deeper look is obvious, that unit carrying some meaning is probably
something between letter (or syllable) and word. Such a unit according to linguistic theories is morpheme - the
4
smallest linguistic unit that has semantic meaning. (Words, by itself, are often compounds of other words and
contain affixes, prefixes and so on.) Implementation of an algorithm based on recognition of morphemes would
be difficult. That is why we offer only simplified version for letters vs. whole words.
For the same reason we ignore that written language is not a phonetic transcription. Ideal model should
count with texts transcribed from phonetic form according to universal rules which is, in the matter of the fact,
almost impossible requirement.
The output of evaluation of linguistic characteristics is a numerical value and the
graph. The axis x means window size and on axis y there are values of given linguistic
characteristic.
Other offers of the graph are available by right-clicking the graph and choosing the
command (such as extract of values, steps of the computation or saving the image). If the
item n-gram is adjusted then the extract of n- grams of given length is offered.
2.2.1 Linguistic complexity (CL)
The linguistic complexity suggested by Kolmogorov is the simplest concept of
measuring vocabulary usage in given sequence.
Where Vi is the number of different words of length i and Vmax is maximal possible number
of words of length i. Therefore, this simple concept of complexity reflects how many words
of length i from a possible vocabulary size occur in the given text. The composition of ilength words is reflected as follows:
Tab.1.
Length of word
Number of i-length words
2
3
10
40
Maximal vocabulary
usage
50
50
CL
(10+40)/(50+50)= 0,5
The output of the measure is a value of CL and a graph.
Adjustable item: window size (preset to 20, higher values produce also higher CL)
Item n-gram has no effect on computation.
2.2.2 Linguistic complexity for whole words (CL– whole words)
There is the same principle of computation as in the case of CL described above, but the
whole word stands here as one unit instead of a single letter. (Word means unit divided by
blanks)
The output of the computation is numerical value of CL and the graph (axis x =
window size, axis y = CL). In addition, there is also list of words (or collocations) if the value
of n was more than 1.
Adjustable items are thus the window size and also the n-gram. Both of them have some
influence on computation. Graphs show that values of window size bigger than 20 don’t affect
CL too much.
5
2.2.3 Trifonov´s Complexity (CT)
The complexity suggested by E.N. Trifonov reflects the richness of vocabulary
(diversification of the vocabulary). The concept takes into account both -frequency and
composition. Complexity defined by Trifonov (1990) for any sequence of length L is given by
the product:
Where Ui corresponds to the ratio of the actual to maximal vocabulary sizes for word
length i. The algorithm runs in steps for i=1...n and doesn’t take all combinations of words in
one “package” but it reflects the distribution of all words of length i.
The final value can be compared with another value of CT of different length. The algorithm is sensitive for the
presence of repetitive strings which cause rapid decrease of CT value. Suspiciously low CT value could indicate
presence of copies of longer parts of the text.
Example:
Tab.2-The difference between CL a CT
Length of word
Number of i-length words
2
3
25
25
Maximal vocabulary
usage
50
50
CL =(25+25)/(50+50)= 0,5
CT=25/50* 25/50 = 0, 25
2.2.4 Trifonov´s Complexity for whole words (CT- whole words)
Similarly as the CL, the CT –whole words algorithm works in such a way, that factors
Ui (actual vocabulary usage) means the ratio of unique words/collocations to total number of
words used in given text minus 1 (see example in chapter 4.).
Outputs: CT value, graph of CT and of vocabulary usage (Available by right-clicking the
graph and choosing the vocabulary usage command)
Adjustable item: window size is preset on 20, however values higher than 3 have insignificant
effect on the result. Adjusting of item n-gram has no effect on the result value of CT. It
influences only the list of words/ collocations according to given n.)
The results of the analysis of literary texts of approximate length of thousands
characters with using CT-whole words procedure are usually higher by several orders of
magnitude than CT for single letters. /see chapter 2.1.3./
2.2.5 Shannon’s entropy (CE)
Shannon's entropy measures the information potentially contained in a message
without considering how many real information the message actually has. The sequence of
repeating one character (AAAA...) has zero entropy. In a sequence where all possible symbols
occur evenly, the entropy comes near to 1. Shannon ´s information entropy is defined as:
Where N is the length of the sequence (whole text), ni is the frequency of ith symbol in the text
and K is the alphabet size.
6
The Shannon’s information entropy however doesn’t reflect the real composition of the words, it
reflects the potential amount of information in the sequence – i.e. it reflects the frequency of symbols only. The
sequence AAAACCCCCC would have the same entropy as the sequence ACACACCACA: H´= 4/10 * log(4/10)
+ 6/10 * log(6/10) – there are 2 symbols (A occurs 4 times and C occurs 6 times) used in sequence of length
N=10, thus n1 =4, n2 =6.
2.2.6 Markovian entropy model (CM)
Markovian model is an alternative entropy measure also counting with real
combination and relations of letters within the text (inner composition of the vocabulary).
Actual value of CM is changing with the given length of words.
Where M is a number of possible words: M=Km , K is the alphabet size, m is the
window size (maximal length of words) and N the length of the whole text. The length of
words varies from 1 to m. The model contains the probability of combination of letters which
is reflected in transition matrix.
Output: Result value and graph
1) only one value for length of words 2
2) the extract of all values for words of length 1 to m. Available by right-clicking the
graph and choosing the write values command
Adjustable items: window size (the calculation of course takes longer for higher numbers)
N-gram - has no effect on results, only it provides extract of n- grams of given length n.
2.2.7 Transition matrix
The matrix is one of the steps in CM measure; it simply reflects how many times
various combinations of letters occur. Thus it provides transparent information about
preferences between each letters.
Example of output:
Tab.3 –Representation of the text using the transition matrix. (The poem of L. Carroll – Jabberwock,
length of the text – 206 characters, the most frequent combination of letters is, as is typical for English
language, “TH”)
(In this case, the window size has no effect)
7
3. Other function
3.1. The conversion of the text to a sequence of consonants and vowels
The function is activated in main menu by edit → convert phones to
consonants/vowels. The later list of n-grams is already converted into sequence of C/V.
The ratio of consonants and vowels in whole analysed text is obtained simply by
adjusting n=1.
Syllabic consonants such as L, R are not taken into consideration and they are always considered
consonants.
3.2. Random selection of sample
The function provides random selection of given length from the sample.
It is available in the main menu in settings and it is set always after the data are loaded.
3.3. Comparison with random model – the Monte Carlo simulations
The comparison with random generated model is possible in case of linguistic
characteristics. If activated in the main form, it is visualised in graph always as a blue line and
shows the situation of random sequence of the same parameters (alphabet size, length of the
text, window size) as investigated sequence. Please acknowledge that activation of random
model requires longer computation time!
4. Example of computation
Let’s have an example: The poem written by Lewis Carroll – Jabberwock of length 122
characters:
It was brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
Linguistic characteristics:
CT for single letters:
There are 18 different one-character words in our text (I, T, W, A, S…), whereas maximal
possible vocabulary size is 26. Thus, the vocabulary usage is U1= 18/26 = 0, 692308
Our text contains 71 different (unique) two-character words. (IT, TW, WA, AS…) Maximal
number of two-character words is 26 2  676 , but the text is only 100 characters long, so
maximal vocabulary size is 99 in our case. Thus U2= 71/99 = 0,717172.
There are 88 different three- character words of maximal possible vocabulary size 98 and so
on. Final result is product of vocabulary usage Ui:
CT=18/26* 71/99*88/98*93/97*94/96*94/95*94/94*93/93 = 0,414144
CT for whole words:
Instead of counting unique n-letters combinations we count unique connection of words
(collocations) as we can see in the table 1. In this case, CT =0, 747036, factors Ui are only 2
(for n=1, n=2) because all the n-grams of n>2 are unique so Ui would be equal to 1.
8
Tab. 1.
Word legth
1
2
Unique combination
18
21
Max. vocabulary size
23
22
Vocabulary usage Ui
0,782608696
0,954545455
CT
0,747036
IT
WAS
U
U
ITWAS
WASBRILLIG
U
U
BRILLIG
AND
U
U
BRILLIGAND
ANDTHE
U
U
THE
SLITHY
U
U
THESLITHY
SLITHYTOVES
U
U
TOVES
DID
U
U
TOVESDID
DIDGYRE
U
U
GYRE
AND
U
GYREAND
ANDGIMBLE
U
U
GIMBLE
IN
U
U
GIMBLEIN
INTHE
U
U
THE
WABEALL
U
THEWABEALL
WABEALLMIMSY
U
U
MIMSY
WERE
U
U
MIMSYWERE
WERETHE
U
U
U
THEBOROGOVES
BOROGOVESAND
U
U
ANDTHE
THEMOME
U
MOMERATHS
RATHSOUTGRABE
U
U
THE
BOROGOVES
AND
THE
MOME
RATHS
U
U
OUTGRABE
U
The activation of random model: On the main form, item Monte carlo,
Parameters of random model are preset. In case of standard computation it is not necessary to
change them.
Example: The calculation of CT. There is length of words on axis x (e.g. window size) and there is value of
complexity CT on axis y (graph 1.) and value of vocabulary usage (graph 2.) The random model is marked
in blue, thinner lines mark the 95% range of probability
Graph 1.:Complexity CT
Graph 2.:Vocabulary usage
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
2
4
6
8
10
12
14
16
18
2
4
6
8
10
12
14
16
18
9
The ratio of consonant and vowels:
→First, the data are loaded, then the conversion of the text to sequence of C/V is done.
(edit → convert phones to consonants/ vowels)
→ setting the length of an n-gram n=1 (on the main form, item „n-gram“)
→ Calculation using the n-grams function
The result:
C……132
V……. 74
10
Download