2. Computations

advertisement
COMPLEXITY- Software tools for linguistic based analysis of genetic
sequences
Michaela Zemková - Charles University in Prague, Faculty of Science – Department of Philosophy and History
of Science, Daniel Zahradník – Czech University of Life Sciences, Prague, Faculty of Forestry and Wood
Sciences – Department of Forest Management
1. Basic information about the programme ................................................................................ 2
1.1 Loading the data ............................................................................................................... 3
1.1.1 Filtration of sequences .............................................................................................. 3
1.1.2 Parameters of filtration .............................................................................................. 3
1.2. Data saving ...................................................................................................................... 6
2. Computations ......................................................................................................................... 6
2.1 Linguistic characteristics .................................................................................................. 6
2.1.1 Linguistic complexity (CL) ....................................................................................... 6
2.1.2 Trifonov´s Complexity (CT) .................................................................................... 7
2.1.3 Shannon´s entropy (CE) ........................................................................................... 7
2.1.4 Markovian entropy model (CM) ............................................................................... 8
2.1.5 Transition matrix ....................................................................................................... 8
2.1.6 Wooton- Federhen index (CWF)............................................................................... 9
2.2 Calculations using n-grams .............................................................................................. 9
2.2.1. Decomposition of the text to n-grams ...................................................................... 9
2.2.2 The automated calculation of the ratio of n-grams potentially forming amphipathic
alpha helices ..................................................................................................................... 10
3.3 Other functions ............................................................................................................... 10
3.3.1 The conversion of proteome to a sequence of polar and non-polar amino acids .... 10
3.3.2 Detection of n-grams potentially forming amphipathic alpha helices ................ 11
3.3.3 Random selection of sample ................................................................................... 12
3.3.4 Comparison with random model – the Monte Carlo simulations ........................... 13
4. Example of computation ...................................................................................................... 14
1. Basic information about the programme
The “Complexity“ programme was originally developed for calculating linguistic
complexity according to E.N. Trifonov, however it has extended successively to other
algorithms based on decomposition of the text to potential words of length n (also known as
Shannon ´s n-grams). Sequences in format txt, fasta or faa could be used as input data. In
addition to linguistic assays, program also contains procedure for prediction of structures
potentially forming amphipathic alpha-helices.
At present provides following procedures:








Linguistic complexity (CL)
Linguistic complexity suggested by E.N. Trifonov (CT)
Shannon ´s entropy (CE)
Markov ´s model of entropy (CM)
Transition matrix/ one step of Markov´s entropy procedure
Wooton- Federhen index (CWF)
n-gram based text decomposition (n-grams)
Automated measure of the ratio of structures potentially forming amphipathic alphahelices (n-grams auto)
Other functions:
 Comparison with random model using Monte Carlo simulations
 Random selection of given size from investigated text
 Removing of repetitions based on method of k identical substrings of length n and
extraction of the list of excluded parts of the text.
 Selection of proteins of interest according to their names
 The conversion of proteomes to sequence of polar and non-polar amino acids
 Searching for structures potentially forming amphipathic alpha-helices
it is possible to do the calculation directly using the following form:
2
1.1 Loading the data
.
Data could be loaded in the main menu by File → open. It is possible to load
sequences in format txt., fasta or faa., faa21For loading more files as one dataset (for example
several chromosome of one organism) there is an option “Open all files in folder“. (Just click
on first file in folder and all other files will be loaded automatically)
1.1.1 Filtration of sequences
.
Before the linguistic assay, proteome sequence should be cleared of repetitions and
special characters and comments should be removed. There are identical or very similar
sequences of proteins (differing from each other in few letters) and it is better to exclude
them, as otherwise the result of linguistic could be biased.
In the first step of the filtration, there is an offer to remove comments and some characters
which occur sometimes in sequence and aren’t part of the standard amino-acid table, such as
B,J,X,Z.
These characters denote gaps with unidentified amino-acid (X), or the amino acid of known polarity (B, Z –
polar, J – non-polar). In case of including these characters, the alphabet size would be larger and the result of
linguistic characteristics would be biased because these characters are not a part of alphabet in right sense.
1.1.2 Parameters of filtration
It is possible to set filtration parameters in the form that are automatically offered after
the data are selected:
3
The proteins are loaded successively. There are k random selected samples of length n
from each new protein and it is determined, how many of them are between the already read
proteins. If this number is higher than the permitted number (“Allowed number of identical
samples”), the new protein is considered to be a replication of the respective already read
protein and refused.
The default choice for k is 5. Optimal size of this parameter can be controlled by
Trifonov´s complexity measure. If the value of vocabulary usage in the graph for n =16 is
equal to one, than there are no loaded proteins with repeating substring of length 16. If the
value of vocabulary usage is only close to one, than it means, that at least one substring of
such a length occurs repetitively and thus the parameters of filtration should be set higher –
increase the number of samples compared.
(The disadvantage of this treatment is longer computation time)
An example:
Fig.1. Vocabulary usage of the proteom of Trypanosoma, k = 5
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The axis x contains values of n – length of words. The axis y contains values of vocabulary usage (the ratio of
actual to maximal possible vocabulary usage in given sequence). For string of length 16 there is a value of
4
vocabulary usage only close to one which means that there is still some identical string of the length 16 and the
complexity value would be lower. The problem will be eliminated setting k to 20.
This level of filtration offers control list of excluded proteins, which allows reverse
control. The list is available in menu edit →excluded proteins.
!! Filtration treatment is very time consuming procedure according to length of sequence and
parameters. (It can take about few hours). In case of proteome of size 5MB and parameters 16_5_0, the
treatment takes approximately 10 minutes.
The „merge identical proteins“item has some impact only in case of random selection of
sample. (Viz 3.3.3)
1.1.3. Special selection
The special selection is the last phase of data filtration provided for selecting proteins
of interest according to identical substring in the name of the protein. The function is
available from menu edit in the form “data selection”. It is possible to select all proteins or
write name of the protein of interest in form, or select all, except the specified proteins.
5
1.2. Data saving
Filtered (or selected according to the given parameters) genetic text can be saved via
File → save. The program offers automatically saving in the format faa2. (possibly also in
txt.). In this way, the sequence of letters without blanks, lines, special characters and so on, is
saved. The advantage of this kind of saving is that we don’t need to filter the sequence every
time. (It is reasonable to reflect the parameters of filtration in the name of the new filtered
sequence such as, for example: plasmodium 16_10_0.faa2.
For reverse control, it is possible to save also the list of excluded proteins in any appropriate format, with the
help of standard copy – paste tools. (There is no special function for saving the list of excluded proteins.)
2. Computations
2.1 Linguistic characteristics
All calculations are available in the main menu under computation. There is only one
adjustable item – the window size which generally reflects the maximal lengths of “words”
and it is adjusted if variable. The alphabet size is recognized automatically. In loading
procedure, it is necessary to choose the item „remove characters B,J,Z,U,X“. To let these
characters in the text would cause undesirable change in alphabet size and hence the result of
the linguistic measure. (See 1.1.1)
2.1.1 Linguistic complexity (CL)
The linguistic complexity suggested by Kolmogorov is the simplest concept of
measuring vocabulary usage in given sequence.
6
Where Vi is the number of different words of length i and Vmax is maximal possible number
of words of length i. Therefore, this simple concept of complexity reflects how many words
of length i from a possible vocabulary size occur in the given text. The composition of ilength words is reflected as follows:
Example:
length of word
number of i-length words
2
3
10
40
Maximal vocabulary
usage
50
50
CL
(10+40)/(50+50)= 0,5
The output of the measure is a value of CL and a graph.
Adjustable item: window size (preset to 20)
2.1.2 Trifonov´s Complexity (CT)
The complexity suggested by E.N. Trifonov reflects richness of vocabulary
(diversification of the vocabulary). The concept takes into account both -frequency and
composition. Complexity defined by Trifonov (1990) for any sequence of length L is given by
the product:
Where Ui corresponds to the ratio of the actual to maximal vocabulary sizes for word
length i. The algorithm runs in steps for i=1...n and doesn’t take all combinations of words in
one “package” but it reflects the distribution of all words of length i.
The final value can be compared with another value of CT of different length. The disadvantage of the concept is
the sensibility to the occurrence of repetitions which decrease the results drastically. To evaluate the CT value of
a proteome, it is necessary to clean the sequence of repetitions (those ones that are artefacts of the database) in
advance.
An example: The difference between CL and CT
length of word
number of i-length words maximal vocabulary
usage
2
25
50
CL =(25+25)/(50+50)= 0,5
3
25
50
CT=25/50* 25/50 = 0, 25
2.1.3 Shannon´s entropy (CE)
Shannon's entropy measures the information potentially contained in a message
without considering how many real information the message actually has. The sequence of
repeating one character (AAAA...) has zero entropy. In a sequence where all possible symbols
occur, evenly the entropy comes near to 1. Shannon ´s information entropy is defined as:
7
Where N is the length of the sequence (whole text), ni is the frequency of ith symbol in
the text and K is the alphabet size.
The Shannon’s information entropy however doesn’t reflect the real composition of the words, it
reflects the potential amount of information in the sequence – i.e. it reflects the frequency of symbols only. The
sequence AAAACCCCCC would have the same entropy as the sequence ACACACCACA: H´= 4/10 * log(4/10)
+ 6/10 * log(6/10) – there are 2 symbols (A occurs 4 times and C occurs 6 times) used in sequence of length
N=10, thus n1 =4, n2 =6.
2.1.4 Markovian entropy model (CM)
Markovian model is an alternative entropy measure counting also with real
combination and relations of letters within the text (inner composition of the vocabulary).
Actual value of CM is changing with the given length of words.
Where M is a number of possible words: M=Km , K is the alphabet size, m is the
window size (maximal length of words) and N the length of the whole text. The length of
words varies from 1 to m. The model contains the probability of combination of letters which
is reflected in transition matrix.
Output: Result value and graph
1) only one value for length of words 2
2) the extract of all values for words of length 1 to m. Available by right-clicking the
graph and choosing the write values command
Adjustable item: window size (the calculation of course takes longer for higher numbers)
2.1.5 Transition matrix
The matrix is one of the steps in CM measure; it simply reflects how many times
various combinations of letters occur. Thus it provides transparent information about
preferences between each letters.
Example of output:
8
(In this case, the window size has no effect)
2.1.6 Wooton- Federhen index (CWF)
The algorithm is incorporated in BLAST algorithm where serves to find lowcomplexity regions. The formula is:
The output is directly the CWF value, low value corresponds to low complexity
The choice of window size has no effect.
2.2 Calculations using n-grams
Variable of the calculations is the length of n-gram (respective the length of words),
which could be set in form as a “word list”.
2.2.1. Decomposition of the text to n-grams
The method suggested by Shannon (1948) is based on decomposition of the text into,
so called, n-grams – the words of length n. The program shows the list of n-grams of given
length which is adjusted by hand in the form. The shortest length for genetic sequences is 4.
(This is because the procedure was primarily programmed for analysis of amphipathic structures, where lower
value than 4 is unreasonable. The function providing all different lengths of 1 to n is allowed in the version
Complexity_H for analysis of human languages.
For unique n-grams, there is the information about frequency, the list is arranged in
descending order.
An example of the list of 4-grams:
NNNN
QQQQ
SSSS
108767
25156
11673
9
TTTT
NNNS
SNNN
11199
5493
5210
….
There is also information in the list: the information about the ratio of potentially
amphipathic and non-amphipathic n-grams (see 3.3.2).
Generally, the n-gram is the substring of length n in given sequence characterized by the probability of
occurrence and it is used in language modelling. However, there is usually value of n considered as number of
words and the probability is thus the probability of connection of words .
2.2.2 The automated calculation of the ratio of n-grams potentially
forming amphipathic alpha helices
The automated calculation is a simplified way to determine the development of the
ratio of potentially amphipathic n-grams for the words of lengths 4 to n (See 2.3). The
proteome is divided into potentially amphipathic and non-amphipathic part. The program
distinguish the unique types of n-grams (occurring just once in the text and thus this number
reflect the real diversity of vocabulary) and the total number of n-grams.
The upper border of requested n is set in the form. The programme than shows the
extract for all n from 4 to this maximum. (The function “n-grams” /2.2.1/ shows the same
value in the item “test”, but only for one given length). To the contrary, the function “n-grams
auto” doesn´t show the list of particular n-grams.
Example of an output: The extract for n of 4 to 8. In the left part, there is number of unique
potentially amphipathic and non-amphipathic peptides in proteome. In the right part, there is
the total number of each. (The plotted ratio can be visualised e.g. in excel)
3.3 Other functions
3.3.1 The conversion of proteome to a sequence of polar and non-polar
amino acids
The function is activated in main menu by edit → convert amino acids to
polar/nonpolar. The later list of n-grams is already converted into sequence of N/P.
The transformation is given by following table:
Table of polar and non-polar amino acids:
Non-polar
Polar
G, A, V, L, I, P, M, F, W, C (J)
S, T, Y, N, Q, D, E, K, R, H (B,Z)
Don’t remove the characters B,J,Z,U,X which denote unidentified amino acids of
known polarity in loading procedure!! (These characters should be removed in linguistic
measures, where the alphabet size is reflected in the result)
10
An example of the list: A 4-gram converted into the sequence of P/N amino acids, potentially
amphipathic n-grams are marked.
PPPP
PPNP
PNPP
NPPP
PPPN
969848
489240 Amphipatic
486666 Amphipatic
465192
464924
…
3.3.2 Detection of n-grams potentially forming amphipathic alpha
helices
So called amphipathic n-grams are recognized in the sequence according to the
essential (and at the same time sufficient) condition of period of polar and non polar amino
acids being 3.5. (i.e. 3 or 4)
The (P,N) sequence thus belongs to amphipathic helices if P, or PP or PPP alternate with
N, or N, or NNN, in such a way that there is such P (call its coordinate 0),
that at position 3 or 4,or both or another P is found, at position 7 yet another one, and
at position 10 or 11, or both- or another one. The same can be demanded for N.
The above definition is applicable to lengths where the property of periodicity
3.5 can at all be demonstrated. Therefore, 4 is the minimal length. It shows
at least one period, and could make part of a longer helix of several
periods.
Besides the list of n-grams for given n (in form in item data) itself, there is also the
extract of numbers of unique amphipathic (and non amphipathic) n-grams and their total
number. (Numbers and statistics available in the item “test”) (see 2.2.2)
An example of list: The extract of all 4-grams of organism Dictiostelium discoideum – there are 6 possible (and,
in this case, existing) amphipathic and 10 non amphipathic types of n-grams, which are unique. The statistic test
shows if the distribution of amphipathic n-grams (unique and also total) in the whole proteome is uniform. The
test used is Kolmogorov – Smirnov test, the hypothesis of uniformity is rejected if the test statistic for
Kolmogorov – Smirnov test exceeds the critic value. The level of significance is 0.05.
11
The own distribution of amphipathic n-grams is visible from the graph. (In the form the item “image”) The blue
line always denotes the uniform distribution of corresponding parameters. The red line denotes the distribution
of n-grams in given sequence. In this case the proteome of Dictiostelium discoideum, the hypothesis of
uniformity is rejected on the level of significance of 0.05. The position of distribution curve of investigated
organism above- or under the uniform distribution curve – reflects accumulation or lack of amphipathic n-grams
in the sequence of n-grams arranged in descending order:
3.3.3 Random selection of sample
The function provides random selection of given length from the sample (which can
be useful for comparison of proteomes of originally different length).
Available in the main menu in settings. The adjusted item “n-grams can be located inside
proteins only” guarantees the selection of random sample only within one protein, so the
12
sequence cannot be connected from adjacent proteins. The selection is made without returning
of samples.
It is good to check the “merge identical proteins” checkbox during the loading
procedure. This is due to possible selection of random sample within adjacent identical
proteins; however this function should have no significant effect if the analyzed sequence is
filtered.
3.3.4 Comparison with random model – the Monte Carlo simulations
The comparison with random generated model is possible in case of linguistic
characteristics. If activated in the main form, it is visualised in graph always as a blue line and
shows the situation of random sequence of the same parameters (alphabet size, length of the
text, window size) as investigated sequence. Please acknowledge that activation of random
model requires longer computation time!
The graph of complexity CT of protein actin of mice with the length of 370 amino acids and window size 10
(There is length of words on axis x (e.g. window size) and there is value of complexity CT on axis y) The
random model is marked in blue, thinner lines mark the 95% range of probability.
13
4. Example of computation
Let’s determine the ratio and distribution of potentially amphipathic peptides in the
proteome of a Plasmodium yoeli organism. The procedure could be divided into 5 steps:
1. Loading the data (possibly with filtration) /see chapter 1.1.1,1.1.2/
2. Setting the length of an n-gram (on the main form, item „word list“)
3. Selection of random sample (according to the situation) /see 3.3.3/
4. Conversion of the sequence of amino acids into N/P sequence (Main menu: edit → convert
amino acids to polar/nonpolar )
5. Calculation using the n-grams functions.
(The list of N/P converted n-grams of given length is displayed. There are also numbers of unique amphipathic
and non amphipathic n-grams and their total number, statistic of the distribution of n-grams and the graph. The
development of the distribution is visible from the graph. )
14
Download