COMPLEXITY- Software tools for linguistic based analysis of genetic sequences Michaela Zemková - Charles University in Prague, Faculty of Science – Department of Philosophy and History of Science, Daniel Zahradník – Czech University of Life Sciences, Prague, Faculty of Forestry and Wood Sciences – Department of Forest Management 1. Basic information about the programme ................................................................................ 2 1.1 Loading the data ............................................................................................................... 3 1.1.1 Filtration of sequences .............................................................................................. 3 1.1.2 Parameters of filtration .............................................................................................. 3 1.2. Data saving ...................................................................................................................... 6 2. Computations ......................................................................................................................... 6 2.1 Linguistic characteristics .................................................................................................. 6 2.1.1 Linguistic complexity (CL) ....................................................................................... 6 2.1.2 Trifonov´s Complexity (CT) .................................................................................... 7 2.1.3 Shannon´s entropy (CE) ........................................................................................... 7 2.1.4 Markovian entropy model (CM) ............................................................................... 8 2.1.5 Transition matrix ....................................................................................................... 8 2.1.6 Wooton- Federhen index (CWF)............................................................................... 9 2.2 Calculations using n-grams .............................................................................................. 9 2.2.1. Decomposition of the text to n-grams ...................................................................... 9 2.2.2 The automated calculation of the ratio of n-grams potentially forming amphipathic alpha helices ..................................................................................................................... 10 3.3 Other functions ............................................................................................................... 10 3.3.1 The conversion of proteome to a sequence of polar and non-polar amino acids .... 10 3.3.2 Detection of n-grams potentially forming amphipathic alpha helices ................ 11 3.3.3 Random selection of sample ................................................................................... 12 3.3.4 Comparison with random model – the Monte Carlo simulations ........................... 13 4. Example of computation ...................................................................................................... 14 1. Basic information about the programme The “Complexity“ programme was originally developed for calculating linguistic complexity according to E.N. Trifonov, however it has extended successively to other algorithms based on decomposition of the text to potential words of length n (also known as Shannon ´s n-grams). Sequences in format txt, fasta or faa could be used as input data. In addition to linguistic assays, program also contains procedure for prediction of structures potentially forming amphipathic alpha-helices. At present provides following procedures: Linguistic complexity (CL) Linguistic complexity suggested by E.N. Trifonov (CT) Shannon ´s entropy (CE) Markov ´s model of entropy (CM) Transition matrix/ one step of Markov´s entropy procedure Wooton- Federhen index (CWF) n-gram based text decomposition (n-grams) Automated measure of the ratio of structures potentially forming amphipathic alphahelices (n-grams auto) Other functions: Comparison with random model using Monte Carlo simulations Random selection of given size from investigated text Removing of repetitions based on method of k identical substrings of length n and extraction of the list of excluded parts of the text. Selection of proteins of interest according to their names The conversion of proteomes to sequence of polar and non-polar amino acids Searching for structures potentially forming amphipathic alpha-helices it is possible to do the calculation directly using the following form: 2 1.1 Loading the data . Data could be loaded in the main menu by File → open. It is possible to load sequences in format txt., fasta or faa., faa21For loading more files as one dataset (for example several chromosome of one organism) there is an option “Open all files in folder“. (Just click on first file in folder and all other files will be loaded automatically) 1.1.1 Filtration of sequences . Before the linguistic assay, proteome sequence should be cleared of repetitions and special characters and comments should be removed. There are identical or very similar sequences of proteins (differing from each other in few letters) and it is better to exclude them, as otherwise the result of linguistic could be biased. In the first step of the filtration, there is an offer to remove comments and some characters which occur sometimes in sequence and aren’t part of the standard amino-acid table, such as B,J,X,Z. These characters denote gaps with unidentified amino-acid (X), or the amino acid of known polarity (B, Z – polar, J – non-polar). In case of including these characters, the alphabet size would be larger and the result of linguistic characteristics would be biased because these characters are not a part of alphabet in right sense. 1.1.2 Parameters of filtration It is possible to set filtration parameters in the form that are automatically offered after the data are selected: 3 The proteins are loaded successively. There are k random selected samples of length n from each new protein and it is determined, how many of them are between the already read proteins. If this number is higher than the permitted number (“Allowed number of identical samples”), the new protein is considered to be a replication of the respective already read protein and refused. The default choice for k is 5. Optimal size of this parameter can be controlled by Trifonov´s complexity measure. If the value of vocabulary usage in the graph for n =16 is equal to one, than there are no loaded proteins with repeating substring of length 16. If the value of vocabulary usage is only close to one, than it means, that at least one substring of such a length occurs repetitively and thus the parameters of filtration should be set higher – increase the number of samples compared. (The disadvantage of this treatment is longer computation time) An example: Fig.1. Vocabulary usage of the proteom of Trypanosoma, k = 5 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The axis x contains values of n – length of words. The axis y contains values of vocabulary usage (the ratio of actual to maximal possible vocabulary usage in given sequence). For string of length 16 there is a value of 4 vocabulary usage only close to one which means that there is still some identical string of the length 16 and the complexity value would be lower. The problem will be eliminated setting k to 20. This level of filtration offers control list of excluded proteins, which allows reverse control. The list is available in menu edit →excluded proteins. !! Filtration treatment is very time consuming procedure according to length of sequence and parameters. (It can take about few hours). In case of proteome of size 5MB and parameters 16_5_0, the treatment takes approximately 10 minutes. The „merge identical proteins“item has some impact only in case of random selection of sample. (Viz 3.3.3) 1.1.3. Special selection The special selection is the last phase of data filtration provided for selecting proteins of interest according to identical substring in the name of the protein. The function is available from menu edit in the form “data selection”. It is possible to select all proteins or write name of the protein of interest in form, or select all, except the specified proteins. 5 1.2. Data saving Filtered (or selected according to the given parameters) genetic text can be saved via File → save. The program offers automatically saving in the format faa2. (possibly also in txt.). In this way, the sequence of letters without blanks, lines, special characters and so on, is saved. The advantage of this kind of saving is that we don’t need to filter the sequence every time. (It is reasonable to reflect the parameters of filtration in the name of the new filtered sequence such as, for example: plasmodium 16_10_0.faa2. For reverse control, it is possible to save also the list of excluded proteins in any appropriate format, with the help of standard copy – paste tools. (There is no special function for saving the list of excluded proteins.) 2. Computations 2.1 Linguistic characteristics All calculations are available in the main menu under computation. There is only one adjustable item – the window size which generally reflects the maximal lengths of “words” and it is adjusted if variable. The alphabet size is recognized automatically. In loading procedure, it is necessary to choose the item „remove characters B,J,Z,U,X“. To let these characters in the text would cause undesirable change in alphabet size and hence the result of the linguistic measure. (See 1.1.1) 2.1.1 Linguistic complexity (CL) The linguistic complexity suggested by Kolmogorov is the simplest concept of measuring vocabulary usage in given sequence. 6 Where Vi is the number of different words of length i and Vmax is maximal possible number of words of length i. Therefore, this simple concept of complexity reflects how many words of length i from a possible vocabulary size occur in the given text. The composition of ilength words is reflected as follows: Example: length of word number of i-length words 2 3 10 40 Maximal vocabulary usage 50 50 CL (10+40)/(50+50)= 0,5 The output of the measure is a value of CL and a graph. Adjustable item: window size (preset to 20) 2.1.2 Trifonov´s Complexity (CT) The complexity suggested by E.N. Trifonov reflects richness of vocabulary (diversification of the vocabulary). The concept takes into account both -frequency and composition. Complexity defined by Trifonov (1990) for any sequence of length L is given by the product: Where Ui corresponds to the ratio of the actual to maximal vocabulary sizes for word length i. The algorithm runs in steps for i=1...n and doesn’t take all combinations of words in one “package” but it reflects the distribution of all words of length i. The final value can be compared with another value of CT of different length. The disadvantage of the concept is the sensibility to the occurrence of repetitions which decrease the results drastically. To evaluate the CT value of a proteome, it is necessary to clean the sequence of repetitions (those ones that are artefacts of the database) in advance. An example: The difference between CL and CT length of word number of i-length words maximal vocabulary usage 2 25 50 CL =(25+25)/(50+50)= 0,5 3 25 50 CT=25/50* 25/50 = 0, 25 2.1.3 Shannon´s entropy (CE) Shannon's entropy measures the information potentially contained in a message without considering how many real information the message actually has. The sequence of repeating one character (AAAA...) has zero entropy. In a sequence where all possible symbols occur, evenly the entropy comes near to 1. Shannon ´s information entropy is defined as: 7 Where N is the length of the sequence (whole text), ni is the frequency of ith symbol in the text and K is the alphabet size. The Shannon’s information entropy however doesn’t reflect the real composition of the words, it reflects the potential amount of information in the sequence – i.e. it reflects the frequency of symbols only. The sequence AAAACCCCCC would have the same entropy as the sequence ACACACCACA: H´= 4/10 * log(4/10) + 6/10 * log(6/10) – there are 2 symbols (A occurs 4 times and C occurs 6 times) used in sequence of length N=10, thus n1 =4, n2 =6. 2.1.4 Markovian entropy model (CM) Markovian model is an alternative entropy measure counting also with real combination and relations of letters within the text (inner composition of the vocabulary). Actual value of CM is changing with the given length of words. Where M is a number of possible words: M=Km , K is the alphabet size, m is the window size (maximal length of words) and N the length of the whole text. The length of words varies from 1 to m. The model contains the probability of combination of letters which is reflected in transition matrix. Output: Result value and graph 1) only one value for length of words 2 2) the extract of all values for words of length 1 to m. Available by right-clicking the graph and choosing the write values command Adjustable item: window size (the calculation of course takes longer for higher numbers) 2.1.5 Transition matrix The matrix is one of the steps in CM measure; it simply reflects how many times various combinations of letters occur. Thus it provides transparent information about preferences between each letters. Example of output: 8 (In this case, the window size has no effect) 2.1.6 Wooton- Federhen index (CWF) The algorithm is incorporated in BLAST algorithm where serves to find lowcomplexity regions. The formula is: The output is directly the CWF value, low value corresponds to low complexity The choice of window size has no effect. 2.2 Calculations using n-grams Variable of the calculations is the length of n-gram (respective the length of words), which could be set in form as a “word list”. 2.2.1. Decomposition of the text to n-grams The method suggested by Shannon (1948) is based on decomposition of the text into, so called, n-grams – the words of length n. The program shows the list of n-grams of given length which is adjusted by hand in the form. The shortest length for genetic sequences is 4. (This is because the procedure was primarily programmed for analysis of amphipathic structures, where lower value than 4 is unreasonable. The function providing all different lengths of 1 to n is allowed in the version Complexity_H for analysis of human languages. For unique n-grams, there is the information about frequency, the list is arranged in descending order. An example of the list of 4-grams: NNNN QQQQ SSSS 108767 25156 11673 9 TTTT NNNS SNNN 11199 5493 5210 …. There is also information in the list: the information about the ratio of potentially amphipathic and non-amphipathic n-grams (see 3.3.2). Generally, the n-gram is the substring of length n in given sequence characterized by the probability of occurrence and it is used in language modelling. However, there is usually value of n considered as number of words and the probability is thus the probability of connection of words . 2.2.2 The automated calculation of the ratio of n-grams potentially forming amphipathic alpha helices The automated calculation is a simplified way to determine the development of the ratio of potentially amphipathic n-grams for the words of lengths 4 to n (See 2.3). The proteome is divided into potentially amphipathic and non-amphipathic part. The program distinguish the unique types of n-grams (occurring just once in the text and thus this number reflect the real diversity of vocabulary) and the total number of n-grams. The upper border of requested n is set in the form. The programme than shows the extract for all n from 4 to this maximum. (The function “n-grams” /2.2.1/ shows the same value in the item “test”, but only for one given length). To the contrary, the function “n-grams auto” doesn´t show the list of particular n-grams. Example of an output: The extract for n of 4 to 8. In the left part, there is number of unique potentially amphipathic and non-amphipathic peptides in proteome. In the right part, there is the total number of each. (The plotted ratio can be visualised e.g. in excel) 3.3 Other functions 3.3.1 The conversion of proteome to a sequence of polar and non-polar amino acids The function is activated in main menu by edit → convert amino acids to polar/nonpolar. The later list of n-grams is already converted into sequence of N/P. The transformation is given by following table: Table of polar and non-polar amino acids: Non-polar Polar G, A, V, L, I, P, M, F, W, C (J) S, T, Y, N, Q, D, E, K, R, H (B,Z) Don’t remove the characters B,J,Z,U,X which denote unidentified amino acids of known polarity in loading procedure!! (These characters should be removed in linguistic measures, where the alphabet size is reflected in the result) 10 An example of the list: A 4-gram converted into the sequence of P/N amino acids, potentially amphipathic n-grams are marked. PPPP PPNP PNPP NPPP PPPN 969848 489240 Amphipatic 486666 Amphipatic 465192 464924 … 3.3.2 Detection of n-grams potentially forming amphipathic alpha helices So called amphipathic n-grams are recognized in the sequence according to the essential (and at the same time sufficient) condition of period of polar and non polar amino acids being 3.5. (i.e. 3 or 4) The (P,N) sequence thus belongs to amphipathic helices if P, or PP or PPP alternate with N, or N, or NNN, in such a way that there is such P (call its coordinate 0), that at position 3 or 4,or both or another P is found, at position 7 yet another one, and at position 10 or 11, or both- or another one. The same can be demanded for N. The above definition is applicable to lengths where the property of periodicity 3.5 can at all be demonstrated. Therefore, 4 is the minimal length. It shows at least one period, and could make part of a longer helix of several periods. Besides the list of n-grams for given n (in form in item data) itself, there is also the extract of numbers of unique amphipathic (and non amphipathic) n-grams and their total number. (Numbers and statistics available in the item “test”) (see 2.2.2) An example of list: The extract of all 4-grams of organism Dictiostelium discoideum – there are 6 possible (and, in this case, existing) amphipathic and 10 non amphipathic types of n-grams, which are unique. The statistic test shows if the distribution of amphipathic n-grams (unique and also total) in the whole proteome is uniform. The test used is Kolmogorov – Smirnov test, the hypothesis of uniformity is rejected if the test statistic for Kolmogorov – Smirnov test exceeds the critic value. The level of significance is 0.05. 11 The own distribution of amphipathic n-grams is visible from the graph. (In the form the item “image”) The blue line always denotes the uniform distribution of corresponding parameters. The red line denotes the distribution of n-grams in given sequence. In this case the proteome of Dictiostelium discoideum, the hypothesis of uniformity is rejected on the level of significance of 0.05. The position of distribution curve of investigated organism above- or under the uniform distribution curve – reflects accumulation or lack of amphipathic n-grams in the sequence of n-grams arranged in descending order: 3.3.3 Random selection of sample The function provides random selection of given length from the sample (which can be useful for comparison of proteomes of originally different length). Available in the main menu in settings. The adjusted item “n-grams can be located inside proteins only” guarantees the selection of random sample only within one protein, so the 12 sequence cannot be connected from adjacent proteins. The selection is made without returning of samples. It is good to check the “merge identical proteins” checkbox during the loading procedure. This is due to possible selection of random sample within adjacent identical proteins; however this function should have no significant effect if the analyzed sequence is filtered. 3.3.4 Comparison with random model – the Monte Carlo simulations The comparison with random generated model is possible in case of linguistic characteristics. If activated in the main form, it is visualised in graph always as a blue line and shows the situation of random sequence of the same parameters (alphabet size, length of the text, window size) as investigated sequence. Please acknowledge that activation of random model requires longer computation time! The graph of complexity CT of protein actin of mice with the length of 370 amino acids and window size 10 (There is length of words on axis x (e.g. window size) and there is value of complexity CT on axis y) The random model is marked in blue, thinner lines mark the 95% range of probability. 13 4. Example of computation Let’s determine the ratio and distribution of potentially amphipathic peptides in the proteome of a Plasmodium yoeli organism. The procedure could be divided into 5 steps: 1. Loading the data (possibly with filtration) /see chapter 1.1.1,1.1.2/ 2. Setting the length of an n-gram (on the main form, item „word list“) 3. Selection of random sample (according to the situation) /see 3.3.3/ 4. Conversion of the sequence of amino acids into N/P sequence (Main menu: edit → convert amino acids to polar/nonpolar ) 5. Calculation using the n-grams functions. (The list of N/P converted n-grams of given length is displayed. There are also numbers of unique amphipathic and non amphipathic n-grams and their total number, statistic of the distribution of n-grams and the graph. The development of the distribution is visible from the graph. ) 14