moses - Algorithms reference Terminology: "track" each calculation step in the analysis of a nucleotide sequence in moses is stored in a track. Each track holds a data matrix, which is a matrix of float (real) values. "data lines" one line of values in the data matrix. If tracks are used as input for algorithms, each data line is expected to hold certain data so the algorithm works correctly. If wrong source tracks are specified the result will be unusable. "Annotated sequence track" - track type in moses that holds a sequences and annotations, when loading a Genbank file using "load track" the track will be of this type. "Analysis tracks" all other tracks in moses are analysis track. "stencil track" a track that contains region information. The information is coded by consecutive values above zero. These tracks can be used to restrict calculations to subregions of a genome, for instance. Often ncRNA detection is only performed for intergenic regions. "no value" means a parameter should not be given a value, e.g. left blank. General parameters: generally a parameter in moses can be one of the following: - a float value ( param = -123.23) - a string value ( param = not) - a list of float values ( param = -12.3, 23.4, 65.78 ) - a list of string values ( param = a,t,c,g) - a matrix of float values ( param = 1.0, 1.2, 1.3; 2.3, 2.4, 2.5; 3.4, 3.1, 3.3 ) - a matrix of string values ( param = a,t,c,g; t, c, g, a; g, c, c, t) windowHalfSize = for algorithms using sliding windows, this is half the size of the window windowStep = for algorithms using sliding windows, this is the number of base pairs the window is shifted in each step windowOffset = for algorithms using sliding windows, this is position where the first window starts in the first step calculateUsingStencil = can be 0 or 1. for algorithms supporting the use of a stencil track to restrict calculations to a subregion of the analysed data this options defines if the stencil track is used or ignored. numberOfShuffledInputs= for algorithms supporting the use of a shuffled input data (to test statistical significance) this option defines the number of shuffled samples. Currently this is only used for the RNAFold-Algorithm Annotation filter This module is used to create analysis tracks based on the annotations in an annotated sequence track. source tracks 1 - annotated sequence track with the annotated sequence track parameters annotationTypesToExtract = you can select the type of annotations by specifying a comma seperated list of integers. Each integer is one type as they are given in the statistics part of the data info panel for an Annotated sequence track but the first type is number 0 not 1. regionToExtract = one of the following: left, right, start, end or no value. no value means the full annotations are extracted. left/right mean the flanks of the annotations e.g. before/after a gene. regionMayOverlapOtherAnnotations = can be 0 or 1. Specifies if the extracted regions should be cut off if another overlapping annotation is detected. regionToExtractSize = size of the region to extract, for instance if regionToExtract = left and regionToExtractSize = 200 the output will be a track where the the first 200base pairs before each annotations are highlighted. This can be useful if this is the only approximation for putative promoter regions. invertOutput = can be 0 or 1. To easily extract intergenic regions the output of the algorithm can be inverted, e.g. 0 is replaced with 1 and 1 with 0. Example 1: "Highlight the first 500 bp before each rRNA" source track: "Pyrococcus furiosus genome" annotationTypesToExtract = 3,4 (if 3 is the number of +rRNA and 4 of -rRNA, this depends on the annotated sequence track used) regionToExtract = left regionMayOverlapOtherAnnotations = 1 (means if there are no unannotated 500bp the highlighted region will overlap the annotations that come before the 500bp have ended) regionToExtractSize = 500 invertOutput = 0 Example 2: "Highlight all intergenic regions, only protein genes count as genes" source track: "Pyrococcus furiosus genome" annotationTypesToExtract = 0,1 (if 0,1 correspond to +gene, -gene this depends on the Genbank file used.) regionToExtract = regionMayOverlapOtherAnnotations = regionToExtractSize = invertOutput = 1 Basic Operations This module is used to process and prepare the output of other modules. source tracks depend on the operation used. parameters value = any real value. uses as user-specified value for some operations. operation = see below. stepSize = any real value. see discription of "steps" below. steps = positive integer value. if steps is above zero the operation will be performed several times as specified by the value of steps. Each time after the first run the parameter "value" will be increased by "stepSize". The result of each run is stored in new data lines in the resulting data matrix. operations: - output1 means output data line 1, input1, input2 ... mean input data lines. - arithmetic and logical operations are carried out by base-pair - for the operations and, or, *, /, +, >, <, =: if only one source track is given with only 1 data line, an additional data line used as the first data line were all values are as specified by the parameter "value", the second data line is the one given by the source track. logical operations: - values above 0 will be interpreted as 1, all others as 0. and: output1 = input1 AND input 2 AND input3 ... or: output1 = input1 OR input 2 OR input3 ... not: output1 = NOT ( input1 OR input 2 OR input3 ...) arithmetic operations - v = value defined by the parameter "value". +: output1 = input1 + input2 + ... -: output1 = input1 - input2 - ... *: output1 = input1 * input2 - ... /: output1 = input1 / input2 (only input1 and 2 are used!) >: output1 = 1 if input1 > input2 > input3 >... holds <: output1 = 1 if input1 < input2 < input3 >... holds =: output1 = 1 if input1 = input2 = input3 =... holds max: output1 = max( input1, input2, ...) max: output1 = min( input1, input2, ...) invert: output1 = maxValue1 - input1 (only input 1 is used, maxValue1 is the maximum of all values in data line 1) abs: output1 = absolute value (input1) (only input 1 is used) normalize: each data line is normalized, that means for each value the normalized value normalized = ( value - mininumOfDataLine ) / maximumOfDataLine is calculated. linear combination: output1 = input1 + input2 * v above linear combination: output1 = 1 if input3 > (input1 + input2 * v) below linear combination: output1 = 1 if input3 < (input1 + input2 * v) count if above: for each position in the input data lines the output is the number of data lines with values above v at that position. count if below: for each position in the input data lines the output is the number of data lines with values above v at that position. fill: output1 = v random: fills the output data line randomly with "1"s the rate for "1"s is given by v. randomize: if one input line is given, this operation fills the output data line randomly with "1"s the rate for "1"s is given by the number of values above zero in the input data line divided by the length of the input data line. if two input lines are given, this operation is performed interval-wise with intervals defined by the second input line. Intervals are defined by consecutive values above zero. arithmetic operations for one data line accumulate: calculates the sum of the values of data line 1 from beginning to end. The sum at each position is stored in the output data line. See example: input data line: 5,3, 1, -4, 0, 5, 3, 2 output data line: 5,8,9, 5, 5, 10,13,15 position: output the position at each position of the output data line. This is useful to define intervals using a subsequence BasicOperationsAlgorithm with ">" or "<". Example: output data line: 1,2,3,4,5,6... (length defined by source track) track manipulations merge: creates a new data matrix that contains all data lines from all the source tracks. select: creates a new data matrix that contains only the data line from the data lines of all source tracks specified by the parameter "value". extract:creates a track where the output data line is created by copying certain parts of input data line 1. The parts to be copied are defined by input data line 2, each value where input data line 2 is above zero has to be copied from data line 1. The copied values are not stored at there original positions in the output data line but they are stored back to back. sequence operations: complement, reverse: calculate the complement or reverse sequence. The sequence is exspected in data line 1 in ascii format. (Annotated sequence tracks have the sequence in data line 1 in ascii format.) interval operations Annotations and regions in a sequence can be seen as intervals. For interval operations data line is interpreted as definition of a number of intervals. An interval is defined by consecutive values above 0. The output is a data line where the manipulated intervals are stored using consecutive "1"'s. v = value defined by the parameter "value". extend left: the start of all intervals is decreased by v extend right: the end of all intervals is increased by v extend: the start of all intervals is decreased by v and the end of all intervals is increased by v extend to: the start of all intervals is decreased and the end of all intervals is increased so that the resulting length of the interval is v. If the length was larger than v the interval is not changed. shrink left: the start of all intervals is increased by v shrink right: the end of all intervals is decreased by v shrink: the start of all intervals is increased by v and the end of all intervals is decreased by v shrink to: the start of all intervals is increased and the end of all intervals is decreased so that the resulting length of the interval is v. If the length was smaller than v the interval is not changed. length above: delete all intervals that are not longer than v length below: delete all intervals that are not shorter than v Example "Threshold filter" source track "GC-content of Pyrococcus Furiosus, complete genome" value = 0.5 operation = < Example "Count number of predictions" source tracks "signal by RNAz", "signal by RNAFold", "signal by Dynalign", "Terminators" value = operation = count if above Example "find genes longer than 1000bp" source track "Annotation filter on Pyrococcus Furiosus, complete genome" value = 1000 operation = length above BLAST - [1] This module can BLAST a genome against a genome, query against a genome or genome against a database. Query-Genome will be used if the parameter querySequence is not empty. Genome-Genome will be used if querySequence is empty and two genomes have been selected as source tracks. Genome-Database will be used else source tracks 1 - Genome that is analysed for conserved regions (Annotated Sequence track) 2 - optional: Genome to be BLASTed against (Annotated Sequence track) parameters eValueCutOff = E-Value cut off for the BLAST algorithm. commandLine = additional command line options to be used to run BLAST. blastDatabase = Genome-Database mode will be used if no querySequence is supplied and only 1 genome is selected as source track, this has to be the filename of the corresponding database files. The database has to be prepared using the formatDB tool that comes with BLAST. If the created database files are named C:\temp\database.nhr, C:\temp\database.nin and C:\temp\database.nsq the parameter has to be C:\\temp\\database. querySequence = If a query sequence is supplied this query will be searched in the genome given as the first source track. outputOption = one of the following: eValue, score, ln score, identity, eValue only, score only, ln score only, identity only. The use of the "only" option will result in only one output line. Without "only" the second output line will contain location (in base pairs) where the BLAST hits are found in the reference genome if genomegenome BLAST is used. Example "Search motif in genome" sourceTrack: "Pyrococcus Furiosus, complete genome" eValueCutOff = 10 querySequence = "actgcatgctgac" outputOption = score only Example "Look for hits to Rfam dump" sourceTrack: "Pyrococcus Furiosus, complete genome" eValueCutOff = 10 blastDatabase = C:\\Rfam.fasta outputOption = score only Example "Look for best hit to reference Genome" sourceTrack: "Pyrococcus Furiosus, complete genome", "Reference genome" eValueCutOff = 10 outputOption = score DNAProperties - [2] This module calculates one of the available thermodynamic properties for sliding windows of an analyzed DNA sequence: source tracks 1 - annotated sequence track with the annotated sequence track. OR 1 - Track with sequence parameters windowoffset/windowstep/windowsize wordSize = if a custom conversion table is to be used the word size for the conversion, can be 2 or 3 conversionTable = if a custom conversion table is to be used (property = custom) the conversion table, the format is simply a list of float values, corresponding to the di/trinucleotides. The order of the di/trinucleotides is a follows: di: aa, at, ag, ac, ta, tt, tg, tc, ga, gt, gg, gc, ca, ct, cg, cc, tri: aaa, aat, aag, aac, ata, att, atg, atc, aga, agt, agg, agc, aca, act, acg, acc, taa, tat, tag, tac, tta, ttt, ttg, ttc, tga, tgt, tgg, tgc, tca, tct, tcg, tcc, gaa, gat, gag, gac, gta, gtt, gtg, gtc, gga, ggt, ggg, ggc, gca, gct, gcg, gcc, caa, cat, cag, cac, cta, ctt, ctg, ctc, cga, cgt, cgg, cgc, cca, cct, ccg, ccc, property = one of the following: aphilicity, base stacking, b-dna twist, bendability, bending stiffness, cpg islands, cpnpgcpg islands, cpnpgislands, dna denaturation, duplex stability disrupt energy, duplex stability free energy, nucleosome position, propellor twist, protein deformation, protein dna twist, z dna Example: "Calcuate bending stiffness in a sliding window of size 51" source track: "Pyrococcus furiosus complete genome" windowoffset = 0 windowStep = 1 windowSize = 25 property = bending stiffness Dynalign - [3] Dynalign is a tool that calculates the secondary structure common to two sequences. moses uses for each window of an analyzed genome the most similar sequence from references genomes for alignment. source tracks 1 - Annotated sequence track with Analyzed genome 2 - Annotated sequence track with reference genome 1 3 - BLAST track for Analyzed genome against reference genome 1 4 - Annotated sequence track with reference genome 2 5 - BLAST track for Analyzed genome against reference genome 3 ... ... last - optional stencil track the BLAST tracks may not use the "only" option in the parameter output option so that location information is not discared. parameters windowoffset/windowstep/windowsize/calculateUsingStencil Example "Use Dynalign on EColi using four reference Genomes" source tracks "Ecoli genome", "Entero genome", "BLAST Ecoli vs Entero", "Erwinia genome", "BLAST Ecoli vs Erwinia", "Klebsiella genome", "BLAST Ecoli vs Klebsiella", "Salmonella genome", "BLAST Ecoli vs Salmonella", "Ecoli intergenic regions" windowoffset = 0 windowStep = 1 windowSize = 37 calculateUsingStencil = 1 Hidden Markov Models - untested ! This module can be used to see if sliding windows of an analyzed nucleotide or numerical sequence matches a given hidden markov model (mode = match) or it can calculate for each window the most probable state at the center of the window (mode = decode). source track 1 - analysed nucleotide sequence or numerical sequence parameters mode = can be match or decode transitionProbabilities = transition probabilites matrix. emissionProbabilities = emission probablities matrix. startProbabilities = start probabilites vector. stateLabels = state labels; a list of strings. emissionLabels = emission labels; a list of strings. Example "find motif" windowoffset = 0 windowStep = 1 windowSize = 25 mode = match transitionProbabilities = 0.5,0.5,0,0; 0,0.5,0.5,0; 0,0,0.5,0.5; 0,0,0,1 emissionProbabilities = 0.9, 0.033, 0.033, 0.033; 0.033, 0.9, 0.033, 0.033; 0.033, 0.033, 0.9, 0.033; 0.033, 0.033, 0.033, 0.9 startProbabilities = 0.9, 0.033, 0.033, 0.033 stateLabels = a,t,g,c emissionLabels = a,t,g,c ( the Hidden markov has to model a sequence motif) Length Filter The length filter highlights consecutive or almost consecutive values above zero in the source track. Gap sizes can be allowed in the sequence of values. In the output data line regions will be coded by consecutive "1"'s that define the regions that pass the filter. source tracks 1 - Analysis track to filter parameters minimalRequiredLength = the number of values above zero that have to be found for the region to pass the filter. maximumSingleGapSize = maximum size of a single gap inside the region with the sufficient number of values above zero. maximumGapSize = maximum number of Example "find intervals longer than 100bp" minimalRequiredLength = 101 maximumSingleGapSize = 0 maximumGapSize = 0 Example "cluster ncRNA signals" minimalRequiredLength = 101 maximumSingleGapSize = 30 maximumGapSize = 50 Match Signal To Annotated Gene This module is used to classify predictions into confirmations of already known ncRNA-Genes or as novel findings. The output will contain a 1 for a prediction that overlaps an annotation, 2 for a prediction close to the start of an annotation, 3 if the prediction is close to the end of an annotation and 4 if the prediction is not close to any annotation. The output of this track can be used to generate reports of the results of a prediction method to have list of gene candidates that can be used for experimental verification. source tracks input data line 1 - signal data input data line 2 - sequence (not used) input data line 3 - annotations input data line 4 - custom annotations - The input format of this module was chosen to make this module conveniently usable with annotated sequence track, which contain one data line with the sequence and a second with the annotations, parameters maxDistanceToAcceptMatchLeft = definition of the neighborhood of annotations, signals that are closer to the start of an annotation than this value will be considered in the neighborhood of this annotation, the corresponding output value is 2. maxDistanceToAcceptMatchRight = definition of the neighborhood of annotations, signals that are closer to the end of an annotation than this value will be considered in the neighborhood of this annotation, the corresponding output value is 3. maxGapSizeInSignal = A signal will be classified as a whole. The signal can have gaps but it can be useful not to consider it two distinct signals just because of a small gap. The length filter algorithm and other methods should be used in favor of this simple method to cluster signals. output = can be signal or gene. "signal" will results in the behavior described above. The use of gene will generate an output data line were all predicted genes are marked by "1"'s. Example "Mark genes that were predicted by RNAz" source tracks "RNAz above 0.995", "Pyrococcus Furiosus complete genome" maxDistanceToAcceptMatchLeft = 5 maxDistanceToAcceptMatchRight = 5 maxGapSizeInSignal = 5 output = "gene" Example "Classify RNAz predictions" source tracks "RNAz above 0.995", "Pyrococcus Furiosus complete genome" maxDistanceToAcceptMatchLeft = 5 maxDistanceToAcceptMatchRight = 5 maxGapSizeInSignal = 5 output = "signal" Random Sequence This module generates random sequences. source tracks 1 - sequence track to define the nucleotide composition and length of the random sequence. parameters windowed = can be 1 or 0. if the value is 1 the nucleotide composition will be sampled using sliding windows from the sequence in the source track. Otherwise the composition of the whole sequence is used length = length of the new random sequence if no source track is used. aRatio = ratio of nucleotide "a" if no source track is used. gRatio = ratio of nucleotide "g" if no source track is used. tRatio = ratio of nucleotide "t" if no source track is used, ratio of nucleotide "c" is the difference to one from the sum of aRatio, gRatio and tRatio. RNAFold - [4] moses uses RNAFold to predict the se condary structures of sliding windows of a nucleotide sequence. source tracks 1 - Annotated sequence track 2 - optional stencil track parameters windowoffset/windowstep/windowsize/calculateUsingStencil/numberOfShuffledInputs Example "Calculate RNAFold profile" windowoffset = 0 windowStep = 1 windowSize = 37 calculateUsingStencil = 1 numberOfShuffledInputs = 0 Example "Calculate RNAFold mean and standard deviation for shuffled slidind windows" windowoffset = 0 windowStep = 1 windowSize = 37 calculateUsingStencil = 1 numberOfShuffledInputs = 100 RNAz - [5] moses allows to run RNAz using this modul. source tracks 1 - Annotated sequence track with Analyzed genome 2 - Annotated sequence track with reference genome 1 3 - BLAST track for Analyzed genome against reference genome 1 4 - Annotated sequence track with reference genome 2 5 - BLAST track for Analyzed genome against reference genome 3 ... ... last - optional stencil track the BLAST tracks may not use the "only" option in the parameter output option so that location information is not discared. parameters windowoffset/windowstep/windowsize/calculateUsingStencil Example "Use RNAz on EColi using four reference Genomes" source tracks "Ecoli genome", "Entero genome", "BLAST Ecoli vs Entero", "Erwinia genome", "BLAST Ecoli vs Erwinia", "Klebsiella genome", "BLAST Ecoli vs Klebsiella", "Salmonella genome", "BLAST Ecoli vs Salmonella", "Ecoli intergenic regions" windowoffset = 0 windowStep = 1 windowSize = 37 calculateUsingStencil = 1 Statistical Filter This module calculates statistics for numerical profiles. source tracks 1 - Analysis track parameters windowoffset/windowstep/windowsize output = one of the following: mean, std, min, max, jump, peak, above mean, below mean, local statistics, local statistics view, above local mean, below local mean. - mean, std, min, max calculate the corresponding figure for all values in input data line 1. All positions in the output data line will contain that value. - jump, peak perform jump detection / peak detection - local statistics calculates mean and standard deviation for sliding windows each in one output data line. - local statistics view calculates output data line1: (local mean of window) + l * (standard deviation of window) output data line2: (local mean of window) output data line3: (local mean of window) - l * (standard deviation of window) Where l is the value of parameter "limitInStandardDeviationsFromMean". This is useful using display mode 2 and 4 (see user reference) to view the characteristics of the numerical profile - above/below local mean calculates a "1" if the value at the center of the sliding window is above/below the mean of the window. limitInStandardDeviationsFromMean = a parameter used for jump/peak detection and the functions above/below local mean. WordAnalysis This module calculates the ratio of the number of occurence of a word to the maximally possible number of occurences. Multiple words can be used at once, the corresponding ratios are summed. The typical use for this module is calculation of GC-content or detection of Start/Stop codons. parameters windowoffset/windowstep/windowsize WordsToLookFor = words for which to perform the analysis. References [1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic local alignment search tool.,” J Mol Biol, vol. 215, Oct. 1990, pp. 403–410. [2] T. Abeel, Y. Saeys, E. Bonnet, P. Rouzé, and Y. Van de Peer, “Generic eukaryotic core promoter prediction using structural features of DNA,” Genome Research, vol. 18, Feb. 2008, pp. 310-323. [3] D.H. Mathews, “Predicting a set of minimal free energy RNA secondary structures common to two sequences,” Bioinformatics, vol. 21, May. 2005, pp. 2246-2253. [4] I. Hofacker, W. Fontana, P. Stadler, S. Bonhoeffer, M. Tacker, and P. Schuster, “Fast Folding and Comparison of RNA Secondary Structures,” Monatsh. Chem., vol. 125, 1994, pp. 188, 167. [5] S. Washietl, I.L. Hofacker, and P.F. Stadler, “Fast and reliable prediction of noncoding RNAs,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, Feb. 2005, pp. 2454-2459.