moses - Algorithms reference

advertisement
moses - Algorithms reference
Terminology:
"track" each calculation step in the analysis of a nucleotide sequence in moses is stored in a track.
Each track holds a data matrix, which is a matrix of float (real) values.
"data lines" one line of values in the data matrix. If tracks are used as input for algorithms, each
data line is expected to hold certain data so the algorithm works correctly. If wrong source tracks
are specified the result will be unusable.
"Annotated sequence track" - track type in moses that holds a sequences and annotations, when
loading a Genbank file using "load track" the track will be of this type.
"Analysis tracks" all other tracks in moses are analysis track.
"stencil track" a track that contains region information. The information is coded by consecutive
values above zero. These tracks can be used to restrict calculations to subregions of a genome, for
instance. Often ncRNA detection is only performed for intergenic regions.
"no value" means a parameter should not be given a value, e.g. left blank.
General parameters:
generally a parameter in moses can be one of the following:
- a float value ( param = -123.23)
- a string value ( param = not)
- a list of float values ( param = -12.3, 23.4, 65.78 )
- a list of string values ( param = a,t,c,g)
- a matrix of float values ( param = 1.0, 1.2, 1.3; 2.3, 2.4, 2.5; 3.4, 3.1, 3.3 )
- a matrix of string values ( param = a,t,c,g; t, c, g, a; g, c, c, t)
windowHalfSize = for algorithms using sliding windows, this is half the size of the window
windowStep = for algorithms using sliding windows, this is the number of base pairs the window
is shifted in each step
windowOffset = for algorithms using sliding windows, this is position where the first window
starts in the first step
calculateUsingStencil = can be 0 or 1. for algorithms supporting the use of a stencil track to
restrict calculations to a subregion of the analysed data this options defines if the stencil track is
used or ignored.
numberOfShuffledInputs= for algorithms supporting the use of a shuffled input data (to test
statistical significance) this option defines the number of shuffled samples. Currently this is only
used for the RNAFold-Algorithm
Annotation filter
This module is used to create analysis tracks based on the annotations in an annotated sequence
track.
source tracks
1 - annotated sequence track with the annotated sequence track
parameters
annotationTypesToExtract = you can select the type of annotations by specifying a comma
seperated list of integers. Each integer is one type as they are given in the statistics part of the
data info panel for an Annotated sequence track but the first type is number 0 not 1.
regionToExtract = one of the following: left, right, start, end or no value. no value means the
full annotations are extracted. left/right mean the flanks of the annotations e.g. before/after a
gene.
regionMayOverlapOtherAnnotations = can be 0 or 1. Specifies if the extracted regions should
be cut off if another overlapping annotation is detected.
regionToExtractSize = size of the region to extract, for instance if regionToExtract = left and
regionToExtractSize = 200 the output will be a track where the the first 200base pairs before each
annotations are highlighted. This can be useful if this is the only approximation for putative
promoter regions.
invertOutput = can be 0 or 1. To easily extract intergenic regions the output of the algorithm can
be inverted, e.g. 0 is replaced with 1 and 1 with 0.
Example 1: "Highlight the first 500 bp before each rRNA"
source track: "Pyrococcus furiosus genome"
annotationTypesToExtract = 3,4 (if 3 is the number of +rRNA and 4 of -rRNA, this depends on
the annotated sequence track used)
regionToExtract = left
regionMayOverlapOtherAnnotations = 1 (means if there are no unannotated 500bp the
highlighted region will overlap the annotations that come before the 500bp have ended)
regionToExtractSize = 500
invertOutput = 0
Example 2: "Highlight all intergenic regions, only protein genes count as
genes"
source track: "Pyrococcus furiosus genome"
annotationTypesToExtract = 0,1 (if 0,1 correspond to +gene, -gene this depends on the Genbank
file used.)
regionToExtract =
regionMayOverlapOtherAnnotations =
regionToExtractSize =
invertOutput = 1
Basic Operations
This module is used to process and prepare the output of other modules.
source tracks
depend on the operation used.
parameters
value = any real value. uses as user-specified value for some operations.
operation = see below.
stepSize = any real value. see discription of "steps" below.
steps = positive integer value. if steps is above zero the operation will be performed several times
as specified by the value of steps. Each time after the first run the parameter "value" will be
increased by "stepSize". The result of each run is stored in new data lines in the resulting data
matrix.
operations:
- output1 means output data line 1, input1, input2 ... mean input data lines.
- arithmetic and logical operations are carried out by base-pair
- for the operations and, or, *, /, +, >, <, =: if only one source track is given with only 1 data line,
an additional data line used as the first data line were all values are as specified by the parameter
"value", the second data line is the one given by the source track.
logical operations:
- values above 0 will be interpreted as 1, all others as 0.
and: output1 = input1 AND input 2 AND input3 ...
or:
output1 = input1 OR input 2 OR input3 ...
not: output1 = NOT ( input1 OR input 2 OR input3 ...)
arithmetic operations
- v = value defined by the parameter "value".
+:
output1 = input1 + input2 + ...
-:
output1 = input1 - input2 - ...
*:
output1 = input1 * input2 - ...
/:
output1 = input1 / input2
(only input1 and 2 are used!)
>:
output1 = 1 if input1 > input2 > input3 >... holds
<:
output1 = 1 if input1 < input2 < input3 >... holds
=:
output1 = 1 if input1 = input2 = input3 =... holds
max: output1 = max( input1, input2, ...)
max: output1 = min( input1, input2, ...)
invert: output1 = maxValue1 - input1 (only input 1 is used, maxValue1 is the maximum of all
values in data line 1)
abs:
output1 = absolute value (input1) (only input 1 is used)
normalize:
each data line is normalized, that means for each value the normalized value
normalized = ( value - mininumOfDataLine ) / maximumOfDataLine is
calculated.
linear combination:
output1 = input1 + input2 * v
above linear combination:
output1 = 1 if input3 > (input1 + input2 * v)
below linear combination:
output1 = 1 if input3 < (input1 + input2 * v)
count if above: for each position in the input data lines the output is the number of data lines with
values above v at that position.
count if below: for each position in the input data lines the output is the number of data lines with
values above v at that position.
fill:
output1 = v
random:
fills the output data line randomly with "1"s the rate for "1"s is given by v.
randomize:
if one input line is given, this operation fills the output data line randomly with
"1"s the rate for "1"s is given by the number of values above zero in the input data
line divided by the length of the input data line.
if two input lines are given, this operation is performed interval-wise with
intervals defined by the second input line. Intervals are defined by consecutive
values above zero.
arithmetic operations for one data line
accumulate: calculates the sum of the values of data line 1 from beginning to end. The sum at
each position is stored in the output data line. See example:
input data line:
5,3, 1, -4, 0, 5, 3, 2
output data line:
5,8,9, 5, 5, 10,13,15
position:
output the position at each position of the output data line. This is useful to define
intervals using a subsequence BasicOperationsAlgorithm with ">" or "<".
Example:
output data line:
1,2,3,4,5,6... (length defined by source track)
track manipulations
merge: creates a new data matrix that contains all data lines from all the source tracks.
select: creates a new data matrix that contains only the data line from the data lines of all source
tracks specified by the parameter "value".
extract:creates a track where the output data line is created by copying certain parts of input data
line 1. The parts to be copied are defined by input data line 2, each value where input data
line 2 is above zero has to be copied from data line 1. The copied values are not stored at
there original positions in the output data line but they are stored back to back.
sequence operations:
complement, reverse: calculate the complement or reverse sequence. The sequence is exspected
in data line 1 in ascii format. (Annotated sequence tracks have the
sequence in data line 1 in ascii format.)
interval operations
Annotations and regions in a sequence can be seen as intervals. For interval operations data line
is interpreted as definition of a number of intervals. An interval is defined by consecutive values
above 0. The output is a data line where the manipulated intervals are stored using consecutive
"1"'s.
v = value defined by the parameter "value".
extend left: the start of all intervals is decreased by v
extend right: the end of all intervals is increased by v
extend:
the start of all intervals is decreased by v and the end of all intervals is increased
by v
extend to:
the start of all intervals is decreased and the end of all intervals is increased so that
the resulting length of the interval is v. If the length was larger than v the interval
is not changed.
shrink left:
the start of all intervals is increased by v
shrink right: the end of all intervals is decreased by v
shrink:
the start of all intervals is increased by v and the end of all intervals is decreased
by v
shrink to:
the start of all intervals is increased and the end of all intervals is decreased so that
the resulting length of the interval is v. If the length was smaller than v the interval
is not changed.
length above: delete all intervals that are not longer than v
length below: delete all intervals that are not shorter than v
Example "Threshold filter"
source track "GC-content of Pyrococcus Furiosus, complete genome"
value = 0.5
operation = <
Example "Count number of predictions"
source tracks "signal by RNAz", "signal by RNAFold", "signal by Dynalign", "Terminators"
value =
operation = count if above
Example "find genes longer than 1000bp"
source track "Annotation filter on Pyrococcus Furiosus, complete genome"
value = 1000
operation = length above
BLAST - [1]
This module can BLAST a genome against a genome, query against a genome or genome against
a database.
Query-Genome
will be used if the parameter querySequence is not empty.
Genome-Genome
will be used if querySequence is empty and two genomes have been
selected as source tracks.
Genome-Database
will be used else
source tracks
1 - Genome that is analysed for conserved regions (Annotated Sequence track)
2 - optional: Genome to be BLASTed against (Annotated Sequence track)
parameters
eValueCutOff = E-Value cut off for the BLAST algorithm.
commandLine = additional command line options to be used to run BLAST.
blastDatabase = Genome-Database mode will be used if no querySequence is supplied and only 1
genome is selected as source track, this has to be the filename of the
corresponding database files. The database has to be prepared using the formatDB
tool that comes with BLAST. If the created database files are named
C:\temp\database.nhr, C:\temp\database.nin and C:\temp\database.nsq the
parameter has to be C:\\temp\\database.
querySequence = If a query sequence is supplied this query will be searched in the genome given
as the first source track.
outputOption = one of the following: eValue, score, ln score, identity, eValue only, score only, ln
score only, identity only. The use of the "only" option will result in only one
output line. Without "only" the second output line will contain location (in base
pairs) where the BLAST hits are found in the reference genome if genomegenome BLAST is used.
Example "Search motif in genome"
sourceTrack: "Pyrococcus Furiosus, complete genome"
eValueCutOff = 10
querySequence = "actgcatgctgac"
outputOption = score only
Example "Look for hits to Rfam dump"
sourceTrack: "Pyrococcus Furiosus, complete genome"
eValueCutOff = 10
blastDatabase = C:\\Rfam.fasta
outputOption = score only
Example "Look for best hit to reference Genome"
sourceTrack: "Pyrococcus Furiosus, complete genome", "Reference genome"
eValueCutOff = 10
outputOption = score
DNAProperties - [2]
This module calculates one of the available thermodynamic properties for sliding windows of an
analyzed DNA sequence:
source tracks
1 - annotated sequence track with the annotated sequence track. OR 1 - Track with sequence
parameters
windowoffset/windowstep/windowsize
wordSize = if a custom conversion table is to be used the word size for the conversion, can be 2
or 3
conversionTable = if a custom conversion table is to be used (property = custom) the conversion
table, the format is simply a list of float values, corresponding to the di/trinucleotides. The order
of the di/trinucleotides is a follows:
di: aa, at, ag, ac,
ta, tt, tg, tc, ga, gt, gg, gc, ca, ct, cg, cc,
tri:
aaa, aat, aag, aac,
ata, att, atg, atc,
aga, agt, agg, agc,
aca, act, acg, acc,
taa, tat, tag, tac,
tta, ttt, ttg, ttc,
tga, tgt, tgg, tgc,
tca, tct, tcg, tcc,
gaa, gat, gag, gac,
gta, gtt, gtg, gtc,
gga, ggt, ggg, ggc,
gca, gct, gcg, gcc,
caa, cat, cag, cac,
cta, ctt, ctg, ctc,
cga, cgt, cgg, cgc,
cca, cct, ccg, ccc,
property = one of the following: aphilicity, base stacking, b-dna twist, bendability, bending
stiffness, cpg islands, cpnpgcpg islands, cpnpgislands, dna denaturation, duplex stability disrupt
energy, duplex stability free energy, nucleosome position, propellor twist, protein deformation,
protein dna twist, z dna
Example: "Calcuate bending stiffness in a sliding window of size 51"
source track: "Pyrococcus furiosus complete genome"
windowoffset = 0
windowStep = 1
windowSize = 25
property = bending stiffness
Dynalign - [3]
Dynalign is a tool that calculates the secondary structure common to two sequences. moses uses
for each window of an analyzed genome the most similar sequence from references genomes for
alignment.
source tracks
1 - Annotated sequence track with Analyzed genome
2 - Annotated sequence track with reference genome 1
3 - BLAST track for Analyzed genome against reference genome 1
4 - Annotated sequence track with reference genome 2
5 - BLAST track for Analyzed genome against reference genome 3
...
...
last - optional stencil track
the BLAST tracks may not use the "only" option in the parameter output option so that location
information is not discared.
parameters
windowoffset/windowstep/windowsize/calculateUsingStencil
Example "Use Dynalign on EColi using four reference Genomes"
source tracks "Ecoli genome", "Entero genome", "BLAST Ecoli vs Entero", "Erwinia genome",
"BLAST Ecoli vs Erwinia", "Klebsiella genome", "BLAST Ecoli vs Klebsiella", "Salmonella
genome", "BLAST Ecoli vs Salmonella", "Ecoli intergenic regions"
windowoffset = 0
windowStep = 1
windowSize = 37
calculateUsingStencil = 1
Hidden Markov Models - untested !
This module can be used to see if sliding windows of an analyzed nucleotide or numerical
sequence matches a given hidden markov model (mode = match) or it can calculate for each
window the most probable state at the center of the window (mode = decode).
source track
1 - analysed nucleotide sequence or numerical sequence
parameters
mode = can be match or decode
transitionProbabilities = transition probabilites matrix.
emissionProbabilities = emission probablities matrix.
startProbabilities = start probabilites vector.
stateLabels = state labels; a list of strings.
emissionLabels = emission labels; a list of strings.
Example "find motif"
windowoffset = 0
windowStep = 1
windowSize = 25
mode = match
transitionProbabilities = 0.5,0.5,0,0; 0,0.5,0.5,0; 0,0,0.5,0.5; 0,0,0,1
emissionProbabilities = 0.9, 0.033, 0.033, 0.033; 0.033, 0.9, 0.033, 0.033; 0.033, 0.033, 0.9,
0.033; 0.033, 0.033, 0.033, 0.9
startProbabilities = 0.9, 0.033, 0.033, 0.033
stateLabels = a,t,g,c
emissionLabels = a,t,g,c
( the Hidden markov has to model a sequence motif)
Length Filter
The length filter highlights consecutive or almost consecutive values above zero in the source
track. Gap sizes can be allowed in the sequence of values. In the output data line regions will be
coded by consecutive "1"'s that define the regions that pass the filter.
source tracks
1 - Analysis track to filter
parameters
minimalRequiredLength = the number of values above zero that have to be found for the region
to pass the filter.
maximumSingleGapSize = maximum size of a single gap inside the region with the sufficient
number of values above zero.
maximumGapSize = maximum number of
Example "find intervals longer than 100bp"
minimalRequiredLength = 101
maximumSingleGapSize = 0
maximumGapSize = 0
Example "cluster ncRNA signals"
minimalRequiredLength = 101
maximumSingleGapSize = 30
maximumGapSize = 50
Match Signal To Annotated Gene
This module is used to classify predictions into confirmations of already known ncRNA-Genes or
as novel findings. The output will contain a 1 for a prediction that overlaps an annotation, 2 for a
prediction close to the start of an annotation, 3 if the prediction is close to the end of an
annotation and 4 if the prediction is not close to any annotation.
The output of this track can be used to generate reports of the results of a prediction method to
have list of gene candidates that can be used for experimental verification.
source tracks
input data line 1 - signal data
input data line 2 - sequence (not used)
input data line 3 - annotations
input data line 4 - custom annotations
- The input format of this module was chosen to make this module conveniently usable with
annotated sequence track, which contain one data line with the sequence and a second with the
annotations,
parameters
maxDistanceToAcceptMatchLeft = definition of the neighborhood of annotations, signals that are
closer to the start of an annotation than this value will be
considered in the neighborhood of this annotation, the
corresponding output value is 2.
maxDistanceToAcceptMatchRight = definition of the neighborhood of annotations, signals that
are closer to the end of an annotation than this value will be
considered in the neighborhood of this annotation, the
corresponding output value is 3.
maxGapSizeInSignal = A signal will be classified as a whole. The signal can have gaps but it can
be useful not to consider it two distinct signals just because of a small gap.
The length filter algorithm and other methods should be used in favor of
this simple method to cluster signals.
output = can be signal or gene. "signal" will results in the behavior described above. The use of
gene will generate an output data line were all predicted genes are marked by "1"'s.
Example "Mark genes that were predicted by RNAz"
source tracks "RNAz above 0.995", "Pyrococcus Furiosus complete genome"
maxDistanceToAcceptMatchLeft = 5
maxDistanceToAcceptMatchRight = 5
maxGapSizeInSignal = 5
output = "gene"
Example "Classify RNAz predictions"
source tracks "RNAz above 0.995", "Pyrococcus Furiosus complete genome"
maxDistanceToAcceptMatchLeft = 5
maxDistanceToAcceptMatchRight = 5
maxGapSizeInSignal = 5
output = "signal"
Random Sequence
This module generates random sequences.
source tracks
1 - sequence track to define the nucleotide composition and length of the random sequence.
parameters
windowed = can be 1 or 0. if the value is 1 the nucleotide composition will be sampled using
sliding windows from the sequence in the source track. Otherwise
the composition of the whole sequence is used
length = length of the new random sequence if no source track is used.
aRatio = ratio of nucleotide "a" if no source track is used.
gRatio = ratio of nucleotide "g" if no source track is used.
tRatio = ratio of nucleotide "t" if no source track is used, ratio of nucleotide "c" is the difference
to one from the sum of aRatio, gRatio and tRatio.
RNAFold - [4]
moses uses RNAFold to predict the se
condary structures of sliding windows of a nucleotide sequence.
source tracks
1 - Annotated sequence track
2 - optional stencil track
parameters
windowoffset/windowstep/windowsize/calculateUsingStencil/numberOfShuffledInputs
Example "Calculate RNAFold profile"
windowoffset = 0
windowStep = 1
windowSize = 37
calculateUsingStencil = 1
numberOfShuffledInputs = 0
Example "Calculate RNAFold mean and standard deviation for shuffled
slidind windows"
windowoffset = 0
windowStep = 1
windowSize = 37
calculateUsingStencil = 1
numberOfShuffledInputs = 100
RNAz - [5]
moses allows to run RNAz using this modul.
source tracks
1 - Annotated sequence track with Analyzed genome
2 - Annotated sequence track with reference genome 1
3 - BLAST track for Analyzed genome against reference genome 1
4 - Annotated sequence track with reference genome 2
5 - BLAST track for Analyzed genome against reference genome 3
...
...
last - optional stencil track
the BLAST tracks may not use the "only" option in the parameter output option so that location
information is not discared.
parameters
windowoffset/windowstep/windowsize/calculateUsingStencil
Example "Use RNAz on EColi using four reference Genomes"
source tracks "Ecoli genome", "Entero genome", "BLAST Ecoli vs Entero", "Erwinia genome",
"BLAST Ecoli vs Erwinia", "Klebsiella genome", "BLAST Ecoli vs Klebsiella", "Salmonella
genome", "BLAST Ecoli vs Salmonella", "Ecoli intergenic regions"
windowoffset = 0
windowStep = 1
windowSize = 37
calculateUsingStencil = 1
Statistical Filter
This module calculates statistics for numerical profiles.
source tracks
1 - Analysis track
parameters
windowoffset/windowstep/windowsize
output = one of the following: mean, std, min, max, jump, peak, above mean, below mean, local
statistics, local statistics view, above local mean, below local mean.
- mean, std, min, max calculate the corresponding figure for all values in input data line 1. All
positions in the output data line will contain that value.
- jump, peak perform jump detection / peak detection
- local statistics calculates mean and standard deviation for sliding windows each in one output
data line.
- local statistics view calculates
output data line1: (local mean of window) + l * (standard deviation of window)
output data line2: (local mean of window)
output data line3: (local mean of window) - l * (standard deviation of window)
Where l is the value of parameter "limitInStandardDeviationsFromMean".
This is useful using display mode 2 and 4 (see user reference) to view the characteristics of the
numerical profile
- above/below local mean calculates a "1" if the value at the center of the sliding window is
above/below the mean of the window.
limitInStandardDeviationsFromMean = a parameter used for jump/peak detection and the
functions above/below local mean.
WordAnalysis
This module calculates the ratio of the number of occurence of a word to the maximally possible
number of occurences. Multiple words can be used at once, the corresponding ratios are summed.
The typical use for this module is calculation of GC-content or detection of Start/Stop codons.
parameters
windowoffset/windowstep/windowsize
WordsToLookFor = words for which to perform the analysis.
References
[1]
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic local alignment
search tool.,” J Mol Biol, vol. 215, Oct. 1990, pp. 403–410.
[2]
T. Abeel, Y. Saeys, E. Bonnet, P. Rouzé, and Y. Van de Peer, “Generic eukaryotic core
promoter prediction using structural features of DNA,” Genome Research, vol. 18, Feb. 2008,
pp. 310-323.
[3]
D.H. Mathews, “Predicting a set of minimal free energy RNA secondary structures
common to two sequences,” Bioinformatics, vol. 21, May. 2005, pp. 2246-2253.
[4]
I. Hofacker, W. Fontana, P. Stadler, S. Bonhoeffer, M. Tacker, and P. Schuster, “Fast
Folding and Comparison of RNA Secondary Structures,” Monatsh. Chem., vol. 125, 1994, pp.
188, 167.
[5]
S. Washietl, I.L. Hofacker, and P.F. Stadler, “Fast and reliable prediction of noncoding
RNAs,” Proceedings of the National Academy of Sciences of the United States of America, vol.
102, Feb. 2005, pp. 2454-2459.
Download