Classes - Information & Computational Sciences

advertisement
TOPALi v2 is heavily (Java) packaged, although ideally everything would be a single
“TOPALi” package. Due to the number of classes it is easier to create various
subpackages for the separate areas of the program.
topali.analysis
Contains (non GUI) analysis code that is only ever run client side (such as parameter
estimation).
topali.cluster
Contains webservices code for responding to requests to a tomcat server, along with
all the cluster-side analysis code for the methods (even those that can run client side
too).
Subpackages also exist for the actual analysis-related webservices:
topali.cluster.dss
topali.cluster.hmm
topali.cluster.lrt
topali.cluster.pdm
topali.cluster.trees
topali.cluster.jobs
Contains code for (client side) setup of jobs to be submitted as webservices, including
code to submit, monitor and retrieve the results.
topali.cluster.sge
Contains code specific to running TOPALi analysis methods on a Sun Grid Engine
(SGE) enabled cluster. Mainly to deal with job status monitoring.
topali.data
Contains all the data structures that make up a TOPALi project and are written to
XML as part of a save operation. Data objects are a mixture of analysis result related
objects and GUI save states (positions of components, last used values etc)
topali.fileio
Contains code to handle file input and output, mainly reading/writing the various
multiple alignment formats that are supported.
topali.gui
Contains code relating to the main client interface.
topali.gui.dialog
Contains code for the various client-side dialog boxes and input frames that appear.
topali.gui.dialog.hmm
Contains additional code relating to the Run HMM dialog (because it had so many
classes due to its complexity).
topali.gui.dialog.settings
Contains code for the Analysis|Settings dialog box. This is expected to contain a
number of JPanels (likely to be one-per-class) and was therefore subpackaged.
topali.gui.nav
Contains code related to the navigation tree (visible down the left hand side of the
client interface).
topali.gui.results
Contains code for displaying GUI components related to results that have been
obtained from running analysis methods (mainly alignment graphs to date)
topali.gui.tree
Contains GUI code for dealing with phylogenetic trees.
topali.mod
Contains code that has been obtained elsewhere and modified for use within TOPALi.
topali.vamsas
Contains code that is related to the VAMSAS integration within TOPALi (so far code
to read/write/import/export VAMSAS xml documents).
Webservices and analyses are set up in a generic way. Using DSS as an example, the
package will contain
DSSWebService
Contains the axis web service code that extends from topali.cluster.WebService. Is
responsible for starting a thread that runs:
RunDSS
Performs initial DSS setup (in DSS’s case, breaking the full analysis into [N] jobs and
simulating alignments for each job). The job (or job array) is then submitted to the
cluster via DRMAA or qsub.
The job startup scripts are located in /WEB-INF/cluster/sge – a DSS job is an array
job calling DSSAnalysis on each alignment.
DSSAnalysis
Performs the actual analysis over an alignment, which is mainly a case of working out
windows for that alignment and running the DSS method on each window
DSS
This is the class that performs the method
Other classes may be involved for other web services too.
The final common class is a collation class, responsible for checking on the progress
of a jobtype when requested by a client-server call.
CollateDSS
Monitors the directories in use by the job, looking for completed subtasks and
calculating a percentage complete based on the information gained. It is up to the job
type itself to decide how and where to write percentage information that the collation
class can monitor, but most tasks create a “percent” directory in the job’s home
directory and write one file per percentage point complete, in the form p1, p2, p3, etc.
DSSWebService
RunDSS
DSSAnalysis
DSSAnalysis
DSSAnalysis
… [N] bootstraps
DSS
DSS
DSS
… [N] windows
RunFitch
RunFitch
RunFitch
PDM2Analysis
PDMAnalysis.run()
We run through a loop that is moving the window along the (sub partition) of the
alignment that this job is looking at.
For each window:
1. Run it through MrBayes
2. Take its results, and rewrite them into a win[n].txt file
3. Once data for a pair of windows is available, calculate a PDM score for them
For 2)
PDM. saveWindowResults() is used
This writes the same file to two locations – the local “working” directory (on the
cluster node) and the job’s “run” directory on the head node.
As these files are being created, a float[] array holding each probability score is
created for the PDM class. After two windows have been run, two of these arrays will
exist: set1 (window1) and set2 (window2)
Window1
set2 created
Window2
set2 copied to set1
new set2 created
Window3
set2 copied to set1
new set2 created
This means that for any window pair there is always a set1 and set2.
The file written to disk takes “pdm.nex.trprobs” from the MrBayes output and
rewrites it so each line contains
tree_string_1 prob_score_1
tree_string_2 prob_score_2
For 3)
The PDM class holds the maximum Robinson-Faulds distance we expect (rfMax) that
is calculated from 2n-6 where n is the number of sequences.
(http://www.bio.umontreal.ca/casgrain/en/labo/robinson_foulds.html)
(Robinson DF & Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53:131147)
We run TreeDist on the two window files win[n-1].txt and win[n].txt (where n is the
last run windowed position along the alignment). TreeDist compares all trees to
calculate an RF distance for each pair. This file is “outfile” and is formatted, eg:
1 4 0 (tree1 in window1 against tree1 in window2)
1 5 2 (tree1 in window1 against tree2 in window2)
1 6 1 (tree1 in window1 against tree3 in window2)
This file compares a) against b) but also b) against a) so only the first half of it is
needed.
Using this data, we then attempt to do a summation PDM score
PDM.doCalculations()
For each line in the file:
Is it the first line? If so, we can determine the starting index (win2Index) of the
2nd window’s trees by looking at the 2nd column. (4 in the example above). We
now know that to get “4” in set2 we have to do [[i]-4]
Is the 2nd column value equal to 1? If so, then we know we’ve reached the
half-way point and can stop. (as it’s window2_tree1 against window1_tree1)
Calculate a PDM score for this line and add it to the current total (see below)
To calculate a score for a given line we use:
pK is the probability for win1, tree[n] – found in set1[index-1]
qK is the probability for win2, tree[n] – found in set2[index-win2Index]
Probability scores in set1 and set2 given as:
set1:
tree1
tree2
tree3
tree4
0.01
0.02
0.1
0.04
set2
tree5 0.02
tree6 0.03
TreeDist comparison provides:
tree1 tree5 rfValue1/5
tree1 tree6 rfValue1/6
tree2 tree5 rfValue2/5
tree2 tree6 rfValue2/6
tree3 tree5 rfValue3/5
tree3 tree6 rfValue3/6
tree4 tree5 rfValue4/5
tree4 tree6 rfValue4/6
<and duplicates where 5 against 1, 5 against 2 etc>
for each line, we then compute a PDM score, such that the first two scores will be:
pK tree1 (0.01) and qK tree5 (0.02) and rfValue1/5 are the inputs
pK tree1 (0.01) and qK tree6 (0.03) and rfValue1/6 are the inputs
This score is summed over all iterations.
Download