documentation - Computational Systems Biology Lab

advertisement
RNACluster
An integrated tool for RNA secondary structure comparison and
clustering
http://csbl.bmb.uga.edu/publication/materials/qiliu/RNACluster.html
Qi Liu , V. Olman , Huiqing Liu , Xiuzi Ye,Shilun Qiu,Ying Xu. RNACluster: An integrated tool for RNA secondary structure
comparison and clustering
To be submitted to the Journal of Computational Chemistry
Version 1.0 October 2007
Qi Liu
Computational System Biology Lab
The University of Georgia
Athens, GA 30602
USA
Contents
1. Overview
2. Obtaining and installation of RNACluster
2.1. Obtaining RNACluster
2.2. Installing RNACluster
3. Tutorial
3.1. Functions and features
3.2. Inputs
3.3. Outputs
3.4. Interface
3.4. Simple examples
4. References
RNACluster documentation (2/6/16)
1
1. Overview
RNACluster is an integrated computational software which implements 6 common structure distances to
measure the (dis)similarity of RNA secondary structures including base pair distance (Ding et al.,2005);
mountain distance (Vincent et al., 2000); morphological distance (Vincent et al., 2000; Björn et al.,2004); tree
edit distance (Shapiro B A and Zhang K,1990); string edit distance(Shapiro,1988;Shapiro B A and Zhang
K,1990) and our in-house structure matrix distance (Qi Liu et al., Fuzzy kernel clustering of RNA secondary
structures using a novel similarity metric, paper submitted), and one effective cluster algorithm for the
clustering of structure samples based on the minimum spanning tree concept of graph theory (Ying et
al.,2002;Olman V et al.,2003;). This tool can be used to study the characteristics of RNA secondary
structures; RNA structure conformational switch; RNA conformational energy landscape and RNA secondary
structure prediction based on the clustering of structure samples.
RNACluster is compiled in Windows and Linux, and it can be run on both platforms.
Contact: qiliu@csbl.bmb.uga.edu
RNACluster documentation (2/6/16)
2
2. Obtaining and Installation of RNACluster
2.1. Obtaining RNACluster
RNACluster can be downloaded at:
http://csbl.bmb.uga.edu/publications/materials/qiliu/RNACluster.html
The website contains:

Documentations (as a PDF file and as a Word file).

Three versions of RNACluster: Windows console version; Windows graphical version; Linux version.

And other related materials.
2.2. Installing RNACluster
For Linux users, just download the Linux version package and follow the README file to install it.
For Windows users to install the graphical version, just download the setup.exe, run it and install RNACluster
as it guides. The install program will also make links of the main program in the desktop and start menu. A
friendly install interface can be seen in Figure 1.
Figure 1:Interface of RNACluster Setup program
RNACluster documentation (2/6/16)
3
3. Tutorial
3.1. Functions and features
1. Integrate 6 different distances to measure (dis)similarity of RNA secondary structures.
2. Cluster RNA structure samples based on the minimum spanning tree algorithm.
3. Visualize the MST construction procedure and plot the edge-distance in the order of their selection by the
Prim's algorithm (MST curve).
4. Identify all possible clusters for the given structure ensemble and derive useful information about cluster
(cluster number, size of each cluster, p-value of each cluster, parent cluster index / number of children of
this cluster, centroid of each cluster, structure with lowest energy in each cluster, compactness of each
cluster et al.).
5. Apply multi-thread mechanism to utilize the processing ability of Windows operation system when it has
multiple CPUs or CPUs with multiple cores, and achieve a better speed when calculating and clustering
large set of structure samples.
6. Friendly graphical interface.
3.2. Inputs
Input to RNACluster should be a RNA structure ensemble contains a set of suboptimal secondary structures.
The suboptimal secondary structures can be sampled within a user defined energy range above the
minimum free energy (MFE) or randomly sampled by Boltzmann weight (Ding et al., 2005). Several tools can
be used to generate such suboptimal structures including Vienna RNA package (Hofacker et al.,1994),
MFold (zuker,1981), RNAshapes (Giegerich et al.,2004), SFold (Ding et al.,2001) et al.
Format of input file is straightforward, as show in Figure 2, begin with one RNA sequence in the first line and
the dot-bracket notation of each structures in the following.
RNACluster documentation (2/6/16)
4
Figure 2: An example of input file
Note:
1).First line of input file should be the RNA sequence with no tabs, spaces, and wraps in the sequence.
(make sure to leave no tabs or spaces at the end of the line)
2).Following lines contain the structure samples need to be analyzed in dot-bracket notation. Each line has
one structure. (make sure to leave no tabs or spaces at the end of the line).
3).The sequence is formed with character A/a,U/u,G/g,C/c,T/t. Any other character contained in the
sequence will lead an error notification in the program.
4). User can click “OPEN FILE” button to open the dialogue to select the input file.
3.3. Outputs
All output files are stored in the same directory where the input file is stored.
There are two types of output files: output of distance matrix and output of the final cluster result.
1. Output file of distance matrix can be traced for individual distance metrics. The corresponding output file
for individual distance metrics are:
a). Base pair distance: basepair_distance.txt
b). Mountain distance: mountain_distance.txt
c). Morphological distance: morphological_distance.txt
d). Tree edit distance: treeedit_distance.txt
e). String edit distance: stringedit_distance.txt
f). Structure matrix distance: structurematrix_distance.txt
2. Output of clustering result is named as “cluster_result.txt”. This file presents useful information about the
clustering of the given structure ensemble.
3. A list box on the main interface of the software will display necessary information about the running
statues of the software.
RNACluster documentation (2/6/16)
5
Note:
Format of distance matrix output file is followed with the format of the input to the program fitch,kitsch and
neighbor in PHYLIP package (Felsenstein,1989), which can be used to construct the phylogeny tree of the
input RNA structure samples.
3.4. Interface
A friendly interface is implemented in our software include 3 operation areas (Figure 3):
MAIN WINDOWS:
OPEN FILE: read the input file
PARAMETER SETTING: choose distance and set other parameter for cluster
DISTANCE CALCULATION: compute distance matrix
CLUSTERING: cluster given structure ensemble based on minimum spanning tree
algorithm.
CLUSTER VISULIZATION: draw MST curve.
CLEAR: clear the running information in the list box
PROGRESS:
A progress button show the running procedure during the distance calculation or clustering
OPERATION:
ONLINE HELP: link to the online help file
ABOUT: simply introduction of our software
EXIT: exit
RNACluster documentation (2/6/16)
6
Figure 3: Main interface of RNACluster
3.5. A simple example
1. Start RNACluster by click the software icon on the desktop. This command opens the main window of
RNACluster.
2. Click “OPEN FILE” to select the input file and read it. Here please find the file “example.txt” in your
installed program directory, which contains 13 structure samples of an RNA attenuator.
3. After reading input file, the “DISTANCE CALCULATION” button will be activated. Click this button to
compute distance matrix for the given ensemble. The default distance metric is base pair distance. Click
“PARAMETER SETTING” button to change your distance metric (Figure 4). Users can set the minimum
cluster size before clustering. Also if the “FOCUSED STRUCTURE ID” is designated (0 means no
designation), the point of the focused structure in the MST curve will be highlight when the curve is drawn
(Figure 5).
RNACluster documentation (2/6/16)
7
4. After finishing distance calculation, the “CLUSTERING” and “CLUSTER VISUALIZATION” button will be
activated. Click them to cluster the given structure ensemble or see the MST curve of clustering (Figure
5).
5. The corresponding output file for distance matrix and cluster result will be seen in the same directory
where the input file is stored.
Figure 4: Parameter setting dialogue
RNACluster documentation (2/6/16)
8
Figure 5: MST Curve
4. References
Björn Voß
Carsten Meyer Robert Giegerich, (2004), Evaluating the Predictability of Conformational
Switching in RNA, Bioinformatics, Pages: 1573 – 1582.
Ding, Y. and Lawrence, C.E. (2001). Statistical prediction of single stranded regions in RNA secondary
structure and application to predicting effective antisense target sites and beyond. Nucleic Acids Res. 29:
1034–1046.
Ding, Y. Chi Yu Chan, and Charles E. Lawrence (2005), RNA secondary structure prediction by centroids in
a Boltzmann weighted ensemble, RNA, 11:1157–1166.
Felsenstein, J. (1989), PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.
Ivo L. Hofacker, Walter Fontana, Peter F. Stadler et al., (1994), Fast Folding and Comparison of RNA
Secondary Structures, Monatsh.Chem. 125: 167-188.
Olman V, Xu D, Xu Y.,CUBIC (2003), identification of regulatory binding sites through data clustering.
J Bioinform Comput Biol. Apr;1 (1):21-40.
RNACluster documentation (2/6/16)
9
Robert Giegerich, Björn Voß and Marc Rehmsmeier, (2004), Abstract shapes of RNA, Nucleic Acids
Research, Vol. 32, No. 16 4843–4851.
Shapiro B A, (1988), An algorithm for comparing multiple RNA secondary structures, CABIOS 4, 381-393.
Shapiro B A, Zhang K (1990), Comparing multiple RNA secondary structures using tree comparison,
CABIOS 6, 309-318
Vincent Moulton,Michael Zuker,Michael Steel,et al (2000). Metrics on RNA Secondary Structures, Journal of
Computational Biology, PP. 277–292.
Ying X., Olmam, V., Dong X. (2002), Clustering Gene Expression Data Using a Graph-theoretic Approach:
an Application of Minimum Spanning Trees. Bioin-formatics, 18, 536-545.
Zuker M. (1981), Optimal computer folding of large RNA sequence using thermodynamics and auxiliary
information, Nucl.Acids.Res. , 9:133-148.
RNACluster documentation (2/6/16)
10
Download