RNACluster An integrated tool for RNA secondary structure comparison and clustering http://csbl.bmb.uga.edu/publication/materials/qiliu/RNACluster.html Qi Liu , V. Olman , Huiqing Liu , Xiuzi Ye,Shilun Qiu,Ying Xu. RNACluster: An integrated tool for RNA secondary structure comparison and clustering To be submitted to the Journal of Computational Chemistry Version 1.0 October 2007 Qi Liu Computational System Biology Lab The University of Georgia Athens, GA 30602 USA Contents 1. Overview 2. Obtaining and installation of RNACluster 2.1. Obtaining RNACluster 2.2. Installing RNACluster 3. Tutorial 3.1. Functions and features 3.2. Inputs 3.3. Outputs 3.4. Interface 3.4. Simple examples 4. References RNACluster documentation (2/6/16) 1 1. Overview RNACluster is an integrated computational software which implements 6 common structure distances to measure the (dis)similarity of RNA secondary structures including base pair distance (Ding et al.,2005); mountain distance (Vincent et al., 2000); morphological distance (Vincent et al., 2000; Björn et al.,2004); tree edit distance (Shapiro B A and Zhang K,1990); string edit distance(Shapiro,1988;Shapiro B A and Zhang K,1990) and our in-house structure matrix distance (Qi Liu et al., Fuzzy kernel clustering of RNA secondary structures using a novel similarity metric, paper submitted), and one effective cluster algorithm for the clustering of structure samples based on the minimum spanning tree concept of graph theory (Ying et al.,2002;Olman V et al.,2003;). This tool can be used to study the characteristics of RNA secondary structures; RNA structure conformational switch; RNA conformational energy landscape and RNA secondary structure prediction based on the clustering of structure samples. RNACluster is compiled in Windows and Linux, and it can be run on both platforms. Contact: qiliu@csbl.bmb.uga.edu RNACluster documentation (2/6/16) 2 2. Obtaining and Installation of RNACluster 2.1. Obtaining RNACluster RNACluster can be downloaded at: http://csbl.bmb.uga.edu/publications/materials/qiliu/RNACluster.html The website contains: Documentations (as a PDF file and as a Word file). Three versions of RNACluster: Windows console version; Windows graphical version; Linux version. And other related materials. 2.2. Installing RNACluster For Linux users, just download the Linux version package and follow the README file to install it. For Windows users to install the graphical version, just download the setup.exe, run it and install RNACluster as it guides. The install program will also make links of the main program in the desktop and start menu. A friendly install interface can be seen in Figure 1. Figure 1:Interface of RNACluster Setup program RNACluster documentation (2/6/16) 3 3. Tutorial 3.1. Functions and features 1. Integrate 6 different distances to measure (dis)similarity of RNA secondary structures. 2. Cluster RNA structure samples based on the minimum spanning tree algorithm. 3. Visualize the MST construction procedure and plot the edge-distance in the order of their selection by the Prim's algorithm (MST curve). 4. Identify all possible clusters for the given structure ensemble and derive useful information about cluster (cluster number, size of each cluster, p-value of each cluster, parent cluster index / number of children of this cluster, centroid of each cluster, structure with lowest energy in each cluster, compactness of each cluster et al.). 5. Apply multi-thread mechanism to utilize the processing ability of Windows operation system when it has multiple CPUs or CPUs with multiple cores, and achieve a better speed when calculating and clustering large set of structure samples. 6. Friendly graphical interface. 3.2. Inputs Input to RNACluster should be a RNA structure ensemble contains a set of suboptimal secondary structures. The suboptimal secondary structures can be sampled within a user defined energy range above the minimum free energy (MFE) or randomly sampled by Boltzmann weight (Ding et al., 2005). Several tools can be used to generate such suboptimal structures including Vienna RNA package (Hofacker et al.,1994), MFold (zuker,1981), RNAshapes (Giegerich et al.,2004), SFold (Ding et al.,2001) et al. Format of input file is straightforward, as show in Figure 2, begin with one RNA sequence in the first line and the dot-bracket notation of each structures in the following. RNACluster documentation (2/6/16) 4 Figure 2: An example of input file Note: 1).First line of input file should be the RNA sequence with no tabs, spaces, and wraps in the sequence. (make sure to leave no tabs or spaces at the end of the line) 2).Following lines contain the structure samples need to be analyzed in dot-bracket notation. Each line has one structure. (make sure to leave no tabs or spaces at the end of the line). 3).The sequence is formed with character A/a,U/u,G/g,C/c,T/t. Any other character contained in the sequence will lead an error notification in the program. 4). User can click “OPEN FILE” button to open the dialogue to select the input file. 3.3. Outputs All output files are stored in the same directory where the input file is stored. There are two types of output files: output of distance matrix and output of the final cluster result. 1. Output file of distance matrix can be traced for individual distance metrics. The corresponding output file for individual distance metrics are: a). Base pair distance: basepair_distance.txt b). Mountain distance: mountain_distance.txt c). Morphological distance: morphological_distance.txt d). Tree edit distance: treeedit_distance.txt e). String edit distance: stringedit_distance.txt f). Structure matrix distance: structurematrix_distance.txt 2. Output of clustering result is named as “cluster_result.txt”. This file presents useful information about the clustering of the given structure ensemble. 3. A list box on the main interface of the software will display necessary information about the running statues of the software. RNACluster documentation (2/6/16) 5 Note: Format of distance matrix output file is followed with the format of the input to the program fitch,kitsch and neighbor in PHYLIP package (Felsenstein,1989), which can be used to construct the phylogeny tree of the input RNA structure samples. 3.4. Interface A friendly interface is implemented in our software include 3 operation areas (Figure 3): MAIN WINDOWS: OPEN FILE: read the input file PARAMETER SETTING: choose distance and set other parameter for cluster DISTANCE CALCULATION: compute distance matrix CLUSTERING: cluster given structure ensemble based on minimum spanning tree algorithm. CLUSTER VISULIZATION: draw MST curve. CLEAR: clear the running information in the list box PROGRESS: A progress button show the running procedure during the distance calculation or clustering OPERATION: ONLINE HELP: link to the online help file ABOUT: simply introduction of our software EXIT: exit RNACluster documentation (2/6/16) 6 Figure 3: Main interface of RNACluster 3.5. A simple example 1. Start RNACluster by click the software icon on the desktop. This command opens the main window of RNACluster. 2. Click “OPEN FILE” to select the input file and read it. Here please find the file “example.txt” in your installed program directory, which contains 13 structure samples of an RNA attenuator. 3. After reading input file, the “DISTANCE CALCULATION” button will be activated. Click this button to compute distance matrix for the given ensemble. The default distance metric is base pair distance. Click “PARAMETER SETTING” button to change your distance metric (Figure 4). Users can set the minimum cluster size before clustering. Also if the “FOCUSED STRUCTURE ID” is designated (0 means no designation), the point of the focused structure in the MST curve will be highlight when the curve is drawn (Figure 5). RNACluster documentation (2/6/16) 7 4. After finishing distance calculation, the “CLUSTERING” and “CLUSTER VISUALIZATION” button will be activated. Click them to cluster the given structure ensemble or see the MST curve of clustering (Figure 5). 5. The corresponding output file for distance matrix and cluster result will be seen in the same directory where the input file is stored. Figure 4: Parameter setting dialogue RNACluster documentation (2/6/16) 8 Figure 5: MST Curve 4. References Björn Voß Carsten Meyer Robert Giegerich, (2004), Evaluating the Predictability of Conformational Switching in RNA, Bioinformatics, Pages: 1573 – 1582. Ding, Y. and Lawrence, C.E. (2001). Statistical prediction of single stranded regions in RNA secondary structure and application to predicting effective antisense target sites and beyond. Nucleic Acids Res. 29: 1034–1046. Ding, Y. Chi Yu Chan, and Charles E. Lawrence (2005), RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble, RNA, 11:1157–1166. Felsenstein, J. (1989), PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166. Ivo L. Hofacker, Walter Fontana, Peter F. Stadler et al., (1994), Fast Folding and Comparison of RNA Secondary Structures, Monatsh.Chem. 125: 167-188. Olman V, Xu D, Xu Y.,CUBIC (2003), identification of regulatory binding sites through data clustering. J Bioinform Comput Biol. Apr;1 (1):21-40. RNACluster documentation (2/6/16) 9 Robert Giegerich, Björn Voß and Marc Rehmsmeier, (2004), Abstract shapes of RNA, Nucleic Acids Research, Vol. 32, No. 16 4843–4851. Shapiro B A, (1988), An algorithm for comparing multiple RNA secondary structures, CABIOS 4, 381-393. Shapiro B A, Zhang K (1990), Comparing multiple RNA secondary structures using tree comparison, CABIOS 6, 309-318 Vincent Moulton,Michael Zuker,Michael Steel,et al (2000). Metrics on RNA Secondary Structures, Journal of Computational Biology, PP. 277–292. Ying X., Olmam, V., Dong X. (2002), Clustering Gene Expression Data Using a Graph-theoretic Approach: an Application of Minimum Spanning Trees. Bioin-formatics, 18, 536-545. Zuker M. (1981), Optimal computer folding of large RNA sequence using thermodynamics and auxiliary information, Nucl.Acids.Res. , 9:133-148. RNACluster documentation (2/6/16) 10