A Tutorial on Using GenomeBuilder v1.0 A Tool for Visualisation and Analysis of Sequence Collections and Assemblies GenomeBuilder Application by Juha Muilu GenomeBuilder Tutorial by Alan Robinson European Bioinformatics Institute, EMBL Outstation - Hinxton, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. CB10 1SD Introduction: This document provides a tutorial on using GenomeBuilder 1.0. By the end of this practical, you should have grasped the basics of using GenomeBuilder and how CORBA can provide easy access to biological resources. (You do not need to know CORBA to understand and use GenomeBuilder, just like you don't need to understand HTTP in order to use the Web. All the CORBA happens behind the scenes). GenomeBuilder is a Java application that has been produced within the Industry Programme of the EMBL European Bioinformatics Institute (http://industry.ebi.ac.uk/). It is designed for the analysis and visualisation of collections and assemblies of sequences, e.g. from EST and highthroughput genomic sequencing. To access databases of sequence data and sequence analysis applications, GenomeBuilder uses CORBA to connect to database servers located at the EBI. By using CORBA, GenomeBuilder can easily and silently connect to these other resources and applications on the network. If users have their own databases and analysis applications, then GenomeBuilder may also access these if they have implemented the appropriate CORBA interface. (Instructions on how to do this will be described in a more advanced tutorial). Basic functionality that is part of GenomeBuilder includes: Sequences can be either labelled or colour-coded according to their properties. Regions of similarity between sequences can be highlighted. Sequences in an assembly can be edited and moved. Linked windows mean an overall view of the assembly can be maintained in one window, while viewing at a higher resolution in another. During the course of this tutorial you will be using GenomeBuilder to retrieve clusters of EST sequences from an EST cluster database at the EBI (EuroGeneIndex 1), retrieve annotation about the sequences from an EST database (dbEST), and then use the CAP3 program to align these sequences. For further details of GenomeBuilder technical specifications and functionality, please see the "GenomeBuilder User Guide" that is available on the web: http://industry.ebi.ac.uk/~muilu/GBuilder/ Getting Started: For this tutorial, we will be using the version 1.0 of GenomeBuilder available on the web at http://corba.ebi.ac.uk/EST/GB_CLU/. N.B. The URL http://corba.ebi.ac.uk/ is the general place to find information about CORBA and CORBA services at the EBI. If you have not already opened this page, do so now. Once GenomeBuilder has loaded, you will see two windows opened (see Figure 1). The window with a series of pull-down menus ("File", "Edit", "Option", etc.) is the main GenomeBuilder window. The second window is the console in which messages about the status of GenomeBuilder are displayed. 1 The EuroGeneIndex database is a collection of DNA sequence clusters produced by the JESAM software. JESAM takes novel approaches to both sequence comparison and clustering: alignments are calculated using a PVM process farm, and published via a CORBA server at the EBI. The EuroGeneIndex superclusters complement rather than supersede the UniGene databases. See http://corba.ebi.ac.uk/~jparsons/packages/jesam/jesam_paper.html for further details. Figure 1: The starting windows of GenomeBuilder: Main GenomeBuilder window (top) and message console window (bottom). If you do not manage to open the applet, the most likely problem is that your site has a firewall that does not support the CORBA protocol. Before proceeding, you will need to either talk to your IT department about configuring your firewall to allow the IIOP protocol, or find a computer that is outside of your firewall. Loading in Sequences As our first exercise, we will assume that we have discovered a mouse sequence that has a high similarity to the mouse EST sequence with EMBL accession number AA387471. The first objective will be to search and retrieve the supercluster from the EuroGeneIndex database that contains the sequence AA387471, along with any other EST sequences with which it has been clustered. The first operation will be to connect to the database from which we can retrieve EST superclusters. From the main GenomeBuilder window, choose: "File" → "New…" A new window will appear with the caption "No connection to sequence database". In order to connect to a database, you will need to specify the database type, the site where the database is located and the specific organism for which you wish to retrieve sequence clusters. Currently, only the EuroGeneIndex EST sequence cluster database is available publicly. In the new database connection window: <Select database> → Select "EuroGeneIndex (JESAM)" 2 This database is available only from the EBI currently. (If other sites provide mirrors or alternatives, then they can also be available as options, e.g. an in-house database). In the database connection window: <Select site> → Select "EBI" There are EuroGeneIndex EST clusters for "Mouse", "Nidulans", "Rat" and "Zebra fish". In the database connection window: <Select organism> → Select "Mouse" 2 In early versions of GenomeBuilder, this may be specified as "ClusterDB (JESAM)" Upon selecting the appropriate "database", "site" and "organism", GenomeBuilder will try to establish a CORBA connection to the appropriate database. You can follow the progress in the GenomeBuilder message console window: Connecting to the database. Please wait.... GOT IOR:IOR:000000000000002749444c3a656d626c2f6562692f6573742f636c757 37465722f436c757374657244423a312e30000000000001000000000000002c00 0100000000000e3139332e36322e3139362e36330085da000000100000000037a 17efb0001ccf000000000 GOT IOR:IOR:000000000000002b49444c3a656d626c2f6562692f6573742f616c696 76e6d656e742f416c69676e6d656e7444423a312e300000000000010000000000 00002c000100000000000e3139332e36322e3139362e36330085d500000010000 0000037a17e7200086b9b00000000 Database: gb_clusterdb, Mouse http://bach.ebi.ac.uk/EST/IOR/clusterDB_mouse.ior Connection OK Now that the connection to the database has been established, the supercluster containing the sequence AA387471 and any other sequences with which it has been clustered can be retrieved: Enter AA387471 as the "EMBL accession" in the database connection window Press "Retrieve cluster". Using CORBA, GenomeBuilder will now connect to the EuroGeneIndex database and retrieve the given EST and any other EST sequences with which it has been clustered. Again, you can watch the progress in the GenomeBuilder message console window: query: AA387471 = dbname: 971231.embl cluster accession: AA097459 type: file Cluster size is 4 download ok Sequences found 4 New virtual sequence created In the main GenomeBuilder window (which you may have to resize), four sequences should now be presented at the resolution of the nucleotide sequence (see Figure 2). Figure 2: The main window of the GenomeBuilder showing the four sequences retrieved from the EuroGeneIndex EST cluster containing AA387471. Changing the View and Adding Descriptions You may scroll along the sequences using the scroll bar at the bottom of the main GenomeBuilder window. The magnification level of the window can be re-scaled dynamically using the scrollbar underneath the pull-down menus at the top of the GenomeBuilder window: Drag the scrollbar to the right so that the complete length of all the sequences is displayed and the accession numbers of the sequences are displayed (see Figure 3). Figure 3: The main GenomeBuilder window with the four EST sequences re-scaled. The accession numbers of the sequences are not very meaningful. As the next step, we will display annotation about the sequences as contained in the EuroGeneIndex. In the main GenomeBuilder window: "View" → "Show" → "Annotation" → "Description" A short description of each sequence will now be display as annotated in EuroGeneIndex. We will now reset some of the display properties of the GenomeBuilder. From the main window of GenomeBuilder, select: "View" → "Format…" Set "Line size" to 4 and click "OK" The sequences are now more separated along the vertical axis. If you wish, change the "Font size" too (see Figure 4). Figure 4: The main GenomeBuilder window with annotation about the sequences from EuroGeneIndex. Adding and Displaying More Information about the Sequences Since GenomeBuilder is CORBA aware, it can connect to other CORBA aware databases of sequence and mapping information so as to retrieve and display further information about sequences it is displaying. We will establish a connection to an EST sequence database at the EBI 3. From the main GenomeBuilder window: Select "Databases" → "Sequence" → "ESTDB (EBI)" In the message console window, the following should appear: Connecting to the database. Please wait.... OK Database: gb_simpleEst, EST Database http://bach.ebi.ac.uk/EST/IOR/SimpleEstDB_2.IOR Connection OK Now that the connection to the database has been made, it is necessary to retrieve the information about the sequences loaded currently into GenomeBuilder from the ESTDB database. From the main GenomeBuilder window: 3 This ESTDB sequence database is essentially the dbEST EST database, but the information is made available through a CORBA server. "Edit" → "Annotate all from…" → "EST Database" Now, the information about the EST sequences can be displayed, and used to colour the sequences. From the main GenomeBuilder window: "View" → "Show" → "Annotation" → "Clone library" "View" → "Colour" → "Colour by clone library" We can also sort the sequences so that those from the same clone library are neighbours. From the main GenomeBuilder window: "View" → "Sort/Align" → "Sort by library" By now your display should look like Figure 5. Figure 5: The main GenomeBuilder window with the sequences coloured and labelled according to their clone library. Using GenomeBuilder to Perform Sequence Analysis The EBI has made available some routine sequence analysis programs over the Internet using CORBA. AppLab is used to provide the connection to these analysis tools through CORBA (Please see http://industry.ebi.ac.uk/applab/ for further information). Here are some features and highlights of AppLab: AppLab defines and automatically generates the IDL interfaces for invoking and controlling remote applications and for browsing their results through CORBA AppLab implements a mechanism that makes it possible to plug-in new applications without any additional programming effort AppLab is easily extensible for problem-specific application sets (such as the special data structures needed in bioinformatics) Since GenomeBuilder is CORBA aware, it can: 1. Connect to these analysis servers available through CORBA using AppLab 2. Submit sequences to these analysis servers 3. Retrieve and display the results from these analyses An analysis program that is available currently via a CORBA server is the sequence assembly program, CAP3. We will use CAP3 to re-assemble our cluster of EST sequences. From the main GenomeBuilder window: "Analysis" → "Assembly" → "Cap3" After a short time, a new window will open entitled "Cap3" (see Figure 6). Figure 6: The CAP3 window generated by AppLab. The CAP3 analysis is run on a server at the EBI, and the results returned to the client. For this tutorial, we will be using the default values for CAP3. However if you wish to change the parameters used by the program, then you are able to do so form the "Options" menu. In the CAP3 window, select: "Options" → "Show optional parameters" To run CAP3, the sequences must be sent off to the CORBA server at the EBI. From the CAP3 window, select: "Actions" → "Run Cap3" Whilst the job is running, the GenomeBuilder application will be unavailable. Upon completion, a message box will appear. The GenomeBuilder display will then change and show the sequences in their assembled position, plus the new consensus contig constructed by CAP3. A new window will also appear, "AppLab stdout" that displays the output information of the CAP3 program (see Figure 7). Figure 7: Results of CAP3 assembly: The sequences are aligned in the main GenomeBuilder window with the consensus contig labelled and shown in black (top), and the text output of the CAP3 program is shown in a separate window (bottom). Thus, using GenomeBuilder and the EuroGeneIndex, we have a cluster of EST's on consensus contig that extends our previous sequence that showed similarity to the small AA387471 fragment. IMPORTANT - Before re-running CAP3, you will need to select and delete the CONTIG generated by CAP3. (To delete the CONTIG sequence, select it clicking on it using the left mouse button, then select "Delete selected" from the "Edit" menu). Selecting Sequences Sequences may be 'selected' so that operations are only carried out on those 'selected' sequences. Selection is done by clicking the left mouse button over the sequences to be selected. A selected sequence is highlighted by a red box around the sequence. Selection is a toggle operation, thus clicking the left mouse button over a previously selected sequence will deselect it. Displaying the Quality of the Assembly The nucleotide matches between the sequences in the assembly can be displayed by GenomeBuilder. In the main GenomeBuilder window, select: "View → Show" → "Similarity" → "Show (dis)similarity" Slide the scale scrollbar at the top of the display so that the sequences appear at the resolution of single nucleotides (see Figure 8). Figure 8: The aligned superclusters of sequences from the EuroGeneIndex database with identical nucleotide matches between sequences pairs highlighted (top) and only unmatched (dissimilar) nucleotide matches highlighted (bottom). Given that the sequences are assembled quite well, it is more meaningful to show the dissimilarities between the sequences (see Figure 8). From the main GenomeBuilder window, select: "Options" → "Similarity" → "Show dissimilar" This option has a toggle behaviour. To return to the "similarity" view, from the main GenomeBuilder window, select: "Options" → "Similarity" → "√ Show dissimilar" It is also possible to examine the (dis)similarity between individual pairs of sequences. First, turn off the "show similarity" option for all sequences. From the main GenomeBuilder window, select: "View → Show" → "Similarity" → "Hide (dis)similarity" Now select any two sequences by clicking on them using the left mouse button, e.g. "CONTIG 1" and "AA097459". These selected sequences will now have a red box surrounding them. If you have a three-button mouse, use the middle mouse button to bring up a pop-up menu. (If you have a two-button mouse, press <Ctrl> with either mouse button): <pop-up menu> → "Show (dis)similarity between selected" You may toggle between the similarity and dissimilarity views using: "Options" → "Similarity" → "Show dissimilar" As you select and deselect other sequences their (dis)similarity will also be shown (see Figure 9). Figure 9: Highlighting the similarities between the calculated CONTIG and AA097459. In conclusion, you can switch between match and mismatch views using the "Options" → "Similarity" → "Show dissimilar" toggle. You may view either the (mis)match between all neighbouring sequences using the "View → Show" → "Similarity" → "Show (dis)similarity", or you may view the (mis)match between selected sequences using the <pop-up menu> → "Show (dis)similarity between selected". IMPORTANT - It is possible to have both (mis)match between all neighbouring pairs, and all selected sequences simultaneously. This can lead to a confusing display - Ensure that only one of these options is turned on at any time. Moving Sequences GenomeBuilder provides the facility to move sequences relative to each other. The default is that sequences may be moved along the horizontal axis. IMPORTANT: Moving sequences in the horizontal direction will negate any previous alignments. CAP3 (or equivalent) will have to be re-run in order to re-generate the alignment (see below). To enable the "move sequence" option, in the main GenomeBuilder window, select: "Edit" → "Enable move selected" If sequences are selected by clicking over them with the left mouse button, then all selected sequences may be moved by clicked and dragged using the left mouse button anywhere in the main GenomeBuilder window. If the "Show (dis)similarity" option is selected for all sequences, then as the selected sequence is moved along the horizontal axis, the changing pattern of (mis)matches between sequences can be observed. To turn off the sequence move option, from the main GenomeBuilder window, select: "Edit" → "Disable move selected" If you have moved the sequences along the horizontal direction, and hence lost the CAP3 alignment, this can be easily regenerated: Select only the black "CONTIG1" sequence by clicking the left mouse button over it so that it is highlighted by a red box. Remove any other selections on other sequences by clicking over them again with the left mouse button. Using the middle mouse button in the main GenomeBuilder window, activate the pop-up menu: <pop-up menu> → "Delete selected" Now run CAP3 as described previously. If you try to run CAP3 without first deleting the CONTIG, GenomeBuilder will warn you. Sequences may also be moved in the vertical direction. This does not destroy any previous global alignment of the supercluster of EST sequences. Ensure that the "Enable move selected" has been chosen. In the main GenomeBuilder window, select: "Options" → "Move vertical" If a sequence is selected (using the left mouse button), then it can be clicked and dragged in the vertical direction. To turn off vertical motion, select "Options" → "√ Move vertical" To turn off the sequence move option, from the main GenomeBuilder window, select: "Edit" → "Disable move selected" By selecting more than one sequence, then whole groups of sequences can be moved together. Editing Sequences GenomeBuilder provides basic sequence editing capabilities. Select the sequence that you want to be edited (IMPORTANT -Only ONE sequence must be selected). For example, let us pretend that we go back to the trace for AA097459 and decide that the unassigned nucleotide at position 66 (shown by a "_") is actually a guanine. First select the AA097459 sequence using the left mouse button Ensure that the view in the main GenomeBuilder window is at the level of the nucleotide With the middle mouse button, bring up the pop-up menu, and select "Edit selected" The sequence AA097459 can now be edited dynamically. In the "Sequence editor" window, the position of the text cursor along the sequence is also shown in the main GenomeBuilder window by a magenta tick mark. In the "Sequence editor" window, click next to the "_" character at position 66. Note the appearance of the tick mark in the main GenomeBuilder window Delete the "_" character (note that the sequence in the main GenomeBuilder window is updated immediately) Now insert a "g" at this position (note that the sequence in the main GenomeBuilder window is updated immediately) Repeat the procedure for the first "_" character in the AA541835 sequence, and replace it with a "g" too. Finally, we shall re-run the CAP3 alignment with these new sequences. Select only the black "CONTIG1" sequence by clicking the left mouse button over it so that it is highlighted by a red box. Remove any other selections on other sequences by clicking over them again with the left mouse button. Using the middle mouse button in the main GenomeBuilder window, activate the pop-up menu: <pop-up menu> → "Delete selected" Now run CAP3 as described previously. The final result is shown in Figure 10. If you try to run CAP3 without first deleting the CONTIG, GenomeBuilder will warn you. Figure 10: The result of running CAP3 on the edited sequences with mismatches between sequences highlighted (compare with Figure 8). Zooming It is quite usual to want to be able to examine sequences at the level of the single nucleotide, whilst maintaining a global perspective of the sequence in the context of the others. This is achieved in GenomeBuilder through the use of dynamically linked windows. From the main GenomeBuilder window, select: "View" → "New window" A second window will appear which is similar to the main GenomeBuilder window, except that it lacks menus. This second widow has all the scrolling and zooming features of the main window. Furthermore, the outline of a black box is shown in the main GenomeBuilder window. This box shows the portion of the main window that is projected into the second "zoom" window (see Figure 11). Figure 11: The main GenomeBuilder window (top), and the secondary "zoom" window (bottom). The view in the "zoom" window reflects the content within the black box in the main GenomeBuilder window. The view in the "zoom" window can be changed in three ways: 1. By moving the box in the main GenomeBuilder window - Click on the boundary of the black box; the outline should now become red. Clicking and dragging with the left mouse button will cause the box to move around the main GenomeBuilder screen. Clicking on the border of the box again will cause the box to become immovable. 2. In the main GenomeBuilder window, place the mouse cursor to the upper left of the point that is of interest and press the mouse middle button. Select "zoom here". 3. By using the scroll and zoom bars of the "zoom" window. In all cases, as the view is changed in one window, it will be dynamically updated in the other. It is now a simple procedure to examine the consensus at the level of the assembly in the main GenomeBuilder window whilst simultaneously being able to view the quality of the assembly at the nucleotide level in the "zoom" window. Saving Sequences In this current version of GenomeBuilder (v1.0), sequence files are saved using the FASTA format. This does not retain assembly information. [N.B. The ability to save the assembly information of a supercluster will be implemented in version 1.1]. To write out sequences in FASTA format; select those sequences in which you are interested, e.g. CONTIG1. In the main GenomeBuilder window, select: "View" → "Display selected in FASTA" A new window will appear with the sequence in FASTA format (see Figure 12). This data can be saved in a variety of ways: 1. From the "FASTA" window, select "File" → "Save…" to save the data to a file on the local disk. (N.B. If you are running GenomeBuilder as an applet, this option may not work because of Java security mechanisms). 2. Copy and paste the text to a text editor and save the file. 3. Copy and paste the text to an application, e.g. FASTA or BLAST at the EBI. 4. Mail the data to yourself. In the "FASTA" window, select "File" → "Email address…" and enter your e-mail address. Next, select "File" → "Mail", and the data will be emailed to the e-mail address given. Figure 12: The FASTA window of GenomeBuilder. Cleaning Up To remove sequences in GenomeBuilder: To delete single sequences, select the sequences, and then select "Edit" → "Delete selected" in the main GenomeBuilder window (N.B. you may also use the pop-up menu in GenomeBuilder). To remove all sequences, select "File" → "Remove everything". Using other Databases - RHalloc GenomeBuilder may also connect to other databases. The next brief exercise will retrieve a mouse cluster and study which sequences are allocated to the RH mapping database as annotated in the RHalloc database. Using what you've already learnt, load the mouse cluster from the EBI EuroGeneIndex that contains the sequence R74857. Now connect to the mapping database, RHalloc, that has a CORBA service at the EBI. Select: "Databases" → "Mapping" → "RHalloc (EBI)" Now we need to retrieve the annotation from RHalloc about the current sequences. Select: "Edit" → "Annotate all from" → "RHalloc" Next we need to display this information. Select: "View" → "Colour" → "Colour by Property" → "RHalloc" "View" → "Show" → "Annotation" → "Property" → "RHalloc" You may also wish to run CAP3 on this cluster, as it has an interesting feature. Exiting GenomeBuilder To exit GenomeBuilder, from the main window, select: "File" → "Exit" You may have to shut down the CAP3 window separately, since this is a program running separately from GenomeBuilder. Conclusion The EBI is investigating the use of CORBA (Common Object Request Broker Architecture) to resolve some of the current IT problems in bioinformatics (see http://corba.ebi.ac.uk/). Distributed computing using CORBA provides a standard means to connect computational resources on different machines over the Internet. Computer applications can access these resources if they implement the programming interface to a CORBA server (specified using the Interface Definition Language (IDL)). Currently, CORBA interfaces to EST sequence (dbEST), EST cluster (EuroGeneIndex) and radiation hybrid databases (RHdb) are implemented and available in Genome Builder. New databases can be added dynamically to Genome Builder if they provide an IDL interface and a CORBA server. Within Genome Builder, the AppLab program is used to generate client and server side components of external analysis applications (see http://industry.ebi.ac.uk/applab/). Currently, two applications have been added to Genome Builder using AppLab: CAP3 for making multiple alignments and CLEANUP for removing redundant sequences. Availability For the latest information on GenomeBuilder, see http://industry.ebi.ac.uk/~muilu/GBuilder/. GenomeBuilder is available as an applet from: http://corba.ebi.ac.uk/EST/GB_CLU using the Sun AppletViewer (recommended ) or Netscape browser (version 4.5 or higher). It is also possible to run the program as a standalone application. You can get the latest release of GenomeBuilder from ftp://ftp.ebi.ac.uk/pub/contrib/muilu/GB/gb.tar. The installation instructions are available from ftp://ftp.ebi.ac.uk/pub/contrib/muilu/GB/README Acknowledgements GenomeBuilder is developed and written by Juha Muilu of the Industry Programme of the EMBL European Bioinformatics Institute (http://industry.ebi.ac.uk/) with input from Tom Flores, Alan Robinson, Patricia Rodriguez-Tome, Martin Senger, Alphonse Thanaraj and Matteo di Tommaso,