GenomeBuilder

advertisement
A Tutorial on Using GenomeBuilder v1.0
A Tool for Visualisation and Analysis of Sequence
Collections and Assemblies
GenomeBuilder Application by Juha Muilu
GenomeBuilder Tutorial by Alan Robinson
European Bioinformatics Institute, EMBL Outstation - Hinxton, Wellcome Trust
Genome Campus, Hinxton, Cambridge, UK. CB10 1SD
Introduction:
This document provides a tutorial on using GenomeBuilder 1.0. By the end of this practical, you
should have grasped the basics of using GenomeBuilder and how CORBA can provide easy
access to biological resources. (You do not need to know CORBA to understand and use
GenomeBuilder, just like you don't need to understand HTTP in order to use the Web. All the
CORBA happens behind the scenes).
GenomeBuilder is a Java application that has been produced within the Industry Programme of
the EMBL European Bioinformatics Institute (http://industry.ebi.ac.uk/). It is designed for the
analysis and visualisation of collections and assemblies of sequences, e.g. from EST and highthroughput genomic sequencing. To access databases of sequence data and sequence analysis
applications, GenomeBuilder uses CORBA to connect to database servers located at the EBI.
By using CORBA, GenomeBuilder can easily and silently connect to these other resources and
applications on the network. If users have their own databases and analysis applications, then
GenomeBuilder may also access these if they have implemented the appropriate CORBA
interface. (Instructions on how to do this will be described in a more advanced tutorial).
Basic functionality that is part of GenomeBuilder includes:

Sequences can be either labelled or colour-coded according to their properties.

Regions of similarity between sequences can be highlighted.

Sequences in an assembly can be edited and moved.

Linked windows mean an overall view of the assembly can be maintained in one
window, while viewing at a higher resolution in another.
During the course of this tutorial you will be using GenomeBuilder to retrieve clusters of EST
sequences from an EST cluster database at the EBI (EuroGeneIndex 1), retrieve annotation about
the sequences from an EST database (dbEST), and then use the CAP3 program to align these
sequences.
For further details of GenomeBuilder technical specifications and functionality, please see the
"GenomeBuilder User Guide" that is available on the web:
http://industry.ebi.ac.uk/~muilu/GBuilder/
Getting Started:
For this tutorial, we will be using the version 1.0 of GenomeBuilder available on the web at
http://corba.ebi.ac.uk/EST/GB_CLU/. N.B. The URL http://corba.ebi.ac.uk/ is the general place to
find information about CORBA and CORBA services at the EBI.
If you have not already opened this page, do so now. Once GenomeBuilder has loaded, you will
see two windows opened (see Figure 1). The window with a series of pull-down menus ("File",
"Edit", "Option", etc.) is the main GenomeBuilder window. The second window is the console in
which messages about the status of GenomeBuilder are displayed.
1
The EuroGeneIndex database is a collection of DNA sequence clusters produced by the
JESAM software. JESAM takes novel approaches to both sequence comparison and clustering:
alignments are calculated using a PVM process farm, and published via a CORBA server at the
EBI. The EuroGeneIndex superclusters complement rather than supersede the UniGene
databases. See http://corba.ebi.ac.uk/~jparsons/packages/jesam/jesam_paper.html for further
details.
Figure 1: The starting windows of GenomeBuilder: Main GenomeBuilder window (top) and
message console window (bottom).
If you do not manage to open the applet, the most likely problem is that your site has a firewall
that does not support the CORBA protocol. Before proceeding, you will need to either talk to your
IT department about configuring your firewall to allow the IIOP protocol, or find a computer that is
outside of your firewall.
Loading in Sequences
As our first exercise, we will assume that we have discovered a mouse sequence that has a high
similarity to the mouse EST sequence with EMBL accession number AA387471.
The first objective will be to search and retrieve the supercluster from the EuroGeneIndex
database that contains the sequence AA387471, along with any other EST sequences with which
it has been clustered.
The first operation will be to connect to the database from which we can retrieve EST
superclusters. From the main GenomeBuilder window, choose:
"File" → "New…"
A new window will appear with the caption "No connection to sequence database". In order to
connect to a database, you will need to specify the database type, the site where the database is
located and the specific organism for which you wish to retrieve sequence clusters.
Currently, only the EuroGeneIndex EST sequence cluster database is available publicly. In the
new database connection window:
<Select database> → Select "EuroGeneIndex (JESAM)"
2
This database is available only from the EBI currently. (If other sites provide mirrors or
alternatives, then they can also be available as options, e.g. an in-house database). In the
database connection window:
<Select site> → Select "EBI"
There are EuroGeneIndex EST clusters for "Mouse", "Nidulans", "Rat" and "Zebra fish". In the
database connection window:
<Select organism> → Select "Mouse"
2
In early versions of GenomeBuilder, this may be specified as "ClusterDB (JESAM)"
Upon selecting the appropriate "database", "site" and "organism", GenomeBuilder will try to
establish a CORBA connection to the appropriate database. You can follow the progress in the
GenomeBuilder message console window:
Connecting to the database. Please wait....
GOT
IOR:IOR:000000000000002749444c3a656d626c2f6562692f6573742f636c757
37465722f436c757374657244423a312e30000000000001000000000000002c00
0100000000000e3139332e36322e3139362e36330085da000000100000000037a
17efb0001ccf000000000
GOT
IOR:IOR:000000000000002b49444c3a656d626c2f6562692f6573742f616c696
76e6d656e742f416c69676e6d656e7444423a312e300000000000010000000000
00002c000100000000000e3139332e36322e3139362e36330085d500000010000
0000037a17e7200086b9b00000000
Database: gb_clusterdb, Mouse
http://bach.ebi.ac.uk/EST/IOR/clusterDB_mouse.ior
Connection OK
Now that the connection to the database has been established, the supercluster containing the
sequence AA387471 and any other sequences with which it has been clustered can be retrieved:
Enter AA387471 as the "EMBL accession" in the database connection window
Press "Retrieve cluster".
Using CORBA, GenomeBuilder will now connect to the EuroGeneIndex database and retrieve the
given EST and any other EST sequences with which it has been clustered. Again, you can watch
the progress in the GenomeBuilder message console window:
query: AA387471 = dbname: 971231.embl cluster accession: AA097459
type: file
Cluster size is 4
download ok
Sequences found 4
New virtual sequence created
In the main GenomeBuilder window (which you may have to resize), four sequences should now
be presented at the resolution of the nucleotide sequence (see Figure 2).
Figure 2: The main window of the GenomeBuilder showing the four sequences retrieved from the
EuroGeneIndex EST cluster containing AA387471.
Changing the View and Adding Descriptions
You may scroll along the sequences using the scroll bar at the bottom of the main
GenomeBuilder window.
The magnification level of the window can be re-scaled dynamically using the scrollbar
underneath the pull-down menus at the top of the GenomeBuilder window:
Drag the scrollbar to the right so that the complete length of all the sequences is
displayed and the accession numbers of the sequences are displayed (see Figure 3).
Figure 3: The main GenomeBuilder window with the four EST sequences re-scaled.
The accession numbers of the sequences are not very meaningful. As the next step, we will
display annotation about the sequences as contained in the EuroGeneIndex. In the main
GenomeBuilder window:
"View" → "Show" → "Annotation" → "Description"
A short description of each sequence will now be display as annotated in EuroGeneIndex.
We will now reset some of the display properties of the GenomeBuilder. From the main window of
GenomeBuilder, select:
"View" → "Format…"
Set "Line size" to 4 and click "OK"
The sequences are now more separated along the vertical axis. If you wish, change the "Font
size" too (see Figure 4).
Figure 4: The main GenomeBuilder window with annotation about the sequences from
EuroGeneIndex.
Adding and Displaying More Information about the Sequences
Since GenomeBuilder is CORBA aware, it can connect to other CORBA aware databases of
sequence and mapping information so as to retrieve and display further information about
sequences it is displaying.
We will establish a connection to an EST sequence database at the EBI 3. From the main
GenomeBuilder window:
Select "Databases" → "Sequence" → "ESTDB (EBI)"
In the message console window, the following should appear:
Connecting to the database. Please wait....
OK
Database: gb_simpleEst, EST Database
http://bach.ebi.ac.uk/EST/IOR/SimpleEstDB_2.IOR
Connection OK
Now that the connection to the database has been made, it is necessary to retrieve the
information about the sequences loaded currently into GenomeBuilder from the ESTDB
database. From the main GenomeBuilder window:
3
This ESTDB sequence database is essentially the dbEST EST database, but the information is
made available through a CORBA server.
"Edit" → "Annotate all from…" → "EST Database"
Now, the information about the EST sequences can be displayed, and used to colour the
sequences. From the main GenomeBuilder window:
"View" → "Show" → "Annotation" → "Clone library"
"View" → "Colour" → "Colour by clone library"
We can also sort the sequences so that those from the same clone library are neighbours. From
the main GenomeBuilder window:
"View" → "Sort/Align" → "Sort by library"
By now your display should look like Figure 5.
Figure 5: The main GenomeBuilder window with the sequences coloured and labelled according
to their clone library.
Using GenomeBuilder to Perform Sequence Analysis
The EBI has made available some routine sequence analysis programs over the Internet using
CORBA. AppLab is used to provide the connection to these analysis tools through CORBA
(Please see http://industry.ebi.ac.uk/applab/ for further information). Here are some features and
highlights of AppLab:

AppLab defines and automatically generates the IDL interfaces for invoking and
controlling remote applications and for browsing their results through CORBA

AppLab implements a mechanism that makes it possible to plug-in new applications
without any additional programming effort

AppLab is easily extensible for problem-specific application sets (such as the special data
structures needed in bioinformatics)
Since GenomeBuilder is CORBA aware, it can:
1. Connect to these analysis servers available through CORBA using AppLab
2. Submit sequences to these analysis servers
3. Retrieve and display the results from these analyses
An analysis program that is available currently via a CORBA server is the sequence assembly
program, CAP3. We will use CAP3 to re-assemble our cluster of EST sequences. From the main
GenomeBuilder window:
"Analysis" → "Assembly" → "Cap3"
After a short time, a new window will open entitled "Cap3" (see Figure 6).
Figure 6: The CAP3 window generated by AppLab. The CAP3 analysis is run on a server at the
EBI, and the results returned to the client.
For this tutorial, we will be using the default values for CAP3. However if you wish to change the
parameters used by the program, then you are able to do so form the "Options" menu. In the
CAP3 window, select:
"Options" → "Show optional parameters"
To run CAP3, the sequences must be sent off to the CORBA server at the EBI. From the CAP3
window, select:
"Actions" → "Run Cap3"
Whilst the job is running, the GenomeBuilder application will be unavailable. Upon completion, a
message box will appear.
The GenomeBuilder display will then change and show the sequences in their assembled
position, plus the new consensus contig constructed by CAP3. A new window will also appear,
"AppLab stdout" that displays the output information of the CAP3 program (see Figure 7).
Figure 7: Results of CAP3 assembly: The sequences are aligned in the main GenomeBuilder
window with the consensus contig labelled and shown in black (top), and the text output of the
CAP3 program is shown in a separate window (bottom).
Thus, using GenomeBuilder and the EuroGeneIndex, we have a cluster of EST's on consensus
contig that extends our previous sequence that showed similarity to the small AA387471
fragment.
IMPORTANT - Before re-running CAP3, you will need to select and delete the CONTIG
generated by CAP3. (To delete the CONTIG sequence, select it clicking on it using the left mouse
button, then select "Delete selected" from the "Edit" menu).
Selecting Sequences
Sequences may be 'selected' so that operations are only carried out on those 'selected'
sequences. Selection is done by clicking the left mouse button over the sequences to be
selected. A selected sequence is highlighted by a red box around the sequence.
Selection is a toggle operation, thus clicking the left mouse button over a previously selected
sequence will deselect it.
Displaying the Quality of the Assembly
The nucleotide matches between the sequences in the assembly can be displayed by
GenomeBuilder. In the main GenomeBuilder window, select:
"View → Show" → "Similarity" → "Show (dis)similarity"
Slide the scale scrollbar at the top of the display so that the sequences appear at the resolution of
single nucleotides (see Figure 8).
Figure 8: The aligned superclusters of sequences from the EuroGeneIndex database with
identical nucleotide matches between sequences pairs highlighted (top) and only unmatched
(dissimilar) nucleotide matches highlighted (bottom).
Given that the sequences are assembled quite well, it is more meaningful to show the
dissimilarities between the sequences (see Figure 8). From the main GenomeBuilder window,
select:
"Options" → "Similarity" → "Show dissimilar"
This option has a toggle behaviour. To return to the "similarity" view, from the main
GenomeBuilder window, select:
"Options" → "Similarity" → "√ Show dissimilar"
It is also possible to examine the (dis)similarity between individual pairs of sequences. First, turn
off the "show similarity" option for all sequences. From the main GenomeBuilder window, select:
"View → Show" → "Similarity" → "Hide (dis)similarity"
Now select any two sequences by clicking on them using the left mouse button, e.g. "CONTIG 1"
and "AA097459". These selected sequences will now have a red box surrounding them.
If you have a three-button mouse, use the middle mouse button to bring up a pop-up menu. (If
you have a two-button mouse, press <Ctrl> with either mouse button):
<pop-up menu> → "Show (dis)similarity between selected"
You may toggle between the similarity and dissimilarity views using:
"Options" → "Similarity" → "Show dissimilar"
As you select and deselect other sequences their (dis)similarity will also be shown (see Figure 9).
Figure 9: Highlighting the similarities between the calculated CONTIG and AA097459.
In conclusion, you can switch between match and mismatch views using the "Options" →
"Similarity" → "Show dissimilar" toggle. You may view either the (mis)match between all
neighbouring sequences using the "View → Show" → "Similarity" → "Show (dis)similarity", or you
may view the (mis)match between selected sequences using the <pop-up menu> → "Show
(dis)similarity between selected".
IMPORTANT - It is possible to have both (mis)match between all neighbouring pairs, and all
selected sequences simultaneously. This can lead to a confusing display - Ensure that only one
of these options is turned on at any time.
Moving Sequences
GenomeBuilder provides the facility to move sequences relative to each other. The default is that
sequences may be moved along the horizontal axis.
IMPORTANT: Moving sequences in the horizontal direction will negate any previous alignments.
CAP3 (or equivalent) will have to be re-run in order to re-generate the alignment (see below).
To enable the "move sequence" option, in the main GenomeBuilder window, select:
"Edit" → "Enable move selected"
If sequences are selected by clicking over them with the left mouse button, then all selected
sequences may be moved by clicked and dragged using the left mouse button anywhere in the
main GenomeBuilder window.
If the "Show (dis)similarity" option is selected for all sequences, then as the selected sequence is
moved along the horizontal axis, the changing pattern of (mis)matches between sequences can
be observed.
To turn off the sequence move option, from the main GenomeBuilder window, select:
"Edit" → "Disable move selected"
If you have moved the sequences along the horizontal direction, and hence lost the CAP3
alignment, this can be easily regenerated:
Select only the black "CONTIG1" sequence by clicking the left mouse button over it so
that it is highlighted by a red box. Remove any other selections on other sequences by
clicking over them again with the left mouse button.
Using the middle mouse button in the main GenomeBuilder window, activate the pop-up menu:
<pop-up menu> → "Delete selected"
Now run CAP3 as described previously.
If you try to run CAP3 without first deleting the CONTIG, GenomeBuilder will warn you.
Sequences may also be moved in the vertical direction. This does not destroy any previous global
alignment of the supercluster of EST sequences. Ensure that the "Enable move selected" has
been chosen. In the main GenomeBuilder window, select:
"Options" → "Move vertical"
If a sequence is selected (using the left mouse button), then it can be clicked and dragged in the
vertical direction.
To turn off vertical motion, select "Options" → "√ Move vertical"
To turn off the sequence move option, from the main GenomeBuilder window, select:
"Edit" → "Disable move selected"
By selecting more than one sequence, then whole groups of sequences can be moved together.
Editing Sequences
GenomeBuilder provides basic sequence editing capabilities. Select the sequence that you want
to be edited (IMPORTANT -Only ONE sequence must be selected).
For example, let us pretend that we go back to the trace for AA097459 and decide that the
unassigned nucleotide at position 66 (shown by a "_") is actually a guanine.
First select the AA097459 sequence using the left mouse button
Ensure that the view in the main GenomeBuilder window is at the level of the nucleotide
With the middle mouse button, bring up the pop-up menu, and select "Edit selected"
The sequence AA097459 can now be edited dynamically. In the "Sequence editor" window, the
position of the text cursor along the sequence is also shown in the main GenomeBuilder window
by a magenta tick mark.
In the "Sequence editor" window, click next to the "_" character at position 66. Note the
appearance of the tick mark in the main GenomeBuilder window
Delete the "_" character (note that the sequence in the main GenomeBuilder window is
updated immediately)
Now insert a "g" at this position (note that the sequence in the main GenomeBuilder
window is updated immediately)
Repeat the procedure for the first "_" character in the AA541835 sequence, and replace it with a
"g" too.
Finally, we shall re-run the CAP3 alignment with these new sequences.
Select only the black "CONTIG1" sequence by clicking the left mouse button over it so
that it is highlighted by a red box. Remove any other selections on other sequences by
clicking over them again with the left mouse button.
Using the middle mouse button in the main GenomeBuilder window, activate the pop-up menu:
<pop-up menu> → "Delete selected"
Now run CAP3 as described previously. The final result is shown in Figure 10.
If you try to run CAP3 without first deleting the CONTIG, GenomeBuilder will warn you.
Figure 10: The result of running CAP3 on the edited sequences with mismatches between
sequences highlighted (compare with Figure 8).
Zooming
It is quite usual to want to be able to examine sequences at the level of the single nucleotide,
whilst maintaining a global perspective of the sequence in the context of the others. This is
achieved in GenomeBuilder through the use of dynamically linked windows.
From the main GenomeBuilder window, select:
"View" → "New window"
A second window will appear which is similar to the main GenomeBuilder window, except that it
lacks menus. This second widow has all the scrolling and zooming features of the main window.
Furthermore, the outline of a black box is shown in the main GenomeBuilder window. This box
shows the portion of the main window that is projected into the second "zoom" window (see
Figure 11).
Figure 11: The main GenomeBuilder window (top), and the secondary "zoom" window (bottom).
The view in the "zoom" window reflects the content within the black box in the main
GenomeBuilder window.
The view in the "zoom" window can be changed in three ways:
1. By moving the box in the main GenomeBuilder window - Click on the boundary of the
black box; the outline should now become red. Clicking and dragging with the left
mouse button will cause the box to move around the main GenomeBuilder screen.
Clicking on the border of the box again will cause the box to become immovable.
2. In the main GenomeBuilder window, place the mouse cursor to the upper left of the
point that is of interest and press the mouse middle button. Select "zoom here".
3. By using the scroll and zoom bars of the "zoom" window.
In all cases, as the view is changed in one window, it will be dynamically updated in the other.
It is now a simple procedure to examine the consensus at the level of the assembly in the main
GenomeBuilder window whilst simultaneously being able to view the quality of the assembly at
the nucleotide level in the "zoom" window.
Saving Sequences
In this current version of GenomeBuilder (v1.0), sequence files are saved using the FASTA
format. This does not retain assembly information. [N.B. The ability to save the assembly
information of a supercluster will be implemented in version 1.1].
To write out sequences in FASTA format; select those sequences in which you are interested,
e.g. CONTIG1. In the main GenomeBuilder window, select:
"View" → "Display selected in FASTA"
A new window will appear with the sequence in FASTA format (see Figure 12). This data can be
saved in a variety of ways:
1. From the "FASTA" window, select "File" → "Save…" to save the data to a file on the
local disk. (N.B. If you are running GenomeBuilder as an applet, this option may not
work because of Java security mechanisms).
2. Copy and paste the text to a text editor and save the file.
3. Copy and paste the text to an application, e.g. FASTA or BLAST at the EBI.
4. Mail the data to yourself. In the "FASTA" window, select "File" → "Email address…"
and enter your e-mail address. Next, select "File" → "Mail", and the data will be emailed to the e-mail address given.
Figure 12: The FASTA window of GenomeBuilder.
Cleaning Up
To remove sequences in GenomeBuilder:

To delete single sequences, select the sequences, and then select "Edit" → "Delete
selected" in the main GenomeBuilder window (N.B. you may also use the pop-up
menu in GenomeBuilder).

To remove all sequences, select "File" → "Remove everything".
Using other Databases - RHalloc
GenomeBuilder may also connect to other databases. The next brief exercise will retrieve a
mouse cluster and study which sequences are allocated to the RH mapping database as
annotated in the RHalloc database.
Using what you've already learnt, load the mouse cluster from the EBI EuroGeneIndex that
contains the sequence R74857.
Now connect to the mapping database, RHalloc, that has a CORBA service at the EBI. Select:
"Databases" → "Mapping" → "RHalloc (EBI)"
Now we need to retrieve the annotation from RHalloc about the current sequences. Select:
"Edit" → "Annotate all from" → "RHalloc"
Next we need to display this information. Select:
"View" → "Colour" → "Colour by Property" → "RHalloc"
"View" → "Show" → "Annotation" → "Property" → "RHalloc"
You may also wish to run CAP3 on this cluster, as it has an interesting feature.
Exiting GenomeBuilder
To exit GenomeBuilder, from the main window, select:
"File" → "Exit"
You may have to shut down the CAP3 window separately, since this is a program running
separately from GenomeBuilder.
Conclusion
The EBI is investigating the use of CORBA (Common Object Request Broker Architecture) to
resolve some of the current IT problems in bioinformatics (see http://corba.ebi.ac.uk/). Distributed
computing using CORBA provides a standard means to connect computational resources on
different machines over the Internet. Computer applications can access these resources if they
implement the programming interface to a CORBA server (specified using the Interface Definition
Language (IDL)).
Currently, CORBA interfaces to EST sequence (dbEST), EST cluster (EuroGeneIndex) and
radiation hybrid databases (RHdb) are implemented and available in Genome Builder. New
databases can be added dynamically to Genome Builder if they provide an IDL interface and a
CORBA server.
Within Genome Builder, the AppLab program is used to generate client and server side
components of external analysis applications (see http://industry.ebi.ac.uk/applab/). Currently,
two applications have been added to Genome Builder using AppLab: CAP3 for making multiple
alignments and CLEANUP for removing redundant sequences.
Availability
For the latest information on GenomeBuilder, see http://industry.ebi.ac.uk/~muilu/GBuilder/.
GenomeBuilder is available as an applet from: http://corba.ebi.ac.uk/EST/GB_CLU using the Sun
AppletViewer (recommended ) or Netscape browser (version 4.5 or higher).
It is also possible to run the program as a standalone application. You can get the latest release
of GenomeBuilder from ftp://ftp.ebi.ac.uk/pub/contrib/muilu/GB/gb.tar. The installation instructions
are available from ftp://ftp.ebi.ac.uk/pub/contrib/muilu/GB/README
Acknowledgements
GenomeBuilder is developed and written by Juha Muilu of the Industry Programme of the EMBL
European Bioinformatics Institute (http://industry.ebi.ac.uk/) with input from Tom Flores, Alan
Robinson, Patricia Rodriguez-Tome, Martin Senger, Alphonse Thanaraj and Matteo di Tommaso,
Download