Working with the Conifer_dbMagic database: A short tutorial on mining

advertisement
Working with the Conifer_dbMagic
database: A short tutorial on mining
conifer assembly data.
This tutorial is designed to be used in a “follow along” fashion. You will need to have
the Conifer_dbMagic database launched to replicate steps shown in this tutorial.
If you do not already have the conifer_dbMagic.jnlp (Java web start) file on your
desktop, use the following URL to download and launch the file now:
http://ancangio.uga.edu/ng-genediscovery/conifer_dbMagic.jnlp
Upon launching the program, the Assemblies
menu will appear.
The drop down menu is used to select the
species and the assembly that you wish to
query.
Each species ID is followed by an extension that identifies the
assembler used for de novo transcriptome assembly.
For example the _MIRA, _NGEN, and _NBLR extensions
indicate that either miraEST, NGen, or Newbler was used to
assemble the transcript data, respectively.
The three P. taeda libraries are listed slightly differently, i.e.
as PtMIRA, PtNBLR1, and PtNGen2
We have selected the C.atl_MIRA assembly for our example.
Now click on the “Submit” button to open the Assembly
Display Screen
This is the main Assembly Display panel. Two tabs are on
the upper right: Search UniScript and Blast Annotation.
Note that C.atlantica is listed in the box on the right under
“Genotypes.” Do not click on it.
Click on the Submit button in the center of the window.
IMPORTANT: Do not click anywhere within the “Select
Contigs containing all these genotypes box.” This is a
feature that is not utilized in conifer_dbMagic, since
there is no genotype information parsed into the
database.
We see that there are 30,658 matches found (total clusters) in this assembly.
Each of four columns that have been populated can now be sorted (either increasing or decreasing)
simply by clicking on the its column header:
Num- database numeric identifier
UniScript- cluster name
UniScript Length- total consensus length in bases
Total Seq- total number of sequence reads associated with each cluster (Note that there are
“clusters” of one, i.e. singletons.).
*If you want to open a new Assemblies menu to select a different species or a different assembler,
you can simply click on “New Display” located at the top left of this window.
*In this and in all other windows, the term “UniScript” in column 2 is a legacy term meaning unique
transcript, but is simply the contig name (or isotig name in the case of Newbler assemblies)
associated with each cluster in the database.
To search the assembly by UniScript or Sequence Name, or
to filter the assembly by either the UniScript length, or by
the number of sequence reads in a cluster, use the
“UniScript filters” box seen at the upper left.
Two drop down menus are available: First, select UniScript
Length “between x,y” and then type in a range of 2000 to
3000 bases. Next, select Number of Sequences >= and type
in the number 10.
Now click the Submit button
The result is 768 clusters that have consensi between 2000
and 3000 bases, and have at least 10 sequence reads per
cluster.
Note that all of the column values have also changed to
reflect the new query results.
Now click twice on the Total Seq column header to sort from
highest to lowest values.
After sorting, we will click on the first row to highlight
cluster C.atlantica_rep_c103, which has the largest number
of total reads (303).
Next, click on “View Alignment” at the bottom of the
window to see the cluster alignment.
*Multiple clusters can be selected here and multiple
alignment windows can be opened for viewing or comparing
several clusters at once.
A new UniScript Alignment window now appears with
the consensus sequence shown at the top, and a pileup
view of all aligned sequences listed below. Individual
sequence read names are seen on the left.
The red blocks indicate inconsistencies among the
sequenced reads and the consensus sequence (some of
these may be interpreted as possible indel/SNP
containing reads).
The slider bars located on the bottom and right side of
the window are used to scroll through the alignment.
Now, return to the Assembly Display window by
clicking on it, and then click on “Blast Annotation” at
the bottom of the window.
The view switches to the Blast Annotation tab (one can also
go here directly as will be shown later).
The UniScript Name for the cluster we identified in the
“Search UniScript” tab has been auto-filled with a database
generated ID.
Next, click to highlight a target blast database (NCBI NR) in
the Select Target Database(s) panel.
Click “Submit” to see the Blastx returns for the selected
contig.
Here we see the blastx results panel, and we have returned 10 records for the
C.atlantica_repC_103 cluster. Just as in the Search UniScript tables, one can sort the blast data
table columns by clicking on any column header.
Column widths can also be modified by clicking on the dividing line and dragging to the
desired width.
In any list obtained from the database, e.g. in the Search UniScript or the Blast Annotation tabs,
one can highlight contiguous or multiple, separated rows of interest using standard Windows
Shift or Ctrl key/mouse click combinations. Use CtrlC to copy a highlighted table or individual
rows of table data for pasting into text or Excel files.
Next, we will click on the “Expect” column and sort the blast data by their
expect values.
Note that whenever a row is highlighted, the amino acid alignment
between the query sequence and the target sequence appears at the
bottom of the window, which itself can be scrolled through using the slider
bar.
Now, click the “Reset” button to clear this query result.
Next, we type in the word “actin” in the Annotation box, and
select < from the drop down menu next to Expect Val and
type 1e-75 in Expect Val box.
Click to highlight the TAIR_9 database.
Click Submit
We see that 442 records are returned whose TAIR blast description records
contain the term “actin,” and that also have expect values < 1e-75.
*Note in the highlighted row that any record, e.g. Num=9, containing the term
“actin” in the description is returned, i.e. the word “interacting,” whether it is
actually an “actin” gene or not. Also note that up to five different blast records may
be returned for any given cluster.
Now, we will sort the blast data by clicking on the “Match Length” column, sorting
from highest to lowest values.
Next, scroll down and highlight the first entry for ACT1 in the Seq Description
column (you will need to increase this column width to see it)- record Num= 46,
cluster C. atlantica_rep_c1017)
Now click “Search UniScript” at the bottom of the window.
We are returned to the Search UniScript tab and the
“UniScript Name(s)” box has been auto-filled with a
database generated ID.
Click Submit and the information for the ACT1 cluster is
returned.
Click to highlight the UniScript row.
Now, we can either click to view the alignment of the cluster, as we
saw previously, or we can click on “Make Fasta”
After clicking Make Fasta, a dialog box appears for selection of
either just the consensus sequence, or the consensus sequence
plus all individual sequence reads associated with it.
The fasta file can then be downloaded to a local directory of
choice.
This concludes the conifer_dbMagic tutorial
Here are some helpful commands for working in or copying
information from java database tables:
Ctrl A = all rows selected.
Click/Shift/Click = a defined group of rows with the range
selected using the mouse.
Click/Ctrl/Click = multiple, ungrouped rows selected using
the mouse.
Ctrl C = copy rows that have been highlighted.
Download