Gap4 is a Genome Assembly Program

advertisement
BIT 150 - Lab 4
Sequence Assembly and Annotation
A. Sequence assembly
An introduction to Pregap4 and Gap4 (
http://staden.sourceforge.net/manual/master_unix_brief.html )
Gap4 is a Genome Assembly Program. Gap4 stores the data for an assembly project in
a Gap4 database, but before the data must be passed through preassembly steps, via
Pregap4. Pregap4 operates on batches of files. These files can be binary trace files (in
ABI, ALF or SCF format), Experiment Files, or plain text, and do not need to all be in the
same format.
Objective: Use Pregap4 and Gap4 to assemble a sample of subclones from Triticum
monococcum L. BAC clone 322N9, and edit the assembly.
Activities:
Copy the folder ‘08_Lab4’ from Z: to C:.
Both Pregap4 and Gap4 are part of the ‘Staden Package’. To open either of them, go to
Start/Programs/BioInformatics/Staden Package.
I. Pregap4
1. Open Pregap4:
1.1. The ‘Files to Process’ tab is used to define the files to be processed, to be then used
by Gap4 for the assembly.
- Go to Add files and Open the folder /08_Lab4/322N9/. In ‘Files of type’ select
Any (*.*). Select ALL the files contained in the folder /322N9/, and click on Open.
1.2. In the ‘Configure Modules’ tab you will find:
General Configuration: Sets general parameters that affect several other modules. If you
do NOT select the option Get entry names from trace files, the sequences’ names will not
be changed.
- Select No.
Estimate Base Accuracies: Analyses the traces at each base call to estimate a confidence
value for the called base, by simply looking at the area underneath the trace for the called
base and dividing this by the highest area under the trace for the three uncalled bases.
This is a very simplistic statistic. If another program (such as Phred) is available, then
this should be used in preference. If the Logarithmic (Phred) scale is selected, the
confidence value for the called base will be a log-transformed error probability [10*log10(Pe)], i.e. a value of 20 will mean a 1/100 error probability.
- Select Logarithmic (Phred) scale.
1
Trace Format Conversion: Converts files between the various trace formats. It can read
ABI, ALF, SCF, CTF and ZTR formats, and can write SCF, CTF, and ZTR formats. Of
these formats, ZTR typically represents the smallest size and is fast due to its own
internal compression routines.
- Leave default parameters.
Initialize Experiment Files: Creates an ‘experiment file’ from a trace file. This module is
mandatory for many subsequent modules. There are no adjustable parameters for this
module.
Augment Experiment Files: Adds further data to the ‘experiment file’, with additional
information obtained from external sources. We will not add any extra information.
Quality Clip: Determines where the sequence quality is too poor to be used for reliable
assembly. Its default quality evaluation is based on the range of values produced by the
Estimate Base Accuracies module.
- Clip mode: The ‘by base call’ mode is equivalent to the uncalled Clip mode.
The ‘by confidence’ mode uses Phred-scaled confidence values to determine
the quality for clipping.
- Minimum extent: The lowest allowable 5' clip position.
- Maximum extent: The largest allowable 3' clip position.
- Minimum length: If after quality clipping the good portion of a sequence is
shorter than the specified length, then this file will be rejected with the
message ‘qclip: Sequence too short’.
The ‘by confidence’ mode:
- Window length: The window length over which the confidence will be
averaged.
- Average confidence: The minimum average confidence (over the ‘window
length’ bases) for a sequence to be accepted as good-quality sequence.
The ‘by base call’ mode:
- Start offset: The base number to start the 5' and 3' good quality searches
from.
- 3' window length: The window length in which to count uncalled bases.
- 3' number of uncalled bases: The maximum allowed count of uncalled bases
in a single window length.
- 5' window length: The window length in which to count uncalled bases.
- 5' number of uncalled bases: The maximum allowed count of uncalled bases
in a single window length.
- Select Clip mode by confidence, and leave default parameters for the rest of the
options.
Sequencing Vector Clip: Uses the vector_clip program to identify and mark the
sequencing vector.
2
- Select Vector-primer subset: Used in conjunction with the Vector-primer
filename, to indicate which of the vector-primer pairs listed in this file should
be used.
- In Select Vector-primer subset, select pBS/HindIII.
Screen for Unclipped Vector Clip: Identifies undetected segments of sequencing vector.
After searching and marking sequencing vector, any further strong matches to the
sequencing vector indicate a possible problem.
- Select this option.
Cloning Vector Clip: Searches for non-sequencing vectors used in the shotgunning
process, such us BACs, Cosmids or YACs. Any fragment in any orientation of this vector
could be present, so there is no need for the cut sites to be known.
- Unselect this option.
Gap4 Shotgun Assembly: Assembles the processed sequences into Gap4 using Gap4's
own assembly engine.
- In Gap4 database name, put a name for your output (such as ‘Lab4’), and make
sure you change the Gap4 database version of the output every time you perform a new
assembly with different parameters. Create new database, and click on RUN.
1.3. The ‘Textual Output’ tab will show you your results in the Output window,
whereas any error when performing the assembly will be notified in the Error window.
II. Gap4
1. Open Gap4:
Your output file will be stored in /08_Lab4/322N9/ (the same folder where the files
processed by Pregap4 are located). Take a look at your output file by opening the file
with the name you created and the extension ‘.aux’. If you are asked to select a program
from a list to open this file, select Gap4.
2. Analyzing the Gap4 assembly:
2.1. Assembly information:
Look at the Gap4 main window:
- How many reads were taken to perform the assembly?
- How many contigs were created?
- What is the total length of the assembly?
2.2. Contig information:
In the Gap4 main window, go to Contig selector, which shows your assembly
graphically displayed.
If you place the mouse on one contig, it will tell you the name of the contig (which is the
name of the most-left reading), the length (in bp), and the number of reads included in
that contig. The same information can be viewed by View|Contig list.
-
What is the length of each contig?
3
-
How many reads were included in each contig?
2.3. Quality of the assembly:
a. In the Gap4 main window, go to View|List consensus confidence, select all contigs,
and click on OK.
Look at the Confidence Values in your Gap 4 Output window. For example, if 50 bases
(Frequency) in the consensus have a Confidence Value of 10, we would expect those 50
bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases have a Confidence
Value of 20, we would expect them to contain 2 errors.
-
How many errors are expected to be in the consensus sequence?
How many bases in the consensus have a Confidence Value of 10?
How many errors we would expect to be contained in those bases?
How many bases were assembled with a Confidence Value of 20 or less?
b. In the Gap4 main window, go to View|Confidence values graph, select all contigs,
and click on OK. You can see a graph of the confidence values for the consensus
sequence.
By double-clicking on one region of the graph, the Contig Editor showing the aligned
sequences in that given region of the contig will be displayed.
By double-clicking on the consensus sequence, the chromatograms corresponding to each
aligned sequence will be displayed. You can select the number of Columns and Rows
showing the chromatograms you want to have displayed.
By selecting Show confidence, the confidence value with which each base was called will
be shown.
c. In the Gap4 main window, go to View|Quality plot. For each base in the consensus
sequence, a quality is computed based on the accuracy of the data on each strand. This
information is then plotted using color and height to distinguish between the different
quality assignments. If you right-click on the plot and go to Information, the percentages
of the contig with a given quality will shown.
Color
Grey
Blue
Green
Red
Black
-
Height
0 to 0
0 to 1
-1 to 0
-1 to 1
-2 to 2
Meaning
OK on both strands, both agree
OK on plus strand only
OK on minus strand only
Bad on both strands
OK on both strands but they disagree
Open the Quality plot for the larger contig.
For how many bases the quality of the consensus sequence was OK on both
strands?
For how many bases the quality of the consensus sequence was bad on both
strands?
2.4. Saving the consensus sequence:
4
In the Gap4 main window, go to File|Save consensus|Normal, select all contigs, name
consensus by ‘left-most reading’, select FASTA format, give a name to the Output file,
and clock on OK. You can open the output file with Word.
2.5. Joining contigs:
If this is known that all the reads used to perform the assembly belong to a unique contig
although they were grouped into separate contigs, the separate contigs can be joined.
In the Gap4 main window, go to View|Find internal joins. Select Input contigs from all
contigs, Use standard consensus, Maximum alignment length to list (bp): 4000, Use
hidden data: YES, Word length: 4, Minimum overlap: 20, Maximum percentage
mismatch: 30.00, Alignment algorithm: sensitive, Diagonal threshold: 1.0e-8.
The Contig Comparator showing the results will be displayed.
-
What does this window look like?
Go to View|Display diagonal.
-
Can you also see another region of overlapping besides the main diagonal? What
does it mean?
Double-click on this region, and Join Editor will be displayed. You can edit the
overlapping region, and if you decide you join the contigs, click on Join/Quit. Take a
look at the Contig Selector.
-
How many contigs do you have now?
Save this final consensus sequence.
2.6. Inspecting chromatograms and editing:
In the Gap4 main window, go to Edit|Edit contig, write the name of the contig you want
to open, and click on OK. The Contig Editor window will be displayed. Go to Edit
Modes, and select Mode set 2. You can insert/delete bases, perform searches, create tags,
select primers, among other editing options.
B. Sequence annotation
Objective: Annotate the T. monococcum BAC clone 322N9 fragment assembled in A.
Annotate genes and repetitive elements using a combination of graphic and word-based
alignment tools, gene-finding programs, and BLAST searches.
Activities:
Open the final consensus sequence you obtained after the assembly performed in A. in
Word.
1. Dotter
5
Dotter helps to identify repetitive elements present in a sequence.
1.1. Use Dotter to align the T. monococcum sequence with itself.
-
Can you identify any repeat?
Are they direct or inverted repeats?
To facilitate annotation, let us divide the sequence into two separate regions: the
one where repeats were identified, and the one with no repeats and where
potentially genes will be found.
2. Annotating REPETITIVE ELEMENTS
2.1. Perform a blastn alignment with the region of the sequence where the repeats where
identified, using the nucleotide collection database.
-
-
-
Click on the best blastn alignment. What is the accession/version number?
Open the flat file in a new tab. Go to the ‘features’ section to see the annotation of
the element present in this region of the subject sequence. What is it? Using the
coordinates of the alignment between your query sequence and the subject
sequence from the database, find and highlight this element in your sequence in
Word.
Divide now the region of the sequence where the repeats where identified into
two approximate halves, and align the two halves using blast2sequences to help
yourself identify the LTRs of the repetitive element found. Mark the LTRs with
bold letters. Can you identify the host duplication and the inverted repeats
flanking the LTRs? Highlight them in your sequence in Word. Highlight the
inverted repeats flanking the LTRs with the same color used to highlight the
complete repetitive element, but a use a different color to highlight the host
duplication, since it is not part of the repetitive element.
Annotate the repetitive elements in the T. monococcum sequence in Word.
2.1.1. TREP: BLAST repeats
Go to http://wheat.pw.usda.gov/ggpages/Repeats/blastrepeats3.html . This specialized
repeat database allows you to discover and annotate the repetitive elements present in a
sequence. Copy the region of the sequence where the repeats where identified and paste it
in the TREP window. Select blastn program and Cereal repeat sequences, complete
set database. Click on Search.
-
Do you get any significant hit? What is it? Does it agree with your previous
finding?
3. Annotating GENES
3.1. Perform a blastn alignment with the region of the sequence with no repeats, using
the est_others database.
-
How many exons would you predict the gene present in this region has? Highlight
them in your sequence in Word.
6
3.3.1. Gene prediction programs
3.3.1.1. FGENESH
Go to
http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfin
d . Copy the region of the sequence with no repeats and paste it in the FGENESH
window. Select Monocot plants as organism, and click on Search. Take a look at the
predicted genes by clicking on Show picture of predicted genes in PDF file.
-
How many genes are predicted? How many exons?
Using a combination of your blastn search and the gene finding programs,
identify start codon, splicing sites, exons, stop codon, and PolyA, and highlight
them in your sequence in Word.
3.3.1.2. You can go to Gene Sequer at http://www.plantgdb.org/PlantGDBcgi/GeneSeqer/PlantGDBgs.cgi, and GENSCAN at
http://genes.mit.edu/GENSCAN.html, and compare the results between the different gene
prediction programs.
-
How many genes/exons are identified by each of the programs?
3.3.2. Gene annotation
3.3.2.1. BLASTP (protein blast) and BLASTX
Go to http://www.ncbi.nlm.nih.gov/BLAST/ and then to BLASTP (protein blast) to
perform a search with the translated protein (after translation using GeneTool), or to
BLASTX to perform a search using the protein database with the cDNA.
-
Do you get any significant hit? What is it?
Do you find any conserved domain? What is it?
Annotate the gene in the T. monococcum sequence in Word.
7
Download