Laboratory 2 - Plant Sciences

advertisement
BIT150 - Lab 2
Sequence Assembly and Database Searches
Copy the directory ‘Z:\10_Lab2’ into your own directory in ‘C:\YourLastname\10_Lab2’
A. Sequence assembly
Objective: Use Pregap4 and Gap4 to assemble a sample of subclones from Triticum
monococcum L. BAC clone 322N9, and edit the assembly.
Additional Resource: “An introduction to Pregap4 and Gap4”
( http://staden.sourceforge.net/manual/master_unix_brief.html )
Gap4 is a Genome Assembly Program. Gap4 stores the data for an assembly project in a Gap4
database, but before the data must be passed through preassembly steps, via Pregap4. Pregap4
operates on batches of files. These files can be binary trace files (in ABI, ALF or SCF format),
Experiment Files, or plain text, and do not need to all be in the same format. Both Pregap4 and
Gap4 are part of the ‘Staden Package’. To open either of them, go to
Start/Programs/BioInformatics/Staden Package.
I. Pregap4
1. Open Pregap4:
1.1. The ‘Files to Process’ tab is used to define the files to be processed, to be then used by
Gap4 for the assembly.
- Go to Add files and Open the folder /10_Lab2/322N9/. In ‘Files of type’ select Any
(*.*). Select ALL the files contained in the folder /322N9/, and click on Open.
1.2. In the ‘Configure Modules’ tab you will find:
General Configuration: Sets general parameters that affect several other modules. If you do NOT select the
option Get entry names from trace files, the sequences’ names will not be changed.
- Select No.
Estimate Base Accuracies: Analyses the traces at each base call to estimate a confidence value for the called
base, by simply looking at the area underneath the trace for the called base and dividing this by the highest area
under the trace for the three uncalled bases. This is a very simplistic statistic. If another program (such as Phred) is
available, then this should be used in preference. If the Logarithmic (Phred) scale is selected, the confidence value
for the called base will be a log-transformed error probability [-10*log10(Pe)], i.e. a value of 20 will mean a 1/100
error probability.
- Select Logarithmic (Phred) scale.
Trace Format Conversion: Converts files between the various trace formats. It can read ABI, ALF, SCF, CTF
and ZTR formats, and can write SCF, CTF, and ZTR formats. Of these formats, ZTR typically represents the
smallest size and is fast due to its own internal compression routines.
- Leave default parameters.
Initialize Experiment Files: Creates an ‘experiment file’ from a trace file. This module is mandatory for many
subsequent modules. There are no adjustable parameters for this module.
1
Augment Experiment Files: Adds further data to the ‘experiment file’, with additional information obtained
from external sources. We will not add any extra information.
Quality Clip: Determines where the sequence quality is too poor to be used for reliable assembly. Its default
quality evaluation is based on the range of values produced by the Estimate Base Accuracies module.
- Clip mode: The ‘by base call’ mode is equivalent to the uncalled Clip mode. The ‘by confidence’
mode uses Phred-scaled confidence values to determine the quality for clipping.
- Minimum extent: The lowest allowable 5' clip position.
- Maximum extent: The largest allowable 3' clip position.
- Minimum length: If after quality clipping the good portion of a sequence is shorter than the specified
length, then this file will be rejected with the message ‘qclip: Sequence too short’.
The ‘by confidence’ mode:
- Window length: The window length over which the confidence will be averaged.
- Average confidence: The minimum average confidence (over the ‘window length’ bases) for a
sequence to be accepted as good-quality sequence.
The ‘by base call’ mode:
- Start offset: The base number to start the 5' and 3' good quality searches from.
- 3' window length: The window length in which to count uncalled bases.
- 3' number of uncalled bases: The maximum allowed count of uncalled bases in a single window
length.
- 5' window length: The window length in which to count uncalled bases.
- 5' number of uncalled bases: The maximum allowed count of uncalled bases in a single window
length.
- Select Clip mode by confidence, and leave default parameters for the rest of the options.
Sequencing Vector Clip: Uses the vector_clip program to identify and mark the sequencing vector.
- Select Vector-primer subset: Used in conjunction with the Vector-primer filename, to indicate
which of the vector-primer pairs listed in this file should be used.
- In Select Vector-primer subset, select pBS/HindIII.
Screen for Unclipped Vector Clip: Identifies undetected segments of sequencing vector. After searching and
marking sequencing vector, any further strong matches to the sequencing vector indicate a possible problem.
- Select this option.
Cloning Vector Clip: Searches for non-sequencing vectors used in the shotgunning process, such as BACs,
Cosmids or YACs. Any fragment in any orientation of this vector could be present, so there is no need for the cut
sites to be known.
-
Unselect this option.
Gap4 Shotgun Assembly: Assembles the processed sequences into Gap4 using Gap4's own assembly engine.
- In Gap4 database name, put a name for your output (such as ‘Lab4’), and make sure
you change the Gap4 database version of the output every time you perform a new assembly
with different parameters. Create new database, and click on RUN.
1.3. The ‘Textual Output’ tab will show you your results in the Output window, whereas any
error when performing the assembly will be notified in the Error window.
2
II. Gap4
1. Open Gap4:
Your output file will be stored in /10_Lab4/322N9/ (the same folder where the files processed by
Pregap4 are located). Take a look at your output file by opening the file with the name you
created and the extension ‘.aux’. If you are asked to select a program from a list to open this file,
select Gap4.
2. Analyzing the Gap4 assembly:
2.1. Assembly information:
Look at the Gap4 main window:
- How many reads were taken to perform the assembly?
- How many contigs were created?
- What is the total length of the assembly?
2.2. Contig information:
In the Gap4 main window, go to Contig selector, which shows your assembly graphically
displayed.
If you place the mouse on one contig, it will tell you the name of the contig (which is the name
of the most-left reading), the length (in bp), and the number of reads included in that contig. The
same information can be viewed by View|Contig list in the main Gap window.
-
What is the length of each contig?
How many reads were included in each contig?
2.3. Quality of the assembly:
a. In the Gap4 main window, go to View|List consensus confidence, select all contigs, and
click on OK. It will provide you the total length and the expected total errors.
Look at the Confidence Values in your Gap 4 Output window.
A confidence value of 10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000; etc.
For example, if 50 bases (Frequency) in the consensus have a Confidence Value of 10, we would
expect those 50 bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases have a
Confidence Value of 20, we would expect them to contain 2 errors.
b. In the Gap4 main window, go to View|Confidence values graph, select all contigs, and click
on OK. You can see a graph of the confidence values for the consensus sequence.
- By double-clicking on one region of the graph, the Contig Editor showing the aligned
sequences in that given region of the contig will be displayed.
- By double-clicking on the consensus sequence, the chromatograms corresponding to each
aligned sequence will be displayed. You can select the number of Columns and Rows showing
the chromatograms you want to have displayed.
- By selecting Show confidence, the confidence value with which each base was called will be
shown.
3
In the Gap4 main window, go to View|Quality plot. For each base in the consensus sequence, a
quality is computed based on the accuracy of the data on each strand. This information is then
plotted using color and height to distinguish between the different quality assignments. If you
right-click on the plot and go to Information, the percentages of the contig with a given quality
will shown.
Color
Grey
Blue
Green
Red
Black
Height
0 to 0
0 to 1
-1 to 0
-1 to 1
-2 to 2
Meaning
OK on both strands, both agree
OK on plus strand only
OK on minus strand only
Bad on both strands
OK on both strands but they disagree
- Open the Quality plot for the larger contig.
- For how many bases the quality of the consensus sequence was OK on both strands?
- For how many bases the quality of the consensus sequence was bad on both strands?
2.4. Saving the consensus sequence:
In the Gap4 main window, go to File|Save consensus|Normal, select all contigs, name
consensus by ‘left-most reading’, select FASTA format, give a name to the Output file, and click
on OK. You can open the output file with Word.
2.5. Joining contigs:
If this is known that all the reads used to perform the assembly belong to a unique contig
although they were grouped into separate contigs, the separate contigs can be joined.
In the Gap4 main window, go to View|Find internal joins. Select Input contigs from all
contigs, Use standard consensus, Maximum alignment length to list (bp): 4000, Use hidden data:
YES, Word length: 4, Minimum overlap: 20, Maximum percentage mismatch: 30.00, Alignment
algorithm: sensitive, Diagonal threshold: 1.0e-8.
The Contig Comparator showing the results will be displayed.
-
What does this window look like?
Go to View|Display diagonal.
-
Can you also see another region of overlapping besides the main diagonal? What does it
mean?
Double-click on this region, and Join Editor will be displayed. You can edit the overlapping
region, and if you decide you join the contigs, click on Join/Quit. Take a look at the Contig
Selector.
-
How many contigs do you have now?
Save this final consensus sequence.
4
2.6. Inspecting chromatograms and editing:
In the Gap4 main window, go to Edit|Edit contig, write the name of the contig you want to
open, and click on OK. The Contig Editor window will be displayed. Go to Edit Modes, and
select Mode set 2. You can insert/delete bases, perform searches, create tags, select primers,
among other editing options.
B. BLAST SEARCH
Open the link: http://www.ncbi.nlm.nih.gov/blast/
 BLASTN: Go to nucleotide blast (blastn).
3.1. Randomly type in a 50-bp DNA sequence, choose database Others\Nucleotide
collection (nr/nt), optimize for blastn, then click on BLAST.
Look at the E values.
3.2. Compare it with a real search using the Acyl Co-A synthetase DNA sequence.
 BLASTP: Go to protein blast (blastp).
3.3. Do the same as in 4.1. and 4.2., but now using a randomly generated 50-aa sequence
and the Acyl Co-A synthetase protein sequence, selecting algorithm blastp.
 BLASTN: Go to nucleotide blast (blastn).
3.4. Choose database Others/Expressed sequence tags (est), in Organism select ‘Hordeum’,
and optimize for blastn. Search with the Acyl Co-A synthetase DNA sequence. Good ESTs
show alignment to exons in the Acyl Co-A synthetase DNA sequence.
C. ENTREZ
Help file: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezHelp
Open the link: http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
4.1 Get information about ‘Acyl CoA synthetase’.
4.2 Click on the number near Nucleotide.
History function:
Go to ‘History’; you will see the history of all your searches, and you can type the
number given to them in the query box for future searches.
4.3. Boolean operators:
Perform the following searches and report the number of Nucleotides found:
‘Acyl CoA Synthetase’
Acyl CoA Synthetase AND Hordeum
Acyl CoA Synthetase AND Hordeum [ORGANISM]
Acyl CoA Synthetase AND (Hordeum OR Triticum)
Acyl CoA Synthetase AND Hordeum OR Triticum
5
Since the name of this enzyme can appear also as ‘synthase’, perform your search using
truncation:
“Acyl CoA” synth*
Boolean operators in Entrez must be entered in UPPERCASE.
Restricted searches may also be entered manually by following the term with the name of
the field in square brackets “[ ]” (ie horse[ORGANISM] )
D. Local BLAST
The ability to harness the computational power of a BLAST search on a local scale provides a
huge advantage for searching unpublished databases or collections of sequences, and can even be
used on portions of larger databases that have been downloaded to a local server.
Here we will use the program BioEdit to create and search by BLAST against a local database
file of sequences from a next generation Illumina sequencing run, which produces ~ 80 million
reads approximately 100bp long. This gives us 8,000 megabases (Mb) of sequence. After
screening for quality, 95% of these reads were assembled into 29,189 contigs, which collectively
span 64.8Mb. The reads in this example are from a plant pathogen that causes stripe rust in
cereal crops, known as Puccinia striifomis. P. striiformis has a genome size of 73Mb, which
means that this database covers approximately 89% of the genome (64.8Mb / 73Mb).
Researchers have identified a gene in a related pathogen, Puccinia graminis, which is known to
affect the pathogen’s ability to successfully infect the host with stem rust. You would like to
identify a homologous gene in the Puccinia striiformis database by BLAST search.
5.1 Create local database for BLAST.
Open Bioedit from the Bioinformatics folder.
Click Accessory Application  BLAST  Create a local nucleotide database file.
Select the file Bioedit_database_PST130_contigs.fsa . Files can be formatted as a .txt file
or as a FASTA file.
*Requires you have permissions to write to files in the Windows folder.*
5.2
Perform a local BLAST on your database.
Click Accessory Application  BLAST  Local BLAST.
- Select the appropriate type of BLAST to be performed.
- Select the newly created Nucleotide Database.
- Paste in the Query sequence you would like to use to search the database.
- Click Do Search.
6
For this example we will perform a blastn search. Use the following Query sequence:
>Stem_rust_Coding_region_P.graminis
AGAGCATTCATCCATCTCACTGTCCTAGCTGATCCTGCTTCTTCTCACCTCTTCCCTCCAGAACTCGCAACAGCATG
ACTAAGAAATTCGACGATCGGGTGATCCCACTTGGCGAGCCTGCGACCACCAGAGCCAAACTGGTCGGCGTCGCCAT
GGCCTTGTTTGCGGCCTTTGGAGGATTCTTATACGGCTATGACACTGGTTACATATCCGGAACCAAGGAGATGGCTT
ACTGGAAGTCGCTCTTTGGAGATCAGATTGCCGACGGAAGTTACATCCTCACCACTGCCAACGACTCCCTCGTCACT
TCCATCCTGTCAGCAGGAACCTTCACGGGTGCCCTCCTAGCATATCCTTTCGGTGATCGCCTCGGACGAAGATGGGG
TGTCATCGTCGCCTGTCTCATCTTCTGCATCGGCGTGGCTCTTCAGACCGCATCAACTGACATCCCAGTCTTTGCCG
TCGGCCGTGTGTTTGCAGGCTTGGGTGTCGGGATGACCTCATGTCTTGTGCCCATGTACCAATCAGAATGCGCCCCG
AAATGGATTCGAGGTGCTGTGGTGGCCTGCTACCAATGGGCTATCACCATCGGACTCTTGGTTGCAGCAATCGTCGT
CAACGCTACTCAAGACATCAACAATGCCAGCTCCTACCGAATCCCCATCGGGATCCAATTTGTTTGGGCCGTGATCC
TATCCTTAGGACTGTACATTCTACCCGAGTCGCCTAAGTATCTGATCCTCAAGGGGCGGGAAGAAGAAGCCAAGAAG
TCCCTCTCCAGACTGCTCTCCATCCCTGCCACCTCCCCGCAAGTCCTGAGCGAGTATGACGAAGTCTGCGAAAGCCT
GCGCGCCGAGCGCGCCATGGGCACCTCCACTTATGCAGACTGCTTCAAATCTGGACCGGGCAAATACCGGCTCAGGA
CTTTGACCGGGATGGGAATCCAAGCTCTCCAACAACTCACTGGTATCAACTTCATCTTCTATTACGGAACAACTTTC
TTCAAGAACAGTGGAATCAAGGAGGCATTCACAATCACCATTATCACCAATGTTGTCAACGTAGTCATGACGATCCC
CGGGATTTGGCTGGTCGACAAGGCCGGCCGTCGATCGCTGCTATTGACCGGCGCTGCCATCATGTGTGTCTGTGAAT
TCATCGTCGCCATCATCGGACTCAAGCTCGAAAGCTCCAACCTTGCTGGCCAACGCGCCTTGATCTCCCTCGTCTGT
ATCTACATCGGAGCGTTTGCGGCAACTTGGGGTCCCATCGCATGGGTTGTGACGAGTGAGATTTACCCGTTGGCCAT
CCGAGCCAAGGCGATAAACTTCGCCATCGGATACTCGACGCCATATTTAGTGGATGTCGGTCCAGGTAAGGCGGGGC
TCCAATCGAACGTGTTCTTCATCTGGGGCGCATGCTGTGGCTTGTGCTTTTTGTTCACCTTTTTCTGTATTCCGGAG
ACCAAGGGGCTGTCGCTCGAGCAGGTCGATCAACTCTACATGAACAGCTCCATCCTCGGCTCCAATGCCTATCGTCG
GCGACTCGTCAATGGCGAGTTCGACACCCATCAGCCTACGACCCCGCTCGGCTCGATCGTCGAAGACGACCGGGCCG
ACAAGCTCGCTAAAGTCGCTTGATCGATCCCTCCCAATAAACAAAAATTGACACTCTCATATCCTTACCTCCTTATA
ACCAAACGATCCGCTATTATACTAGTAGTAGAACTTGTTCCCCTTTCACCTAATATACTCTCTATTGTGACTTGTTA
GGTTTTTTTTTTGTTTTTTGTTTCTTCTTATGTACCTTAGGTAGGAGTGGGAGGGATCGATAAGCATGGTTGAGCTC
GAGGCGTGGTTTACATACCTGAATCATCACAGACTTCAGTTTCTCTTCTTCTTTCTCCACTTGGCAGCCGGGCTAGC
7
GGATCCATTCTTCCTTTCTCATCATCCCCATCTACCTTCCTCTCCCTCTCCCTACACACACACAACACAAAAAAAAA
CTTGCTCCAGCCACAACAAAAAAGATATGCCAGTTCTCGCGCTCGTTGTATTATTATCTTATATTGCCCGTCTCTTT
TTTTTGTCTTCGGTTCCTCCTCTTTGTTCTCATCTCTTTCTCTGTTTCCAGCTTTCTTTTATGATCTCTCTATATGA
AAACTCTTGAAATAAAATCAAAATCAAAAATCAAAAATTGATGATGATGATTTCGT


5.3
Look at the high E value hits…
Repeat the BLASTN but clicking the low complexity filter
Retrieve matching sequences from your database.
a) Note the matching contig name and its orientation (Plus or Minus) in your search
results. Open the database file using Notepad. Right click on the file, Click Open
With, and select Notepad. (Important! The database file is too large to open with
Microsoft Word, and will likely crash that program, causing you to lose any work you
haven’t saved. Be safe and use Notepad to open the database file!)
b) Use the Find feature to locate your read in the file (Click Edit  Find or press CtrlF) using the name of the read. Copy it to your MCBS Word Add-In Document. If the
read was in the Minus orientation, you will need to use the reverse complement the
read using the Antisense Sequence Manipulation tool from the Add-In.
Here we find PSTcontig_7550 is the best match to our Query, and likely contains the
Puccinia striiformis homolog to our known Puccinia graminis gene. We can use this
complete sequence for continued molecular studies of the gene without the need to clone
and sequence it.
c) The orientation is Plus/ Minus :: Query/Contig, so we will need to reverse
complement this contig to find the matching sequence to our Query. Use the
Antisense tool to reverse complement the sequence, then highlight the portions of the
sequence that aligned to our Query sequence in the BLAST search.
8
Download