practical4 - School of Science and Technology

advertisement
Bioinformatics Practical 4:
BLAST in practice
Karen Marshall
Phone: 6773 2264
Room: #33 in the homestead building, School of Rural Science and Agriculture
Email: kmarsha2@metz.une.edu.au
The aim of this practical is to examine how the BLAST parameters (such as the word
length and scoring system) affect the alignment outcome.
Understanding BLAST
parameters is particularly relevant when doing alignments when the expected level of
homology is around 70-85%, such as sheep or cattle sequences to human sequences. This
is because the default BLAST parameters in many programs (e.g. NCBI BLAST) are
often optimized for high homology alignments. Use of the default ‘high homology
parameters’ for a lower homology alignment could result in many ‘hits’ being missed.
In this practical only nucleic acid alignments are considered. Specifically we will
Part 1

Simulate a set of sequences at different PAM (point accepted mutation) distances,
which correspond to different levels of sequence homology

Perform alignments using the above sequences and various BLAST parameters,
using the BLASTN software (version 2.2.8)
Part 2

Perform BLASTS of human and mouse interferon-gamma to the human genomic
sequence surrounding this gene, applying the concepts learnt in part 1
For assessment you are required to submit the BLASTN outputs from either the
simulated sequences (Part 1) or the real sequences (Part 2), and provide a summary /
discussion of up to 500 words. Please ensure the BLASTN outputs are concatenated into
one file, and sufficiently annotated (e.g. brief description at start of each original file).
Email files to kmarsha2@metz.une.edu.au.
BACKGROUND
Markov mutational models and PAM distances.
The PAM (point accepted mutation) model of molecular evolution was originally
developed from observations of residue replacements in closely related proteins (Dayhoff
et al. 1978, see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html).
One
“PAM” corresponds to an average change in 1% of all amino acids present. A Markov
mutational model is used, assuming the mutation events are random and independent.
The same model was applied to DNA by States et al. (1991, Methods in Enzymology, 3:
66-70). Quoting from this paper: “A matrix M of probabilities for substituting base i by
base j after any given amount of evolution can be calculated by successive iteration of a
reference mutation matrix: Mn=(M1)n.
M1 is a matrix reflecting 99% sequence
conservation and one point accepted mutation (1 PAM) per 100 bases.
Mn then
represents the substitution probabilities after n PAMs. To model the case for which all
base substitutions are equally likely, the diagonal elements of M1 are all 0.99, whilst all
off-diagonal elements are 0.00333”.
Nucleic acid scoring schemes. Scoring matrices for nucleic acid sequence comparisons
were determined by States et al. (1991) for different PAM distances under either a
uniform or biased mutational model. Key figures from this paper are reproduced in
Figure 1 below. Under the uniform mutation model it was assumed that mutation events
were random and independent. Under the biased mutation model a transition (G ↔ A
and C ↔ T) was assumed to be three-times more likely than a transversion (G ↔ C and
A ↔ T).
The efficiency of these scoring schemes was examined in their paper.
Conclusions were a) “to achieve optimal sensitivity it is necessary to use scores relevant
to the specific question being asked” i.e. use a scoring scheme relevant to the expected
level of sequence homology and b) “in noncoding regions, the use of scores based on a
biased as opposed to uniform mutational model may substantially improve the search
sensitivity” i.e. a scores matrix that can simplify to a value for a match and a value for a
mismatch may not be optimal.
Figure 1. Key figures from States et al. (1991, Methods in Enzymology, 3: 66-70)
Expected levels of sequence homologies.
The level of homology between two
sequences is usually not known prior to performing a BLAST. However, an idea of the
level of homology can be gained from the literature. Expected homology levels will
differ depending on the species being compare (higher for e.g. human to mouse than
human to cattle), and on the sequence type (coding, intronic, 5’ or 3’ UTRs, promoter,
intergenic etc., see Figure 2) .
Figure 2. Homology in different parts of the gene, human to mouse. From …
BLASTN. In this practical BLASTN (2.2.8: Altschul et al., 1997, Nucleic Acids Res.
25:3389-3402) will be used on our own computers.
Key features of BLASTN
(reproduced with permission from McEwan 2004 “Bioinformatics for quantitative
geneticists” course notes) are given below:




A suffix tree is formed of the locations of all short “words” of sequence in one of the files.
Typically this is the larger file and is termed the database. This look up table allows the user to
quickly find perfect matches of a certain length. These are the “seeds” from which the alignment is
extended to form a high scoring pair (HSP) which is retained if it exceeds some threshold typically
the threshold expectation value used. This reduces the search space dramatically and the reduction
is greater the longer the seed length.
These HSPs are then joined together if the combined score including gap penalties is significantly
greater when they are combined.
It has features such as “DUST” which guard against false matches due to low complexity
sequence.
The seed lengths and their format depend on the level of homology. The default of 11 (and even
higher for Megablast) is not suited for low homology matches. The minimum of 7 dramatically
increases execution time but is required when homology drops below ~82%.



The scoring regime has to be altered for low homology comparisons by default it is optimized for
near perfect matches. The “magic” options
-W 7 -r 17 -q -21 -f 280 -G 29 -E 22 -X 240 are the most sensitive available. However these
options mean the E values are now only approximate.
More recently BLAST has introduced discontiguous seeds. These are seeds where only a pattern
of bases has to match. The theory suggests that these seeds are more sensitive to low homology
matches while retaining the benefits of speed observed with longer seed lengths.
The most often options used are:
o -W is the seed word length
o –r reward for a match default =1
o –q penalty for a mismatch default =3
o –G cost to open a gap
o –E cost to extend a gap
o –F often use the magic option –F “m D”. This prevents using low complexity or
softmasked sequence to initiate a seed match but allows the match to be extended through
the region.
o –e to set the threshold expectation. Note this is the threshold for the HSP BEFORE gaps
are included!
o –m to specify the output options the most used are:
 –m 3 this prints out all the query and aligns matches below it with dots for
identities
 -m 4 is the same as –m 3 but prints matching section in full great for cutting and
pasting regions for primer design
 -m 8 results in a tab delimited file that can be loaded into excel. This is the
format to use if you want the results to be displayed in a genome browser like
UCSC.
o –U allows soft masking with lower case if the T option is chosen, very useful.
Using BLASTN you can search with more than one sequence at a time. Steps for a
BLASTN search are:
1. Format one file into a database with the command
Formatdb –i dbfile.txt –p F –oT
This file can contain multiple sequences (in FASTA format), but the formatdb
command will fail if there are duplicated sequences. Check the format.log file to see
if the process was successful.
2. Perform a BLAST with the command
blastall –p blastn –d dbfile.txt –i comp.txt –o out.txt
where –p is followed by the program (BLASTN) –d
the name of the file
formatted in step 1, -i the other (smaller) sequence file, and –o the output file.
Other options can also be included e.g. –r for match score etc.
THE PRACTICAL: PART 1
1. Examine the spreadsheet markov.xls which shows an initial DNA sequence,
and corresponding sequences at PAM distances of 1  100. The simulation
can be rerun by Ctrl-Alt-F9. Ensure you understand the simulation, and note
the consequences of the mutational process (saturation, forward and backward
mutation).
2. Extract the original sequence plus sequences at PAM distances of 10, 20, 30,
40 and 50 for five different replicas.
Include these in a single file
‘markov.seq’ in FASTA format ready for BLAST. e.g.
>Ori_seq
AGATTCACTGGTGTGGCAA ….
>Rep1_Pam10
AGATTCACTGGTGTGGCAA ….
(if you don’t have the scripting skills to do this quickly, use the
‘markov.seq’ file provided).
3. Prepare to BLAST the original sequence back to this file
a. Create a file containing just the original sequence ‘markov.ori’
b. Format the file ‘markov.seq’ using the formatdb command (see above)
4. Perform a blast using the command below, and examine the output
blastall -p blastn -d markov.seq -i markov.ori -m 3 -o markov1.out
Note that this uses default parameters for all but the output. Use blastall to
view default parameters; also also given in Appendix 1.
5. Perform blast, using increasingly more optimized parameters for this
simulation, and examine the output.
a. remove ‘softmasking’
blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D"
-o markov.out
b. reduce wordlength
blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7
-o markov.out
c. use ‘sensitive’ parameters
blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7
-r 17 -q -21 -f 280 -G 29 -E 22 -X 240 -o markov.out
d. change the expectation value
blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7
-r 17 -q -21 -f 280 -G 29 -E 22 -X 240 -e 1e-2 -o markov.out
THE PRACTICAL: PART 2
In this part of the practical you will align the mRNA sequences of human and cattle
interferon-gamma (INFG) against the human genomic sequence surrounding this gene.
Two files are available:

INFG_refseq.txt containing the reference sequences for human (NM_000619)
and cattle (NM_174086) INFG mRNA. These sequences are in FASTA format
and were obtained from the NCBI website using Batch Entrez

hs_chr12_subseq.txt containing human genomic sequence surrounding the INFG
gene (chr12:66,589,493-67,085,092 ~ ½ Mb) . This sequence is in FASTA
format with repeats masked to lower case, and was obtained from the USCS
‘golden path’ website.
1. BLAST using parameters that will favour the alignment of the human INFG
mRNA sequence to the human genomic sequence. You should be able to identify
the 4 exons (the output switch –m9 provides a useful table).
The exon / intron report from NCBI ‘AceView’ for NM_000619 is as follows:
In
variant
Length
& DNA
Coordinates
on gene
Supporting
clone (s)
Exon 1
243bp
1 to 243
M29383
Intron [gt-ag]
1242bp
244 to 1485
M29383
and 32 others
Exon 2
69bp
1486 to 1554
NM_000619
and 32 others
Intron [gt-ag]
95bp
1555 to 1649
NM_000619
and 33 others
Exon 3
183bp
1650 to 1832
NM_000619
and 24 others
Intron [gt-ag]
2425bp
1833 to 4257
NM_000619
and 24 others
Exon 4
725bp
4258 to 4982
2. BLAST using parameters that will favour the alignment of cattle (bos Taurus)
INFG mRNA sequence to the human genomic sequence. The expected homology
level for human coding region to cattle coding region is around 85%.
REFERENCES
An excellent BLAST tutorial, written by one of the developers of BLAST, can be found
at
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.
This is strongely
recommended as further reading, and has a comprehensive reference list.
States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid
database searches using application-specific scoring matrices." Methods in Enzymology
3:66-70.
McEwan (2004) “Bioinformatics for quantitative geneticists” course notes. Found online at http://www-personal.une.edu.au/~jvanderw/aabc_materials2004.htm#ModuleC
Appendix 1. blastall 2.2.7 arguments.
-p
-d
Program Name [String]
Database [String]
default = nr
-i Query File [File In]
default = stdin
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
-o BLAST report Output File [File Out] Optional
default = stdout
-F Filter query sequence (DUST with blastn, SEG with others)
[String]
default = T
-G Cost to open a gap (zero invokes default behavior) [Integer]
default = 0
-E Cost to extend a gap (zero invokes default behavior) [Integer]
default = 0
-X X dropoff value for gapped alignment (in bits) (zero invokes
default behavior)
blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
default = 0
-I Show GI's in deflines [T/F]
default = F
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
default = 1
-v Number of database sequences to show one-line descriptions for
(V) [Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-f Threshold for extending hits, default if zero
blastp 11, blastn 0, blastx 12, tblastn 13
tblastx 13, megablast 0 [Integer]
default = 0
-g Perfom gapped alignment (not available with tblastx) [T/F]
default = T
-Q Query Genetic code to use [Integer]
default = 1
-D DB Genetic code (for tblast[nx] only) [Integer]
default = 1
-a Number of processors to use [Integer]
default = 1
-O SeqAlign file [File Out] Optional
-J
Believe the query defline [T/F]
default = F
-M Matrix [String]
default = BLOSUM62
-W Word size, default if zero (blastn 11, megablast 28, all others
3) [Integer]
default = 0
-z Effective length of the database (use zero for the real size)
[Real]
default = 0
-K Number of best hits from a region to keep (off by default, if
used a value of 100 is recommended) [Integer]
default = 0
-P 0 for multiple hit, 1 for single hit [Integer]
default = 0
-Y Effective length of the search space (use zero for the real size)
[Real]
default = 0
-S Query strands to search against database (for blast[nx], and
tblastx)
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output [T/F]
default = F
-l Restrict search of database to list of GI's [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
default = F
-y X dropoff value for ungapped extensions in bits (0.0 invokes
default behavior)
blastn 20, megablast 10, all others 7 [Real]
default = 0.0
-Z X dropoff value for final gapped alignment in bits (0.0 invokes
default behavior)
blastn/megablast 50, tblastx 0, all others 25 [Integer]
default = 0
-R PSI-TBLASTN checkpoint file [File In] Optional
-n MegaBlast search [T/F]
default = F
-L Location on query sequence [String] Optional
-A Multiple Hits window size, default if zero (blastn/megablast 0,
all others 40 [Integer]
default = 0
-w Frame shift penalty (OOF algorithm for blastx) [Integer]
default = 0
-t Length of the largest intron allowed in tblastn for linking HSPs
(0 disables linking) [Integer]
default = 0
-B Number of concatenated queries, for blastn and tblastn [Integer]
Optional
default = 0
Download