Bioinformatics Practical 4: BLAST in practice Karen Marshall Phone: 6773 2264 Room: #33 in the homestead building, School of Rural Science and Agriculture Email: kmarsha2@metz.une.edu.au The aim of this practical is to examine how the BLAST parameters (such as the word length and scoring system) affect the alignment outcome. Understanding BLAST parameters is particularly relevant when doing alignments when the expected level of homology is around 70-85%, such as sheep or cattle sequences to human sequences. This is because the default BLAST parameters in many programs (e.g. NCBI BLAST) are often optimized for high homology alignments. Use of the default ‘high homology parameters’ for a lower homology alignment could result in many ‘hits’ being missed. In this practical only nucleic acid alignments are considered. Specifically we will Part 1 Simulate a set of sequences at different PAM (point accepted mutation) distances, which correspond to different levels of sequence homology Perform alignments using the above sequences and various BLAST parameters, using the BLASTN software (version 2.2.8) Part 2 Perform BLASTS of human and mouse interferon-gamma to the human genomic sequence surrounding this gene, applying the concepts learnt in part 1 For assessment you are required to submit the BLASTN outputs from either the simulated sequences (Part 1) or the real sequences (Part 2), and provide a summary / discussion of up to 500 words. Please ensure the BLASTN outputs are concatenated into one file, and sufficiently annotated (e.g. brief description at start of each original file). Email files to kmarsha2@metz.une.edu.au. BACKGROUND Markov mutational models and PAM distances. The PAM (point accepted mutation) model of molecular evolution was originally developed from observations of residue replacements in closely related proteins (Dayhoff et al. 1978, see http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html). One “PAM” corresponds to an average change in 1% of all amino acids present. A Markov mutational model is used, assuming the mutation events are random and independent. The same model was applied to DNA by States et al. (1991, Methods in Enzymology, 3: 66-70). Quoting from this paper: “A matrix M of probabilities for substituting base i by base j after any given amount of evolution can be calculated by successive iteration of a reference mutation matrix: Mn=(M1)n. M1 is a matrix reflecting 99% sequence conservation and one point accepted mutation (1 PAM) per 100 bases. Mn then represents the substitution probabilities after n PAMs. To model the case for which all base substitutions are equally likely, the diagonal elements of M1 are all 0.99, whilst all off-diagonal elements are 0.00333”. Nucleic acid scoring schemes. Scoring matrices for nucleic acid sequence comparisons were determined by States et al. (1991) for different PAM distances under either a uniform or biased mutational model. Key figures from this paper are reproduced in Figure 1 below. Under the uniform mutation model it was assumed that mutation events were random and independent. Under the biased mutation model a transition (G ↔ A and C ↔ T) was assumed to be three-times more likely than a transversion (G ↔ C and A ↔ T). The efficiency of these scoring schemes was examined in their paper. Conclusions were a) “to achieve optimal sensitivity it is necessary to use scores relevant to the specific question being asked” i.e. use a scoring scheme relevant to the expected level of sequence homology and b) “in noncoding regions, the use of scores based on a biased as opposed to uniform mutational model may substantially improve the search sensitivity” i.e. a scores matrix that can simplify to a value for a match and a value for a mismatch may not be optimal. Figure 1. Key figures from States et al. (1991, Methods in Enzymology, 3: 66-70) Expected levels of sequence homologies. The level of homology between two sequences is usually not known prior to performing a BLAST. However, an idea of the level of homology can be gained from the literature. Expected homology levels will differ depending on the species being compare (higher for e.g. human to mouse than human to cattle), and on the sequence type (coding, intronic, 5’ or 3’ UTRs, promoter, intergenic etc., see Figure 2) . Figure 2. Homology in different parts of the gene, human to mouse. From … BLASTN. In this practical BLASTN (2.2.8: Altschul et al., 1997, Nucleic Acids Res. 25:3389-3402) will be used on our own computers. Key features of BLASTN (reproduced with permission from McEwan 2004 “Bioinformatics for quantitative geneticists” course notes) are given below: A suffix tree is formed of the locations of all short “words” of sequence in one of the files. Typically this is the larger file and is termed the database. This look up table allows the user to quickly find perfect matches of a certain length. These are the “seeds” from which the alignment is extended to form a high scoring pair (HSP) which is retained if it exceeds some threshold typically the threshold expectation value used. This reduces the search space dramatically and the reduction is greater the longer the seed length. These HSPs are then joined together if the combined score including gap penalties is significantly greater when they are combined. It has features such as “DUST” which guard against false matches due to low complexity sequence. The seed lengths and their format depend on the level of homology. The default of 11 (and even higher for Megablast) is not suited for low homology matches. The minimum of 7 dramatically increases execution time but is required when homology drops below ~82%. The scoring regime has to be altered for low homology comparisons by default it is optimized for near perfect matches. The “magic” options -W 7 -r 17 -q -21 -f 280 -G 29 -E 22 -X 240 are the most sensitive available. However these options mean the E values are now only approximate. More recently BLAST has introduced discontiguous seeds. These are seeds where only a pattern of bases has to match. The theory suggests that these seeds are more sensitive to low homology matches while retaining the benefits of speed observed with longer seed lengths. The most often options used are: o -W is the seed word length o –r reward for a match default =1 o –q penalty for a mismatch default =3 o –G cost to open a gap o –E cost to extend a gap o –F often use the magic option –F “m D”. This prevents using low complexity or softmasked sequence to initiate a seed match but allows the match to be extended through the region. o –e to set the threshold expectation. Note this is the threshold for the HSP BEFORE gaps are included! o –m to specify the output options the most used are: –m 3 this prints out all the query and aligns matches below it with dots for identities -m 4 is the same as –m 3 but prints matching section in full great for cutting and pasting regions for primer design -m 8 results in a tab delimited file that can be loaded into excel. This is the format to use if you want the results to be displayed in a genome browser like UCSC. o –U allows soft masking with lower case if the T option is chosen, very useful. Using BLASTN you can search with more than one sequence at a time. Steps for a BLASTN search are: 1. Format one file into a database with the command Formatdb –i dbfile.txt –p F –oT This file can contain multiple sequences (in FASTA format), but the formatdb command will fail if there are duplicated sequences. Check the format.log file to see if the process was successful. 2. Perform a BLAST with the command blastall –p blastn –d dbfile.txt –i comp.txt –o out.txt where –p is followed by the program (BLASTN) –d the name of the file formatted in step 1, -i the other (smaller) sequence file, and –o the output file. Other options can also be included e.g. –r for match score etc. THE PRACTICAL: PART 1 1. Examine the spreadsheet markov.xls which shows an initial DNA sequence, and corresponding sequences at PAM distances of 1 100. The simulation can be rerun by Ctrl-Alt-F9. Ensure you understand the simulation, and note the consequences of the mutational process (saturation, forward and backward mutation). 2. Extract the original sequence plus sequences at PAM distances of 10, 20, 30, 40 and 50 for five different replicas. Include these in a single file ‘markov.seq’ in FASTA format ready for BLAST. e.g. >Ori_seq AGATTCACTGGTGTGGCAA …. >Rep1_Pam10 AGATTCACTGGTGTGGCAA …. (if you don’t have the scripting skills to do this quickly, use the ‘markov.seq’ file provided). 3. Prepare to BLAST the original sequence back to this file a. Create a file containing just the original sequence ‘markov.ori’ b. Format the file ‘markov.seq’ using the formatdb command (see above) 4. Perform a blast using the command below, and examine the output blastall -p blastn -d markov.seq -i markov.ori -m 3 -o markov1.out Note that this uses default parameters for all but the output. Use blastall to view default parameters; also also given in Appendix 1. 5. Perform blast, using increasingly more optimized parameters for this simulation, and examine the output. a. remove ‘softmasking’ blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -o markov.out b. reduce wordlength blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7 -o markov.out c. use ‘sensitive’ parameters blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7 -r 17 -q -21 -f 280 -G 29 -E 22 -X 240 -o markov.out d. change the expectation value blastall -p blastn -d markov.seq -i markov.ori -m 3 -F "m D" -W 7 -r 17 -q -21 -f 280 -G 29 -E 22 -X 240 -e 1e-2 -o markov.out THE PRACTICAL: PART 2 In this part of the practical you will align the mRNA sequences of human and cattle interferon-gamma (INFG) against the human genomic sequence surrounding this gene. Two files are available: INFG_refseq.txt containing the reference sequences for human (NM_000619) and cattle (NM_174086) INFG mRNA. These sequences are in FASTA format and were obtained from the NCBI website using Batch Entrez hs_chr12_subseq.txt containing human genomic sequence surrounding the INFG gene (chr12:66,589,493-67,085,092 ~ ½ Mb) . This sequence is in FASTA format with repeats masked to lower case, and was obtained from the USCS ‘golden path’ website. 1. BLAST using parameters that will favour the alignment of the human INFG mRNA sequence to the human genomic sequence. You should be able to identify the 4 exons (the output switch –m9 provides a useful table). The exon / intron report from NCBI ‘AceView’ for NM_000619 is as follows: In variant Length & DNA Coordinates on gene Supporting clone (s) Exon 1 243bp 1 to 243 M29383 Intron [gt-ag] 1242bp 244 to 1485 M29383 and 32 others Exon 2 69bp 1486 to 1554 NM_000619 and 32 others Intron [gt-ag] 95bp 1555 to 1649 NM_000619 and 33 others Exon 3 183bp 1650 to 1832 NM_000619 and 24 others Intron [gt-ag] 2425bp 1833 to 4257 NM_000619 and 24 others Exon 4 725bp 4258 to 4982 2. BLAST using parameters that will favour the alignment of cattle (bos Taurus) INFG mRNA sequence to the human genomic sequence. The expected homology level for human coding region to cattle coding region is around 85%. REFERENCES An excellent BLAST tutorial, written by one of the developers of BLAST, can be found at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html. This is strongely recommended as further reading, and has a comprehensive reference list. States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods in Enzymology 3:66-70. McEwan (2004) “Bioinformatics for quantitative geneticists” course notes. Found online at http://www-personal.une.edu.au/~jvanderw/aabc_materials2004.htm#ModuleC Appendix 1. blastall 2.2.7 arguments. -p -d Program Name [String] Database [String] default = nr -i Query File [File In] default = stdin -e Expectation value (E) [Real] default = 10.0 -m alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = query-anchored no identities and blunt ends, 6 = flat query-anchored, no identities and blunt ends, 7 = XML Blast output, 8 = tabular, 9 tabular with comment lines 10 ASN, text 11 ASN, binary [Integer] default = 0 -o BLAST report Output File [File Out] Optional default = stdout -F Filter query sequence (DUST with blastn, SEG with others) [String] default = T -G Cost to open a gap (zero invokes default behavior) [Integer] default = 0 -E Cost to extend a gap (zero invokes default behavior) [Integer] default = 0 -X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) blastn 30, megablast 20, tblastx 0, all others 15 [Integer] default = 0 -I Show GI's in deflines [T/F] default = F -q Penalty for a nucleotide mismatch (blastn only) [Integer] default = -3 -r Reward for a nucleotide match (blastn only) [Integer] default = 1 -v Number of database sequences to show one-line descriptions for (V) [Integer] default = 500 -b Number of database sequence to show alignments for (B) [Integer] default = 250 -f Threshold for extending hits, default if zero blastp 11, blastn 0, blastx 12, tblastn 13 tblastx 13, megablast 0 [Integer] default = 0 -g Perfom gapped alignment (not available with tblastx) [T/F] default = T -Q Query Genetic code to use [Integer] default = 1 -D DB Genetic code (for tblast[nx] only) [Integer] default = 1 -a Number of processors to use [Integer] default = 1 -O SeqAlign file [File Out] Optional -J Believe the query defline [T/F] default = F -M Matrix [String] default = BLOSUM62 -W Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer] default = 0 -z Effective length of the database (use zero for the real size) [Real] default = 0 -K Number of best hits from a region to keep (off by default, if used a value of 100 is recommended) [Integer] default = 0 -P 0 for multiple hit, 1 for single hit [Integer] default = 0 -Y Effective length of the search space (use zero for the real size) [Real] default = 0 -S Query strands to search against database (for blast[nx], and tblastx) 3 is both, 1 is top, 2 is bottom [Integer] default = 3 -T Produce HTML output [T/F] default = F -l Restrict search of database to list of GI's [String] Optional -U Use lower case filtering of FASTA sequence [T/F] Optional default = F -y X dropoff value for ungapped extensions in bits (0.0 invokes default behavior) blastn 20, megablast 10, all others 7 [Real] default = 0.0 -Z X dropoff value for final gapped alignment in bits (0.0 invokes default behavior) blastn/megablast 50, tblastx 0, all others 25 [Integer] default = 0 -R PSI-TBLASTN checkpoint file [File In] Optional -n MegaBlast search [T/F] default = F -L Location on query sequence [String] Optional -A Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer] default = 0 -w Frame shift penalty (OOF algorithm for blastx) [Integer] default = 0 -t Length of the largest intron allowed in tblastn for linking HSPs (0 disables linking) [Integer] default = 0 -B Number of concatenated queries, for blastn and tblastn [Integer] Optional default = 0