HW 2 Answer Key

advertisement
BCB 444/544 Fall 07 Sept5
BCB 444/544
Homework 2 (20 pts)
Due Fri Sept 14
HW2 KEY p 1
ANSWER KEY
(please bring hard copy to class or deliver to MBB 106 by 5 PM)
Note: You may work with other students to solve these problems, but each student must submit separate answers
in his/her own words. If necessary, use additional paper for your answers.
Objectives:
1. Understand how to use dot plots and interpret their results
2. Understand how dynamic programming works
3. Gain experience using different types of substitution matrices
Problem 1 - (3 pts total)
1. Suppose we are given two identical 100 kb DNA sequences A and B. A dot plot comparing these two sequences
would have the pattern shown below:
1
A
100 kb
1
B
100 kb
Draw simple diagrams of dot plots that would result from each of the following comparisons.
Be sure to label both axes, as in the example shown above.
1a) (1 pt) A 120 kb DNA sequence, A, and a 100 kb DNA sequence, B, identical to A,
except that A has a 20 kb segment duplicated near the beginning, relative to B.
1b) (1 pt) Two 100 kb DNA sequences, identical except that B has 40 kb
segment inverted relative to A, near the center.
1c) (1 pt) A 10 kb gene sequence, A, and the 8 kb mRNA, B, produced from it.
There is a single 2 kb intron located 1 kb from the beginning of the A gene (i.e.,
from the 5'end).
BCB 444/544 Fall 07 Sept5
HW2 KEY p 2
Problem 2 - (5 pts total)
2. Re-read pages 41-49 of your textbook re: substitution matrices and statistical significance of alignment.
Recall that to evaluate similarities between divergent proteins, a PAM matrix with a larger "index" but a BLOSUM
matrix with a smaller "index" should be used. In this exercise, you will compare different matrices to test whether
they really give different results in a BLAST search. Your task is to determine whether any yeasts have an
enzyme similar to the human telomerase enzyme:
>gi|109633031|ref|NP_937983.2| telomerase reverse transcriptase isoform 1 [Homo sapiens]
MPRAPRCRAVRSLLRSHYREVLPLATFVRRLGPQGWRLVQRGDPAAFRALVAQCLVCVPWDARPPPAAPSFRQVSCLKELVARVLQRLCERGAKNVLAF
GFALLDGARGGPPEAFTTSVRSYLPNTVTDALRGSGAWGLLLRRVGDDVLVHLLARCALFVLVAPSCAYQVCGPPLYQLGAATQARPPPHASGPRRRLG
CERAWNHSVREAGVPLGLPAPGARRRGGSASRSLPLPKRPRRGAAPEPERTPVGQGSWAHPGRTRGPSDRGFCVVSPARPAEEATSLEGALSGTRHSHP
SVGRQHHAGPPSTSRPPRPWDTPCPPVYAETKHFLYSSGDKEQLRPSFLLSSLRPSLTGARRLVETIFLGSRPWMPGTPRRLPRLPQRYWQMRPLFLEL
LGNHAQCPYGVLLKTHCPLRAAVTPAAGVCAREKPQGSVAAPEEEDTDPRRLVQLLRQHSSPWQVYGFVRACLRRLVPPGLWGSRHNERRFLRNTKKFI
SLGKHAKLSLQELTWKMSVRDCAWLRRSPGVGCVPAAEHRLREEILAKFLHWLMSVYVVELLRSFFYVTETTFQKNRLFFYRKSVWSKLQSIGIRQHLK
RVQLRELSEAEVRQHREARPALLTSRLRFIPKPDGLRPIVNMDYVVGARTFRREKRAERLTSRVKALFSVLNYERARRPGLLGASVLGLDDIHRAWRTF
VLRVRAQDPPPELYFVKVDVTGAYDTIPQDRLTEVIASIIKPQNTYCVRRYAVVQKAAHGHVRKAFKSHVSTLTDLQPYMRQFVAHLQETSPLRDAVVI
EQSSSLNEASSGLFDVFLRFMCHHAVRIRGKSYVQCQGIPQGSILSTLLCSLCYGDMENKLFAGIRRDGLLLRLVDDFLLVTPHLTHAKTFLRTLVRGV
PEYGCVVNLRKTVVNFPVEDEALGGTAFVQMPAHGLFPWCGLLLDTRTLEVQSDYSSYARTSIRASLTFNRGFKAGRNMRRKLFGVLRLKCHSLFLDLQ
VNSLQTVCTNIYKILLLQAYRFHACVLQLPFHQQVWKNPTFFLRVISDTASLCYSILKAKNAGMSLGAKGAAGPLPSEAVQWLCHQAFLLKLTRHRVTY
VPLLGSLRTAQTQLSRKLPGTTLTALEAAANPALPSDFKTILD
First, use the sequence above as Query sequence in a BLASTp search http://www.ncbi.nlm.nih.gov/BLAST/,
using default parameters to search the non-redundant protein sequences database (nr) for similar sequences
in yeasts.
Hints:
1- Give this search the JobTitle "Default" (or "BLOSUM62")
2- Set Organism = "yeast taxid:4932"
Next, run 4 additional BLAST searches using each of the 4 alternative substitution matrices listed in 2a.
More Hints:
3- Click on Algorithm Parameters to reveal choices for Substitution Matrices
4- Click on Edit and Resubmit link at top of output page to change parameters
5- Use the Recent Results tab to review & compare your results
BLOSUM62
(default)
2a) (1 pts) How many hits did you obtain?
14
BLOSUM45
21
BLOSUM80
19
PAM30
PAM70
26
38
2b) (1 pt) Describe & explain differences you observe in results obtained with BLOSUM45 vs BLOSUM80.
BLOSUM45 found 2 more hits than BLOSUM80, which we expected because BLOSUM45 should be able to
find more divergent sequences. Based on the E-values, the first 14 hits from both (which are the same 14
hits found by using the BLOSUM62 matrix) are very likely to be related to our query sequence, while the
other hits may or may not be. Because the E-values are high (>1) for hits after top ranking 14, we conclude
that these are most likely random hits.
But: Why does BLOSUM62 result in only 14 hits instead of 20 (i.e., between 21 from BLOSUM45 & 19
from BLOSUM80)? To provide meaningful results, BLAST has been "tuned" to automatically adjust other scoring
parameters when the substitution matrix is changed; these changes most likely explain the lower number of hits with
BLOSUM62 examples such as this one. If we were to perform a series of BLAST searches in which we systematically
changed the BLOSUM index (and made no other changes to parameter settings), generally, we would expect to see the
number of hits decrease as the BLOSUM index increases (because a higher BLOSUM index requires "closer" matches
because the matrix is based on more closely related sequences).
BCB 444/544 Fall 07 Sept5
HW2 KEY p 3
2c) (1 pt) Describe & explain differences you observe in results obtained with PAM30 vs PAM70.
Because PAM matrices are based on an evolutionary model and a higher index corresponds to more
evolutionary time (i.e., more expected point mutations), we would expect to see more hits using PAM70 than
PAM30, which is what we observed. Also, alignments found using PAM30 are very short, with best E-value
was only 0.11. With PAM70, alignments were longer and best E-value was 2
e-04
. This also meets our
expectations because we used a human query sequence to search for yeast sequences, and we know that
humans and yeast quite distant on the evolutionary scale!
2d) (1 pt) Taken together, do these results make sense, given what you've learned about PAM and
BLOSUM matrices? Explain.
The results make sense in that we found more hits using the BLOSUM matrices with lower numbers (more
divergent sequences used to build them) and using the PAM matrices with higher numbers (longer
evolutionary time scale in model). The only odd-ball thing was the BLOSUM62 result (see above).
2e) (1 pt) Do yeasts have a telomerase enzyme? Explain.
Is the default BLOSUM62 matrix "the best" for answering this question? Explain.
Yes, it appears that yeasts have a telomerase enzyme. All of our searches found significant hits. However,
we note that high sequence similarity does not guanantee functional similarity and we should confirm that
yeast have a telomerase enzyme by doing wet lab experiments (FYI - these have been done!). Based on the
very significant E-values, we might be willing to "bet" a few dollars that the EST2 protein is a telomerase
enzyme!
Using the BLOSUM62 matrix appeared to identify better "hits" the PAM matrices, but the results from all
three BLOSUM matrices indicate that yeast is likely to have a telomerase enzyme. Our results confirm that
different matrices give different results. If the BLAST default matrix does not provide sufficient hits or
results that make sense to you, don’t stop there. Try some different matrices and scoring parameters to see
whether your results make more sense.
BCB 444/544 Fall 07 Sept5
HW2 KEY p 4
Problem 3 - (12 pts total)
3. Consider the following sequences for a "toy" alignment problem in which you will perform & score both a
global & a local alignment:
x = ACCTT
y = ACTTG
3a) (4 pts) Complete the Global Alignment Dynamic Programming matrix below (with initial values already
entered). Use the following scoring scheme:
Reward for matches: +10
Mismatch penalty: -2
Space penalty: -5
A
C
C
T
T
0
-5
-10
-15
-20
-25
A
- 5
10
5
0
-5
-10
C
T
T
G
-10
-15
-20
-25
5
0
-5
-10
20
15
10
5
15
18
13
8
10
25
28
23
5
20
35
30
3b) (1 pt) What is the score of the optimal global alignment(s)?
30
3c) (1 pt) Draw the alignment(s) that give this score.
ACCTTA-CTTG
ACCTTAC-TTG
3d) (4 pts) Complete the Local Alignment DP matrix below (with initial values already entered).
Use the following scoring scheme:
Match: +2
Mismatch and space: -1
A
C
C
T
T
0
0
0
0
0
0
A
0
2
1
0
0
0
C
0
1
4
3
2
1
T
0
0
3
3
5
4
T
0
0
2
2
5
7
G
0
0
1
1
4
6
3e) (1 pt) What is the score of the optimal local alignment(s)?
7
3f) (1 pt) Draw the alignment(s) that give this score.
ACCTT
A-CTT
ACCTT
AC-TT
Download