BCB 444/544 Fall 07 Sept 21 Exam 1 KEY BCB 444/544 - F07 Exam 1 (100 pts) p 1 of 6 Name_____ANSWER KEY_________________ A. Databases & Literature Resources for Bioinformatics (10 pts TOTAL) A1. (2pts) In your undergraduate research project, you have identified an especially interesting and, so far, unannotated gene in bacteria, which you have named "BCB1." Your experimental results demonstrate that BCB1 is an essential gene: mutations that knock-out its function are lethal. You have a hunch it must be conserved among all life forms. To obtain support for this hypothesis, you would like to find identify a homolog of this gene in humans. You logon to the BLAST page at NCBI and choose to run a basic protein BLAST search against only human proteins. However, you obtain no significant hits!!! How should you change your search parameters to increase your chances of detecting a potential human homolog? Change the default BLAST substitution matrix to one that will take into consideration greater evolutionary divergence, such as BLOSUM45. (or try PSI-BLAST!) A2. (1pt) Despite changing parameters as described above, you were unable to identify a putative human homolog. You decide to change your strategy and run a BLAST search against proteins from all organisms. Great! You find an extensive list of potential homologs across many forms of life -- but you still did not identify any potential homologs in human. As you sit in frustration your thoughts drift back to your glory days in BCB 444/544 and you remember an alternative BLAST program that takes advantage of a profile or PSSM in an interative search procedure, thus providing more sensitivity for detecting remote homologs. What is this specific BLAST program called? PSI-BLAST A3. (2pts) You tap your foot and wait for your browser to refresh. You recall a few "suggestion" & "caveats" about effective use of PSI-BLAST ; ) Hmmm… What is one "tip" for effective use of PSI-BLAST? When in doubt, leave it out: remove any "suspicious" hits obtained after each iteration, so they won't contaminate the profile (or PSSM) Use stingent parameters during first iteration Run 3-5 iterations - not more Others? A4. (2pts) At last there it is, you’ve found a significant "hit" in human! homolog! This seems like an excellent and fitting end to a long and exciting search! You pat yourself on the back and are just about to go out to celebrate with a few beers, when your lab partner takes a look at the annotation for your putative human homolog and says: "Hey! I think you've been scooped! I saw a paper describing a human protein with the same annotation from Drena Dobbs's lab last year - it was in Science or Nature, I think--or maybe it was in NAR, no - it was Proteins, maybe 2 years ago! You'd better check it out!" Aaargh… it is 2 AM & the library is closed…Which online resource would you use to find all papers published by Dobbs in biomedical journals during the past 5 years? PubMed or NCBI ENTREZ - other correct answers are possible! A5. (3pts) Darn! That Dobbs lab must have some amazing students! They did identify your gene in humans -- and actually found two very similar genes. They said one of them is the ortholog of the gene you found in bacteria and the other is actually a paralog. What is an ortholog and how does it differ from a paralog? Orthologs are the same genes in different species; they are the result of common ancestry, and the corresponding proteins have the same function. Paralogs are similar genes within a species; they are the result of gene duplication events, and the corresponding proteins have similar functions. BCB 444/544 Fall 07 Sept 21 Exam 1 KEY p 2 of 6 B. Dynamic Programming (20 pts TOTAL) You think Dobbs made an error -- it looks like she confused the ortholog & paralog! A vital piece of evidence that could prove this is an optimal global pairwise alignment between your prokaryotic gene and each of the human homologs. You would love to prove Dobbs wrong, so despite the late hour, you decide to compare the two alignments (in the bar, where you are now drowning your sorrows, while surfing web on your laptop). Aaaarrrgh! Your battery just died - and you left your charger in lab!! You must perform the alignment by hand. Demonstrate your prowess by reproducing a portion of that global alignment below. B1. (8pts) Fill out the dynamic programming matrix for determining an optimal global alignment between the sequences TCG and TCCAG. Scoring: +5 for matches; -3 for mismatches and spaces. 0 T C C A G -3 -6 -9 -12 -15 T -3 5 2 -1 -4 -7 C -6 2 10 7 4 1 G -9 -1 7 7 4 9 B2. (2pts) Where is the score of the optimal alignment(s) located in the DP matrix? (Circle it) (In the bottom right corner of the matrix) B3. (4pts) There are 2 optimal alignments. For full credit, draw both of them & show your traceback arrows. T T +5 C -3 C C +5 A -3 G G +5 = 9 T T +5 C C +5 C -3 A -3 G G +5 = 9 B4. (4pts) You don't want to go home yet, so decide it would be entertaining to set up a DP matrix for local alignment, using the BLOSUM62 matrix (attached to this Exam). But, you were able to fill in only the first two rows before the bar closed. Show what you accomoplished in the matrix below: 0 T C C A G 0 0 0 0 0 0 T 0 5 2 0 0 0 B5. (2pts) Walking home with a bit of a buzz, it occurs to you that the "rule" for initializing a DP matrix for global alignment - which can cause "end-gap" penalties to accumulate if sequences are of different lengths - would be a problem if you wanted to use global alignment to assemble a set of overlapping sequences into a single long sequence. How would you initialize a DP matrix identify the region(s) of overlap between two long sequences (a & b), which are known to overlap, but each of which is expected to have some unique sequences on one end? a) -----------------------|||||||||||||||||||||||||| This is type of alignment is referred to as "end-gap free" alignment. b) ------------------------- Scoring is the same as for global alignment, except that there are no penalties for gaps at ends of sequences (so the DP matrix is initialized as for local alignment (with all zeros). BCB 444/544 Fall 07 Sept 21 Exam 1 KEY C. p 3 of 6 PSSMs & PSI-BLAST (25 pts TOTAL) C1. (10pts) PSSM matrix - The alignment of four DNA sequences is shown below. CAACTG CAGCTG CAGGTG CAGCTT Which of the position-specific score matrices (PSSMs) shown above is most likely to be correct ? Explain. PSSM-2 is most likely correct. PSSM-1 shows that position 5 is almost always a G, but our alignment shows that we have T’s there. PSSM-3 shows that position 6 is almost always a T, but our alignment shows mostly G’s there. Only PSSM-2 fits with the alignment. C2. (5pts) Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are different. PAM matrices are based on an evolutionary model for frequencies of amino acid substitutions (based on data from very closely related sequences) whereas BLOSUM matrices are based on observed frequencies of amino acid substitutions in alignments of more distantly related protein sequences. One other important difference is that a higher numeric index for a PAM matrix corresponds to more divergent sequences, whereas a higher index for a BLOSUM matrix corresponds to more similar sequences. C3. (5pts) In evaluating the results of a database search using BLAST, why is it sometimes important to consider the bit score, S', instead of only E-value? The E-value is directly proportional to the size of the database and the length of the query sequence. The S' score is a "normalized" version of the raw alignment score and is not dependent on sequence length or datasbase size. Thus, to compare the significance of alignments obtained from searches in which the query sequences are of different lengths, or databases are of different sizes, the S' or bit score more reliable. C4. (2pts) In what sense is the Smith-Waterman (local alignment) DP algorithm better than BLAST? Smith-Waterman is guaranteed to find the sequence with the optimal alignment score because it examines every possible alignment. BLAST cannot guarantee this because it uses a heuristic to speed up the search. C5. (3pts) Everything else being equal, when does BLAST produce a more significant E-value, when searching a database of size 500,000 or when searching a database of size 1,000,000? Explain. Because the E-value is directly proportionally to the size of the database, E-values for results of a BLAST search using the same query sequence would be greater when searching a large database than when searching a small database. Thus, we would expect to see a smaller (and more significant) E-value for a search performed against the smaller database of 500,000 sequences. BCB 444/544 Fall 07 Sept 21 Exam 1 KEY D. p 4 of 6 Dot Plots & Misc. (20 pts TOTAL) D1. Suppose we are given 2 DNA sequences A and B. Draw a simple diagram of dot plots that would result from the following comparison. To receive full credit, be sure to label both axes. a) (5pts ) DNA sequence A is 1000 bp in length and is identical to sequence B, which is 800 bp in length, except that A has a single 200 bp segment duplicated near the 3' end (right end). A 1000 B 800 b) (5pts) Explain what the dot plot pattern shown below represents: Two sequences of the same length, identical except that one of them Has an inverted segment near the center. D2. (5pts) Which lab did you like best? Why? Most anything you wrote here was given credit - and your feedback was much appreciated! D3. (5pts) (From Sean Eddy's paper - and discussed in lecture) Why is "dynamic programming" called that? What does the name mean? Why did Richard Bellman at RAND give it this name? Bellman called it "dynamic programming" to obscure the true subject of his research (mathematics) and to make it sound impressive to senators who controlled RAND funding. The dynamic part came from Bellman’s research on time series (and Bellman thought "dynamic" could never be used in a "pejorative sense") and programming was actually from planning, not computer programming. BCB 444/544 Fall 07 Sept 21 Exam 1 KEY p 5 of 6 E. Molecular Biology & Bioinformatics Terms (20 pts TOTAL) (1pt each) Fill in the box beside each definition with one term that corresponds to the definition provided. Term Definition Genes in different species that evolved from a common ancestral gene and have similar functions A nucleotide or amino-acid sequence pattern that is often conserved and has, or is conjectured to have, functional significance E1. Orthologs E2. Motif E3. Phenotype E4. Transcription E5. Introns E6. PAM A type of substitution matrix that relies on an explicit evolutionary model and is based on observed differences in closely related proteins E7. ORF A region of a DNA sequence that begins with a START codon and ends with a STOP codon E8. CLUSTAL E9. PSSM E10 Heuristic Observable characteristics of an organism Process mediated by RNA polymerase in which information in DNA is copied into RNA. Sections of eukaryotic genes that are transcribed, but spliced out of mature mRNA Software that uses progressive aliignment hueristics to generate a multiple sequence alignment of related sequences An n x m matrix of log-odds scores, derived from a MSA of related protein sequences, which can be used to represent a (gapless) sequence motif A computational "shortcut" or "rule-of-thumb" that can dramatically shorten the "runtime" required to solve a problem, but cannot guarantee an optimal solution ( 2pts each) Short answer: Answer each of the following questions (one phrase or sentence should be sufficient). E11. What is RNA splicing? RNA splicing is RNA processing: the process of removing introns from a pre-mRNA and "splicing" together the remaining exons to form a mature mRNA E12. What is meant by 6-frame translation? There are 6 possible reading frames for any DNA sequence, 3 forward (from one strand) and 3 reverse (from the complementary strand). 6-frame translation means determining the 6 different amino acid sequences that would result if both "theoretical" RNAs encoded by the 2 strands of a DNA molecule were translated into all 6 possible reading frames. E13. What is an affine gap penalty? A gap penalty in which gap initiation (opening) is given a higher penalty than gap extension (continuing an already existing gap). E14. Why do we need/use heuristics for aligning sequences? For speed: dynamic programming for alignment can takes a long time when sequences are long E15. What are 3 basic computational methods for sequence alignment? Dot matrices, dynamic programming, & word or k-tuple approaches BCB 444/544 Fall 07 Sept 21 Exam 1 KEY p 6 of 6 F. The Question I Didn't Ask (5 pts TOTAL) Describe something you have learned from your reading, lectures or labs that was not asked on this Exam - and that you think is worth 5 pts! Any reasonable answer was awarded 5 pts. Blosum62 matrix