BCB 444/544 - F07 Exam 1 (100 pts) Name_____

advertisement
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
BCB 444/544 - F07
Exam 1 (100 pts)
p 1 of 6
Name_____ANSWER KEY_________________
A. Databases & Literature Resources for Bioinformatics (10 pts TOTAL)
A1. (2pts) In your undergraduate research project, you have identified an especially interesting and, so far,
unannotated gene in bacteria, which you have named "BCB1." Your experimental results demonstrate that BCB1 is an
essential gene: mutations that knock-out its function are lethal. You have a hunch it must be conserved among all life
forms. To obtain support for this hypothesis, you would like to find identify a homolog of this gene in humans. You
logon to the BLAST page at NCBI and choose to run a basic protein BLAST search against only human proteins.
However, you obtain no significant hits!!! How should you change your search parameters to increase your
chances of detecting a potential human homolog?
Change the default BLAST substitution matrix to one that will take into consideration greater
evolutionary divergence, such as BLOSUM45. (or try PSI-BLAST!)
A2. (1pt) Despite changing parameters as described above, you were unable to identify a putative human homolog.
You decide to change your strategy and run a BLAST search against proteins from all organisms. Great! You find an
extensive list of potential homologs across many forms of life -- but you still did not identify any potential homologs in
human. As you sit in frustration your thoughts drift back to your glory days in BCB 444/544 and you remember an
alternative BLAST program that takes advantage of a profile or PSSM in an interative search procedure, thus
providing more sensitivity for detecting remote homologs. What is this specific BLAST program called?
PSI-BLAST
A3. (2pts) You tap your foot and wait for your browser to refresh. You recall a few "suggestion" & "caveats" about
effective use of PSI-BLAST ; ) Hmmm… What is one "tip" for effective use of PSI-BLAST?




When in doubt, leave it out: remove any "suspicious" hits obtained after each iteration,
so they won't contaminate the profile (or PSSM)
Use stingent parameters during first iteration
Run 3-5 iterations - not more
Others?
A4. (2pts) At last there it is, you’ve found a significant "hit" in human! homolog! This seems like an excellent and
fitting end to a long and exciting search! You pat yourself on the back and are just about to go out to celebrate with a
few beers, when your lab partner takes a look at the annotation for your putative human homolog and says: "Hey! I
think you've been scooped! I saw a paper describing a human protein with the same annotation from Drena Dobbs's
lab last year - it was in Science or Nature, I think--or maybe it was in NAR, no - it was Proteins, maybe 2 years ago!
You'd better check it out!" Aaargh… it is 2 AM & the library is closed…Which online resource would you use to
find all papers published by Dobbs in biomedical journals during the past 5 years?
PubMed or NCBI ENTREZ - other correct answers are possible!
A5. (3pts) Darn! That Dobbs lab must have some amazing students! They did identify your gene in humans -- and
actually found two very similar genes. They said one of them is the ortholog of the gene you found in bacteria and the
other is actually a paralog. What is an ortholog and how does it differ from a paralog?
Orthologs are the same genes in different species; they are the result of common ancestry, and
the corresponding proteins have the same function. Paralogs are similar genes within a
species; they are the result of gene duplication events, and the corresponding proteins have
similar functions.
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
p 2 of 6
B. Dynamic Programming (20 pts TOTAL)
You think Dobbs made an error -- it looks like she confused the ortholog & paralog! A vital piece of evidence that
could prove this is an optimal global pairwise alignment between your prokaryotic gene and each of the human
homologs. You would love to prove Dobbs wrong, so despite the late hour, you decide to compare the two alignments
(in the bar, where you are now drowning your sorrows, while surfing web on your laptop). Aaaarrrgh! Your battery
just died - and you left your charger in lab!! You must perform the alignment by hand. Demonstrate your prowess by
reproducing a portion of that global alignment below.
B1. (8pts) Fill out the dynamic programming matrix for determining an optimal global alignment between the
sequences TCG and TCCAG. Scoring: +5 for matches; -3 for mismatches and spaces.

0

T
C
C
A
G
-3
-6
-9
-12
-15
T
-3
5
2
-1
-4
-7
C
-6
2
10
7
4
1
G
-9
-1
7
7
4
9
B2. (2pts) Where is the score of the optimal alignment(s) located in the DP matrix? (Circle it)
(In the bottom right corner of the matrix)
B3. (4pts) There are 2 optimal alignments. For full credit, draw both of them & show your traceback arrows.
T
T
+5
C
-3
C
C
+5
A
-3
G
G
+5 = 9
T
T
+5
C
C
+5
C
-3
A
-3
G
G
+5 = 9
B4. (4pts) You don't want to go home yet, so decide it would be entertaining to set up a DP matrix for local
alignment, using the BLOSUM62 matrix (attached to this Exam). But, you were able to fill in only the first two
rows before the bar closed. Show what you accomoplished in the matrix below:
0

T
C
C
A
G

0
0
0
0
0
0
T
0
5
2
0
0
0
B5. (2pts) Walking home with a bit of a buzz, it occurs to you that the "rule" for initializing a DP matrix for global
alignment - which can cause "end-gap" penalties to accumulate if sequences are of different lengths - would be a
problem if you wanted to use global alignment to assemble a set of overlapping sequences into a single long sequence.
How would you initialize a DP matrix identify the region(s) of overlap between two long sequences (a & b),
which are known to overlap, but each of which is expected to have some unique sequences on one end?
a) -----------------------||||||||||||||||||||||||||
This is type of alignment is referred to as "end-gap free" alignment.
b)
------------------------- Scoring is the same as for global alignment, except that there are no
penalties for gaps at ends of sequences (so the DP matrix is initialized as
for local alignment (with all zeros).
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
C.
p 3 of 6
PSSMs & PSI-BLAST (25 pts TOTAL)
C1. (10pts) PSSM matrix - The alignment of four DNA sequences is shown below.
CAACTG
CAGCTG
CAGGTG
CAGCTT
Which of the position-specific score matrices (PSSMs) shown above is most likely to be correct ? Explain.
PSSM-2 is most likely correct. PSSM-1 shows that position 5 is almost always a G, but our alignment shows
that we have T’s there. PSSM-3 shows that position 6 is almost always a T, but our alignment shows mostly G’s
there. Only PSSM-2 fits with the alignment.
C2. (5pts) Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are
different.
PAM matrices are based on an evolutionary model for frequencies of amino acid substitutions (based on data
from very closely related sequences) whereas BLOSUM matrices are based on observed frequencies of amino
acid substitutions in alignments of more distantly related protein sequences. One other important difference is
that a higher numeric index for a PAM matrix corresponds to more divergent sequences, whereas a higher
index for a BLOSUM matrix corresponds to more similar sequences.
C3. (5pts) In evaluating the results of a database search using BLAST, why is it sometimes important to
consider the bit score, S', instead of only E-value?
The E-value is directly proportional to the size of the database and the length of the query sequence. The S' score
is a "normalized" version of the raw alignment score and is not dependent on sequence length or datasbase size.
Thus, to compare the significance of alignments obtained from searches in which the query sequences are of
different lengths, or databases are of different sizes, the S' or bit score more reliable.
C4. (2pts) In what sense is the Smith-Waterman (local alignment) DP algorithm better than BLAST?
Smith-Waterman is guaranteed to find the sequence with the optimal alignment score because it examines every
possible alignment. BLAST cannot guarantee this because it uses a heuristic to speed up the search.
C5. (3pts) Everything else being equal, when does BLAST produce a more significant E-value, when
searching a database of size 500,000 or when searching a database of size 1,000,000? Explain.
Because the E-value is directly proportionally to the size of the database, E-values for results of a BLAST
search using the same query sequence would be greater when searching a large database than when searching a
small database. Thus, we would expect to see a smaller (and more significant) E-value for a search performed
against the smaller database of 500,000 sequences.
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
D.
p 4 of 6
Dot Plots & Misc. (20 pts TOTAL)
D1. Suppose we are given 2 DNA sequences A and B. Draw a simple diagram of dot plots that would result from the
following comparison. To receive full credit, be sure to label both axes.
a) (5pts ) DNA sequence A is 1000 bp in length and is identical to sequence B,
which is 800 bp in length, except that A has a single 200 bp segment duplicated
near the 3' end (right end).
A
1000
B
800
b) (5pts) Explain what the dot plot pattern shown below represents:
Two sequences of the same length, identical except that one of them
Has an inverted segment near the center.
D2. (5pts) Which lab did you like best? Why?
Most anything you wrote here was given credit - and your feedback was much appreciated!
D3. (5pts) (From Sean Eddy's paper - and discussed in lecture) Why is "dynamic programming" called that?
What does the name mean? Why did Richard Bellman at RAND give it this name?
Bellman called it "dynamic programming" to obscure the true subject of his research (mathematics) and to
make it sound impressive to senators who controlled RAND funding. The dynamic part came from Bellman’s
research on time series (and Bellman thought "dynamic" could never be used in a "pejorative sense") and
programming was actually from planning, not computer programming.
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
p 5 of 6
E. Molecular Biology & Bioinformatics Terms (20 pts TOTAL)
(1pt each) Fill in the box beside each definition with one term that corresponds to the definition provided.
Term
Definition
Genes in different species that evolved from a common ancestral gene and have similar
functions
A nucleotide or amino-acid sequence pattern that is often conserved and has, or is conjectured to
have, functional significance
E1.
Orthologs
E2.
Motif
E3.
Phenotype
E4.
Transcription
E5.
Introns
E6.
PAM
A type of substitution matrix that relies on an explicit evolutionary model and is based on
observed differences in closely related proteins
E7.
ORF
A region of a DNA sequence that begins with a START codon and ends with a STOP codon
E8.
CLUSTAL
E9.
PSSM
E10
Heuristic
Observable characteristics of an organism
Process mediated by RNA polymerase in which information in DNA is copied into RNA.
Sections of eukaryotic genes that are transcribed, but spliced out of mature mRNA
Software that uses progressive aliignment hueristics to generate a multiple sequence alignment
of related sequences
An n x m matrix of log-odds scores, derived from a MSA of related protein sequences, which
can be used to represent a (gapless) sequence motif
A computational "shortcut" or "rule-of-thumb" that can dramatically shorten the "runtime"
required to solve a problem, but cannot guarantee an optimal solution
( 2pts each) Short answer: Answer each of the following questions (one phrase or sentence should be sufficient).
E11.
What is RNA splicing?
RNA splicing is RNA processing: the process of removing introns from a pre-mRNA and "splicing"
together the remaining exons to form a mature mRNA
E12.
What is meant by 6-frame translation?
There are 6 possible reading frames for any DNA sequence, 3 forward (from one strand) and 3 reverse
(from the complementary strand). 6-frame translation means determining the 6 different amino acid
sequences that would result if both "theoretical" RNAs encoded by the 2 strands of a DNA molecule
were translated into all 6 possible reading frames.
E13.
What is an affine gap penalty?
A gap penalty in which gap initiation (opening) is given a higher penalty than gap extension (continuing
an already existing gap).
E14.
Why do we need/use heuristics for aligning sequences?
For speed: dynamic programming for alignment can takes a long time when sequences are long
E15.
What are 3 basic computational methods for sequence alignment?
Dot matrices, dynamic programming, & word or k-tuple approaches
BCB 444/544 Fall 07 Sept 21 Exam 1 KEY
p 6 of 6
F. The Question I Didn't Ask (5 pts TOTAL)
Describe something you have learned from your reading, lectures or labs that was not asked on this Exam - and
that you think is worth 5 pts!
Any reasonable answer was awarded 5 pts.
Blosum62 matrix
Download