BCB 444/544 - F07 Exam 2 (100 pts) Name

advertisement
BCB 444/544 Fall 07 Oct 26 Exam 2
BCB 444/544 - F07
Exam 2 (100 pts)
p 1 of 8
Name KEY–Please note Revised Answers for Questions C2 & F
A. Cells, Nucleus, Genetic Code, Transcription Factors, etc. (15 pts TOTAL)
Below the DNA sequence shown, write the RNA sequence that would be transcribed from the top
strand of DNA, assuming it is copied completely from the 3' to the 5' end. On the RNA sequence,
circle the START and STOP codons for translation. Translate the RNA sequence into amino acids
and write the deduced protein sequence below, too.
DNA
3'-TATATCGTACAGACGATTTTGATCTATTGACTGG-5'
5'-ATATAGCATGTCTGCTAAAACTAGATAACTGACC-3'
RNA
5'-AUAUAGCAUGUCUGCUAAAACUAGAUAACUGACC-3
Protein:
NH2 – M e t - S e r - A l a - L y s - T h r - A r g - S t o p
Suppose the DNA sequence in another individual is different, due to a single base-pair deletion,
resulting in the sequence shown below. What effects (if any) would you expect this SNP have on the
sequence and function of the expected protein product?
3'-TATATCGTACGACGATTTTGATCTATTGACTGG-5'
5'-ATATAGCATGCTGCTAAAACTAGATAACTGACC-3'
RNA
5'-AUAUAGCAUGCUGCUAAAACUAGAUAACUGACC-3
Protein:
NH2 – M e t - L e u - L e u - L y s - L e u - A s p - A s n - S t o p
The single base deletion caused a frameshift, resulting in a completely different protein
sequence. I would expect this mutation to make a protein that did not have the same function
as the one above, and it may not function at all.
BCB 444/544 Fall 07 Oct 26 Exam 2
p 2 of 8
B. HMM (20 pts TOTAL)
Consider the occasionally dishonest casino example discussed in class.
The system has 3 states:
B denotes the start state
F denotes the state when a fair die is used
L denotes the state when a loaded die is used
The transition probabilities between these states are shown in the diagram.
The emission probabilities are:
for state F, eF(1) = eF(2) = … = eF(6) = 1/6
for state L, eL(1) = eL(2) = … = eL(5) = 0.1, eL(6) = 0.5
1. What is the most probable sequence of states, starting from state B, to produce the sequence of
die tosses 1,6? For full credit, you must show your work and fill in the table below.
B
1
1
0
F
0
½ * 1/6 = 1/12
L
0
½ * 1/10 = 1/20
6
0
max {1/12 * 0.99 * 1/6 = 0.0138,
1/20 * 0.2 * 1/6 = 0.001667}
= 0.0138
max {1/12 * 0.01 * 1/2 = 0.00042,
1/20 * 0.8 * 1/2 = 0.02}
= 0.02
The most probable sequence of states is: B – L – L.
2. What is the total probability of the sequence 1,6? Show your work and fill in the table below.
B
1
1
0
F
0
½ * 1/6 = 1/12
L
0
½ * 1/10 = 1/20
6
0
sum {1/12 * 0.99 * 1/6 = 0.0138,
1/20 * 0.2 * 1/6 = 0.001667}
= 0.015467
sum {1/12 * 0.01 * 1/2 = 0.00042,
1/20 * 0.8 * 1/2 = 0.02}
= 0.02042
The total probability of the sequence 1,6 is: 0.015467 + 0.02042 = 0.035887
BCB 444/544 Fall 07 Oct 26 Exam 2
p 3 of 8
C. Motifs, Domains, Structure & Structure Prediction (10 pts TOTAL)
C1.
(2 pts) Why is a profile more sensitive and flexible than a PSSM for detecting sequence
motifs in proteins?
A profile allows for gaps in the sequence while a PSSM does not.
C2.
(4 pts) Briefly explain the roles of base-pairing vs base-stacking interactions in RNA
structure prediction
Both base-pairing and base-stacking interactions stabilize RNA structures, but the major
energetic contribution is from base-stacking. Most prediction methods attempt to identify
potential structures by optimizing base-pairing, but the energies and ranking of potential
structures are calculated based on nearest-neighbor base-stacking interactions.
C3.
(2 pts) Suggest one physical explanation for the fact that secondary structure prediction
algorithms can more accurately identify alpha-helical segments than beta-strands/sheets.
Secondary structure prediction algorithms more accurately identify alpha-helical segments
because they are local structures, that is they are stabilized by hydrogen bonds between near
neighbors in an amino acid sequence. Beta-strands/sheets are harder to predict because they
involve long-range interactions stabilized by hydrogen bonds between portions of the sequence
that can be far apart.
C4.
(2 pts) According to the paper by Ginalski et al., why are meta predictors better than
individual methods for predicting the tertiary structure of proteins?
Meta predictors benefit from being able to choose between multiple predicted conformations.
They are able to identify the structure (or portions of a structure) that occur more often than
expected in the set of structures returned. This procedure gives meta predictors a significant
advantage over individual methods and results in better models.
BCB 444/544 Fall 07 Oct 26 Exam 2
p 4 of 8
D. Longer answers/problems (20 pts TOTAL)
D1.
(10 pts)
a) Briefly outline the steps used to predict a protein structure by threading.
1. Align the target sequence with all template structures
2. Calculate the energy score to evaluate how well the sequence fits on the structure
3. Rank models based on energy scores.
b) What were the key "simplifications" exploited in the Ho method to make it fast enough to
be used for genome-wide threading?
1. Simplify
the full 3D
eigenvector
structure.
2. Simplify
3. Simplify
residues.
the template structure representation by using the contact matrix (2D) instead of
structure. This representation is further reduced by considering only the dominant
of the contact matrix, resulting in a one dimensional representation of the template
the target sequence representation by using the Li-Tang-Wingreen representation.
the energy function by only counting number of contacts between hydrophobic
BCB 444/544 Fall 07 Oct 26 Exam 2
p 5 of 8
D2. (10 pts) Given a single RNA sequence, GGCGCGGCACCGUCCGCGGAACAAACGG, we predict the
structure shown below. We then perform a database search and discover 5 homologous sequences
and align them. The MSA is shown below with nucleotides that are base-paired in the structure
highlighted and numbered so that base-paired positions have the same number above them.
a) Based on the information in this MSA, is our predicted structure likely to be
correct? Explain.
Maybe. Most likely, the base pair at position 3 below does not form in all of the other
sequences. Also, all of the other sequences can form base pairs with the bases immediately
before and after the numbered region – see the UC and the GA bases that are present in
all other sequences besides our first sequence.
b) Are there any base-pairs in the predicted structure that are unlikely to form in
structures corresponding to the additional sequences in the MSA? Explain.
Yes, the base pair at position 3 is not likely to form because the bases in all of the other
sequences vary but do not conserve the ability to base pair.
12345
54321
GGCGCGGCACCGUCCGCGGAACAAACGG
UCCGGGUCACCGUACGCGGAACAAACGG
UCCGUGCCACCGUGCGCGGAACAAACGG
UCCGUGACACCGUUCCCGGAACAAACGG
UCCGAGUCACCGUACGCGGAACAAACGG
UCCGCGUCACCGUACGCGGAACAAACGG
BCB 444/544 Fall 07 Oct 26 Exam 2
p 6 of 8
E. Molecular Biology & Bioinformatics Terms (20 pts TOTAL)
(1pt each) Fill in the box beside each definition with one term or acronym that corresponds to the
definition provided. (Some have more than one correct answer).
Term
Definition
Algorithm used to determine the most probable path for a sequence of
observed variables from a HMM
E1.
Viterbi algorithm
E2.
GOR V, PHD, PSIPRED, etc.
E3.
CpG Island
E4.
CATH, SCOP
A protein structure classification database
E5.
PyMol, MolMol,
Cn3D, etc.
A program for visualizing protein structures
E6.
CASP
E7.
Cytoskeleton
Internal structural scaffold that organizes the cytoplasm in eukaryotic cells
E8.
Peptide bond
Type of covalent bond that links amino acids in a polypeptide chain
E9.
Protein domain
E10
NMR Spectroscopy,
X-ray
crystallography
A program for predicting protein secondary structure
Genomic region with a high frequency of CG dinucleotides, likely to be near the
transcription start site for genes
A protein structure prediction "contest"
Independent structural or functional unit of a protein
Experimental method for determining the 3-D structure of a macromolecule
(2pts each) Short answer: Answer each of the following questions (one phrase or sentence should
be sufficient in most cases).
E11. What is the difference between a protein motif and a protein domain?
A protein motif is a short, conserved sequence pattern. A protein domain is a larger unit,
corresponding to a longer sequence that usually represents a functionally or structurally
independent unit (and may contain one or more motifs).
E12. What is "hidden" in a hiddle Markov model?
The actual state is hidden. We can’t "see" the underlying state that emits the observed
variables.
E13. What is meant by co-variation in the context of RNA structure prediction?
Co-variation is used to determine which residues are more likely to be base-paired. In a
MSA, if every time one position changes from an A to a C, another position changes from a
BCB 444/544 Fall 07 Oct 26 Exam 2
p 7 of 8
U to a G, those two positions are likely to be base-paired because the mutation/variations
change the sequence of the RNA, but conserve the ability to base-pair.
E14. Name the 3 basic computational appraoches for protein structure prediction.
Homology modeling, threading, and ab initio.
E15. Which experimental protein structure determination method can to provide
information about protein dynamics?
NMR spectroscopy
F. Something a bit more thought provoking!! (10 pts TOTAL)
You are given a "mystery" gene sequence (M gene) from a newly discovered bacterium. The M gene has
only one large open reading frame (ORF), which would encode a protein ~200 amino acids in length.
Outline and briefly describe the types of computational analyses you would perform to try to
annotate this gene and its potential product(s). Be sure to provide relevant details and explain
how you would proceed if a proposed approach does not provide useful information. You should
begin like this -- but please replace underlined bits and "x"s with your own words:
First, I would perform a "xBLASTx" search using the M gene sequence to query the "x" database.
If no hits with e-values better than "x" are returned using default parameters, I would….
1) BLAST, if no significant hits, adjust default parameters, including substitution matrices
2) If still no hits, try PSI-BLAST to identify potential remote homologs
3) If still no hits, search for protein sequence or portions of it in Pfam & other protein family
databases
4) Try to predict protein structure – protein may have function related to template protein
chosen by homology modeling or threading program
5) Search for:
a. functional motifs or domains: e.g., kinase or DNA binding motifs
b. potentially co-regulated groups of proteins, based n microarray or proteomics data
c. references to the protein in the literature
Other ideas?
BCB 444/544 Fall 07 Oct 26 Exam 2
p 8 of 8
G. The Question I Didn't Ask (5 pts TOTAL)
Describe something you have learned about an ISU scientist or research project, somehow related
to bioinformatics or computational biology (from lectures, reading, labs, seminars), which you think
is worth 5 pts!
We have mentioned quite a few of these in class. Some examples are the Ho group’s threading
method for protein structure prediction and the Jernigan group’s GOR V, CDM, and FDM methods
for protein secondary structure prediction, and the Brendel group’s methods for gene prediction.
There are lots of other possibilities as well.
Genetic Code Table
Download