BCB 444/544- F07 Study Guide #2 – For Exam 2 (Oct 26)

advertisement
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 1 of 9
BCB 444/544- F07
Study Guide #2 – KEY (Partial Answers)
For Exam 2 (Oct 26)
Complete answers will be discussed in Review Session on Thurs Oct 25
General comments
•
•
•
•
•
•
Exam 2 will cover all topics covered in class, lab and assigned readings:
• Lectures 13 - 26 (Wed Sept 19 thru Mon Oct 22 )
• Labs 5 - 8
• HW# 3 & 4 (not 5)
• All assigned reading & URLs indicated in PPTs, including:
Xiong: Chps 6 (beginning with HMMs), 7, 8, 12 - 15 (not 10 & 11)
Eddy: What is a hidden Markov Model?
Ginalski: Practical Lessons from Protein Structure Prediction
This study guide covers ~90% of material important for Exam 2 - no
guarantees about other 10%!
Exam 2 will be a closed-book, closed-notes, 50-minute exam.
Some questions will involve computation; bring your calculators if you like.
All required formulae or tables will be provided.
Some questions will require short essay-like answers that demonstrate your
understanding of key concepts covered in the course.
Topics & Study Questions:
•
Review: Nucleus, Chromosomes, Genes
o Name two primary differences in the organization of eukaryotic versus
prokaryotic cells
•
Review: RNA, Proteins, Promoters, Transcription factors
o What is the relationship between protein sequence – structure – function?
o Eukaryotic gene structure: Introns vs Exons
o Regulation of gene expression
• What is a promoter? An enhancer?
• What is a transcription factor?
•
PSSMs, Profiles & HMMs
o What are sequence logos?
o What is the main difference between a PSSM and a profile?
o How do you calculate the probability of a given path through an HMM?
o How do you calculate the most probable path through a HMM?
o How do you calculate the total probability of an observed sequence from a
HMM?
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 2 of 9
•
Protein Structure: Basics, Visualization, Classification
o Amino acid properties: Why is Glycine "special?" Proline? Cysteine?
o What are the 4 main levels of protein structure?
o What are 3 main types of secondary structural elements?
o Name 2 databases in which proteins are categorized on the basis of their
structural class.
o What is the major database for protein structure information?
o Name 2 protein structure visualization tools.
o What are the 2 main methods use to obtain high-resolution experimental
structures of proteins?
•
Protein Secondary Structure Prediction
o Why are different programs needed to predict the secondary structures
of globular vs. membrane proteins?
o What features of membrane proteins make them "easier" targets for
protein secondary structure prediction?
•
Protein Tertiary Structure & Prediction
o What is meant by the "protein folding" problem?
o What is meant by the "inverse folding" problem?
o What is the primary goal of the Structural Genomics Project?
o What are the 3 major methods for protein tertiary structure prediction?
o What is the CASP contest?
o Name 1 program for protein structure comparison.
•
RNA Structure & Prediction
o List 4 different types of functional RNAs
o Why are there so many different types of base-pairs in RNA structures?
o What types of bonds/molecular interactions are primarily responsible for
stabilizing RNA structures?
o What are 3 main methods for RNA structure prediction?
o How is covariation used in RNA structure prediction?
o What type of protein structure prediction method was developed by KaiMing Ho's group at ISU?
•
Gene Structure and (just a little bit on) Prediction
o Name 3 differences in the structural features of genes in eukaryotes vs
prokaryotes
o Name 4 DNA sequence "signals" in genes that are often used in
computational gene prediction approaches.
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 3 of 9
Sample Questions/Problems
1. Label each of the following statements eitherTrue (T) or False (F).
a.
F
b.
T
c.
T
d.
e.
f.
g.
h.
i.
F
F
F
F
T
T
j.
T
The cytoskeleton extracellular matrix in eukaryotic cells organizes the
extracellular space.
The process by which information in RNA is used to make proteins
is called translation.
The process by which information in DNA is copied into RNA is called
transcription.
Peptide bonds are both planar and flexible. rigid.
Enzymes that catalyze reactions in the cell are not always proteins.
Protein interactions are not required for the functions of most proteins.
An exon is a segment of a eukaryotic gene that does not encodes protein.
In eukaryotes, one gene can sometimes encode several proteins.
Transcription factors are proteins that often bind specific DNA
sequences and promote the initiation of transcription.
Non-coding RNAs can play important roles in cells, even though they do
not produce proteins.
2. Short answer questions
a. Briefly describe how PSSMs and profiles are generated, and how they differ.
Name two applications for which either a PSSM or a profile could be used.
Why are PSSMs or profiles used (i.e., what advantage do they confer over a
single consensus sequence)?
PSSMs and profiles are generated from multiple sequence alignments of similar
sequences. Both capture the frequency of amino acid substitutions observed
at each position of the protein sequence in the MSA. The main difference
between a PSSM and a profile is that a profile allows gaps and a PSSM does
not.
PSSMs or profiles can be used to identify and characterize protein families,
discover conserved regions of a sequence, generate consensus sequences, and
find conserved sequences in a database, etc.
The advantage of PSSMs and profiles over a consensus sequence is they store
information about the relative importance of each position of the sequence,
which cannot be captured by a single sequence.
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 4 of 9
b. What are the 4 main hierarchical levels of protein structure? What types of
bonds (covalent or non-covalent) are most important for stabilizing structure
at each level.
Primary structure – linear sequence of amino acids; stabilized by covalent bonds
(peptide bonds between amino acids)
Secondary structure – alpha-helices, beta-strands & loops formed via shortrange interactions between amino acids; stabilized by non-covalent bonds
(mostly hydrogen bonds)
Tertiary structure – overall 3-D fold of a single polypeptide chain; stabilized
mainly by non-covalent bonds (disulfide bonds are the exception here:
they are covalent bonds that stabilize tertiary structure in some proteins)
Quaternary structure – interactions between different polypeptide chains or
subunits to form a functional multisubunit protein; stabilized by noncovalent bonds
c. What is the difference between a protein motif and a protein domain?
A protein motif is a short, conserved sequence pattern. A protein domain is a
larger unit, corresponding to a longer sequence, that usually represents a
functionally or structurally independent unit (and may contain one or more
motifs).
d. What is a HMM? Why is it more "powerful" and "flexible" than either a PSSM
or a profile?
An HMM is a hidden Markov model. HMMs are mathematical models with special
properties, including “hidden” states. In our toy example of the fair and
loaded dice, the hidden state was whether the fair or loaded die was being
tossed. We cannot see the hidden state, only the observed variables; in the
die example, we only see what number is rolled.
One reason an HMMs is more powerful and flexible than either a PSSM or profile
is that HMMs are mathematical models that explicitly include and distinguish
between insertions and deletions, whereas PSSMs do not allow gaps, and
profiles allow gaps, but don't distinguish insertions vs deletions.
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 5 of 9
3. "Problem Solving" & Short Essay
A. Protein Structure Basics & Prediction
According to Ginalski et al, 2005, what are the most important problems that
must be addressed to improve protein tertiary structure prediction?
Some important areas for improvement are:
1. Increasing the number of known protein structures through structural
genomics projects. This will be the most helpful advance for protein structure
prediction because homology modeling and threading rely on known structures.
2. Improving accuracy of energy functions. Ab initio methods, especially, are
hampered by inaccurrate energy functions and the huge search space of possible
conformations. Improvements in both ab initio and comparative methods will
require better energy functions, increased computational power, and more
efficient algorithms for exploring the conformational search space.
3. Improving structure comparison methods. The difficulty in comparing predicted
structures with actual structures is another major problem at present. It is
hard to improve your prediction method if you cannot accurately determine
where you were wrong in the first place.
B. RNA Structure Basics & Prediction
What are the three main approaches to RNA secondary structure prediction?
Explain the advantages and disadvantages (if any) of each.
1. Ab initio –
• Advantage – requires only a single RNA sequence.
• Disadvantage – relies on finding the minimum free energy structure
based on energy models that may not be accurate
2. Comparative –
• Advantage – uses multiple sequences to find covariation information or
consensus structures between sequences
• Disadvantage – still relies on energy models and multiple sequences
provided as input must be biologically related; in other words, if you give
these programs irrelevant information as input, they cannot produce
good output.
3. Combined Computational and Experimental –
• Advantage – can be much more accurate due to experimental evidence.
• Disadvantage – requires wet-lab experiments that can be expensive and
time-consuming, or digging through the literature to find experimental
evidence that cost somebody else a lot of time and money.
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 6 of 9
C. Gene Structure Basics & Prediction
Using this hidden Markov model, calculate the most probable path for sequence
ACTG.
Your probability table will begin like this:
Start
Begin
1
A
0
C
0
T
0
0.056 *
(0.9 *
0.25) =
0.012
E
0
1 * 0.25 =
0.25
0.25 * (0.9
* 0.25) =
0.056
5
0
0
0.25 * 0.1
*0=0
0.056 *
0.1 * 0 = 0
I
End
0
0
0
0
0
0
0
0
G
0
0.012 *
(0.9 *
0.25) =
0.0028
0.0126 *
0.1 * 0.95
= 0.0012
0
0
End
0
0
0
0
0
Sorry about this problem! Because of the structure of this HMM, we cannot make it all
the way from Start to End with this sequence. We can’t even make it to state I; we can
get to either E or 5. The most probable path for this sequence through this model is:
Start - E - E - E - E, and then we are stuck because we can’t get all the way to End!
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 7 of 9
D. Transcription/translation:
Below the DNA sequence shown, write the RNA sequence that would be
transcribed from the top strand of DNA, assuming it is copied completely from the
3' to the 5' end. On the RNA sequence, circle the START and STOP codons for
translation. Translate the RNA sequence into amino acids and write the deduced
protein sequence below, too.
DNA
3'-TATATCGCGTTACGATCTGCACAAGATCATC-5'
5'-ATATAGCGCAATGCTAGACGACTTCTAGTAG-3'
RNA
5'- A U A U A G C G C A A U G C U A G A C G A C U U C U A G U A G - 3 '
___ ___ ___ ___ ___ ___
Protein:
NH2 - M e t - L e u - A s p - A s p - P h e - S T OP - S to p ! !
START
STOP
Suppose the DNA sequence in another individual is different, due to a single basepair substitution, as shown below. What effects (if any) would you expect this SNP
have on the sequence and function of the expected protein product?
3 ' - T A T A T C G C G T T A C G A T C T G C A C A A G GT C A T C - 5 '
5'-ATATAGCGCAATGCTAGACGACTTCCAGTAG-3'
5'- A U A U A G C G C A A U G C U A G A C G A C U U C C A G U A G
NH2 - M e t - L e u - A s p - A s p - P h e - G l n - S to p
START
STOP
The peptide encoded by the 2nd gene would have one additional amino acid (Gln),
relative to the 1st. Withonly the information provided, it is not possible to
determine whether this change would alter its function.
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 8 of 9
Important "Bioinformatics" vocabulary:
New Molecular Biology Jargon:
CpG island
RNA polymerase
Promoter
Enhancer
Post-transcriptional control
Primary, secondary, tertiary, quaternary structure
Hydrophobic/ hydrophilic
Motif
Domain
Covalent bond
Non-covalent bond
Peptide bond
Hydrogen bond (H-bond)
Base-stacking interactions
Rotamer
Ramachandran plot
Energy minimization
X-ray crystallography
NMR spectroscopy
AFM
Co-variation
Experimental constaints (for RNA structure prediction)
cDNA
EST
Bioinformatics Jargon:
Log odds score
PSSM
Profile
HMM
Markov model
1st, 2nd, 3rd order MMs
Obseved vs hidden variables
Emission vs transition probabilities
Viterbi algorithm
Profile HMM
Regularization
Pseudocounts
Dirichlet mixture
BCB 444/544 Fall 07
Study Guide #2 - KEY - Oct 24
p 9 of 9
Regular expression
PSB, MMDB, MSD
PYMOL, Cn3D
CATH, SCOP, DALI
GOR V, CMD
Q3 SCORE
Phobius
Protein Structure Prediction
1) Ab initio
2) Threading or fold recognition
3) Homology modeling
Miyazawa-Jernigan (MJ) model/potentials
SWISS-MODEL
3-D JURY
RNA structure prediction
1) Ab initio (thermodynamic)
2) Comparative
3) Combined computational & experimental
Gene structure prediction
1) Ab initio
2) Similarity based
3) Combined
Download