BCB 444/544- F07 Study Guide #1 -

advertisement
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 1 of 10
BCB 444/544- F07
Study Guide #1 - ANSWERS (PARTIAL)
For Exam 1 (Fri Sept 21) - Answers will be discussed in Lab/Review Session on Thurs Sept 20
General comments
•
•
•
•
•
•
Exam 1 will cover all topics covered in class, lab and assigned readings:
• Lectures 2-12 (thru Mon Sept 17)
• Labs 1-4
• HW2
• All assigned reading & URLs indicated in PPTs, including:
Xiong: Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
This study guide covers ~90% of material important for Exam 1 - no guarantees about other 10%!
Exam 1 will be a closed-book, closed-notes, 50-minute exam.
Some questions will involve computation; therefore, bring your calculators if you like.
All required formulae or tables (except the dynamic programming equations) will be provided.
Some questions will require short essay-like answers that demonstrate your understanding of key
concepts covered in the course.
Topics & Study Questions:
Resources for Bioinformatics (in ISU Library)
•
Name 5 resources for Bioinformatics provided by NCBI (ENTREZ).
•
Which online resource (available through ISU's library) would you use to find papers that cite one of
your published papers?
•
Where (on NCBI website) would you go to find free full-text copies of textbooks related to molecular
biology, genetics, etc.?
Molecular Biology
•
•
Eukaryotic vs prokaryotic cells/organisms
o
Name 3 differences between them
o
Name 1 example of each type of organism
Central Dogma of Molecular Biology
o
DNA Replication
o
Transcription
o
Translation
•
What is splicing (RNA splicing)?
•
What is an Exon? an Intron?
•
o
Which is present in pre-mRNA?
o
Which is present in mature mRNA?
What is an ORF?
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 2 of 10
•
What is meant by 6-frame translation?
•
What is the difference between genotype and phenotype?
•
Which can be expressed quantitatively: similarity or homology?
•
What is the difference between an ortholog and a paralog?
Sequence Alignment
•
What are 3 basic computational methods for sequence alignment?
•
Why do we need/use heuristics for aligning sequences?
•
Global vs local alignments (see HW2 examples) &:
o
What are differences in filling DP matrix?
o
What are differences in traceback & scoring?
o
When should you use each type of alignment method?
o
Whose implementation of DP algorithm for global alignment is most widely used?

For local alignment?
•
What is an affine gap penalty? Why is it often better to use than constant gap penalty?
•
Dot matrices (see HW2 examples) &:
o
•
Dynamic programming (DP)
o
•
•
•
Name 2 alignment programs that use this method.
Scoring matrices (PAM and BLOSUM)
o
Which type of matrix is based on an evolutionary model?
o
Which type of matrix is used as default matrix in NCBI's BLAST?
o
When/why would you use a BLOSUM matrix with a higher index, e.g., BLOSUM90?
o
When/why would you use a BLOSUM matrix with a lower index, e.g., BLOSUM45?
Database searching with BLAST:
o
•
Explain the basic idea behind DP
What is a word (k-tuple) method?
o
•
What does a series of parallel diagonal lines in a dot matrix pattern usually represent?
Which flavor of BLAST should be used when searching for highly divergent sequences?

for long nearly identical related sequences?

for DNA sequences similar to your query DNA sequence?

for protein sequence similar that encoded by your query DNA sequence?
Significance of BLAST "hits"
o
In general, what range of E-values suggests that a "hit" is significant?
o
In general, what range of E-values suggests that a "hit" is no better than random?
o
Why is it sometimes important to consider the bit score, S', instead of only E-value?
Advantages/disadvantages of BLAST vs FASTA vs DP
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 3 of 10
Sample Questions/Problems
1. Answer True or False or fill in blank to complete the following statements.
a. It is correct to say: These two sequences are 30% homologous." False
b. Explain. Homology is a qualitative term: when we say two sequences are homologous, we mean they
are "evolutionarily-related." Homology can be inferred if the degree of sequence similarity/identity
shared by sequences is high enough. Identity and similarity are quantitative terms.
c. Homologous protein sequences usually exhibit more than 25 - 30% sequence identity.
d. A(n) _Open Reading Frame_or ORF_ includes all codons between 2 stop codons (or all codons
between a START codon (AUG) and a STOP codon) in the same frame of an mRNA sequence.
e. Phenotype refers to the observable (e.g., physical) characteristics of an organism; an organism's
genotype is its genetic makeup, which largely determines its phenotype. True
f. Only a very small fraction of human genes are alternatively spliced to result in the expression of more
than one mature mRNA. False
g. Explain. Based on recent results, it is believed that >50% of human genes are alternatively spliced!
h. An _intron__ is usually removed from the pre-mRNA transcribed from a gene, and the amino acid
sequences corresponding to it do not usually appear in the final expressed protein product of a gene.
i. Usually, a pairwise alignment can provide just as much information as a multiple sequence alignment.
False
j. Explain. MSAs provide much more information than pairwise alignment, regarding conserved residues,
possible functional motifs, and potential homologous relationships.
k. Psi-BLAST is valuable for identifying remotely homologous sequences. In each iteration, a MSA is
used to generate a PSSM that is used instead of the original query sequence to search a database.
True.
l. Explain. The iterative use of PSSMs gives Psi-BLAST enhanced ability to detect remote homologs,
relative to "ordinary" BLAST.
2. Short answer questions (Answers provided below are sometimes longer than required!)
a. Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are different.
PAM matrices are based on an evolutionary model. PAM1 is a matrix of log odds scores derived
by calculating substitution statistics from an alignment of closely related sequences. Other PAM
matrices are obtained from PAM1 by extrapolation. A PAM matrix with a lower index should be
used for comparing closely related sequences (e.g., PAM1). BLOSUM matrices are log odds
score matrices, too, but they are derived by calculating substitution statistics based on BLOCKs
of conserved sequences from alignments of evolutionarily divergent sequences. A BLOSUM
martrix with a higher index should be used for comparing closely related sequences (e.g.,
BLOSUM90).
b. In what sense is BLAST better than the Smith-Waterman (local alignment) DP algorithm?
BLAST is faster.
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 4 of 10
c. What is the difference between an affine gap penalty and a constant linear gap penalty?
A constant linear gap penalty uses the same penalty for every gap position.
An affine gap penalty uses different penalties for opening a gap and for extending the gap;
usually the opening penalty is greater than the extension penalty.
d. Everything else being equal, when does BLAST produce a more significant E-value, when searcing a
database of size 500,000 or when searching a database of size 1,000,000? Explain your answer.
Because the E-value is directly proportionally to the size of the database, E-values for results of
a BLAST search using the same query sequence would typically be greater when searching a
large database than when searching a small database. Thus, we would expect to see a smaller
(and more significant) E-value for a search performed against the smaller database of 500,000
seqences.
e. In pairwise alignment, how would you go about modifying scoring schemes to accommodate different
evolutionary distances? For example, if you need to globally align two sequences, how would you modify
the gap and match penalties if you knew that they were closely related? If they were more distantly
related?
Closely related: increase gap penalty and choose a lower PAM or higher BLOSUM matrix.
Distantly related: decrease the gap penalty and choose a higher PAM or a lower BLOSUM matrix
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 5 of 10
3. Dot plots. Below is a dot plot comparing two 5,000 bp DNA sequences. For part a) you can think purely in
terms of sequence information. For part b) you should think biologically (what functional features could explain
the observed pattern).
1
2
3
a) Interpret the pattern, describing what events happened during the divergence of sequences A and B.
"Divergence" implies the two sequences are believed to be derived from a single common
ancestor. They appear to have undergone a combination of single nucleotide changes and
insertion-deletion (indel) events. The nucleotide changes explain why only some parts of the
sequences give rise to extended diagonal lines. Also, it appears that there was an insertion in
sequence A or perhaps a deletion in sequence B (or both) between diagonals 1 and 2. Between
diagonals 2 and 3, the opposite pattern of indels appears to have occurred.
b) Suppose you know these 2 sequences each include coding regions for only 1 eukaryotic gene.
• Describe what the matching regions are most likely to represent and why.
The general answer is that matching regions are sequences that are evolving more slowly
(under more selective constraints). A specific answer is that each matching region corresponds
to a protein-encoding exon and these exons are separated by introns (because exons are
almost always more conserved than introns).
• Describe what the regions at the northwest and southeast (the parts beyond the matching
diagonals) are likely to represent and explain your reasoning.
Possibilities include:
i) they are intergenic regions (beyond the ends of the genes) that have undergone
extensive single nucleotide changes over time
ii) one sequence or the other underwent rearrangement, e.g., acquired deletions or was
moved (by recombination) into a new chromosomal location, such that any shared
sequences outside the region defined by diagonals were "removed" in a single event.
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 6 of 10
4) Dynamic Programming - Global alignment
4a) Fill out the dynamic programming matrix for determining the optimal global alignment between the two
sequences, CGGA and ACTG. Scoring: Match = +3; Mismatches and Spaces = -1.
λ
C
G
G
A
λ
0
-1
-2
-3
-4
A
-1
-1
-2
-3
0
C
-2
2
1
0
-1
T
-3
1
1
0
-1
G
-4
0
4
4
3
4b) What is the optimal score for the alignment(s)?
3
4c) Draw the optimal alignment(s) corresponding to this score.
(if there is more than one, you must include all for complete credit!)
A
C
C
G
T
G
G
A
-
-1
+3
-1
+3
-1
=
3
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 7 of 10
5. Dynamic Programming for Local Alignment, using BLOSUM matrix
Use the Smith-Waterman (local alignment) DP algorithm with a constant linear gap penalty of -3 and the
BLOSUM62 scoring matrix (below) to fill in ONLY the first two columns of the following matrix. Include
traceback arrows.
λ
λ
0
C
0
C
0
9
V
0
6
E
0
3
H
0
0
S
0
0
E
V
G
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 8 of 10
6. Position-Specific Scoring Matrices (PSSMs)
An analysis of 77 DNA binding sites for a specific transcription factor (TF) yielded the following PSSM:
A
C
G
T
37
10
13
17
0
76
0
1
0
0
0
77
0
1
76
0
7
4
9
57
34
11
9
23
The 3 sequence fragments given below contain a TF binding site.
Calculate which of these has the strongest and which has the weakest match.
Show your calculations and ranking.
fragment 1:
fragment 2:
fragment 3:
A C C T G C
C A C T G T
T G C T G A
We calculate the score for each fragment, based on the PSSM matrix:
fragment1 =
A
C
C
T
G
C
37 + 76 + 0 + 0 + 9 + 11
fragment2 =
C
A
C
T
G
T
10 + 0 + 0 + 0 + 9 + 23
fragment3 =
T
G
C
T
G
A
17 + 0 + 0 + 0 + 9 + 34
Fragment 1 is the strongest because it has the highest score.
Fragment 2 is the weakest because it has the lowest score.
Ranking:
1 >
3
>
2
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 9 of 10
Important "molecular biology" and "bioinformatics" vocabulary:
Molecular Biology Jargon:
Central Dogma of Molecular Biology
DNA
RNA
Protein
Chromosome
DNA Replication
Transcription
Translation
DNA polymerase
RNA polymerase (RNAP)
Ribosome
Genome
Genotype
Phenotype
Eukaryote
Prokaryote
Gene
Exon
Intron
Splicing
Alternative splicing
Messenger RNA (mRNA)
Pre-mRNA
6-frame translation
Open Reading Frame (ORF)
Homolog
Ortholog
Paralog
BCB 444/544 Fall 07
Study Guide #1 KEY - Sept 16
p 10 of 10
Mutation
Synonymous
Non-synonymous
Homology
Similarity
Bioinformatics Jargon:
Annotation
Algorithm
Exhaustive method
Heuristic method
Alignment methods
1) Dot matrix analysis
2) Dynamic programming (DP)
3) Word or k-tuple
Global alignment
Local alignment
Pairwise alignment
Multiple sequence alignment (MSA)
BLAST
FASTA
BLOSUM matrix
PAM matrix
Motif
PSSM
Psi-BLAST
Needleman-Wunsch algorithm (NW)
Smith-Waterman algorithm (SW)
Clustal W
Download