BCB 444/544- F07 Study Guide #1

advertisement
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 1 of 9
BCB 444/544- F07
Study Guide #1
For Exam 1 (Fri Sept 21) - Answers will be discussed in Lab/Review Session on Thurs Sept 20
General comments






Exam 1 will cover all topics covered in class, lab and assigned readings:
 Lectures 2-12 (thru Mon Sept 17)
 Labs 1-4
 HW2
 All assigned reading & URLs indicated in PPTs, including:
Xiong: Chps 2-6 (but not HMMs)
Eddy: What is Dynamic Programming
This study guide covers ~90% of material important for Exam 1 - no guarantees about other 10%!
Exam 1 will be a closed-book, closed-notes, 50-minute exam.
Some questions will involve computation; therefore, bring your calculators if you like.
All required formulae or tables (except the dynamic programming equations) will be provided.
Some questions will require short essay-like answers that demonstrate your understanding of key
concepts covered in the course.
Topics & Study Questions:
Resources for Bioinformatics (in ISU Library)

Name 5 resources for Bioinformatics provided by NCBI (ENTREZ).

Which online resource (available through ISU's library) would you use to find papers that cite one of
your published papers?

Where (on NCBI website) would you go to find free full-text copies of textbooks related to molecular
biology, genetics, etc.?
Molecular Biology

Eukaryotic vs prokaryotic cells/organisms
Name 3 differences between them
Name 1 example of each type of organism

Central Dogma of Molecular Biology
DNA Replication
Transcription
Translation

What is splicing (RNA splicing)?

What is an Exon? an Intron?
Which is present in pre-mRNA?
Which is present in mature mRNA?

What is an ORF?
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 2 of 9

What is meant by 6-frame translation?

What is the difference between genotype and phenotype?

Which can be expressed quantitatively: similarity or homology?

What is the difference between an ortholog and a paralog?
Sequence Alignment

What are 3 basic computational methods for sequence alignment?

Why do we need/use heuristics for aligning sequences?

Global vs local alignments (see HW2 examples) &:
What are differences in filling DP matrix?
What are differences in traceback & scoring?
When should you use each type of alignment method?
Whose implementation of DP algorithm for global alignment is most widely used?
For local alignment?

What is an affine gap penalty? Why is it often better to use than constant gap penalty

Dot matrices (see HW2 examples) &:
What does a series of parallel diagonal lines in a dot matrix pattern usually represent?

Dynamic programming (DP)
Explain the basic idea behind DP

What is a word (k-tuple) method?
Name 2 alignment programs that use this method.

Scoring matrices (PAM and BLOSUM)
Which type of matrix is based on an evolutionary model?
Which type of matrix is used as default matrix in NCBI's BLAST?
When/why would you use a BLOSUM matrix with a higher index, e.g., BLOSUM90?
When/why would you use a BLOSUM matrix with a lower index, e.g., BLOSUM45?

Database searching with BLAST:
Which flavor of BLAST should be used when searching for highly divergent sequences?
for long nearly identical related sequences?
for DNA sequences similar to your query DNA sequence?
For protein sequence similar that encoded by your query DNA sequence?

Significance of BLAST "hits"
In general, what range of E-values suggests that a "hit" is significant?
In general, what range of E-values suggests that a "hit" is no better than random?
Why is it sometimes important to consider the bit score, S', instead of only e-value?

Advantages/disadvantages of BLAST vs FASTA vs DP
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 3 of 9
Sample Questions/Problems
1. Answer True or False or fill in blank to complete the following statements.
a. It is correct to say: These two sequences are 30% homologous."
b. Explain.
True or False?
c. Homologous protein sequences usually exhibit more than _____% sequence identity.
d. A(n) _____________ includes all codons between 2 stop codons (or all codons between a START
codon (AUG) and a STOP codon) in the same frame of an mRNA sequence.
e. Phenotype refers to the observable (e.g., physical) characteristics of an organism; an organism's
genotype is its genetic makeup, which largely determines its phenotype. True or False?
f. Only a very small fraction of human genes are alternatively spliced to result in the expression of more
than one mature mRNA. True or False?
g. Explain.
h. An ________ is usually removed from the pre-mRNA transcribed from a gene, and the amino acid
sequences corresponding to it do not usually appear in the final expressed protein product of a gene.
i. Usually, a pairwise alignment can provide just as much information as a multiple sequence alignment.
True or False?
j. Explain.
k. Psi-BLAST is valuable for identifying remotely homologous sequences. In each iteration, a MSA is
used to generate a PSSM that is used instead of the original query sequence to search a database.
True or False?
l. Explain.
2. Short answer questions.
a. Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are different.
b. In what sense is BLAST better than the Smith-Waterman (local alignment) DP algorithm?
c. What is the difference between an affine gap penalty and a constant linear gap penalty?
d. Everything else being equal, when does BLAST produce a more significant E-value, for a database of
size 500,000 or for a database of size 1,000,000? Explain your answer.
e. In pairwise alignment, how would you go about modifying scoring schemes to accommodate different
evolutionary distances? For example, if you need to globally align two sequences, how would you modify
the gap and match penalties if you knew that they were closely related? If they were more distantly
related?
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 4 of 9
3. Dot plots. Below is a dot plot comparing two 5,000 bp DNA sequences. For part a) you can think purely in
terms of sequence information. For part b) you should think biologically (what functional features could explain
the observed pattern).
a) Interpret the pattern, describing what events happened during the divergence of sequences A and B.
b) Suppose you know these 2 sequences each including coding regions for only 1 eukaryotic gene.
Describe what the matching regions are most likely to represent and why.
Describe what the regions at the northwest and southeast (the parts beyond the matching
diagonals) are likely to represent and explain your reasoning.
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 5 of 9
4) Dynamic Programming - Global alignment
4a) Fill out the dynamic programming matrix for determining the optimal global alignment between the two
sequences, CGGA and ACTG. Scoring: Match = +3; Mismatches and Spaces = -1.

C
G

A
C
T
G
4b) What is the optimal score for the alignment(s)?
4c) Draw the optimal alignment(s) corresponding to this score.
(if there is more than one, you must include all for complete credit!)
G
A
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 6 of 9
5. Dynamic Programming for Local Alignment, using BLOSUM matrix
Use the Smith-Waterman (local alignment) DP algorithm with a constant linear gap penalty of -3 and the
BLOSUM62 scoring matrix (below) to fill in ONLY the first two columns of the following matrix. Include
trace back arrows.


C
V
E
H
S
C
E
V
G
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 7 of 9
6. Position-Specific Scoring Matrices (PSSMs)
An analysis of 77 DNA binding sites for a specific transcription factor (TF) yielded the following PSSM:
A
C
G
T
37
10
13
17
0
76
0
1
0
0
0
77
0
1
76
0
7
4
9
57
34
11
9
23
The 3 sequence fragments given below contain TF binding site.
Calculate which of these has the strongest and which has the weakest match.
Show your calculations and ranking.
fragment 1:
fragment 2:
fragment 3:
ACCTGC
CACTGT
TGCTGA
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 8 of 9
Important "molecular biology" and "bioinformatics" vocabulary:
Molecular Biology Jargon:
Central Dogma of Molecular Biology
DNA
RNA
Protein
Chromosome
DNA Replication
Transcription
Translation
DNA polymerase
RNA polymerase (RNAP)
Ribosome
Genome
Genotype
Phenotype
Eukaryote
Prokaryote
Gene
Exon
Intron
Splicing
Alternative splicing
Messenger RNA (mRNA)
Pre-mRNA
6-frame translation
Open Reading Frame (ORF)
Homolog
Ortholog
Paralog
BCB 444/544 Fall 07
Study Guide #1 - Sept 16
p 9 of 9
Mutation
Synonymous
Non-synonymous
Homology
Similarity
Bioinformatics Jargon:
Annotation
Algorithm
Exhaustive method
Heuristic method
Alignment methods
1) Dot matrix analysis
2) Dynamic programming (DP)
3) Word or k-tuple
Global alignment
Local alignment
Pairwise alignment
Multiple sequence alignment (MSA)
BLAST
FASTA
BLOSUM matrix
PAM matrix
Motif
PSSM
Psi-BLAST
Needleman-Wunsch algorithm (NW)
Smith-Waterman algorithm (SW)
Clustal W
Download