BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 1 of 10 BCB 444/544- F07 Study Guide #1 - ANSWERS (PARTIAL) For Exam 1 (Fri Sept 21) - Answers will be discussed in Lab/Review Session on Thurs Sept 20 General comments • • • • • • Exam 1 will cover all topics covered in class, lab and assigned readings: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading & URLs indicated in PPTs, including: Xiong: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming This study guide covers ~90% of material important for Exam 1 - no guarantees about other 10%! Exam 1 will be a closed-book, closed-notes, 50-minute exam. Some questions will involve computation; therefore, bring your calculators if you like. All required formulae or tables (except the dynamic programming equations) will be provided. Some questions will require short essay-like answers that demonstrate your understanding of key concepts covered in the course. Topics & Study Questions: Resources for Bioinformatics (in ISU Library) • Name 5 resources for Bioinformatics provided by NCBI (ENTREZ). • Which online resource (available through ISU's library) would you use to find papers that cite one of your published papers? • Where (on NCBI website) would you go to find free full-text copies of textbooks related to molecular biology, genetics, etc.? Molecular Biology • • Eukaryotic vs prokaryotic cells/organisms o Name 3 differences between them o Name 1 example of each type of organism Central Dogma of Molecular Biology o DNA Replication o Transcription o Translation • What is splicing (RNA splicing)? • What is an Exon? an Intron? • o Which is present in pre-mRNA? o Which is present in mature mRNA? What is an ORF? BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 2 of 10 • What is meant by 6-frame translation? • What is the difference between genotype and phenotype? • Which can be expressed quantitatively: similarity or homology? • What is the difference between an ortholog and a paralog? Sequence Alignment • What are 3 basic computational methods for sequence alignment? • Why do we need/use heuristics for aligning sequences? • Global vs local alignments (see HW2 examples) &: o What are differences in filling DP matrix? o What are differences in traceback & scoring? o When should you use each type of alignment method? o Whose implementation of DP algorithm for global alignment is most widely used? For local alignment? • What is an affine gap penalty? Why is it often better to use than constant gap penalty? • Dot matrices (see HW2 examples) &: o • Dynamic programming (DP) o • • • Name 2 alignment programs that use this method. Scoring matrices (PAM and BLOSUM) o Which type of matrix is based on an evolutionary model? o Which type of matrix is used as default matrix in NCBI's BLAST? o When/why would you use a BLOSUM matrix with a higher index, e.g., BLOSUM90? o When/why would you use a BLOSUM matrix with a lower index, e.g., BLOSUM45? Database searching with BLAST: o • Explain the basic idea behind DP What is a word (k-tuple) method? o • What does a series of parallel diagonal lines in a dot matrix pattern usually represent? Which flavor of BLAST should be used when searching for highly divergent sequences? for long nearly identical related sequences? for DNA sequences similar to your query DNA sequence? for protein sequence similar that encoded by your query DNA sequence? Significance of BLAST "hits" o In general, what range of E-values suggests that a "hit" is significant? o In general, what range of E-values suggests that a "hit" is no better than random? o Why is it sometimes important to consider the bit score, S', instead of only E-value? Advantages/disadvantages of BLAST vs FASTA vs DP BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 3 of 10 Sample Questions/Problems 1. Answer True or False or fill in blank to complete the following statements. a. It is correct to say: These two sequences are 30% homologous." False b. Explain. Homology is a qualitative term: when we say two sequences are homologous, we mean they are "evolutionarily-related." Homology can be inferred if the degree of sequence similarity/identity shared by sequences is high enough. Identity and similarity are quantitative terms. c. Homologous protein sequences usually exhibit more than 25 - 30% sequence identity. d. A(n) _Open Reading Frame_or ORF_ includes all codons between 2 stop codons (or all codons between a START codon (AUG) and a STOP codon) in the same frame of an mRNA sequence. e. Phenotype refers to the observable (e.g., physical) characteristics of an organism; an organism's genotype is its genetic makeup, which largely determines its phenotype. True f. Only a very small fraction of human genes are alternatively spliced to result in the expression of more than one mature mRNA. False g. Explain. Based on recent results, it is believed that >50% of human genes are alternatively spliced! h. An _intron__ is usually removed from the pre-mRNA transcribed from a gene, and the amino acid sequences corresponding to it do not usually appear in the final expressed protein product of a gene. i. Usually, a pairwise alignment can provide just as much information as a multiple sequence alignment. False j. Explain. MSAs provide much more information than pairwise alignment, regarding conserved residues, possible functional motifs, and potential homologous relationships. k. Psi-BLAST is valuable for identifying remotely homologous sequences. In each iteration, a MSA is used to generate a PSSM that is used instead of the original query sequence to search a database. True. l. Explain. The iterative use of PSSMs gives Psi-BLAST enhanced ability to detect remote homologs, relative to "ordinary" BLAST. 2. Short answer questions (Answers provided below are sometimes longer than required!) a. Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are different. PAM matrices are based on an evolutionary model. PAM1 is a matrix of log odds scores derived by calculating substitution statistics from an alignment of closely related sequences. Other PAM matrices are obtained from PAM1 by extrapolation. A PAM matrix with a lower index should be used for comparing closely related sequences (e.g., PAM1). BLOSUM matrices are log odds score matrices, too, but they are derived by calculating substitution statistics based on BLOCKs of conserved sequences from alignments of evolutionarily divergent sequences. A BLOSUM martrix with a higher index should be used for comparing closely related sequences (e.g., BLOSUM90). b. In what sense is BLAST better than the Smith-Waterman (local alignment) DP algorithm? BLAST is faster. BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 4 of 10 c. What is the difference between an affine gap penalty and a constant linear gap penalty? A constant linear gap penalty uses the same penalty for every gap position. An affine gap penalty uses different penalties for opening a gap and for extending the gap; usually the opening penalty is greater than the extension penalty. d. Everything else being equal, when does BLAST produce a more significant E-value, when searcing a database of size 500,000 or when searching a database of size 1,000,000? Explain your answer. Because the E-value is directly proportionally to the size of the database, E-values for results of a BLAST search using the same query sequence would typically be greater when searching a large database than when searching a small database. Thus, we would expect to see a smaller (and more significant) E-value for a search performed against the smaller database of 500,000 seqences. e. In pairwise alignment, how would you go about modifying scoring schemes to accommodate different evolutionary distances? For example, if you need to globally align two sequences, how would you modify the gap and match penalties if you knew that they were closely related? If they were more distantly related? Closely related: increase gap penalty and choose a lower PAM or higher BLOSUM matrix. Distantly related: decrease the gap penalty and choose a higher PAM or a lower BLOSUM matrix BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 5 of 10 3. Dot plots. Below is a dot plot comparing two 5,000 bp DNA sequences. For part a) you can think purely in terms of sequence information. For part b) you should think biologically (what functional features could explain the observed pattern). 1 2 3 a) Interpret the pattern, describing what events happened during the divergence of sequences A and B. "Divergence" implies the two sequences are believed to be derived from a single common ancestor. They appear to have undergone a combination of single nucleotide changes and insertion-deletion (indel) events. The nucleotide changes explain why only some parts of the sequences give rise to extended diagonal lines. Also, it appears that there was an insertion in sequence A or perhaps a deletion in sequence B (or both) between diagonals 1 and 2. Between diagonals 2 and 3, the opposite pattern of indels appears to have occurred. b) Suppose you know these 2 sequences each include coding regions for only 1 eukaryotic gene. • Describe what the matching regions are most likely to represent and why. The general answer is that matching regions are sequences that are evolving more slowly (under more selective constraints). A specific answer is that each matching region corresponds to a protein-encoding exon and these exons are separated by introns (because exons are almost always more conserved than introns). • Describe what the regions at the northwest and southeast (the parts beyond the matching diagonals) are likely to represent and explain your reasoning. Possibilities include: i) they are intergenic regions (beyond the ends of the genes) that have undergone extensive single nucleotide changes over time ii) one sequence or the other underwent rearrangement, e.g., acquired deletions or was moved (by recombination) into a new chromosomal location, such that any shared sequences outside the region defined by diagonals were "removed" in a single event. BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 6 of 10 4) Dynamic Programming - Global alignment 4a) Fill out the dynamic programming matrix for determining the optimal global alignment between the two sequences, CGGA and ACTG. Scoring: Match = +3; Mismatches and Spaces = -1. λ C G G A λ 0 -1 -2 -3 -4 A -1 -1 -2 -3 0 C -2 2 1 0 -1 T -3 1 1 0 -1 G -4 0 4 4 3 4b) What is the optimal score for the alignment(s)? 3 4c) Draw the optimal alignment(s) corresponding to this score. (if there is more than one, you must include all for complete credit!) A C C G T G G A - -1 +3 -1 +3 -1 = 3 BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 7 of 10 5. Dynamic Programming for Local Alignment, using BLOSUM matrix Use the Smith-Waterman (local alignment) DP algorithm with a constant linear gap penalty of -3 and the BLOSUM62 scoring matrix (below) to fill in ONLY the first two columns of the following matrix. Include traceback arrows. λ λ 0 C 0 C 0 9 V 0 6 E 0 3 H 0 0 S 0 0 E V G BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 8 of 10 6. Position-Specific Scoring Matrices (PSSMs) An analysis of 77 DNA binding sites for a specific transcription factor (TF) yielded the following PSSM: A C G T 37 10 13 17 0 76 0 1 0 0 0 77 0 1 76 0 7 4 9 57 34 11 9 23 The 3 sequence fragments given below contain a TF binding site. Calculate which of these has the strongest and which has the weakest match. Show your calculations and ranking. fragment 1: fragment 2: fragment 3: A C C T G C C A C T G T T G C T G A We calculate the score for each fragment, based on the PSSM matrix: fragment1 = A C C T G C 37 + 76 + 0 + 0 + 9 + 11 fragment2 = C A C T G T 10 + 0 + 0 + 0 + 9 + 23 fragment3 = T G C T G A 17 + 0 + 0 + 0 + 9 + 34 Fragment 1 is the strongest because it has the highest score. Fragment 2 is the weakest because it has the lowest score. Ranking: 1 > 3 > 2 BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 9 of 10 Important "molecular biology" and "bioinformatics" vocabulary: Molecular Biology Jargon: Central Dogma of Molecular Biology DNA RNA Protein Chromosome DNA Replication Transcription Translation DNA polymerase RNA polymerase (RNAP) Ribosome Genome Genotype Phenotype Eukaryote Prokaryote Gene Exon Intron Splicing Alternative splicing Messenger RNA (mRNA) Pre-mRNA 6-frame translation Open Reading Frame (ORF) Homolog Ortholog Paralog BCB 444/544 Fall 07 Study Guide #1 KEY - Sept 16 p 10 of 10 Mutation Synonymous Non-synonymous Homology Similarity Bioinformatics Jargon: Annotation Algorithm Exhaustive method Heuristic method Alignment methods 1) Dot matrix analysis 2) Dynamic programming (DP) 3) Word or k-tuple Global alignment Local alignment Pairwise alignment Multiple sequence alignment (MSA) BLAST FASTA BLOSUM matrix PAM matrix Motif PSSM Psi-BLAST Needleman-Wunsch algorithm (NW) Smith-Waterman algorithm (SW) Clustal W