BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 1 of 13 BCB 444/544 - F07 Study Guide Final - KEY For Final Exam (Dec 10) Answers will be discussed in Review Session on Thurs Dec 6 FINAL EXAM Mon Dec 10 9:45 – 11:45 AM in MBB 1420 & 1340 NOTE CHANGE: ENTIRE FINAL EXAM WILL BE OPEN BOOK/NOTES!! General Comments: Final Exam will include 100 pts and contain 2 sections: 1) a 50-minute written Exam, open-book, open-notes, open-computer 20 pts In Class: Comprehensive 40 pts In Class: New material (since Exam 2) All topics covered in class, lab and assigned readings: Lectures 27-38, HW5-6, Chps 9-11 & 17-19, Labs 9-11 2) a 50-minute lab practical Exam, open-book, open-notes, using computers 40 pts In Lab: Practical (Comprehensive) Some questions will involve computation; bring your calculators if you like. All required formulae or tables will be provided. Some questions will require short essay-like answers that demonstrate your understanding of key concepts covered in the course. How to study: Review all topics, problems & correct answers included in: o Study Guides 1 & 2 & 3/Final (this document, see below) o Exams 1 & 2 o Graded HW Assignments Review topics on which you missed points or need review in: o Lecture PPTs (1 - 38) (39 will not be covered) Spend time reviewing procedures and answers for: o Lab Exercises (1-11) (Lab 12 will not be covered) BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 2 of 13 Hints: For “Comprehensive” part of Final Exam, focus will be on key vocabulary, concepts and problems covered on Exams 1 & 2 (Lectures 1-26) and skills covered in Lab Exercises 1-8. For example, everyone should know: differences between eukaryotic and prokaryotic cells differences between replication, transcription, translation how to read a genetic code table how to fill in a simple dynamic programming matrix differences between PAM and BLOSUM scoring matrices how to retrieve sequences and structures from online databases how to visualize and manipulate protein structures with PyMol how to predict genes in a given DNA sequence how to predict protein function from sequence Strong hints: transcription/translation, dynamic programming & HMM problems similar to those on Exams 1 & 2 are almost guaranteed to appear!! For the "New Material" part of Final Exam, focus will be on material covered in: Lectures 27 - 38 (not Lecture 39) Lab Exercises 9 – 11 (not Lab 12) Final Exam WILL include question(s) based on BCB 544 Project Presentations Strong hints: You should understand basic principles of: Phylogenetic analysis Machine learning algorithms Microarray and proteomics techniques & analysis You should solve practice problems included in this Study Guide: New Material (Part IB – New Material) - below BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 3 of 13 Practice Problems for Part IA - Comprehensive Section (20 pts) IA-1. Using this hidden Markov model and assuming that you start in state 1 calculate the most probable path for sequence AGAT. The most probable path is: : 1–1–1-1 For complete credit, show your work by completing this probability table: A 1 2 3 G A T 0.25*max{0.25*0.5 0.25*max{0.03125*0.5 0*0.2 0.03*0.2 0*0} 0.005*0} = 0.03125 = 0.00391 0.25*max{0.00391*0.5 0.0018*0.2 0.0025*0} = 0.00049 0 0.4*max{0.25*0.3 0*0.6 0*1} = 0.03 0.1*max{0.03125*0.3 0.03*0.6 0.005*1} = 0.0018 0.1*max{0.00391*0.3 0.0018*0.6 0.0025*1} = 0.00025 0 0.1*max{0.25*0.2 0*0.2 0*0} = 0.005 0.4*max{0.03125*0.2 0.03*0.2 0.005*0} = 0.0025 0.4*max{0.00390*0.2 0.0018*0.2 0.0025*0} = 0.00031 0.25 BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 4 of 13 IA-2.1 Fill out the dynamic programming matrix for determining an optimal global alignment between the sequences TCG and TCCAG. Scoring: +3 for matches; -2 for mismatches and spaces. T C C A G 0 -2 -4 -6 -8 -10 T -2 3 1 -2 -4 -6 C -4 1 6 4 2 0 G -6 -2 4 4 2 5 2.2 What is the score(s) of the optimal alignment(s)? 5 matrix) (Circle in DP 2.3. There are 2 optimal alignments. For full credit, draw both of them below & show your traceback arrows in the DP matrix above. T C C A G T C C A G T - C - G T C - - G +3 -2 +3 -2 +3 = 5 +3 +3 -2 -2 +3 = 5 Practice Problems & Review Questions for Part IB – New Material (40 pts) BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 5 of 13 IB-1. Phylogenetic Analysis Find the parsimony score for the tree below using Fitch’s algorithm. For full credit, show your work. The parsimony score for this tree is 9 Reminder re: how Fitch’s algorithm works: v in a tree has a set X v . If v is a leaf, X v is the nucleotide (or amino acid for protein based trees) observed at v . If v is a node with descendants u and w : o Let Y X u X w o If Y make X v Y o If Y make X v X u X w and count one evolutionary step. Each node Basically, we examine the characters at children of each internal node. If they have some character(s) in common, we write down the common character(s). If there is no overlap, we write down all of the characters and count an evolutionary step (noted with a *). At the end, we count up all of the *’s to get the total number of evolutionary steps. C A G T {A,G}* G C {G,T}* T A {C,T}* T T {A,T}* A G C {C,G}* {A,C,G}* {A,C,G}* {A,C,G,T}* {A,T} {T} {T} {A,C,G,T}* BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 6 of 13 IB-2. Microarray Analysis – Clustering 2.1 Using hierarchical clustering, how would you clusters genes (A,B,C,D)? A 1.00 0.95 0.50 0.25 A B C D B 0.95 1 0.70 0.50 C 0.50 0.70 1 0.65 D 0.25 0.50 0.65 1 You may find it helpful to use the following table to calculate your clusters: Iteration 1 2 3 Clusters = [AB] Object 1 A C [AB] & Object 2 B D [CD] Correlation 0.95 0.65 0.49 New Object [AB] [CD] [ABCD] [CD] 2.2 Draw a simple tree that illustrates this grouping of the genes. A C B D BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 7 of 13 IB-3. Vocabulary & Short Answers Questions What is/are: A perceptron? A single layer neural network that takes a set of (weighted) inputs and maps them to an output value by applying a function; the value of the function is compared with a threshold to determine whether the perceptron “fires” (output=1) or not (output =0). A kernel function? o A function that map inputs into a higher dimensional “feature space” in which the members of different classes are (hopefully) linearly separable. Parsimony? o The parsimony principle means that the simplest explanation is the best. Parsimony is often used in phylogenetics to select a tree that explains the observed data with the smallest number of evolutionary steps. Bootstrapping? o A method for estimating how many positions in a multiple sequence alignment support a particular branch point in a phylogenetic tree. Basically, bootstrapping estimates how much of the input data supports each branch point. The Jukes-Cantor model? o A simple model of evolution in which each position in a sequence has the same probability of mutating, and when a mutation occurs, every possible mutation is equally likely. A new high-throughput “massively parallel” sequencing technique? o 454 or Pyro sequencing The HAPMAP project? o Project to identify and catalog common genetic variants (e.g. SNPs) in humans. HAPMAP stands for “haplotype map;” a haplotype is a set of SNPs that are inherited together. SAGE? o Serial Analysis of Gene Expression - a high throughput method for analyzing RNA expression; provides a “snapshot” of RNAs expressed in a particular sample (similar to that provided by microarrays) The two main microarray platforms? o cDNA arrays - probes are cDNAs spotted onto glass slides o Oligo arrays – probes are short synthetic oligonucleotides synthesized directly on chips BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 8 of 13 Two types of machine learning algorithms used to recognize patterns in microarray data? o Clustering algorithms (e.g., hierarchical, k-means) o Classification algorithms (e.g., k-nearest neighbor, SVM) What are the main differences between: Brendel’s (ISU) GeneSeqer & Burge’s (MIT) GenScan programs for gene prediction? o GeneSeqer uses splice site prediction & spliced alignment; it does not depend on level of sequence homology between query sequence and known genes; GenScan relies on HMMs and works best for sequences that share sequence similarity with known genes Supervised vs unsupervised machine learning algorithms? o Supervised algorithms require the data to have labels, for example binding or non-binding, and learn from these labeled examples. Our lab exercise on machine learning used only supervised algorithms (Naïve Bayes, SVM, decision trees). Unsupervised algorithms work with unlabeled data and attempt to discover correlations in the data without the guidance of labels. Our lab exercise on microarray analysis used examples of unsupervised algorithms (clustering). Distance-based & parsimony-based phylogenetic tree-building algorithms o The main difference is the input used by the programs. Distancebased methods use a distance matrix, which is very quick to compute for any sequences. Parsimony-based methods use a character matrix derived from a multiple sequence alignment. The distance matrix condenses all of the information about differences between two sequences into a single number, whereas the multiple sequence alignment contains information about every specific difference. Regulation of gene expression in prokaryotes vs eukaryotes? o Everything is much more complicated in eukaryotes. In prokaryotes, almost all regulation of gene expression occurs at the level of transcription initiation. In eukaryotes, regulation occurs at multiple levels, many of which occur after transcription. This includes processes such as RNA processing, transport, and stability, & protein processing, post-translational modification, transport, and stability. BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 9 of 13 Hierarchical and k-means clustering? o Hierarchical clustering proceeds by grouping the two most similar objects and repeating until all objects are in a single cluster; thus the final number of clusters is determined by choosing where to “cut” the hierarchy. K-means clustering starts with k randomly chosen points (centroids), computes the distance from each object to these k centroids, and assigns each object to the closest centroid. New centroids are then computed by averaging all points assigned to each centroid. Then the distance from each object to each new centroid is calculated, and each object is reassigned to the closest centroid. This process is repeated until the centroids don’t move very much. The number of clusters is determined by value of k chosen. Promoters & enhancers? o In eukaryotes, both promoters and enhancers are binding sites for transcription factors. Promoters are essential for initiation of transcription and are located relatively close to the transcription start site. Enhancers are needed for regulated transcription and can be very far from the transcription start site. Cis- vs trans- acting regulatory factors/elements? o Cis-acting regulatory elements are signals in DNA (or RNA) sequences and include promoters, enhancers, and silencers. Transacting regulatory factors are “diffusible” and recognize such signals; they usually are proteins, such as transcription factors or chromatin remodeling complexes, but some small regulatory RNAs are also trans-acting. Briefly list & describe the major sequence signals used in gene prediction software. o These include regulatory signals for: Transcription: TF binding sites, promoters, initiation sites, terminators, CpG islands, etc. RNA Processing: Splice donor/acceptor sites, polyA signals Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) codons, ORFs, codon usage What are the two major types of distance-based methods for generating phylogenetic trees – and what are the relative advantages and disadvantages of each? Clustering methods and optimality methods. Clustering methods attempt to build a tree in which the most similar sequences are close together in the tree. Advantage – very fast. Disadvantage - produce a single tree, without BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 10 of 13 consideration of alternative tree topologies. Optimality methods compare all possible trees and select the tree that best fits the distance matrix. Advantage – consider multiple tree topologies. Disadvantage – computationally slow. Why are SNPs important? SNPs are Single Nucleotide Polymorphisms or single-base variations in genomic DNA sequences that are present in at least 1% of the population. They are important because specific SNPs are associated with predisposition for certain diseases or can be indicative an individual’s response to a particular therapy. Which student project presentation did you find most interesting? Why? (Be specific) What is the most important “new” thing you learned in this course? Explain. This Genetic Code table will be provided for Part I of the Final. BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 11 of 13 Study Suggestions for Part II – Lab Practical (40 pts) You will be given an amino acid sequence for a “mystery” protein. You will be required to use servers/programs used in lab (such as BLAST) and to gather and analyze information to characterize this mystery protein. Knowledge of which servers perform which function will also be tested. There will not be sufficient time to perform any analysis of actual microarray data as part of the practical exam. However, you should be prepared to discuss in a paragraph or two some of the important issues in setting up a microarray experiment, and how to normalize and filter your data prior to subsequent analysis (e.g. clustering/machine learning etc.). This may be included as part of the mystery protein portion of the practical, or as a standalone question or three. For the practical, you will not be expected to discuss the clustering or visualization portions of the lab, as there was less guidance about the best ways to do this. You will be expected to know how to retrieve structure files from PDB using keyword search, direct retrieval by PDB id, or by BLAST query (either at PDB, under Advanced Search, or using NCBI blastp, by specifying PDB as the database to query). You will also be expected to use PyMOL to perform very basic manipulations of how such a structure file is displayed. The following list of links will be provided for Part II of the Final. Entrez Web site (http://www.ncbi.nlm.nih.gov/entrez/) BLAST at NCBI (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi) OMIM (Online Mendelian Inheritance in MAN) UniProt (http://www.pir.uniprot.org/) READSEQ (http://thr.cit.nih.gov/molbio/readseq/) SRS server (http://srs6.ebi.ac.uk/ ) Dothelix (www.genebee.msu.su/services/dhm/advanced.html) LALIGN (www.ch.embnet.org/software/LALIGN_form.html) BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 12 of 13 SSEARCH (http://pir.georgetown.edu/pirwww/search/pairwise.html) PRSS (http://fasta.bioch.virginia.edu/fasta/prss.htm) Biology Workbench ClustalW (http://www.ebi.ac.uk/Tools/clustalw/) TCoffee (http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi) MultAlign (http://prodes.toulouse.inra.fr/multalin/multalin.html) DCA server (http://bibiserv.techfak.uni-bielefeld.de/dca/submission.html) PRRN server (http://prrn.ims.u-tokyo.ac.jp/) ORF Finder at NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) Gene Seqer (http://www.plantgdb.org/) GeneMark (http://opal.biology.gatech.edu/GeneMark/genemark24.cgi) PROSITE (http://ca.expasy.org/prosite/) Super Family (http://supfam.org/SUPERFAMILY/hmm.html) JAFA meta server (http://jafa.burnham.org/) PDB (http://pdb.org) CDM PSIPRED Proteus TMPRED Phobius SWISS-MODEL server WHAT IF server 3-D Jury (BioInfoBank) META server mfold Web server BCB 444/544 Fall 07 Study Guide Final KEY – Dec 5 p 13 of 13 RNAfold server RNAalifold UCSC genome browser (http://genome.ucsc.edu/) MEME/MAST (http://meme.sdsc.edu/meme/) TESS Gene Expression Pattern Analysis Suite (GEPAS, at http://gepas.bioinfo.cipf.es/) Expression Profiler (http://www.ebi.ac.uk/microarray/index.html)