BCB 444/544 Lecture 33 Genomics #33_Nov09 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 1 Required Reading (before lecture) √ Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML • Chp 11 - pp 142 – 169 √ Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33 Functional and Comparative Genomics • Chp 17 and Chp 18 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 2 Assignments & Announcements Fri Nov 9 - HW#6 (will be posted this weekend) HW#6 - More fun with Machine Learning!! Due: Fri Nov 16 (or sometime before Mon Nov 26) BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 3 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB • Sharon Roth Dent MD Anderson Cancer Center • Role of chromatin and chromatin modifying proteins in regulating gene expression • Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB • Jianzhi George Zhang U. Michigan • Evolution of new functions for proteins • Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI • Amy Andreotti ISU • T cell signaling: insights from protein NMR spectroscopy BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 4 Chp 11 – Phylogenetic Tree Construction Methods and Programs SECTION IV MOLECULAR PHYLOGENETICS Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs • • • • Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 5 Machine Learning • What is learning? • What is machine learning? • Learning algorithms • Machine learning applied to bioinformatics and computational biology • Some slides adapted from Dr. Vasant Honavar and Dr. Byron Olson BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 6 Examples of Machine Learning Algorithms • Naïve Bayes (NB) • Bayes Theorem • Neural network (NN) or Artificial Neural Net (ANN) • Perceptrons • Support Vector Machine (SVM) • Kernel functions Lab - WEKA: Decision Trees (DT), NB, SVM BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 7 An Application: Predicting RNA Binding Sites in Proteins • Problem: Given an amino acid sequence, classify each residue as RNA binding or non-RNA binding • Input to the classifier is a string of amino acid identities • Output from the classifier is a class label, either binding or not BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 8 Bayes Theorem Applied to RNA Binding Site Prediction P(binding ) P(aa seq | binding ) P(binding | aa seq ) P(aa seq ) P(c 1) P( X x | c 1) P (c 1 | X x ) P( X x) P(c 0) P( X x | c 0) P (c 0 | X x ) P( X x) BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 9 Naïve Bayes for Binary Classification Assign c = 1 if P (c 1 | X x ) P (c 0 | X x ) Otherwise, assign c = 0 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 10 Example: Is ARG 6 RNA-binding or not? ARG 6 TSKKKRQRGSR p(X1 = T | c = 1) p(X2 = S | c = 1) … p(X1 = T | c = 0) p(X2 = S | c = 0) … BCB 444/544 F07 ISU Dobbs#33 - Genomics ≥ θ 11/09/07 11 Predicted vs Actual RNA Binding for Ribosomal protein L15 (PDB ID 1JJ2:K) Predicted BCB 444/544 F07 ISU Dobbs#33 - Genomics Actual 11/09/07 12 Artificial Neural Networks (ANNs or NNs) • Neural networks - classify “input vectors” or “examples” into categories (2 or more) • They are loosely based on biological neurons • Some of most successful methods for predicting secondary structure are based on neural networks: • Neural networks are trained to recognize amino acid patterns corresponding to known secondary structure elements; these patterns are used to predict secondary structure type for aa sequences in proteins of unknown structure BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 13 Biological Neurons “Sum” Input Signals & Generate Output Signal Dendrites receive inputs, Axon sends output Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 14 Simple Neuron = “Perceptron” Perceptron is “Simplest ANN” = feed-forward NN = linear classifier Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 15 The Perceptron X1 X2 w1 w2 N 1 S T 0 S T T S X i Wi i 1 XN Input X wN Weights W Summation S Threshold T Output F Perceptron combines input vectors X1…N , compares “sum” S with a threshold T, and generates output class label: either 1 or 0 If weights W and threshold T are not known in advance, the perceptron must be trained. Ideally, perceptron is trained to return correct answer for all training examples, and perform well on test examples it has never seen. Training set must contain both classes of data (i.e.. with “1” and “0” output). BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 16 Perceptron “Sums” Inputs by Computing Dot Product S = XW • Input is a vector X; Weight is are another vector W • Perceptron Summation S computes the dot product, S = XW • Perceptron Output F is a function of S: it is often discrete (1 or 0), in which case the function is a step function • For continuous output, a sigmoidal function is often used: 1 1 F(X ) 1 e X 1/2 0 0 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 17 Training a Perceptron Find the weights W that minimize the error function E: P E F(X W ) t(X ) i i i1 Use steepest descent: - compute gradient: - update weight vector: - iterate 2 P: number of training examples Xi: training vectors F(WXi): output of perceptron t(Xi) : target value for Xi E E E E E , , ,..., wN w1 w2 w3 Wnew Wold E BCB 444/544 F07 ISU Dobbs#33 - Genomics (: learning rate) 11/09/07 18 Artificial Neural Network (ANN) Artificial neural network • Set of perceptrons interconnected such that outputs of some units become inputs of other units • Many topologies are possible! • Can have multiple layers Neural networks are trained in same way perceptrons are trained, P by minimizing an error function: i i E PP(X ) t(X ) 2 i1 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 19 Support Vector Machines - SVMs Image from http://en.wikipedia.org/wiki/Support_vector_machine BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 20 SVM Finds Maximum-Margin Hyperplane (i.e., hyperplane that provides maximum separation between two classes of instances in dataset) Image from http://en.wikipedia.org/wiki/Support_vector_machine BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 21 Kernel “Trick” BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 22 Kernel Function BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 23 Take Home Messages • Must consider how to set up the learning problem (supervised or unsupervised, generative or discriminative, classification or regression, etc.) • Lots of algorithms out there • No algorithm performs best on all problems BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 24 Genomics - for excellent overview lectures, see these posted by NHGRI & Pevsner: 1- Genomic sequencing Mapping and Sequencing Eric Green, NHGRI CTGA2005Lecture1.pdf 2- Human genome project The Human Genome 2005-10-19_ch17.pdf Jonathan Pevsner, Kennedy Krieger Institute 3- SNPs Studying Genetic Variation II: Computational Techniques Jim Mullikin, NHGRI TGA2005Lecture13.pdf 4- Comparative Genomics Comparative Sequence Analysis Elliott Margulies, NHGRI CTGA2005Lecture8.pdf BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 25 1- Genomic sequencing Many thanks to: Eric Green, NHGRI for the following slides extracted from his lecture on: Mapping and Sequencing CTGA2005Lecture1.pdf BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 26 Genomic Sequencing - Brief Review QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 27 Comparison of Sequenced Genome Sizes QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 28 Comparison of Genetic & Physical Maps QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 29 STSs: Provide common markers for "linking" genetic & physical maps QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 30 With complete genomes (now), why bother to generate physical maps? QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. 11/09/07 31 Genomic sequencing requires assembly of sequences obtained from cloned DNA QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 32 Human Genome Sequencing Two approaches: • Public (government) - International Consortium (6 countries, NIH-funded in US) • "Hierarchical" cloning & BAC-by-BAC sequencing • Map-based assembly • Private (industry) - Celera (Craig Venter) • Whole genome random "shotgun" sequencing • Computational assembly (took advantage of public maps & sequences,too) Guess which human genome Celera sequenced? BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 33 NIH: "Hierarchical" BAC-by-BAC Sequencing QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 34 "Hierarchical" Subcloning Strategy QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 35 Celera: Whole-Genome "Shotgun" Sequencing QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 36 "Shotgun" Sequencing Stategy QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 37 Either Strategy: Sequence "Finishing" = Hardest part !! QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 38 Advances in DNA Sequencing Technology QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 39 Sequencing Method #1: Gilbert-Maxim "Chemical Degradation" QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 40 Sequencing Method #2: Sanger "Di-deoxy Chain Termination" QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 41 Automated Sequencing for Genome Projects: Sanger method - with improvements QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Another “recent” improvement: rapid & high resolution separation of fragments in capillaries instead of gels (E Yeung,Ames Lab, ISU) E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 42 Recent technologies? Pyro- & 454 Sequencing BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 43 1st Eukaryotic Genome Sequence: S. cerevisiae QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 44 1st Animal Genome Sequence: C. elegans QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 45 Timetable for Human Genome Sequencing: Faster than expected! QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. 11/09/07 46 1st Draft Human Genome: ”Complete" in 2001 QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 47 Public Sequencing - International Consortium QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 48 "Finishing" the Human Genome - continues… QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 49 After "Complete" Human Genome Sequence What next? QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 50 Interpreting the Human Genome Sequence! QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 51 Comparative Genomics: now with complete genomic sequences QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 52 Comparing Genomes: Functional Elements QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 53 ENCODE Project QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 54 ENCODE - Web Sites QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 55 ENCODE - Results? June 2007 http://www.nature.com/nature/journal/v447/n7146/full/nature05874.html BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 56 Eric Green's Genomic Sequencing Challenges (2005 List) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. E Green 2005 BCB 444/544 F07 ISU Dobbs#33 - Genomics 11/09/07 57