#32 - Machine Learning 11/07/07 Required Reading BCB 444/544 (before lecture) Fri Oct 30 - Lecture 30 Lecture 32 Phylogenetic – Distance-Based Methods • Chp 11 - pp 142 – 169 Mon Nov 5 - Lecture 31 Phylogenetics – Parsimony and ML Machine Learning • Chp 11 - pp 142 – 169 Wed Nov 7 - Lecture 32 Machine Learning #32_Nov07 Fri Nov 9 - Lecture 33 Functional and Comparative Genomics • Chp 17 and Chp 18 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 1 BCB 544 Only: New Homework Assignment 11/07/07 2 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: 544 Extra#2 Due: BCB 444/544 F07 ISU Terribilini #32- Machine Learning http://www.bcb.iastate.edu/seminars/index.html √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 • Nov 7 Wed - BBMB Seminar 4:10 in 1414 MBB • Sharon Roth Dent MD Anderson Cancer Center • Role of chromatin and chromatin modifying proteins in Part 1 - Brief outline of Project, email to Drena & Michael regulating gene expression after response/approval, then: • Nov 8 Thurs - BBMB Seminar 4:10 in 1414 MBB • Jianzhi George Zhang Part 2 - More detailed outline of project U. Michigan • Evolution of new functions for proteins Read a few papers and summarize status of problem • Nov 9 Fri - BCB Faculty Seminar 2:10 in 102 SciI Schedule meeting with Drena & Michael to discuss ideas • Amy Andreotti ISU • Something about NMR BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 3 Chp 11 – Phylogenetic Tree Construction Methods and Programs Distance-Based Methods Character-Based Methods Phylogenetic Tree Evaluation Phylogenetic Programs BCB 444/544 Fall 07 Dobbs 4 • Bootstrapping • Jackknifing • Bayesian Simulation • Statistical difference tests (are two trees significantly different?) Xiong: Chp 11 Phylogenetic Tree Construction Methods and Programs BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 Phylogenetic Tree Evaluation SECTION IV MOLECULAR PHYLOGENETICS • • • • BCB 444/544 F07 ISU Terribilini #32- Machine Learning • Kishino-Hasegawa Test (paired t-test) • Shimodaira-Hasegawa Test (χ2 test) 11/07/07 5 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 6 1 #32 - Machine Learning 11/07/07 Bootstrapping Bootstrapping Comments • Bootstrapping doesn’t really assess the accuracy of a tree, only indicates the consistency of the data • To get reliable statistics, bootstrapping needs to be done on your tree 500 – 1000 times, this is a big problem if your tree took a few days to construct • A bootstrap sample is obtained by sampling sites randomly with replacement • Obtain a data matrix with same number of taxa and number of characters as original one • Construct trees for samples • For each branch in original tree, compute fraction of bootstrap samples in which that branch appears • Assigns a bootstrap support value to each branch • Idea: If a grouping has a lot of support, it will be supported by at least some positions in most of the bootstrap samples BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 7 Jackknifing 11/07/07 9 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 10 Phylogenetic Programs • Huge list at: • http://evolution.genetics.washington.edu/phylip/so ftware.html • PAUP* - one of the most popular programs, commercial, Mac and Unix only, nice user interface • PHYLIP – free, multiplatform, a bit difficult to use but web servers make it easier • WebPhylip – another interface for PHYLIP online BCB 444/544 Fall 07 Dobbs 8 • Using a Bayesian ML method to produce a tree automatically calculates the probability of many trees during the search • Most trees sampled in the Bayesian ML search are near an optimal tree Phylogenetic Programs BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 Bayesian Simulation • Another resampling technique • Randomly delete half of the sites in the dataset • Construct new tree with this smaller dataset, see how often taxa are grouped • Advantage – sites aren’t duplicated • Disadvantage – again really only measuring consistency of the data BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 • TREE-PUZZLE – uses a heuristic to allow ML on large datasets, also available as a web server • PHYML – web based, uses genetic algorithm • MrBayes – Bayesian program, fast and can handle large datasets, multiplatform download • BAMBE – web based Bayesian program 11 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 12 2 #32 - Machine Learning 11/07/07 Final Comments on Phylogenetics Machine Learning • What is learning? • What is machine learning? • Learning algorithms • Machine learning applied to bioinformatics and computational biology • No method is perfect • Different methods make very different assumptions • If multiple methods using different assumptions come up with similar results, we should trust the results more than any single method BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 • Some slides adapted from Dr. Vasant Honavar and Dr. Byron Olson 13 What is Learning? 11/07/07 15 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 16 Contributing Disciplines • Machine learning is an area of artificial intelligence concerned with development of techniques which allow computers to “learn” • Computer Science – artificial intelligence, algorithms and complexity, databases, data mining • Statistics – statistical inference, experimental design, exploratory data analysis • Mathematics – abstract algebra, logic, information theory, probability theory • Psychology and neuroscience – behavior, perception, learning, memory, problem solving • Philosophy – ontology, epistemology, philosophy of mind, philosophy of science • Machine learning is a method for creating computer programs by the analysis of data sets • We understand a phenomenon when we can write a computer program that models it at the desired level of detail BCB 444/544 Fall 07 Dobbs 14 • Rote learning – useful when it is less expensive to store and retrieve some information than to compute it • Learning from instruction – transform instructions into useful knowledge • Learning from examples – extract predictive or descriptive regularities from data • Learning from deduction – generalize instances of deductive problem-solving • Learning from exploration – learn to choose actions that maximize reward What is Machine Learning? BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 Types of Learning • Learning is a process by which the learner improves his performance on a task or a set of tasks as a result of experience within some environment BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 17 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 18 3 #32 - Machine Learning 11/07/07 Machine Learning Applications • • • • • • • • Machine Learning Algorithms Bioinformatics and Computational Biology Environmental Informatics Medical Informatics Cognitive Science E-Commerce Human Computer Interaction Robotics Engineering BCB 444/544 F07 ISU Terribilini #32- Machine Learning • Many types of algorithms differing in the structure of the learning problem as well as the approach to learning used • Regression vs. Classification • Supervised vs. Unsupervised • Generative vs. Discriminative • Linear vs. Non-Linear 11/07/07 19 Machine Learning Algorithms 11/07/07 21 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 22 Machine Learning Algorithms • Generative vs. Discriminative • Philosophical difference • Generative models attempt to recreate or understand the process that generated the data • Discriminative models attempt to simply separate or determine the class of input data without regard to the process BCB 444/544 Fall 07 Dobbs 20 • Supervised vs. Unsupervised • Data difference • Supervised learning involves using pairs of input/output relationships to learn an input output mapping (called labeled pairs often denoted {X,Y} • Unsupervised learning involves examining input data to find patterns (clustering) Machine Learning Algorithms BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 Machine Learning Algorithms • Regression vs. Classification • Structural difference • Regression algorithms attempt to map inputs into continuous outputs (integers, real numbers, etc.) • Classification algorithms attempt to map inputs into one of a set of classes (color, cellular locations, good and bad credit risks, etc.) BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 • Linear vs. Non-linear • Modeling difference • Linear models involve only linear combinations of input variables • Non-linear models are not restricted in their form (commonly include exponentials or quadratic terms) 23 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 24 4 #32 - Machine Learning 11/07/07 Summary of Machine Learning Algorithms Linear vs. Non-linear • This is only the tip of the iceberg • No single algorithm works best for every application • Some simple algorithms are effective on many data sets • Better results can be obtained by preprocessing the data to suit the algorithm or adapting the algorithm to suit the characteristics of the data BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 25 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 26 Trade Off Between Specificity and Sensitivity Measuring Performance • Classification threshold controls a trade off between specificity and sensitivity • High specificity – predict fewer instances with higher confidence • High sensitivity – predict more instances with lower confidence • Commonly shown as a Receiver Operating Characteristic (ROC) Curve BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 27 BCB 444/544 F07 ISU Terribilini #32- Machine Learning Measuring Performance BCB 444/544 Fall 07 Dobbs 28 11/07/07 30 Machine Learning in Bioinformatics • Using any single measure of performance is problematic • Accuracy can be misleading – when 95% of examples are negative, we can achieve 95% accuracy by predicting all negative. We are 95% accurate, but 100% wrong on positive examples BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 11/07/07 • • • • • • • 29 Gene finding Binding site identification Protein structure prediction Protein function prediction Genetic network inference Cancer diagnosis etc. BCB 444/544 F07 ISU Terribilini #32- Machine Learning 5 #32 - Machine Learning 11/07/07 Sample Learning Scenario – Protein Function Prediction Some Examples of Algorithms • Naïve Bayes • Neural network • Support Vector Machine BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 31 • Problem: Given an amino acid sequence, classify each residue as RNA binding or non-RNA binding • Input to the classifier is a string of amino acid identities • Output from the classifier is a class label, either binding or not 11/07/07 P( A | B) = P (c = 1 | X = x ) = 33 BCB 444/544 F07 ISU Terribilini #32- Machine Learning P(c = 1 | X = x) P(c = 0 | X = x) = P(c = 1) P( X = x | c = 1) P( X = x) 34 11/07/07 P( X1 = x1 , X 2 = x2 ,..., X n = xn | c =1) × P( c =1) P ( X1 = x1 , X 2 = x2 ,..., X n = xn | c = 0) × P( c = 0) n = P(c = 0) P ( X = x | c = 0) P (c = 0 | X = x ) = P( X = x) BCB 444/544 Fall 07 Dobbs 11/07/07 P ( A) P ( B | A) P( B) Naïve Bayes Algorithm P (binding ) P ( aa seq | binding ) P ( aa seq ) BCB 444/544 F07 ISU Terribilini #32- Machine Learning 32 P(A) = prior probability P(A|B) = posterior probability Bayes Theorem Applied to Classification P (binding | aa seq ) = 11/07/07 Bayes Theorem Predicting RNA binding sites in proteins BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 F07 ISU Terribilini #32- Machine Learning P(c = 1)∏ P( X i = xi | c = 1) i =1 n P(c = 0)∏ P( X i = xi | c = 0) i =1 35 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 36 6 #32 - Machine Learning 11/07/07 Naïve Bayes Algorithm Assign c=1 if Example P(c =1| X = x) ≥θ P(c = 0| X = x) ARG 6 TSKKKRQRGSR p(X1 = T | c = 1) p(X2 = S | c = 1) … p(X1 = T | c = 0) p(X2 = S | c = 0) … BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 37 Predictions for Ribosomal protein L15 PDB ID 1JJ2:K Actual 11/07/07 39 38 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 40 11/07/07 42 Artificial Neuron – “Perceptron” Dendrites receive inputs, Axon gives output Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html Image from Christos Stergiou and Dimitrios Siganos http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html BCB 444/544 Fall 07 Dobbs 11/07/07 •The most successful methods for predicting secondary structure are based on neural networks. The overall idea is that neural networks can be trained to recognize amino acid patterns in known secondary structure units, and to use these patterns to distinguish between the different types of secondary structure. •Neural networks classify “input vectors” or “examples” into categories (2 or more). •They are loosely based on biological neurons. Biological Neurons BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 F07 ISU Terribilini #32- Machine Learning Neural networks Predicted BCB 444/544 F07 ISU Terribilini #32- Machine Learning ≥ θ 11/07/07 41 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 7 #32 - Machine Learning 11/07/07 The perceptron The perceptron X1 X2 w1 S = ∑ X i Wi i =1 XN • The input is a vector X and the weights can be stored in another vector W. ⎧1 S > T ⎨ ⎩0 S < T T N w2 wN • The perceptron computes the dot product S = X.W • The output F is a function of S: it is often discrete (i.e. 1 or 0), in which case the function is the step function. • For continuous output, often use a sigmoid: Input Threshold Unit Output 1 The perceptron classifies the input vector X into two categories. If the weights and threshold T are not known in advance, the perceptron must be trained. Ideally, the perceptron must be trained to return the correct answer on all training examples, and perform well on examples it has never seen. 1 F(X ) = 1 + e− X 1/2 0 0 The training set must contain both type of data (i.e. with “1” and “0” output). BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 43 The perceptron BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 44 11/07/07 46 11/07/07 48 Biological Neural Network Training a perceptron: Find the weights W that minimizes the error function: P: number of training data Xi: training vectors F(W.Xi): output of the perceptron t(Xi) : target value for Xi E = ∑ (F ( X i .W ) − t ( X i ) ) P 2 i =1 Use steepest descent: - compute gradient: - update weight vector: - iterate ⎛ ∂E ∂E ∂E ∂E ∇E = ⎜⎜ , , ,..., ∂wN ⎝ ∂w1 ∂w2 ∂w3 ⎞ ⎟⎟ ⎠ Wnew = Wold − ε∇E Image from http://en.wikipedia.org/wiki/Biological_neural_network (e: learning rate) BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 45 Artificial Neural Network BCB 444/544 F07 ISU Terribilini #32- Machine Learning Support Vector Machines - SVMs A complete neural network is a set of perceptrons interconnected such that the outputs of some units becomes the inputs of other units. Many topologies are possible! Neural networks are trained just like perceptron, by minimizing an error function: E= ∑ (NN ( X Ndata i ) − t ( X i )) i =1 BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 Fall 07 Dobbs 2 Image from http://en.wikipedia.org/wiki/Support_vector_machine 11/07/07 47 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 8 #32 - Machine Learning 11/07/07 Kernel Function SVM finds the maximum margin hyperplane Image from http://en.wikipedia.org/wiki/Support_vector_machine BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 49 Kernel Function BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 50 Take Home Messages • Must consider how to set up the learning problem (supervised or unsupervised, generative or discriminative, classification or regression, etc.) • Lots of algorithms out there • No algorithm performs best on all problems BCB 444/544 F07 ISU Terribilini #32- Machine Learning BCB 444/544 Fall 07 Dobbs 11/07/07 51 BCB 444/544 F07 ISU Terribilini #32- Machine Learning 11/07/07 52 9