#38- Proteomics 11/28/07 Required Reading BCB 444/544 (before lecture) 3 √ Mon Nov 26 - Lecture 37 Lecture 38 Clustering & Classification Algorithms • Chp 18 Functional Genomics Review: Microarrays 2 Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions • Chp 19 Proteomics Proteomics Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett Dancik) http://www.r-project.org/ #38_Nov28 1 Fri Dec 1 - Lecture 39 (Last Lecture!) Systems Biology Thanks to Doina Caragea, KSU (& a bit of Metabolomics & Synthetic Biology) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 1 Assignments & Announcements Mon Nov 26 - HW#6 Due Mon Dec 3 ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! #2: Tonia (10-15’) #4: Addie (10-15’) Thurs Dec 6 - Optional Review Session for Final Exam 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive) BCB 444/544 F07 ISU Dobbs #38 - Proteomics Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB • Sue Gibson Univ. of Minnesota • How do soluble sugar levels help regulate plant development, carbon partitioning and gene expression? 11/28/07 Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB • John Abrams Univ Texas Southwestern Medical Center • Dying Like Flies: Programmed & Unprogrammed Cell Death 3 BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 4 11/28/07 6 Gene Expression Analysis Chp 18 – Functional Genomics SECTION V http://www.bcb.iastate.edu/seminars/index.html Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium, • Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic Approach Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Shashi Gadia ComS, ISU • Harnessing the Potential of XML Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM) Will include: 2 BCB List of URLs for Seminars related to Bioinformatics: (5 PM Mon Nov 26 or ASAP) Wed Dec 5: #!: Xiong & Devin (~20’) Fri Dec 7: #3: Kendra & Drew (~20’) 11/28/07 Seminars this Week - BCB 544 Project Reports Due (NO CLASS that day!!) Tentative Schedule: BCB 444/544 F07 ISU Dobbs #38 - Proteomics GENOMICS & PROTEOMICS Xiong: Chp 18 Functional Genomics • Sequence-based Approaches • Microarray-based Approaches • Comparison of SAGE & DNA Microarrays BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Fall 07 Dobbs 11/28/07 5 BCB 444/544 F07 ISU Dobbs #38 - Proteomics 1 #38- Proteomics 11/28/07 Pattern Recognition in Microarray Analysis Microarray Analysis - Questions & Answers • How do hierarchical clustering algorithms work? • How do we measure the distance between two clusters? (similarity criteria) • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Single link • Complete link • Average link • Classification (supervised learning) • What are “good clusters”? • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest • Big difference between INTRA-cluster distance and INTERcluster distance, i.e., INTRA-cluster distance is minimized while INTER-cluster distance is maximized • What are pros & cons of: • Hierarchical vs K-means clustering • Clustering vs Classification BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 7 Clustering Metrics BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 8 How Determine Distances? • A key issue in clustering is to determine what similarity / distance metric to use • Often, such metric has a bigger effect on the results than actual clustering algorithm used! • When determining the metric, we should take into account our assumptions about the data and the goal of the clustering Intra-cluster distance Inter-cluster distance • Min/Max/Avg the distance between - All pairs of points in the cluster OR - Between centroid and all points in the cluster • Single link • distance between two most similar members • Complete link • distance between two most similar members • Average link • Average distance of all pairs • Centroid distance What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 9 BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 10 11/28/07 12 Methods for Clustering (Unsupervised Learning) INTRA- vs INTER-Cluster Distances • Hierarchical Clustering • K-Means • Self Organizing Maps • (in lab, won’t discuss in lecture) • …many others…. Good! BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Fall 07 Dobbs Bad! 11/28/07 11 BCB 444/544 F07 ISU Dobbs #38 - Proteomics 2 #38- Proteomics 11/28/07 Hierarchical Clustering* *This method was illustrated in Lecture 36,Tables 6.1-MM6.4 • • • Probably most popular clustering algorithm for microarray analysis First presented in this context by Eisen et al. in 1998 Nodes = genes or groups of genes Agglomerative (bottom up) Initially each item is a cluster 1. Compute distance matrix 2. Find two closest nodes (most similar clusters) 3. Merge them 4. Compute distances from merged node to all others 5. Repeat until all nodes merged into a single node 0. BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 13 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Copyright: Russ Altman Hierarchical Clustering: Strengths & Weaknesses Computationally attractive! Bottom-up is most commonly used method • Can also perform top-down, which requires splitting a large group successively 4. 1. 2. 3. 5. BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 2nd Centroid A Initial Centroid A Choose random points (cluster centers or centroids) in k dimensions Compute distance from each data point to centroids Assign each data point to closest centroid Compute new cluster centroid as average of points assigned to cluster Loop to (2), stop when cluster centroids do not move very much 15 Initial Centroid B 2nd Centroid B For K = 2 Two features: f1 (x-coordinate) & f2 (y-coordinate) BCB 444/544 F07 ISU Dobbs #38 - Proteomics For simplicity, assume k=2 & objects are 1-dimensional (Numerical difference is used as distance) BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Fall 07 Dobbs 1 2 1 2 1.5 2 6 5 6.5 Assign clusters 7 Compute centroids Re-assign clusters 7 x x x x 7 5 6.5 Compute centroids Re-assign clusters Converged! 6 2.7 1 6 5 2 2 16 Pick seeds 5 2.7 1 11/28/07 K Means Clustering for k=2 A more realistic example K-Means Clustering Example, for k=2 1 14 K-Means Clustering (Model-based) • Easy to understand & implement • Can decide how big to make clusters by choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 7 1. Randomly select 5 and 6 as centers (centroids) 2. Calculate distance from points to centroids & assign points to clusters: {1,2,5} & {6,7} 3. Compute new cluster centroids: (C1 ) = 8/3 = 2.7 (C2 ) = 13/2= 6.5 4. Calculate distance from points to new centroids & assign data points to new clusters: {1,2} & {5,6,7} 5. Compute new cluster centroids: (C1 ) = 1.5 (C2 ) = 6.0 6. No change? Converged! => Final clusters = {1,2} & {5,6,7} 11/28/07 7 6 5 6 11/28/07 7 17 From S. Mooney BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 18 3 #38- Proteomics 11/28/07 K-Means Clustering: Strengths & Weaknesses • Fast, O(N) • Hard to know which K to choose • Try several and assess cluster quality • Hard to know where to seed the clusters • Results can change drastically with different initial choices for centroids - as shown in example: Choice of K? Helpful to have additional information to aid evaluation of clusters Example Illustrating Sensitivity to Seeds In the above, if start with B and E as centroids will converge to {A,B,C} and {D,E,F} If start with D and F Will converge to {A,B,D,E} {C,F} BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 19 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Running Time Assumptions K-Means Slower Faster • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Classification (supervised learning) Requires distance Requires distance metric metric Parameters None K (number of clusters) Clusters Subjective (only a tree is returned) Exactly K clusters BCB 444/544 F07 ISU Dobbs #38 - Proteomics 20 Clustering vs Classification Hierarchical Clustering vs K-Means Hierarchical Clustering 11/28/07 11/28/07 • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest 21 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Classification: Supervised Learning Task 11/28/07 22 11/28/07 24 Methods for Classification • Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient) • K-nearest neighbors (KNN) • • • • • Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class • Do: Learn a model that accurately predicts class based on features Linear Models Logistic Regression Naive Bayes Decision Trees Support Vector Machines • Outcome: Predict class value of a patient based on expression levels of his/her genes BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Fall 07 Dobbs 11/28/07 23 BCB 444/544 F07 ISU Dobbs #38 - Proteomics 4 #38- Proteomics 11/28/07 K-Nearest Neighbor (KNN) Basic KNN Algorithm • Idea: Use k closest neighbors to label new data points (e.g., for k = 4) INPUT: • Set of data with labels (training data) • K • Set of data needing labels • Distance metric 1. For each unlabeled data point, compute distance to all labeled data 2. Sort distances, determine closest K neighbors (smallest distances) 3. Use majority voting to predict label of unlabeled data point BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 25 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Variations on KNN 11/28/07 26 Compare in Graphical Representation • Can classify into multiple classes easily • Weighted KNN - an weight votes of nearby training samples based on their distance from unknown sample • Can set a threshold, p, for the # of votes needed to win. (If no winner, then either NULL result or set default winner) Clustering Classification Apply external labels: RED group & BLUE group BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 27 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Tradeoffs for Clustering vs Classification SECTION V BCB 444/544 Fall 07 Dobbs 11/28/07 30 GENOMICS & PROTEOMICS Xiong: Chp 19 Proteomics • • • • • Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge 11/28/07 28 Chp 19 – Proteomics • Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 29 Technology of Protein Expression Analysis Post-translational Modification Protein Sorting Protein-Protein Interactions BCB 444/544 F07 ISU Dobbs #38 - Proteomics 5 #38- Proteomics 11/28/07 Proteomics: What do all those proteins do?? ISU Proteomics Resources & Researchers Biological processes for yeast proteins Facilities: Proteomics Facility (Carver Co-lab) http://www.plantgenomics.iastate.edu/proteomics/ Protein Facility (MBB) http://www.protein.iastate.edu/ Experiments: Plant: Rodermel, Wise, Voytas Animal: Greenlee, perhaps others soon? Computational Analysis: Honavar, Wise, Dobbs BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 31 Copyright © 2006 A. Malcolm Campbell Proteome Analysis: “Traditionally” using Two-dimensional (2D) gels BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 32 Proteins identified on 2D gels (IEF/SDS-PAGE) 1st D: Isoelectric focusing (IEF) in pH gradient: Proteins migrate to isoelectric points & stop moving Direct protein microsequencing by Edman degradations -- done at facilities (here at ISU) -- typically need 5 picomoles -- often get 10 to 20 amino acids of sequence Protein mass analysis by MALDI-TOF -- Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Spectroscopy -- done at facilities (here at ISU) -- often detect post-translational modifications (such as phosphorylated Ser, Thr, Tyr) 2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis): Proteins migrate according to molecular weight Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 33 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Tandem Mass Spectrometry (TS) to Identify Proteins Evaluation of 2D gels (IEF/SDS-PAGE) Advantages: Visualize hundreds to thousands of proteins Improved identification of protein spots Disadvantages: Limited number of samples can be processed Mostly abundant proteins visualized Technically difficult Page 250-1 11/28/07 34 Figure 8.19 Tandem mass spectrometry for protein identification a) ESI creates ionized proteins, represented by colored shapes with positive charges. Each shape represents many copies of identical proteins. b) Ionized proteins are separated based on their mass to charge ratio (m/z) and sent one at a time into the activation chamber. Separation and selection take place in the first of the two MS devices. The solid purple protein has been selected for analysis; the other three are temporarily stored for later analysis. c) The group of m/z selected ionized proteins enters a collision cell that is filled with inert argon gas. Gas molecules collide with proteins, which causes them to break into two peptide pieces (labeled b and y). Jonathan Pevsner BCB 444/544 F07 ISU Dobbs #38 - Proteomics BCB 444/544 Fall 07 Dobbs Page 251 35 11/28/07 d) Ionized peptide pieces are sent into second MS device, which again measures the m/z ratio. A computer compares spectrum of peptide pieces to a database of ideal spectra to identify the original group of identical proteins. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 36 6 #38- Proteomics 11/28/07 MS data: Protein identification through peptide fragment identification & separation Databases of 2D Gel Information http://ca.expasy.org/ch2d/2d-index.html Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses (gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and protein identity. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 37 Jonathan Pevsner BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 39 BCB 444/544 Fall 07 Dobbs BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 38 7