BCB 444/544 Lecture 38 Review: Microarrays Proteomics #38_Nov28 Thanks to Doina Caragea, KSU BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 1 Required Reading (before lecture) 3 √Mon Nov 26 - Lecture 37 Clustering & Classification Algorithms • Chp 18 Functional Genomics 2 Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions • Chp 19 Proteomics Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett Dancik) http://www.r-project.org/ 1 Fri Dec 1 - Lecture 39 (Last Lecture!) Systems Biology (& a bit of Metabolomics & Synthetic Biology) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 2 Assignments & Announcements Mon Nov 26 - HW#6 Due Mon Dec 3 (5 PM Mon Nov 26 or ASAP) - BCB 544 Project Reports Due (NO CLASS that day!!) ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! Tentative Schedule: Wed Dec 5: #!: Xiong & Devin (~20’) Fri Dec 7: #3: Kendra & Drew (~20’) #2: Tonia (10-15’) #4: Addie (10-15’) Thurs Dec 6 - Optional Review Session for Final Exam Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM) Will include: 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 3 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium, • Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic Approach Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB • Sue Gibson Univ. of Minnesota • How do soluble sugar levels help regulate plant development, carbon partitioning and gene expression? Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Shashi Gadia ComS, ISU • Harnessing the Potential of XML Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB • John Abrams Univ Texas Southwestern Medical Center • Dying Like Flies: Programmed & Unprogrammed Cell Death BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 4 Chp 18 – Functional Genomics SECTION V GENOMICS & PROTEOMICS Xiong: Chp 18 Functional Genomics • Sequence-based Approaches • Microarray-based Approaches • Comparison of SAGE & DNA Microarrays BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 5 Gene Expression Analysis BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 6 Pattern Recognition in Microarray Analysis • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Classification (supervised learning) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 7 Microarray Analysis - Questions & Answers • How do hierarchical clustering algorithms work? • How do we measure the distance between two clusters? (similarity criteria) • Single link • Complete link • Average link • What are “good clusters”? • Big difference between INTRA-cluster distance and INTERcluster distance, i.e., INTRA-cluster distance is minimized while INTER-cluster distance is maximized • What are pros & cons of: • Hierarchical vs K-means clustering • Clustering vs Classification BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 8 Clustering Metrics • A key issue in clustering is to determine what similarity / distance metric to use • Often, such metric has a bigger effect on the results than actual clustering algorithm used! • When determining the metric, we should take into account our assumptions about the data and the goal of the clustering BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 9 How Determine Distances? Intra-cluster distance Inter-cluster distance • Min/Max/Avg the distance between - All pairs of points in the cluster OR - Between centroid and all points in the cluster • Single link • distance between two most similar members • Complete link • distance between two most similar members • Average link • Average distance of all pairs • Centroid distance What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 10 INTRA- vs INTER-Cluster Distances Good! BCB 444/544 F07 ISU Dobbs #38 - Proteomics Bad! 11/28/07 11 Methods for Clustering (Unsupervised Learning) • Hierarchical Clustering • K-Means • Self Organizing Maps • (in lab, won’t discuss in lecture) • …many others…. BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 12 Hierarchical Clustering* *This method was illustrated in Lecture 36,Tables 6.1-MM6.4 • • • Probably most popular clustering algorithm for microarray analysis First presented in this context by Eisen et al. in 1998 Nodes = genes or groups of genes Agglomerative (bottom up) 0. Initially each item is a cluster 1. Compute distance matrix 2. Find two closest nodes (most similar clusters) 3. Merge them 4. Compute distances from merged node to all others 5. Repeat until all nodes merged into a single node BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 13 BCB 444/544 F07 ISU Dobbs #38 - Proteomics Copyright: Russ Altman 11/28/07 14 Hierarchical Clustering: Strengths & Weaknesses • Easy to understand & implement • Can decide how big to make clusters by choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima Bottom-up is most commonly used method • Can also perform top-down, which requires splitting a large group successively BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 15 K-Means Clustering (Model-based) 2nd Centroid A Computationally attractive! 1. 2. 3. 4. 5. Choose random points (cluster centers or centroids) in k dimensions Compute distance from each data point to centroids Assign each data point to closest centroid Compute new cluster centroid as average of points assigned to cluster Loop to (2), stop when cluster centroids do not move very much Initial Centroid A Initial Centroid B 2nd Centroid B For K = 2 Two features: f1 (x-coordinate) & f2 (y-coordinate) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 16 K-Means Clustering Example, for k=2 For simplicity, assume k=2 & objects are 1-dimensional (Numerical difference is used as distance) Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 7 1. Randomly select 5 and 6 as centers (centroids) 2. Calculate distance from points to centroids & assign points to clusters: {1,2,5} & {6,7} 3. Compute new cluster centroids: (C1) = 8/3 = 2.7 (C2) = 13/2= 6.5 4. Calculate distance from points to new centroids & assign data points to new clusters: {1,2} & {5,6,7} 5. Compute new cluster centroids: (C1) = 1.5 (C2) = 6.0 6. No change? Converged! => Final clusters = {1,2} & {5,6,7} BCB 444/544 F07 ISU Dobbs #38 - Proteomics 1 5 2 1 2 2.7 1 1 2 1.5 2 7 6 5 2.7 1 6 5 2 7 6.5 6 5 5 7 6.5 6 6 11/28/07 7 7 17 K Means Clustering for k=2 A more realistic example Pick seeds Assign clusters Compute centroids Re-assign clusters x x x x Compute centroids Re-assign clusters Converged! From S. Mooney BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 18 K-Means Clustering: Strengths & Weaknesses • Fast, O(N) • Hard to know which K to choose • Try several and assess cluster quality • Hard to know where to seed the clusters • Results can change drastically with different initial choices for centroids - as shown in example: BCB 444/544 F07 ISU Dobbs #38 - Proteomics Example Illustrating Sensitivity to Seeds In the above, if start with B and E as centroids will converge to {A,B,C} and {D,E,F} If start with D and F Will converge to {A,B,D,E} {C,F} 11/28/07 19 Choice of K? Helpful to have additional information to aid evaluation of clusters BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 20 Hierarchical Clustering vs K-Means Running Time Assumptions Hierarchical Clustering K-Means Slower Faster Requires distance Requires distance metric metric Parameters None K (number of clusters) Clusters Subjective (only a tree is returned) Exactly K clusters BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 21 Clustering vs Classification • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Classification (supervised learning) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 22 Classification: Supervised Learning Task • Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class • Do: Learn a model that accurately predicts class based on features • Outcome: Predict class value of a patient based on expression levels of his/her genes BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 23 Methods for Classification • K-nearest neighbors (KNN) • • • • • Linear Models Logistic Regression Naive Bayes Decision Trees Support Vector Machines BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 24 K-Nearest Neighbor (KNN) • Idea: Use k closest neighbors to label new data points (e.g., for k = 4) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 25 Basic KNN Algorithm INPUT: • Set of data with labels (training data) • K • Set of data needing labels • Distance metric 1. For each unlabeled data point, compute distance to all labeled data 2. Sort distances, determine closest K neighbors (smallest distances) 3. Use majority voting to predict label of unlabeled data point BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 26 Variations on KNN • Can classify into multiple classes easily • Weighted KNN - an weight votes of nearby training samples based on their distance from unknown sample • Can set a threshold, p, for the # of votes needed to win. (If no winner, then either NULL result or set default winner) BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 27 Compare in Graphical Representation Clustering Classification Apply external labels: RED group & BLUE group BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 28 Tradeoffs for Clustering vs Classification • Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters • Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 29 Chp 19 – Proteomics SECTION V GENOMICS & PROTEOMICS Xiong: Chp 19 Proteomics • • • • Technology of Protein Expression Analysis Post-translational Modification Protein Sorting Protein-Protein Interactions BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 30 ISU Proteomics Resources & Researchers Facilities: Proteomics Facility (Carver Co-lab) http://www.plantgenomics.iastate.edu/proteomics/ Protein Facility (MBB) http://www.protein.iastate.edu/ Experiments: Plant: Rodermel, Wise, Voytas Animal: Greenlee, perhaps others soon? Computational Analysis: Honavar, Wise, Dobbs BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 31 Proteomics: What do all those proteins do?? Biological processes for yeast proteins Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 32 Proteome Analysis: “Traditionally” using Two-dimensional (2D) gels 1st D: Isoelectric focusing (IEF) in pH gradient: Proteins migrate to isoelectric points & stop moving 2nd D: SDS-PAGE (SDS detergent, polyacrylamide gel electrophoresis): Proteins migrate according to molecular weight Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 33 Proteins identified on 2D gels (IEF/SDS-PAGE) Direct protein microsequencing by Edman degradations -- done at facilities (here at ISU) -- typically need 5 picomoles -- often get 10 to 20 amino acids of sequence Protein mass analysis by MALDI-TOF -- Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Spectroscopy -- done at facilities (here at ISU) -- often detect post-translational modifications (such as phosphorylated Ser, Thr, Tyr) BCB 444/544 F07 ISU Dobbs #38 - Proteomics Page 11/28/07 250-1 34 Evaluation of 2D gels (IEF/SDS-PAGE) Advantages: Visualize hundreds to thousands of proteins Improved identification of protein spots Disadvantages: Limited number of samples can be processed Mostly abundant proteins visualized Technically difficult Jonathan Pevsner BCB 444/544 F07 ISU Dobbs #38 - Proteomics Page 251 35 11/28/07 Tandem Mass Spectrometry (TS) to Identify Proteins Figure 8.19 Tandem mass spectrometry for protein identification a) ESI creates ionized proteins, represented by colored shapes with positive charges. Each shape represents many copies of identical proteins. b) Ionized proteins are separated based on their mass to charge ratio (m/z) and sent one at a time into the activation chamber. Separation and selection take place in the first of the two MS devices. The solid purple protein has been selected for analysis; the other three are temporarily stored for later analysis. c) The group of m/z selected ionized proteins enters a collision cell that is filled with inert argon gas. Gas molecules collide with proteins, which causes them to break into two peptide pieces (labeled b and y). d) Ionized peptide pieces are sent into second MS device, which again measures the m/z ratio. A computer compares spectrum of peptide pieces to a database of ideal spectra to identify the original group of identical proteins. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 36 MS data: Protein identification through peptide fragment identification & separation Figure 8.20 When a group of identical proteins is broken into peptide pieces, more than one pair of b and y peptides will be formed. a) One protein sequence and its calculated mass on top, with the b peptides/masses (gray) and the y peptides/masses (purple) below. b) An experimentally determined mass/charge spectrum from the peptide in panel a). Some peaks are higher than others, which means that some b/y peptide pieces were more abundant than others. The spectrum is used to determine each peptide’s amino acid sequence and protein identity. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 37 Databases of 2D Gel Information http://ca.expasy.org/ch2d/2d-index.html BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 38 Jonathan Pevsner BCB 444/544 F07 ISU Dobbs #38 - Proteomics 11/28/07 39