#37- Clustering & Classification Algorithms 11/26/07 Required Reading BCB 444/544 (before lecture) Mon Nov 26 - Lecture 37 Lecture 37 Clustering & Classification Algorithms • Chp 18 Functional Genomics Brief Review: Microarrays Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions • Chp 19 Proteomics Clustering & Classification Algorithms Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett Dancik) http://www.r-project.org/ #37_Nov26 Fri Dec 1 - Lecture 39 Systems Biology Thanks to: Doina Caragea, KSU Dan Nettleton, ISU (& a bit of Metabolomics & Synthetic Biology) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 1 Assignments & Announcements Mon Nov 26 - HW#6 Due Mon Dec 3 http://www.bcb.iastate.edu/seminars/index.html Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium, • Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic Approach ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! #2: Tonia (10-15’) #4: Addie (10-15’) Thurs Dec 6 - Optional Review Session for Final Exam 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB • John Abrams Univ Texas Southwestern Medical Center • Dying Like Flies: Programmed & Unprogrammed Cell Death 3 GENOMICS & PROTEOMICS 4 High-throughput analysis of RNA expression: • Sequence-based Approaches • Microarray-based Approaches • Comparison of SAGE & DNA Microarrays BCB 444/544 Fall 07 Dobbs 11/26/07 Transcriptome = complete collection of all RNAs in a cell at a given time Xiong: Chp 18 Functional Genomics BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 F07 ISU Dobbs #37- Clustering Transcriptome Analysis Chp 18 – Functional Genomics SECTION V Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB • Sue Gibson Univ. of Minnesota • How do soluble sugar levels help regulate plant development, carbon partitioning and gene expression? Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Shashi Gadia ComS, ISU • Harnessing the Potential of XML Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM) Will include: 2 BCB List of URLs for Seminars related to Bioinformatics: (sometime before 5 PM Mon Nov 26) Wed Dec 5: #!: Xiong & Devin (~20’) Fri Dec 7: #3: Kendra & Drew (~20’) 11/26/07 Seminars this Week - BCB 544 Project Reports Due (but no class!) Tentative Schedule: BCB 444/544 F07 ISU Dobbs #37- Clustering Microarrays - "Gene Chips" most popular Other related methods: SAGE = Serial Analysis of Gene Expression MPSS = Massively Parallel Signature Sequencing 11/26/07 5 BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 6 1 #37- Clustering & Classification Algorithms 11/26/07 "Guilt by Association" - Similar expression patterns suggest potential functions for novel proteins Microarray Analysis Which RNAs are detected? • mRNAs (& pre-RNAs) alternatively spliced mRNAs • rRNAs, tRNAs • miRNAs, siRNAs, other regulatory RNAs TF is induced 2X & is known to activate genes G1 and G2, both of which are induced 6X. G3 is induced 6X, too. Is it regulated by TF? 2 Major Types of DNA Microarrays: cDNA = "spotted" = low density, glass slides = Southern blot on a slide oligo = "DNA chip" = high density, photolithography "Affy" chip; computationally designed • Clustering of gene expression patterns (with known genes) suggests potential functions for unknown genes - additional experiments are required to test these hypothesized functions. Both types can be made here, in ISU facilities BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 7 Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #37- Clustering Gene Expression Pattern Clusters: for several thousand genes!! 11/26/07 8 ISU Microarray Researchers & Facilities Microarray Facilities: Center for Plant Genomics (ISU PSI) - Pat Schnable in Carver Co-Lab GeneChip Facility (ISU Biotech & PSI) - Steve Whitham in MBB Each row represents a different gene Each column represents a different time point Green indicates repression (decrease in RNA) Red indicates induction (increase in RNA) Research Labs: Pat Schnable (Agron/GDCB) - Facilities for cDNA microarrays Steve Whitham (PlPath) - Facilities for oligo microarrays Genes have been clustered so they are near other genes with similar expression patterns. Notice that the genes at the bottom were repressed for the first few time points. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 Google "microarrays" from ISU website>>> Lots more: Jo Anne Powell-Coffman, GDCB: genes induced under oxidative stress Roger Wise, Rico Caldo, Plant Pathology: interaction between multiple isolates of powdery mildew and multiple genotypes of barley Chris Tuggle, Animal Science: genes controlling mammalian embryo development 9 BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 10 11/26/07 12 Gene Expression Analysis ISU Microarray Design & Analysis • Experimental Design is critical ISU Course: Stat 416/516X Nettleton Statistical Design & Analysis of Microarray Experiments •Dan Nettleton (Stat) - Experimental design & statistical analyses •Hui-Hsien Chou (Com S) - "Picky" software for designing oligos •Di Cook (Stat) "exploRase" software for high-dimensional data analysis & visualization for systems biology •Tools from Statistics & Machine Learning are needed ISU Experts: Dan Nettleton & Di Cook, Stat Vasant Honavar, Com S Statistics: ANOVA (Analysis of Variance) R Statistics package ML: Clustering & Classification Algorithms WEKA package GEPAS Many additional resources & tools available online ISU has several Microarray Analysis Suites BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 11 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 2 #37- Clustering & Classification Algorithms 11/26/07 Microarray Analysis - Questions: Data Analysis Considerations • How do hierarchical clustering algorithms work? • How do we measure the distance between two clusters? (similarity criteria) • What are “good clusters”? Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 • • • • • 13 Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Pattern Recognition in Microarray Analysis • Clustering (unsupervised learning) Two Views of same Microarray Experiment • Represented by expression levels across different samples/experiments/conditions (ie, features=samples) • Goal: categorize genes • Classification (supervised learning) • Data points are samples (eg, patients) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 • Represented by expression levels of different genes (ie, features=genes) • Goal: categorize samples 15 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Two Ways to View Microarray Data Doina Caragea AB00014_at AB00015_at A28202_ac ... Person Gene A28202_ac Person 1 1142.0 321.0 2567.2 ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... Person 4 42.8 692.1 812.0 AB00014_at AB00015_at ... . . . . . ... . . . . . . ... . . . . . . ... . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 Fall 07 Dobbs 16 11/26/07 18 ... . 11/26/07 11/26/07 Data Points are Genes Person Gene BCB 444/544 F07 ISU Dobbs #37- Clustering 14 • Data points are genes • Uses primary data to group measurements, with no information from other sources Doina Caragea 11/26/07 17 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 3 #37- Clustering & Classification Algorithms 11/26/07 Data Points are Samples Doina Caragea Clustering: Unsupervised Learning Task 1 Person Gene A28202_ac AB00014_at AB00015_at Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... • Given: a set of microarray results in which gene expression levels are measured under different experimental conditions ... . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 F07 ISU Dobbs #37- Clustering • Do: Cluster the genes, where a gene is described by its expression levels under different conditions • Outcome: Groups genes into clusters, where expression of all members of a cluster tend to go up or down together 11/26/07 19 Doina Caragea Example: Groups of Genes are Clustered BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 20 Visualizing Expression Patterns for Different Clusters (Green = up-regulated, Red = down-regulated) Gene Cluster 2, size=43 Genes Normalized expression Gene Cluster 1, size=20 Time (10-minute intervals) (from Sharan & Shamir, 2000) Experiments (Samples) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 21 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 22 Examples Clustering: Unsupervised Learning Task 2 • Given: a set of microarray results in which experimental samples correspond to different patients • Cluster samples from mice subjected to a variety of toxic compounds • Cluster samples from cancer patients to discover different subtypes of a cancer • Cluster samples taken at different timepoints • Do: Cluster the experiments • Outcome: Groups samples according to similarities in gene expression profiles Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 23 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 24 4 #37- Clustering & Classification Algorithms 11/26/07 Classification: Supervised Learning Task Supervision: Add Class Values Doina Caragea Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 ... normal Person 2 586.3 586.1 759.0 ... cancer Person 3 105.2 559.3 3210.7 ... normal Person 4 42.8 692.1 812.0 ... cancer . . . . . . ... . . . . . . ... . . . . . . ... • Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient) Class BCB 444/544 F07 ISU Dobbs #37- Clustering Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class • Do: Learn a model that accurately predicts class based on features • Outcome: Predict class value of a patient based on expression levels of his/her genes 11/26/07 25 Doina Caragea Methods for Clustering • (in lab, won’t discuss in lecture) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 27 Doina Caragea Distance Metrics for 2 n-Dimensional Vectors • Euclidean distance 2 • Correlation coefficient where ! 2 $ (x i " x = sqrt(E(x ) # E(x) ) 28 • Compare computed clusters with known clusters (if there are any) to see how closely they match Good clusters will contain all known and no “wrong” cluster members # µx )(y i # µy ) i 2 11/26/07 • Compare INTRA-cluster distances with INTER-cluster distances. Good clusters should have big difference 2 D(x, y) = sqrt[(x1 " y1 ) + (x 2 " y 2 ) + ...+ (x n " y n ) ] cov(x, y) "(x, y) = = std(x)std(y) BCB 444/544 F07 ISU Dobbs #37- Clustering Measuring Quality of Clusters (e.g., for a series of expression measurements) ! 26 • A key issue in clustering is to determine what similarity / distance metric to use • Often, such metric has a bigger effect on the results than actual clustering algorithm used! • When determining the metric, we should take into account our assumptions about the data and the goal of the clustering • …many others…. 2 11/26/07 Clustering Metrics • Hierarchical Clustering • K-Means • Self Organizing Maps Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering % x% y and E(x) is expected value of X • Other metrics are also used… ! Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 29 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 30 5 #37- Clustering & Classification Algorithms 11/26/07 INTRA- vs INTER-Cluster Distances How Determine Distances? Intra-cluster distance Inter-cluster distance • Min/Max/Avg the distance between - All pairs of points in the cluster OR - Between centroid and all points in the cluster • Single link • distance between two most similar members • Complete link • distance between two most similar members • Average link • Average distance of all pairs • Centroid distance Good! Doina Caragea What is the centroid? the "average" of all points of X. The Bad! BCB 444/544 F07 ISU Dobbs #37- Clustering centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia 11/26/07 31 Doina Caragea Similarity Criterion: Single Link BCB 444/544 F07 ISU Dobbs #37- Clustering • Cluster similarity = similarity of two least similar members Potentially long and skinny clusters BCB 444/544 F07 ISU Dobbs #37- Clustering 32 Similarity Criterion: Complete Link • Cluster similarity = similarity of two most similar members Doina Caragea 11/26/07 11/26/07 33 Tight clusters Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 34 Hierarchical Clustering* Similarity Criterion: Average Link *This method was illustrated in Lecture 36,Tables 6.1-MM6.4 • • • • Cluster similarity = average similarity of all pairs Probably most popular clustering algorithm for microarray analysis First presented in this context by Eisen et al. in 1998 Nodes = genes or groups of genes Agglomerative (bottom up) Initially each item is a cluster 1. Compute distance matrix 2. Find two closest nodes (most similar clusters) 3. Merge them 4. Compute distances from merged node to all others 5. Repeat until all nodes merged into a single node 0. This is perhaps most widely used similarity criterion Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 35 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 36 6 #37- Clustering & Classification Algorithms 11/26/07 Hierachical Clustering Example: Using Single Link Criterion to Iteratively “Combine” Data Points Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 37 Copyright: Russ AltmanBCB 444/544 F07 ISU Dobbs #37- Clustering Hierarchical Clustering: Strengths & Weaknesses Computationally attractive! Bottom-up is most commonly used method • Can also perform top-down, which requires splitting a large group successively 4. 1. 2. 3. 5. BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 39 2nd Centroid A Initial Centroid A Choose random points (cluster centers or centroids) in k dimensions Compute distance from each data point to centroids Assign each data point to closest centroid Compute new cluster centroid as average of points assigned to cluster Loop to (2), stop when cluster centroids do not move very much Doina Caragea Initial Centroid B 2nd Centroid B For K = 2 Two features: f1 (x-coordinate) & f2 (y-coordinate) BCB 444/544 F07 ISU Dobbs #37- Clustering For simplicity, assume k=2 & objects are 1-dimensional (Numerical difference is used as distance) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 1 2 1 2 1 1.5 2 6 5 6.5 Assign clusters 7 Compute centroids Re-assign clusters 7 x x x x 7 5 6.5 Compute centroids Re-assign clusters Converged! 6 2.7 2 6 5 2 1 40 Pick seeds 5 2.7 1 11/26/07 K Means Clustering for k=2 A more realistic example K-Means Clustering Example, for k=2 Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 7 1. Randomly select 5 and 6 as centers (centroids) 2. Calculate distance from points to centroids & assign points to clusters: {1,2,5} & {6,7} 3. Compute new cluster centroids: (C1 ) = 8/3 = 2.7 (C2 ) = 13/2= 6.5 4. Calculate distance from points to new centroids & assign data points to new clusters: {1,2} & {5,6,7} 5. Compute new cluster centroids: (C1 ) = 1.5 (C2 ) = 6.0 6. No change? Converged! => Final clusters = {1,2} & {5,6,7} 38 K-Means Clustering (Model-based) • Easy to understand & implement • Can decide how big to make clusters by choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima Doina Caragea 11/26/07 7 6 5 6 11/26/07 7 41 From S. Mooney BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 42 7 #37- Clustering & Classification Algorithms 11/26/07 K-Means Clustering: Strengths & Weaknesses • Fast, O(N) • Hard to know which K to choose • Try several and assess cluster quality • Hard to know where to seed the clusters • Results can change drastically with different initial choices for centroids - as shown in example: Doina Caragea Choice of K? Helpful to have additional information to aid evaluation of clusters Example Illustrating Sensitivity to Seeds In the above, if start with B and E as centroids will converge to {A,B,C} and {D,E,F} If start with D and F Will converge to {A,B,D,E} {C,F} BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 43 Doina Caragea Running Time Assumptions Doina Caragea K-Means Slower Faster None K (number of clusters) Clusters Subjective (only a tree is returned) Exactly K clusters BCB 444/544 F07 ISU Dobbs #37- Clustering • Uses primary data to group measurements, with no information from other sources 11/26/07 • Classification (supervised learning) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest 45 Doina Caragea Compare in Graphical Representation Clustering 44 • Clustering (unsupervised learning) Requires distance Requires distance metric metric Parameters 11/26/07 Clustering vs Classification Hierarchical Clustering vs K-Means Hierarchical Clustering BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 46 Tradeoffs • Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters • Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge Classification Apply external labels: RED group & BLUE group Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 47 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 48 8 #37- Clustering & Classification Algorithms 11/26/07 Methods for Classification • • • • • • K-Nearest Neighbor (KNN) • Idea: Use k closest neighbors to label new data points (e.g., for k = 4) K-nearest neighbors Linear Models Logistic Regression Naive Bayes Decision Trees Support Vector Machines Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 49 Basic KNN Algorithm 11/26/07 50 • Some of this is material I discussed or wrote on blackboard • It is provided here for your information & for future reference • It will not be covered on the Final Exam! 1. For each unlabeled data point, compute distance to all labeled data 2. Sort distances, determine closest K neighbors (smallest distances) 3. Use majority voting to predict label of unlabeled data point. BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 F07 ISU Dobbs #37- Clustering SLIDES FOLLOWING THIS ONE WERE NOT SHOWN IN LECTURE INPUT: • Set of data with labels (training data) • K • Set of data needing labels • Distance metric Doina Caragea Doina Caragea 11/26/07 51 BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 52 cDNA Microarrays Microarray Technology Details re: 2 types of arrays • Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples • cDNA “Slides” • Short-oligonucleotide “Chips” • cDNAs are arrayed on each slide in a grid of spots. • Each spot contains thousands of copies of a sequence that matches a segment of a gene’s coding sequence. • A sequence and its complement are present in the same spot. A few words about microarray terminology: • Different spots typically represent different genes, but some genes may be represented by multiple spots • Probes refers to cDNAs or DNA oligos attached to slide or chip • Target refers to labeled mRNA or cRNA in solution, which is hybridized to probes attached to slide or chip Note: this is opposite of terminology used in discussing Southern blots, etc, in which target is DNA attached to solid matrix & probe is labeled RNA or cDNA in solution, which is hybridized to targets attached to matrix BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 53 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 54 9 #37- Clustering & Classification Algorithms 11/26/07 cDNA microarray slide 1 cDNA Microarray Probes spot for gene 201 AAAAAAAAA...A cDNA GATATG... GATATG... GATATG... GATATG... spot for gene 576 spot for gene 576 TTCCAG... TTCCAG... TTCCAG... TTCCAG... TTCCAG... TTCCAG... ... ... TTTTTTTTTT...T EST GATATG... GATATG... EST ... mRNA spot for gene 201 ... • Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. • ESTs are small pieces of cDNA sequence (usually 200 to 500 nucleotides long) that has been reverse-transcribed from mRNA cDNA microarray slide 2 Each spot contains many copies of a sequence along with its complement (not shown). Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 55 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 56 Spotting Probes on the Microarray 8 X 4 Print Head Spotting cDNA Probes on Microarrays • Solutions containing probes are transferred from a plate to a microarray slide by a robotic arrayer. plate with wells holding probes in solution • The robot picks up a small amount of solution containing a probe by dipping a pin into a well on a plate. • The robot then deposits a small drop of the solution on the microarray slide by touching the pin onto the slide. • The pin is washed and the process is repeated for a different probe. • Most arrayers use several pins so that multiple probes are spotted simultaneously on a slide. • Most arrayers print multiple slides together so that probes are deposited on several slides prior to washing. All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. microarray slide Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 57 cDNA Microarrays to Measure mRNA Levels • RNA is extracted from a target sample of interest. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 58 cDNA Microarrays to Measure mRNA Levels (cont.) • Usually two samples, dyed with different dyes, are hybridized to a single slide. • mRNAs are reverse transcribed into cDNA. • The resulting cDNAs are labeled with a fluorescent dye and are incubated with the microarray slide. • The dyes fluoresce at different wavelengths so it is possible to get separate images for each dye. • Dyed cDNA sequences hybridize to complementary probes spotted on the array. • Images from the scanner are black and white, but it is typical to display Cy3 images as green and Cy5 images are displayed as red. • A laser excites the dye and a scanner records an image of the slide. • It is common to superimpose the two images, using yellow to indicate a mixture of green and red. • The image is quantified to obtain measures of fluorescence intensity for each pixel. • Pixel values are processed to obtain measures of mRNA abundance for each probe on the array. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 59 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 60 10 #37- Clustering & Classification Algorithms 11/26/07 cDNA Microarrays to Measure mRNA Levels: Problems with cDNA Microarrays: Difficult to Make Meaningful Comparisons between Genes Step 1: Prepare Microarray Slide & Sample mRNAs Microarray Slide ATCTA...A ATCTA...A ATCTA...A ACGGG...T ACGGG...T ACGGG...T CGATA...G CGATA...G CGATA...G Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 61 Step 2: Convert mRNA to cDNA & label with Fluorescent Dyes Spots (Probes) Sample 2 ??? ??? ??? ???? ???? ???? ?????? ???? ?? ???? ?? ??? Unknown mRNA Sequences (Target) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 62 Step 3: Mix Labeled cDNA and Hybridize to Slide Sample 1 Sample 1 ?????????? ?????????? ACCTG...G ACCTG...G ACCTG...G TTCTG...A TTCTG...A TTCTG...A GGCTT...C GGCTT...C GGCTT...C ATCTA...A ATCTA...A ATCTA...A ACGGG...T ACGGG...T ACGGG...T CGATA...G CGATA...G CGATA...G ?????????? ?????????? TTCTG...A TTCTG...A TTCTG...A ?? cDNA Microarrays to Measure mRNA Levels: cDNA Microarrays to Measure mRNA Levels: ACCTG...G ACCTG...G ACCTG...G Dan Nettleton, ISU Statistics 416/516X ???? ? GGCTT...C GGCTT...C GGCTT...C ???? ?? ???? ???? ???? ???? ?? ???? ???? ?? ?????????? ??? TTCTG...A TTCTG...A TTCTG...A ? • Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful. ACCTG...G ACCTG...G ACCTG...G Sample 1 ??? • Measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). ?????????? ?????????? ?????????? ?????????? ?????????? ATCTA...A ATCTA...A ATCTA...A Sample 2 ?????????? Sample 2 ?????????? ?????????? CGATA...G CGATA...G CGATA...G ?????????? ACGGG...T ACGGG...T ACGGG...T ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? GGCTT...C GGCTT...C GGCTT...C ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 63 cDNA Microarrays to Measure mRNA Levels: Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 64 Pros/Cons of Spotted cDNA Arrays Step 5: Excite Dye with Laser, Scan & Quantify Signals Sample 1 ACCTG...G TTCTG...A 7652 138 5708 4388 GGCTT...C ATCTA...A 8566 765 1208 13442 ACGGG...T CGATA...G 6784 9762 67 239 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs • Many sources of variation in the manufacture of these arrays, print tips, lab, etc. • Contamination • Uneven distribution • Flexible, can put any cDNA on slide Sample 2 11/26/07 65 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 66 11 #37- Clustering & Classification Algorithms 11/26/07 DNA Oligonucleotide “Chips” • An oligonucleotide microarray is a microarray whose probes consist of synthetically created DNA oligonucleotides. • Probes sequences are chosen to have good and relatively uniform hybridization characteristics • A probe is chosen to match a portion of its target mRNA transcript that is unique to that sequence. • Oligo probes can distinguish among multiple mRNA transcripts with similar sequences. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 67 BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 68 www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 69 www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 70 Simplified Example Oligo Microarray Fabrication ...gene 1 oligo probe for gene 1 ... • Oligo sequences can be synthesized on a slide or chip using various commercial technologies. ATTACTAAGCATAGATTGCCGTATA • In one approach, sequences are synthesized on a slide using inkjet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. ... gene 2 Shared green regions indicate high degree of sequence similarity throughout much of the transcript Dan Nettleton, ISU Statistics 416/516X • Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. ... • GCGTATGGCATGCCCGGTAAACTGG BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs Affymetrix uses a photolithographic approach. oligo probe for gene 2 11/26/07 71 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 72 12 #37- Clustering & Classification Algorithms 11/26/07 Affymetrix Probe Sets Affymetrix GeneChips • Affymetrix (www.affymetrix.com) manufactures GeneChips, oligonucleotide arrays. • A probe set is used to measure mRNA levels of a single gene. • Each gene (or sequence of interest or feature) is represented by multiple short (25-nucleotide) oligo probes. • Each probe cell contains millions of copies of one oligo. • Each probe set consists of multiple probe cells. • Each oligo is intended to be 25 nucleotides in length. • Some GeneChips include probes for around 60,000 genes. • Probe cells in a probe set are arranged in probe pairs. • mRNA that has been extracted from a biological sample can be labeled (dyed) and hybridized to a GeneChip in a manner similar to that described for cDNA microarrays. • Each probe pair contains a perfect match (PM) probe cell and a mismatch (MM) probe cell. • Only one sample is hybridized to each GeneChip rather than two as in the case of cDNA microarrays. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 73 5’ Reference Sequence • A MM oligo is identical to a PM oligo except that the middle nucleotide (13th of 25) is intentionally replaced by its complementary nucleotide. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 74 Different Probe Pairs Represent Different Parts of the Same Gene Affymetrix GeneChips mRNA reference sequence • A PM oligo perfectly matches part of a gene sequence. 3’ gene sequence Spaced DNA probe pairs …TGTGATGGTGGGAATGGGTCAGAAGGGACTCCTATGTGGGTGACGAGGCC… TTACCCAGTCTTCCCTGAGGATACAC Perfect match oligo TTACCCAGTCTTGCCTGAGGATACAC Mismatch oligo Probe Set PM MM PM - 25 bases complementary to gene MM - Middle base is different Probe Pair Dan Nettleton, ISU Statistics 416/516X PM MM Probe Cell MM BCB 444/544 F07 ISU Dobbs #37- Clustering Probe Cell 11/26/07 Probes are selected to be specific to the target gene and have good hybridization characteristics. PM 75 Obtaining Labeled Target for Affy Chips 1. RNA single-stranded cDNA 2. Single-stranded cDNA double-stranded cDNA 3. Double-strand cDNA labeled single-stranded cRNA complementary to coding sequence Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 76 Pros/Cons GeneChip Arrays • Consistent manufacture -> good standardization • Comparable across experiments • Design is time-consuming, good for large sets of chips • Can only see what is on the chip Number of copies of each sequence gets amplified in conversion to cRNA. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 77 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 78 13 #37- Clustering & Classification Algorithms 11/26/07 Reminder: Why do microarray experiments? Affymetrix Data Processing Pipeline Experiment preparation *.exp file Analysis output *.chp file Image of the scanned probe array *.dat file • Compare two (or more) conditions to identify differentially expressed genes Probe Cell Intensity file *.cel file • Exploratory analysis • Control/treatment • Disease/normal • What genes are expressed in response to drought stress? • What gene expression changes occur during normal retinal development? MicroArray Suite or other analysis software Dan Nettleton, ISU Statistics 416/516X • Diagnostic & prognostic tool development: • Can we predicting certain conditions (breast cancer vs normal) • Can we identify patterns of gene expression that predict a patient’s response to treatment/drug? BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 79 Doina Caragea Differential Gene Expression Inoculated Doina Caragea Inoculated BCB 444/544 F07 ISU Dobbs #37- Clustering 80 • Find patterns in data to see what genes are expressed under different conditions • Analysis includes clustering methods • Used when little or no prior knowledge exists about the problem Mutant 2 Control 11/26/07 Exploratory Analysis • Are there significant differences in expression level between the conditions? • Analysis of Variance (ANOVA) Mutant 1 BCB 444/544 F07 ISU Dobbs #37- Clustering Control 11/26/07 81 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 82 11/26/07 84 Microarray data analysis Classification Preprocessing normalization scatter plots • Learn characteristic patterns from a training set and evaluate with a test set. • Classify tumor types based on expression patterns • Predict disease susceptibility, stages, etc. Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs 11/26/07 83 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 14 #37- Clustering & Classification Algorithms 11/26/07 Pre-processing Inferential statistics Main goal of data preprocessing is to remove any systematic bias in the data as completely as possible, while preserving variation in gene expression that occurs because of Biologically relevant changes in transcription. Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: • different labeling efficiencies of Cy3, Cy5 • • • • Doina Caragea uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency BCB 444/544 F07 ISU Dobbs #37- Clustering Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Page 191 85 11/26/07 Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Page 199 86 11/26/07 Limitations of Microarrays Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. • Link between proteins and expressed RNA not always clear • Difficult to compare between microarray platforms: Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Fall 07 Dobbs Page 203 87 11/26/07 • Only see what is on the microarray • Gene finding is still an art • Other coding regions, “dark matter” on genome • But now microarrays for these are being developed, too! Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 88 15