BCB 444/544 Lecture 37 Brief Review: Microarrays Clustering & Classification Algorithms #37_Nov26 Thanks to: Doina Caragea, KSU Dan Nettleton, ISU BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 1 Required Reading (before lecture) Mon Nov 26 - Lecture 37 Clustering & Classification Algorithms • Chp 18 Functional Genomics Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions • Chp 19 Proteomics Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett Dancik) http://www.r-project.org/ Fri Dec 1 - Lecture 39 Systems Biology (& a bit of Metabolomics & Synthetic Biology) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 2 Assignments & Announcements Mon Nov 26 - HW#6 Due Mon Dec 3 (sometime before 5 PM Mon Nov 26) - BCB 544 Project Reports Due (but no class!) ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! Tentative Schedule: Wed Dec 5: #!: Xiong & Devin (~20’) Fri Dec 7: #3: Kendra & Drew (~20’) #2: Tonia (10-15’) #4: Addie (10-15’) Thurs Dec 6 - Optional Review Session for Final Exam Mon Dec 10 - BCB 444/544 Final Exam (9:45 - 11:45AM) Will include: 40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 3 Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium, • Greg Voth Univ. of Utah • Multiscale Challenge for Biomolecular Systems: A Systematic Approach Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB • Sue Gibson Univ. of Minnesota • How do soluble sugar levels help regulate plant development, carbon partitioning and gene expression? Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Shashi Gadia ComS, ISU • Harnessing the Potential of XML Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB • John Abrams Univ Texas Southwestern Medical Center • Dying Like Flies: Programmed & Unprogrammed Cell Death BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 4 Chp 18 – Functional Genomics SECTION V GENOMICS & PROTEOMICS Xiong: Chp 18 Functional Genomics • Sequence-based Approaches • Microarray-based Approaches • Comparison of SAGE & DNA Microarrays BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 5 Transcriptome Analysis Transcriptome = complete collection of all RNAs in a cell at a given time High-throughput analysis of RNA expression: Microarrays - "Gene Chips" most popular Other related methods: SAGE = Serial Analysis of Gene Expression MPSS = Massively Parallel Signature Sequencing BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 6 Microarray Analysis Which RNAs are detected? • mRNAs (& pre-RNAs) alternatively spliced mRNAs • rRNAs, tRNAs • miRNAs, siRNAs, other regulatory RNAs 2 Major Types of DNA Microarrays: cDNA = "spotted" = low density, glass slides = Southern blot on a slide oligo = "DNA chip" = high density, photolithography "Affy" chip; computationally designed • Both types can be made here, in ISU facilities BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 7 "Guilt by Association" - Similar expression patterns suggest potential functions for novel proteins TF is induced 2X & is known to activate genes G1 and G2, both of which are induced 6X. G3 is induced 6X, too. Is it regulated by TF? Clustering of gene expression patterns (with known genes) suggests potential functions for unknown genes - additional experiments are required to test these hypothesized functions. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 8 Gene Expression Pattern Clusters: for several thousand genes!! Each row represents a different gene Each column represents a different time point Green indicates repression (decrease in RNA) Red indicates induction (increase in RNA) Genes have been clustered so they are near other genes with similar expression patterns. Notice that the genes at the bottom were repressed for the first few time points. Copyright © 2006 A. Malcolm Campbell BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 9 ISU Microarray Researchers & Facilities Microarray Facilities: Center for Plant Genomics (ISU PSI) - Pat Schnable in Carver Co-Lab GeneChip Facility (ISU Biotech & PSI) - Steve Whitham in MBB Research Labs: Pat Schnable (Agron/GDCB) - Facilities for cDNA microarrays Steve Whitham (PlPath) - Facilities for oligo microarrays Google "microarrays" from ISU website>>> Lots more: Jo Anne Powell-Coffman, GDCB: genes induced under oxidative stress Roger Wise, Rico Caldo, Plant Pathology: interaction between multiple isolates of powdery mildew and multiple genotypes of barley Chris Tuggle, Animal Science: genes controlling mammalian embryo development BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 10 ISU Microarray Design & Analysis • Experimental Design is critical ISU Course: Stat 416/516X Nettleton Statistical Design & Analysis of Microarray Experiments •Dan Nettleton (Stat) - Experimental design & statistical analyses •Hui-Hsien Chou (Com S) - "Picky" software for designing oligos •Di Cook (Stat) "exploRase" software for high-dimensional data analysis & visualization for systems biology •Tools from Statistics & Machine Learning are needed ISU Experts: Dan Nettleton & Di Cook, Stat Vasant Honavar, Com S Statistics: ANOVA (Analysis of Variance) R Statistics package ML: Clustering & Classification Algorithms WEKA package GEPAS Many additional resources & tools available online ISU has several Microarray Analysis Suites BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 11 Gene Expression Analysis Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 12 Microarray Analysis - Questions: • How do hierarchical clustering algorithms work? • How do we measure the distance between two clusters? (similarity criteria) • What are “good clusters”? Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 13 Data Analysis Considerations • • • • • Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 14 Pattern Recognition in Microarray Analysis • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Classification (supervised learning) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 15 Two Views of same Microarray Experiment • Data points are genes • Represented by expression levels across different samples/experiments/conditions (ie, features=samples) • Goal: categorize genes • Data points are samples (eg, patients) • Represented by expression levels of different genes (ie, features=genes) • Goal: categorize samples Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 16 Two Ways to View Microarray Data Doina Caragea Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 17 Data Points are Genes Doina Caragea Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 18 Data Points are Samples Doina Caragea Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 19 Clustering: Unsupervised Learning Task 1 • Given: a set of microarray results in which gene expression levels are measured under different experimental conditions • Do: Cluster the genes, where a gene is described by its expression levels under different conditions • Outcome: Groups genes into clusters, where expression of all members of a cluster tend to go up or down together Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 20 Example: Groups of Genes are Clustered Genes (Green = up-regulated, Red = down-regulated) Experiments (Samples) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 21 Visualizing Expression Patterns for Different Clusters Gene Cluster 2, size=43 Normalized expression Gene Cluster 1, size=20 Time (10-minute intervals) (from Sharan & Shamir, 2000) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 22 Clustering: Unsupervised Learning Task 2 • Given: a set of microarray results in which experimental samples correspond to different patients • Do: Cluster the experiments • Outcome: Groups samples according to similarities in gene expression profiles Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 23 Examples • Cluster samples from mice subjected to a variety of toxic compounds • Cluster samples from cancer patients to discover different subtypes of a cancer • Cluster samples taken at different timepoints Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 24 Supervision: Add Class Values Doina Caragea Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 ... normal Person 2 586.3 586.1 759.0 ... cancer Person 3 105.2 559.3 3210.7 ... normal Person 4 42.8 692.1 812.0 ... cancer . . . . . . ... . . . . . . ... . . . . . . ... BCB 444/544 F07 ISU Dobbs #37- Clustering Class 11/26/07 25 Classification: Supervised Learning Task • Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class • Do: Learn a model that accurately predicts class based on features • Outcome: Predict class value of a patient based on expression levels of his/her genes Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 26 Methods for Clustering • Hierarchical Clustering • K-Means • Self Organizing Maps • (in lab, won’t discuss in lecture) • …many others…. Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 27 Clustering Metrics • A key issue in clustering is to determine what similarity / distance metric to use • Often, such metric has a bigger effect on the results than actual clustering algorithm used! • When determining the metric, we should take into account our assumptions about the data and the goal of the clustering Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 28 Distance Metrics for 2 n-Dimensional Vectors (e.g., for a series of expression measurements) • Euclidean distance D(x, y) sqrt[(x1 y1) 2 (x 2 y 2 ) 2 ... (x n y n ) 2 ] • Correlation coefficient cov( x, y) (x, y) std(x)std(y) where (x i x sqrt(E(x 2 ) E(x) 2 ) i x )(y i y ) x y and E(x) is expected value of X • Other metrics are also used… Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 29 Measuring Quality of Clusters • Compare INTRA-cluster distances with INTER-cluster distances. Good clusters should have big difference • Compare computed clusters with known clusters (if there are any) to see how closely they match Good clusters will contain all known and no “wrong” cluster members Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 30 INTRA- vs INTER-Cluster Distances Good! Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Bad! 11/26/07 31 How Determine Distances? Intra-cluster distance Inter-cluster distance • Min/Max/Avg the distance between - All pairs of points in the cluster OR - Between centroid and all points in the cluster • Single link • distance between two most similar members • Complete link • distance between two most similar members • Average link • Average distance of all pairs • Centroid distance What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipedia Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 32 Similarity Criterion: Single Link • Cluster similarity = similarity of two most similar members Potentially long and skinny clusters Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 33 Similarity Criterion: Complete Link • Cluster similarity = similarity of two least similar members Tight clusters Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 34 Similarity Criterion: Average Link • Cluster similarity = average similarity of all pairs This is perhaps most widely used similarity criterion Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 35 Hierarchical Clustering* *This method was illustrated in Lecture 36,Tables 6.1-MM6.4 • • • Probably most popular clustering algorithm for microarray analysis First presented in this context by Eisen et al. in 1998 Nodes = genes or groups of genes Agglomerative (bottom up) 0. Initially each item is a cluster 1. Compute distance matrix 2. Find two closest nodes (most similar clusters) 3. Merge them 4. Compute distances from merged node to all others 5. Repeat until all nodes merged into a single node Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 36 Hierachical Clustering Example: Using Single Link Criterion to Iteratively “Combine” Data Points Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 37 Copyright: Russ AltmanBCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 38 Hierarchical Clustering: Strengths & Weaknesses • Easy to understand & implement • Can decide how big to make clusters by choosing cut level of hierarchy • Can be sensitive to bad data • Can have problems interpreting tree • Can have local minima Bottom-up is most commonly used method • Can also perform top-down, which requires splitting a large group successively Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 39 K-Means Clustering (Model-based) 2nd Centroid A Computationally attractive! 1. 2. 3. 4. 5. Choose random points (cluster centers or centroids) in k dimensions Compute distance from each data point to centroids Assign each data point to closest centroid Compute new cluster centroid as average of points assigned to cluster Loop to (2), stop when cluster centroids do not move very much Doina Caragea Initial Centroid A Initial Centroid B 2nd Centroid B For K = 2 Two features: f1 (x-coordinate) & f2 (y-coordinate) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 40 K-Means Clustering Example, for k=2 For simplicity, assume k=2 & objects are 1-dimensional (Numerical difference is used as distance) Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 7 1. Randomly select 5 and 6 as centers (centroids) 2. Calculate distance from points to centroids & assign points to clusters: {1,2,5} & {6,7} 3. Compute new cluster centroids: (C1) = 8/3 = 2.7 (C2) = 13/2= 6.5 4. Calculate distance from points to new centroids & assign data points to new clusters: {1,2} & {5,6,7} 5. Compute new cluster centroids: (C1) = 1.5 (C2) = 6.0 6. No change? Converged! => Final clusters = {1,2} & {5,6,7} Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 1 5 2 1 2 2.7 1 1 2 1.5 2 7 6 5 2.7 1 6 5 2 7 6.5 6 5 5 7 6.5 6 6 11/26/07 7 7 41 K Means Clustering for k=2 A more realistic example Pick seeds Assign clusters Compute centroids Re-assign clusters x x x x Compute centroids Re-assign clusters Converged! From S. Mooney BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 42 K-Means Clustering: Strengths & Weaknesses • Fast, O(N) • Hard to know which K to choose • Try several and assess cluster quality • Hard to know where to seed the clusters • Results can change drastically with different initial choices for centroids - as shown in example: Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Example Illustrating Sensitivity to Seeds In the above, if start with B and E as centroids will converge to {A,B,C} and {D,E,F} If start with D and F Will converge to {A,B,D,E} {C,F} 11/26/07 43 Choice of K? Helpful to have additional information to aid evaluation of clusters Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 44 Hierarchical Clustering vs K-Means Running Time Assumptions Doina Caragea Hierarchical Clustering K-Means Slower Faster Requires distance Requires distance metric metric Parameters None K (number of clusters) Clusters Subjective (only a tree is returned) Exactly K clusters BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 45 Clustering vs Classification • Clustering (unsupervised learning) • Uses primary data to group measurements, with no information from other sources • Classification (supervised learning) • Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 46 Compare in Graphical Representation Clustering Classification Apply external labels: RED group & BLUE group Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 47 Tradeoffs • Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters • Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 48 Methods for Classification • • • • • • K-nearest neighbors Linear Models Logistic Regression Naive Bayes Decision Trees Support Vector Machines Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 49 K-Nearest Neighbor (KNN) • Idea: Use k closest neighbors to label new data points (e.g., for k = 4) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 50 Basic KNN Algorithm INPUT: • Set of data with labels (training data) • K • Set of data needing labels • Distance metric 1. For each unlabeled data point, compute distance to all labeled data 2. Sort distances, determine closest K neighbors (smallest distances) 3. Use majority voting to predict label of unlabeled data point. Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 51 SLIDES FOLLOWING THIS ONE WERE NOT SHOWN IN LECTURE • Some of this is material I discussed or wrote on blackboard • It is provided here for your information & for future reference • It will not be covered on the Final Exam! BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 52 Microarray Technology Details re: 2 types of arrays • cDNA “Slides” • Short-oligonucleotide “Chips” A few words about microarray terminology: • Probes refers to cDNAs or DNA oligos attached to slide or chip • Target refers to labeled mRNA or cRNA in solution, which is hybridized to probes attached to slide or chip Note: this is opposite of terminology used in discussing Southern blots, etc, in which target is DNA attached to solid matrix & probe is labeled RNA or cDNA in solution, which is hybridized to targets attached to matrix BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 53 cDNA Microarrays • Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples • cDNAs are arrayed on each slide in a grid of spots. • Each spot contains thousands of copies of a sequence that matches a segment of a gene’s coding sequence. • A sequence and its complement are present in the same spot. • Different spots typically represent different genes, but some genes may be represented by multiple spots Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 54 cDNA Microarray Probes • Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. • ESTs are small pieces of cDNA sequence (usually 200 to 500 nucleotides long) that has been reverse-transcribed from mRNA mRNA AAAAAAAAA...A cDNA TTTTTTTTTT...T EST Dan Nettleton, ISU Statistics 416/516X EST BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 55 cDNA microarray slide 1 spot for gene 201 cDNA microarray slide 2 spot for gene 201 GATATG... GATATG... spot for gene 576 spot for gene 576 ... GATATG... GATATG... ... GATATG... GATATG... TTCCAG... TTCCAG... ... TTCCAG... TTCCAG... ... TTCCAG... TTCCAG... Each spot contains many copies of a sequence along with its complement (not shown). Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 56 Spotting Probes on the Microarray 8 X 4 Print Head plate with wells holding probes in solution All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. microarray slide Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 58 cDNA Microarrays to Measure mRNA Levels • RNA is extracted from a target sample of interest. • mRNAs are reverse transcribed into cDNA. • The resulting cDNAs are labeled with a fluorescent dye and are incubated with the microarray slide. • Dyed cDNA sequences hybridize to complementary probes spotted on the array. • A laser excites the dye and a scanner records an image of the slide. • The image is quantified to obtain measures of fluorescence intensity for each pixel. • Pixel values are processed to obtain measures of mRNA abundance for each probe on the array. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 59 cDNA Microarrays to Measure mRNA Levels (cont.) • Usually two samples, dyed with different dyes, are hybridized to a single slide. • The dyes fluoresce at different wavelengths so it is possible to get separate images for each dye. • Images from the scanner are black and white, but it is typical to display Cy3 images as green and Cy5 images are displayed as red. • It is common to superimpose the two images, using yellow to indicate a mixture of green and red. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 60 Problems with cDNA Microarrays: Difficult to Make Meaningful Comparisons between Genes • Measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). • Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 61 cDNA Microarrays to Measure mRNA Levels: Step 1: Prepare Microarray Slide & Sample mRNAs Microarray Slide ACCTG...G ACCTG...G ACCTG...G TTCTG...A TTCTG...A TTCTG...A GGCTT...C GGCTT...C GGCTT...C ATCTA...A ATCTA...A ATCTA...A ACGGG...T ACGGG...T ACGGG...T CGATA...G CGATA...G CGATA...G Dan Nettleton, ISU Statistics 416/516X Sample 1 Spots (Probes) Sample 2 Unknown mRNA Sequences (Target) BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 62 cDNA Microarrays to Measure mRNA Levels: Step 2: Convert mRNA to cDNA & label with Fluorescent Dyes Sample 1 ?????????? ?????????? TTCTG...A TTCTG...A TTCTG...A ?????????? ACCTG...G ACCTG...G ACCTG...G ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? GGCTT...C GGCTT...C GGCTT...C ATCTA...A ATCTA...A ATCTA...A Sample 2 ?????????? ?????????? ?????????? CGATA...G CGATA...G CGATA...G ?????????? ACGGG...T ACGGG...T ACGGG...T ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 63 cDNA Microarrays to Measure mRNA Levels: Step 3: Mix Labeled cDNA and Hybridize to Slide Sample 1 TTCTG...A TTCTG...A TTCTG...A GGCTT...C GGCTT...C GGCTT...C ATCTA...A ATCTA...A ATCTA...A ACGGG...T ACGGG...T ACGGG...T Dan Nettleton, ISU Statistics 416/516X CGATA...G CGATA...G CGATA...G ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ACCTG...G ACCTG...G ACCTG...G BCB 444/544 F07 ISU Dobbs #37- Clustering Sample 2 11/26/07 64 cDNA Microarrays to Measure mRNA Levels: Step 5: Excite Dye with Laser, Scan & Quantify Signals Sample 1 ACCTG...G TTCTG...A 7652 138 5708 4388 GGCTT...C ATCTA...A 8566 765 1208 13442 ACGGG...T CGATA...G 6784 9762 67 239 Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering Sample 2 11/26/07 65 Pros/Cons of Spotted cDNA Arrays • Many sources of variation in the manufacture of these arrays, print tips, lab, etc. • Contamination • Uneven distribution • Flexible, can put any cDNA on slide Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 66 DNA Oligonucleotide “Chips” • An oligonucleotide microarray is a microarray whose probes consist of synthetically created DNA oligonucleotides. • Probes sequences are chosen to have good and relatively uniform hybridization characteristics • A probe is chosen to match a portion of its target mRNA transcript that is unique to that sequence. • Oligo probes can distinguish among multiple mRNA transcripts with similar sequences. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 67 BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 68 www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 69 www.affymetrix.com BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 70 Simplified Example ... gene 1 ... oligo probe for gene 1 ATTACTAAGCATAGATTGCCGTATA ...gene 2 ... Shared green regions indicate high degree of sequence similarity throughout much of the transcript Dan Nettleton, ISU Statistics 416/516X GCGTATGGCATGCCCGGTAAACTGG BCB 444/544 F07 ISU Dobbs #37- Clustering oligo probe for gene 2 11/26/07 71 Oligo Microarray Fabrication • Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. • Oligo sequences can be synthesized on a slide or chip using various commercial technologies. • In one approach, sequences are synthesized on a slide using inkjet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. • Affymetrix uses a photolithographic approach. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 72 Affymetrix GeneChips • Affymetrix (www.affymetrix.com) manufactures GeneChips, oligonucleotide arrays. • Each gene (or sequence of interest or feature) is represented by multiple short (25-nucleotide) oligo probes. • Some GeneChips include probes for around 60,000 genes. • mRNA that has been extracted from a biological sample can be labeled (dyed) and hybridized to a GeneChip in a manner similar to that described for cDNA microarrays. • Only one sample is hybridized to each GeneChip rather than two as in the case of cDNA microarrays. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 73 Affymetrix Probe Sets • A probe set is used to measure mRNA levels of a single gene. • Each probe set consists of multiple probe cells. • Each probe cell contains millions of copies of one oligo. • Each oligo is intended to be 25 nucleotides in length. • Probe cells in a probe set are arranged in probe pairs. • Each probe pair contains a perfect match (PM) probe cell and a mismatch (MM) probe cell. • A PM oligo perfectly matches part of a gene sequence. • A MM oligo is identical to a PM oligo except that the middle nucleotide (13th of 25) is intentionally replaced by its complementary nucleotide. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 74 Affymetrix GeneChips mRNA reference sequence 5’ Reference Sequence 3’ Spaced DNA probe pairs …TGTGATGGTGGGAATGGGTCAGAAGGGACTCCTATGTGGGTGACGAGGCC… TTACCCAGTCTTCCCTGAGGATACAC Perfect match oligo TTACCCAGTCTTGCCTGAGGATACAC Mismatch oligo Probe Set PM MM PM - 25 bases complementary to gene MM - Middle base is different Probe Pair PM PM Probe Cell MM Probe Cell MM Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 75 Different Probe Pairs Represent Different Parts of the Same Gene gene sequence Probes are selected to be specific to the target gene and have good hybridization characteristics. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 76 Obtaining Labeled Target for Affy Chips 1. RNA single-stranded cDNA 2. Single-stranded cDNA double-stranded cDNA 3. Double-strand cDNA labeled single-stranded cRNA complementary to coding sequence Number of copies of each sequence gets amplified in conversion to cRNA. Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 77 Pros/Cons GeneChip Arrays • Consistent manufacture -> good standardization • Comparable across experiments • Design is time-consuming, good for large sets of chips • Can only see what is on the chip Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 78 Affymetrix Data Processing Pipeline Experiment preparation *.exp file Image of the scanned probe array *.dat file Analysis output *.chp file Probe Cell Intensity file *.cel file MicroArray Suite or other analysis software Dan Nettleton, ISU Statistics 416/516X BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 79 Reminder: Why do microarray experiments? • Compare two (or more) conditions to identify differentially expressed genes • Control/treatment • Disease/normal • Exploratory analysis • What genes are expressed in response to drought stress? • What gene expression changes occur during normal retinal development? • Diagnostic & prognostic tool development: • Can we predicting certain conditions (breast cancer vs normal) • Can we identify patterns of gene expression that predict a patient’s response to treatment/drug? Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 80 Differential Gene Expression • Are there significant differences in expression level between the conditions? • Analysis of Variance (ANOVA) Mutant 1 Inoculated Doina Caragea Mutant 2 Control Inoculated BCB 444/544 F07 ISU Dobbs #37- Clustering Control 11/26/07 81 Exploratory Analysis • Find patterns in data to see what genes are expressed under different conditions • Analysis includes clustering methods • Used when little or no prior knowledge exists about the problem Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 82 Classification • Learn characteristic patterns from a training set and evaluate with a test set. • Classify tumor types based on expression patterns • Predict disease susceptibility, stages, etc. Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 83 Microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 84 Pre-processing Main goal of data preprocessing is to remove any systematic bias in the data as completely as possible, while preserving variation in gene expression that occurs because of Biologically relevant changes in transcription. Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: • different labeling efficiencies of Cy3, Cy5 • • • • Doina Caragea uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency BCB 444/544 F07 ISU Dobbs #37- Clustering Page 191 85 11/26/07 Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Page 199 86 11/26/07 Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering Page 203 87 11/26/07 Limitations of Microarrays • Link between proteins and expressed RNA not always clear • Difficult to compare between microarray platforms: • Only see what is on the microarray • Gene finding is still an art • Other coding regions, “dark matter” on genome • But now microarrays for these are being developed, too! Doina Caragea BCB 444/544 F07 ISU Dobbs #37- Clustering 11/26/07 88