BCB 444/544 Fall 07 Nov 15 Lab 11 p. 1 BCB 444/544 Lab 11 Microarray Analysis Objectives 1. Learn how to use a set of tools to analyze raw data from a microarray experiment 2. Understand the reasoning behind each step, and the potential pitfalls of each 3. Gain a better understanding of potential applications of microarray experiments Introduction One of the primary developments resulting from (and contributing to) the many completed and ongoing genome sequencing projects is microarray technology. DNA microarrays allow researchers to rapidly perform thousands of gene-level experiments in parallel, as part of a genome-scale experiment. These high throughput experiments allow genome-wide analysis of how a cell of interest reacts (at the transcriptome level) to various experimental conditions. The Gene Expression Pattern Analysis Suite (GEPAS, at http://gepas.bioinfo.cipf.es/) is a wonderful collection of online tools for the analysis of microarray data. The primary intent of this site is to be a powerful research resource, by allowing researchers to upload their raw data and pipe it through a variety of modules to help them analyze their results. They also provide a wealth of tutorial information and several real datasets to make this an excellent educational resource. (Note: at the time this lab was updated, the v4.0 of GEPAS still seems a bit buggy in places) Today’s lab will show you how to go from raw data collected from a microarray chip, to determining which genes are induced or repressed under experimental conditions. Due to time limitations we will not be able to cover each step in detail, and will focus primarily on preprocessing the data, followed by several options on how to cluster and then visualize the gene expression profiles. You are of course encouraged to explore the other tutorials and exercises in more detail to gain a better understanding of the full path from data to results. Keep in mind that this site could be an excellent platform for staging a research project. This lab is intended as a rudimentary introduction to the type of data used and some basic methods for manipulating it. For more detailed information, you are strongly encouraged to contact Pat Schnable and Dan Nettleton, both ISU professors with strong expertise in microarray experiment design. Expression Profiler (http://www.ebi.ac.uk/microarray/index.html) at the European Bioinformatics Institute may provide similar and/or complimentary functionality. Local resources: ISU has two Microarray Facilities: Center for Plant Genomics (ISU PSI) - Pat Schnable - in Carver Co-Lab (http://www.plantgenomics.iastate.edu/microarray/ ) GeneChip Facility (ISU Biotech & PSI) - Steve Whitham in MBB (http://www.biotech.iastate.edu/facilities/genechip/Genechip.htm that you can utilize to perform your own microarray experiments at a very reasonable cost. ISU’s DNA facility (http://www.dna.iastate.edu/frame_microarray_ser.html) can assist you in synthesizing samples necessary for spotting a microarray chip. Hui-Hsien Chou’s oligo design software, Picky (http://www.complex.iastate.edu/download/Picky/index.html) can help choose optimal primers for use in an experiment. Di Cook’s exploRase (http://croc.vrac.iastate.edu/explorase/index.html) is a powerful visualization tool that can be used to visually find datapoints and patterns of interest. Dan Nettleton (http://www.public.iastate.edu/~dnett/) is ISU’s leading expert in both experimental design and analysis of microarrays. BCB 444/544 Fall 07 Nov 15 Lab 11 p. 2 Exercises Required questions are in red. Email completed exercise to: petez@iastate.edu by 5 p.m. on Monday, Nov. 26 (due to Thanksgiving break) NOTE: For some reason, certain parts of this website behave oddly under Safari, so you should use FireFox instead. Exercise 1 – Introduction to GEPAS Most of today’s lab will be performed using GEPAS (http://gepas.bioinfo.cipf.es/). Several links in the new v4.0 version of GEPAS appear to be incorrect, so we will use version 3.1 instead, located at http://gepas3.bioinfo.cipf.es/. (Within v3.1 several of the internal links start with http://gepas… which links to files that have text, but no images, if you place a 3 immediately after gepas in the address bar, then hit enter, the correct page will load.) Start by going to that link and familiarizing yourself with each of the four sections: Tools, Documentation, dataSets, and Publications. In the Tools section, notice that there are several possible pathways through the workflow. We will be starting with two-color array information, and after preprocessing we will follow the bottom path of clustering, followed by visualization in CAAT. You could then continue to analysis of gene lists with tools such as FatiGO+. In the Documentation section you will see a list of tutorials available for the various tools. If you click on one, you will find that you are on a fairly detailed introduction page to that tool. At the top of the page, you may notice that you can click on links to exercises and in some cases to other examples of how to use the tool. Exercise 2 – Data Normalization and Preprocessing After obtaining microarray data, there are several preprocessing steps that are usually taken to help ensure that we obtain reliable results. The first of step is called data normalization, which can be done via the Diagnosis and Normalization for MicroArray Data (DNMAD) tool. We will not be doing this today, but it is important to understand why it is necessary. Read the first several paragraphs of the introduction document for this tool. 1) In your own words, describe in several sentences what normalization is (with respect to microarray data), and why we need to do it prior to any further processing of the data. After normalization, we are left with our table of red/green ratios, where values between 0 and 1 represent repression of a gene under a particular experimental condition, relative to a control condition, while values greater than 1 represent induction of a gene, relative to the control. Before we can begin analyzing this data however, it is important to do a bit more preprocessing to ensure that our analysis is meaningful. First, read the tutorial on gene expression pattern preprocessing, then click on the Exercises link at the top of the tutorial. Answer the questions for exercise 1.1, which are listed as questions 2.1 – 2.7 on the submission form. (Note: I have tweaked several of the questions on the submission form, due to confusion last year, so make sure you read those questions, rather than the ones at the website.) The link to the dataset to be used in the exercise is incorrect, click here instead: http://gepas3.bioinfo.cipf.es/cgi-bin/sets/diauxic Before we proceed to clustering, we will take a detour to gain a better understanding of how the Pearson correlation coefficient is used to measure similarity between gene expression profiles, by way of experimenting with a toy data set in an Excel spreadsheet. Exercise 3 - Exploring Correlation The accompanying Excel spreadsheet (correl_explore_scenarios.xls) illustrates the concept of the Pearson correlation coefficient as a measurement of similarity between gene expression patterns. Each of the four scenarios in the spreadsheet begins with log-transformed gene expression ratios of two genes, as measured in BCB 444/544 Fall 07 Nov 15 Lab 11 p. 3 eight different samples. We will refer to the set of eight numbers for a particular gene as a “gene expression pattern,” or simply “pattern.” The correlation coefficient between the two gene expression patterns is calculated by Excel and displayed in the grey-shaded area to the right of the pattern data. The first graph for each scenario (on the left hand side) plots the gene expression pattern for each of the two genes. One way to think about the correlation coefficient is as a measure of how well the two patterns “track” each other. If the two patterns tend to go up and down together, from one sample to the next, then the patterns are highly positively correlated. The patterns in Scenario II have a fairly large positive correlation. The largest possible value for correlation is 1, and this occurs when the change from one sample to the next for one gene, divided by the change from one sample to the next for the other gene, is always the same number. In other words, the two gene expression patterns do not have to have to be the same order of magnitude to be highly correlated. For example, one gene may have values between –1 and 1, while the other gene has values between –100 and 100. If the two patterns tend to be opposites of one another, i.e. one goes up while the other goes down, as you move from one sample to another, then the patterns are highly negatively correlated. The smallest possible value for correlation is –1. The second graph for each scenario (on the right hand side) plots the log-transformed gene expression ratio for each sample as a point in the plane. The horizontal axis represents Gene 1, and the vertical axis represents Gene 2. The line of best fit (i.e. regression line) is shown on each graph of this type. If the line of best fit has a negative slope, the two patterns are negatively correlated; if the line has a positive slope, the two patterns are positively correlated. Note that the slope of the line does not measure the magnitude of the correlation. Rather, the magnitude of the correlation is determined by how close the points are to the line of best fit. If they are very close, the magnitude is large (near 1 or –1). If they are scattered far from the line, the magnitude is near 0. The patterns in Scenario I have a correlation near 0. The following exercises guide you though a brief exploration of the correlation coefficient. 3.1 In Scenario I, a single number can be changed for Gene 1 that results in dramatic changes in the correlation. Use the two graphs for the scenario to guide your experimentation of the following changes. a. Change a single sample for Gene 1 that causes the correlation to jump up to approximately 0.68, and list the change made. b. Change Scenario I, Gene 1, Sample 8, from 100 to –150. Note that the correlation jumps down to approximately –0.63. Explain this jump by seeing what changes in each of the two graphs. 3.2 To help answer the following, first notice that in Scenario II, the pattern for Gene 2 is evenly spaced between 10 and 80, changing in increments of 10. a. Change the pattern for Gene 1 in Scenario II such that the correlation is exactly 1, and list the changes made. You will need to change all but one or two of the values. b. Change the pattern for Gene 1 in Scenario II such that the correlation is exactly –1, and list the changes made. You will need to change all but one or two of the values. 3.3 Scenario III illustrates how sensitive the correlation can be to small changes. Here we examine a gene whose log ratio changes substantially across samples and a gene with essentially constant log ratio across samples. a. Find a pair of samples for which Gene 2 can be changed from 7 to 6, resulting in a much larger positive correlation, and list the changes made. b. Return the two samples found in part (a) to their original values of 7, and find a new pair of samples for which Gene 2 can be changed from 7 to 6, resulting in a fairly large negative BCB 444/544 Fall 07 Nov 15 Lab 11 p. 4 correlation, and list the changes made. 3.4 Scenario IV shows that correlation is undefined if one of the patterns is constant across samples. As in the previous scenario, changing just one of the values for Gene 2 has a significant effect on the correlation. a. Change the value for sample 1 from 4 to 3, and record the effect on correlation. b. Change the value for sample 8 from 4 to 3, and record the effect on correlation. c. Explain why one of these changes has a greater magnitude effect than the other. d. Which single change from 4 to 3 would give the correlation nearest to 0? Why? Exercise 4 – Clustering and visualization In this section we will focus on comparing a few of the different clustering methods using default settings, and utilizing Caat to visualize and compare the methods. A few notes on some oddities in Caat, however. First, you can apparently only have one Caat session running on a computer at a time. If you try to send results from a different experiment to Caat, while you still have a Caat session running, you will simply see another window displaying your previous Caat results. For this reason, it may be easiest to work with a partner on this portion of the lab (or on two computers if you come in later and there is nobody to work with), with each person having one of the Caat windows open at a time. After you are finished with a given analysis, make sure to click Exit near the top of the Caat viewer to terminate the session. Second, as described in the Caat tutorial, one way you can get a sense of the quality of each cluster - if you don’t know already know which genes should be clustered together - is a measure called the silhouette index, which ranges from -1 to 1, with 1 being the best. When viewing the summary view of trees, this value should be displayed for each cluster, next to the red S, but for some reason this initially displays NA for each cluster. I did, however discover something strange that will let this value display properly. By clicking on a cluster, you will pop up a window with information on that cluster, including its silhouette index. If you now click a button to Expand or Collapse one of the clusters, the correct S value will display for any clusters you have manually retrieved info from. Finally…we are ready to begin this exercise. We will compare agglomerative hierarchical clustering (cluster), the self-organizing tree algorithm (SOTA), self-organizing maps (SOM) and k-means clustering for the diauxic shift dataset: http://gepas3.bioinfo.cipf.es/cgi-bin/sets/diauxic. For the first three methods, you can simply click the send to: link on the data sets page, using the data_norm.txt and data_norm.dat files. For the k-means clustering, you will need to download the data_norm.txt file, then upload it again to the k-means clustering utility under Tools. For all of these, you can simply use the default settings. You are of course free to experiment on your own to see what effects various parameters have on each method. 4. Write a couple paragraphs comparing and contrasting the different methods. For the tree results, be sure to expand the tree several levels to get an idea of how well the genes are clustered at different levels. Did any of the four methods perform particularly poorly? How about particularly well? You are not required to do this next bit, but it is very easy once you have some clustered results. After clicking on a cluster to get information on it, you can click “Send to FatiGO+” to send the list of genes in that cluster to analyze various functional descriptors of each gene. Note the visual appearance of the expression profiles in each cluster. For which clusters would you expect to find that most of the genes share similar function(s)? This lab includes material from: Campbell, M., & Heyer, L (2006) Discovering Genomics, Proteomics, & Bioinformatics BCB 444/544 Fall 07 Nov 15 Lab 11 p. 5 Key concepts you should understand after completing this lab: What is the relationship between the raw data that is initially scanned (i.e. the results from scanning with the green laser, and with the red laser), and the resulting table of numbers that are presented as the “raw” data? What do the numbers in this raw data table represent? What types of processing are performed on this raw data prior to clustering analysis, and why? What are the major differences between the various clustering methods?