One of the primary developments resulting from (and contributing to)... genome sequencing projects is microarray technology. DNA microarrays allow researchers... BCB 444/544

advertisement
BCB 444/544 Fall 07 Nov 15
Lab 11
p. 1
BCB 444/544
Lab 11
Microarray Analysis
Objectives
1. Learn how to use a set of tools to analyze raw data from a microarray experiment
2. Understand the reasoning behind each step, and the potential pitfalls of each
3. Gain a better understanding of potential applications of microarray experiments
Introduction
One of the primary developments resulting from (and contributing to) the many completed and ongoing
genome sequencing projects is microarray technology. DNA microarrays allow researchers to rapidly perform
thousands of gene-level experiments in parallel, as part of a genome-scale experiment. These high throughput
experiments allow genome-wide analysis of how a cell of interest reacts (at the transcriptome level) to various
experimental conditions.
The Gene Expression Pattern Analysis Suite (GEPAS, at http://gepas.bioinfo.cipf.es/) is a wonderful
collection of online tools for the analysis of microarray data. The primary intent of this site is to be a powerful
research resource, by allowing researchers to upload their raw data and pipe it through a variety of modules to
help them analyze their results. They also provide a wealth of tutorial information and several real datasets to
make this an excellent educational resource. (Note: at the time this lab was updated, the v4.0 of GEPAS still
seems a bit buggy in places)
Today’s lab will show you how to go from raw data collected from a microarray chip, to determining
which genes are induced or repressed under experimental conditions. Due to time limitations we will not be able
to cover each step in detail, and will focus primarily on preprocessing the data, followed by several options on
how to cluster and then visualize the gene expression profiles. You are of course encouraged to explore the
other tutorials and exercises in more detail to gain a better understanding of the full path from data to results.
Keep in mind that this site could be an excellent platform for staging a research project.
This lab is intended as a rudimentary introduction to the type of data used and some basic methods for
manipulating it. For more detailed information, you are strongly encouraged to contact Pat Schnable and Dan
Nettleton, both ISU professors with strong expertise in microarray experiment design.
Expression Profiler (http://www.ebi.ac.uk/microarray/index.html) at the European Bioinformatics Institute may
provide similar and/or complimentary functionality.
Local resources:
ISU has two Microarray Facilities:
Center for Plant Genomics (ISU PSI) - Pat Schnable - in Carver Co-Lab
(http://www.plantgenomics.iastate.edu/microarray/ )
GeneChip Facility (ISU Biotech & PSI) - Steve Whitham in MBB
(http://www.biotech.iastate.edu/facilities/genechip/Genechip.htm
that you can utilize to perform your own microarray experiments at a very reasonable cost.
ISU’s DNA facility (http://www.dna.iastate.edu/frame_microarray_ser.html) can assist you in synthesizing
samples necessary for spotting a microarray chip.
Hui-Hsien Chou’s oligo design software, Picky (http://www.complex.iastate.edu/download/Picky/index.html)
can help choose optimal primers for use in an experiment.
Di Cook’s exploRase (http://croc.vrac.iastate.edu/explorase/index.html) is a powerful visualization tool that can
be used to visually find datapoints and patterns of interest.
Dan Nettleton (http://www.public.iastate.edu/~dnett/) is ISU’s leading expert in both experimental design and
analysis of microarrays.
BCB 444/544 Fall 07 Nov 15
Lab 11
p. 2
Exercises
Required questions are in red. Email completed exercise to: petez@iastate.edu by 5 p.m. on Monday, Nov. 26
(due to Thanksgiving break)
NOTE: For some reason, certain parts of this website behave oddly under Safari, so you should use FireFox
instead.
Exercise 1 – Introduction to GEPAS
Most of today’s lab will be performed using GEPAS (http://gepas.bioinfo.cipf.es/). Several links in the new
v4.0 version of GEPAS appear to be incorrect, so we will use version 3.1 instead, located at
http://gepas3.bioinfo.cipf.es/. (Within v3.1 several of the internal links start with http://gepas… which links to
files that have text, but no images, if you place a 3 immediately after gepas in the address bar, then hit enter, the
correct page will load.) Start by going to that link and familiarizing yourself with each of the four sections:
Tools, Documentation, dataSets, and Publications. In the Tools section, notice that there are several possible
pathways through the workflow. We will be starting with two-color array information, and after preprocessing
we will follow the bottom path of clustering, followed by visualization in CAAT. You could then continue to
analysis of gene lists with tools such as FatiGO+. In the Documentation section you will see a list of tutorials
available for the various tools. If you click on one, you will find that you are on a fairly detailed introduction
page to that tool. At the top of the page, you may notice that you can click on links to exercises and in some
cases to other examples of how to use the tool.
Exercise 2 – Data Normalization and Preprocessing
After obtaining microarray data, there are several preprocessing steps that are usually taken to help ensure that
we obtain reliable results. The first of step is called data normalization, which can be done via the Diagnosis
and Normalization for MicroArray Data (DNMAD) tool. We will not be doing this today, but it is important to
understand why it is necessary. Read the first several paragraphs of the introduction document for this tool.
1) In your own words, describe in several sentences what normalization is (with respect to microarray data), and
why we need to do it prior to any further processing of the data.
After normalization, we are left with our table of red/green ratios, where values between 0 and 1 represent
repression of a gene under a particular experimental condition, relative to a control condition, while values
greater than 1 represent induction of a gene, relative to the control. Before we can begin analyzing this data
however, it is important to do a bit more preprocessing to ensure that our analysis is meaningful. First, read the
tutorial on gene expression pattern preprocessing, then click on the Exercises link at the top of the tutorial.
Answer the questions for exercise 1.1, which are listed as questions 2.1 – 2.7 on the submission form. (Note: I
have tweaked several of the questions on the submission form, due to confusion last year, so make sure you
read those questions, rather than the ones at the website.) The link to the dataset to be used in the exercise
is incorrect, click here instead: http://gepas3.bioinfo.cipf.es/cgi-bin/sets/diauxic
Before we proceed to clustering, we will take a detour to gain a better understanding of how the Pearson
correlation coefficient is used to measure similarity between gene expression profiles, by way of experimenting
with a toy data set in an Excel spreadsheet.
Exercise 3 - Exploring Correlation
The accompanying Excel spreadsheet (correl_explore_scenarios.xls) illustrates the concept of the Pearson
correlation coefficient as a measurement of similarity between gene expression patterns. Each of the four
scenarios in the spreadsheet begins with log-transformed gene expression ratios of two genes, as measured in
BCB 444/544 Fall 07 Nov 15
Lab 11
p. 3
eight different samples. We will refer to the set of eight numbers for a particular gene as a “gene expression
pattern,” or simply “pattern.” The correlation coefficient between the two gene expression patterns is calculated
by Excel and displayed in the grey-shaded area to the right of the pattern data.
The first graph for each scenario (on the left hand side) plots the gene expression pattern for each of the two
genes. One way to think about the correlation coefficient is as a measure of how well the two patterns “track”
each other.
If the two patterns tend to go up and down together, from one sample to the next, then the patterns are highly
positively correlated. The patterns in Scenario II have a fairly large positive correlation. The largest possible
value for correlation is 1, and this occurs when the change from one sample to the next for one gene, divided by
the change from one sample to the next for the other gene, is always the same number. In other words, the two
gene expression patterns do not have to have to be the same order of magnitude to be highly correlated. For
example, one gene may have values between –1 and 1, while the other gene has values between –100 and 100.
If the two patterns tend to be opposites of one another, i.e. one goes up while the other goes down, as you move
from one sample to another, then the patterns are highly negatively correlated. The smallest possible value for
correlation is –1.
The second graph for each scenario (on the right hand side) plots the log-transformed gene expression ratio for
each sample as a point in the plane. The horizontal axis represents Gene 1, and the vertical axis represents Gene 2.
The line of best fit (i.e. regression line) is shown on each graph of this type. If the line of best fit has a negative
slope, the two patterns are negatively correlated; if the line has a positive slope, the two patterns are positively
correlated. Note that the slope of the line does not measure the magnitude of the correlation. Rather, the
magnitude of the correlation is determined by how close the points are to the line of best fit. If they are very close,
the magnitude is large (near 1 or –1). If they are scattered far from the line, the magnitude is near 0. The patterns
in Scenario I have a correlation near 0.
The following exercises guide you though a brief exploration of the correlation coefficient.
3.1 In Scenario I, a single number can be changed for Gene 1 that results in dramatic changes in the
correlation. Use the two graphs for the scenario to guide your experimentation of the following changes.
a.
Change a single sample for Gene 1 that causes the correlation to jump up to approximately 0.68,
and list the change made.
b.
Change Scenario I, Gene 1, Sample 8, from 100 to –150. Note that the correlation jumps down to
approximately –0.63. Explain this jump by seeing what changes in each of the two graphs.
3.2 To help answer the following, first notice that in Scenario II, the pattern for Gene 2 is evenly spaced
between 10 and 80, changing in increments of 10.
a.
Change the pattern for Gene 1 in Scenario II such that the correlation is exactly 1, and list the
changes made. You will need to change all but one or two of the values.
b.
Change the pattern for Gene 1 in Scenario II such that the correlation is exactly –1, and list the
changes made. You will need to change all but one or two of the values.
3.3 Scenario III illustrates how sensitive the correlation can be to small changes. Here we examine a gene
whose log ratio changes substantially across samples and a gene with essentially constant log ratio
across samples.
a.
Find a pair of samples for which Gene 2 can be changed from 7 to 6, resulting in a much larger
positive correlation, and list the changes made.
b.
Return the two samples found in part (a) to their original values of 7, and find a new pair of
samples for which Gene 2 can be changed from 7 to 6, resulting in a fairly large negative
BCB 444/544 Fall 07 Nov 15
Lab 11
p. 4
correlation, and list the changes made.
3.4 Scenario IV shows that correlation is undefined if one of the patterns is constant across samples. As in the
previous scenario, changing just one of the values for Gene 2 has a significant effect on the correlation.
a.
Change the value for sample 1 from 4 to 3, and record the effect on correlation.
b.
Change the value for sample 8 from 4 to 3, and record the effect on correlation.
c.
Explain why one of these changes has a greater magnitude effect than the other.
d.
Which single change from 4 to 3 would give the correlation nearest to 0? Why?
Exercise 4 – Clustering and visualization
In this section we will focus on comparing a few of the different clustering methods using default settings, and
utilizing Caat to visualize and compare the methods. A few notes on some oddities in Caat, however. First, you
can apparently only have one Caat session running on a computer at a time. If you try to send results from a
different experiment to Caat, while you still have a Caat session running, you will simply see another window
displaying your previous Caat results. For this reason, it may be easiest to work with a partner on this portion of
the lab (or on two computers if you come in later and there is nobody to work with), with each person having
one of the Caat windows open at a time. After you are finished with a given analysis, make sure to click Exit
near the top of the Caat viewer to terminate the session. Second, as described in the Caat tutorial, one way you
can get a sense of the quality of each cluster - if you don’t know already know which genes should be clustered
together - is a measure called the silhouette index, which ranges from -1 to 1, with 1 being the best. When
viewing the summary view of trees, this value should be displayed for each cluster, next to the red S, but for
some reason this initially displays NA for each cluster. I did, however discover something strange that will let
this value display properly. By clicking on a cluster, you will pop up a window with information on that cluster,
including its silhouette index. If you now click a button to Expand or Collapse one of the clusters, the correct S
value will display for any clusters you have manually retrieved info from. Finally…we are ready to begin this
exercise.
We will compare agglomerative hierarchical clustering (cluster), the self-organizing tree algorithm (SOTA),
self-organizing
maps
(SOM)
and
k-means
clustering for
the
diauxic
shift
dataset:
http://gepas3.bioinfo.cipf.es/cgi-bin/sets/diauxic. For the first three methods, you can simply click the send to:
link on the data sets page, using the data_norm.txt and data_norm.dat files. For the k-means clustering, you will
need to download the data_norm.txt file, then upload it again to the k-means clustering utility under Tools. For
all of these, you can simply use the default settings. You are of course free to experiment on your own to see
what effects various parameters have on each method.
4. Write a couple paragraphs comparing and contrasting the different methods. For the tree results, be sure to
expand the tree several levels to get an idea of how well the genes are clustered at different levels. Did any of
the four methods perform particularly poorly? How about particularly well?
You are not required to do this next bit, but it is very easy once you have some clustered results. After clicking
on a cluster to get information on it, you can click “Send to FatiGO+” to send the list of genes in that cluster to
analyze various functional descriptors of each gene. Note the visual appearance of the expression profiles in
each cluster. For which clusters would you expect to find that most of the genes share similar function(s)?
This lab includes material from:
Campbell, M., & Heyer, L (2006) Discovering Genomics, Proteomics, & Bioinformatics
BCB 444/544 Fall 07 Nov 15 Lab 11
p. 5
Key concepts you should understand after completing this lab:
What is the relationship between the raw data that is initially scanned (i.e. the results from scanning with the green laser, and with the
red laser), and the resulting table of numbers that are presented as the “raw” data? What do the numbers in this raw data table
represent? What types of processing are performed on this raw data prior to clustering analysis, and why? What are the major
differences between the various clustering methods?
Download