Lecturer`s notes on microarray analysis

advertisement
Microarray Analysis
Microarray analysis is method to measure changes in gene expression (actually RNA
content) over a large number of genes at the same time. DNA representing each gene
(called a "probe") is placed on very small spots on a solid support. The mRNA (called a
"target") is converted to fluorescently labeled cDNA and hybridized to the array.
Analysis then reveals a number of genes that are differentially expressed at a higher or
lower level in two different targets. Although the target is typically mRNA, microarray
experiments are also used to quantify microRNAs, clarify splicing patterns, or directly on
genomic DNA to measure amplification of genes in cancer cells. The underlying motive
when measuring gene expression is that more mRNA usually means that more protein is
made. The mechanism could either be a change in the rate of transcription, or a change
in the rate of mRNA turnover. One always has to keep in mind that translational control,
alteration of protein turnover, or regulation by protein modification may also be
occurring, and will not be reflected by the amount of mRNA present.
The history of microarray analysis is summarized in:
 Fodor SP et al, Science 1991 251: 767-73
 Schena M et al., Science 270: 467-70.
Producing the targets
As with all RNA work, it is essential to avoid contamination with RNAse when preparing
the mRNA. This requires RNAse inhibitors, specially cleaned reagents, fastidious
technique, and an assay for RNA degradation. The classical assay for RNA degradation
is a Northern blot. More recently, HPLC applications have been used.
This image from Agilent's Bioanalyzer web advertisement emphasizes the relative
intensity of the 18S and 26S rRNA bands as an indicator of degradation. Note, however,
that rRNA is much more resistant to RNAse than is mRNA. There would be no mRNA
at all surviving in the sample to the right.
Typically, polyadenylated RNA is subjected to reverse transcription primed with oligo
dT. The label can be produced as dye conjugated dUTP, although each label has a
different efficiency of incorporation. It is also possible to incorporate amino allyl-dUTP,
and conjugate the dye to the amino group after the fact of making the cDNA.
cDNA can be accumulated by linear amplification. For prokaryotic mRNA, random
hexamer primers are used to prime conversion to cDNA. Cycling in this instance risks
generating great variation in the representation of different mRNA species.
Because the method requires reproducible kinetics of hybridization, achieving a
comparable concentration of cDNA in each of the samples is critical for producing usable
results.
Probes
The probes are tightly arrayed on a chip. The different variations can be divided into
short oligos (typically 25 nt), long oligos (typically 60 nt), and full cDNAs.
Different Microarray Instruments.
(Miller and Tang, [2009] Clin. Microbiol Rev. 22:611-633).
In situ synthesis
The
best
known
microarray
system
is
made
by
Affymetrix.
(http://www.affymetrix.com). The oligonucleotide probes are synthesized on the solid
support by a photolithographic technique. This makes use of a blocking group on each
added nucleotide that is removed by a photo-induced reaction allowing the next
nucleotide to add. A "mask" is applied so that only the spots scheduled for the next base
to be added at this position (e.g. a T below) are illuminated. Then the chip is washed
with the reagent to add a T (a nucleotidyl phosphoramidite). The mask is then shifted to
deprotect the spots to get another base (e.g. a C) and then that reagent is added. This is
repeated 4 times at each position until an array of 25 nt long oligonucleotides is built up
on the chip. These are typically called "short oligo" arrays.
Affymetrix tries to put ~11 oligos per gene all from the 3' UTR on each chip, and pairs
them with an oligo that has a mismatched base in the center. The idea is that the signal
from the mismatched probe can be used to subtract non-specific hybridization from the
signal from the perfectly matched probe. This is called PM-MM scoring. However, it
has been shown that correcting by the mismatched signal actually produces nosier data
than ignoring the mismatched signal (Milenaar et al, BMC Bioinformatics 7:137 [2006]).
So most users ignore the MM signal. The name of the software to reanalyze the data
without doing PM-MM subtraction is RMA. The average fluorescence intensity after
excluding outliers (probes whose signals do not change in the same pattern as the others)
is called the "expression value".
Short oligo arrays have less sensitivity than long oligo arrays. For low concentration
targets (e.g. microbiological samples) a preliminary PCR step may be required to amplify
the target. Discrimination of a single mismatch becomes better at shorter lengths. The
signal from a single mismatch is ~ 25% at a probe length of 19 nt. The lithographic
method can produce very high density arrays with a million probes per chip.
Another manufacturer of high density in situ synthesized microarrays is RocheNimbleGen. It gets around the Affymetrix patent on photolithographic synthesis by
using micro-mirrors to focus light on the spots to be extended rather than a lithographic
mask. Agilent uses high density in situ synthesized microarrays where the reagents to
add each successive base are supplied in a focused way by an inkjet printing process.
NimbleGen and Agilent equipment supports two color analysis, whereas Affymetrix
equipment supports only one color analysis.
Spotted (printed) arrays
Before microarrays, people made hybridization arrays by simply spotting DNA on a glass
slide. This strategy has been scaled down through the use of robotic spotting machines
that dip an array of needles into a microplate with solutions of different probes and then
touch them to a glass slide or other solid support. Spotted arrays can be made directly
from denatured cDNAs (or more commonly PCR amplicons from cDNAs), in which case
the DNA sticks to the glass support by electrostatic interaction. However, in order to
distinguish between closely related members of gene families, they are usually made
from synthetic oligonucleotides which are typically 60 nt long and correspond to 3'
untranslated regions. Oligos are bound to the support through a modified 5' or 3' end. To
distinguish from the lithographically synthesized arrays, printed oligo arrays are often
called "long oligo" arrays, although they can be made with oligos of any length. Printed
arrays usually have a density of only 10,000 - 30,000 spots, and have less redundancy per
gene.
The most well known vender of printed microarray equipment is Agilent. UTHSCSA has
Agilent equipment used under the supervision of Dr. Yidong Chen at CCRI. Core
facilities can provide substantial savings by buying a set of oligos, printing the arrays
locally, and distributing the cost over multiple users.
Most spotted array equipment allows two dye measurements. In two dye (or two
channel) measurements, two different targets are labeled with dyes of different colors.
The most commonly used dyes are named cy3 (green) and cy5 (red). Both targets are
hybridized at once to the chip, and the colors are analyzed separately. This has the
advantage of automatically normalizing for variation in the amount of probe from spot to
spot on the array. For two color analysis the raw readout is M = log2 Red/Green = log2R
- log2G. A plot of M for each gene against 1/2(log2R + log2G) is called an MA plot.
From Wikipedia:
The MA plots will typically be subjected to low level normalization based on the
assumption that the average gene is not differentially expressed.
Bead Arrays
Illumina mounts their oligos on 3 micron beads, that fit into wells in a microplate such
that fiber optic sensors attach to each bead. Their two platforms are called Sentrix Array
Matrixes (SAM) or Sentrix BeadChips. The beads have to go through a process called
"decoding" to figure out what oligo sequence became attached to each sensor. Basically,
when the probe oligos are initially synthesized they are concatenated to a 29 nt sequence
designed to be easily identifiable by a series of hybridizations to short oligonucleotides.
Hybridizations to these oligos is carried out first, identifying the address attached to each
optical fiber, and hence the associated probe sequence. The Illumina technology is
related to their next generation sequencing technology, and they sell a platform that will
do both kinds of analysis. Bead Arrays are configured for fewer probes and many more
target samples.
Sources of noise in microarray experiments.
Technical noise.
Probe efficiency variation: There can be variation in the amount of probe per spot, or
the efficiency of hybridization due to formation of hairpins within probes, or due to the
Tm being out of range. The first step of analysis is typically an imaging of the hybridized
spots and the exclusion of spots that are misshapen or compromised by dust or other
inappropriate fluorescent signals in the image.
Image of a section of a microarray from Howlader & Chaubey, IEEE Trans Image
Process 19:1953-1967 (2010).
Nonlinear hybridization kinetics:
To understand the capabilities and limitations of a microarray experiment, a comparison
to its forerunner, the Northern Blot, is given below.
Northern Blot
In a Northern Blot, the RNA from a cellular preparation is run on a denaturing agarose
gel and then the pattern is transferred to a filter to which the RNA becomes permanently
affixed. For a given gene, a cDNA is prepared and either radiolabeled or fluorescently
labeled. The cDNA is called the "probe" in this case. The probe is hybridized to the
filter over a long period of time until all complementary RNA on the filter is duplexed.
After a wash to eliminate probe that is not duplexed, the filter is imaged. Since the
hybridization was to completion, the amount of signal is proportional to the amount of
the cognate RNA on the filter.
Image from Wikipedia Commons.
The strengths of the Northern Blot are that quantification is accurate, degradation or lack
of it of the RNA is apparent, and it has a large dynamic range. Its weakness is that for
each gene, one has to prepare a separate labeled probe, strip the filter, and then hybridize
again. Hence Northern blots are not applicable to analyzing any appreciable number of
different genes. For analyzing large numbers of genes (up to 23,000) a microarray
experiment is used. For analyzing smaller numbers (up to 100) qPCR is now the
preferred method.
Kinetics of microarray hybridization
In microarray analysis, a probe for each gene is placed on a spot on a supporting material,
such that thousands of genes may be represented on a small area (called a "chip").
Fluorescently labeled cDNA corresponding to total RNA is prepared and hybridized to
the chip. The RNA is called the "target" or sometimes the "treatment". Typically one
will be comparing the amount of RNA present under two different circumstances (e.g.,
with and without application of a drug to cells in culture), so the usual outcome is to
identify genes which are more or less heavily represented in the RNA of cells for
treatment A vs. treatment B. In this case, if the target cDNA were hybridized to
completion, then all quantitative information would be lost. The signal intensity would
represent the amount of probe on the chip, not the amount of RNA in the target. Instead
of hybridizing to completion, the target is hybridized for a fixed amount of time so that
each spot is partially duplexed. Since the on rate for hybridization is related to the
concentration of the hybridizing species, the signal produced in a fixed hybridization time
reflects the concentration of the complementary cDNA in the target.
The accumulation of target cDNA on the probe is linear up to a point and then the spot
become saturated. cDNAs present at low levels may not exceed the amount of probe.
Those (red line in the fig. above) will begin to saturate at lower levels because the off rate
becomes significant as the amount of free cDNA approaches zero. The presence of nonspecific opportunities for low level cDNAs to interact with the chip may differ from spot
to spot on the chip and produce spot to spot variation in the signal due to localized
exhaustion of cDNA. The end result is that the signal tends to have poor linearity,
expression changes may be attenuated by saturation, and random variation can be
introduced into the results. Significant expression differences can be lost in the random
background from nonspecific interaction with the chip. This problem is discussed by
Chudin et al., (2002) Genome Biol. 3(1), RESEARCH0005, and Ono et al.,
Bioinformatics. (2008) 24:1278-85.
Noise with bioinformatics sources.
Probes are usually placed in the 3' untranslated regions of genes for two reasons: 1)
Many genes fall in families with sufficiently closely related paralogs that hybridization to
a coding region oligonucleotide would not distinguish between the family members. 2)
The cDNA is usually primed with oligo dT, and the amount of cDNA produced falls off
with distance from the polyA tail. Hence it is necessary to accurately predict the
polyadenylation sites of each gene. Two computer programs are in use for this purpose:
1) polyadq, and 2) Genescan. Neither is 100% reliable, and sometimes gene expression
is missed because the probe was placed downstream of the polyA addition site. Different
placement of the probe from one chip design to the next, coupled to different potentials
for cross hybridization to related sequences and the generally nonlinear response of the
method have historically led to extensive discrepancies in expression profiles from one
experiment to another. These properties are exacerbated when working from a genomic
sequence that is still in early stages of finishing.
Noise due to specifics of different oligos.
Variation in signal intensity among different oligos for the same gene. (modified from Li & Wong.
(2001) PNAS 98, 31-36. Each point is a different oligo in the same gene. The variation in oligo
response is generally greater than the differences from one treatment to another. An oligo's
response can be low because of hairpin formation or because its Tm is out of range. Oligo
response can be high due to cross hybridization to other species in the cDNA.
Many of the technical sources of noise should not affect a relative change in expression
level, and two color measurements should help
In principle, replicate microarray hybridizations (for both treatments) could be used to
drive down the noise. However, microarray experiments typically cost $1000 - $3000
per chip, so most experimenters do one chip per treatment, accept that the results are
noisy, and then proceed to use qPCR to filter out some genes with actual expression
changes. Some microarray setups allow measuring two colors. If treatment A is labeled
in one color and treatment B in another, then the ratio of A/B is measured over each spot
should be free of variation based on the amount of probe washing over the spot.
Another problem is that the total amount of RNA for treatment A and B may not have
been the same, or the cDNA synthesis may not have occurred comparably. Typically
there will be a normalization intended to enforce the assumption that most RNAs did not
change intensity between the two treatments.
Noise from Biological Variation
Deciding that there is confidence in a finding of differential gene expression requires a
statistical analysis that apportions the variation observed between differences in gene
expression and sources of random variation. Consider for example that RNA
preparations were made from the livers of several mice, some receiving no treatment (A)
and some treated with a drug (B). Consider the following experiment that would require
8 mice, four chips, a two color instrument, and 8 labeling reactions:
Here, if we had only done one chip with one treated and one untreated mouse (replicate
1), we might have concluded that gene 1 expression was suppressed by the drug while
gene 2 expression was unaffected. Upon repeating four replicates, we would realize that
expression of gene 1 was subject to extensive biological variation, whereas gene two is
relatively steadily expressed. The statistics to determine if the average change between A
and B for each mouse is significant given the variation from replicate to replicate is
beyond the scope of this course. But clearly without enough replicates there will be
genes identified as being affected by the drug treatment that are, in fact, not.
At first we may wonder if the "biological" variation was really variation of expression
among the livers of these different mice, or technical variation in how much total mRNA
was recovered from each mouse, or how efficiently each total mRNA sample was
converted to probe. However, if there are a large number of genes like B whose
expression appears very steady across replicates, that would suggest that our procedure
for labeled target production was consistent. More often, the RNA preparations won't be
as consistent as we'd like, but it would be possible to notice that some large majority of
gene varied consistently in expression from replicate to replicate, and then we would
normalize all the signals based on the assumption most genes are expressed the same
from mouse to mouse. Similarly, the steadily expressed set of genes may show a
consistent difference between treatment A and B. Commonly a normalization is imposed
to enforce the assumption that the expression of most genes is not different between
treatment A and B.
Alternatively, it can be done to devote two chips to the same A and B samples, only with
the probe labeling (red vs. green) reversed.
The proper statistics for dealing with data of this type is beyond the scope of this (or any
introductory statistics course), but exists in a number of varieties that go by names like
"linear model fitting", and "empirical Bayesian analysis".
The mainline software
package for finding confidence levels that expression has changed is limma, which is
freeware that runs within the R programming language and comes as part of the
bioconductor package (http://www.bioconductor.org/).
An example of R code to conduct the MA plots with normalization seen above:
library(affy)
if (require(affydata))
{
data(Dilution)
}
y <- (exprs(Dilution)[, c("20B", "10A")])
x11()
ma.plot( rowMeans(log2(y)), log2(y[, 1])-log2(y[, 2]), cex=1 )
title("Dilutions Dataset (array 20B v 10A)")
library(preprocessCore)
#do a quantile normalization
x <- normalize.quantiles(t)
x11()
ma.plot( rowMeans(log2(x)), log2(x[, 1])-log2(x[, 2]), cex=1 )
title("Post Norm: Dilutions Dataset (array 20B v 10A)")
Limma will want to fit a "model". The model is the experimental design with respect to
replicative samples, and is best understood by some examples. If one made cDNA from
one untreated mouse and labeled it green, and from one treated mouse and labeled it red
and then did a single dual color hybridization experiment, limma would be given a
"model" that specified that experimental arrangement. One would probably specify some
sort of invariance selection algorithm, in which some of the probes are assumed to detect
RNAs that are expressed the same in both mice, and these are used to normalize for
differences in the amount of cDNA made, the detection efficiency of the two dyes, and
similar systematic differences between the two target cDNAs.
Another program used to determine confidence in expression differences is SAM.
 SAM runs as a tool within Microsoft Excel version 2000 or above and only on a
Microsoft Operating System Windows 2000 or above.
 It acts as an interface to the R programming system, which must also be installed.
 The user must provide for normalization of the data and export to an Excel
spreadsheet format.
 As with limma, the user must describe the model of the experiment to SAM.
 SAM can be obtained from http://www-stat.stanford.edu/~tibs/SAM/
 The literature reference is V. Tusher, R. Tibshirani, and G. Chu. Significance
analysis of microarrays applied to transcriptional
 responses to ionizing radiation. Proc. Natl. Acad. Sci. USA., 98:5116–5121, 2001.
 An example of SAM summary output:


From the SAM web site.
Clustering
In considering whether genes are differentially expressed, there is usually an attempt to
consider them in groups that coordinate to carry out some function. Placing genes in
groups is called "classification" or "clustering".
Two different clustering strategies are in use:
1) Unsupervised: Enumerate genes showing a coordinated change between two targets.
Ask what gene ontologies are in common (more than by random chance) among that set
of genes.
This illustrates clustering and the use of a heat map for visualization of a gene's
expression profile. In this case the expression being measured is degree of induction (or
repression) at times after initiation of sporulation in yeast. For easy visualization, the
numeric values are converted to a shade of red or green indicating degree of induction or
repression. See the example on the left. Then the genes are stacked in an order that puts
the most similar expression patterns together. That's at right. Each gene is a thin slice of
this heat map. Genes that don't show much change at all are left out to compress the
display.
For association with function, the genes are broken into groups representing a similarity
of expression pattern. These are compared to groups of genes defined as to commonality
of
function
(Gene
ontology
groups,
or
GO groups). Each observed group of genes will overlap each preconceived functionally
defined group by a certain number of genes. A statistic is assigned expressing the
expectation that randomly picked set of genes of the same size would overlap by the
same number of genes. The GO groups are then sorted with the ones that most
improbably overlapped by chance listed first:
From the David Ease web site: genes with GO terms enriched in a set of 400 genes
whose expression was judged to be changed in a peripheral blood mononuclear cells
incubated with HIV envelope protein.
There are a variety of commercial tools that paint expression changes on maps of specific
pathways.
A
non-commercial
tool
of
this
kind
is
GenMapp
(http://www.genmapp.org/default.html).
A demonstration image from the GenMapp web site showing in red the proteins in the
yeast galactose metabolism pathway for which gene induction with galactose was
observed in a microarray experiment.
A more complex example showing differential expression among a number of different
human embryonic stem cell lines:
2) Supervised: Evaluate for preconceived sets of genes thought to collaborate on a
function whether their expression has changed in aggregate between the two targets. A
tool for this is GSEA (Gene Set Enrichment Analysis).
In this analysis, every gene in the preconceived GO group (which they prefer to call a
gene set) is marked in the clustered heat map. This is a difference from the unsupervised
method, where genes that didn't change much were not explicitly accounted for. Since
the heat map is ordered so that most strongly induced genes are at the top and the most
strongly repressed ones are at the bottom, a gene set that was coordinately induced or
expressed would have their members clustered near the top or bottom, respectively of the
heat map. Here we see a partial correlation. More of this gene set's members are
clustered near the top than would be expected by random chance. The analysis then
proceeds by trying to suggest a subset within the gene set that is correlated with the
expression data. Unlike the unsupervised method, this method could detect that a subset
of genes from a group are significantly clustered in the highly induced set even though
the over set is not significantly correlated. For each subset of the gene set starting with
the most induced and including genes in order of induction strength, a statistic (the
enrichment score) is calculated that the subgroup is clustered more to the top than
expected by random chance. When that statistic reaches it's maximum and starts to
decline, a boundary is established for the correlated subset.
Figure 1 from Subramanian et al., PNAS 102:15545 (2005) illustrating the GSEA
supervised classifier. The paper didn't really explain what "phenotype A and B" were,
but I'm thinking of them as replicates of treatment A and B.
Overall data processing pipeline:
The following is a series of steps that might be used in interpretation of micro array data.
Typically, earlier steps might be done with instrumentation-specific software, whereas
later steps would typically be done with R/Bioconductor, SAM and or other 3rd party
packages. In some cases, the instructions for the downstream package will specify not to
do one of more of the more primitive corrections, because that has already been written
into the downstream analysis.
 Image inspection: removal of spots that have been compromised.
 Background subtraction: removal of some amount of intensity from each spot
representing non-specific interaction with the probe.
 Signal averaging and outlier exclusion: If there are multiple probes per gene,
probes that give results discordant with the others are excluded, and then the
remaining signals are averaged.



Normalization: Imposing the assumption that the average gene is not
differentially expressed.
Statistical analysis: Estimation of the within group variation and estimation of
confidence in between group variation. This usually involves specifying an
acceptable false discovery rate, and may involve specifying a minimal between
group differential expression.
Clustering/Classification: Identification of groups of genes that collaborate on a
common function that have varied coordinately with the experimental variable
(e.g. drug vs. no drug; normal vs. diseased).
Validation
Given the large potential for noise to obscure microarray results, essentially all genes said
to be differentially expressed should be subjected to validation. The most common
method is qPCR. Other possibilities are Northern blots or in situ hybridization. qPCR
can easily handle many more replicates, so a common strategy is to do relatively minimal
replicates by microarray (keeping the cost down), and then do more biological replicates
by qPCR.
Glossary of Microarray Jargon:
Biological variation: The degree of variation in results observed if RNAs from several
individuals thought to be in the same physiological state are examined.
Classifier: An algorithm to group data from genes according to functionally related
clusters of genes.
 Unsupervised classification - Without prior information, genes found to have a
similar expression profile in a given data set are grouped together.
 Supervised classification - The coordinated changes in expression of
preconceived groups of genes are examined in a give data set.
Dye-swap: Repeating a two color experiment (two targets labeled with different color
dyes and hybridized to one chip) with the dye switched between the two targets.
False Discovery Rate: The fraction of genes above a given expression ratio expected to
have been placed there as a result of noise.
Feature: a spot containing an amount of a specific probe sequence.
GSEA (Gene Set Enrichment Analysis): The practice of targeting measurement of
differential expression of preconceived sets of genes thought to collaborate on a function.
(i.e. a supervised classifier). Also the name of a specific software solution to carrying out
this
analysis
(Subramanian
et
al.,
[2005]
PNAS
102:15545-15550;
http://www.broadinstitute.org/gsea).
Miss rate: The false negative rate. The expected number of genes missing from the set
discovered to meet a threshold of significance.
Model: A description of which dyes were used, which targets were hybridized at the
same time, how experimental variables (e.g. drug vs. no drug; time points; control vs.
diseased), and replicates are represented within the number of targets hybridized.
MSigDB: The database of gene sets used by GSEA. The genes are organized into sets
according to one of the following principles:
 Positional gene sets - grouped by human chromosome and cytogenetic band.
 Curated gene sets - based on online pathway databases, publications in PubMed,
and knowledge of domain experts.
 Motif gene sets - based on conserved cis-regulatory motifs.
 Computational gene sets - defined by a tendency for coexpression with one of 803
cancer-associated genes.
 GO gene sets - annotated by the same Gene Ontology terms.
Over fitting: Attributing meaning to a pattern of expression that fails to replicate. Too
few biological replicates in an experiment leads to a high risk of over fitting.
Technical variation: The degree of variation in results observed if the same RNA is
reanalyzed.
Download