Microarray Data Analysis

Microarray Data Analysis
Tutorial at ISMB 2008
Mark Reimers
Virginia Commonwealth University
Quality assessment
Normalizing expression arrays
Normalizing other array types
Selecting genes
Identifying functional groups
Array Quality Assessment
ISMB 2008 Microarray Tutorial
Mark Reimers
Virginia Commonwealth University
Approaches to Quality Control
Careful Technique
Testing the RNA quality
Examining the chips and images
Spot metrics
Statistical approaches
– Examining deviations as functions of technical
Simple QA for Spotted Arrays
• Spot Measures
– Signal/Noise
• Foreground / background or
– foreground / SD
– Uniformity
– Spot Area
• Global Measures
– Qualitative assessments
– Averages of spot measures
• Inspect images for artifacts
– Streaks of dye, scratches etc.
• Are there biases in regions?
Using Statistics for QA
• Significant artifacts may not be obvious
from visual inspection or bulk statistics
• General approach: plot deviations from
average or residuals from fit against any
technical variable:
– Intensity
– CG content or Tm
– Probe position relative to 3’ end (for poly-T
– Location on chip
• Color-code deviations on chip image
Simple Portrait - Boxplots
• Boxplot of 16 chips from Cheung et al
Nature 2005
Another Portrait - Densities
Density Plots:
before and after
Ratio vs Intensity Plots:
Saturation & Quenching
• Saturation
– Decreasing rate of
binding of RNA at
higher occupancies on
• Quenching:
– Light emitted by one
dye molecule may be
re-absorbed by a
nearby dye molecule
– Then lost as heat
– Effect proportional to
square of density
Plot of log ratio against
average log intensity
across chips
GSM25377 from the
CEPH expression data
How Much Variability on R-I?
• Ratio-Intensity plots for six arrays at
random from Cheung et al Nature (2005)
Covariation with Probe Tm
• MAQC project
• Agilent 44K
– Array 1C3
– Performed by
•Plot of log ratios to average against Tm
•Bimodal distribution because two
samples are very different
Covariation with Probe Position
• RNA degrades
from 5’ end
• Intensity should
decrease from
3’ end uniformly
across chips
• affyRNAdeg
plots in affy
Plot of average intensity for
each probe position across all
genes against probe position
Spatial Variation Across Chips
Red/Green ratios
show variation
Ratios of ratios
on slide to ratios
on standard show
consistent biases
In House Spotted Arrays
Ratio of ratios shows
much clearer
concentration of red
spots on some slides
Note non-random but
highly irregular
concentration of red
Background Subtraction (1)
• We think that local
contributes to bias
• Does subtracting
background remove
Local off-spot background may
not be the best estimate of
spot background (nonspecific hyb)
Background Subtraction (2)
Raw spot ratios show
a mild bias relative to
After subtracting a
high green bg in the
center a red bias
Raw Ratios Background
Other Bias Patterns
This spotted oligo array
shows strong biases at
the beginning and end of
each print-tip group
The background shows
a milder version of this
Subtracting background
compensates for about
half this effect
Raw Spot
Local Bias on Affymetrix Chips
Image of raw data on a log2
scale shows striations but no
obvious artifacts
Image of ratios of probes to
standard shows a smudge
Images show high values as red, low values as yellow
Variation in Affy Chips
QC in Bioconductor
• Robust Multi-chip Analysis (RMA)
– fits a linear model to each probe set
– High residuals show regional patterns
High residuals in green
Available in affyPLM package
at www.bioconductor.org
Portion of dChip QA image
High residuals in pink
Affy QC Metrics in Bioconductor
• affyPLM package fits
probe level model to
Affymetrix raw data
• NUSE - Normalized
Unscaled Standard
– normalized relative to
each gene
• How many big errors?
• Technical bias is a significant source of
error in microarray studies
• Technical bias can be visualized and
• Normalization can only partially
compensate these problems
• It is best to drop chips with extreme
technical deviations
Are Microarrays Reliable?
• Microarray studies have an uneven track
record of replication
• Huang et al (2003) replicated few of the markers
for breast cancer survival identified by van t’Veer et
al (2002)
• Two Science papers in 2003 on stem cell gene
expression describing parallel experiments identify
only 2 genes in common out of hundreds
• The MAQC project papers in fall 2006 claimed to
validate microarrays. Measures of consistency
were highly variable across any two platforms
• Maybe both are true:
– careful microarray studies are accurate
Further Questions
Quality assessment
Normalizing expression arrays
Normalizing other array types
Selecting genes
Identifying functional groups
Normalizing Expression Data
ISMB 2008 Tutorial
QA and Normalization
• Technical differences cause changes in
• QA flags big technical differences in arrays
– Often sporadic – e.g. scratches on chip
• Normalization attempts to compensate for
modest but systematic differences
– e.g. intensity-dependent bias (quenching)
• Both must be done together
Variations in Technique with Broad
Consequences for Measures
Temperature of hybridization
Amount of RNA
Degradation of RNA
Yield of conversion to cDNA or cRNA
Yield of labeling reaction
Strength of ionic buffers
Stringency of wash
Many Normalization Approaches
• One Parameter
– Total or median brightness
• Two parameter
– Variance stabilizing
– Lowess for two-color arrays
• Non-parametric
– Distribution (quantile) matching
• Regression on technical covariates
One Parameter – Overall Mean
• Can only measure relative levels of expression: per
mg RNA
• Assume: only difference between chips is due to
different weights of RNA hybridized
• Set:
Cred   f i
i 1
; Cgreen   f i
i 1
• For each chip, normalized values are:
fi*  fi red / Cred ; fi*  fi green / Cgreen
• For 2-color cDNA use separate constant for each
channel on each chip
• More consistent results if use a robust estimator,
such as median or 1/3 – trimmed mean
(Quackenbush): take mean of middle 2/3 of probes
One-Parameter Limitations
Density Plots:
before and after
A centering transform canlog(Signal)
shift the density of log2 data but
can’t get around the differences in shape of distributions
Intensity Dependent Bias
Ratio – Intensity (M-A) plot of raw data:
M = log2(R/G) ; A = (log2(R) + log2(G)) / 2
Global (lowess) Normalization
Same data set normalized by: Mnorm = M-c(A)
where c(A) is an intensity dependent function
estimated by local regression.
Print-tip Normalization
Print-tip layout
Separate lowess curves for
each of 16 print-tips
Box plots for log ratios in
each of 16 print-tip groups
Scaled Print-tip Normalization
Box plot after print-tip normalization
Box plot after scaled print-tip normalization
There still remain apparent technical artifacts
Try to fix them by scaling after print-tip lowess: Mp,norm = sp·(Mp-cp(A));
Spatial Effects
No normalization
Global normalization
Print-tip normalization
Scaled Print-tip normalization
This is too Complex!
• We are piling fudge factor upon fudge
• We don’t know what errors these fudges
are introducing…
• … and still haven’t removed all artifacts
• How about something simpler!
Quantile Normalization
• Currently most widely used ‘best’ method
• Implemented in BioC affy, oligo, limma
• Ignores causes of variation and technical
• Shoehorns all data into the same shape
distribution – matching quantiles
Motivation: Probe Intensities in 23
Replicates on Affy chips
Densities of intensities
from GeneLogic spikein
study; black is composite
Quantile Normalization
(Irizarry et al 2002)
1. Map values in each curve separately to
their quantile within the distribution
2. Map quantiles of each distribution to
quantiles of the reference curve
The mapping by
quantile normalization
Ratio Intensity Correction?
M-A plots of raw data
from Affy chip pairs
Ratio-Intensity: After Quantile Norm
Critiques of Quantile Normalization
• Artificially compresses variation of highly
expressed genes
• Confounds systematic changes due to
cross-hybridization with changes in
abundance to genes of low expression
• Induces artifactual correlations in lowintensity genes
Technical Variable Regression
1. Most technical variation between chips is
caused by a few systemic factors
2. Probes with similar technical characteristics
(Tm, position in gene, location on chip,
intensity) will be distorted by similar amounts
Normalization: estimate bias due to
technical factors by local averaging of
deviations from reference profile
Further Questions
Models for Multiple-Probe
Affymetrix and NimbleGen Data
Many Probes for One Gene
Gene 5´
oligo probes
Perfect Match
How to combine signals from multiple probes
into a single gene abundance estimate?
Probe Variation
• Individual probes don’t agree on fold
• Probes vary by two orders of magnitude
on each chip
– CG content is most important factor in signal strength
Signal from 16 probes
along one gene on
one chip
Probe Measure Variation
•Typical probes are two orders of magnitude different!
•CG content is most important factor
•RNA target folding also affects hybridization
Many Approaches
• Affymetrix MicroArray Suite
• dChip - Li and Wong, HSPH
• Bioconductor:
– RMA - Bolstad, Irizarry, Speed, et al
– affyPLM – Bolstad
– gcRMA – Wu
• Physical chemistry models – Zhang et al
Critique of Averaging (MAS5)
• Not clear what an average of different
probes should mean
• Tukey bi-weight can be unstable when
data cluster at either end – frequently the
conditions here
• No ‘learning’ based on cross-chip
performance of individual probes
Motivation for multi-chip models:
Probe level data from spike-in study ( log scale )
note parallel trend of all probes
Courtesy of Terry Speed
Model for Probe Signal
• Each probe signal is proportional to
– i) the amount of target sample – a
– ii) the affinity of the specific probe sequence to the
target – f
• NB: High affinity is not the same as Specificity
– Probe can give high signal to intended target and also
to other transcripts
1 2 3
chip 1
chip 2
f1 f2 f3
Multiplicative Model
• For each gene, a set of probes p1,…,pk
• Each probe pj binds the gene with
efficiency fj
• In each sample there is an amount ai.
• Probe intensity should be proportional to
• Always some noise!
Robust Linear Models
• Criterion of fit
– Least median squares
– Sum of weighted squares
– Least squares and throw out outliers
• Method for finding fit
– High-dimensional search
– Iteratively re-weighted least squares
– Median Polish
Bolstad, Irizarry, Speed – (RMA)
• For each probe set, take the log transform of
PMij = aifj:
log ( PM ij )  log( ai )  log( f j )
• i.e. fit the model:
log ( PM ij  bg)  ai  b j   ij
After normalization!
• Fit this additive model by iteratively reweighted least-squares or median polish
Critique: Model assumes probe noise is
constant (homoschedastic) on log scale
Comparing Measures
Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA
20 replicate arrays – variance should be small
Standard deviations of expression estimates on arrays
arranged in four groups of genes
Courtesy of Terry Speed
by increasing mean expression level
Current Multi-chip Approaches
• PLIER – Affymetrix’ own multi-chip model
• RMA – Bioconductor standard
– Affy package
– To be replaced by oligo package?
• affyPLM
– More sophisticated RMA
– Careful with more sophisticated models!
• gcRMA
– Estimates background
Quality assessment
Normalizing expression arrays
Normalizing other array types
Selecting genes
Identifying functional groups
Further Questions
Outline of Section
General Issues
Normalizing CGH arrays
Normalizing ChIP-chip arrays
Normalizing SNP arrays
Normalizing methylation (HELP) arrays
General Issues
• Quantile (and other expression normalizations)
assume that overall distribution of values is
invariant across samples
– Roughly true for most expression sets
– Untrue for copy number, DNA methylation, ChiP-chip
• Many choices for expression probes
• Tiling, exon, and promoter arrays must work with
what is there
Array CGH Technology
Array CGH Data:
Chromosome 8 (241 BAC’s)
10 cell lines and 45 tumor samples
CGH Array Pre-processing
• Dye bias effects are present but because
the ranges are usually smaller they have
less impact
– Most programs don’t do a dye bias
• Probe bias effects
– Most probes show consistent biases in
relation to their neighbors
– About 50% of total variance is such bias
– Few programs adjust for that
– See Bilke et al (Bioinformatics 2005)
CGH Array Segmentation
• Key idea: Most probe targets have same
copy number as their next neighbors
• Can average over neighbors
• Key issue: when is a difference real?
• Recommended Programs:
• DNACopy – Solid statistical basis; slow
• StepGram – Heuristic ; fast
Further Questions
Chromatin Immuno-precipitation
Tiling Array
• One probe every n base pairs over some
length of chromosome
What the data look like
Superimposing channels
Giresi et al, Genome Res. 10
Methods and Issues
• Normalization
– Different enrichment ratios
– Different probe thermodynamics
– Dye and probe bias
• Estimation
– Categorical or continuous?
– Individual values are noisy:
• Where is the peak?
• Ignores normalization
• ‘Shrinkage’ estimator of variance
– Improves individual scores
• Smooths noise by moving average
• Alternative HMM
• One color array
– Needs separate array for comparison
• Normalizes probe thermodynamics &
enrichment ratio
• Estimation by (robust) moving average
Further Questions
SNP chips
Affy SNP chips
SNP Chip Probe Design
• 10 25-mers
overlapping the SNP
• Alleles A & B
• Sense and Anti-sense
– or PM and MM (old)
RMA for SNP chips
• Early Affy software wasn’t very accurate
• Rabbee & Speed (2006) proposed RLMM,
an RMA-like method using:
– Quantile normalization
– Two variables ( A & B signals)
– Discriminant analysis (Mahalanobis metric) to
tell apart alleles
• Much better than Affy software
• Has been adopted by Affy
Discriminating SNPs
SNP A 1753037
SNP A 1698299
• RLMM still had problems with several
hundred SNPs
• Also problems comparing across labs
• CRLMM addresses these by careful
CRLMM Overview
• Model intensity on each chip separately by
• Summarize qA, qB, qA, qB by median polish: M+ = qA
- qB ; M- = qA- qB
• Model log ratio bias on each chip by
• Estimate log ratio bias using E-M
– Zi indexes which SNP state is likely
Normalization – Step 1
• Regress (PM) intensity on sequence
predictors and fragment length
Normalization – Step 1
• Too many hb(t)’s
– Impose constraint:
• hb(t) is a cubic spline with 5 df on [1,25]
• Forces neighboring values of h to be close
• Allows variation in smoothness (unlike loess)
• Subtract fitted values from signal
• BUT: bias still present
Summary Method
Median Polish
For arrays of numbers
Iterative method
Subtract medians of each row and each
column (and accumulate) until medians
Normalization Step 2
• Fit bias function:
• But what is k?
E-M Algorithm
Systematic way to ‘guess and improve’
Start with putative assignments to classes
Estimate bias and fit
Use residuals from fit to classify again
Final Step: Calling
• Separation in two-dimensional log-ratio
Further Questions
Return at 3:45
Exploratory Data Analysis
• Can be useful to detect unsuspected affiliations
among clinical samples or genes
• For most designed experiments significance
testing or multivariate analysis are better choices
When is Clustering Useful?
• To examine data for artifacts
• To try to identify sub-categories in
undifferentiated clinical samples
• To impute roles / functions for genes by
Caveats for Clustering
• Ordering of samples in dendrogram is
arbitrary – not meaningful relation
Samples 3 & 6 could
be on the right
• Distance metrics are usually arbitrary
• Correlation ‘distance’ usually improves if
increase dynamic range
Multivariate Exploration
Example – Ape Brain Gene Exp.
• Enard et al, Science
• Nice separation
between species
Caveats for PCA
• Multivariate methods are sensitive
– Pick up subtle patterns
– Can easily pick up artifacts in data
1. The clustering does
not reflect phylogeny
2. The principal components
don’t load heavily onto
distinct subsets of genes
Further Questions
Quality assessment
Normalizing expression arrays
Normalizing other array types
Selecting genes
Selecting functional groups
Selecting Genes
Multiple Comparisons Issues
ISMB 2008 Tutorial
• Multiple Testing
– Family wide error rates
– False discovery rates
• Application to microarray data
– Practical issues
– Computing FDR by permutation procedures
– Conditioning t-scores
• To identify genes most likely to be
changed or affected
• To prioritize candidates for focused followup studies
• To characterize functional changes
consequent on changes in gene
Individual Variation Within Groups
• Numerous genes show high levels of interindividual variation
• Level of variation depends on tissue also
• Donors or experimental animals may be infected
or under social stress
• Tissues are hypoxic or ischemic for variable
times before freezing
• Tissues are mixtures of different types of cells –
proportions may vary
• Inflammatory cells recruited differently
Technical Variation
Temperature of hybridization
Amount of RNA
Degradation of RNA
Yield and lengths of cDNA conversion
Strength of ionic buffers
Stringency of wash
Scratches on chip
Random variations in number of molecules
Multiple comparisons
• Suppose no genes really changed
– (as if random samples from same population)
• 10,000 genes on a chip
• Each gene has a 5% chance of exceeding
the threshold at a p-value of .05
– Type I error
• The test statistics for 500 genes should
exceed .05 threshold ‘by chance’
Distributions of p-values
Random Data
Real Microarray Data
Characterizing False Positives
• Family-Wide Error Rate (FWE)
– probability of at least one false positive
arising from the selection procedure
• Strong control of FWE:
– Bound on the probability
• False Discovery Rate:
– Proportion of false positives arising from
selection procedure
‘Corrected’ p-Values for FWE
• Sidak (exact correction for independent
– pi* = 1 – (1 – pi)N
– Still conservative if genes are co-regulated
– pi* = 1 – (1 – Npi + …) gives Bonferroni
• Bonferroni correction
– pi* = Npi, if Npi < 1, otherwise 1
– Too conservative!
Increasing Power by Lower N
• If we select a small subset of genes for
tests we will have better p-values
• Selecting well-measured or biologically
more probable genes is reasonable
– e.g. high S/N ratio, high variance
• Selecting genes by statistical criteria
related to the test that will be performed is
not reasonable.
– e.g. selecting genes obtained by clustering
Simes’ Lemma
• Suppose we order the p-values from N
independent tests using random data:
– p(1), p(2), …, p(N) are order statistics from
• Pick a target threshold a:
– Then: P( p(1) < a /N | p(2) < 2 a /N | p(3) < 3 a /N | … ) = a
• Hochberg’s procedure:
– If any p(k) < k a /N
– Then select genes (1) to (k)
False Discovery Rate
• At threshold t* what fraction of genes are
likely to be true positives?
• Illustration: 10,000 independent genes
In practice use permutation algorithm to compute FDR
• Can’t know what FDR is for a particular sample
• B-H suggest procedure specifying Average FDR
– Order the p-values : p(1), p(2), …, p(N)
– If any p(k) < k a /N
– Then select genes (1) to (k)
• q-value: smallest FDR at which the gene
becomes ‘significant’
• NB: acceptable FDR may be much larger than
acceptable p-value (e.g. 0.10 )
Practical Issues
• Actual proportion of false positives varies
from data set to data set
• Mean FDR could be low but could be high
in your data set
The Effect of Correlation
• If all genes are uncorrelated, Sidak is exact
• If all genes were perfectly correlated
– p-values for one are p-values for all
– No multiple-comparisons correction needed
• Typical gene data is highly correlated
– First eigenvalue of SVD may be more than half
the variance
– Distribution of p-values may differ from uniform
• True FDR more variable
P-Values from Correlated
Null distribution from
Null distribution from
Null distribution from
independent genes
perfectly correlated genes
highly correlated genes
Rows: genes; columns: samples;
entries: p-values from randomized distribution
Re-formulating the Question
• Independent: ~5% of genes exceed .05
threshold, all the time
• Perfectly correlated: all genes exceed .05
threshold ~5% of the time
• Partially correlated: .05 < f1 < 1 of genes
exceeds .05 threshold, .05 < f2 < 1 of the cases
• New question: for a given f1 and a, how likely is
it that a fraction f1 of genes will exceed the a
Symptoms of Correlated Tests
Typical P-value histograms with NO real changes
P-values from random
but correlated data
P-values from random but
correlated data -- permuted
Distributions of numbers of p-values
below threshold
• 10,000 genes;
• 10,000 random drawings
• L: Uncorrelated R: Highly correlated
Permutation Tests
• We don’t know the true distribution of gene
expression measures within groups
• We simulate the distribution of samples
drawn from the same group by pooling the
two groups, and selecting randomly two
groups of the same size we are testing.
• Need at least 5 in each group to do this!
Permutation Tests – How To
• Suppose samples 1,2,…,10 are in group 1
and samples 11 – 20 are from group 2
• Permute 1,2,…,20: say
– 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10
– Construct t-scores for each gene based on
these groups
• Repeat many times to obtain Null
distribution of t-scores
• This will be a t-distribution  original
distribution has no outliers
Multivariate Permutation Tests
• Want a null distribution with same
correlation structure as given data but no
real differences between groups
• Permute group labels among samples
• redo tests with pseudo-groups
• repeat ad infinitum (10,000 times)
Critiques of Permutations
• Variances of permuted values for truly
changed genes are inflated
– artificially low p-values
• Permuted t -scores for many genes may
be lower than from random samples from
the same population
Recurrent Aberrations
• Many genetic diseases are highly variable
• Large numbers of genes show changes:
– mutation (Sjoblom et al, Science 2006)
– differential expression (many)
– amplifications/ deletions (many)
– methylation (Figueroa & Reimers, submitted)
• Few show consistency!
Categorical Analysis: Mutations
• Mutations occur randomly
– But at different rates in different samples
• Null model IID occurrence: Poisson
– Estimated parameters: sample mutation rate l
• Compare numbers of genes mutated in k
samples to prediction
Copy Number Patterns
• Neighboring loci are dependent because
long regions typically get amplified/deleted
• Different samples have widely differing
rates of genomic aberrations
• Null model (like permutation)
– same number of aberrant regions and lengths
as actually occur in each sample
– but occurring at random positions
Segmented Data
Recurrent Loci
Statistical Significance
• ‘Permute’ aberrations in all samples
• For each n: what fraction of loci end up
with > n aberrations in each permutation?
• Select n such that fraction exceeding n is
less than x% of that observed in y% of the
Further Questions
Quality assessment
Normalizing expression arrays
Normalizing other array types
Selecting genes
Selecting functional groups
• Characterize biological meaning of joint
changes in gene expression
• Organize expression (or other) changes
into meaningful ‘chunks’ (themes)
• Identify crucial points in process where
intervention could make a difference
• Why? Biology is Redundant! Often sets of
genes doing related functions are changed
Gene Sets
• Gene Ontology
– Biological Process
– Molecular Function
– Cellular Location
• Pathway Databases
– BioCarta
– Broad Institute
Other Gene Sets
• Transcription factor targets
– All the genes regulated by particular TF’s
• Protein complex components
– Sets of genes whose protein products
function together
• Ion channel receptors
• RNA / DNA Polymerase
• Paralogs
– Families of genes descended (in eukaryotic
times) from a common ancestor
• Univariate:
– Derive summary statistics for each gene
– Group statistics of genes by gene group
• Multivariate:
– Analyze covariation of genes in groups across
– More adaptable to continuous statistics
Univariate Approaches
• Discrete tests: enrichment for groups in
gene lists
– Select genes differentially expressed at some cutoff
– For each gene group cross-tabulate
– Test for significance (Hyper-geometric or Fisher test)
• Continuous tests: from gene scores to
group scores
– Compare distribution of scores within each group to
random selections
– GSEA (Gene Set Enrichment Analysis)
– PAGE (Parametric Analysis of Gene Expression)
Multivariate Approaches
• Classical multivariate methods
– Multi-dimensional Scaling
– Hotelling’s T2
• Informativeness
– Topological score relative to network
– Prediction by machine learning tool
• e.g. ‘random forest’
Contingency Table:
Category X Significance
Signif. NS
Genes Genes
Group of
(N-n)- N-n
for each table
Categorical Analysis
• Fisher’s Exact Test
– Of all tables with same margins, how many show as
much or more enrichment?
• Tedious to compute when n or k are large
• Approximations
– Binomial (when k / n is small)
– Chi-square (when expected values > 5 )
– G2 (log-likelihood ratio; compare to c2)
• Approx. not needed for modern computers
Signif. genes
NS Genes
Issues in Assessing Significance
• P-value or FDR?
– FDR more powerful but strong correlations
• What is appropriate Null Distribution?
– Random sets of genes? Or
– Random assignments of samples?
• If a child category is significant, how to
assess significance of parent category?
– Include child category
– Consider only genes outside child category
Critiques of Discrete Approach
• Discrete approaches are simple
– often good for a first pass at a novel situation
• No use of size of change information
• Continuous procedures have twice the
power of analogous discrete procedures
on discretized continuous data
• No account of gene covariation
– Covariation changes p-values greatly
Continuous Scores
• Gene Set Enrichment Analysis
– Subramanian et al (2005) improves by hack
– Mootha et al (2003) uses distributional test
– Simple z-score test
– Adds up changes in all genes in a pathway
• T-profiler
– T-version of PAGE
• Uses Kolmogorov-Smirnov (K-S) test of
distribution equality to compare t-scores
for selected gene group with all genes
Group Z- or T- Scores
• Form z-scores for each gene:
– Z = ( mG – m ) / SAll
• Each gene’s z-score (zi) is distributed N(0,1)
• If z-scores are assigned randomly in groups
– Hence the sum over genes in a group G:
G ~ N (0,1)
• Compare group scores to Normal distribution
– Remember multiple comparisons! (many groups)
Issues for Pathway Methods
• How to assess significance?
– Null distribution by permutations
– Permute genes or samples?
• How to handle activators and inhibitors in the
same pathway?
– Variance Test
– Other approaches
• Critique: all genes treated as independent
– No account of gene covariation
– Covariation changes p-values greatly
Further Questions
Multivariate Approaches to
Gene Set Selection
Key Multivariate Ideas
• PCA (Principal Components Analysis)
– Identify major directions of covariation in data
• MDS (Multi-dimensional Scaling)
– Easy graphical representation of data in terms
of many variables
• Hotelling T2
– Test for differences between two sample
Three correlated variables
PCA1 lies along the direction of
maximal correlation; PCA 2 at
right angles with the next highest
MDS Representation of Expression
in Pathways
• BAD pathway
Other BC
• Clear separation
between groups
• Variation
Hotelling’s T2
• Compute distance between sample means
using (common) metric of covariation
• Where
• Multidimensional analog of t (actually F)
Principles of Kong et al Method
• Normal covariation generally acts to
preserve homeostasis
• The transcription of genes that participate
in many processes will be changed
• The joint changes in genes will be most
distinctive for those genes active in
pathways that are working differently
Critiques of Hotelling’s T
• Small samples: unreliable S estimates
• Estimates of Dx and S not robust to
• Assumes same covariance in each sample
– S1 = S2 ? Usually not in disease
– Kong et al propose analog of Welch t-test
– Permutation in samples for significance
Making it Stable
1. Insufficient information to capture all
relationships – too much correlation!
– Power of Hotelling’s method comes from
identifying directions of rare variation
– Many (spurious) directions of 0 variation
2. Random variation in data leads to
random variation in PCA
• Regularization strategy: force covariance
to be more like independent variables
Making it Robust
• Microarray data has many outliers
• Multivariate methods are very much
distorted by outliers
• Robust estimates of covariance could give
robust PCA
• Simple approach: trim outliers
Handling Changes of Covariance
• Power of Hotelling’s method comes from
identifying directions of rare variation
• If one group shows little covariation in one
direction but the other does – how to test
for changes?
• If one group is control then its covariance
should be taken as standard
– Robust measure of means in both groups
• All arrays subject to systemic error
– Often hidden from view!
• Best normalizations for expression arrays
• Normalization methods for non-expression
arrays very different
• Selecting individual genes problematic in
presence of correlated variation
• Selection by pathways often more
powerful but often spurious
Additional Information
• An Opinionated Guide to Microarray Data
Analysis (coming August 2008)
• www.people.vcu.edu/~mreimers/OGMDA
• BioConductor: www.bioconductor.org