Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics,

advertisement
Statistical Methods in Microarray
Data Analysis
Mark Reimers,
Genomics and Bioinformatics,
Karolinska Institute
Four Recent Contributions
• Exploratory graphics
• Multiple comparisons corrections
– Randomization-based significance tests
• Normalization
– loess normalization for cDNA microarray
• Models for probe-level Affymetrix data
– Robust estimation
Multiple comparisons
• Each gene has a 5% chance of exceeding
the threshold at a p-value of .05
– Type I error
• 10,000 genes on a chip
• 500 genes should exceed .05 threshold
Corrections to p-Value
• Bonferroni correction
– pi* = Npi, if Npi < 1, otherwise 1
– Too conservative!
• Sidak
– pi* = 1 – (1 – pi)N
– Still conservative if genes are co-regulated
(correlated)
Step-Down p-Values
• p-values for many genes: p1, …, pN
• Order the smallest k as p(1), …, p(k)
• How likely are we to get k p-values this
small by chance?
• An improvement in power over single-step
procedures
• Plot sample tscores against tscores under
random
hypothesis
• Statistically
significant
genes stand out
Sample t-scores
Quantile Plot
Changed
genes
Corresponding quantiles of t-distribution
Volcano Plot
• Displays both
biological
importance and
statistical
significance
log2(p-value)
or t-score
log2(fold change)
Normalization: Comparing Chips
• Measures differ consistently between chips due to:
–
–
–
–
Different amounts of RNA
Hybridization conditions
Scanner settings
Murphy’s Law
• Normalization: compensate for systematic
technical differences in measurement process
• Re-scaling to mean or median leaves strong
evidence of systematic technical variation
Normalization: Signal Distributions
• Distributions of log intensity of all probes among a
set of 21 replicate chips
Each color
represents
probe density
on one chip
Re-scaling would
shift distribution
shape to right or
left on this plot
Quantile Normalization
Raw data
Formula:
xnorm = F2-1(F1(x))
Density
function
Assumes:
gene distribution
changes little
Distribution
function
F1(x)
Reference
distribution F2(x)
Visible Effect of Quantile Norm.
• Ratio-Intensity plots are straightened as byproduct
Current Work
• Hybridization reaction
varies across some
chips
• Very common on
cDNA
• 10%-20% of welldone Affy chips
Synthetic image of ratio of
individual probes to their
median across chips:
Yellow areas show ratios more
than twice those of red areas
Models: Many Probes for One Gene
Gene 5´
Sequence
3´
Multiple
oligo probes
Perfect Match
Mismatch
How to combine signals from multiple probes
into a single gene abundance estimate?
Probe Variation
• Individual probes don’t agree on fold
changes
• Probes vary by two orders of magnitude on
each chip
– CG content is most important factor in signal strength
Signal from 16 probes
along one gene on
one chip
Models for Multiple Probes
• Issues:
– Accuracy – does the model give accurate
estimates of relative gene expression, when this
is known?
– Noise – what is the variance of replicates?
– Theoretical basis – do we understand why we
are doing what we do?
• Statistical experience with methodology
• Theory of hybridization process underlying
observations
Three Competing Models
• Affymetrix MicroArray Suite
– versions 4, and 5
• dChip
– Li and Wong, HSPH
• Bioconductor: affy package (RMA)
– Bolstad, Irizarry, Speed, et al
Model 1: MicroArray Suite – Version 4
• GeneChip® older software uses Avg.diff
1
Avg.diff 

 ( PM
j
j
 MM j )
with A a set of suitable pairs chosen by software
– 30%-40-% of probe differences can be negative
Model 2: MicroArray Suite – Version 5
• MicroArray Suite version 5 uses
signal  TukeyBiweight{log( PM j  MM *j )}
• MM* is an adjusted MM that is never bigger than PM
• Tukey biweight is a robust average procedure with
weights: f(x)=c2/6[1-(1-x2/s2) 3]; |x|<c
PM-MM values for probe pairs
For this (typical) example, it is not clear what the average would mean
Linear Models
• Extension of linear regression
• Essential features:
– variance constant
– errors independent
– Small number of factors combine in algebraic
form to give levels
• frequently additive
Model for Probe Signal
• Each probe signal is proportional to
– i) the amount of target sample
– ii) the hybridization efficiency of the specific probe
sequence to the target
– Each probe has a specific affinity to its gene target
• NB: Sensitivity need not imply Specificity
chip 1
q1
chip 2
q2
Probes
1 2 3
Robust Statistics
• Outlier: a measure that is far beyond the typical
random variation
– common in biological measures
– 10-15% in Affy probe sets
• Robust methods try to fit the majority of data
points
– Issue is to identify which points to down-weight or
ignore
• Median is very robust – but inefficient
– Trimmed means are almost as robust and much more
efficient
Robust Linear Models
• Criterion of fit
– Least median squares
– Sum of weighted squares
– Least squares and throw out outliers
• Method for finding fit
– High-dimensional search
– Iteratively re-weighted least squares
– Median Polish
Why Robust Models for GeneChips?
• 10% - 15% of individual signals in a probe
set deviate greatly from pattern
• Often outliers lie close together
• Causes:
– Scratches
– Proximity to heating elements
– Uneven fluid flow
Why Robust Models for GeneChips?
• 10% - 15% of individual signals in a probe
set deviate greatly from pattern
• Often outliers lie close together
• Causes:
– Scratches
– Proximity to heating elements
– Uneven fluid flow
Li & Wong (dChip)
• Model: PMij = qifj + eij
- Original model (dChip 1.0) used PMij - MMij = qifj + eij
by analogy with Affy MAS 4
• Outlier removal:
–
–
–
–
Identify extreme residuals
Remove
Re-fit
Iterate
• Distribution of errors eij assumed
independent of signal strength
Robust Multi-chip Analysis
• Each probe responds roughly linearly
– over a moderate range
– some probes are outliers
• Linear Model:
– signal = qifj + e
• qi amount of transcript in sample i;
• fj amplification of probe j
• Robust Fit:
– identify outliers by heuristic – remove
– standard robust method – iteratively re-weighted least
squares
Bolstad, Irizarry, Speed – (RMA)
• For each probe set, re-write PMij
as:
log(PMij)=
= qifj
log(qi ) + log(fj)
• Fit this additive model by iteratively reweighted least-squares or median polish
• In practice, fit:
n log( PM ij  bg)  ai  b j  e ij
Where nlog() stands for logarithm after normalization
NB. Now homoschedastic on log scale
It Makes a Difference
dChip values
Two fairly consistent genes in each of 71 samples
MAS 5 values
Models Compared on Gene Variance
Std Dev of gene measures from 20 replicate arrays
Abundance: Low
High
Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA
Courtesy of Terry Speed
Improvement in Models
• Affymetrix Suite gets better every year
– MAS 7 is expected to be a multi-chip model
• MAS 5.0 estimation does a reasonable job on
probe sets that are bright
– Metabolic and structural genes
– These are most often reported in papers
• dChip and RMA do better on genes that are
less abundant
– Signalling proteins
– transcription factors
Expression Comparison 1 – MAS 4
Ratio-Intensity Plot
comparing two chips
from spike-in
experiment
White dots represent
unchanged genes
Red numbers flag
spike-in genes
Courtesy of Terry Speed
Expression Comparison 2 – MAS 5
t-scores
changed
genes
Theoretical
t-distribution
Expression Comparison 3 – Li-Wong
Courtesy of Terry Speed
Expression Comparison 4 - RMA
Courtesy of Terry Speed
Current Work: Improving the Model
• How to use the MM information profitably
– Combine estimates from PM and MM probes?
• Assessments of probe quality
• Accurate estimates of probe background
• Normalization method based on 2-d loess to
correct spatial inhomogeneity
Relation Between PM and MM
Across One Experiment Set
Colored symbols are one probe
Probe Specific Background
Fitted Data
Probe BG subtracted
Horizontal lines represent probes; colored symbols correspond to arrays
After subtracting individual backgrounds, ratios between corresponding
arrays are more consistent between probes
Where Are We?
• Affymetrix almost finished?
– Probe variation ~40% => gene variation ~ 10%
– RMA gives ~20%
• Work to be done:
– Systematic biases for cDNA arrays
– Platform reconciliation
– Using QC and variation measures for individual probes
in combined expression measures
• Frontiers:
– Image analysis
Near Term Work to be Done
• New hybridization technologies for measuring
gene expression
• Protein chips
– More complex cross-hybridization
• Other high-throughput technologies
– eg RNAi chips
– Cell arrays
• Using sequence information to understand crosshybridization
Integrated Analysis
• Integrating statistical measures of data
uncertainty in machine-learning techniques
for network analysis
• Statistical inference for pathways and gene
ontology categories
• Robust data analysis to mine for genomescale patterns in expression
Acknowledgements
• KI
–
–
–
–
• Berkeley
Karin Dahlman
Yudi Pawitan
Arief Gusnanto
Lennie Fredriksson
– Terry Speed
– Ben Bolstad
• Johns Hopkins
– Rafael Irizarry
Affymetrix Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, fluorescently
labeled DNA target
*
*
*
*
*
Oligonucleotide probe
20µm
1.28cm
Each probe cell or feature contains
millions of copies of a specific
oligonucleotide probe
Over 400,000 different probes
complementary to genetic
information of interest
Image of Hybridized Probe Array
Evidence for Spatial Variation
Synthetic
Image of
Affy chip
Loess Normalization for Areas
Fit two-parameter loess smoother
With 5-10 df
Download