Array normalization, Gene expression index

advertisement
Microarray Normalization
Xiaole Shirley Liu
STAT115 / STAT215
Affymetrix Microarray Imagine Analysis
• Affymetrix GeneChip Operating System
(GCOS)
– Gridding: based on spike-in DNA
– cel file
X
701
702
Y
523
523
MEAN
311.0
48.0
STDV
76.5
10.5
NPIXELS
16
16
– cdf file
• Which probe at (X,Y) corresponds to which probe
sequence and targeted transcript
• MM probes always (X,Y+1) PM
2
Normalization
• Try to preserve biological variation and
minimize experimental variation, so
different experiments can be compared
• Assumption: most genes / probes don’t
change between two conditions
• Normalization can have larger effect on
analysis than downstream steps (e.g. group
comparisons)
3
Median Scaling
• Linear scaling
array1
array1
– Ensure the different arrays have the same
median value and same dynamic range
– X' = (X – c1) * c2
array2
4
array2
LOESS
• LOcally WEighted Scatterplot Smoothing,
more general form is LOESS
• Fit a smooth curve
– Use robust local linear fits
– Effectively applies different scaling factors at
different intensity levels
– Y = f(X)
– Transform X to X' = f(X)
– Y and X' are comparable
5
Quantile Normalization
• Bolstad et al Bioinformatics 2003
– Currently considered the best normalization method
– Assume most of the probes/genes don’t change between samples
• Calculate mean for each quantile and reassign each probe
by the quantile mean
• No experiment retain original value, but all experiments
have exact same distribution
Experiments
Probes
6
Mean
How to Visualize Microarray
Normalization?
7
Dilution Series
• RNA sample in 5 different concentrations
• 5 replicates scanned on 5 different scanners
• Before and after quantile normalization
8
MvA Plot
log2R vs log2G
Values should be
on diagonal
9
M=log2R- log2G
A=(log2R+log2G)/2
Values should scatter
around 0
Before Normalization
• Pairwise MA plot for 5 arrays, probe (PM)
M  log 2 ( PM i / PM j )
A  log 2 PM i  PM j
10
After Normalization
• Pairwise MA plot for 5 arrays, probe (PM)
M  log 2 ( PM i / PM j )
A  log 2 PM i  PM j
11
Gene Expression Index
Affymetrix Microarray Expression Index
• How to summarize probes in a probeset?
13
Brighter PM usually carries more information, but not
always the case (cross-hybridization)
MAS4
• GeneChip® older software Microarray Analysis
Software 4.0 uses AvgDiff
1
AvgDiff 

( PM
j
j
 MM j )
• A: a set of suitable pairs chosen by software
– Remove highest/lowest
– Calculate mean, sd from remaining probes
– Eliminate probes more than 3 sd from mean
• Drawback (naïve algorithm):
– Can omit 30-40% probes
– Can give negative values
14
MAS5
• GeneChip® newest version
signal  TukeyBiweight{log(PM j  CTj* )}
• Tukey Biweight down-weights points far from the
estimated center of the data scatter, robust
statistics resistant to outliers
• CT* (change threshold) a version of MM that is
never bigger than PM
– If MM<PM, CT* = MM
– If MM>PM, estimate typical case (Tukeybiweight) MM
for PM (~70% PM)
– If typical MMs > PM for, set CT* = PM - 
• Works OK but ad hoc
15
Li & Wong (dChip)
Important observation: relative values of probes within
a probeset very stable across multiple samples.
16
Model-Based Expression Index
• Look at multiple samples at a time, give different
probes a different weight
• Each probe signal is proportional to
– Amount of target sample: qi
– Affinity of specific probe sequence to the target: fj
Probes
1
2
3
sample 1
q1
sample 2
q2
f1 f2 f3
17
Li & Wong (dChip)
• Model PMij  MMij  qif j  ij
Error
Concentration Probe affinity
• Iteratively estimate θi and φj to minimize εij
φ1
Probe1
φ2
Probe2
φ3
Probe3 …
Sample1 ( P  M )11 ( P  M )12 ( P  M )13
Sample2 ( P  M )21 ( P  M )22 ( P  M )23
Sample3 ( P  M )31 ( P  M )32 ( P  M )33
…
...
...
...
...
...
q1
q2
q3
...
... …
• Try to minimize the sum of squared errors
18
RMA = Robust Multi-chip Analysis
• Irizarry & Speed, 2003
• 1: Probe intensity background adjustment
• 2: Quantile normalize the Log
transformed background adjusted PM
• 3: Robust probe summary
19
RMA Background Subtraction
• Observed PM = Signal + Background noise
=
+
• Signal ~ exponential; BG ~ normal
• Background estimated from MM
20
Why Log(PM)
• Captures the fact that higher value probes are
more variable
• Assume probe noise is comparable on log scale
21
RMA
• For each probe set, PMij = qifj
log( PMij )  log(qi )  log( j )
• Fit the model:
log2n( PMij  bg)  ai  bj  ij
– aj is expression index, bj is probe effect
– Log2n() stands for logarithm after quantile
normalization of n samples
22
RMA
• Examples…
• Iteratively refit aj and bj using median
polish
– Alternately remove (subtract) row and column
medians until sum of absolute residuals
converges
– For complex data structures, can efficiently find
a “general picture” of the data
– Robust to outliers in large data sets
• Similar to dChip, but minimize error at
logPM, so less weight on large PMs
23
Gene Expression Index
Method Comparison
Method Comparison Standard
• Spike-ins: introduce markers with known
concentration (intensity) to RNA samples
– Should cover a broad range of concentrations
– Run two samples with and without spike-in, see
whether algorithm can detect the spike-in (differential
expression)
• Dilutions:
– Serial dilutions: 1:2, 1:4, 1:8…
• Latin square spike-in captures both approaches
above
• Compare both accuracy qualitatively and
expression index quantiatively
25
Latin Square Spike-ins
26
Method Comparison of Spike-in
MAS4
MAS 5
Red numbers
indicate spiked
genes
dChip
27
RMA
Summary
• Cel file and cdf file.
• Array normalization: Loess, qnorm
– Assumptions
• Normalization visualization: MA plots
• Gene Expression Index
–
–
–
–
RMA models probe effect in expression arrays
Use MM to correct background
Qnorm log (PM)
Median polish, model probe behavior to get expression
index
• Method comparison
28
Download