L5-AFFY_PRE.pptx

advertisement
Lecture Topic 5
Pre-processing AFFY data
Probe Level Analysis
• The Purpose
– Calculate an expression value for each probe
set (gene) from the 11-25 PM and MM
intensities
– Critical for later analysis. Avoiding GIGO
– VERY recent, but has made significant
progress
Difficulties
• Large variability
• Few measurements (11-25) at most
• MM is very complex, it is signal plus
background
• Signal has to be SCALED
• Probe-level effects
Different Methods
• MAS 4 Affymetrix 1996
• MAS 5 Affymetrix 2002
• Model Based Expression Index (MBEI) Li
and Wong 2001
• Robust Multichip Analysis (RMA) 2002
• GC-RMA 2004
MAS 4
1
AvgDiff 
 ( PM j  MM j )
| A | jA
A- probe pairs selected
Avg Diff
• Calculated using differences between MM
and PM of every probe pair and averaging
over the probe pair
–
–
–
–
Excluded OUTLIER pairs if PM-MM > 3 SD
Was NOT a robust average
NOT log-transformed
COULD be negative (about 1/3 of the times)
MAS 5
•
•
•
•
Signal=TukeyBiweight{log2(PMj-IMj)
Discussed this earlier.
Requires calculating IM
Adjusted PM-MM are log transformed and
robust for outlying observations using
Tukey Biweight.
Robust Multichip Analysis
ONLY uses PM and ignores MM
SACRIFICES Accuracy but major gains in PRECISION
• Basic Steps:
– 1. Calculate chip background (*BG) and subtract from PM
– 2. Carry out intensity dependent normalization for PM-*BG
• Lowess
• Quantile Normalization (Discussed before)
– Normalized PM-*BG are log transformed
– Robust multichip analysis of all probes in the set and using Tukey
median polishing procedure. Signal is antilog of result.
RMA- Step 1: Background Correction
• Irrizary et al(2003)
• Looks at finding the conditional expectation of the TRUE signal given
the observed signal (which is assumed to be the true signal plus noise)
• E(si | si+bi)
• Here, si assumed to follow Exponential distribution with parameter q.
• Bi assumed to follow N(me, s2e)
• Estimate me and se as the mean and standard deviation of empty spots
1
qˆ 
y  me
RMA- BG Corrected Value
yˆi  yi  me  s e2q  s e
yi  me  s e2q
yi  s e2q
(
) (
)
se
se
yi  m e  s e2q
yi  s e2q
(
)  (
) 1
se
se
RMA-Normalization
Use the background corrected intensities B(PM) to carry out
normalization
– Lowess (for Spatial effects)
– Quantile Normalization (to allow comparability
amongst replicate slides)
– Normalized B(PM) are log transformed
RMA summarization
• Use MEDIAN POLISH to fit a linear model
• Given a MATRIX of data:
– Data= overall effects+row effects + column effects +
residual
• Find row and column effects by subtracting the medians of
row and column successively till all the medians are less
than some epsilon
• Gives estimated row, column and overall effect when done
Median Polish of RMA
• For each probe set we have a matrix (probes in rows and
arrays in columns)
• We assume:
• Signal=probe affinity effect + logscale for expression +
error
• Also assume the sum of probe affinities is 0
• Use MEDIAN polish to estimate the expression level in
each array
GC-RMA the Basic Idea of Background
• Uses MM and PM in a more statistical framework.
– PM = OPM +NPM +S
– MM = OMM +NMM +S
•
•
•
•
1
O: represents optical noise,
N represents NSB noise and
S is a quantity proportional to RNA expression (the quantity of interest).
The parameter 0 <  < 1 accounts for the fact that for some probe-pairs the
MM detects signal.
Distributional Assumptions
Assume
• O follows a log-normal distribution
• log(NPM) and log(NMM) follow a bivariate-normal distribution with means of
µPM and µMM the variance var[log(NPM]=var[log(NMM)]=s2 and correlation r
constant across probes.
• µPM h(aPM) and µMM h(aMM),
• with h a smooth (almost linear) function and the a defined next
• Because we do not expect NSB to be affected by optics we assume O and N
are independent
•
The parameters µPM, µMM, r, and s2 can be estimated from the large amount of
data.
A background adjustment procedure can then be formalized as the statistical
problem of predicting S given that we observed PM and MM and assuming we
know h, r, s2 and 
GC-RMA
• Naef and Magnesco (2003) defined
25
a 

k 1 j( A,T ,G ,C )
m j , k I bk  j
m jk  spline of 5 degrees
mˆ jk  LSestimate
• where k = 1,…,25 indicates the position along the probe, j indicates the
base letter, bk represents the base at position k, Ibk=j is an indicator
function that is 1 when the k-th base is of type j and 0 otherwise,
• µj;k represents the contribution to affinity of base j in position k.
Assumptions and Notations needed for applying GC-RMA
1.  is 0.
(Although we know  > 0).
2. O is an array-dependent constant.
Notations:
Let m: minimum value allowed for S, (generally=0)
and h, r, s2 are plug-in estimators
MLE estimates
• Under the above described assumptions, the maximum
likelihood estimate (MLE) of S=
PM  O  Nˆ PM
if PM  Nˆ PM  m
 m otherwise
Nˆ PM  exp{r log( MM  O )
 m PM  rm MM  (1  r 2 )s 2 )}
Download