Lecture Topic 5 Pre-processing AFFY data Probe Level Analysis • The Purpose – Calculate an expression value for each probe set (gene) from the 11-25 PM and MM intensities – Critical for later analysis. Avoiding GIGO – VERY recent, but has made significant progress Difficulties • Large variability • Few measurements (11-25) at most • MM is very complex, it is signal plus background • Signal has to be SCALED • Probe-level effects Different Methods • MAS 4 Affymetrix 1996 • MAS 5 Affymetrix 2002 • Model Based Expression Index (MBEI) Li and Wong 2001 • Robust Multichip Analysis (RMA) 2002 • GC-RMA 2004 MAS 4 1 AvgDiff ( PM j MM j ) | A | jA A- probe pairs selected Avg Diff • Calculated using differences between MM and PM of every probe pair and averaging over the probe pair – – – – Excluded OUTLIER pairs if PM-MM > 3 SD Was NOT a robust average NOT log-transformed COULD be negative (about 1/3 of the times) MAS 5 • • • • Signal=TukeyBiweight{log2(PMj-IMj) Discussed this earlier. Requires calculating IM Adjusted PM-MM are log transformed and robust for outlying observations using Tukey Biweight. Robust Multichip Analysis ONLY uses PM and ignores MM SACRIFICES Accuracy but major gains in PRECISION • Basic Steps: – 1. Calculate chip background (*BG) and subtract from PM – 2. Carry out intensity dependent normalization for PM-*BG • Lowess • Quantile Normalization (Discussed before) – Normalized PM-*BG are log transformed – Robust multichip analysis of all probes in the set and using Tukey median polishing procedure. Signal is antilog of result. RMA- Step 1: Background Correction • Irrizary et al(2003) • Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise) • E(si | si+bi) • Here, si assumed to follow Exponential distribution with parameter q. • Bi assumed to follow N(me, s2e) • Estimate me and se as the mean and standard deviation of empty spots 1 qˆ y me RMA- BG Corrected Value yˆi yi me s e2q s e yi me s e2q yi s e2q ( ) ( ) se se yi m e s e2q yi s e2q ( ) ( ) 1 se se RMA-Normalization Use the background corrected intensities B(PM) to carry out normalization – Lowess (for Spatial effects) – Quantile Normalization (to allow comparability amongst replicate slides) – Normalized B(PM) are log transformed RMA summarization • Use MEDIAN POLISH to fit a linear model • Given a MATRIX of data: – Data= overall effects+row effects + column effects + residual • Find row and column effects by subtracting the medians of row and column successively till all the medians are less than some epsilon • Gives estimated row, column and overall effect when done Median Polish of RMA • For each probe set we have a matrix (probes in rows and arrays in columns) • We assume: • Signal=probe affinity effect + logscale for expression + error • Also assume the sum of probe affinities is 0 • Use MEDIAN polish to estimate the expression level in each array GC-RMA the Basic Idea of Background • Uses MM and PM in a more statistical framework. – PM = OPM +NPM +S – MM = OMM +NMM +S • • • • 1 O: represents optical noise, N represents NSB noise and S is a quantity proportional to RNA expression (the quantity of interest). The parameter 0 < < 1 accounts for the fact that for some probe-pairs the MM detects signal. Distributional Assumptions Assume • O follows a log-normal distribution • log(NPM) and log(NMM) follow a bivariate-normal distribution with means of µPM and µMM the variance var[log(NPM]=var[log(NMM)]=s2 and correlation r constant across probes. • µPM h(aPM) and µMM h(aMM), • with h a smooth (almost linear) function and the a defined next • Because we do not expect NSB to be affected by optics we assume O and N are independent • The parameters µPM, µMM, r, and s2 can be estimated from the large amount of data. A background adjustment procedure can then be formalized as the statistical problem of predicting S given that we observed PM and MM and assuming we know h, r, s2 and GC-RMA • Naef and Magnesco (2003) defined 25 a k 1 j( A,T ,G ,C ) m j , k I bk j m jk spline of 5 degrees mˆ jk LSestimate • where k = 1,…,25 indicates the position along the probe, j indicates the base letter, bk represents the base at position k, Ibk=j is an indicator function that is 1 when the k-th base is of type j and 0 otherwise, • µj;k represents the contribution to affinity of base j in position k. Assumptions and Notations needed for applying GC-RMA 1. is 0. (Although we know > 0). 2. O is an array-dependent constant. Notations: Let m: minimum value allowed for S, (generally=0) and h, r, s2 are plug-in estimators MLE estimates • Under the above described assumptions, the maximum likelihood estimate (MLE) of S= PM O Nˆ PM if PM Nˆ PM m m otherwise Nˆ PM exp{r log( MM O ) m PM rm MM (1 r 2 )s 2 )}