Microarray Normalization Xiaole Shirley Liu STAT115 / STAT215 Affymetrix Microarray Imagine Analysis • Affymetrix GeneChip Operating System (GCOS) – Gridding: based on spike-in DNA – cel file X 701 702 Y 523 523 MEAN 311.0 48.0 STDV 76.5 10.5 NPIXELS 16 16 – cdf file • Which probe at (X,Y) corresponds to which probe sequence and targeted transcript • MM probes always (X,Y+1) PM 2 Normalization • Try to preserve biological variation and minimize experimental variation, so different experiments can be compared • Assumption: most genes / probes don’t change between two conditions • Normalization can have larger effect on analysis than downstream steps (e.g. group comparisons) 3 Median Scaling • Linear scaling array1 array1 – Ensure the different arrays have the same median value and same dynamic range – X' = (X – c1) * c2 array2 4 array2 LOESS • LOcally WEighted Scatterplot Smoothing, more general form is LOESS • Fit a smooth curve – Use robust local linear fits – Effectively applies different scaling factors at different intensity levels – Y = f(X) – Transform X to X' = f(X) – Y and X' are comparable 5 Quantile Normalization • Bolstad et al Bioinformatics 2003 – Currently considered the best normalization method – Assume most of the probes/genes don’t change between samples • Calculate mean for each quantile and reassign each probe by the quantile mean • No experiment retain original value, but all experiments have exact same distribution Experiments Probes 6 Mean How to Visualize Microarray Normalization? 7 Dilution Series • RNA sample in 5 different concentrations • 5 replicates scanned on 5 different scanners • Before and after quantile normalization 8 MvA Plot log2R vs log2G Values should be on diagonal 9 M=log2R- log2G A=(log2R+log2G)/2 Values should scatter around 0 Before Normalization • Pairwise MA plot for 5 arrays, probe (PM) M log 2 ( PM i / PM j ) A log 2 PM i PM j 10 After Normalization • Pairwise MA plot for 5 arrays, probe (PM) M log 2 ( PM i / PM j ) A log 2 PM i PM j 11 Gene Expression Index Affymetrix Microarray Expression Index • How to summarize probes in a probeset? 13 Brighter PM usually carries more information, but not always the case (cross-hybridization) MAS4 • GeneChip® older software Microarray Analysis Software 4.0 uses AvgDiff 1 AvgDiff ( PM j j MM j ) • A: a set of suitable pairs chosen by software – Remove highest/lowest – Calculate mean, sd from remaining probes – Eliminate probes more than 3 sd from mean • Drawback (naïve algorithm): – Can omit 30-40% probes – Can give negative values 14 MAS5 • GeneChip® newest version signal TukeyBiweight{log(PM j CTj* )} • Tukey Biweight down-weights points far from the estimated center of the data scatter, robust statistics resistant to outliers • CT* (change threshold) a version of MM that is never bigger than PM – If MM<PM, CT* = MM – If MM>PM, estimate typical case (Tukeybiweight) MM for PM (~70% PM) – If typical MMs > PM for, set CT* = PM - • Works OK but ad hoc 15 Li & Wong (dChip) Important observation: relative values of probes within a probeset very stable across multiple samples. 16 Model-Based Expression Index • Look at multiple samples at a time, give different probes a different weight • Each probe signal is proportional to – Amount of target sample: qi – Affinity of specific probe sequence to the target: fj Probes 1 2 3 sample 1 q1 sample 2 q2 f1 f2 f3 17 Li & Wong (dChip) • Model PMij MMij qif j ij Error Concentration Probe affinity • Iteratively estimate θi and φj to minimize εij φ1 Probe1 φ2 Probe2 φ3 Probe3 … Sample1 ( P M )11 ( P M )12 ( P M )13 Sample2 ( P M )21 ( P M )22 ( P M )23 Sample3 ( P M )31 ( P M )32 ( P M )33 … ... ... ... ... ... q1 q2 q3 ... ... … • Try to minimize the sum of squared errors 18 RMA = Robust Multi-chip Analysis • Irizarry & Speed, 2003 • 1: Probe intensity background adjustment • 2: Quantile normalize the Log transformed background adjusted PM • 3: Robust probe summary 19 RMA Background Subtraction • Observed PM = Signal + Background noise = + • Signal ~ exponential; BG ~ normal • Background estimated from MM 20 Why Log(PM) • Captures the fact that higher value probes are more variable • Assume probe noise is comparable on log scale 21 RMA • For each probe set, PMij = qifj log( PMij ) log(qi ) log( j ) • Fit the model: log2n( PMij bg) ai bj ij – aj is expression index, bj is probe effect – Log2n() stands for logarithm after quantile normalization of n samples 22 RMA • Examples… • Iteratively refit aj and bj using median polish – Alternately remove (subtract) row and column medians until sum of absolute residuals converges – For complex data structures, can efficiently find a “general picture” of the data – Robust to outliers in large data sets • Similar to dChip, but minimize error at logPM, so less weight on large PMs 23 Gene Expression Index Method Comparison Method Comparison Standard • Spike-ins: introduce markers with known concentration (intensity) to RNA samples – Should cover a broad range of concentrations – Run two samples with and without spike-in, see whether algorithm can detect the spike-in (differential expression) • Dilutions: – Serial dilutions: 1:2, 1:4, 1:8… • Latin square spike-in captures both approaches above • Compare both accuracy qualitatively and expression index quantiatively 25 Latin Square Spike-ins 26 Method Comparison of Spike-in MAS4 MAS 5 Red numbers indicate spiked genes dChip 27 RMA Summary • Cel file and cdf file. • Array normalization: Loess, qnorm – Assumptions • Normalization visualization: MA plots • Gene Expression Index – – – – RMA models probe effect in expression arrays Use MM to correct background Qnorm log (PM) Median polish, model probe behavior to get expression index • Method comparison 28