A "Consistency" Test for Determining the Significance of Gene Expression Changes on Replicate Samples and Two Convenient Variance-stabilizing Transformations Peter J. Munson, Ph.D. Mathematical and Statistical Computing Laboratory DCB, CIT, NIH munson@helix.nih.gov Page 1 P. J. Munson, National Institutes of Health, Nov. 2001 Introduction • Math. Stat. Comp. Lab. at NIH • Run Affy LIMS database – Started Dec 2000, Stores >700 chips, – Serves 3 core facilities at NIH • Study 1 – 2 treatments, 5 time points, 6 subjects, 60 U95A chips, PBMC cells • Study 2 – 3 treatments, 5 time points, 5 subj., 75 Hu6800 chips, human cells in culter • Study 3 – 4 doses, 2 time oints, 20 subjects, 20 RG U34A chips, blood cells Page 2 P. J. Munson, National Institutes of Health, Nov. 2001 Outline • Development of Consistency Test • Variance-stabilizing transforms – Generalize Logarithm, GLog – Adaptive transform for Average Diff, TAD • Normalization – Normal quantile + adaptive transform • Application • Probe-pair data visualization: – Parallel Axis Coordinate Display Page 3 P. J. Munson, National Institutes of Health, Nov. 2001 Comparing Two Cell Lines • Don’t subtract background • Ignore background-level points • Calibrate on median intensity of each cell type • Over 3-fold change = = Outside dashed lines • Are these expression level changes significant? real? Data from Carlisle, et al., Mol.Carcinogen., 2000 Page 4 P. J. Munson, National Institutes of Health, Nov. 2001 Duplicate Experiments and "Consistency" Plot Identifies Real Changes in Expression Keratin 5 Vimentin Page 5 P. J. Munson, National Institutes of Health, Nov. 2001 Replication Permits Calculation of Significance (P-values) 4 False-positives Out of 5760 spots: P ≈ 4/5760 = 0.0007 Page 6 P. J. Munson, National Institutes of Health, Nov. 2001 Consistency Plot L21b**exp45 • Compare duplicate experiments, Log Ratio scale • Set Cutoffs for Over-, Underexpression • Calculate number detected, D • Assume Independence, calculate expected number, E, above both, below both cutoffs • Estimate false positive rate, E/D Page 7 1 D=24 0.8 0.6 0.4 0.2 -0 -0.2 -0.4 -0.6 -0.8 D=16 -1 -1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 L12b**exp44 0 0. 3 22 45.2 D=24 E=0. 6 E/D=3% 46 11 4074 26.1 4036.6 28 50.4 4113 90 16 E=0.6 74 88.4 0 1.1 27 4170 52 4249 P. J. Munson, National Institutes of Health, Nov. 2001 p53 +/+ cells 6 hrs, replicate reciprocal experiment 1 1 0.8 0.6 0.4 L21**exp64 0.2 -0 -0.2 0 -0.4 -0.6 -0.8 -1 -1 -1 Page 8 0 L12**exp44 1 -1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 L12**exp63 P. J. Munson, National Institutes of Health, Nov. 2001 Consistency Test on Relative Expression DEFINE: x(g, i) = relative expression value for gene g (=1,...,n) in experiment i (=1,...,m) Fi(X) = empirical cdf of xi across genes (spots) c = minj x(g, j), across experiments THEN assuming that { x(g, i), g=1,...,n } are an independent sample from distribution Fi , the probability that x(g, i) is consistently large is: pup (g) = Pr(Xi ≥ c, for all i) = ∏i (1 - Fi(c)) Page 9 P. J. Munson, National Institutes of Health, Nov. 2001 Consistency Test on Relative Expression- 2 DEFINE: x(g, i) = relative expression value for gene g (= 1,...,n) in experiment i (= 1,...,m) pup(g) = ∏i (1 - Fi( minj x(g, j) )) pdn(g) = ∏i (Fi( maxj x(g, j) )) THEN Expected number of false positives: E(g) = n * p(g) Page 10 P. J. Munson, National Institutes of Health, Nov. 2001 Assumptions of Consistency Test • Independence between experiments • “Exchangeability” of genes • Homogeneity of variance across genes (i.e. across expression intensity) Does NOT require: • Identical distribution in separate experiments But, variance homogeneity violated for Affy Avg. Diff. data Page 11 P. J. Munson, National Institutes of Health, Nov. 2001 Variance Stabilizing Transformations • Logarithm • Box-Cox, power • Generalized Logarithm, GLog • Adaptive, TAD Page 12 P. J. Munson, National Institutes of Health, Nov. 2001 Model Variance as Function of Mean AD Page 13 P. J. Munson, National Institutes of Health, Nov. 2001 Model Variance as Function of Mean AD Var(y) = a0 Var(y) = a0 + a1*y Var(y) = a0 + a1*y + a2*y2 Var(y) = a2*y2 =>> use logarithms What about: Var(y) = a0 + a2*y2 Page 14 P. J. Munson, National Institutes of Health, Nov. 2001 Generalized Log Transform (G-Log) Var(y) = a0 + a2 * y2 = a0*( 1+ (y/c)2) where c = sqrt(a0/a2) GLog(y; c) = sign(y) *ln{ |y/c| + sqrt(1 + y2/c2) } e.g. Page 15 = s.d. at y = 0 / CV, = 10 / 0.1 = 100 P. J. Munson, National Institutes of Health, Nov. 2001 Quantile Normalization for AD (before) Page 16 P. J. Munson, National Institutes of Health, Nov. 2001 Quantile Normalization for AD (after) Page 17 P. J. Munson, National Institutes of Health, Nov. 2001 Normal Quantile Transform after GLog(AD) (it’s almost linear) Page 18 P. J. Munson, National Institutes of Health, Nov. 2001 Adaptive Transform of AD (TAD) - 1 Model variance (over many replicates) vs. mean AD Plot: Log(SD) or Wilson-Hilferty, SD^(2/3) transform vs. Mean of NQ(AD) Fit smooth function, g which predicts SD Page 19 P. J. Munson, National Institutes of Health, Nov. 2001 Adaptive Transform of AD (TAD) - 2 T(X) = Int(-inf,X,1/g) Page 20 P. J. Munson, National Institutes of Health, Nov. 2001 Adaptive Transform of AD (TAD) Page 21 P. J. Munson, National Institutes of Health, Nov. 2001 Consistency Test p-values Time 2 vs. Time 0 Time 1 vs. Time 0 1000 Treatment 500 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 Count Axis 1500 0 .1 .2 .3 .4 .5 .6 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 1 Sham 200 100 0 Page 22 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Count Axis 300 .7 .8 .9 1 P. J. Munson, National Institutes of Health, Nov. 2001 Results of Study 1 (5 time points, 2 treatments, 6 subjects) Table 1. Numb er of genes detected by consistency test with expected false positives set to 1.0 Group Any Time 1-0 2-0 3-0 4-0 Treated Controls Both 385 83 2 13 21 0 340 23 1 22 26 2 Table 3. Numb er of genes detected by Maximu m TAD greater than 1 Group Any time 1-0 2-0 3-0 Treated 275 5 264 4 Controls 6 1 2 4 Both 1 0 0 0 Page 23 19 24 1 4-0 5 4 1 P. J. Munson, National Institutes of Health, Nov. 2001 Probe Pair Data, Delta TAD = 2 Parallel Axis Coordinate Display Page 24 P. J. Munson, National Institutes of Health, Nov. 2001 Probe Pair Data Delta TAD = 0.5 Page 25 P. J. Munson, National Institutes of Health, Nov. 2001 Probe Pair Data, Delta TAD = -1.5 Page 26 P. J. Munson, National Institutes of Health, Nov. 2001 Probe Pair Data, Delta TAD = -0.5 Page 27 P. J. Munson, National Institutes of Health, Nov. 2001 Acknowledgements Lynn Young, MSCL Vinay Prabhu, MSCL Jennifer Barb, MSCL Howard Shindel, MSCL Andrew Schwartz, CIT Steve Bailey, CIT Sayed Daoud, NCI Yves Pommier, NCI John Weinstein, NCI Robert Danner, CC Anthony Suffredini, CC Peter Eichacker, CC James Shelhamer, CC Eric Gerstenberger, CC David Rocke, UC Davis Page 28 David Krizman, NCI Alex Carlisle, NCI P. J. Munson, National Institutes of Health, Nov. 2001