Analysis of time-course gene expression data Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC Outline of the talk Some objectives for performing “long series” time-course experiments Single cell-cycle experiment A. – – – – A nonlinear regression model Phase angle of a cell cycle gene Inference Open research problems Multiple cell-cycle experiments B. – – – “Coherence” between multiple cell-cycle experiments Illustration Open research problems Objectives Some genes play an important role during the cell division cycle process. They are known as “cellcycle genes”. Objectives: Investigate various characteristics of cell-cycle and/or circadian genes such as: – Amplitude of initial expression – Period – Phase angle of expression (angle of maximum expression for a cell cycle gene) Phases in cell division cycle A brief description • G1 phase: "GAP 1". For many cells, this phase is the major period of cell growth during its lifespan. • S ("Synthesis”) phase: DNA replication occurs. A brief description • G2 phase: "GAP 2“: Cells prepare for M phase. The G2 checkpoint prevents cells from entering mitosis when DNA was damaged since the last division, providing an opportunity for DNA repair and stopping the proliferation of damaged cells. • M (“Mitosis”) phase: Nuclear (chromosomes separate) and cytoplasmic (cytokinesis) division occur. Mitosis is further divided into 4 phases. Single, long series experiment … Whitfield et al. (Molecular Biology of the Cell, 2002) Basic design is as follows: Experimental units: Human cancer cells (HeLa) Microarray platform: cDNA chips used with approx 43000 probes (i.e. roughly 29000 genes) 3 different patterns of time points (i.e. 3 different experiments) One of the goals of these experiments was to identify periodically expressed genes. Whitfield et al. (Molecular Biology of the Cell, 2002) Experiment 1: (26 time points) Hela cancer cells arrested in the S-phase using double thymidine block. Sampling times after arrest (hrs): – 0 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 18 20 22 24 26 28 32 36 40 44. Whitfield et al. (2002) Experiment 2: (47 time points) Hela cancer cells arrested in the S-phase using double thymidine block. Sampling times after arrest (hrs): – every hour between 0 and 46. Whitfield et al. (2002) Experiment 3: (19 time points) Hela cancer cells arrested arrested in the Mphase using thymidine and then by nocodazole. Sampling times after arrest (hrs): – 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36. Whitfield et al. (2002) Phase marker genes: Cell Cycle Phase ------------------ Genes ------- G1/S CCNE1, CDC6, PCNA,E2F1 S RFC4, RRM2 G2 CDC2, TOP2A, CCNA2, CCNF G2/M STK15, CCNB1, PLK, BUB1 M/G1 VEGFC, PTTG1, CDKN3, RAD21 Questions Can we describe the gene expression of a cellcycle gene as a function of time? Can we determine the phase angle for a given cellcycle gene? i.e. can we quantify the previous table in terms of angles on a circle? What is the period of expression for a given gene? Can we test the hypothesis that all cell-cycle genes share the same time period? Etc. Profile of PCNA based on experiment 2 data Some important observations 1. Gene expression has a sinusoidal shape 2. Gene expression for a given gene is an average value of mRNA levels across a large number of cells 3. Duration of cell cycle varies stochastically across cells 4. Initially cells are synchronized but over time they fall out of synchrony 5. Gene expression of a cell-cycle gene is expected to “decrease/decay” over time. This is because of items 2 and 4 listed above! Random Periods Model (PNAS, 2004) K f (t ) a bt 2 • • • • • a and b: K: T: : : z2 2t Cos T exp( z) exp 2 dz background drift parameters the initial amplitude the average period the attenuation parameter the phase angle Fitted curves for some phase marker genes Whitfield et al. (2002) Phase marker genes: Phase -------- Genes ------- G1/S CCNE1, CDC6, PCNA,E2F1 5.83 Phase angles (radians) -----------------------0.56, 5.96, 5.87, S RFC4, RRM2 5.47, 5.36 G2 CDC2, TOP2A, CCNA2, CCNF 4.24, 3.74, 3.55, 3.25 G2/M STK15, CCNB1, PLK, BUB1 2.51 M/G1 VEGFC, PTTG1, CDKN3, RAD21 3.06, 2.67, 2.61, 2.66, 2.40, 2.25, 1.81 A hypothesis of biological interest Do all cell cycle genes have same T and same but the other 4 parameters are gene specific? i.e. H 0 : Tg T , g for all genes g An Important Feature Correlated data – Temporal correlation within gene – Gene-to-gene correlations Test Statistic Wald statistic for heteroscedastic linear and nonlinear models – Zhang, Peddada and Rogol (2000) – Shao (1992) – Wu (1986) The Null Distribution Due to the underlying correlation structure – Asymptotic appropriate. 2 approximation is not – Use moving-blocks bootstrap technique on the residuals of the nonlinear model. Kunsch (1989) Moving-blocks Bootstrap Step 1: Fit the null model to the data and compute the residuals. Step 2: Draw a simple random sample (with replacement) from all possible blocks , of a specific size, of consecutive residuals. Moving-blocks Bootstrap Step 3: Add these residuals to the fitted curve under the null hypothesis to obtain the bootstrap data set Step 4: Using the bootstrap data fit the model under the alternate hypothesis and compute the Wald statistic. Moving-blocks Bootstrap Step 5: Repeat the above steps a large number of times. Step 6: The bootstrap p-value is the proportion of the above Wald statistics that exceed the Wald statistic determined from the actual data. Analysis of experiment 2 The bootstrap p-value for testing H 0 : Tg T , g using Experiment 2 data of Whitfield et al. (2002) is 0.12. Thus our model is biologically plausible. Multiple experiments Statistical inferences on the phase angle Some questions of interest How to evaluate or combine results from multiple cell division cycle experiments? – Are the results “consistent” across experiments? How to evaluate this? What could be a possible criterion? Data ˆg ,i : RPM estimate of phase angle of a cell-cycle gene ‘g’ from the i th experiment. Representation using a circle Consider 4 cell cycle genes A, B, C, D. The vertical line in the circle denotes the reference line. The angles are measured in a counter-clockwise. Thus the sequential order of expression in this example is A, B, D, C. A B D C “Coherence” in multiple cell-cycle experiments A group of cell cycle genes are said to be coherent across experiments if their sequential order of the phase angles is preserved across experiments. A B D Exp 2 B D D C A C C Exp 3 Exp 1 A B Geometric Representation We shall represent phase angles from multiple cell cycle experiments using concentric circles. Each circle represents an experiment. Same gene from a pair of experiments is connected by a line segment. – A figure with non-intersecting lines indicates perfect coherence. – If there is no coherence at all then there will be many intersecting lines. Example: Perfectly Coherent Example: Perfectly Coherent Example: No coherence Estimated Phase Angles Due to statistical errors in estimation, the estimated phase angles from multiple cell cycle experiments need not preserve the sequential order even though the true phase angles are in a sequential order. How to evaluate coherence? Some background on regression for circular data ˆ3, 2 ˆ1,1 ˆ2,1 ˆ3,1 ˆ1,2 ˆ2,2 Experiment A Experiment B Question: Can we determine a rotation matrix A such that we can rotate the circle representing Experiment A to obtain the circle representing Experiment B? Angle of rotation for a rigid body Yes! By solve the following minimization problem: n AS min || ˆ g 1 g ,2 Aˆ 2 g ,1 || 2 cos ˆv|u sin ˆv|u Aˆ sin ˆ ˆ cos v|u v|u Determination of Coherence Across “k” Experiments The Basic Idea Consider a rigid body rotating in a plane. Suppose the body is perfectly rigid with no deformations. Let Aii 1 denote the 2x2 rotation matrices from experiment i to i+1 (k+1 = 1). Then A12 A23 A34 . . . Ak 1k A1k Alternatively A12 A23 A34 . . . Ak 1k A'1k I A12 A23 A34 . . . Ak 1k Ak 1 I The Basic Idea Equivalently, if cosˆ ˆ sin i1|i i1|i Aii1 ˆ ˆ sini1|i cosi1|i Then under perfect rigid body motion we should have k cos( i 1|i ) 1 i 1 Problem! In the present context we do NOT necessarily have a rigid body! – Not all experiments are performed with same precision. – The time axis may not be constant across experiments. – Number of time points may not be same across experiments. – Etc. Example: Not a rigid motion but perfectly coherent Consequence Rotation matrix A alone may not be enough to bring two circles to congruence! An additional “association/scaling” parameter may be needed as see in the previous figure! Circular-Circular regression model for a pair of experiments (Downs and Mardia, 2002) For g 1,2,..., G , let (ˆg ,1, ˆg ,2 ) denote a pair of angular variables. Suppose ˆg ,2 | ˆg ,1 is von-Mises distributed with mean direction and concentration parameter Circular-Circular Regression Model (Downs and Mardia, 2002) The regression model is given by the link function tan( 2|1 2 ) 2|1 tan( ˆg ,1 2|1 2 ), where 2|1 2|1 2|1 the angle of rotation 2|1 " associatio n parameter" 0 2|1 1, 2|1 Back to the toy examples (ˆ B| A , ˆ C|B , ˆ A|C ) (1,1,1), | ˆB| A ˆC|B ˆA|C | 0 (ˆ B| A , ˆ C|B , ˆ A|C ) (.64,.34,.20), | ˆB| A ˆC|B ˆA|C | 0 (ˆ B| A , ˆ C|B , ˆ A|C ) (0,0,0), | ˆB| A ˆC|B ˆA|C | 2.2 Determination Of Coherence Suppose we have K experiments, labeled as 1, 2, 3, …, K. Let ˆi| j denote the angle of rotation for the regression of i on j for a group of g genes. K Compute | ˆ i 1 Note i|i 1 K 1 1 . | Determination Of Coherence We expect | K ˆ | under no coherence i|i 1 i 1 K ˆ | | i|i 1 to be “stochastically” larger than i 1 under coherence. Comparison of Cumulative Distribution Functions Blue line: Coherence Pink line: No Coherence Determination Of Coherence For a given data compute c | K ˆ | i|i 1 i 1 Generate the bootstrap distribution of K | ˆi|i 1 | i 1 under the null hypothesis of no coherence. Bootstrap P-value For Coherence Let * ˆ i|i 1 denote the angle of rotation using the bootstrap sample. Then the P-value is: K P( | ˆi*|i1 | c) i 1 Illustration: Whitfield et al. data There are 3 experiments. The phase angles of each gene was estimated using Liu et al., (2004) model. A total of 47 common cell-cycling genes were selected from the three experiments. Estimates The estimated values of interest are (ˆ 2|1 , ˆ 3|2 , ˆ1|3 ) (0.67,0.70,0.64), (ˆB| A , ˆC|B , ˆA|C ) (0.5, - 3.03, 2.59) Note that | ˆ2|1 ˆ3|2 ˆ1|3 | 0.06 radians P(| ˆ2*|1 ˆ3*|2 ˆ1*|3 | 0.06) 0.029 Conclusion Since the bootstrap P-value < 0.05, we conclude that the three experiments are coherent. Accession AA135809 W93120 T54121 AA131908 AA088457 AA464019 AA430092 AA425404 H73329 AA629262 AA157499 AA282935 AA053556 AA279990 AA402431 R11407 AA598776 AA262211 AA421171 AA010065 AA292964 AA430511 AA430511 AA676797 AA458994 AA235662 N63744 AA620485 AA608568 R96941 AA504625 AI053446 R22949 AA452513 T66935 AA099033 AA485454 AA485454 AA485454 AA485454 AA620553 AA425120 N57722 AA450264 H51719 H59203 R06900 Gene Symbol EST EST CCNE1* FLJ10540 EST E2-EPF BUB1 FLJ10156 C20orf1 PLK MAPK13 MPHOSPH1 MKI67 TACC3 CENPE STK15 CDC20 KIAA0008 NUF2R CKS2 CKS2 FLJ14642 FLJ14642 CCNF PMSCL1 FLJ14642 FLJ10468 ANKT CCNA2 C20orf129 KNSL1 EST EST KNSL5 DKFZp762E1312 USP1* EST EST EST EST* FEN1 CHAF1B MCM6 PCNA ORC1L CDC6 RAMP A 0.882 0.260 1.191 3.534 2.613 3.478 3.566 3.508 3.494 3.314 3.390 3.826 3.600 3.804 3.556 3.484 3.355 3.457 3.785 3.341 3.312 4.170 4.170 4.024 0.841 3.653 3.864 3.709 3.857 3.751 4.107 4.348 4.164 3.915 4.193 5.000 4.886 4.275 4.886 4.275 5.897 5.697 0.047 0.195 5.906 0.551 0.243 Phase (rad) B 0.040 0.427 0.559 2.220 2.373 2.464 2.510 2.519 2.594 2.613 2.615 2.667 2.731 2.810 2.892 2.940 2.957 2.989 3.000 3.030 3.037 3.244 3.244 3.249 3.387 3.396 3.511 3.531 3.541 3.546 3.551 3.612 3.631 3.730 3.884 4.760 5.086 5.086 5.235 5.235 5.510 5.714 5.817 5.858 5.917 5.968 6.049 C 3.399 2.580 2.661 6.186 5.700 5.798 6.132 6.241 5.873 5.888 5.784 6.233 5.665 0.275 5.939 5.869 5.854 5.918 5.679 5.826 5.980 1.653 1.474 1.170 0.298 1.278 0.637 0.923 6.133 0.667 0.410 1.256 0.161 0.192 0.800 2.876 0.891 0.891 0.891 0.891 3.028 1.685 2.568 2.438 2.889 2.723 2.889 B - B|A -0.29 0.52 0.02 -0.65 0.66 -0.33 -0.41 -0.32 -0.22 0.05 -0.05 -0.64 -0.24 -0.46 -0.01 0.14 0.34 0.23 -0.24 0.43 0.48 -0.57 -0.57 -0.35 -0.15 0.35 0.15 0.40 0.19 0.36 -0.17 -0.45 -0.17 0.29 0.04 -0.12 0.33 1.12 0.48 1.27 -0.21 0.16 -0.23 -0.29 0.19 -0.43 -0.13 Res (rad) C - C|B 0.66 -0.58 -0.65 0.65 -0.02 -0.02 0.26 0.36 -0.09 -0.10 -0.20 0.19 -0.44 0.37 -0.33 -0.44 -0.47 -0.44 -0.69 -0.57 -0.42 1.35 1.17 0.86 -0.13 0.85 0.11 0.38 -0.70 0.11 -0.15 0.65 -0.46 -0.50 -0.01 1.43 -0.79 -0.79 -0.90 -0.90 1.02 -0.49 0.31 0.14 0.54 0.33 0.42 A - A|C -0.10 0.53 1.35 -0.08 -0.68 0.12 -0.01 -0.14 0.08 -0.11 0.04 0.18 0.33 -0.05 0.10 0.08 -0.04 0.02 0.50 -0.04 -0.17 -0.74 -0.57 -0.46 0.12 -0.92 -0.23 -0.59 0.28 -0.36 0.17 -0.21 0.39 0.12 -0.01 -1.45 0.61 0.00 0.61 0.00 -0.79 0.76 0.34 0.67 -0.57 0.61 0.06 Dispersion (rad) Cir_dist 0.04 0.21 0.33 0.25 0.08 0.07 0.11 0.12 0.04 0.02 0.02 0.12 0.06 0.13 0.01 0.01 0.00 0.00 0.10 0.02 0.01 0.70 0.55 0.36 0.01 0.51 0.07 0.24 0.05 0.11 0.00 0.21 0.02 0.03 0.03 0.71 0.24 0.44 0.32 0.55 0.23 0.16 0.00 0.02 0.14 0.04 0.00 Statistical inferences on the phase angle - Some open problems Estimation subject to inequality constraints It is reasonable to hypothesize that for a normal cell division cycle, the p phase marker genes must express in an order around the unit circle. Thus they must satisfy: 0 1 2 ... p 2 Open problems - data from single experiment How to estimate the phase angles subject to the simple order restriction? 0 1 2 ... p 2 More generally - wow to estimate the phase angles subject isotropic simple order restriction? 1 2 ... p How to test the above hypothesis? What are the null and alternative hypotheses? Open problems – data from multiple experiments How do we estimate the phase angles from multiple experiments under the order restriction on the phase angles of cell cycle genes? What are the statistical errors associated with such an estimator? How to construct confidence intervals and test hypotheses? Acknowledgments Delong Liu (former Post-doc at NIEHS) David Umbach (NIEHS) Leping Li (NIEHS) Clare Weinberg (NIEHS) Pat Crocket (Constella Group) Cristina Rueda (Univ. of Valladolid, Spain) Miguel Fernandez (Univ. of Valladolid, Spain)