Periodicity Analysis in a microarray time-course study Xin Zhao1, J.S. Marron2, Martin T. Wells1 1Department 2Department of Statistics, Cornell University of Statistics, University of North Carolina Abstract: Microarray time-course genome-wide data are typically HDLSS (High Dimension Low Sample Size). Gene expression profiles over time could be seen as functional data. The functional approach could provide powerful new insights for this type of data. Successful data reduction from a functional viewpoint is used in an analysis of periodicities for a microarray gene expression data set. For the purpose of analyzing periodicity, an appropriate Fourier Transformation followed by PCA (Principal Component Analysis) reduces the dimension of data from 18 to 2. The 2-dimensional Fourier subspace spanned by the sine and cosine functions with 2 periods captures the main feature of periodicity in the data. The distance to the origin in the subspace could be used to measure the degree of periodicity for genes. Introduction: Identifying cell cycle-related genes is helpful in understanding the mechanisms that maintain order during cell division and in studying cancer. Cell cycle-related genes show periodic variation during the cell cycle. factor-based synchronization experiment was conducted by Spellman, et al (1998) to study yeast genome-wide gene expression during two cell cycles. Gene expression were measured for 6,178 genes over 18 equally spaced time points (cover 2 cell cycles). After pre-processing of the data by removing observations with bad quality, 4,489 genes have no missing values. Objective: Identify cell cycle-related genes in the yeast genome, i.e., genes that express periodically over the cell cycle. Methods and Results: 1. Missing data imputation: KNN method For 1689 genes with missing values, missing data points were estimated using KNN if method with k = 12. To impute the missing value at a time point for a gene, we selected12 genes with expression profiles similar to the gene. The weighted average of these 12 genes expression value at the time point is used as an estimate for the missing value. 2. Analysis of periodicity I. PCA (Principal Component Analysis) on raw data - Figure 1 doesn’t reveal periodic structure in the raw data. - Figure 2 shows the percentage of variation explained by each PC. - Figure 3 and 4 show that PCA on the raw data doesn’t reveal the frequency 2 periodic structure expected from the two cell cycle experiment design. Fig 1: raw data -- 6178 gene expression time series. x-axis is time point, y-axis is log2(gene expression ratio) Fig 3: projections of the data on the 1st PC direction Fig 2: power plot of PCA on raw data Fig 4: projections of the data on the 2nd PC direction II. Project the data onto an appropriate Fourier subspace to reveal periodicity structure Fourie basis B = {sin( it ), cos(it ), i 2,4,6,8} where 2 , T = 18, t = 1, 2, …, 18 T Projection matrix = B (BTB)–1 BT (data matrix) 18 6,178 18 6,178 III. PCA on the projected data - Figure 5 revealed two cell cycle structure, but still no apparent periodic structure - Figure 6 shows that the first two PCs explain about 60% of total variation. - Figure 7 and 8 show that the 1st (2nd ) PC direction is similar to a sine (cosine) wave over two periods. Fig 5: projected data Fig 7: projections of the projected data on the 1st PC direction Fig 6: power plot of PCA on the projected data Fig 8: projections of the projected data on the 2nd PC direction IV. Periodicity of genes in the 2-dim Fourier subspace spanned by {sin(2t), cos(2t)} Project data onto the 2-dim Fourier subspace with x (y)-axis representing cosine (sine) direction. The distance to the origin in the subspace is a metric for the periodicity of a gene. Figure 9: scatter plot of genes in the subspace. x-axis is proj_cos, y-axis is proj_sin Conclusions: • Functional approach showed to be powerful in dimension reduction for the purpose of finding interested pattern in a HDLSS data. • The main feature of periodicity in the data could be reserved by the 2-dimentional Fourier subspace spanned by periodic functions of frequency 2. • The distance to the origin in the subspace is an appropriate metric for the periodicity of genes.