Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Earl F. Glynn Arcady R. Mushegian Jie Chen Stowers Institute Stowers Institute & Univ. of Kansas Medical Center Stowers Institute & Univ. of Missouri Kansas City http://research.stowers-institute.org/efg/2004/CAMDA Critical Assessment of Microarray Data Analysis Conference November 11, 2004 1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms • • • • • Periodic Patterns in Biology Introduction to Lomb-Scargle Periodogram Data Pipeline Application to Bozdech’s Plasmodium dataset Conclusions 2 Periodic Patterns in Biology A vertebrate’s body plan: a segmented pattern. Segmentation is established during somitogenesis. Photograph taken at Reptile Gardens, Rapid City, SD www.reptile-gardens.com 3 Periodic Patterns in Biology Intraerythrocytic Developmental Cycle of Plasmodium falciparum From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3. RNA from parasitized red blood cells Cy5 Expression Ratio = Cy3 = RNA from all development cycles Values for Log2(Expression Ratio) are approximately normally distributed. Assume gene expression reflects observed biological periodicity. 4 Simple Periodic Gene Expression Model “On” “On” “On” period (T) 1 frequency = period Expression period (T) 1 f= T “Off” “Off” Time = angular frequency = 2f Gene Expression = Constant Cosine(2f t) “Periodic” if only observed over a single cycle? 5 Introduction to Lomb-Scargle Periodogram • • • • • What is a Periodogram? Why Lomb-Scargle Instead of Fourier? Example Using Cosine Expression Model Mathematical Details Mathematical Experiments - Single Dominant Frequency - Multiple Frequencies - Mixtures: Signal and Noise 6 What is a Periodogram? • A graph showing frequency “power” for a spectrum of frequencies • “Peak” in periodogram indicates a frequency with significant periodicity Periodic Signal Periodogram Computation Spectral “Power” Log2(Expression) Time Frequency 7 Why Lomb-Scargle Instead of Fourier? • • • • • Missing data handled naturally No data imputation needed Any number of points can be used No need for 2N data points like with FFT Lomb-Scargle periodogram has known statistical properties Note: The Lomb-Scargle algorithm is NOT equivalent to the conventional periodogram analysis based Fourier analysis. 8 Lomb-Scargle Periodogram Example Using Cosine Expression Model 0.0 0.5 1.0 -1.0 Expression Cosine Curve (N=48) N = 48 0 10 20 30 A small value for the false-alarm probability indicates a highly significant periodic signal. 40 0.8 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 Peak Significance p = 3.3e-009 at Peak Probability 20 Lomb-Scargle Periodogram Period at Peak = 48 hours 0 1 T= f Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Evenly-spaced time points 9 Lomb-Scargle Periodogram Example Using Noisy Cosine Expression Model Time Interval Variability 8 6 4 0 2 Frequency 0.0 -1.0 Expression 1.0 Cosine Curve + Noise (N=48) N = 48 20 10 -1.0 40 30 -0.5 0.0 0.5 1.0 Peak Significance p = 2.54e-007 at Peak 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 0.8 Lomb-Scargle Periodogram Period at Peak = 45.7 hours Probability log10(delta T) 20 Time [hours] 0 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Unevenly-spaced time points 10 Lomb-Scargle Periodogram Example Using Noise Time Interval Variability 6 4 0 N = 48 0 20 10 -1.0 40 30 -0.5 0.0 0.5 Peak Significance p = 0.973 at Peak 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 1.0 0.8 Lomb-Scargle Periodogram Period at Peak = 7.4 hours Probability log10(delta T) 20 Time [hours] 0 Normalized Power Spectral Density 2 Frequency 8 0.0 0.5 1.0 -1.0 Expression Noise (N=48) 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] 11 Lomb-Scargle Periodogram Mathematical Details PN() has an exponential probability distribution with unit mean. Source: Numerical Recipes in C (2nd Ed), p. 577 12 Mathematical Experiment: Single Dominant Frequency 0.0 0.5 1.0 Expression = Cosine(2t/24) -1.0 Expression Cosine Curve (N=48) N = 48 0 20 10 40 30 Peak Significance p = 3.3e-009 at Peak 0.8 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 Probability 20 Lomb-Scargle Periodogram Period at Peak = 24 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Single “peak” in periodogram. Single “valley” in significance curve. 13 Mathematical Experiment: Multiple Frequencies 2 0 1 Expression = Cosine(2t/48) + Cosine(2t/24) + Cosine(2t/ 8) -2 -1 Expression 3 Sum of 3 Cosines (N=48) N = 48 0 20 10 40 30 Peak Significance p = 0.00246 at Peak 0.8 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 Probability 20 Lomb-Scargle Periodogram Period at Peak = 21.8 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] 14 Multiple peaks in periodogram. Corresponding valleys in significance curve. Mathematical Experiment: Multiple Frequencies 0 2 Expression = 3*Cosine(2t/48) + Cosine(2t/24) + Cosine(2t/ 8) -2 Expression 4 Sum of 3 Cosines (N=48) N = 48 0 20 10 40 30 Peak Significance p = 2.37e-007 at Peak 0.8 0.4 0.0 5 10 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 p = 0.01 p = 0.05 Probability 20 Lomb-Scargle Periodogram Period at Peak = 48 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] “Weaker” periodicities cannot always be resolved statistically. 15 Mathematical Experiment: Multiple Frequencies: “Duty Cycle” 50% 66.6% (e.g., human sleep cycle) 0.8 0.6 0.0 0.2 0.4 Expression 0.6 0.4 0.0 0.2 N = 48 N = 48 40 0 10 0.2 0.3 Frequency [1/hour] 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency [1/hour] One peak with symmetric “duty cycle”. 20 0.8 25 Peak Significance p = 5.06e-006 at Peak 15 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 10 p = 0.01 p = 0.05 0.0 0.6 0.0 5 p = 0.01 p = 0.05 0.4 10 p = 0.001 0.2 p = 1e-04 Probability p = 1e-05 Lomb-Scargle Periodogram Period at Peak = 24 hours 5 20 15 p = 1e-06 0.1 40 0 Normalized Power Spectral Density 1.0 Peak Significance p = 2.54e-007 at Peak 0.8 25 Lomb-Scargle Periodogram Period at Peak = 24 hours 0.0 30 Time [hours] 0 Normalized Power Spectral Density Time [hours] 20 1.0 30 0.6 20 0.4 10 0.2 0 Probability Expression 0.8 1.0 duty cycle: 2/3 1.0 duty cycle: 1/2 0.0 0.1 0.2 0.3 Frequency [1/hour] 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency [1/hour] 16 Multiple peaks with asymmetric cycle. Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise “p” histogram 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) p corresponding to max Periodogram Power Spectral Density 100 % simulated periodic genes p corresponding to max Periodogram Power Spectral Density 50 % simulated periodic genes p corresponding to max Periodogram Power Spectral Density 0 % simulated periodic genes 1500 Frequency 1000 0 0 0 500 50 500 Frequency 100 Frequency 1000 150 2000 1500 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) -8 -6 -4 -2 0 log10(p) 100% periodic genes -8 -6 -4 -2 log10(p) 50% periodic 50% noise 0 -8 -6 -4 -2 0 log10(p) 100% noise 17 Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise Multiple-Hypothesis Testing More False Negatives Multiple Testing Correction Methods 50 % simulated periodic genes 0 Bonferroni -4 -6 bonferroni holm hochberg fdr none -8 Benjamini & Hochberg FDR Log10(p) Hochberg -2 Holm None More False Positives 0 1000 2000 3000 4000 5000 Rank Order of Sorted p Values 50% periodic, 50% noise 18 Data Pipeline to Apply to Bozdech’s Data 1. Apply quality control checks to data 2. Apply Lomb-Scargle algorithm to all expression profiles 3. Apply multiple hypothesis testing to define “significant” genes 4. Analyze biological significance of significant genes 19 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Global views of experiment. Remove certain outliers. 20 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Many missing data points require imputation for Fourier analysis. 21 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Mean Expression Profile 40 30 20 0 10 Frequency -0.5 -0.4 -0.3 -0.2 -0.1 Expression 0.0 Time Interval Variability N = 46 20 30 40 -1.0 -0.5 0.0 0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 27.4 hours Peak Significance p = 0.0581 at Peak 1.0 p = 1e-05 p = 1e-04 10 p = 0.001 0.0 5 0.2 p = 0.01 p = 0.05 0.4 Probability 15 p = 1e-06 0.6 0.8 1.0 Time [hours] 20 25 10 0 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Complete/06-MeanExpressionProf ile.pdf 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] 2004-10-27 11:39 A weak diurnal period is visible in “mean” data profile. 22 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Periodic Expression Patterns i3518_1 Time Interval Variability 30 10 20 Frequency 0 -1 Expression 30 0 0 -2 -4 10 20 Frequency 0 -2 Expression 1 2 40 Time Interval Variability 40 opfi17638 N = 46 N = 46 1.0 0.00 0.05 0.10 0.15 0.20 30 40 -1.0 0.00 0.05 0.10 0.15 0.20 0.0 0.5 1.0 1.0 Peak Significance p = 1.48e-008 at Peak 0.8 25 -0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 45.7 hours p = 1e-04 p = 0.001 0.4 p = 1e-05 0.6 p = 1e-06 p = 0.01 p = 0.05 0 0.0 5 0.2 p = 0.01 p = 0.05 20 20 1.0 0.6 0.4 10 p = 0.001 Probability 15 p = 1e-04 10 Time [hours] Peak Significance p = 1.19e-008 at Peak p = 1e-06 p = 1e-05 0 0.2 0.5 0.8 20 25 0.0 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 45.7 hours 0 Normalized Power Spectral Density Time [hours] -0.5 0.0 -1.0 Probability 40 15 30 10 20 5 10 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 Examples of highly-significant periodic expression profiles. 0.15 23 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Aperiodic/Noise Expression Patterns 1.0 0.00 0.05 0.10 0.15 0.20 10 0 30 40 -1.0 0.05 0.10 0.15 0.20 0.0 0.5 1.0 1.0 Peak Significance p = 0.516 at Peak 0.8 20 25 -0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 32 hours p = 1e-04 p = 0.001 p = 0.01 p = 0.05 0.6 p = 1e-05 0.4 15 p = 1e-06 0 0.00 20 Frequency 20 10 1.0 0.8 0.6 0.0 5 0.2 p = 0.01 p = 0.05 10 Time [hours] Peak Significance p = 0.998 at Peak 0.4 p = 0.001 Probability 15 10 p = 1e-04 N = 45 0 0.2 0.5 0.0 0.0 log10(delta T) 20 25 -0.5 p = 1e-06 0 Normalized Power Spectral Density Time [hours] Lomb-Scargle Periodogram Period at Peak = 17.8 hours p = 1e-05 0.5 -0.5 -1.0 -1.0 Probability 40 5 30 Normalized Power Spectral Density 20 0.0 Expression 0 10 30 1.0 20 15 5 10 Frequency 0.5 0.0 Expression -0.5 N = 35 0 Time Interval Variability 1.5 25 f35105_2 40 Time Interval Variability 1.0 j167_5 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 24 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Small “N” Time Interval Variability n170_1 Time Interval Variability 1.0 0.05 0.10 0.15 0.20 N=39 20 30 40 -1.0 0.05 0.10 0.15 0.20 0.5 1.0 1.0 Peak Significance p = 2.74e-005 at Peak 0.8 20 25 0.0 p = 1e-04 p = 0.001 0.6 p = 1e-05 0.4 15 p = 1e-06 p = 0.01 p = 0.05 0 0.00 -0.5 log10(delta T) Lomb-Scargle Periodogram Period at Peak = 64 hours 10 1.0 0.8 0.6 0.0 5 p = 0.01 p = 0.05 0.4 p = 0.001 10 Time [hours] Peak Significance p = 8.54e-006 at Peak 0.2 10 p = 1e-04 Probability 15 p = 1e-05 0 0.2 0.5 log10(delta T) 20 25 0.0 p = 1e-06 0 Normalized Power Spectral Density Time [hours] -0.5 0.0 -1.0 Probability 40 Lomb-Scargle Periodogram Period at Peak = 48 hours 0.00 15 5 0 N = 32 30 5 20 Normalized Power Spectral Density 10 10 Frequency N = 39 0 20 25 2 0 -3 0 5 -2 -1 Expression 1 25 20 10 15 Frequency 0.5 0.0 -1.5 -1.0 -0.5 Expression 1.0 30 1.5 30 f58149_1 0.00 0.05 0.10 0.15 0.20 0.00 N=32 0.05 0.10 0.15 25 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Signal and Noise Mixture 'p' histogram Aperiodic Probes or Noise 50 100 150 Periodic Probes 0 Number of Probes 200 Complete Bozdech set of 6875 probes -8 -6 -4 -2 0 log10(p) histogram-log10p.pdf 2004-11-06 10:26 26 Bozdech’s Plasmodium dataset: 3. Apply Multiple-Hypothesis Testing More False Negatives Multiple Testing Correction Methods 0 (Using R's p.adjust methods) -2 Bonferroni -4 = 1E-4 bonferroni holm hochberg fdr none -8 Benjamini & Hochberg FDR Significance -6 Hochberg Log10(p) Holm None 0 1000 2000 3000 4000 5000 6000 7000 Rank Order of Sorted p Values More False Positives p-adjust.pdf 2004-11-06 10:12 27 Bozdech’s Plasmodium dataset: 3. Apply Multiple-Hypothesis Testing p Adjustment Method Significance Level 0.05 0.01 0.001 0.0001 0.00001 Bonferroni 3707 3050 1461 13 0 Holm 3995 3351 1705 13 0 Hochberg 4009 3359 1723 15 0 Benjamini & Hochberg FDR 5618 5315 4906 4358 3584 None 5648 5351 4961 4456 3823 A priori plan: Use Benjamini & Hochberg FDR level of 0.0001. Observed number of periodic probes consistent with biological observation of ~60% of Plasmodium genome being transcriptionally active during the 28 intraerythrocytic developmental cycle. Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes, = 1E-4 significance Comparison with Bozdech’s Results Dataset N time series points Bozdech Complete 43 .. 46 % 81.0% (Bozdech Quality Control Dataset) 32 .. 42 Total Probes Lomb-Scargle Periodic 5080 4115 1795 6875 243 4358 13.5 63.4 While Lomb-Scargle identified 243 new low “N” periodic probes, the low percentage in that group may indicate some other problem. 29 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes, = 1E-4 significance Comparison with Bozdech’s Results Dataset Bozdech Overview Probes Lomb-Scargle Periodic 3711 3611 Unclear how to apply Bozdech’s ad hoc “Overview” criteria for use with Lomb-Scargle method: “70% power in max frequency with top 75% of max frequency magnitude.” The best 3711 Lomb-Scargle “p” values contained 3449 (92.9%) of the Overview probes. 30 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Probes Ordered by Phase Probes Ordered by Phase “Phaseograms” Time Lomb-Scargle Results 4358 Probes Time Bozdech: “Overview” Dataset 2714 genes, 3395 probes 31 Bozdech’s Plasmodium dataset: Probes Ordered by Peak Frequency 4. Analyze Biological Significance Lomb-Scargle: 4358 Probes, = 1E-4 significance Periodogram Map • Shows periodograms, not expression profiles • Shows frequency space, not time • Dominant frequency band corresponds to 48-hr period Frequency Period •Are “weak” bands indicative of complex expression, perhaps a diurnal component, or an asymmetric “duty cycle”? 32 Summary Lomb-Scargle Method Fourier Method Weights data points No special requirement No special processing No special requirement Weights frequency intervals Requires uniform spacing Missing data imputed 2N points for FFT; 0 padding Known statistical properties Permutation tests needed to assess statistical properties Ad hoc scoring rules Use “p” values Need estimate of number of Usually only look at “independent frequencies” but “independent” Fourier explore using continuum frequencies 33 Conclusions • Lomb-Scargle periodogram is effective tool to identify periodic gene expression profiles • Results comparable with Fourier analysis • Lomb-Scargle can help when data are missing or not evenly spaced We wanted to validate the Lomb-Scargle method before applying to our somitogenesis problem, since the Fourier technique would be difficult to use. Scargle (1982): “surprising result is that the … spectrum of a process can be estimated … [with] only the order of the samples ...” 34 Conclusions • Conclusions should not be drawn using the individual p-value calculated for each profile. A multiple comparison procedure False Discovery Rate (FDR) must be used to control the error rate. • Expression profiles may be more complex than simple cosine curves • Power spectra of non-sinusoid rhythms are more difficult to interpret 35 Supplementary Information http://research.stowers-institute.org/efg/2004/CAMDA 36 Acknowledgements Stowers Institute for Medical Research Pourquie Lab Olivier Pourquie Mary-Lee Dequeant 37