Using Lomb-Scargle Periodograms to Identify Periodic Genes in Somitogenesis Stowers Institute for Medical Research Earl F. Glynn Scientific Programmer 29 March 2006 1 Using Lomb-Scargle Periodograms to Identify Periodic Genes in Somitogenesis • • • • • Periodic Patterns in Biology Simple Periodic Gene Expression Model Introduction to Lomb-Scargle Periodogram Data Pipeline Methodology Validation Study: Bozdech’s Plasmodium dataset • Application to Mary-Lee’s somitogenesis dataset • Conclusions 2 Periodic Patterns in Biology A vertebrate’s body plan: a segmented pattern. Segmentation is established during somitogenesis. Photograph taken at Reptile Gardens, Rapid City, SD www.reptile-gardens.com 3 Simple Periodic Gene Expression Model “On” “On” “On” period (T) period (T) 1 frequency = period Expression 1 f= T ω = angular frequency = 2πf “Off” “Off” Time Gene Expression = Amplitude × Cosine(2πf t + PhaseShift) “Periodic” if only observed over a single cycle? 4 Introduction to Lomb-Scargle Periodogram • • • • • What is a Periodogram? Why Lomb-Scargle Instead of Fourier? Example Using Cosine Expression Model Mathematical Details Lomb-Scargle Experiments - Single Dominant Frequency - Multiple Frequencies - Mixtures: Signal and Noise - Multiple Hypothesis Testing 5 What is a Periodogram? • A graph showing frequency “power” for a spectrum of frequencies • “Peak” in periodogram indicates a frequency with significant periodicity Periodic Signal Periodogram Computation Spectral “Power” Log2(Expression) Time Frequency 6 Why Lomb-Scargle Instead of Fourier? Lomb-Scargle Method Weights data points Data can be unevenly sampled No data imputation Any number of data points Known statistical properties Fourier Method Weights frequency intervals Requires uniform spacing Missing data imputed 2N points for FFT; 0 padding Permutation tests needed to assess statistical properties “p” value Ad hoc scoring rules Need estimate of number of Usually only look at “independent frequencies” but “independent” Fourier explore using continuum frequencies 7 Lomb-Scargle Periodogram Example Using Cosine Expression Model 0.0 0.5 1.0 -1.0 Expression Cosine Curve (N=48) N = 48 0 10 20 30 A small value for the false-alarm probability indicates a highly significant periodic signal. 40 0.8 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p Peak Significance p = 3.3e-009 at Peak Probability 20 Lomb-Scargle Periodogram Period at Peak = 48 hours 0 1 T= f Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Evenly-spaced time points 8 Lomb-Scargle Periodogram Example Using Noisy Cosine Expression Model Time Interval Variability 8 6 4 0 2 Frequency 0.0 -1.0 Expression 1.0 Cosine Curve + Noise (N=48) N = 48 10 20 30 40 -1.0 -0.5 0.0 0.5 1.0 Peak Significance p = 2.54e-007 at Peak 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p 0.8 Lomb-Scargle Periodogram Period at Peak = 45.7 hours Probability log10(delta T) 20 Time [hours] 0 Normalized Power Spectral Density 0 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Unevenly-spaced time points 9 Lomb-Scargle Periodogram Example Using Noise 8 6 4 0 2 Frequency N = 48 0 10 20 30 40 -1.0 -0.5 0.0 0.5 Peak Significance p = 0.973 at Peak 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p 1.0 0.8 Lomb-Scargle Periodogram Period at Peak = 7.4 hours Probability log10(delta T) 20 Time [hours] 0 Normalized Power Spectral Density Time Interval Variability 0.0 0.5 1.0 -1.0 Expression Noise (N=48) 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] 10 Lomb-Scargle Periodogram Mathematical Details PN(ω) has an exponential probability distribution with unit mean. Source: Numerical Recipes in C (2nd Ed), p. 577 11 Lomb-Scargle Periodogram Experiment: Single Dominant Frequency 0.0 0.5 1.0 Expression = Cosine(2πt/24) -1.0 Expression Cosine Curve (N=48) N = 48 0 10 20 30 40 0.8 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p Peak Significance p = 3.3e-009 at Peak Probability 20 Lomb-Scargle Periodogram Period at Peak = 24 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] Single “peak” in periodogram. Single “valley” in significance curve. 12 Lomb-Scargle Periodogram Experiment: Multiple Frequencies 2 0 1 Expression = Cosine(2πt/48) + Cosine(2πt/24) + Cosine(2πt/ 8) -2 -1 Expression 3 Sum of 3 Cosines (N=48) N = 48 0 10 20 30 40 0.8 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p Peak Significance p = 0.00246 at Peak Probability 20 Lomb-Scargle Periodogram Period at Peak = 21.8 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] 13 Multiple peaks in periodogram. Corresponding valleys in significance curve. Lomb-Scargle Periodogram Experiment: Multiple Frequencies 0 2 Expression = 3*Cosine(2πt/48) + Cosine(2πt/24) + Cosine(2πt/ 8) -2 Expression 4 Sum of 3 Cosines (N=48) N = 48 0 10 20 30 40 0.8 0.4 = 1e-06 = 1e-05 = 1e-04 = 0.001 p = 0.01 p = 0.05 0.0 5 10 p p p p Peak Significance p = 2.37e-007 at Peak Probability 20 Lomb-Scargle Periodogram Period at Peak = 48 hours 0 Normalized Power Spectral Density Time [hours] 0.00 0.05 0.10 0.15 Frequency [1/hour] 0.20 0.00 0.05 0.10 0.15 0.20 Frequency [1/hour] “Weaker” periodicities cannot always be resolved statistically. 14 Lomb-Scargle Periodogram Experiment: Multiple Frequencies: “Duty Cycle” 50% 66.6% (e.g., human sleep cycle) 0.8 0.6 0.0 0.2 0.4 Expression 0.6 0.4 0.0 0.2 N = 48 N = 48 40 0 10 0.2 0.3 Frequency [1/hour] 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency [1/hour] One peak with symmetric “duty cycle”. 20 0.8 25 Peak Significance p = 5.06e-006 at Peak 15 p = 1e-06 p = 1e-05 p = 1e-04 p = 0.001 10 p = 0.01 p = 0.05 0.0 0.6 0.0 5 p = 0.01 p = 0.05 0.4 10 p = 0.001 0.2 p = 1e-04 Probability p = 1e-05 Lomb-Scargle Periodogram Period at Peak = 24 hours 5 20 15 p = 1e-06 0.1 40 0 Normalized Power Spectral Density 1.0 Peak Significance p = 2.54e-007 at Peak 0.8 25 Lomb-Scargle Periodogram Period at Peak = 24 hours 0.0 30 Time [hours] 0 Normalized Power Spectral Density Time [hours] 20 1.0 30 0.6 20 0.4 10 0.2 0 Probability Expression 0.8 1.0 duty cycle: 2/3 1.0 duty cycle: 1/2 0.0 0.1 0.2 0.3 Frequency [1/hour] 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency [1/hour] 15 Multiple peaks with asymmetric cycle. Lomb-Scargle Experiment: Mixtures: Periodic Signal Vs. Noise log(p) histogram 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) p corresponding to max Periodogram Power Spectral Density 100 % simulated periodic genes p corresponding to max Periodogram Power Spectral Density 50 % simulated periodic genes p corresponding to max Periodogram Power Spectral Density 0 % simulated periodic genes 1500 Frequency 1000 0 0 0 500 50 500 Frequency 100 Frequency 1000 150 2000 1500 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) -8 -6 -4 -2 0 log10(p) 100% periodic genes -8 -6 -4 -2 log10(p) 50% periodic 50% noise 0 -8 -6 -4 -2 0 log10(p) 100% noise 16 Lomb-Scargle Experiment: Mixtures: Periodic Signal Vs. Noise Multiple-Hypothesis Testing More False Negatives Multiple Testing Correction Methods 50 % simulated periodic genes 0 Bonferroni -4 -6 bonferroni holm hochberg fdr none -8 Benjamini & Hochberg FDR Log10(p) Hochberg -2 Holm None More False Positives 0 1000 2000 3000 4000 5000 Rank Order of Sorted p Values 50% periodic, 50% noise 17 Data Pipeline to Apply to Microarray Dataset 1. Apply quality control checks to data 2. Apply Lomb-Scargle algorithm to all expression profiles 3. Apply multiple hypothesis testing to define “significant” periodic genes 4. Analyze biological significance of periodic genes 18 Methodology Validation Study: Bozdech’s Plasmodium dataset From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3. Intraerythrocytic Developmental Cycle of Plasmodium falciparum Critical Assessment of Microarray Data Analysis Conference (CAMDA), Nov 2004 http://research.stowers-institute.org/efg/2005/LombScargle/ 19 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Global views of experiment. Remove certain outliers. 20 Bozdech’s Plasmodium dataset: 1. Apply Quality Control Checks Many missing data points require imputation for Fourier analysis. 21 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Phase Periodic Expression Patterns i3518_1 Time Interval Variability 30 20 10 0 0 -2 -4 10 -1 -2 20 0 0 30 1 2 40 Time Interval Variability 40 opfi17638 Phase N = 46 -1.0 0.0 0.5 1.0 Peak Significance p = 1.19e-008 at Peak 0 10 20 30 p = 0.001 10 p = 0.05 0.20 0.5 1.0 Peak Significance p = 1.48e-008 at Peak 0 0.00 0.05 0.10 0.15 0.20 0.6 0.0 5 0.2 p = 0.01 p = 0.05 0.2 p = 0.01 0.0 0.15 0.0 0.4 15 0.6 p = 1e-04 0.4 15 10 0.10 p = 1e-05 p = 0.001 5 0.05 -0.5 p = 1e-06 p = 1e-04 0 0.00 -1.0 Lomb-Scargle Periodogram Period at Peak = 45.7 hours p = 1e-06 p = 1e-05 40 20 0.8 Lomb-Scargle Periodogram Period at Peak = 45.7 hours -0.5 0.8 40 25 30 1.0 20 20 25 10 1.0 N = 46 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 Examples of highly-significant periodic expression profiles. 0.15 22 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Aperiodic/Noise Expression Patterns 1.5 10 0.5 1.0 Peak Significance p = 0.998 at Peak N = 45 0 10 20 30 20 10 p = 0.01 p = 0.05 p = 0.05 0.20 0.5 1.0 Peak Significance p = 0.516 at Peak 0 0.00 0.05 0.10 0.15 0.20 0.6 0.0 5 0.2 p = 0.01 0.0 0.15 0.0 0.4 15 0.6 p = 0.001 0.4 p = 1e-04 0.2 15 10 0.10 p = 1e-05 p = 0.001 5 0.05 -0.5 p = 1e-06 p = 1e-04 0 0.00 -1.0 Lomb-Scargle Periodogram Period at Peak = 32 hours p = 1e-06 p = 1e-05 40 1.0 0.0 0.8 -0.5 0.8 Lomb-Scargle Periodogram Period at Peak = 17.8 hours 0 -0.5 -1.0 -1.0 25 40 20 0.5 0.0 10 5 0 30 1.0 20 20 25 10 30 1.0 20 15 0.5 0.0 -0.5 N = 35 0 Time Interval Variability 40 f35105_2 25 Time Interval Variability 1.0 j167_5 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 23 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Small “N” Time Interval Variability n170_1 Time Interval Variability 25 2 20 1 15 10 0 5 -1 -2 -3 0 -1.5 5 -1.0 10 -0.5 15 0 0.0 20 0.5 25 1.0 30 1.5 30 f58149_1 N = 39 N = 32 0.0 0.5 1.0 Peak Significance p = 8.54e-006 at Peak 0 10 20 30 1.0 Peak Significance p = 2.74e-005 at Peak 0.20 N=39 0.4 10 0.4 p = 0.05 0.0 5 0 0.0 0.15 p = 0.001 p = 0.01 0.2 10 0.10 0.5 p = 1e-04 p = 0.05 5 0.05 0.0 0.6 15 0.6 15 p = 1e-05 p = 0.01 0 0.00 -0.5 p = 1e-06 p = 1e-04 p = 0.001 -1.0 Lomb-Scargle Periodogram Period at Peak = 64 hours p = 1e-06 p = 1e-05 40 20 0.8 Lomb-Scargle Periodogram Period at Peak = 48 hours -0.5 1.0 -1.0 0.8 40 0.2 30 25 20 1.0 10 20 25 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 N=32 0.05 0.10 0.15 24 0.20 Bozdech’s Plasmodium dataset: 2. Apply Lomb-Scargle Algorithm Signal and Noise Mixture 'p' histogram Periodic Probes Aperiodic Probes or Noise 150 Missing Noise Spike? 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) Frequency 0 50 500 1000 1500 100 2000 p corresponding to max Periodogram Power Spectral Density 0 % simulated periodic genes -8 -6 -4 -2 0 log10(p) 0 Number of Probes 200 Complete Bozdech set of 6875 probes -8 -6 -4 -2 0 log10(p) histogram -log10p.pdf 2004-11-06 10:26 25 Bozdech’s Plasmodium dataset: 3. Apply Multiple-Hypothesis Testing More False Negatives Multiple Testing Correction Methods 0 (Using R's p.adjust methods) -2 Bonferroni -4 α = 1E-4 bonferroni holm hochberg fdr none -8 Benjamini & Hochberg FDR Significance -6 Hochberg Log10(p) Holm None 0 1000 2000 3000 4000 5000 6000 7000 Rank Order of Sorted p Values More False Positives p-adjust.pdf 2004-11-06 10:12 26 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Bozdech Complete Set 6875 probes Qualilty Control Set 5080 Small "N" 1795 Overview Set 243 501 3611 108 4355 Lomb-Scargle Periodic 3719 Bozdech Periodic 27 Bozdech’s Plasmodium dataset: 4. Analyze Biological Significance Probes Ordered by Phase Probes Ordered by Phase “Phaseograms” Time Lomb-Scargle Results 4355 Probes Time Bozdech: “Overview” Dataset 2714 genes, 3395 probes 28 Bozdech’s Plasmodium dataset: Bozech’s Ad Hoc Scoring powerMAX vs Power at Peak Frequency Periodic Genes 29 Bozdech’s Plasmodium dataset: Bozech’s Ad Hoc Scoring Vs Lomb-Scargle p values 30 Bozdech’s Plasmodium dataset: Bozdech’s “Phase” Vs. Peak of Smoothed Time Series 31 Mary-Lee’s Somitogenesis Dataset We wanted to validate the Lomb-Scargle method with Bozdech’s dataset before applying to our somitogenesis problem, since the Fourier technique could not be used: Scargle (1982): “surprising result is that the … spectrum of a process can be estimated … [with] only the order of the samples …” 32 Somitogenesis Dataset Time Interval Variability 6 2 0 0 20 40 60 80 100 -6 -4 -2 0 2 4 log10(delta T) Lomb-Scargle Periodogram Peak Significance 0.8 0.4 4 period = 119 minutes Probability 6 8 10 Time [minutes] 6 N 10 17 20 48 p 0.1 0.006 0.002 3E-9 0.0 2 p = 0.1058 0 Normalized Power Spectral Density 4 Frequency 0.5 0.0 -1.0 Cosine Curve 8 1.0 Cosine (N = 10 ) 0.005 0.015 0.025 Frequency [1/minute] 0.005 0.015 0.025 Frequency [1/minute] 33 Somitogenesis Dataset Mary-Lee, Science Club, 2004 34 Somitogenesis Dataset Mary-Lee, Science Club, 2004 35 Somitogenesis Dataset Do we care about time order, or only periodic genes? 22,690 Affy Probesets 36 Somitogenesis Dataset Do we care about time order, or only periodic genes? Reorder Plan: Use only known periodic genes TimeOrderingExperiment1.doc, Nov 2004 37 Somitogenesis Dataset Perturbation (“Jitter”) and Permutation Tests Order Nveau.ppt Perturbation (times): 27.75443 33.08618 42.6607 11.04462 -0.6288767 18.55415 12.93994 44.90501 59.98529 67.2538 46.87758 55.91625 74.72591 78.9921 83.98062 104.9910 78.12926 77.38472 111.5297 109.2331 Permutation (order for fixed times): 7 5 6 8 4 3 2 1 | 11 12 9 10 | 13 | 17 15 16 18 14 | 20 19 38 Somitogenesis Dataset Perturbation (“Jitter”) and Permutation Tests Best Rank Product JitterSpreadHistograms.pdf, Feb 2005 Worst Best Rank Product PermuteSpreadHistograms.pdf Worst 39 Somitogenesis Dataset Perturbation (“Jitter”) and Permutation Tests PermutePerturbComparison.xls, Jan 2005 40 Somitogenesis Dataset Perturbation (“Jitter”) and Permutation Tests Top 50: Somites Vs Random 10000 9000 8000 7000 Somites Top 50 Score 6000 5000 4000 3000 2000 1000 Random 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Rank StatComparison1000.xls: Top50, Jan 2005 41 Somitogenesis Dataset Perturbation (“Jitter”) and Permutation Tests Somites Vs Random 35000 30000 Random LogRankSum 25000 20000 15000 10000 5000 Somites 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Rank StatComparison1000.xls: TopLogRankSum 42 Somitogenesis Dataset How many periodic genes are in the dataset? Approx Noise Cutoff 43 Somitogenesis Dataset What periodicities are present? Periodogram Clustering H ie rarch ica l C luste ring 1 4 2 0.1 7 24 93 0 F1 PeriodogramClustering.doc Aug 2004 119 min F5 F9 F13 F17 F21 F25 F29 59 min F33 F37 44 Somitogenesis Dataset Unconstrained Permutations to Estimate False Hit Rate (10,000 Permutations, 17-point time series, 7544 Affy probes) Use p-value cutoffs from “baseline” with permutations About 35-40 cyclic genes detected? FinalStatsPeriodRange.xls, March 2006 45 Conclusions • Lomb-Scargle periodogram is effective tool to identify periodic gene expression profiles • Results comparable with Fourier analysis • Lomb-Scargle can help when data are missing or not evenly spaced 46 Conclusions • Conclusions should not be drawn using the individual p-value calculated for each profile. A multiple comparison procedure False Discovery Rate (FDR) must be used to control the error rate. • Expression profiles may be more complex than simple cosine curves • Power spectra of non-sinusoid rhythms may be difficult to interpret 47 Acknowledgements Pourquie Lab Olivier Pourquié Mary-Lee Dequéant Bioinformatics Arcady Mushegian Galina Glasko Jie Chen 48