Using Lomb-Scargle Periodograms to Identify Periodic Genes in Somitogenesis

advertisement
Using Lomb-Scargle Periodograms
to Identify Periodic Genes in Somitogenesis
Stowers Institute for Medical Research
Earl F. Glynn
Scientific Programmer
29 March 2006
1
Using Lomb-Scargle Periodograms
to Identify Periodic Genes in Somitogenesis
•
•
•
•
•
Periodic Patterns in Biology
Simple Periodic Gene Expression Model
Introduction to Lomb-Scargle Periodogram
Data Pipeline
Methodology Validation Study:
Bozdech’s Plasmodium dataset
• Application to Mary-Lee’s somitogenesis dataset
• Conclusions
2
Periodic Patterns in Biology
A vertebrate’s body plan: a segmented pattern.
Segmentation is established during somitogenesis.
Photograph taken at Reptile Gardens, Rapid City, SD
www.reptile-gardens.com
3
Simple Periodic Gene Expression Model
“On”
“On”
“On”
period
(T)
period
(T)
1
frequency =
period
Expression
1
f=
T
ω = angular frequency = 2πf
“Off”
“Off”
Time
Gene Expression = Amplitude × Cosine(2πf t + PhaseShift)
“Periodic” if only observed over a single cycle?
4
Introduction to Lomb-Scargle Periodogram
•
•
•
•
•
What is a Periodogram?
Why Lomb-Scargle Instead of Fourier?
Example Using Cosine Expression Model
Mathematical Details
Lomb-Scargle Experiments
- Single Dominant Frequency
- Multiple Frequencies
- Mixtures: Signal and Noise
- Multiple Hypothesis Testing
5
What is a Periodogram?
• A graph showing frequency “power” for a spectrum
of frequencies
• “Peak” in periodogram indicates a frequency with
significant periodicity
Periodic Signal
Periodogram
Computation
Spectral
“Power”
Log2(Expression)
Time
Frequency
6
Why Lomb-Scargle Instead of Fourier?
Lomb-Scargle Method
Weights data points
Data can be unevenly sampled
No data imputation
Any number of data points
Known statistical properties
Fourier Method
Weights frequency intervals
Requires uniform spacing
Missing data imputed
2N points for FFT; 0 padding
Permutation tests needed to
assess statistical properties
“p” value
Ad hoc scoring rules
Need estimate of number of
Usually only look at
“independent frequencies” but “independent” Fourier
explore using continuum
frequencies
7
Lomb-Scargle Periodogram
Example Using Cosine Expression Model
0.0 0.5 1.0
-1.0
Expression
Cosine Curve (N=48)
N = 48
0
10
20
30
A small value
for the false-alarm
probability indicates
a highly significant
periodic signal.
40
0.8
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
Peak Significance
p = 3.3e-009 at Peak
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 48 hours
0
1
T=
f
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Evenly-spaced time points
8
Lomb-Scargle Periodogram
Example Using Noisy Cosine Expression Model
Time Interval Variability
8
6
4
0
2
Frequency
0.0
-1.0
Expression
1.0
Cosine Curve + Noise (N=48)
N = 48
10
20
30
40
-1.0
-0.5
0.0
0.5
1.0
Peak Significance
p = 2.54e-007 at Peak
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
0.8
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
Probability
log10(delta T)
20
Time [hours]
0
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Unevenly-spaced time points
9
Lomb-Scargle Periodogram
Example Using Noise
8
6
4
0
2
Frequency
N = 48
0
10
20
30
40
-1.0
-0.5
0.0
0.5
Peak Significance
p = 0.973 at Peak
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
1.0
0.8
Lomb-Scargle Periodogram
Period at Peak = 7.4 hours
Probability
log10(delta T)
20
Time [hours]
0
Normalized Power Spectral Density
Time Interval Variability
0.0 0.5 1.0
-1.0
Expression
Noise (N=48)
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
10
Lomb-Scargle Periodogram Mathematical Details
PN(ω) has an exponential probability distribution with unit mean.
Source: Numerical Recipes in C (2nd Ed), p. 577
11
Lomb-Scargle Periodogram Experiment:
Single Dominant Frequency
0.0 0.5 1.0
Expression = Cosine(2πt/24)
-1.0
Expression
Cosine Curve (N=48)
N = 48
0
10
20
30
40
0.8
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
Peak Significance
p = 3.3e-009 at Peak
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 24 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Single “peak” in periodogram. Single “valley” in significance curve.
12
Lomb-Scargle Periodogram Experiment:
Multiple Frequencies
2
0
1
Expression =
Cosine(2πt/48) +
Cosine(2πt/24) +
Cosine(2πt/ 8)
-2 -1
Expression
3
Sum of 3 Cosines (N=48)
N = 48
0
10
20
30
40
0.8
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
Peak Significance
p = 0.00246 at Peak
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 21.8 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
13
Multiple peaks in periodogram. Corresponding valleys in significance curve.
Lomb-Scargle Periodogram Experiment:
Multiple Frequencies
0
2
Expression =
3*Cosine(2πt/48) +
Cosine(2πt/24) +
Cosine(2πt/ 8)
-2
Expression
4
Sum of 3 Cosines (N=48)
N = 48
0
10
20
30
40
0.8
0.4
= 1e-06
= 1e-05
= 1e-04
= 0.001
p = 0.01
p = 0.05
0.0
5
10
p
p
p
p
Peak Significance
p = 2.37e-007 at Peak
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 48 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
“Weaker” periodicities cannot always be resolved statistically.
14
Lomb-Scargle Periodogram Experiment:
Multiple Frequencies: “Duty Cycle”
50%
66.6% (e.g., human sleep cycle)
0.8
0.6
0.0
0.2
0.4
Expression
0.6
0.4
0.0
0.2
N = 48
N = 48
40
0
10
0.2
0.3
Frequency [1/hour]
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
Frequency [1/hour]
One peak with symmetric “duty cycle”.
20
0.8
25
Peak Significance
p = 5.06e-006 at Peak
15
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
10
p = 0.01
p = 0.05
0.0
0.6
0.0
5
p = 0.01
p = 0.05
0.4
10
p = 0.001
0.2
p = 1e-04
Probability
p = 1e-05
Lomb-Scargle Periodogram
Period at Peak = 24 hours
5
20
15
p = 1e-06
0.1
40
0
Normalized Power Spectral Density
1.0
Peak Significance
p = 2.54e-007 at Peak
0.8
25
Lomb-Scargle Periodogram
Period at Peak = 24 hours
0.0
30
Time [hours]
0
Normalized Power Spectral Density
Time [hours]
20
1.0
30
0.6
20
0.4
10
0.2
0
Probability
Expression
0.8
1.0
duty cycle: 2/3
1.0
duty cycle: 1/2
0.0
0.1
0.2
0.3
Frequency [1/hour]
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
Frequency [1/hour]
15
Multiple peaks with asymmetric cycle.
Lomb-Scargle Experiment:
Mixtures: Periodic Signal Vs. Noise
log(p) histogram
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
p corresponding to max Periodogram Power Spectral Density
100 % simulated periodic genes
p corresponding to max Periodogram Power Spectral Density
50 % simulated periodic genes
p corresponding to max Periodogram Power Spectral Density
0 % simulated periodic genes
1500
Frequency
1000
0
0
0
500
50
500
Frequency
100
Frequency
1000
150
2000
1500
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
-8
-6
-4
-2
0
log10(p)
100% periodic genes
-8
-6
-4
-2
log10(p)
50% periodic
50% noise
0
-8
-6
-4
-2
0
log10(p)
100% noise
16
Lomb-Scargle Experiment:
Mixtures: Periodic Signal Vs. Noise
Multiple-Hypothesis Testing
More False Negatives
Multiple Testing Correction Methods
50 % simulated periodic genes
0
Bonferroni
-4
-6
bonferroni
holm
hochberg
fdr
none
-8
Benjamini &
Hochberg FDR
Log10(p)
Hochberg
-2
Holm
None
More False Positives
0
1000
2000
3000
4000
5000
Rank Order of Sorted p Values
50% periodic, 50% noise
17
Data Pipeline to Apply to Microarray Dataset
1. Apply quality control checks to data
2. Apply Lomb-Scargle algorithm to all
expression profiles
3. Apply multiple hypothesis testing to
define “significant” periodic genes
4. Analyze biological significance of
periodic genes
18
Methodology Validation Study:
Bozdech’s Plasmodium dataset
From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3.
Intraerythrocytic Developmental Cycle of Plasmodium falciparum
Critical Assessment of Microarray Data Analysis Conference (CAMDA), Nov 2004
http://research.stowers-institute.org/efg/2005/LombScargle/
19
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Global views of experiment.
Remove certain outliers.
20
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Many missing data points require imputation for Fourier analysis.
21
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Phase
Periodic Expression Patterns
i3518_1
Time Interval Variability
30
20
10
0
0
-2
-4
10
-1
-2
20
0
0
30
1
2
40
Time Interval Variability
40
opfi17638
Phase
N = 46
-1.0
0.0
0.5
1.0
Peak Significance
p = 1.19e-008 at Peak
0
10
20
30
p = 0.001
10
p = 0.05
0.20
0.5
1.0
Peak Significance
p = 1.48e-008 at Peak
0
0.00
0.05
0.10
0.15
0.20
0.6
0.0
5
0.2
p = 0.01
p = 0.05
0.2
p = 0.01
0.0
0.15
0.0
0.4
15
0.6
p = 1e-04
0.4
15
10
0.10
p = 1e-05
p = 0.001
5
0.05
-0.5
p = 1e-06
p = 1e-04
0
0.00
-1.0
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
p = 1e-06
p = 1e-05
40
20
0.8
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
-0.5
0.8
40
25
30
1.0
20
20
25
10
1.0
N = 46
0
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
Examples of highly-significant periodic expression profiles.
0.15
22
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Aperiodic/Noise Expression Patterns
1.5
10
0.5
1.0
Peak Significance
p = 0.998 at Peak
N = 45
0
10
20
30
20
10
p = 0.01
p = 0.05
p = 0.05
0.20
0.5
1.0
Peak Significance
p = 0.516 at Peak
0
0.00
0.05
0.10
0.15
0.20
0.6
0.0
5
0.2
p = 0.01
0.0
0.15
0.0
0.4
15
0.6
p = 0.001
0.4
p = 1e-04
0.2
15
10
0.10
p = 1e-05
p = 0.001
5
0.05
-0.5
p = 1e-06
p = 1e-04
0
0.00
-1.0
Lomb-Scargle Periodogram
Period at Peak = 32 hours
p = 1e-06
p = 1e-05
40
1.0
0.0
0.8
-0.5
0.8
Lomb-Scargle Periodogram
Period at Peak = 17.8 hours
0
-0.5
-1.0
-1.0
25
40
20
0.5
0.0
10
5
0
30
1.0
20
20
25
10
30
1.0
20
15
0.5
0.0
-0.5
N = 35
0
Time Interval Variability
40
f35105_2
25
Time Interval Variability
1.0
j167_5
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
23
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Small “N”
Time Interval Variability
n170_1
Time Interval Variability
25
2
20
1
15
10
0
5
-1
-2
-3
0
-1.5
5
-1.0
10
-0.5
15
0
0.0
20
0.5
25
1.0
30
1.5
30
f58149_1
N = 39
N = 32
0.0
0.5
1.0
Peak Significance
p = 8.54e-006 at Peak
0
10
20
30
1.0
Peak Significance
p = 2.74e-005 at Peak
0.20
N=39
0.4
10
0.4
p = 0.05
0.0
5
0
0.0
0.15
p = 0.001
p = 0.01
0.2
10
0.10
0.5
p = 1e-04
p = 0.05
5
0.05
0.0
0.6
15
0.6
15
p = 1e-05
p = 0.01
0
0.00
-0.5
p = 1e-06
p = 1e-04
p = 0.001
-1.0
Lomb-Scargle Periodogram
Period at Peak = 64 hours
p = 1e-06
p = 1e-05
40
20
0.8
Lomb-Scargle Periodogram
Period at Peak = 48 hours
-0.5
1.0
-1.0
0.8
40
0.2
30
25
20
1.0
10
20
25
0
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
0.00
N=32
0.05
0.10
0.15
24
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Signal and Noise Mixture
'p' histogram
Periodic Probes
Aperiodic Probes or Noise
150
Missing Noise
Spike?
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
Frequency
0
50
500
1000
1500
100
2000
p corresponding to max Periodogram Power Spectral Density
0 % simulated periodic genes
-8
-6
-4
-2
0
log10(p)
0
Number of Probes
200
Complete Bozdech set of 6875 probes
-8
-6
-4
-2
0
log10(p)
histogram -log10p.pdf 2004-11-06 10:26
25
Bozdech’s Plasmodium dataset:
3. Apply Multiple-Hypothesis Testing
More False Negatives
Multiple Testing Correction Methods
0
(Using R's p.adjust methods)
-2
Bonferroni
-4
α = 1E-4
bonferroni
holm
hochberg
fdr
none
-8
Benjamini &
Hochberg FDR
Significance
-6
Hochberg
Log10(p)
Holm
None
0
1000
2000
3000
4000
5000
6000
7000
Rank Order of Sorted p Values
More False Positives
p-adjust.pdf 2004-11-06 10:12
26
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Bozdech Complete Set
6875 probes
Qualilty Control Set
5080
Small "N"
1795
Overview Set
243
501
3611
108
4355
Lomb-Scargle
Periodic
3719
Bozdech
Periodic
27
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Probes Ordered by Phase
Probes Ordered by Phase
“Phaseograms”
Time
Lomb-Scargle Results
4355 Probes
Time
Bozdech: “Overview” Dataset
2714 genes, 3395 probes
28
Bozdech’s Plasmodium dataset:
Bozech’s Ad Hoc Scoring
powerMAX vs Power at Peak Frequency
Periodic
Genes
29
Bozdech’s Plasmodium dataset:
Bozech’s Ad Hoc Scoring Vs Lomb-Scargle p values
30
Bozdech’s Plasmodium dataset:
Bozdech’s “Phase” Vs. Peak of Smoothed Time Series
31
Mary-Lee’s Somitogenesis Dataset
We wanted to validate the Lomb-Scargle method with
Bozdech’s dataset before applying to our somitogenesis
problem, since the Fourier technique could not be used:
Scargle (1982):
“surprising result is that the … spectrum of a process can
be estimated … [with] only the order of the samples …”
32
Somitogenesis Dataset
Time Interval Variability
6
2
0
0
20
40
60
80
100
-6
-4
-2
0
2
4
log10(delta T)
Lomb-Scargle Periodogram
Peak Significance
0.8
0.4
4
period = 119 minutes
Probability
6
8
10
Time [minutes]
6
N
10
17
20
48
p
0.1
0.006
0.002
3E-9
0.0
2
p = 0.1058
0
Normalized Power Spectral Density
4
Frequency
0.5
0.0
-1.0
Cosine Curve
8
1.0
Cosine (N = 10 )
0.005
0.015
0.025
Frequency [1/minute]
0.005
0.015
0.025
Frequency [1/minute]
33
Somitogenesis Dataset
Mary-Lee, Science Club, 2004
34
Somitogenesis Dataset
Mary-Lee, Science Club, 2004
35
Somitogenesis Dataset
Do we care about time order, or only periodic genes?
22,690
Affy Probesets
36
Somitogenesis Dataset
Do we care about time order, or only periodic genes?
Reorder
Plan: Use only known periodic genes
TimeOrderingExperiment1.doc, Nov 2004
37
Somitogenesis Dataset
Perturbation (“Jitter”) and Permutation Tests
Order Nveau.ppt
Perturbation (times):
27.75443 33.08618 42.6607 11.04462 -0.6288767 18.55415 12.93994 44.90501 59.98529 67.2538 46.87758 55.91625 74.72591
78.9921 83.98062 104.9910 78.12926 77.38472 111.5297 109.2331
Permutation (order for fixed times):
7 5 6 8 4 3 2 1 | 11 12 9 10 | 13 | 17 15 16 18 14 | 20 19
38
Somitogenesis Dataset
Perturbation (“Jitter”) and Permutation Tests
Best
Rank Product
JitterSpreadHistograms.pdf, Feb 2005
Worst
Best
Rank Product
PermuteSpreadHistograms.pdf
Worst
39
Somitogenesis Dataset
Perturbation (“Jitter”) and Permutation Tests
PermutePerturbComparison.xls, Jan 2005
40
Somitogenesis Dataset
Perturbation (“Jitter”) and Permutation Tests
Top 50: Somites Vs Random
10000
9000
8000
7000
Somites
Top 50 Score
6000
5000
4000
3000
2000
1000
Random
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Rank
StatComparison1000.xls: Top50, Jan 2005
41
Somitogenesis Dataset
Perturbation (“Jitter”) and Permutation Tests
Somites Vs Random
35000
30000
Random
LogRankSum
25000
20000
15000
10000
5000
Somites
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Rank
StatComparison1000.xls: TopLogRankSum
42
Somitogenesis Dataset
How many periodic genes are in the dataset?
Approx Noise Cutoff
43
Somitogenesis Dataset
What periodicities are present?
Periodogram Clustering
H ie rarch ica l C luste ring
1
4
2 0.1
7
24
93
0
F1
PeriodogramClustering.doc
Aug 2004
119 min
F5
F9
F13
F17
F21
F25
F29
59 min
F33
F37
44
Somitogenesis Dataset
Unconstrained Permutations to Estimate False Hit Rate
(10,000 Permutations, 17-point time series, 7544 Affy probes)
Use p-value cutoffs from “baseline” with permutations
About 35-40 cyclic genes detected?
FinalStatsPeriodRange.xls, March 2006
45
Conclusions
• Lomb-Scargle periodogram is effective tool to
identify periodic gene expression profiles
• Results comparable with Fourier analysis
• Lomb-Scargle can help when data are missing
or not evenly spaced
46
Conclusions
• Conclusions should not be drawn using the
individual p-value calculated for each profile. A
multiple comparison procedure False Discovery
Rate (FDR) must be used to control the error rate.
• Expression profiles may be more complex than
simple cosine curves
• Power spectra of non-sinusoid rhythms may be
difficult to interpret
47
Acknowledgements
Pourquie Lab
Olivier Pourquié
Mary-Lee Dequéant
Bioinformatics
Arcady Mushegian
Galina Glasko
Jie Chen
48
Download