Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Earl F. Glynn

advertisement
Searching for
Periodic Gene Expression Patterns
Using Lomb-Scargle Periodograms
Earl F. Glynn
Arcady R. Mushegian
Jie Chen
Stowers Institute
Stowers Institute &
Univ. of Kansas Medical Center
Stowers Institute &
Univ. of Missouri
Kansas City
http://research.stowers-institute.org/efg/2004/CAMDA
Critical Assessment of Microarray Data Analysis Conference
November 11, 2004
1
Searching for Periodic Gene Expression Patterns
Using Lomb-Scargle Periodograms
•
•
•
•
•
Periodic Patterns in Biology
Introduction to Lomb-Scargle Periodogram
Data Pipeline
Application to Bozdech’s Plasmodium dataset
Conclusions
2
Periodic Patterns in Biology
A vertebrate’s body plan: a segmented pattern.
Segmentation is established during somitogenesis.
Photograph taken at Reptile Gardens, Rapid City, SD
www.reptile-gardens.com
3
Periodic Patterns in Biology
Intraerythrocytic Developmental Cycle of Plasmodium falciparum
From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3.
RNA from parasitized red blood cells
Cy5
Expression Ratio =
Cy3
=
RNA from all development cycles
Values for Log2(Expression Ratio) are approximately normally distributed.
Assume gene expression reflects observed biological periodicity.
4
Simple Periodic Gene Expression Model
“On”
“On”
“On”
period
(T)
1
frequency =
period
Expression
period
(T)
1
f=
T
“Off”
“Off”
Time
 = angular frequency = 2f
Gene Expression = Constant  Cosine(2f t)
“Periodic” if only observed over a single cycle?
5
Introduction to Lomb-Scargle Periodogram
•
•
•
•
•
What is a Periodogram?
Why Lomb-Scargle Instead of Fourier?
Example Using Cosine Expression Model
Mathematical Details
Mathematical Experiments
- Single Dominant Frequency
- Multiple Frequencies
- Mixtures: Signal and Noise
6
What is a Periodogram?
• A graph showing frequency “power” for a spectrum
of frequencies
• “Peak” in periodogram indicates a frequency with
significant periodicity
Periodic Signal
Periodogram
Computation
Spectral
“Power”
Log2(Expression)
Time
Frequency
7
Why Lomb-Scargle Instead of Fourier?
•
•
•
•
•
Missing data handled naturally
No data imputation needed
Any number of points can be used
No need for 2N data points like with FFT
Lomb-Scargle periodogram has known
statistical properties
Note: The Lomb-Scargle algorithm is NOT equivalent
to the conventional periodogram analysis based
Fourier analysis.
8
Lomb-Scargle Periodogram
Example Using Cosine Expression Model
0.0 0.5 1.0
-1.0
Expression
Cosine Curve (N=48)
N = 48
0
10
20
30
A small value
for the false-alarm
probability indicates
a highly significant
periodic signal.
40
0.8
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
Peak Significance
p = 3.3e-009 at Peak
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 48 hours
0
1
T=
f
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Evenly-spaced time points
9
Lomb-Scargle Periodogram
Example Using Noisy Cosine Expression Model
Time Interval Variability
8
6
4
0
2
Frequency
0.0
-1.0
Expression
1.0
Cosine Curve + Noise (N=48)
N = 48
20
10
-1.0
40
30
-0.5
0.0
0.5
1.0
Peak Significance
p = 2.54e-007 at Peak
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
0.8
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
Probability
log10(delta T)
20
Time [hours]
0
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Unevenly-spaced time points
10
Lomb-Scargle Periodogram
Example Using Noise
Time Interval Variability
6
4
0
N = 48
0
20
10
-1.0
40
30
-0.5
0.0
0.5
Peak Significance
p = 0.973 at Peak
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
1.0
0.8
Lomb-Scargle Periodogram
Period at Peak = 7.4 hours
Probability
log10(delta T)
20
Time [hours]
0
Normalized Power Spectral Density
2
Frequency
8
0.0 0.5 1.0
-1.0
Expression
Noise (N=48)
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
11
Lomb-Scargle Periodogram Mathematical Details
PN() has an exponential probability distribution with unit mean.
Source: Numerical Recipes in C (2nd Ed), p. 577
12
Mathematical Experiment:
Single Dominant Frequency
0.0 0.5 1.0
Expression = Cosine(2t/24)
-1.0
Expression
Cosine Curve (N=48)
N = 48
0
20
10
40
30
Peak Significance
p = 3.3e-009 at Peak
0.8
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 24 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Single “peak” in periodogram. Single “valley” in significance curve.
13
Mathematical Experiment:
Multiple Frequencies
2
0
1
Expression =
Cosine(2t/48) +
Cosine(2t/24) +
Cosine(2t/ 8)
-2 -1
Expression
3
Sum of 3 Cosines (N=48)
N = 48
0
20
10
40
30
Peak Significance
p = 0.00246 at Peak
0.8
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 21.8 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
14
Multiple peaks in periodogram. Corresponding valleys in significance curve.
Mathematical Experiment:
Multiple Frequencies
0
2
Expression =
3*Cosine(2t/48) +
Cosine(2t/24) +
Cosine(2t/ 8)
-2
Expression
4
Sum of 3 Cosines (N=48)
N = 48
0
20
10
40
30
Peak Significance
p = 2.37e-007 at Peak
0.8
0.4
0.0
5 10
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
Probability
20
Lomb-Scargle Periodogram
Period at Peak = 48 hours
0
Normalized Power Spectral Density
Time [hours]
0.00
0.05
0.10
0.15
Frequency [1/hour]
0.20
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
“Weaker” periodicities cannot always be resolved statistically.
15
Mathematical Experiment:
Multiple Frequencies: “Duty Cycle”
50%
66.6% (e.g., human sleep cycle)
0.8
0.6
0.0
0.2
0.4
Expression
0.6
0.4
0.0
0.2
N = 48
N = 48
40
0
10
0.2
0.3
Frequency [1/hour]
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
Frequency [1/hour]
One peak with symmetric “duty cycle”.
20
0.8
25
Peak Significance
p = 5.06e-006 at Peak
15
p = 1e-06
p = 1e-05
p = 1e-04
p = 0.001
10
p = 0.01
p = 0.05
0.0
0.6
0.0
5
p = 0.01
p = 0.05
0.4
10
p = 0.001
0.2
p = 1e-04
Probability
p = 1e-05
Lomb-Scargle Periodogram
Period at Peak = 24 hours
5
20
15
p = 1e-06
0.1
40
0
Normalized Power Spectral Density
1.0
Peak Significance
p = 2.54e-007 at Peak
0.8
25
Lomb-Scargle Periodogram
Period at Peak = 24 hours
0.0
30
Time [hours]
0
Normalized Power Spectral Density
Time [hours]
20
1.0
30
0.6
20
0.4
10
0.2
0
Probability
Expression
0.8
1.0
duty cycle: 2/3
1.0
duty cycle: 1/2
0.0
0.1
0.2
0.3
Frequency [1/hour]
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
Frequency [1/hour]
16
Multiple peaks with asymmetric cycle.
Mathematical Experiment:
Mixtures: Periodic Signal Vs. Noise
“p” histogram
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
p corresponding to max Periodogram Power Spectral Density
100 % simulated periodic genes
p corresponding to max Periodogram Power Spectral Density
50 % simulated periodic genes
p corresponding to max Periodogram Power Spectral Density
0 % simulated periodic genes
1500
Frequency
1000
0
0
0
500
50
500
Frequency
100
Frequency
1000
150
2000
1500
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
-8
-6
-4
-2
0
log10(p)
100% periodic genes
-8
-6
-4
-2
log10(p)
50% periodic
50% noise
0
-8
-6
-4
-2
0
log10(p)
100% noise
17
Mathematical Experiment:
Mixtures: Periodic Signal Vs. Noise
Multiple-Hypothesis Testing
More False Negatives
Multiple Testing Correction Methods
50 % simulated periodic genes
0
Bonferroni
-4
-6
bonferroni
holm
hochberg
fdr
none
-8
Benjamini &
Hochberg FDR
Log10(p)
Hochberg
-2
Holm
None
More False Positives
0
1000
2000
3000
4000
5000
Rank Order of Sorted p Values
50% periodic, 50% noise
18
Data Pipeline to Apply to Bozdech’s Data
1. Apply quality control checks to data
2. Apply Lomb-Scargle algorithm to all
expression profiles
3. Apply multiple hypothesis testing to
define “significant” genes
4. Analyze biological significance of
significant genes
19
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Global views of experiment.
Remove certain outliers.
20
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Many missing data points require imputation for Fourier analysis.
21
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Mean Expression Profile
40
30
20
0
10
Frequency
-0.5 -0.4 -0.3 -0.2 -0.1
Expression
0.0
Time Interval Variability
N = 46
20
30
40
-1.0
-0.5
0.0
0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 27.4 hours
Peak Significance
p = 0.0581 at Peak
1.0
p = 1e-05
p = 1e-04
10
p = 0.001
0.0
5
0.2
p = 0.01
p = 0.05
0.4
Probability
15
p = 1e-06
0.6
0.8
1.0
Time [hours]
20
25
10
0
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
Complete/06-MeanExpressionProf ile.pdf
0.00
0.05
0.10
0.15
0.20
Frequency [1/hour]
2004-10-27 11:39
A weak diurnal period is visible in “mean” data profile.
22
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Periodic Expression Patterns
i3518_1
Time Interval Variability
30
10
20
Frequency
0
-1
Expression
30
0
0
-2
-4
10
20
Frequency
0
-2
Expression
1
2
40
Time Interval Variability
40
opfi17638
N = 46
N = 46
1.0
0.00
0.05
0.10
0.15
0.20
30
40
-1.0
0.00
0.05
0.10
0.15
0.20
0.0
0.5
1.0
1.0
Peak Significance
p = 1.48e-008 at Peak
0.8
25
-0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
p = 1e-04
p = 0.001
0.4
p = 1e-05
0.6
p = 1e-06
p = 0.01
p = 0.05
0
0.0
5
0.2
p = 0.01
p = 0.05
20
20
1.0
0.6
0.4
10
p = 0.001
Probability
15
p = 1e-04
10
Time [hours]
Peak Significance
p = 1.19e-008 at Peak
p = 1e-06
p = 1e-05
0
0.2
0.5
0.8
20
25
0.0
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 45.7 hours
0
Normalized Power Spectral Density
Time [hours]
-0.5
0.0
-1.0
Probability
40
15
30
10
20
5
10
Normalized Power Spectral Density
0
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
Examples of highly-significant periodic expression profiles.
0.15
23
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Aperiodic/Noise Expression Patterns
1.0
0.00
0.05
0.10
0.15
0.20
10
0
30
40
-1.0
0.05
0.10
0.15
0.20
0.0
0.5
1.0
1.0
Peak Significance
p = 0.516 at Peak
0.8
20
25
-0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 32 hours
p = 1e-04
p = 0.001
p = 0.01
p = 0.05
0.6
p = 1e-05
0.4
15
p = 1e-06
0
0.00
20
Frequency
20
10
1.0
0.8
0.6
0.0
5
0.2
p = 0.01
p = 0.05
10
Time [hours]
Peak Significance
p = 0.998 at Peak
0.4
p = 0.001
Probability
15
10
p = 1e-04
N = 45
0
0.2
0.5
0.0
0.0
log10(delta T)
20
25
-0.5
p = 1e-06
0
Normalized Power Spectral Density
Time [hours]
Lomb-Scargle Periodogram
Period at Peak = 17.8 hours
p = 1e-05
0.5
-0.5
-1.0
-1.0
Probability
40
5
30
Normalized Power Spectral Density
20
0.0
Expression
0
10
30
1.0
20
15
5
10
Frequency
0.5
0.0
Expression
-0.5
N = 35
0
Time Interval Variability
1.5
25
f35105_2
40
Time Interval Variability
1.0
j167_5
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
24
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Small “N”
Time Interval Variability
n170_1
Time Interval Variability
1.0
0.05
0.10
0.15
0.20
N=39
20
30
40
-1.0
0.05
0.10
0.15
0.20
0.5
1.0
1.0
Peak Significance
p = 2.74e-005 at Peak
0.8
20
25
0.0
p = 1e-04
p = 0.001
0.6
p = 1e-05
0.4
15
p = 1e-06
p = 0.01
p = 0.05
0
0.00
-0.5
log10(delta T)
Lomb-Scargle Periodogram
Period at Peak = 64 hours
10
1.0
0.8
0.6
0.0
5
p = 0.01
p = 0.05
0.4
p = 0.001
10
Time [hours]
Peak Significance
p = 8.54e-006 at Peak
0.2
10
p = 1e-04
Probability
15
p = 1e-05
0
0.2
0.5
log10(delta T)
20
25
0.0
p = 1e-06
0
Normalized Power Spectral Density
Time [hours]
-0.5
0.0
-1.0
Probability
40
Lomb-Scargle Periodogram
Period at Peak = 48 hours
0.00
15
5
0
N = 32
30
5
20
Normalized Power Spectral Density
10
10
Frequency
N = 39
0
20
25
2
0
-3
0
5
-2
-1
Expression
1
25
20
10
15
Frequency
0.5
0.0
-1.5 -1.0 -0.5
Expression
1.0
30
1.5
30
f58149_1
0.00
0.05
0.10
0.15
0.20
0.00
N=32
0.05
0.10
0.15
25
0.20
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
Signal and Noise Mixture
'p' histogram
Aperiodic Probes or Noise
50
100
150
Periodic Probes
0
Number of Probes
200
Complete Bozdech set of 6875 probes
-8
-6
-4
-2
0
log10(p)
histogram-log10p.pdf 2004-11-06 10:26
26
Bozdech’s Plasmodium dataset:
3. Apply Multiple-Hypothesis Testing
More False Negatives
Multiple Testing Correction Methods
0
(Using R's p.adjust methods)
-2
Bonferroni
-4
 = 1E-4
bonferroni
holm
hochberg
fdr
none
-8
Benjamini &
Hochberg FDR
Significance
-6
Hochberg
Log10(p)
Holm
None
0
1000
2000
3000
4000
5000
6000
7000
Rank Order of Sorted p Values
More False Positives
p-adjust.pdf 2004-11-06 10:12
27
Bozdech’s Plasmodium dataset:
3. Apply Multiple-Hypothesis Testing
p
Adjustment
Method
 Significance Level
0.05
0.01
0.001
0.0001
0.00001
Bonferroni
3707
3050
1461
13
0
Holm
3995
3351
1705
13
0
Hochberg
4009
3359
1723
15
0
Benjamini &
Hochberg FDR
5618
5315
4906
4358
3584
None
5648
5351
4961
4456
3823
A priori plan: Use Benjamini & Hochberg FDR level of 0.0001.
Observed number of periodic probes consistent with biological observation
of ~60% of Plasmodium genome being transcriptionally active during the 28
intraerythrocytic developmental cycle.
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Lomb-Scargle: 4358 Probes,  = 1E-4 significance
Comparison with Bozdech’s Results
Dataset
N
time series points
Bozdech
Complete
43 .. 46
%
81.0%
(Bozdech
Quality Control
Dataset)
32 .. 42
Total
Probes Lomb-Scargle
Periodic
5080
4115
1795
6875
243
4358
13.5
63.4
While Lomb-Scargle identified 243 new low “N” periodic probes, the
low percentage in that group may indicate some other problem.
29
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Lomb-Scargle: 4358 Probes,  = 1E-4 significance
Comparison with Bozdech’s Results
Dataset
Bozdech
Overview
Probes Lomb-Scargle
Periodic
3711
3611
Unclear how to apply Bozdech’s ad hoc “Overview” criteria for
use with Lomb-Scargle method:
“70% power in max frequency
with top 75% of max frequency magnitude.”
The best 3711 Lomb-Scargle “p” values contained 3449
(92.9%) of the Overview probes.
30
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Probes Ordered by Phase
Probes Ordered by Phase
“Phaseograms”
Time
Lomb-Scargle Results
4358 Probes
Time
Bozdech: “Overview” Dataset
2714 genes, 3395 probes
31
Bozdech’s Plasmodium dataset:
Probes Ordered by Peak Frequency
4. Analyze Biological Significance
Lomb-Scargle: 4358 Probes,  = 1E-4 significance
Periodogram Map
• Shows periodograms,
not expression profiles
• Shows frequency space,
not time
• Dominant frequency band
corresponds to 48-hr period
Frequency
Period
•Are “weak” bands indicative of
complex expression, perhaps a
diurnal component, or an
asymmetric “duty cycle”? 32
Summary
Lomb-Scargle Method
Fourier Method
Weights data points
No special requirement
No special processing
No special requirement
Weights frequency intervals
Requires uniform spacing
Missing data imputed
2N points for FFT; 0 padding
Known statistical properties
Permutation tests needed to
assess statistical properties
Ad hoc scoring rules
Use “p” values
Need estimate of number of
Usually only look at
“independent frequencies” but “independent” Fourier
explore using continuum
frequencies
33
Conclusions
• Lomb-Scargle periodogram is effective tool to
identify periodic gene expression profiles
• Results comparable with Fourier analysis
• Lomb-Scargle can help when data are missing
or not evenly spaced
We wanted to validate the Lomb-Scargle method before
applying to our somitogenesis problem, since the Fourier
technique would be difficult to use. Scargle (1982):
“surprising result is that the … spectrum of a process can be
estimated … [with] only the order of the samples ...”
34
Conclusions
• Conclusions should not be drawn using the
individual p-value calculated for each profile. A
multiple comparison procedure False Discovery
Rate (FDR) must be used to control the error rate.
• Expression profiles may be more complex than
simple cosine curves
• Power spectra of non-sinusoid rhythms are more
difficult to interpret
35
Supplementary Information
http://research.stowers-institute.org/efg/2004/CAMDA
36
Acknowledgements
Stowers Institute for Medical Research
Pourquie Lab
Olivier Pourquie
Mary-Lee Dequeant
37
Download