Environmental Data Analysis with MatLab
2 nd Edition
Lecture 26:
Confidence Limits of Spectra; Bootstraps
Lecture 01
Lecture 02
Lecture 03
Lecture 04
Lecture 05
Lecture 06
Lecture 07
Lecture 08
Lecture 09
Lecture 10
Lecture 11
Lecture 12
Lecture 13
Lecture 14
Lecture 15
Lecture 16
Lecture 17
Lecture 18
Lecture 19
Lecture 20
Lecture 21
Lecture 22
Lecture 23
Lecture 24
Lecture 25
Lecture 26
SYLLABUS
Using MatLab
Looking At Data
Probability and Measurement Error
Multivariate Distributions
Linear Models
The Principle of Least Squares
Prior Information
Solving Generalized Least Squares Problems
Fourier Series
Complex Fourier Series
Lessons Learned from the Fourier Transform
Power Spectra
Filter Theory
Applications of Filters
Factor Analysis
Orthogonal functions
Covariance and Autocorrelation
Cross-correlation
Smoothing, Correlation and Spectra
Coherence; Tapering and Spectral Analysis
Interpolation
Linear Approximations and Non Linear Least Squares
Adaptable Approximations with Neural Networks
Hypothesis testing
Hypothesis Testing continued; F-Tests
Confidence Limits of Spectra, Bootstraps
continue develop a way to assess the significance of a spectral peak and develop the Bootstrap Method of determining confidence intervals
assessing the confidence level of a spectral peak
what does confidence in a spectral peak mean?
indefinitely long phenomenon you observe a short time window
(looks “noisy” with no obvious periodicities) you compute the p.s.d. and detect a peak you ask would this peak still be there if I observed some other time window?
or did it arise from random variation?
example
10 d
0
-5
5
-10
0 100 200 300 400 500 600 700 800 900 1000 t
100 100 100
50
0
0 f
Y
50
N
0.5
0
0 0.2
f
0.4
50
0
0 f
N
100
50
N
0.5
0
0 0.2
f
0.4
10 d
0
-5
5
-10
0 100 200 300 400 500 600 700 800 900 1000 t
100
50
Y
100
50
100
Y
50
Y
100
50
Y
0
0 0.2
f
0.4
0
0 0.2
f
0.4
0
0 0.2
f
0.4
0
0 0.2
f
0.4
The spectral peak can be explained by random variation in a time series that consists of nothing but random noise.
Random time series that is:
Normally-distributed uncorrelated zero mean variance that matches power of time series under consideration
So what is the probability density function p(s 2 ) of points in the power spectral density s 2 of such a time series ?
Chain of Logic, Part 1
The time series is Normally-distributed
The Fourier Transform is a linear function of the time series
Linear functions of Normally-distributed variables are Normally-distributed, so the Fourier
Transform is Normally-distributed too
For a complex FT, the real and imaginary parts are individually Normally-distributed
Chain of Logic, Part 2
The time series has zero mean
The Fourier Transform is a linear function of the time series
The mean of a linear function is the function of the mean value, so the mean of the FT is zero
For a complex FT, the means of the real and imaginary parts are individually zero
Chain of Logic, Part 3
The time series is uncorrelated
The Fourier Transform has [ G T G ] -1 proportional to I
So by the usual rules of error propagation, the
Fourier Transform is uncorrelated too
For a complex FT, the real and imaginary parts are uncorrelated
Chain of Logic, Part 4
The power spectral density is proportional to the sum of squares of the real and imaginary parts of the Fourier
Transform
The sum of squares of two uncorrelated Normallydistributed variables with zero mean and unit variance is chi-squared distributed with two degrees of freedom.
Once the p.s.d. is scaled to have unit variance, it is chisquared distributed with two degrees of freedom.
s 2 /c
c
in the text, it is shown that where:
σ d
2 is the variance of the data
N f is the length of the p.s.d.
Δ f is the frequency sampling f f is the variance of the taper.
It adjusts for the effect of a tapering.
example 1: a completely random time series
20
A) tapered time series
10
+
2 s d
0
-10
-
2 s d
4
3
2
1
0
0
-20
0 5 10 15 time t , seconds
9
6
8
7
B) power spectral density
5
2 4 6
20
8 10 12 14 frequency f , Hz
16 18
25
20
30
95% mean
example 1: histogram of spectral values
35
15
10
5
0
30
25
20 mean 95%
1 2 3 4 5 6 power spectral density, s 2 (f)
7 8
example 2: random time series consisting of 5 Hz cosine plus noise
20
A) tapered time series
10
0
-10
-20
20
15
0 5 10 15 time t , seconds
B) power spectral density
20 25
+
2 s d
-
2 s d
30
10
5
0
0 2 4 6 8 10 frequency f , Hz
12 14 16 18 20
95% mean
example 2: histogram of spectral values mean 95%
60
50
40
30
20
10
0
2 4 6 8 10 12 power spectral density, s 2 (f)
14 peak
16 18
so how confident are we of a peak at 5 Hz ?
= 0.99994
the p.s.f. is predicted to be less than the level of the peak 99.994% of the time
a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation
a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation much more likely, since p.s.d. has many frequency points
(513 in this case)
a peak of the observed amplitude at 5 Hz is caused by random variation peak of the observed amplitude or greater occurs only 1-0.99994
= 0.006 % of the time
The Null Hypothesis can be rejected to high certainty a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation
a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation peak of the observed amplitude occurs only 1-(0.99994) 513
= 3% of the time
The Null Hypothesis can be rejected to acceptable certainty
The Bootstrap Method
What do you do when you have a statistic that can test a Null Hypothesis but you don’t know its probability density function
?
If you could repeat the experiment many times, you could address the problem empirically repeat perform experiment calculate statistic, s make histogram of s ’s normalize histogram into empirical p.d.f.
The problem is that it’s not usually possible to repeat an experiment many times over
create approximate repeat datasets by randomly resampling (with duplications) the one existing data set
1
2
3
4
5
6 original data set
1.4
2.1
3.8
3.1
1.5
1.7
random integers in range 1-6
3
1
3
2
5
1
1
2
3
4
5
6 resampled data set
3.8
1.4
3.8
2.1
1.5
1.4
1
2
3
4
5
6 original data set
1.4
2.1
3.8
3.1
1.5
1.7
random integers in range 1-6
3
1
3
2
5
1 new data set
1
2
3
4
5
6
3.8
1.4
3.8
2.1
1.5
1.4
mixing sampling duplication p(d) p’(d)
what is the p(b) where b is the slope of a linear fit?
time t , hours
This is a good test case, because we know the answer if the data are Normally-distributed, uncorrelated with variance σ d
2
, and given the linear problem d = G m where m = [intercept, slope] T
The slope is also Normally-distributed with a variance that is the lower-right element of
σ d
2 [G T G] -1
create resampled data set returns N random integers from 1 to N
usual code for least squares fit of line save slopes
integrate p(b) to P(b)
50
40
30 standard error propagation bootstrap
20
10
95% confidence
0
0.5
0.51
0.52
0.53
slope, b slope, b
0.54
0.55
0.56
p(r) where r is ratio of
CaO to Na
2
O ratio of the second varimax factor of the Atlantic Rock dataset
20
15
10
5
35
30
25
0
0.45
0.46
95% confidence mean
0.47
0.48
0.49
0.5
CaO/Na2O ratio, r
CaO / Na
2
O ratio, r
0.51
0.52
we can use this histogram to write confidence intervals for r r has a mean of 0.486
95% probability that r is between 0.458 and 0.512
and roughly, since p(r) is approximately symmetrical r = 0.486 ± 0.025 (95% confidence)