Lecture 24

advertisement
Environmental Data Analysis with MatLab
Lecture 24:
Confidence Limits of Spectra; Bootstraps
Housekeeping
This is the last lecture
The final presentations are next week
The last homework is due today
SYLLABUS
Lecture 01
Lecture 02
Lecture 03
Lecture 04
Lecture 05
Lecture 06
Lecture 07
Lecture 08
Lecture 09
Lecture 10
Lecture 11
Lecture 12
Lecture 13
Lecture 14
Lecture 15
Lecture 16
Lecture 17
Lecture 18
Lecture 19
Lecture 20
Lecture 21
Lecture 22
Lecture 23
Lecture 24
Using MatLab
Looking At Data
Probability and Measurement Error
Multivariate Distributions
Linear Models
The Principle of Least Squares
Prior Information
Solving Generalized Least Squares Problems
Fourier Series
Complex Fourier Series
Lessons Learned from the Fourier Transform
Power Spectral Density
Filter Theory
Applications of Filters
Factor Analysis
Orthogonal functions
Covariance and Autocorrelation
Cross-correlation
Smoothing, Correlation and Spectra
Coherence; Tapering and Spectral Analysis
Interpolation
Hypothesis testing
Hypothesis Testing continued; F-Tests
Confidence Limits of Spectra, Bootstraps
purpose of the lecture
continue
develop a way to assess the significance of
a spectral peak
and
develop the Bootstrap Method
of determining confidence intervals
Part 1
assessing the confidence level of a spectral peak
what does confidence in a spectral
peak mean?
one possibility
indefinitely long phenomenon
you observe a short time window
(looks “noisy” with no obvious periodicities)
you compute the p.s.d. and detect a peak
you ask
would this peak still be there if I observed some
other time window?
or did it arise from random variation?
example
10
5
0
-5
-10
0
100
200
300
400
100
a.s.d
d
100
Y
50
0
500
0
f
0
700
800
100
N
50
0.5
600
0
0.2
f
0.4
100
N
50
0
900
0
f
N
50
0.5
0
0
0.2
f
0.4
1000
t
10
5
0
-5
-10
0
100
200
300
100
a.s.d
d
500
100
Y
50
0
400
0
0.2
f
0
700
100
Y
50
0.4
600
0
0.2
f
0.4
900
100
Y
50
0
800
0
0.2
f
0.4
Y
50
0
0
0.2
f
0.4
1000
t
Null Hypothesis
The spectral peak can be explained by random
variation in a time series that consists of
nothing but random noise.
Easiest Case to Analyze
Random time series that is:
Normally-distributed
uncorrelated
zero mean
variance that matches power of time series under
consideration
So what is the probability density function p(s2)
of points in the power spectral density s2 of such
a time series ?
Chain of Logic, Part 1
The time series is Normally-distributed
The Fourier Transform is a linear function of the
time series
Linear functions of Normally-distributed variables
are Normally-distributed, so the Fourier
Transform is Normally-distributed too
For a complex FT, the real and imaginary parts are
individually Normally-distributed
Chain of Logic, Part 2
The time series has zero mean
The Fourier Transform is a linear function of the time
series
The mean of a linear function is the function of the
mean value, so the mean of the FT is zero
For a complex FT, the means of the real and
imaginary parts are individually zero
Chain of Logic, Part 3
The time series is uncorrelated
The Fourier Transform has [GTG]-1 proportional to I
So by the usual rules of error propagation, the
Fourier Transform is uncorrelated too
For a complex FT, the real and imaginary parts are
uncorrelated
Chain of Logic, Part 4
The power spectral density is proportional to the sum of
squares of the real and imaginary parts of the Fourier
Transform
The sum of squares of two uncorrelated Normallydistributed variables with zero mean and unit variance
is chi-squared distributed with two degrees of
freedom.
Once the p.s.d. is scaled to have unit variance, it is chisquared distributed with two degrees of freedom.
so
2
s /c
is chi-squared distributed
where c is a yet-to-be-determined
scaling factor
in the text, it is shown that
where:
σd2 is the variance of the data
Nf is the length of the p.s.d.
Δf is the frequency sampling
ff is the variance of the taper.
It adjusts for the effect of a tapering.
A) tapered time series
20
+2 s d
d(i)
10
0
-2sd
-10
-20
0
5
10
15
time t, seconds
20
25
30
9
B) power
spectral density
8
7
6
5
4
s2(f)
example 1: a
completely
random time
series
95%
3
2
mean
1
0
0
2
4
6
8
10
12
14
16
frequency f, Hz
18
20
example 1:
histogram of
spectral
values
mean
95%
35
30
25
counts
20
15
10
5
0
1
2
3
4
5
6
power spectral density, s2(f)
7
8
d(i)
example 2:
random time
series
consisting
of 5 Hz cosine
plus noise
A) tapered time series
20
+2s d
10
0
-2sd
-10
-20
0
20
15
5
10
15
time t, seconds
B) power spectral
density
20
25
30
s2(f)
10
5
0
95%
mean
0
2
4
6
8
10
12
frequency f, Hz
14
16
18
20
example 2:
histogram of
spectral
values
mean 95%
peak
60
counts
50
40
30
20
10
0
2
4
6
8
10
12
power spectral density, s2(f)
14
16
18
so how confident are we of a peak at 5 Hz ?
= 0.99994
the p.s.f. is predicted to be less than the
level of the peak 99.994% of the time
But here we must be very careful
two alternative Null Hypotheses
a peak of the observed amplitude at 5 Hz is
caused by random variation
a peak at the observed amplitude somewhere in
the p.s.d. is caused by random variation
two alternative Null Hypotheses
a peak of the observed amplitude at 5 Hz is
caused by random variation
a peak at the observed amplitude somewhere in
the p.s.d. is caused by random variation
much more likely, since p.s.d.
has many frequency points
(513 in this case)
two alternative Null Hypotheses
a peak of the observed amplitude at 5 Hz is
caused by random variation
peak of the observed amplitude or greater occurs only 1-0.99994
= 0.006 % of the time
The Null Hypothesis can be rejected to high certainty
a peak at the observed amplitude somewhere in
the p.s.d. is caused by random variation
two alternative Null Hypotheses
a peak of the observed amplitude at 5 Hz is
caused by random variation
a peak at the observed amplitude somewhere in
the p.s.d. is caused by random variation
peak of the observed amplitude occurs only 1-(0.99994)513
= 3% of the time
The Null Hypothesis can be rejected to acceptable certainty
Part 2
The Bootstrap Method
The Issue
What do you do when you have a statistic that can
test a Null Hypothesis
but you don’t know its probability density function
?
If you could repeat the experiment many
times, you could address the problem
empirically
repeat
perform experiment
calculate statistic, s
make histogram of s’s
normalize histogram into empirical p.d.f.
The problem is that it’s not usually possible
to repeat an experiment many times over
Bootstrap Method
create approximate repeat datasets
by randomly resampling (with duplications)
the one existing data set
example of resampling
original data
set
1
2
3
4
5
6
1.4
2.1
3.8
3.1
1.5
1.7
random
integers in
range 1-6
3
1
3
2
5
1
resampled
data set
1
2
3
4
5
6
3.8
1.4
3.8
2.1
1.5
1.4
example of resampling
original data
set
1
2
3
4
5
6
1.4
2.1
3.8
3.1
1.5
1.7
random
integers in
range 1-6
3
1
3
2
5
1
new data set
1
2
3
4
5
6
3.8
1.4
3.8
2.1
1.5
1.4
interpretation of resampling
mixing
sampling
duplication
p(d)
p’(d)
Example
d(i)
what is the p(b)
where b is the slope of a linear fit?
time t, hours
This is a good test case, because we
know the answer
if the data are Normally-distributed, uncorrelated
with variance σd2,
and given the linear problem
d = G m where m = [intercept, slope]T
The slope is also Normally-distributed with a
variance that is the lower-right element of
σd2 [GTG]-1
create
resampled
data set
returns N
random
integers
from 1 to N
usual code for
least squares
fit of line
save slopes
histogram of slopes
integrate
p(b) to P(b)
2.5% and 97.5% bounds
standard error propagation
50
bootstrap
p(b)
p(b)
40
30
20
10
0
95% confidence
0.5
0.51
0.52
0.53
slope, b
slope, b
0.54
0.55
0.56
a more complicated example
p(r)
where r is
ratio of
CaO to Na2O ratio of the second varimax factor
of the Atlantic Rock dataset
35
30
20
p(r)
p(r)
25
15
10
95% confidence
5
mean
0
0.45
0.46
0.47
0.48
0.49
CaO/Na2O ratio, r
0.5
CaO / Na2O ratio, r
0.51
0.52
we can use this histogram to write
confidence intervals for r
r has a mean of 0.486
95% probability that r is between 0.458 and 0.512
and roughly, since p(r) is approximately symmetrical
r = 0.486 ± 0.025 (95% confidence)
Download