Lecture 26

advertisement

Environmental Data Analysis with MatLab

2 nd Edition

Lecture 26:

Confidence Limits of Spectra; Bootstraps

Lecture 01

Lecture 02

Lecture 03

Lecture 04

Lecture 05

Lecture 06

Lecture 07

Lecture 08

Lecture 09

Lecture 10

Lecture 11

Lecture 12

Lecture 13

Lecture 14

Lecture 15

Lecture 16

Lecture 17

Lecture 18

Lecture 19

Lecture 20

Lecture 21

Lecture 22

Lecture 23

Lecture 24

Lecture 25

Lecture 26

SYLLABUS

Using MatLab

Looking At Data

Probability and Measurement Error

Multivariate Distributions

Linear Models

The Principle of Least Squares

Prior Information

Solving Generalized Least Squares Problems

Fourier Series

Complex Fourier Series

Lessons Learned from the Fourier Transform

Power Spectra

Filter Theory

Applications of Filters

Factor Analysis

Orthogonal functions

Covariance and Autocorrelation

Cross-correlation

Smoothing, Correlation and Spectra

Coherence; Tapering and Spectral Analysis

Interpolation

Linear Approximations and Non Linear Least Squares

Adaptable Approximations with Neural Networks

Hypothesis testing

Hypothesis Testing continued; F-Tests

Confidence Limits of Spectra, Bootstraps

purpose of the lecture

continue develop a way to assess the significance of a spectral peak and develop the Bootstrap Method of determining confidence intervals

Part 1

assessing the confidence level of a spectral peak

what does confidence in a spectral peak mean?

one possibility

indefinitely long phenomenon you observe a short time window

(looks “noisy” with no obvious periodicities) you compute the p.s.d. and detect a peak you ask would this peak still be there if I observed some other time window?

or did it arise from random variation?

example

10 d

0

-5

5

-10

0 100 200 300 400 500 600 700 800 900 1000 t

100 100 100

50

0

0 f

Y

50

N

0.5

0

0 0.2

f

0.4

50

0

0 f

N

100

50

N

0.5

0

0 0.2

f

0.4

10 d

0

-5

5

-10

0 100 200 300 400 500 600 700 800 900 1000 t

100

50

Y

100

50

100

Y

50

Y

100

50

Y

0

0 0.2

f

0.4

0

0 0.2

f

0.4

0

0 0.2

f

0.4

0

0 0.2

f

0.4

Null Hypothesis

The spectral peak can be explained by random variation in a time series that consists of nothing but random noise.

Easiest Case to Analyze

Random time series that is:

Normally-distributed uncorrelated zero mean variance that matches power of time series under consideration

So what is the probability density function p(s 2 ) of points in the power spectral density s 2 of such a time series ?

Chain of Logic, Part 1

The time series is Normally-distributed

The Fourier Transform is a linear function of the time series

Linear functions of Normally-distributed variables are Normally-distributed, so the Fourier

Transform is Normally-distributed too

For a complex FT, the real and imaginary parts are individually Normally-distributed

Chain of Logic, Part 2

The time series has zero mean

The Fourier Transform is a linear function of the time series

The mean of a linear function is the function of the mean value, so the mean of the FT is zero

For a complex FT, the means of the real and imaginary parts are individually zero

Chain of Logic, Part 3

The time series is uncorrelated

The Fourier Transform has [ G T G ] -1 proportional to I

So by the usual rules of error propagation, the

Fourier Transform is uncorrelated too

For a complex FT, the real and imaginary parts are uncorrelated

Chain of Logic, Part 4

The power spectral density is proportional to the sum of squares of the real and imaginary parts of the Fourier

Transform

The sum of squares of two uncorrelated Normallydistributed variables with zero mean and unit variance is chi-squared distributed with two degrees of freedom.

Once the p.s.d. is scaled to have unit variance, it is chisquared distributed with two degrees of freedom.

so

s 2 /c

is chi-squared distributed where

c

is a yet-to-be-determined scaling factor

in the text, it is shown that where:

σ d

2 is the variance of the data

N f is the length of the p.s.d.

Δ f is the frequency sampling f f is the variance of the taper.

It adjusts for the effect of a tapering.

example 1: a completely random time series

20

A) tapered time series

10

+

2 s d

0

-10

-

2 s d

4

3

2

1

0

0

-20

0 5 10 15 time t , seconds

9

6

8

7

B) power spectral density

5

2 4 6

20

8 10 12 14 frequency f , Hz

16 18

25

20

30

95% mean

example 1: histogram of spectral values

35

15

10

5

0

30

25

20 mean 95%

1 2 3 4 5 6 power spectral density, s 2 (f)

7 8

example 2: random time series consisting of 5 Hz cosine plus noise

20

A) tapered time series

10

0

-10

-20

20

15

0 5 10 15 time t , seconds

B) power spectral density

20 25

+

2 s d

-

2 s d

30

10

5

0

0 2 4 6 8 10 frequency f , Hz

12 14 16 18 20

95% mean

example 2: histogram of spectral values mean 95%

60

50

40

30

20

10

0

2 4 6 8 10 12 power spectral density, s 2 (f)

14 peak

16 18

so how confident are we of a peak at 5 Hz ?

= 0.99994

the p.s.f. is predicted to be less than the level of the peak 99.994% of the time

But here we must be very careful

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation much more likely, since p.s.d. has many frequency points

(513 in this case)

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation peak of the observed amplitude or greater occurs only 1-0.99994

= 0.006 % of the time

The Null Hypothesis can be rejected to high certainty a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation peak of the observed amplitude occurs only 1-(0.99994) 513

= 3% of the time

The Null Hypothesis can be rejected to acceptable certainty

Part 2

The Bootstrap Method

The Issue

What do you do when you have a statistic that can test a Null Hypothesis but you don’t know its probability density function

?

If you could repeat the experiment many times, you could address the problem empirically repeat perform experiment calculate statistic, s make histogram of s ’s normalize histogram into empirical p.d.f.

The problem is that it’s not usually possible to repeat an experiment many times over

Bootstrap Method

create approximate repeat datasets by randomly resampling (with duplications) the one existing data set

1

2

3

4

5

6 original data set

1.4

2.1

3.8

3.1

1.5

1.7

example of resampling

random integers in range 1-6

3

1

3

2

5

1

1

2

3

4

5

6 resampled data set

3.8

1.4

3.8

2.1

1.5

1.4

1

2

3

4

5

6 original data set

1.4

2.1

3.8

3.1

1.5

1.7

example of resampling

random integers in range 1-6

3

1

3

2

5

1 new data set

1

2

3

4

5

6

3.8

1.4

3.8

2.1

1.5

1.4

interpretation of resampling

mixing sampling duplication p(d) p’(d)

Example

what is the p(b) where b is the slope of a linear fit?

time t , hours

This is a good test case, because we know the answer if the data are Normally-distributed, uncorrelated with variance σ d

2

, and given the linear problem d = G m where m = [intercept, slope] T

The slope is also Normally-distributed with a variance that is the lower-right element of

σ d

2 [G T G] -1

create resampled data set returns N random integers from 1 to N

usual code for least squares fit of line save slopes

integrate p(b) to P(b)

50

40

30 standard error propagation bootstrap

20

10

95% confidence

0

0.5

0.51

0.52

0.53

slope, b slope, b

0.54

0.55

0.56

a more complicated example

p(r) where r is ratio of

CaO to Na

2

O ratio of the second varimax factor of the Atlantic Rock dataset

20

15

10

5

35

30

25

0

0.45

0.46

95% confidence mean

0.47

0.48

0.49

0.5

CaO/Na2O ratio, r

CaO / Na

2

O ratio, r

0.51

0.52

we can use this histogram to write confidence intervals for r r has a mean of 0.486

95% probability that r is between 0.458 and 0.512

and roughly, since p(r) is approximately symmetrical r = 0.486 ± 0.025 (95% confidence)

Related documents
Download