Introduction to resampling techniques for generating confidence measures

advertisement
Introduction to resampling
techniques for generating
confidence measures
Resampling techniques
1) Randomization
– Resampling without replacement (re-ordering, permutations)
2) Jackknife
– Leaving one data point out at a time (not good for small sample sizes),
in paleobiology usually used for phylogenetic analyses
3) Sampling Standardization
– When comparing samples of different sizes
4) Bootstrap
– Parametric
• Generate datasets from a parametrized model and comparing
these with empirical data
– Non parametric
• Most common in paleobiology
2
Empirical Data
Randomization
Randomized
Sample 1
Randomized
Sample 2
Randomized
Sample 3
…. Randomized Sample N
Empirical Data
Jack-Knife
Jack knife sample 3
Jack knife sample 1
Jack knife sample 3
…..Jack knife sample N
Sampling Standardization
Empirical data 1
Empirical data 2
Empirical data 3
Standardized
Sample 1
Standardized
Sample 2
… Standardized Sample N
Empirical Data
Non-parametric bootstrap
Bootstrapped
Sample 3
Bootstrapped
Sample 1
Bootstrapped
Sample 2
….. Bootstrapped
Sample N
Non-parametric bootstrap
Empirical
data
Estimate
parameters
(model)
Parametric bootstrap
Bootstraps
samples
Empirical
data
Simulated
samples
Estimate
parameters
Estimate
parameters
(model)
Estimate
parameters
Resampling techniques
1) Randomization
– Resampling without replacement (re-ordering, permutations)
2) Jackknife
– Leaving one data point out at a time (not good for small sample sizes),
in paleobiology usually used for phylogenetic analyses
3) Sampling Standardization
– When comparing samples of different sizes
4) Bootstrap
– Parametric
• Generate datasets from a parametrized model and comparing
these with empirical data
– Non parametric
• Most common in paleobiology
8
Why resampling (now)
• Underlying distribution of data not well understood and/or
complex
• Convenient way to generate uncertainty measures
• Computer intensive (possible only with faster computers)
9
Bootstrapping
• construct estimate of frequency distributions expected from a “generative
process”
• Equivalent to generating replicate outcomes from an experiment (doing
something many times to see the range of results)
• Assumption: data are representative sample of independent observations
derived randomly from the studied statistical population
Bootstrap error estimates
• Estimate standard error by resampling from the single sample we have.
• This approach uses sampling with replacement from observed sample to
simulate sampling without replacement from the underlying distribution.
Procedure
• Start with observed sample of size n and observed sample statistic, call it
Z.
• Randomly pick a sample of size n, with replacement, from the
observedsample.
• Calculate the sample statistic of interest on this random sample; call
isZboot.
• Repeat many times (generally hundreds to thousands, ideally
untilestimate of SE stabilizes).
• Calculate standard deviation of the Zboot.
• This is an estimate of the standard error of the observed sample statistic
Z:SD(Zboot) ≈ SE(Z)
Example (sampling standardization)
Alroy et al. 2008. Phanerozoic trends in the global diversity of
marine invertebrates. Science 321:97-100
Example (non parametric bootstrap)
Foote, M. 2006. Substrate affinity and
diversity dynamics of Paleozoic marine
animals Paleobiology 32:345-366.
Example (non parametric bootstrap)
Liow et al- 2009. Lower extinction
risk in Sleep-or-Hide Mammals.
Am Nat 173:264–272.
R demo
• Packages (e.g. boot, boostrap)
• Write your own: use the function
sample
Nice help
http://www.ats.ucla.edu/stat/r/library/bootstrap.htm
Links
•
•
http://www.paleo.geos.vt.edu/MK/Kowalewski_PNG_2010.pdf
http://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf
Download