Mothematical Modeling - Oregon State University

advertisement
Erin Childs (Pomona College) , Andrew Calderon (Heritage University), Evan Goldman (Bard College,
Boston University), Molly O’Neill (Lehigh University), Clay Showalter (Evergreen University), with the
help of Olivia Poblacion (Oregon State University)
Acknowledgements
Dr. Dietterich, CS Professor


Dr. Wong, CS Professor
Steven Highland, Geosciences PhD
Candidate

Jorge Ramirez, Math Professor


Dan Sheldon, CS Post-doc

Julia Jones, Geosciences Professor

Rebecca Hutchinson, CS Post-doc

Javier Illan, PhD, Post-doc
Studying Climate Change:
Lepidoptera

Why are Lepidoptera are good indicator of climate
change?

Past studies on Lepidoptera


Woiwod 1996: Detecting the effects of climate change on
Lepidoptera
Dewar and Watt 1992: Predicted changes in the synchrony of
larval emergence and budburst under climatic warming
Research Questions
1) How is vegetation related
to moth species
distribution and
composition?
2) How does climate affect
moth phenology?
Study Site
 H.J. Andrews Experimental Forest
http://andrewsforest.oregonstate.edu/about.cfm?topnav=2
Vegetation Surveying: Methods
 GPS coordinates
 Walked out 30m and
100m radius in all
directions
 Presence/absence of
71 species of known
host plants
Moth Trapping: Methods
 Moth Trapping
 9 sites selected
 Equipment used
 Moth preservation
Methods
 Moth Identification
Moth Trapping Results
Semiothis signaria
Pero occidentalis
Overview: Is vegetation a good predictor of
moth species presence/absence?
•
Develop software tools for exploring/analyzing data
•
Run generalized boosted regression models (GBMs) for
each moth species
•
Create GIS layers for the predicted locations of each
moth species
Software Tasks for Data
Exploration
•
Format data
•
Compare the similarities and differences between sites,
moths and vegetation
•
Discover correlations between vegetation and moth
species
•
Calculate marginal probabilities of plant occurrences
•
Visualize results
Measuring Similarity: Hamming Distance
•
Hamming distance is the number of co-variates that
differ between sample sets
•
Smaller number means sets are more similar
Marginal Probabilities

Using the vegetation data collected at 20 sites, generate
marginal probabilities for plants occurrences
If huckleberry (VAHU) is found at a site, what is the
probability of finding thimbleberry (RUPA) but not
licorice root (!LIGR) at that site?
Canonical Correlation
Analysis (CCA)
 Canonical correlations analysis aims at highlighting
correlations between two data sets
 Gives us a way of making sense of cross-covariance
matrices
 Allows ecologists to relate the abundance of species to
environmental variables
 Using CCA we analyzed our vegetation data and moth
data
X-correlation:
Highlights any correlations
among only moth species
(422x422)
Y-correlation:
Highlights any correlations
among only plant species
(71x71)
Cross-correlation:
Highlights any correlations
between both data sets
(71x422)
Generalized Boosted Regression Models (GBMs)
•
Regression analysis allows us to explore the relationships between individual
moth presence/absence (dependent variable) and various characteristics of
each site (independent variables)
•
The goal is to minimize the loss function, which represents the loss associated
with an estimate being different from the true value
•
Basis functions are an element of a set of vectors that, in linear combination,
can represent every vector in a given vector space
•
Every function can be represented as a linear combination of basis function
•
Boosting is the process of iteratively adding basis functions in a greedy fashion
so that each additional basis function further reduces the selected loss function
•
The model is run several times with different values for the tuning parameters
to determine the best values
Validating the GBM
•
All available regressors are used in the model,
meaning that the choice of independent variables is
not supported by theory
•
The standard approach to validating models is to split
the data into a training and a test data set
•
The model is fit on the training data, then used to
make predictions on the test data
•
This ensures that the model is generalizable and not
overfit
Running the Model
•
Ran the model for individual moth species using all 256
trap sites at HJA, using moth trapping data collected
from 2004 to 2008
•
Did not include vegetation data, since we only collected
it at 20 sites
•
The GBM lays a grid over the Andrews forest and
calculates the predicted probability of the moth species
being present for each grid cell
Visualizing GBM Results
Thermal Climate of the H.J. Andrews Experimental Forest
PRISM estimated mean monthly maximum and minimum temperature maps showing topographic effects of radiation and sky view
factors. Provided by Jonathan W. Smith
Daily
temperatures
at climate
stations
Mean
monthly
temperatures
at climate
stations
Mean
monthly
temperatures
at trap sites
Correlation
between
climate
stations and
trap sites
Daily
temperatures
at trap sites
Degree day
curve for
trap sites
Degree Day Accumulation: Trap Site 16B
7000
6000
5000
2003
4000
2004
2005
3000
2006
2007
2000
1000
0
31-Dec
31-Jan
29-Feb
31-Mar
30-Apr
31-May
30-Jun
31-Jul
31-Aug
30-Sep
31-Oct
30-Nov
Estimated Degree Day
Accumulation: Trap Site 16B
Estimated Degree Day
Accumulation: Trap Site 3K
7000
4000
6000
3500
3000
5000
4000
3000
2000
2003
2500
2003
2004
2000
2004
2005
2006
2007
1000
0
1500
1000
500
0
-500
2005
2006
2007
Degree Day Curve
 Use a linear regression model to interpolate the degree
for a given trap site for specific days of a year
 Parameterize temperature in order to later be included
in the temporal model
 Produce degree day curves for any trap site
Multi-Linear Regression Analysis
Find Coefficients
Each Trap_ID will have two sets of coefficients
(Maximum and Minumum)
Predicting Daily Temp
Linear Interpolation
•Fill gaps in the daily temperature data
In goes the trap_ID,
start_date and end_date
Out comes the min and
max for the given day(s)
Temporal Distribution of
Moths
The Problem
 Year-round distribution of moths
 Limited observation points
 Unseen, unmeasurable data
 Catching probabilities
 Total moth population
Example: Flight times
 Consider 3 trapping times and 4 associated intervals,
and moths with flight times as follows
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
0
0
0
0
I1
0
0
0
0
I2
0
0
0
0
I3
0
0
0
0
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
0
0
0
0
I1
0
1
0
0
I2
0
0
0
0
I3
0
0
0
0
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
0
1
0
0
I1
0
1
0
0
I2
0
0
0
0
I3
0
0
0
0
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
0
1
0
0
I1
0
1
0
1
I2
0
0
0
0
I3
0
0
0
0
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
0
1
1
0
I1
0
1
0
1
I2
0
0
0
0
I3
0
0
0
0
I0
t1
I1
t2
I2
t3
I3
Example: Distribution
This gives us a distribution table:
I0
I1
I2
I3
I0
1
2
4
1
I1
0
2
3
3
I2
0
0
1
2
I3
0
0
0
1
Example con’t
This gives us a distribution table … and flight counts
I0
I1
I2
I3
I0
1
2
4
1
I1
0
2
3
3
I2
0
0
1
2
I3
0
0
0
1
f1
7
Example con’t
This gives us a distribution table … and flight counts
I0
I1
I2
I3
I0
1
2
4
1
f1
7
I1
0
2
3
3
f2
11
I2
0
0
1
2
I3
0
0
0
1
Example con’t
This gives us a distribution table … and flight counts
I0
I1
I2
I3
I0
1
2
4
1
f1
7
I1
0
2
3
3
f2
11
I2
0
0
1
2
f3
6
I3
0
0
0
1
Example: Flight Counts
 When trapping moths, all
we see is flight counts
 Given flight counts, we
want to predict moth
distribution
f1
7
f2
11
f3
6
Maximum Likelihood
Model
 Maximize Prob (Data | Parameters)
 Data = Moth trapping
moths trapped: f=(f1, f2, … fT)
times trapped: t=(t1, t2, … tT)
Maximum Likelihood
Model
 Parameters = probability distribution of emergence
time and life span
Emergence and life span assumed to be Gaussian with
parameters µE, σE, µS, σS
Emergence ~ N(µE, σE)
Life Span ~ N(µS, σS)
Moth Distribution
 Use distributions to calculate p(j,k), the probability of a
moth emerging in interval j and dying in interval k
tj
r
tj+1
…
tk
d
Ik
Ij
s
tk+1
Calculating Probabilities
p( j,k) 

 

t j 1
tj
t j 1
t k1 r
tj
t k r
[P(r | , ) 
t k1 s
t k s
P(s | , )]
pE (r | ) pS (s | )dsdr
Probability Table
Emergence Interval
I1
…
IT
P(0,0)
P(0,1)
…
P(0,T)
P(1,0)
P(1,1)
…
P(1,T)
…
P(T,T)
P(T,1)
P(T,2)
…
I0
…
I l
n I
0
t
I1
e
r
I2
v
I3
a
…
D
e
a
t
h
Multinomial Distribution
 All moths fall into one of the probability squares
 Moths have a multinomial distribution
n!
P(F  f ) 
pn1 pn 2
n1!n2! nk!
pn k
 Approximate this with a multivariate Gaussian (or normal)
Approximation Error
 What is the error associated with this approximation?

1 
 m! s(m)1 O  approximated as m!=s(m)
m 


T 2 
 Error of O 
 N 

Likelihood
L
100
 P(F 
fi |  i )
i 1
100
ln L  l   ln P(F  fi |  i )
i 1
  ={µE, σE, µS, σS}
Log Loss
Likelihood surface
µs
µe
# of moths
Results
90
80
70
60
50
40
30
20
10
0
counted moths
predicted
emergence
behavior
•Semiothisa Signaria
•Trap 38B
•2005
Days since May 1st
Results
Predicted Average Emergence
Date
Effects of elevation on moth emergence
R2 =0.23
p<0.01
120
100
80
60
Semiothisa Signaria
40
20
0
0
1000
2000
3000
Elevation
4000
5000
6000
Results
Predicted Average Emergence Date
Effects of air temperature on moth emergence
120
R² = 0.21
p<.01
100
80
60
Semiothisa Signaria
40
20
0
15
17
19
21
23
25
Average Monthly Max Temp for June (oC)
27
Synthetic Data
40
35
30
|life span error|
25
20
abs mu.s
15
averages
10
5
0
0
5
10
15
20
25
trapping interval length
30
35
40
Model Limitations:
The “hidden” population and sample size
Trap 13B
40
Mean life span
35
30
25
20
2007 n=9
2006 n=28
2005 n=87
15
10
5
0
0
500
1000
1500
N
2000
2500
Model Limitations:
Sample Size
14
Frequency
12
10
8
Bad
Good
6
4
2
0
5
10
15
20 25 30 35 40 45 50
# of Moths Observed in Year
55
60
Estimating “Hidden” Moth
Population
700
y = 1.6011x + 0.0292
R² = 0.94
Estimate of N
600
500
400
µ.s=10.5
µ.s=7
µ.s=14
300
200
100
0
0
100
200
Observed Count for Year
300
How is vegetation related to moth
species distribution and composition?
 CCA and Hamming distance shows a strong
correlation between vegetation and moth species
 For the Future: Vegetation surveys at other trap sites
would help improve the performance of the model
How does climate affect
moth phenology?
 Moth emergence shows a strong correlations to the
local temperature
 For the future: incorporating the degree day curves we
calculated for each site will make the model more
robust
Questions?
Download