Erin Childs (Pomona College) , Andrew Calderon (Heritage University), Evan Goldman (Bard College, Boston University), Molly O’Neill (Lehigh University), Clay Showalter (Evergreen University), with the help of Olivia Poblacion (Oregon State University) Acknowledgements Dr. Dietterich, CS Professor Dr. Wong, CS Professor Steven Highland, Geosciences PhD Candidate Jorge Ramirez, Math Professor Dan Sheldon, CS Post-doc Julia Jones, Geosciences Professor Rebecca Hutchinson, CS Post-doc Javier Illan, PhD, Post-doc Studying Climate Change: Lepidoptera Why are Lepidoptera are good indicator of climate change? Past studies on Lepidoptera Woiwod 1996: Detecting the effects of climate change on Lepidoptera Dewar and Watt 1992: Predicted changes in the synchrony of larval emergence and budburst under climatic warming Research Questions 1) How is vegetation related to moth species distribution and composition? 2) How does climate affect moth phenology? Study Site H.J. Andrews Experimental Forest http://andrewsforest.oregonstate.edu/about.cfm?topnav=2 Vegetation Surveying: Methods GPS coordinates Walked out 30m and 100m radius in all directions Presence/absence of 71 species of known host plants Moth Trapping: Methods Moth Trapping 9 sites selected Equipment used Moth preservation Methods Moth Identification Moth Trapping Results Semiothis signaria Pero occidentalis Overview: Is vegetation a good predictor of moth species presence/absence? • Develop software tools for exploring/analyzing data • Run generalized boosted regression models (GBMs) for each moth species • Create GIS layers for the predicted locations of each moth species Software Tasks for Data Exploration • Format data • Compare the similarities and differences between sites, moths and vegetation • Discover correlations between vegetation and moth species • Calculate marginal probabilities of plant occurrences • Visualize results Measuring Similarity: Hamming Distance • Hamming distance is the number of co-variates that differ between sample sets • Smaller number means sets are more similar Marginal Probabilities Using the vegetation data collected at 20 sites, generate marginal probabilities for plants occurrences If huckleberry (VAHU) is found at a site, what is the probability of finding thimbleberry (RUPA) but not licorice root (!LIGR) at that site? Canonical Correlation Analysis (CCA) Canonical correlations analysis aims at highlighting correlations between two data sets Gives us a way of making sense of cross-covariance matrices Allows ecologists to relate the abundance of species to environmental variables Using CCA we analyzed our vegetation data and moth data X-correlation: Highlights any correlations among only moth species (422x422) Y-correlation: Highlights any correlations among only plant species (71x71) Cross-correlation: Highlights any correlations between both data sets (71x422) Generalized Boosted Regression Models (GBMs) • Regression analysis allows us to explore the relationships between individual moth presence/absence (dependent variable) and various characteristics of each site (independent variables) • The goal is to minimize the loss function, which represents the loss associated with an estimate being different from the true value • Basis functions are an element of a set of vectors that, in linear combination, can represent every vector in a given vector space • Every function can be represented as a linear combination of basis function • Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function • The model is run several times with different values for the tuning parameters to determine the best values Validating the GBM • All available regressors are used in the model, meaning that the choice of independent variables is not supported by theory • The standard approach to validating models is to split the data into a training and a test data set • The model is fit on the training data, then used to make predictions on the test data • This ensures that the model is generalizable and not overfit Running the Model • Ran the model for individual moth species using all 256 trap sites at HJA, using moth trapping data collected from 2004 to 2008 • Did not include vegetation data, since we only collected it at 20 sites • The GBM lays a grid over the Andrews forest and calculates the predicted probability of the moth species being present for each grid cell Visualizing GBM Results Thermal Climate of the H.J. Andrews Experimental Forest PRISM estimated mean monthly maximum and minimum temperature maps showing topographic effects of radiation and sky view factors. Provided by Jonathan W. Smith Daily temperatures at climate stations Mean monthly temperatures at climate stations Mean monthly temperatures at trap sites Correlation between climate stations and trap sites Daily temperatures at trap sites Degree day curve for trap sites Degree Day Accumulation: Trap Site 16B 7000 6000 5000 2003 4000 2004 2005 3000 2006 2007 2000 1000 0 31-Dec 31-Jan 29-Feb 31-Mar 30-Apr 31-May 30-Jun 31-Jul 31-Aug 30-Sep 31-Oct 30-Nov Estimated Degree Day Accumulation: Trap Site 16B Estimated Degree Day Accumulation: Trap Site 3K 7000 4000 6000 3500 3000 5000 4000 3000 2000 2003 2500 2003 2004 2000 2004 2005 2006 2007 1000 0 1500 1000 500 0 -500 2005 2006 2007 Degree Day Curve Use a linear regression model to interpolate the degree for a given trap site for specific days of a year Parameterize temperature in order to later be included in the temporal model Produce degree day curves for any trap site Multi-Linear Regression Analysis Find Coefficients Each Trap_ID will have two sets of coefficients (Maximum and Minumum) Predicting Daily Temp Linear Interpolation •Fill gaps in the daily temperature data In goes the trap_ID, start_date and end_date Out comes the min and max for the given day(s) Temporal Distribution of Moths The Problem Year-round distribution of moths Limited observation points Unseen, unmeasurable data Catching probabilities Total moth population Example: Flight times Consider 3 trapping times and 4 associated intervals, and moths with flight times as follows I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 0 0 0 0 I1 0 0 0 0 I2 0 0 0 0 I3 0 0 0 0 I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 0 0 0 0 I1 0 1 0 0 I2 0 0 0 0 I3 0 0 0 0 I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 0 1 0 0 I1 0 1 0 0 I2 0 0 0 0 I3 0 0 0 0 I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 0 1 0 0 I1 0 1 0 1 I2 0 0 0 0 I3 0 0 0 0 I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 0 1 1 0 I1 0 1 0 1 I2 0 0 0 0 I3 0 0 0 0 I0 t1 I1 t2 I2 t3 I3 Example: Distribution This gives us a distribution table: I0 I1 I2 I3 I0 1 2 4 1 I1 0 2 3 3 I2 0 0 1 2 I3 0 0 0 1 Example con’t This gives us a distribution table … and flight counts I0 I1 I2 I3 I0 1 2 4 1 I1 0 2 3 3 I2 0 0 1 2 I3 0 0 0 1 f1 7 Example con’t This gives us a distribution table … and flight counts I0 I1 I2 I3 I0 1 2 4 1 f1 7 I1 0 2 3 3 f2 11 I2 0 0 1 2 I3 0 0 0 1 Example con’t This gives us a distribution table … and flight counts I0 I1 I2 I3 I0 1 2 4 1 f1 7 I1 0 2 3 3 f2 11 I2 0 0 1 2 f3 6 I3 0 0 0 1 Example: Flight Counts When trapping moths, all we see is flight counts Given flight counts, we want to predict moth distribution f1 7 f2 11 f3 6 Maximum Likelihood Model Maximize Prob (Data | Parameters) Data = Moth trapping moths trapped: f=(f1, f2, … fT) times trapped: t=(t1, t2, … tT) Maximum Likelihood Model Parameters = probability distribution of emergence time and life span Emergence and life span assumed to be Gaussian with parameters µE, σE, µS, σS Emergence ~ N(µE, σE) Life Span ~ N(µS, σS) Moth Distribution Use distributions to calculate p(j,k), the probability of a moth emerging in interval j and dying in interval k tj r tj+1 … tk d Ik Ij s tk+1 Calculating Probabilities p( j,k) t j 1 tj t j 1 t k1 r tj t k r [P(r | , ) t k1 s t k s P(s | , )] pE (r | ) pS (s | )dsdr Probability Table Emergence Interval I1 … IT P(0,0) P(0,1) … P(0,T) P(1,0) P(1,1) … P(1,T) … P(T,T) P(T,1) P(T,2) … I0 … I l n I 0 t I1 e r I2 v I3 a … D e a t h Multinomial Distribution All moths fall into one of the probability squares Moths have a multinomial distribution n! P(F f ) pn1 pn 2 n1!n2! nk! pn k Approximate this with a multivariate Gaussian (or normal) Approximation Error What is the error associated with this approximation? 1 m! s(m)1 O approximated as m!=s(m) m T 2 Error of O N Likelihood L 100 P(F fi | i ) i 1 100 ln L l ln P(F fi | i ) i 1 ={µE, σE, µS, σS} Log Loss Likelihood surface µs µe # of moths Results 90 80 70 60 50 40 30 20 10 0 counted moths predicted emergence behavior •Semiothisa Signaria •Trap 38B •2005 Days since May 1st Results Predicted Average Emergence Date Effects of elevation on moth emergence R2 =0.23 p<0.01 120 100 80 60 Semiothisa Signaria 40 20 0 0 1000 2000 3000 Elevation 4000 5000 6000 Results Predicted Average Emergence Date Effects of air temperature on moth emergence 120 R² = 0.21 p<.01 100 80 60 Semiothisa Signaria 40 20 0 15 17 19 21 23 25 Average Monthly Max Temp for June (oC) 27 Synthetic Data 40 35 30 |life span error| 25 20 abs mu.s 15 averages 10 5 0 0 5 10 15 20 25 trapping interval length 30 35 40 Model Limitations: The “hidden” population and sample size Trap 13B 40 Mean life span 35 30 25 20 2007 n=9 2006 n=28 2005 n=87 15 10 5 0 0 500 1000 1500 N 2000 2500 Model Limitations: Sample Size 14 Frequency 12 10 8 Bad Good 6 4 2 0 5 10 15 20 25 30 35 40 45 50 # of Moths Observed in Year 55 60 Estimating “Hidden” Moth Population 700 y = 1.6011x + 0.0292 R² = 0.94 Estimate of N 600 500 400 µ.s=10.5 µ.s=7 µ.s=14 300 200 100 0 0 100 200 Observed Count for Year 300 How is vegetation related to moth species distribution and composition? CCA and Hamming distance shows a strong correlation between vegetation and moth species For the Future: Vegetation surveys at other trap sites would help improve the performance of the model How does climate affect moth phenology? Moth emergence shows a strong correlations to the local temperature For the future: incorporating the degree day curves we calculated for each site will make the model more robust Questions?