October 20th meeting - Scenarios Network for Alaska + Arctic Planning

advertisement
Alaska Biome Shifts
Strategy and Planning Meeting
October 20th 2010
USFWS office, Anchorage AK
In attendance:
Karen Murphy, USFWS
Evie Whitten, TNC
Nancy Fresco, SNAP, UAF
Michael Lindgren, EWHALE, UAF
Joel Reynolds, USFWS
Joel Clement, Wilburforce
Via teleconference:
Falk Huettmann
Synopsis
The group went over the results of modeling efforts thus far, including difficulties
imposed by the requirements of the project, the limitations of the software, and the
limitations of the hardware. We discussed potential fixes and compromises for these
difficulties.
Overview of modeling efforts to date
Michael presented the latest clustering efforts, which now include all of Alaska, with a
range of from 4 to 10 clusters. He also created projections in RandomForest based on
these clusters. From a subjective perspective, these results look good. The clusters
appear logical, based on our “expert knowledge” of the landscape. At the lower cluster
numbers, the landscape is separated into arctic, boreal, western coastal, southeastern and
Aleutian regions; higher cluster numbers create more diverse categories that appear to
account for mountain ranges and other known features. There appear to be some
reasonable congruence with some of our land cover maps, and with the biome
classifications used in Phase I of the project, although this has yet to be analysed
mathematically. However, Canada has not yet been included in clustering efforts, and
there are concerns about resolution. Although projections were made at 2km, the
clustering was done at a resolution of10 km (essentially squares of 25 2 km pixels each).
Michael did 25 repeats to try to overcome this problem. Even this was a vast
improvement over previous efforts, which resulted in continual computer crashes.
Michael is now getting help from Dustin Rice (SNAP IT person) and is using a
combination of SNAP and EWHALE computing power in order to maximize RAM,
pagefile capacity, processing speed and capacity, and file storage space.
Issues of scale and resolution
A lengthy discussion ensued regarding resolution and modeling limitations. Why can’t
we create clusters at 2km?
Karen expressed concern regarding the utility of the eventual outputs of the project. She
pointed out that our goal is to create projections that are useful to land managers,
particularly from the perspective of managing protected areas and refuges. These
managers may need to decide whether existing protected areas are adequate, and provide
landscape connectivity. Even 5 km grids are rather broad for this kind of assessment, and
create “smeared” edge effects. At this scale, we are talking about big landscapes that
don’t fit into conservation land, and may be poor indicators of what’s in a refuge or not in
a refuge. A great deal of detail is lost.
Michael, Joel, and Falk together helped explain the reasons for the limitations, and the
group discussed work-around options. The problem is not in the projections using
RandomForest, but in the creation of clusters. This is because of the way clusters are
defined and created.
Basics of clustering (see also ppt)
• Cluster analysis is the assignment of a set of observations into subsets (called
clusters) so that observations in the same cluster are similar in some sense.
• The choice of which clusters to merge or split is determined by a linkage
criterion, which is a function of the pairwise distances between observations.
• We are clustering using the PAM (partitioning around medoids) algorithm, which
operates on the dissimilarity matrix of the given data set. The dissimilarity matrix
(also called distance matrix) describes pairwise distinction between objects.
• The algorithm Pam first computes representative objects, called medoids (number
of medoids depends on number of clusters). A medoid can be defined as that
object of a cluster, whose average dissimilarity to all the objects in the cluster is
minimal. After finding the set of medoids, each object of the data set is assigned
to the nearest medoid.
• PAM is a robust algorithm, because it minimizes a sum of dissimilarities, and is
thus not strongly skewed by outliers. The medoid for each cluster will be more
like a median than a mean (thus the name).
• Thus, in order to cluster using PAM, the algorithm must compare every data point
(grid cell) with every other one. As the number of grid cells increases with higher
resolution maps, the effect is exponential, not merely multiplicative. At 2 km this
is a VAST undertaking. Even with all Michael, Falk, and Dustin’s efforts, it may
simply be impossible
Scale of Canadian data
Nancy reminded everyone that for much of Canada (all except for the Yukon and BC)
PRISM data is unavailable, meaning that the best resolution for SNAP data is 10 minutes
lat/long. There was some confusion about what this translates to in km. Nancy checked
later, and found that it is approximately an 18.4 km by 18.4 km grid. The only other
option for historical data for Canada would be to use the dataset we used last time, based
on mean values for each ecoregion, and artificially imposing a grid on this data.
However, the SNAP data downscaled with CRU is the only dataset available for future
projections. The historical CRU are the obvious choice for cluster modeling, since they
are on the same grid as these future projections, and will provide better regularity, wider
choices for baseline years, and in many cases better resolution than the ecoregion data.
Testing multiple methods to accommodate scale issues
Perhaps clustering at some lesser resolution than 2 km will be perfectly adequate. If this
is the case, it will save a lot of time and trouble, and lessen concerns over trans-boundary
scale incongruity. It will also provide feedback for the Canadian project, which will use
much coarser resolution (18 km) by default.
The group agreed that we should test various methods:
1) Run all the 2 km pixels, on a smaller area of the state
2) Run a random subset, choosing one 2 km grid cell from each 10 km by 10 km
square
3) Run a regular subset, e.g. the southeast pixel from each 10 by 10 square
4) Average all the 2 km pixels in a 10 km by 10 km grid to create a coarser grid.
We agreed that Michael would do a test run using all four approaches within a relatively
small area in the southeastern portion of the “main” part of the state
(Tok/Valdez/Cordova/McCarthy area). Each of these test runs would be for only an 8cluster model, for the sake of simplicity. Every run would use the same area. For method
#2 and method #3, multiple runs would be done – ideally 25 runs each, such that the total
area sampled would add up to the complete area in question. It was also suggested that
Michael run a test at 50 km resolution. However, given the resolution of the Canadian
data, it might make more sense to run a test at 18km (or 10 minutes).
Analyzing clusters
There was some discussion of how to compare or combine these results, since each
clustering attempt is individual, and results cannot really be averaged. However, it was
agreed that results from multiple runs and multiple methods should be similar enough
that it should be easy to see which clusters are analogous, meaning that they can then be
matched up mathematically according to % same vs % different. Results can also be
analyzed subjectively, simply by looking at the resulting maps to see if they appear
similar or different.
We discussed various ways to look at clusters. In the last project iteration, we created
scatter plots to compare several biomes in the context of two variables. We also created
box plots for each variable and each biome. With 24 variables instead of 4, it will be
much harder to “see” the comparisons between clusters. However, we agreed that
Michael should create box plots for every variable and every cluster (numbering clusters
rather than naming them) for his 8-cluster pilot runs. He should make sure that the
clusters are ordered the same for every box plot, so that a quick comparison will show
which of the 24 variables are causing certain clusters to be considered “different” from
others.
Baseline years
We agree that because1971-2000 are the most recent three decades available, and
because working with complete decades aids in simplifying data and matching with other
studies, these years could be used as a baseline for climate data, for the purposes of
clustering. There was some discussion of whether these years were unusually warm,
compared to years prior to 1970, but we decided that since climate is a moving target, it
would be hard to find any period that would be unassailable.
Scope: How much of Canada to include?
There was a discussion of whether we should try to do the AK modeling and Canadian
modeling all as one linked effort. It was agreed that this makes the most sense, however
there is also concern that we don’t want to sacrifice resolution for AK, if programming
constraints would limit us, and we also don’t want to confuse the model with ecosystems
that are very unlikely to come up in to Alaska, such as Atlantic or Hudson Bay. Another
concern was the Arctic Shield, where granitic bedrock drives the vegetation as much as –
or more than – climate.
Either way, we will need to find “break points” that are defensible, since even the
Canadian model doesn’t need far eastern ecosystems. SNAP only has projection data for
these areas:
After the meeting, Nancy and
Karen looked at Ecozone and
Ecoregion maps for Canada (see
below) and decided that it would
be logical to exclude the Hudson
Plains (driven by proximity of the
Bay); as well as the eastern halves
of the Taiga Shield and Southern
Arctic; the Northern Arctic; the
Arctic Cordillera; the Atlantic
Maritime; and the Mixedwood
Plains. We thought we should
include part of the Boreal Shield,
with the break point determined
along ecoregion
lines, including
regions 87, 88,
89, and possibly
90, 91, and 95,
but excluding the
remainder, due to
lake effects or
simply being too
far east.
We might also
come up with a
logical break
point based on
literature review.
Next steps and homework
We agreed that we need to hold a December meeting during the week of the 13th. This
meeting would be a full day for the core team, with others joining for as long as possible
– perhaps only a half day. The morning of the 16th is not good for Karen. This meeting
will include the Canada team via teleconference (or in person if possible). At this
meeting we will run past all the clustering methods and decide on the steps forward,
including ensemble approach, different models, and different emission levels. Nancy will
schedule this meeting, with feedback from all potential participants.
Suggested participants: Michael, Falk, Karen, Joel, Evie, Dave Verbyla, Tom Paragi,
Dave Douglas, Wendy Loya, Philip Martin (or other from Arctic LCC perhaps Jennifer
Jenkins?), Jennifer Barnes (ne Allen)?, Dustin Rice (SNAP IT), Canadian reps (everyone
on Evie’s email from last week), plus Troy Hegel (Whitehorse – a former student of
Falk’s).
Nancy will work on a ppt for next Tuesday morning’s phone meeting with the Canadian
group, and will get a draft of these notes to Evie by COB Friday. She will call Evie
Monday at 10:00
We will have a group teleconference Weds 27th at 10:00, and Nancy will send a reminder
to the group, including the call-in number.
Nancy will talk to Dustin at SNAP about getting Michael a space to work every Friday at
the SNAP office – this has already been done. Nancy will try to find out more about Rich
Boone’s work on the Saskatoon system ending up in AK – based on soil science? Nancy
will email the news links to everyone from press coverage for phase I of the project.
Download