Vegetation Modeling I (5.3 MB ppt)

advertisement
Vegetation Modeling
1
Outline
Model types
Predictive models
Predictor data
Predictive model types (parametric/nonparametric)
Model Example (Tree-based/Random Forests)
Modeling dataset
• Response/Predictor data
• Discussion – scale of predictors
• Predictor data extraction
Data exploration
•
•
•
•
Summary statistics/NA values
attaching data frame
Predictor variables
Response variables (binary, continuous)
Model generation
• Tree-based models
• Random Forests (classification/regression trees)
• Variable importance/proximity
Model prediction
• Polygon data
• Predictor data (clipping, stacking)
• Apply model and display map
Modelmap
• Build model
• Model diagnostics
• Make map
2
Vegetation Modeling
## In general, there are 3 reasons we model vegetation:
# 1. Explanatory: to explain why something is happening… find a pattern or a cause.
# 2. Descriptive: to see an association between variables, not aimed at prediction.
# 3. Predictive: to predict an occurrence based on known attributes.
## Examples:
Is the height of a tree related to its diameter?
Did the last 20 years of drought affect tree mortality rates?
Are forest disturbances changing the carbon cycle?
Can we predict the distribution of vegetation 100 years from now based on climate models?
Can we predict the current distribution of vegetation across the landscape from the spectral
signature of remotely-sensed data?
## An article on different reasons for modeling.
Shmueli, G. 2010. To Explain or to Predict? Statistical Science 25(3):289-310.
3
Predictive Vegetation Models
## Predicting the distribution of vegetation across the landscape from available maps of
environmental variables, such as geology, topography, and climate and spectral data
from remotely-sensed imagery or products.
## Statistical models are built by finding relationships between vegetation data and the
available digital data and then applied across given digital landscapes.
Integrate forest inventory data . .
. .with available digital data . .
• Satellite imagery
• DEMs
• Soils
. .and generate maps of forest attributes.
y = f(x1 + x2 + xn)
. .using flexible statistical
models and GIS tools . .
4
Predictor Data
## Digital environmental data and remotely-sensed products are becoming increasingly
more available at many different scales.
Remotely-sensed data and derived products
•
•
Snapshot of what is on the ground
Caution: represents current vegetation distributions based on reflectance patterns, but does not
explain potential occurrences of vegetation.
Topography variables (elevation, aspect, slope)
•
•
No direct physiological influence on vegetation
Caution: these variables are local to the model domain, be careful when extrapolating over space
or time.
Climate variables (temperature, precipitation)
•
•
•
Directly related to physiological responses of vegetation
Surrogates for topographical variables
Caution: these variables are better for extrapolating over space and time, but recognize other
limitations, such as current location and dispersal range, and species competition may also
change through time..
Other surrogates (geology, soils, soil moisture availability, solar radiation)
•
•
Resources directly used by plants for growth
Caution: similar to climate variables
5
Predictive Model Types
Model Objective:
Find a relationship between vegetation data and predictor data for prediction purposes.
The model, in its simplest form, looks like the following:
y ~ f (x1 + x2 + x3 + …) + ε,
where y is a response variable, the x's are predictor variables, and ε is the associated error.
There are many different types of models
Parametric models – make assumptions about the underlying distribution of the data.
• Maximum likelihood classification
• Discriminant analysis
• General linear models (ex. linear regression)
• ...
Nonparametric models – make no assumptions about the underlying data distributions.
• Generalized additive models
• Machine-learning models
• Artificial neural networks
• ...
6
Predictive Model Types
Parametric Models
Parametric models – make assumptions about the underlying distribution of the data
Assumptions:
• The shape of the distribution of the underlying population is bell-shaped (normal).
• Errors are independent.
• Errors are normally distributed with mean of 0 and a constant variance.
Advantages:
• Easy to interpret
• High power with low sample sizes
Disadvantages:
• If sample data are not from a normally-distributed population, may lead to incorrect
conclusions.
Examples:
• Maximum likelihood classification
• Discriminant Analysis
• Linear Regression
• Multiple linear regression
• Generalized linear models (parametric/nonparametric)
7
Predictive Model Types
Parametric Models – Regression Example
Previous example: Using regression to fill in missing values.
## Import data and subset for sp19 only
path <- "C:/Peru/Rworkshop" # Where workshop material is
setwd(path)
tree <- read.csv("PlotData/tree.csv", stringsAsFactors=FALSE)
sp19 <- tree[( tree$SPCD==19 & !is.na(tree$DIA) ),]
# Start off with a scatter plot
par(mfrow=c(1,2))
plot(sp19$DIA, sp19$HT, xlab ="Diameter", ylab ="Height",
main = "Subalpine Fir")
abline(lm(sp19$HT ~ sp19$DIA))
# We saw some heteroscedasticity (unequal variance), and transformed data to log scale.
plot(log(sp19$DIA), log(sp19$HT), xlab ="Diameter", ylab ="Height",
main = "Subalpine Fir, log scale")
abline(lm(log(sp19$HT) ~ log(sp19$DIA)))
# We looked at summary of models and saw lower residual error and higher R2 values.
r.mod <- lm(HT~DIA,data=sp19)
summary(r.mod)
r.mod.log.ht.dia <- lm(log(HT)~log(DIA),data=sp19)
summary(r.mod.log.ht.dia)
8
Predictive Model Types
Parametric Models – Regression Example
Previous example cont..
# Then we looked at the residuals versus the fitted values and normal QQ-plots.
par(mfrow=c(2,2))
plot(r.mod$fitted,r.mod.log.ht.dia$residuals, xlab="Fitted",ylab="Residuals",
main ="Fitted versus Residuals")
abline(h=0)
qqnorm(r.mod$residuals, main="Normal Q-Q Plot")
qqline(r.mod$residuals)
plot(r.mod.log.ht.dia$fitted,r.mod.log.ht.dia$residuals,
xlab="Fitted",ylab="Residuals",
main ="Fitted versus Residuals, log scale")
abline(h=0)
qqnorm(r.mod.log.ht.dia$residuals, main="Normal Q-Q Plot, log scale")
qqline(r.mod.log.ht.dia$residuals)
# NOTE:
# Using transformations, we were able to work with data that was nonlinear, with unequal
# variance structure using a parametric model.
9
Predictive Model Types
Nonparametric Models
Nonparametric models – make no assumptions about the underlying distribution of
the data.
Note: Vegetation data typically are not normally distributed across the landscape,
therefore it is most often better to use a nonparametric model.
## Advantages:
# If sample data are not from a normally-distributed population, using a parametric may
lead to incorrect conclusions.
## Disadvantages:
# Need larger sample sizes to have the same power as parametric statistics
# Harder to interpret
# Can overfit data
## Examples:
# Generalized additive models
# Classification and regression trees (i.e. CART)
# Artificial neural networks (ANN)
# Multivariate adaptive regression splines (MARS)
# Ensemble modeling (i.e. Random Forests)
10
Random Forests
Random Forests (Breiman, 2001)
Generates a series of classification
and regression tree models..
.. sampling, with replacement, from
training data (bootstrap)
.. selecting predictor variables at
random for each node
.. outputting the class that most
frequently results
.. and calculating an out-of-bag error
estimate
.. and measuring variable importance
through permutation
randomForest – Liaw & Wiener
ModelMap – Freeman & Frescino
Modeling Example
12
Model Example
Objective:
Using Random Forests to find relationships between forest inventory data and six
spatial predictor layers. The models will be used to make predictions across a
continuous, pixel-based surface.
80 TM B3
9200’ Elev
95° Aspect
20% Slope
60 TM B3
8000’ Elev
160° Aspect
15% Slope
120 TM B3
10500’ Elev
10° Aspect
12 % Slope
Landsat TM
Elevation
Aspect
Slope
10 %
cover
35 %
cover
Extract data from each
layer at each sample plot
location.
Landsat TM
Elevation
Aspect
Slope
80 %
cover
Prediction
(% Tree crown
cover)
Generate spatially explicit maps of
forest attributes based on cell by cell
predictions.
Model Example
The model form:
y ~ f (x1 + x2 + x3 + x4 + x5 + x6) + ε,
where y is forest inventory data, and the x's are the predictor variables listed below,
including satellite spectral data and topographic variables.
## For this example,
# We will look at 2 responses:
# Presence of aspen
Binary response of 0 and 1 values (1=presence)
# Total carbon
Continuous response
# With 5 predictor variables:
# Landsat Thematic Mapper, band 5
# Landsat Thematic Mapper, NDVI
# Classified forest/nonforest map
# Elevation
# Slope
# Aspect
30-m spectral data, band 5
30-m spectral data, NDVI
250-m classified MODIS, resampled to 30 m
30-m DEM
30-m DEM –derived
30-m DEM - derived
14
Study Area
Utah, USA
15
Study Area
Model data set:
Uinta Mountains, Utah,USA
Highest East-West oriented
mountain range in the
contiguous U.S. - up to
13,528 ft (4,123 m)
Apply model to:
High Uinta Wilderness
## Vegetation
5 different life zones:
1. shrub-montane
2. aspen
3. lodgepole pine
4. spruce-fir
5. alpine
16
Modeling Dataset
17
Data for Modeling
# Load libraries
library(rgdal)
library(raster)
library(rpart)
library(car)
library(randomForest)
library(PresenceAbsence)
library(ModelMap)
#
#
#
#
#
#
#
GDAL operations for spatial data
Analyzing gridded spatial data
Recursive partitioning and regression trees
For book (An R Companion to Applied Regression)
Generates Random Forest models
Evaluates results of presence-absence models
Generates and applies Random Forest models
18
Response Data
# Forest Inventory data (Model response)
# We have compiled this data before, let's review.
options(scipen=6)
plt <- read.csv("PlotData/plt.csv", stringsAsFactors=FALSE)
tree <- read.csv("PlotData/tree.csv", stringsAsFactors=FALSE)
ref <- read.csv("PlotData/ref_SPCD.csv", stringsAsFactors=FALSE)
## The plt table contains plot-level data, where there is 1 record (row) for each plot. This
table has the coordinates (fuzzed) of each plot, which we will need later.
dim(plt)
head(plt)
## Total number of plots
## Display first six plot records.
## The tree table contains tree-level data, where there is 1 record (row) for each tree on a
plot. We need to summarize the tree data to plot-level for modeling.
dim(tree)
head(tree)
## Total number of trees
## Display first six tree records.
# First, let's add the species names to the table.
# Merge species names to table using a reference table of species codes
tree <- merge(tree, ref, by="SPCD")
head(tree)
19
Response Data
# Forest Inventory data (Model response)
## We have 2 responses:
# Presence of aspen
# Total carbon
Binary response of 0 and 1 values (1=presence)
Continuous response
## Let's compile presence of aspen and append to plot table.
# First, create a table of counts by species and plot.
spcnt <- table(tree[,c("PLT_CN", "SPNM")])
head(spcnt)
# For this exercise, we don't care about how many trees per species, we just want presence
# or absence, therefore, we need to change all values greater than 1 to 1.
spcnt[spcnt > 1] <- 1
# We are only interested in aspen presence, so let's join the aspen column to the plot table.
spcntdf <- data.frame(PLT_CN=row.names(spcnt), ASPEN=spcnt[,"aspen"])
plt2 <- merge(plt, spcntdf, by.x="CN", by.y="PLT_CN")
dim(plt)
dim(plt2)
# Notice there are fewer records (plots with no trees)
plt2 <- merge(plt, spcntdf, by.x="CN", by.y="PLT_CN", all.x=TRUE)
dim(plt2)
20
Response Data
## Forest Inventory data (Model response)
## Now, let's compile total carbon by plot and append to plot table.
# First, create a table of counts by plot.
pcarbon <- aggregate(tree$CARBON_AG, list(tree$PLT_CN), sum)
names(pcarbon) <- c("PLT_CN", "CARBON_AG")
# Carbon is stored in FIA database with units of pounds. Let's add a new variable,
CARBON_KG, with conversion to kg.
pcarbon$CARBON_KG <- round(pcarbon$CARBON_AG * 0.453592)
# Now we can join this column to the plot table (plt2).
plt2 <- merge(plt2, pcarbon, by.x="CN", by.y="PLT_CN", all.x=TRUE)
dim(plt2)
head(plt2)
# Change NA values to 0 values.
plt2[is.na(plt2[,"ASPEN"]), "ASPEN"] <- 0
plt2[is.na(plt2[,"CARBON_KG"]), "CARBON_KG"] <- 0
plt2$CARBON_AG <- NULL
head(plt2)
21
Response Data
# We need to extract data from spatial layers, so let's convert the plot table to a
SpatialPoints object in R.
## We know the projection information, so we can add it to the SpatialPoints object.
prj4str <- "+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs"
ptshp <- SpatialPointsDataFrame(plt[,c("LON","LAT")], plt,
proj4string = CRS(prj4str))
## Display the points
plot(ptshp)
22
Predictor Data
## Predictor variables:
# Landsat Thematic Mapper, band 5
# Landsat Thematic Mapper, NDVI
# Classified forest/nonforest map
# Elevation
# Slope
# Aspect
30-m spectral data, band 5, resampled to 90 m
30-m spectral data, NDVI, resampled to 90 m
250-m classified MODIS, resampled to 90 m
30-m DEM, resampled to 90 m
90-m DEM – derived in a following slides
90-m DEM – derived in a following slides
## Set file names
b5fn <- "SpatialData/uintaN_TMB5.img"
ndvifn <- "SpatialData/uintaN_TMndvi.img"
fnffn <- "SpatialData/uintaN_fnfrcl.img"
elevfn <- "SpatialData/uintaN_elevm.img"
#
#
#
#
Landsat TM–Band5
Landsat TM–NDVI
Forest type map (reclassed)
Elevation (meters)
Note: If you don't have uintaN_fnfrcl.img, follow steps on last slide of this presentation (Appendix 1)
## Check rasters
rastfnlst <- c(b5fn, ndvifn, fnffn, elevfn)
rastfnlst
sapply(rastfnlst, raster)
23
Predictor Data
# TM Band 5 (uintaN_TMB5.img)
# TM NDVI (uintaN_TMndvi.img)
24
Predictor Data
# Forest/Nonforest map (uintaN_fnf.img)
# DEM (uintaN_elevm.img)
25
Predictor Data
## Now, let's generate slope from DEM. Save it to your SpatialData folder.
help(terrain)
help(writeRaster)
slpfn <- "SpatialData/uintaN_slp.img"
slope <- terrain(raster(elevfn), opt=c('slope'), unit='degrees',
filename=slpfn, datatype='INT1U', overwrite=TRUE)
plot(slope, col=topo.colors(6))
# Add slope file name to rastfnlst
rastfnlst <- c(rastfnlst, slpfn)
rastfnlst
26
Predictor Data
## We can also generate aspect from DEM. Save it to your SpatialData folder.
help(terrain)
## This is an intermediate step, so we are not going to save it.
aspectr <- terrain(raster(elevfn), opt=c('aspect'), unit='radians')
aspectr
# Note: Make sure to use radians, not degrees
plot(aspectr, col=terrain.colors(6))
27
Predictor Data
## Aspect is a circular variable. There are a couple ways to deal with this:
## 1. Convert the values to a categorical variable (ex. North, South, West, East)
## We derived aspect in radians. First convert radians to degrees.
aspectd <- round(aspectr * 180/pi)
aspectd
## Now, create a look-up table of reclass values.
help(reclassify)
frommat <- matrix(c(0,45, 45,135, 135,225, 225,315, 315,361), 5, 2)
frommat
frommat <- matrix(c(0,45, 45,135, 135,225, 225,315, 315,361), 5, 2, byrow=TRUE)
frommat
tovect <- c(1, 2, 3, 4, 1)
rclmat <- cbind(frommat, tovect)
rclmat
## Reclassify raster to new values.
aspcl <- reclassify(x=aspectd, rclmat, include.lowest=TRUE)
aspcl
unique(aspcl)
bks <- c(0,sort(unique(aspcl)))
# Break points
cols <- c("dark green", "wheat", "yellow", "blue")
# Colors
labs <- c("North", "East", "South", "West")
# Labels
lab.pts <- bks[-1]-diff(bks)/2
# Label position
plot(aspcl, col=cols, axis.args=list(at=lab.pts, labels=labs), breaks=bks)
28
Predictor Data
## 2. Convert to a linear variable (ex. solar radiation index; Roberts and Cooper 1989)
aspval <- (1 + cos(aspectr+30))/2 ## Roberts and Cooper 1989
aspval
plot(aspval)
## Let's multiply by 100 and round so it will be an integer (less memory)
aspval <- round(aspval * 100)
aspval
plot(aspval)
# Save this layer to file
aspvalfn <- "SpatialData/uintaN_aspval.img"
writeRaster(aspval, filename=aspvalfn, datatype='INT1U', overwrite=TRUE)
# Add aspval to rastfnlst
rastfnlst <- c(rastfnlst, aspvalfn)
## Converts aspect into solar radiation equivalents, with a correction of 30 degrees to reflect
## the relative heat of the atmosphere at the time the peak radiation is received.
## Max value is 1.0, occurring at 30 degrees aspect; min value is 0, at 210 degrees aspect.
## Roberts, D.W., and S. V. Cooper. 1989. Concepts and techniques in vegetation mapping. In Land
classifications based on vegetation: applications for resource management. D. Ferguson, P. Morgan, and F.
D. Johnson, editors. USDA Forest Service General Technical Report INT-257, Ogden, Utah, USA.
29
Discussion - Scale of Predictors
## Tools in R for handling scale issues. See help on functions for further details.
# focal - applies a moving window function across pixels without changing the resolution.
# Note: It works but takes a lot of time on large rasters.
# aggregate - aggregates pixels to lower resolution.
# resample – resamples pixels to match extent or pixel size of another raster.
Predictor Data Extraction
## The next step is to extract the values of each raster at each sample point location.
80 TM B3
9200’ Elev
95° Aspect
20% Slope
60 TM B3
8000’ Elev
160° Aspect
15% Slope
120 TM B3
10500’ Elev
10° Aspect
12 % Slope
Landsat TM
Elevation
Aspect
Slope
10 %
cover
35 %
cover
Extract data from each
layer at each sample plot
location.
80 %
cover
Predictor Data Extraction
## We need to check the projections of the rasters. If the projections are different,
## reproject the points to the projection of the rasters, it is much faster.
## We will use the plt2 table with LON/LAT coordinates and the response data attached.
head(plt2)
## We know the LON/LAT coordinates have the following projection:
prj4str <- "+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs"
# Check projections of each raster..
sapply(rastfnlst, function(x){ projection(raster(x)) })
## Reproject SpatialPoints object to match raster projections.
help(project)
rast.prj <- projection(raster(rastfnlst[1]))
xy <- cbind(plt$LON, plt$LAT)
xyprj <- project(xy, proj=rast.prj)
32
Predictor Data Extraction
## Extract values (raster package)
help(extract)
# Let's extract values from 1 layer.
tmp <- extract(raster(elevfn), xyprj)
head(tmp)
# Now, let's create a function to extract, so we can extract from all the rasters at the same time.
extract.fun <- function(rast, xy){ extract(raster(rast), xy) }
# Now, apply this function to the vector list of raster file names.
rastext <- sapply(rastfnlst, extract.fun, xyprj)
# Look at the output and check the class.
head(rastext)
class(rastext)
33
Predictor Data Extraction
## Extract values (raster package) cont.. change names
# Let's make the column names shorter.
colnames(rastext)
# Use the rastfnlst vector of file names to get new column names.
# First, get the base name of each raster, without the extension.
cnames <- unlist(strsplit(basename(rastfnlst), ".img"))
cnames
# We could stop here, but let's make the names even shorter and remove
# 'uintaN_' from each name.
cnames2 <- substr(cnames, 8, nchar(cnames))
cnames2
# Now, add names to matrix. Because the output is a matrix, we will use colnames.
colnames(rastext) <- cnames2
head(rastext)
34
Predictor Data Extraction
# Now, let's append this matrix to the plot table with the response data (plt2).
head(plt2)
# We just want the response variables, so let's extract these columns along with the unique
identifier of the table (CN, ASPEN, CARBON_KG).
modeldat <- cbind(plt2[,c("CN", "ASPEN", "CARBON_KG")], rastext)
head(modeldat)
# Check if this is a data frame
is.data.frame(modeldat)
dim(modeldat)
# Let's also append the projected xy coordinates for overlaying with raster layers.
modeldat <- cbind(xyprj, modeldat)
head(modeldat)
colnames(modeldat)[1:2] <- c("X", "Y")
head(modeldat)
35
Data Exploration
36
Model Data Exploration
## What to look for:
# NA values
# Outliers
# Correlations
# Non-normal distributions
# Changes in variability
# Clustering
# Non-linear data structures
37
Model Data Exploration
Summary statistics
## Summary statistics
str(modeldat)
summary(modeldat)
head(modeldat)
dim(modeldat)
# We need to convert categorical variables to factors
modeldat$ASPEN <- factor(modeldat$ASPEN)
modeldat$fnfrcl <- factor(modeldat$fnfrcl)
## Now, display summary statistics again and notice changes for ASPEN and fnfrcl
str(modeldat)
summary(modeldat)
head(modeldat) # notice head does not show which variables are factors
38
Model Data Exploration
NA Values
## Check for NA values.
modeldat[!complete.cases(modeldat),]
modeldat.NA <- modeldat[!complete.cases(modeldat),]
dim(modeldat.NA)
modeldat.NA
# We can overlay plots with NA values on raster.
plot(raster(aspvalfn))
points(modeldat.NA, pch=20)
# Most R model functions will handle or remove NA values, but for this example, let's
remove the plots with NA values from our dataset now.
modeldat <- modeldat[complete.cases(modeldat),]
dim(modeldat)
39
Model Data Exploration
attach data frame
## Attaching a data frame. This is useful if you are exclusively working with 1 data frame.
## Caution: data frame variable names must be unique to data frame.
## Let's save modeldat object and clean up before we attach the data frame.
save(modeldat, file="Outfolder/modeldat.Rda")
# Now, remove all objects except modeldat.
ls()[ls() != "modeldat"]
rm(list = ls()[ls() != "modeldat"])
ls()
# .. and attach modeldat
attach(modeldat)
head(modeldat)
ASPEN
# Display column vector without using $
# Notes:
# To load the saved model object
# load(file="Outfolder/modeldat.Rda")
# Make sure to detach data frame when done using it..
# detach(modeldat)
40
Model Data Exploration
Predictors
## Check for outliers and correlations among predictors.
## Let's look at an example using 2 predictors (elevm, TMB5)
preds <- cbind(elevm, TMB5)
## Correlation between predictor variables, to determine strength of relationship
cor(preds)
round(cor(preds),4)
## Scatterplots
plot(preds)
## Add a regression line
abline(lm(TMB5 ~ elevm))
## Now let's add a smoother line for more information about data trend
lines(loess.smooth(elevm, TMB5), col="red")
## Another way
library(car)
scatterplot(TMB5 ~ elevm)
help(scatterplot)
41
Model Data Exploration
Predictors
## Correlation cont..
# We can now see a non-linear trend in the data where, TMB5 spectral values decrease as
elevations rise up to around 3000 meters, then begins to increase in value as elevations
continue to rise.
# This suggests a non-parametric relationship.
help(cor)
# The default uses pearson correlation coefficient, but spearman may be a better choice for
nonlinear data structures.
round(cor(preds, method="spearman"),4)
## Let's look at all predictors at once
names(modeldat)
preds <- modeldat[-c(1:5)]
head(preds)
## Correlation between predictor variables (to minimize number of predictors)
cor(preds)
str(preds)
## Note Error : 'x' must be numeric (remove factor variable from analysis)
round(cor(preds[,-3], method="spearman"),4)
42
Model Data Exploration
Predictors
## Scatterplots
pairs(preds)
help(pairs) #select Scatterplot Matrices from graphics package
## In help doc, scroll down to Examples and copy and paste the 2 functions:
# panel.hist
# panel.cor
pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor)
## Change cor function to spearman within panel.cor
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{ usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, method="spearman"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor)
## Another way
scatterplotMatrix(preds[,-3])
43
Model Data Exploration
Predictors
## Check for outliers
plot(TMB5, TMndvi)
identify(TMB5, TMndvi)
# Click on outliers and press esc key to escape
## Display the outliers
modeldat[c(58,60),]
plot(raster("SpatialData/uintaN_TMB5.img"))
points(modeldat[c(58,60),], pch=20)
## Let's remove these outliers from our dataset
modeldat <- modeldat[-c(58,60),]
dim(modeldat)
length(TMB5)
## Detach and reattach data frame
detach(modeldat)
attach(modeldat)
dim(modeldat)
length(TMB5)
save(modeldat, file="Outfolder/modeldat.Rda") ## Resave modeldat object
# load(file="Outfolder/modeldat.Rda")
## Check scatterplot again
plot(TMB5, TMndvi)
## ..and for all predictors
preds <- modeldat[-c(1:5)]
pairs(preds, lower.panel = panel.smooth, upper.panel = panel.cor)
44
Model Data Exploration
Predictors
## We can also look at the distribution of each predictor across its range of values and its
diversion from normality using the cumulative density function we looked at earlier.
## This is only meaningful for continuous predictors
plot(sort(elevm))
lines(c(1, length(elevm)), range(elevm), col=2)
## Now, let's make a little function to look at the rest of the continuous predictors
pdistn <- function(x){
plot(sort(x))
lines(c(1, length(x)), range(x), col=2)
}
par(ask=TRUE)
# Press enter to go to next display
pdistn(elevm)
pdistn(slp)
pdistn(TMB5)
pdistn(TMndvi)
pdistn(aspval)
par(ask=FALSE)
## For categorical predictors, let's use table to look at the distribution of samples by value.
table(fnfrcl)
45
Model Data Exploration
Response
## Let's look at the distribution and amount of variability of the sample response data.
## Check for normality (bell-shaped curve)
## Histograms and density functions
par(mfrow=c(2,1))
hist(CARBON_KG)
hist(CARBON_KG, breaks=5)
## Overlay density function (smoothed) on the histogram with continuous response.
hist(CARBON_KG, breaks=10, prob=TRUE)
lines(density(CARBON_KG))
## Data with lots of 0 values tend to have this shape. This is not a normal distribution
## Therefore, using a nonparametric model, such as Random Forests is a good idea.
## Let's look at the distribution if there were no 0 values (plots without trees)
hist(CARBON_KG[CARBON_KG>0], breaks=10, prob=TRUE)
lines(density(CARBON_KG[CARBON_KG>0]))
## The shape of the distribution is similar, with much more higher values than lower values.
par(mfrow=c(1,1))
46
Model Data Exploration
Data distributions – Binomial response
## Let's look at the distribution of the sample response data with a couple predictors.
## We will start with presence/absence of aspen. We know this is a binomial distribution
with 0 and 1 values.
## Let's explore a little more.
## Plot elevation as a function of ASPEN, bar representing median value.
boxplot(elevm ~ ASPEN, main="Aspen Presence/Absence")
## Add names
boxplot(elevm ~ ASPEN, main="Aspen Presence/Absence", names=c("Absence",
"Presence"))
# Note: the bold line is median value
## Add points with mean values
means <- tapply(elevm, ASPEN, mean)
points(means, col="red", pch=18)
# Note: overall, presence of aspen tends to be at the lower elevations of the sample plots.
# The distributions are slightly skewed with the median elevm values higher than mean
elevm values.
47
Model Data Exploration
Data distributions – Binomial response
## Other ways to explore relationships between the response and predictors.
## We can look at the differences of presence vs absence with 2 predictors.
plot(elevm, ASPEN)
## We can look at the differences of presence vs absence with 2 predictors (elevation and slope).
par(mfrow=c(1,2))
plot(elevm[ASPEN==1], slp[ASPEN==1])
plot(elevm[ASPEN==0], slp[ASPEN==0])
## Now, let's make it more meaningful by adding labels and using the same scale for each.
xlab <- "Elevation"
ylab <- "Slope"
xlim <- c(2000, 4000)
ylim <- c(0,40)
plot(elevm[ASPEN==1], slp[ASPEN==1], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim,
main="Aspen Present")
plot(elevm[ASPEN==0], slp[ASPEN==0], xlab=xlab, ylab=ylab, xlim=xlim, ylim=ylim,
main="No Aspen")
par(mfrow=c(1,1))
48
Model Data Exploration
Data distributions – Binomial response cont.
## Other ways to explore relationships between the response and predictors (1 plot)
## We can also color the points based on a factor (ASPEN)
plot(slp, elevm, col=ASPEN, pch=20, xlab="Slope", ylab="Elevation")
## Add a legend (using default colors)
legend(x=35, y=2200, legend=levels(ASPEN), col=1:length(ASPEN), pch=20)
help(legend)
## Now, do it again using your own color choice
palette(c("blue", "green"))
# Change color palette first
plot(slp, elevm, col=ASPEN, pch=20, xlab="Slope", ylab="Elevation")
legend(x=35, y=2200, legend=levels(ASPEN), col=1:length(ASPEN), pch=20)
palette("default")
# Change color palette back to default colors
## Or..
plot(slp, elevm, col=c("red", "blue"), pch=20, xlab="Slope", ylab="Elevation")
legend(x=35, y=2200, legend=levels(ASPEN), col=c("red", "blue"), pch=20)
## Another way:
scatterplot(slp ~ elevm|ASPEN, data=modeldat)
## For categorical predictors, use table function again
table(ASPEN, fnfrcl)
49
Model Data Exploration
Data distributions – Continuous response
## Other ways to explore relationships between the response and predictors.
## We can look at relationship between elevation and CARBON_KG
plot(elevm, CARBON_KG, xlab="Elevation", ylab="Carbon(kg)")
## Without 0 values
plot(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0], xlab="Elevation",
ylab="Carbon (kg)")
## Add regression and smoother lines.
par(mfrow=c(2,1))
## With 0 values
plot(elevm, CARBON_KG, xlab="Elevation", ylab="Carbon(kg)")
line.lm <- lm(CARBON_KG ~ elevm)
abline(line.lm, col="red")
line.sm <- lowess(elevm, CARBON_KG)
lines(line.sm, col="blue")
## Without 0 values
plot(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0],xlab="Elevation",ylab="Carbon(kg)")
line.lm.w0 <- lm(CARBON_KG[CARBON_KG>0] ~ elevm[CARBON_KG>0])
abline(line.lm.w0, col="red")
line.sm <- lowess(elevm[CARBON_KG>0], CARBON_KG[CARBON_KG>0])
lines(line.sm, col="blue")
par(mfrow=c(1,1))
50
Model Generation
51
Model Generation
y ~ f (x1 + x2 + x3 + x4 + x5 + x6) + ε
80 TM B3
9200’ Elev
95° Aspect
20% Slope
60 TM B3
8000’ Elev
160° Aspect
15% Slope
120 TM B3
10500’ Elev
10° Aspect
12 % Slope
Landsat TM
Elevation
Aspect
Slope
10 %
cover
35 %
cover
Extract data from each
layer at each sample plot
location.
80 %
cover
Tree-based Models
## Strengths of Tree-based models
# Easy to interpret
# Adaptable for handling missing values
# Handles correlated predictor variables
# Predictor variable interactions are automatically included
# Handles categorical or continuous predictor variables
## Weaknesses of Tree-based models, such as Random Forests
# Optimization is based on each split, not necessary the overall tree model
# Continuous predictors are treated as categorical, thus inefficient
# Nonparametric, thus loses some the power of parametric statistics
# Tendency to overfit data
53
Tree-based Models
## What does this mean, in simple form..
# Using our example dataset,
load("Outfolder/modeldat.Rda")
library(rpart)
## Classification tree
asp.tree <- rpart(ASPEN ~ TMB5 + TMndvi + fnfrcl + elevm + slp + aspval,
data=modeldat, method="class")
plot(asp.tree)
text(asp.tree, cex=0.75)
## Regression tree
carb.tree <- rpart(CARBON_KG ~ TMB5 + TMndvi + fnfrcl + elevm + slp + aspval,
data=modeldat)
plot(carb.tree)
text(carb.tree, cex=0.75)
54
Random Forests
Random Forests (Breiman, 2001)
Generates a series of classification and
regression tree models..
.. sampling, with replacement, from
training data (bootstrap)
.. selecting predictor variables at random
for each node
.. outputting the class that most
frequently results
.. and calculating an out-of-bag error
estimate
.. and measuring variable importance
through permutation
randomForest – Liaw & Wiener
ModelMap – Freeman & Frescino
Breiman, L. (2001). Random forests. Machine Learning J. 45 5- 32.
Liaw, A.; Wiener, M. 2002. Classification and Regression by randomForest. ISSN 2/3:18-22.
Freeman, E.A. et al. 2012. ModelMap: an R package for Model Creation and Map Production. CRAN R vignette.
Random Forests Model
## Strengths Random Forests (Breiman 2001)
# Bootstrap sample – A random selection of plots used to construct one tree.
# Boosting – Successive trees are constructed using information from previous trees, and a
weighted vote is used for prediction.
# Bagging – Each tree is independently constructed, where the majority (or average) vote is
used for prediction. Breiman's Random Forests uses bagging.
# Predictor selection - Each node is split using a randomly selected subset of predictors. In
standard trees, all variables are used. This difference is more robust against overfitting.
# Two main parameters
# 1. Number of trees (bootstrap samples) to generate (ntree)
# 2. Number of variables in the random subset of predictors at each node (mtry)
# Predictions – Aggregate predictions of n trees
# For categorical response (Classification trees) – majority votes
# For continuous response (Regression trees) – average of regression
# Error rate – based on training data ('out-of-bag', or OOB)
# For each tree (bootstrap sample)
# For continuous response (Regression trees) – average of regression
56
Random Forests Model
classification tree
## Now, let's use the randomForests package – Classification tree
library(randomForest)
help(randomForest)
## Let's try with ASPEN binary, categorical response (presence/absence)
set.seed(66)
asp.mod <- randomForest(ASPEN ~ TMB5 + TMndvi + fnfrcl + elevm + slp + aspval,
data=modeldat, importance = TRUE)
## Default parameters:
# ntree = 500
# mtry = sqrt(p)
# nodesize = 1
# replace = TRUE
# Number of trees
# Number of predictors (p) randomly sampled at each node
# Minimum size of terminal nodes
# Bootstrap samples are selected with replacement
## Look at results
asp.mod
summary(asp.mod)
names(asp.mod)
57
Random Forests Model
Classification tree
## Classification tree - Output
names(asp.mod)
err <- asp.mod$err.rate
head(err)
tail(err)
mat <- asp.mod$confusion
mat
# Out-Of-Bag (OOB) error rate (of i-th element)
# Confusion matrix
58
Random Forests Model
Classification tree
## Classification tree - Output
# Plot the number of trees by the error rate
plot(1:500, err[,"OOB"], xlab="Number of trees", ylab="Error rate")
# Note: how many trees needed to stabilize prediction
## Calculate the percent correctly classified from confusion (error) matrix
mat
pcc <- sum(diag(mat[,1:2]))/sum(mat) * 100
pcc
pcc <- round(pcc, 2)
## Round to nearest 2 decimals
pcc
library(PresenceAbsence)
pcc(mat[,1:2], st.dev=TRUE)
Kappa(mat[,1:2], st.dev=TRUE)
## The Kappa statistic summarizes all the available information in the confusion matrix.
## Kappa measures the proportion of correctly classified units after accounting for the probability of
chance agreement.
59
Random Forests Model
## Now, let's use the randomForests package – regression tree
## Now, let's try with the continuous, CARBON_KG response
set.seed(66)
carb.mod <- randomForest(CARBON_KG ~ TMB5 + TMndvi + elevm + slp + aspval,
data=modeldat, importance = TRUE)
## Default parameters:
# ntree = 500
# mtry = p/3
# nodesize = 5
# replace = TRUE
# Number of trees
# Number of predictors (p) randomly sampled at each node
# Minimum size of terminal nodes
# Bootstrap samples are selected with replacement
## Look at results
carb.mod
summary(carb.mod)
names(carb.mod)
60
Random Forests Model
Regression tree
## Regression tree - Output
names(carb.mod)
mse <- carb.mod$mse
rsq <- carb.mod$rsq
# Mean square error (of i-th element)
# Pseudo R-squared (1-mse/Var(y))(of i-th element)
head(mse)
tail(mse)
tail(rsq)
61
Random Forests Model
Regression tree
## Regression tree - Output
# Plot the number of trees by the mse (Mean Square Error)
plot(1:500, mse, xlab="Number of trees", ylab="Mean Square Error rate")
# Note: how many trees needed to stabilize prediction
# Similarly, plot the number of trees by the rsq (R-Squared)
plot(1:500, mse, xlab="Number of trees", ylab="R-Squared")
# Again: how many trees needed to stabilize prediction
62
Random Forests Model
Variable Importance
## Other information from RandomForest model (importance=TRUE)
# Variable importance (Breiman 2002)
# Estimated by how much the prediction error increases when the OOB data for that variable is
permuted while all others are left unchanged.
# randomForests computes different measures of variable importance
# 1. Computed from OOB data, averaged over all trees and normalized by the standard
deviation of the differences.
Classification trees – error rate (Mean Decrease Accuracy)
Regression trees – Mean Square Error (%IncMSE)
# 2. The total decrease in node impurities from splitting on the variable, averaged over all trees.
Classification trees – measured by Gini index (Mean Decrease Gini)
Regression trees – measured by residual sum of squares (IncNodePurity)
63
Random Forests Model
Variable Importance - Classification
## Variable importance – Classification tree
## Get importance table
asp.imp <- abs(asp.mod$importance)
asp.imp
## Get the number of measures (columns) and number of predictors
ncols <- ncol(asp.imp)
numpred <- nrow(asp.imp)
## Number of measures
## Get number of predictors
## Plot the measures of variable importance for ASPEN presence/absence
par(mfrow=c(2,2))
for(i in 1:ncols){
## Loop thru the different importance measures
ivect <- sort(asp.imp[,i], dec=TRUE)
## Get 1st measure, descending order
iname <- colnames(asp.imp)[i]
## Get name of measure
# Generate histogram plot (type='h') with no x axis (xaxt='n')
plot(ivect, type = "h", main = paste("Measure", iname), xaxt="n",
xlab = "Predictors", ylab = "", ylim=c(0,max(ivect)))
# Add x axis with associated labels
axis(1, at=1:numpred, lab=names(ivect))
}
64
Random Forests Model
Variable Importance - Regression
## Let’s make a function and plot importance values for CARBON_KG model.
plotimp <- function(itab){
ncols <- ncol(itab)
numpred <- nrow(itab)
## Number of measures
## Get number of predictors
## Plot the measures of variable importance
par(mfrow = c(ncols/2,2))
for(i in 1:ncols){ ## Loop thru the different importance measures
ivect <- sort(itab[,i], dec=TRUE) ## Get 1st measure, sorted decreasing
iname <- colnames(itab)[i]
## Get name of measure
# Generate histogram plot (type='h') with no x axis (xaxt='n')
plot(ivect, type = "h", main = paste("Measure", iname), xaxt="n",
xlab = "Predictors", ylab = "", ylim=c(0,max(ivect)))
# Add x axis with associated labels
axis(1, at=1:numpred, lab=names(ivect)) }
}
## Check function with ASPEN model
plotimp(asp.imp)
## Now, run funtion with CARBON_KG model
plotimp(carb.mod$importance)
65
Random Forests Model
Proximity
## Other information from RandomForest model (proximity=TRUE)
# Measure of internal structure (Proximity measure)
# - The fraction of trees in which each plot falls in the same terminal node.
# - Similarity measure - in theory, similar data points will end up in the same terminal node.
## Let's try adding proximity to CARBON_KG model
set.seed(66)
carb.mod <- randomForest(CARBON_KG ~ TMB5 + TMndvi + elevm + slp + aspval,
data=modeldat, importance = TRUE, proximity = TRUE)
names(carb.mod)
carb.prox <- carb.mod$proximity
head(carb.prox)
66
Model Prediction
67
Study Area
Model data set:
Uinta Mountains, Utah,USA
Highest East-West oriented
mountain range in the
contiguous U.S. - up to
13,528 ft (4,123 m)
Apply model to:
High Uinta Wilderness
## Vegetation
5 different life zones:
1. shrub-montane
2. aspen
3. lodgepole pine
4. spruce-fir
5. alpine
68
Polygon Data
## Let's import and display the 2 polygon layers as well.
## Set dsn and polygon layer names
dsn <- "SpatialData"
aoinm <- "uintaN_aoi"
wildnm <- "uintaN_wild"
# AOI boundary
# Wilderness boundary (mapping AOI)
## Import polygon shapefiles
bndpoly <- readOGR(dsn=dsn, layer=aoinm, stringsAsFactors=FALSE)
wildpoly <- readOGR(dsn=dsn, layer=wildnm, stringsAsFactors=FALSE)
## Check projections of all 3 layers to see if we can display them together.
sapply(c(bndpoly, wildpoly), projection)
## Now we can display all 3 layers
par(mfrow=c(1,1))
plot(bndpoly, border="black", lwd=3)
plot(wildpoly, add=TRUE, border="red",
lwd=2)
69
Predictor Data
## Now, we need to clip the raster predictor layers to the extent of the wilderness
polygon
## Set file names
b5fn <- "SpatialData/uintaN_TMB5.img"
ndvifn <- "SpatialData/uintaN_TMndvi.img"
fnffn <- "SpatialData/uintaN_fnfrcl.img"
elevfn <- "SpatialData/uintaN_elevm.img"
slpfn <- "SpatialData/uintaN_slp.img"
(degrees)
aspfn <- "SpatialData/uintaN_aspval.img"
#
#
#
#
#
Landsat TM–Band5
Landsat TM–NDVI
Forest type map
Elevation (meters)
Derived slope
# Derived aspect value
## Check rasters
rastfnlst <- c(b5fn, ndvifn, fnffn, elevfn, slpfn, aspfn)
sapply(rastfnlst, raster)
## Compare projections of rasters with projection of wilderness polygon
projection(wildpoly)
70
Predictor Data
Clip layers
## Clip raster layers
## Let's clip the elevm raster using crop function
help(crop)
elevclip <- crop(raster(elevfn), extent(wildpoly))
## Now display the new raster.
plot(elevclip)
## Note: Notice it didn't clip to boundary, it clipped to extent of boundary
## Add polygon layer to display
plot(wildpoly, add=TRUE)
## We need to crop further by applying a mask with the polygon layer.
elevclip <- mask(elevclip, wildpoly)
plot(elevclip)
71
Predictor Data
Clip layers
## Create a function to clip the all the layers, saving to the working directory. Use
rastfnlst for raster names and build new names from the list. Return the new name
of the raster.
cliprast <- function(rastfn, poly){
## Create new name from rastfn
rastname <- strsplit(basename(rastfn), ".img")[[1]]
newname <- substr(rastname, 8, nchar(rastname))
newname <- paste("Outfolder/",newname, "_clip.img", sep="")
## Crop raster
rastclip <- crop(raster(rastfn), extent(poly))
## Mask raster and save to working directory with newname
rastclip <- mask(rastclip, poly, filename=newname, overwrite=TRUE)
print(paste("finished clipping",rastname))
flush.console()
return(newname)
}
clipfnlst <- {}
for(rastfn in rastfnlst){
clipfn <- cliprast(rastfn, wildpoly)
clipfnlst <- c(clipfnlst, clipfn)
}
72
Predictor Data
Create Raster Stack
## Check clipped rasters.
sapply(clipfnlst, raster)
## For prediction, the predictor layers must be consistent. Check the following:
# dimensions – all layers must have the same number of rows and columns
# resolution – all layers must have the same resolution
# extent – all layers must have the same extent
# projection – all layers must have same projection
## Create a stack of all the predictor layers
clipstack <- stack(clipfnlst)
clipstack
## Add names to stack layers
stacknames <- unlist(strsplit(basename(clipfnlst), "_clip.img"))
names(clipstack) <- stacknames
clipstack
73
Apply Model
y ~ f (x1 + x2 + x3 + x4 + x5 + x6) + ε
80 TM B3
9200’ Elev
95° Aspect
20% Slope
60 TM B3
8000’ Elev
160° Aspect
15% Slope
10 %
cover
120 TM B3
10500’ Elev
10° Aspect
12 % Slope
Landsat TM
Landsat TM
Elevation
Elevation
Aspect
Aspect
Slope
Slope
80 %
cover
35 %
cover
Extract data from each
layer at each sample plot
location.
Prediction
(% Tree crown
cover)
Generate spatially explicit maps of
forest attributes based on cell by cell
predictions.
Apply Model & Display Map
ASPEN
## Predict across stack pixels.
asp.predict <- predict(clipstack, asp.mod)
asp.predict
plot(asp.predict)
# Plot with color breaks
cols <- c("dark grey", "green")
plot(asp.predict, col=cols, breaks=c(0,0.5,1))
colors()
# Or, a little fancier.. create function to color categorical raster classes
colclasses <- function(rast, cols, labs){
nc <- length(cols)
minval <- cellStats(rast, 'min')
maxval <- cellStats(rast, 'max')
bks <- seq(minval,maxval,length.out=nc+1)
#
#
#
#
Number of classes
minimum value of raster
maximum value of raster
break points
lab.pts <- bks[-1]-diff(bks)/2
# label points
plot(rast, col=cols, axis.args=list(at=lab.pts, labels=labs),
breaks=bks)
}
colclasses(asp.predict, cols, labs=c("Absence", "Presence"))
par()
help(par)
par(omi=c(0,0,0,0.5))
# Changes outside margins
colclasses(asp.predict, cols, labs=c("Absence", "Presence"))
75
Apply Model & Display Map
CARBON_KG
## Predict across stack pixels.
carb.predict <- predict(clipstack, carb.mod)
## Plot with heat color ramp
plot(carb.predict, heat.colors(10))
## Plot with grey scale
plot(carb.predict, col=grey(256:0/256))
## Plot with green scale
my.colors <- colorRampPalette(c("white", "dark green"))
plot(carb.predict, col=my.colors(10))
76
ModelMap
Build Model
library(ModelMap)
help(package=ModelMap)
predList <- stacknames
#predList <- c("elevm", "TMndvi")
## Build random forest model
asp.mod2 <- model.build(
model.type = "RF",
qdata.trainfn = modeldat,
folder = "Outfolder",
predList = predList,
predFactor = "fnfrcl",
response.name = "ASPEN",
response.type = "categorical",
unique.rowname = "CN",
seed = 66)
asp.mod2
ModelMap
Model diagnostics
asp.mod2d <- model.diagnostics(
model.obj = asp.mod2,
qdata.trainfn = modeldat,
folder = "Outfolder",
response.name = "ASPEN",
prediction.type = "OOB",
unique.rowname = "CN")
ModelMap
Make Map
# Add full path to beginning of each name in predList
predPath <- sapply(predList,
function(x){paste("SpatialData/uintaN_", x, ".img", sep="")})
predPath
# Generate rastLUTfn. See help(model.mapmake) for details.
numpreds <- length(predList)
rastLUTfn <- data.frame(matrix(c(predPath, predList, rep(1,numpreds)),
numpreds, 3), stringsAsFactors=FALSE)
rastLUTfn
## Check rasters
sapply(rastfnlst, raster)
## Make Map
a <- model.mapmake(
model.obj = asp.mod2,
folder = "Outfolder",
rastLUTfn = rastLUTfn,
make.img = TRUE,
na.action = "na.omit")
Exercise
## 1. Create a map of presence of lodgepole within the Uinta Wilderness area, using a Random
Forests model and the predictors from modeldat. Hint: Load the plt and tree table again and use
code from slide #20 to create a binary response variable for lodgepole presence.
Also, set a seed to 66 so you can recreate the model.
## 2. How many plots in model data set have presence of lodgepole pine?
## 3. Display 1 scatterplot of the relationship of elevation and NDVI values with lodgepole pine
presence and absence as different colors. Hint: see slide #49. Make sure to label plots and add
a legend.
## 4. What is the variable with the highest importance value based on the Gini index?
## 5. What percentage of the total area is lodgepole pine? How does this compare with the
percentage of aspen?
Appendix 1
## Reclass raster layer to 2 categories
fnf <- raster("SpatialData/uintaN_fnf.img")
## Create raster look-up table
fromvect <- c(0,1,2,3)
tovect <- c(2,1,2,2)
rclmat <- matrix(c(fromvect, tovect), 4, 2)
## Generate raster and save to SpatialData folder
fnfrcl <- reclassify(x=fnf, rclmat, datatype='INT2U',
filename="SpatialData/uintaN_fnfrcl.img", overwrite=TRUE)
Download