(LUR) Models

advertisement
Evaluating the Uncertainty of
Land-Use Regression Models
Halûk Özkaynak
US EPA, Office of Research and Development
National Exposure Research Laboratory, RTP, NC
Presented at the
CMAS Special Symposium on Air Quality
October 13, 2010
Land-Use Regression (LUR)
Models
• Widely-used methodology for estimating
individual exposure to ambient air pollution in
epidemiologic studies
Point
Sources
Area
Sources
Line
Sources
2
LUR Strengths
• Able to capture smaller-scale variability in
community health studies
• Less resource intensive –
– Easier to develop and apply compared with
other methods for measuring or estimating
subject-specific values (e.g., household
measurements, physical modelling)
• Land-use data widely available
3
LUR Limitations
4
• Inputs
– Require accurate monitoring data at large number of
sites - e.g., in highly industrialized urban areas with
many types of emission sources
• Application in health studies
– Not transferable from one urban area to another
– Do not address multi-pollutant aspects of air pollution
– Lack the fine-scale temporal resolution needed for
estimating short-term exposure to air pollution
– Often estimate ambient air pollution only versus
indoor and personal
• Lack the ability to connect specific sources of emissions
to concentrations for developing pollution mitigation
strategies
Analysis Goals: New Haven Case Study*
• Use air pollution predicted by coupled regional
(CMAQ) and local (AERMOD) scale air-quality
models
• Develop and evaluate land-use regression
models for:
– Benzene
– Nitrogen oxides (NOx)
– Particulate matter (PM2.5)
• Examine (in future) the implications of alternate
LUR development strategies on model efficacy
for multiple pollutants
Source: Johnson, M., Isakov, V., Touma, J.S., Mukerjee, S., and Özkaynak, H. (2010).
5
Evaluation
of Land Use Regression Models Used to Predict Air Quality Concentrations
in an Urban Area. Atmospheric Environment, Vol. 44, pp: 3660-3668.
Air Pollution Data
• Air pollution concentrations were predicted at 318
census block group sites in New Haven, Connecticut
using a coupled air quality model (Isakov et al., 2009)
• Predicted daily
concentrations for 2-month
periods in winter and
summer (2001) were used to
calculate seasonal average
concentrations for benzene,
NOx, and PM2.5 at each site
– July- August for summer
– January- February for winter
6
et al. 2009. Journal of the Air and Waste
Management Association; 59(4):461-472.
• Annual averages were based Isakov
on 365 daily means for 2001
LUR Model Structure and Inputs
Dependent Variables
Independent (Predictor) Variables
Pollutant Concentrations
Benzene, NOx, and PM2.5 Predicted
by Coupled Regional and Local
Scale Air Quality Models
=
Traffic
Intensity and
Proximity to
Roadways
+
• Traffic intensity near
the home (vpd/km2)
• Proximity (1/distance)
to major roadways
Proximity to
Ports and
Harbors
+
• Proximity (1/distance)
to seaports
• Proximity (1/distance)
to harbors
Population
and Housing
Density
+
• Population density in
census block group
• Housing density in
census block group
Proximity to
Industrial
Sources
• Proximity to
industrial emitters of:
–Benzene
–NOx
–PM2.5
• Multivariate linear regression models
• Initial pool of 60 potential predictors
• Eliminated variables based on
7
– High correlation (R-squared ~1.0) with other
selected predictors and/or
– Lack of interpretability
19 land-use variables included in model selection
Site Selection
• Sites
– Census block group centroids
• Training Sites
– Sites used to fit LUR models
– Selected from 318 census
block groups in the study
area
– Stratified random selection
among 4 census regions
• Test Sites
– Remaining sites withheld
from training set - minimum of
10%
used for independent model
evaluation
8
Model Development and Evaluation
9
• Variable selection
– Examined correlation structure for predictive
variables
• Model development
– All subsets with 3-7 independent predictors
– Model selection based on AIC, Mallow’s C(p),
adjusted r-squared, and variance inflation factor
• Model evaluation
– Cross-validation within training dataset
– Hold-out evaluation within test dataset
• Models for multiple pollutants and training sites
– Benzene, NOx, PM2.5
– 25, 50, 75, 100, 125, 150, 200, and 285
• Automated, iterative process
– Site selection -> model development
– Repeated 100x for each pollutant and number of
training sites
Model Performance in Test versus
Training Sites: Benzene
Proportion of Variance Explained (R2)
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
RSQ Predicted vs Observed Benzene in Test Dataset
0.1
RSQ LUR Models for Benzene in Training Dataset
0
0
25
50
75
100
125
150
175
200
Num ber of Sites in Training Dataset
10
225
250
275
300
Model Performance in Test versus
Training Sites: NOx
Proportion of Variance Explained (R2)
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
RSQ Predicted vs Observed NOx in Test Dataset
0.1
RSQ LUR Models for NOx in Training Dataset
0.0
0
25
50
75
100
125
150
175
200
Num ber of Sites in Training Dataset
11
225
250
275
300
Model Performance in Test versus
Training Sites: PM2.5
Proportion of Variance Explained (R2)
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
RSQ Predicted vs Observed PM2.5 in Test Dataset
0.1
RSQ LUR Models for PM2.5 in Training Dataset
0
0
25
50
75
100
125
150
175
200
Num ber of Sites in Training Dataset
12
225
250
275
300
LUR Prediction Errors: NOx
• Prediction error =
– Average (+/- SD) of
mean predicted minus
observed input values
– For 100 iterations - aka
100 LUR models
• Analyzed by low, medium,
and high NOx
concentration based on
total NOx distribution
– Low = 0 - 25th percentile
– Medium = 25th - 75th
– High = 75th - max
13
Rotterdam Area LUR versus Dispersion Model
(Hoek et al., 2010)
)
F
Dispersion Model
LUR Model
LUR Model Evaluation in Oslo from Hoek et al., 2010
Courtesy: Christian Madsen (Oslo)
F ull model
L OOC V
adj R
2
adj R
2
Validation
T raining s ets *
20 loc ations
40 loc ations
adj R
2
adj R
2
2005:
NO x
NO 2
NO
0.63
0.69
0.57
0.61 - 0.66
0.67 - 0.72
0.53 - 0.59
0.52 - 0.68
0.60 - 0.78
0.45 - 0.61
0.58 - 0.70
0.64 - 0.76
0.51 - 0.65
NO x
NO 2
NO
0.62
0.70
0.56
0.59 - 0.65
0.68 - 0.72
0.51 - 0.59
0.55 - 0.70
0.56 - 0.76
0.53 - 0.65
0.65 - 0.67
0.70 - 0.74
0.57 - 0.61
2008:
Comparison of Two LUR Models for Amsterdam
(Hoek et al., 2010)
Comparison of Two LUR Models for Amsterdam
Denoting Sites Impacted by Traffic or Urban Sources
(Hoek et al., 2010)
Summary and Conclusions
• We used air pollution concentrations predicted by coupled regional
and local scale AQ models to develop and evaluate LUR models in
New Haven, CT for benzene, PM2.5, and NOx
• Model performance and robustness improved as number of sites
used to build the models increased
– R-squares were inflated for models based on pollutant concentrations
from 25 trainings sites compared with models based on 100 -285
training sites
– R-squared for LUR model (training dataset) and R-squared predicted
versus observed (test dataset) converged as training sites increased
• It is critical to evaluate LUR performance using site-specific
independent measurement data sets
• Analysis suggests that coupled air quality models could provide a
useful tool for improving LUR estimates of exposure to ambient air
pollution in epidemiologic studies
18
• LUR model performance may be considerable poorer than
emissions based modeling results for urban environments with
complex sources and landscape characteristics
Acknowledgements*
•
•
•
•
•
•
19
Markey Johnson
Vlad Isakov
Joe Touma
Shaibal Mukerjee
Luther Smith (Alion Incorporated)
Ellen Kinnee (Computer Science Corporation)
*Although this work was reviewed by EPA and approved for publication, it may
not necessarily reflect official Agency policy
Additional Slides
20
Mean Contribution of Land-Use Factors in
Benzene Models
Proportion of Variability Explained
Benzene 25 Site Models
Proportion of Variability Explained
Benzene 100 Site Models
Models Based on
25 Training Sites
13%
2%
Models Based on
100 Training Sites
Intercept
10%
Traffic Intensity (vpd/km2)
8%
0%
12%
20%
11%
Proportion of Variability Explained
44%
Benzene 285 Site Models
Models
Based on
Proportion of Variability Explained
25 Site Models
285 Benzene
Training
Sites
15%
13%
5% 0%
2%
27%
Proximity
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
20%
Population and
Housing
Density
Proximity
Harbors
Proximity
Sources
Populatio
Density
Traffic Intensity (vpd/km2)
Proximity to Roadways
20%
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
Population and Housing
Density
11%5%
21
48%
Traffic Inte
Proximity to Roadways
Intercept
10%
22%
Intercept
38%
LEGEND
Intercept
Traffic Intensity (vpd/km2)
Proximity to Roadways
Proximity to Ports and Harbors
Proximity to Industrial Sources
Population and Housing Density
Mean Contribution of Land-Use Factors in
NOx Models
Proportion of Variability Explained
Models
Based on
NOx 25 Site Models
Proportion
of Variability
Explained
Models
Based
on
NOx 100 Site Models
25 Training Sites
100 Training Sites
1%
0%
Intercept
1%
Traffic Intensity (vpd/km2)
15%
21%
Intercep
1%
Traffic In
16%
28%
Proximity to Roadways
24%
38%
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
17%
Population and Housing
Density
LEGEND
Proportion of Variability Explained
Benzene 25 Site Models
285 Training Sites
15%
11% 5%
0%
2%
10%
0%
32%
22
48%
Proximi
Harbors
Proximi
Sources
Populat
Density
38%
Proportion of Variability Explained
Models
Based
NOx 285 Site
Models on
13%
Proximi
Intercept
Intercept
Traffic Intensity (vpd/km2)
20%
Proximity to Roadways
Traffic Intensity (vpd/km2)
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
Population and Housing
Density
Proximity to Ports and Harbors
Proximity to Roadways
Proximity to Industrial Sources
Population and Housing Density
Mean Contribution of Land-Use Factors in
PM2.5 Models
Proportion of Variability Explained
Models
Based
PM2.5 100 Site
Models on
Proportion
of Variability
Explained
Models
Based
on
PM2.5 25 Site Models
25 Training Sites
100 Training Sites
1%
Traffic Intensity (vpd/km2)
8%
5%
6%
9%
9%
Inter
0%
Intercept
2%
Traf
0%
Prox
Proximity to Roadways
7%
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
Population and Housing
Density
73%
Proportion of Variability Explained
PM2.5 285
Models
Models
Based
on
Proportion
ofSite
Variability
Explained
LEGEND
285 Training Sites
Benzene 25 Site Models
13%
11%
11%
2%
0%
1% 5%
10%
Intercept
Traffic Intensity (vpd/km2)
20%
Proximity to Roadways
0%
23
83%
80%
Proximity to Ports and
Harbors
Proximity to Industrial
Sources
Population and Housing
Density
Intercept
Traffic Intensity (vpd/km2)
Proximity to Roadways
Proximity to Ports and Harbors
Proximity to Industrial Sources
Population and Housing Density
Prox
Harb
Prox
Sou
Pop
Den
Download