Real-time Inference of Air Quality using urban features

advertisement
U-Air: When Urban Air Quality
Meets Big Data
Yu Zheng
Lead Researcher, Microsoft Research Asia
Background
Air quality monitor station
• Air quality
– NO2, SO2
– Aerosols: PM2.5, PM10
• Why it matters
– Healthcare
– Pollution control and dispersal
• Reality
– Building a measurement station
is not easy
– A limited number of stations
(poor coverage)
Beijing only has 15 air quality monitor
stations in its urban areas (50kmx40km)
2PM, June 17, 2013
Challenges
• Air quality varies by locations
non-linearly
• Affected by many factors
– Weathers, traffic, land use…
– Subtle to model with a clear formula
0.30
Proportion
Portition
0.25
0.20
0.15
>35%
0.10
0.05
0.00
0
A) Beijing (8/24/2012 - 3/8/2013)
40
80
120
160
200
240
280
320
360
400
Deviation of PM2.5 between S12 and S13
440
480
We do not really know the air quality of a
location without a monitoring station!
Challenges
• Existing methods do not work well
– Linear interpolation
– Classical dispersion models
• Gaussian Plume models and Operational Street Canyon models
• Many parameters difficult to obtain: Vehicle emission rates, street
geometry, the roughness coefficient of the urban surface…
– Satellite remote sensing
• Suffer from clouds
• Does not reflect ground air quality
• Vary in humidity, temperature, location, and seasons
– Outsourced crowd sensing using portable devices
• Limited to a few gasses: CO2 and CO
• Sensors for detecting aerosol are not portable: PM10, PM2.5
• A long period of sensing process, 1-2 hours
30,000 + USD, 10ug/m3
202×85×168(mmοΌ‰
Inferring Real-Time and Fine-Grained air quality
throughout a city using Big Data
Meteorology
Traffic
Historical air quality data
Human Mobility
Real-time air quality reports
POIs
Road networks
Applications
• Location-based air quality awareness
– Fine-grained pollution alert
– Routing based on air quality
• Identify candidate locations for setup new monitoring stations
• A step towards identifying the root cause of air pollution
S1
S9
S6 S2
S8
S7
S1
S4
S3
S10
S5
B) Shanghai
Difficulties
• How to identify features from each kind of data source
• Incorporate multiple heterogeneous data sources into a
learning model
– Spatially-related data: POIs, road networks
– Temporally-related data: traffic, meteorology, human mobility
• Data sparseness (little training data)
– Limited number of stations
– Many places to infer
Methodology Overview
• Partition a city into disjoint grids
• Extract features for each grid from its affecting region
–
–
–
–
–
Meteorological features
Traffic features
Human mobility features
POI features
Road network features
• Co-training-based semi-supervised learning model for each
pollutant
– Predict the AQI labels
– Data sparsity
– Two classifiers
Meteorological Features: Fm
•
•
•
•
•
Rainy, Sunny, Cloudy, Foggy
Wind speed
Temperature
Humidity
Barometer pressure
AQI of PM10
August to Dec. 2012 in Beijing
Good
Moderate
Unhealthy-S
Unhealthy
Very Unhealthy
Traffic Features: Ft
• Distribution of speed by time: F(v)
• Expectation of speed: E(V)
• Standard deviation of Speed: D
0≤v<20
20≤v<40
km
v≥ 40
E(v)
D(v)
GPS trajectories generated by over 30,000 taxis
From August to Dec. 2012 in Beijing
Good
Moderate
Unhealthy-S
Unhealthy
Very Unhealthy
Extracting Traffic Features
• Offline spatio-temporal indexing
–
–
–
–
π‘‘π‘Ž : arrival time
Traj: trajectory ID
𝑙𝑖 : the index of the first GPS point (in the trajectory) entering a grid
π‘™π‘œ : the index of the last GPS point (in the trajectory) leaving a grid
gi
Traj1
p1→p2→...→pn
Trajm
p1→p2→...→pn
Trajl
p1→p2→...→pn
earliest
td
picki
tp
dropm
td
The time span of the data
drop1
drop2
td
latest
ta, Traj, Ii, Io
Taxi2
ta, Traj, Ii, Io
Taxi7
ta, Traj, Ii, Io
Taxim
Human Mobility Features: Fh
• Human mobility implies
– Traffic flow
– Land use of a location
– Function of a region (like residential or business areas)
• Features:
– Number of arrivals π‘“π‘Ž and leavings 𝑓𝑙
fa
fa
Good
Moderate
Unhealthy-S
Unhealthy
Very Unhealthy
Good
Moderate
Unhealthy-S
Unhealthy
Very Unhealthy
fl
A) AQI of PM10
fl
B) AQI of NO2
POI Features: Fp
• Why POI
– Indicate the land use and the function of the region
– the traffic patterns in the region
• Features
– Distribution of POIs over categories
– Portion of vacant places
– The changes in the number of POIs
• Factories, shopping malls,
• hotel and real estates
• Parks, decoration and furniture markets
Road Network Features: Fr
• Why road networks
– Have a strong correlation with traffic flows
– A good complementary of traffic modeling
• Features:
– Total length of highways π‘“β„Ž
– Total length of other (low-level) road segments π‘“π‘Ÿ
– The number of intersections 𝑓𝑠 in the grid’s affecting region
• Temporal dependency in a location
• Geo-correlation between locations
– Two sets of features
Co-Training
• Spatially-related
• Temporally-related
Spatial Classifier
Temporal Classifier
s2
s3
ti
s1
t2
– Generation of air pollutants
• Emission from a location
• Propagation among locations
l
s4
l
s2
s3
s4
s1
t1
l
s2
s3
spac
e
– States of air quality
s4
s1
Geo
• Philosophy of the model
Time
Semi-Supervised Learning Model
A location with AQI labels
A location to be inferred
Temporal dependency
Spatial correlation
Road Networks: Fr
POIs: Fp
Traffic: Ft
Spatial
Meteorologic: Fm
Human mobility: Fh
Temporal
Co-Training-Based Learning Model
• Temporal classifier
– Model the temporal dependency of the air quality in a location
– Using temporally related features
– Based on a Linear-Chain Conditional Random Field (CRF)
Yt-1
Fm(t-1) Ft(t-1) Fh(t-1) t-1
Yt
Fm(t) Ft(t) Fh(t)
Yt-1
t
Fm(t+1) Ft(t+1) Fh(t+1) t+1
Co-Training-Based Learning Model
• Spatial classifier
– Model the spatial correlation between AQI of different locations
– Using spatially-related features
– Based on a BP neural network
• Input generation
– Select n stations to pair with
– Perform m rounds
FFp1p1
βˆ†βˆ†PP1x1x
DD1 1
FFr1r1
DD2 2
FFpxpx FFrxrx lxlx
FFpkpk
FFrkrk
lklk
bb1 1
w'w'1111
βˆ†βˆ†RR1x1x
DD1 1
l1l1
ww1111
b'b'1 1
ww1 1
dd1x1x
cc1 1
b''
b''
βˆ†βˆ†PPkxkx
DD1 1
wwr r
βˆ†βˆ†RRkxkx
DD1 1
b'b'r r
DD2 2
ddkxkx
kk
Input
Inputgeneration
generation cc
wwpqpq
bbq q
w'w'qrqr
ANN
ANN
ccx x
Learning Process
Temporally-related features
Labeled data
Yt-1
Yt
Fm(t-1) Ft(t-1) Fh(t-1) t-1
Training
Unlabeled data
Yt-1
Fm(t) Ft(t) Fh(t)
t
Fm(t+1) Ft(t+1) Fh(t+1) t+1
Inference
1
Fp
βˆ†P1x
D1
Fr1
D2
Fpx Frx
Fpk
Frk
lk
b1
w'11
βˆ†R1x
D1
l1
w11
lx
b'1
w1
d1x
c
1
b''
βˆ†Pkx
D1
wr
βˆ†Rkx
D1
D2
Input generation
dkx
ck
b'r
wpq
bq
w'qr
ANN
Spatially-related features
cx
Inference Process
Temporally-related features
Yt-1
Yt
Yt-1
< 𝑝𝑐1 , 𝑝𝑐2 , …, 𝑝𝑐𝑛 >
Fm(t-1) Ft(t-1) Fh(t-1) t-1
Fp1
βˆ†P1x
D1
Fr1
D2
Fpx Frx
Frk
lk
w11
b1
t
Fm(t+1) Ft(t+1) Fh(t+1) t+1
×
lx
b'1
w1
d1x
< 𝑝′𝑐1 , 𝑝′𝑐2 , …, 𝑝′𝑐𝑛 >
c1
b''
βˆ†Pkx
D1
wr
βˆ†Rkx
D1
D2
Input generation
dkx
ck
𝑐 = arg 𝑐𝑖∈π’ž π‘€π‘Žπ‘₯(𝑝𝑐𝑖 ×
w'11
βˆ†R1x
D1
l1
Fpk
Fm(t) Ft(t) Fh(t)
b'r
wpq
bq
w'qr
ANN
Spatially-related features
cx
𝑝′𝑐𝑖
)
Evaluation
• Datasets
Data sources
POI
Road
AQI
Beijing
Shanghai
Shenzhen
Wuhan
2012 Q1
2012 Q3
#.Segments
Highways
271,634
272,109
162,246
1,497km
321,529
317,829
171,191
1,963km
107,061
107,171
45,231
256km
102,467
104,634
38,477
1,193km
Roads
18,525km
25,530km KM
6,100km
9,691km
#. Intersec.
49,981
70,293
32,112
25,359
#. Station
Hours
22
23,300
10
8,588
9
6,489
10
6,741
8/24/20123/8/2013
1/19/20133/8/2013
2/4/20133/8/2013
2/4/2013-3/8/2013
50×50km (2500)
50×50km (2500)
57×45km(2565)
45×25km (1165)
Time spans
Urban Size (grids)
S1
S1
S5
S6
S9
S2
S6
S6
S8
S21
S13
S15
S16
S2
S1
S7
S2
S6
S12 S14
S4
S7
S5
S3
S8
S19
S10
S7
S4
S22
S10
S8
S4
S8
S3
S1
S16
S18
S5
S5
S3
S17
A) Beijing
S7
S9
S2
S1
S11
S4
S6
S3
S9
S20
S9
S10
B) Shanghai
C) Shenzhen
D) Wuhan
Evaluation
• Ground Truth
– Remove a station
– Cross cities
• Baselines
–
–
–
–
–
Linear and Gaussian Interpolations
Classical Dispersion Model
Decision Tree (DT):
CRF-ALL
ANN-ALL
Evaluation
• Does every kind of feature count?
NO2
PM10
Features
Precision
Recall
Precision
Recall
πΉπ‘š
0.572
0.514
0.477
0.454
𝐹𝑑
0.341
0.36
0.371
0.35
πΉβ„Ž
0.327
0.364
𝐹𝑝 +πΉπ‘Ÿ
πΉπ‘š +𝐹𝑑
πΉπ‘š +𝐹𝑑 +𝐹𝑝 +πΉπ‘Ÿ
πΉπ‘š +𝐹𝑑 +𝐹𝑝 +πΉπ‘Ÿ +πΉβ„Ž
0.441
0.664
0.731
0.773
0.443
0.675
0.734
0.754
0.411
0.307
0.634
0.701
0.723
0.483
0.354
0.635
0.691
0.704
Evaluation
• Overall performance of the co-training
0.8
0.80
0.7
0.5
0.4
0.3
U-Air
DT
Linear
CRF-ALL
Guassian
ANN-ALL
Classical
Precision
Accuracy
Accuracy
0.6
SC
TC
Co-Training
0.75
0.70
0.2
0.1
0.65
0.0
PM10
NO2
0
20
40
60
80
100
Num. of Iterations
120
140
160
Evaluation
• Confusion matrix of Co-Training on PM10
Predictions
G
M
S
402
102
0
0.883
M
3789
602
3614
204
0
0.818
S
41
200
532
50
0.646
U
0
22
70
219
0.704
0.855
0.853
0.586
0.814
G
Precision
U
Recall
Ground
Truth
0.828
Evaluation
• Performance of Spatial classifier
Cities
PM2.5
PM10
NO2
Prec.
Rec.
Prec.
Rec.
Prec.
Rec.
Beijing
0.764
0.763
0.762
0.745
0.730
0.749
Shanghai
0.705
0.725
0.702
0.718
0.715
0.706
Shenzhen
0.740
0.737
0.710
0.742
0.732
0.722
Wuhan
0.727
0.723
0.731
0.739
0.744
0.719
Evaluation
• Efficiency study
• Inferring the AQIs for entire Beijing in 5 minutes
Procedures
Feature
extraction
(per grid)
𝐹𝑑 & πΉβ„Ž
𝐹𝑝
πΉπ‘Ÿ
Time(ms)
53.2
28.8
14.4
Procedures
Inference
(per grid)
Total
SC
Time(ms)
21.5
TC
13.1
131
Conclusion
• Infer fine-grained air quality with
– Real-time and historical air quality readings from existing stations
– Other data sources: meteorology, POIs, road network, human mobility, and
traffic condition
• Co-Training-based semi-supervised learning approach
– Deal with data sparsity by learning from unlabeled data
– Model the spatial correlation among the air quality of different locations
– Model the temporal dependency of the air quality in a location
• Results
– 0.82 with traffic data (co-training)
– 0.76 if only using spatial classifier
Next Step
• Predict the air quality in 1~2 hours
• Identify the root cause of the air pollutions by
– Studying the correlations between AQIs and different features
– Data visualization across multiple-domain data sources
Thanks!
Yu Zheng
yuzheng@microsoft.com
Homepage
Download