Remote Sensing of Environment 280 (2022) 113192
Contents lists available at ScienceDirect
Remote Sensing of Environment
journal homepage: www.elsevier.com/locate/rse
Unsupervised domain adaptation for global urban extraction using
Sentinel-1 SAR and Sentinel-2 MSI data
Sebastian Hafner , Yifang Ban *, Andrea Nascetti
Division of Geoinformatics, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden
A R T I C L E I N F O
A B S T R A C T
Edited by Marie Weiss
Accurate and up-to-date maps of built-up areas are crucial to support sustainable urban development. Earth
Observation (EO) is a valuable data source to cover this demand. In particular, Sentinel-1 Synthetic Aperture
Radar (SAR) and Sentinel-2 MultiSpectral Instrument (MSI) missions offer new opportunities to map built-up
areas on a global scale. Using Sentinel-2 images, recent urban mapping efforts achieved promising results by
training Convolutional Neural Networks (CNNs) on available built-up data. However, these results strongly
depend on the availability of local reference data for fully supervised training or assume that the application of
CNNs to unseen areas (i.e. across-region generalization) produces satisfactory results. To alleviate these short­
comings, it is desirable to leverage Semi-Supervised Learning (SSL) algorithms that can take advantage of un­
labeled data, especially because satellite data is plentiful. In this paper, we propose a novel Domain Adaptation
(DA) approach using SSL that jointly exploits Sentinel-1 SAR and Sentinel-2 MSI to improve across-region
generalization for built-up area mapping. Specifically, two identical sub-networks are incorporated into the
proposed model to perform built-up area segmentation from SAR and optical images separately. Assuming that
consistent built-up area segmentation should be obtained across data modality, we design an unsupervised loss
for unlabeled data that penalizes inconsistent segmentation from the two sub-networks. Therefore, we propose to
use complementary data modalities as real-world perturbations for consistency regularization. For the final
prediction, the model takes both data modalities into account. Experiments conducted on a test set comprised of
sixty representative sites across the world showed that the proposed DA approach achieves strong improvements
(F1 score 0.694) over fully supervised learning from Sentinel-1 SAR data (F1 score 0.574), Sentinel-2 MSI data
(F1 score 0.580) and their input-level fusion (F1 score 0.651). To demonstrate the effectiveness of DA, we also
performed a comparison with two state-of-the-art products, namely GHS-BUILT-S2 and WSF 2019, on the test set.
The comparison showed that our model is capable of producing built-up area maps with comparable or even
better quality than the state-of-the-art global human settlement maps. Therefore, the multi-modal DA offers great
potential to be adapted to produce easily updateable human settlements maps at a global scale.
Keywords:
Built-up area mapping
Deep learning
Data fusion
Semi-supervised learning
Domain adaptation
Semantic segmentation
1. Introduction
Today, more than half of the global population is living in cities. This
number is growing rapidly and by the middle of the 21st century, the
fraction of the global population residing in urban areas is projected to
exceed two thirds (United Nations, 2014). Coinciding with the popula­
tion growth, global urban extent is expanding at an unprecedented rate
(Liu et al., 2020). In light of this, the development of robust urban
mapping methods is crucial to ensure accurate and up-to-date infor­
mation on urban landscapes and for the subsequent monitoring of ur­
banization at a global scale.
Earth Observation (EO) has become an invaluable tool for large-scale
mapping due to the unique capability of providing regular and consis­
tent spatial information on the Earth’s surface. With the opening of the
Landsat archive in 2008 (Woodcock et al., 2001) and the recent launch
of the Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 Multi­
Spectral Instrument (MSI) missions, a vast amount of medium and highresolution (10–30 m) data with global coverage has become available
free-of-charge. The Copernicus Program’s Sentinel-1/-2 missions are
particularly interesting for large-scale mapping due to the capability to
systematically acquire images with large swaths at high temporal reso­
lution. Therefore, Sentinel data is playing a key role in the development
* Corresponding author.
E-mail address: yifang@kth.se (Y. Ban).
https://doi.org/10.1016/j.rse.2022.113192
Received 20 January 2022; Received in revised form 7 July 2022; Accepted 19 July 2022
Available online 4 August 2022
0034-4257/© 2022 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
of robust methods for accurate and up-to-date urban mapping at a global
scale.
Over the past decade, several research groups developed fully
automatic urban extractor methods with large-scale mapping capability
for SAR data (Gamba et al., 2010; Esch et al., 2013; Ban et al., 2015;
Chini et al., 2018). The underlying hypothesis of SAR-based urban
mapping is that Built-Up Area (BUA) shows distinctive scattering
behaviour characterized by very high backscatter from double-bounce
effects of buildings. Urban extractors typically consist of three process­
ing steps: (1) hand-crafting of spatial and/or textural feature, (2)
extraction of urban area, and (3) post-processing to reduce false posi­
tives from mountains facing the SAR or other land cover classes that are
characterized by high backscattering values. The Global Urban Foot­
print (GUF), for example, was derived from 3 m resolution TerraSAR-X
and TanDEM-X SAR images collected in 2011–2012 (Esch et al., 2012,
2017; Esch et al., 2013). GUF accurately maps global human settlement
at 12 m resolution; however, the application of Very High-Resolution
(VHR) (<10 m) commercial imagery makes producing frequent up­
dates of GUF infeasible. Ban et al. (2015) applied ENVISAT Advanced
SAR images with a robust urban extractor and demonstrated that ac­
curate urban mapping at a global scale is possible from high-resolution
SAR images. More recently, Sentinel-1 SAR backscatter intensity and
coherence information was employed to map BUA in the Mediterranean
region and Northern Africa, where, on average, a 92% agreement with
GUF was achieved (Chini et al., 2018).
While SAR captures information on the structure and dialectic
properties of surfaces, optical sensors provide information on surface
reflectance characteristics. The delineation of urban areas from optical
imagery is, however, complex due to the spectral heterogeneity of urban
landscapes. In the past, spectral indices and spatial features were pro­
posed to enhance urban mapping (Gong et al., 1992; Zha et al., 2003; Xu,
2008). However, the spectral information of artificial impervious area
and bare land are highly confusing, which is one of the major limitations
of applying spectral indices for urban mapping (Gong et al., 2019; Assyakur et al., 2012). Several efforts explored supervised Machine
Learning (ML) for pixel-based classification (Pesaresi et al., 2016a;
Goldblatt et al., 2018; Liu et al., 2019; Gong et al., 2020). Supervised ML
algorithms have the capability to effectively capture the diverse spectral
characteristics of urban areas but require labeled pixels as examples to
learn from. Since manually labeling satellite data requires expertise and
is very time-intensive, various ways to automatically generate pixelbased labels were explored. Some, for example, exploited nighttimelights data to derive reference data and trained ML classifiers such as
Random Forest (RF) or Support Vector Machine (SVM) using 30 m res­
olution Landsat data as input (Goldblatt et al., 2018; Liu et al., 2019).
Others derived reference data from OpenStreetMap (OSM) to produce
the 30 m resolution Global Human Settlement Layer (GHSL) for 2014
from Landsat images using symbolic ML (Pesaresi et al., 2016b). Later,
this methodology was applied to Sentinel-2 images, and improvements
in terms of spatial detail and thematic contents with respect to the
Landsat derived product were reported (Pesaresi et al., 2016a).
Current trends in optical-based urban mapping have, however,
shifted from employing traditional ML classifiers such as RF or SVM to
Deep Learning (DL) models, specifically Convolutional Neural Networks
(CNNs). CNNs, in contrast to traditional ML classifiers, are capable of
learning useful feature representations directly from images. Hand­
crafting features is therefore redundant. Moreover, CNNs have experi­
mentally proved to outperform traditional classifiers for a variety of
image classification tasks in remote sensing (Kussul et al., 2017; Liu
et al., 2018a; DeLancey et al., 2020). Qiu et al. (2020) presented a DLbased framework to map human settlement extent at large-scale from
Sentinel-2 images using existing products in European cities as training
labels. Their framework proved to be effective in mapping human set­
tlement extent in 10 representative areas across the world and, more­
over, produced promising results compared to several baseline products,
including GUF and GHSL. A DL-based approach was also used by
Corbane et al. (2020b) to produce a global map of BUA for the reference
year 2018 (GHS-BUILT-S2) from cloud-free Sentinel-2 composites
(Corbane et al., 2020a). Their proposed multi-model approach includes
training a separate model for each UTM grid zone on the best freely
available settlement dataset (i.e. local training). The comparison against
the local training sets provided evidence for refined BUA detection of
GHS-BUILT-S2.
Taking the complementary information from SAR and optical data,
the advantages of SAR-optical data fusion were explored for improved
urban mapping (e.g., Pacifici et al., 2008; Ban and Jacob, 2013, 2016).
Most of the recent studies employed Sentinel-1 SAR and Landsat data in
combination with traditional ML classifiers Corbane et al. (2017);
Marconcini et al. (2020); Zhang et al. (2020b); Lin et al. (2020). Corbane
et al. (2017), for example, integrated Sentinel-1 SAR data into the
framework of the GHSL to enhance urban mapping capabilities. Others
used an ensemble of SVMs to outline the global human settlements
extent at 10 m resolution for 2015, better known as the World Settle­
ment Footprint 2015 (WSF2015) (Marconcini et al., 2020). Training
data were selected by thresholding spectral indices and backscattering
in optical and SAR images respectively. Sentinel-1 and Landsat data
were also combined for the development of a global 30 m resolution
impervious surface map (Zhang et al., 2020b). Zhang et al. (2020b) used
nighttime-light data, MODIS enhanced vegetation index data and the
GlobeLand30 land-cover product for the automatic derivation of
training samples. Finally, Sentinel-1 SAR and Landsat data were incor­
porated into a ML framework to map impervious surface at a large scale
in China, and improved accuracy compared with using optical Landsat
data alone was observed (Lin et al., 2020). Recently, multitemporal
Sentinel-1 SAR and Sentinel-2 optical data were jointly exploited to
produce the WSF 2019 (Marconcini et al., 2021). Specifically, Marcon­
cini et al. (2021) adapted the label generation worfklow from WSF2015
to train a RF classifier.
In light of the recent trends in urban mapping, particularly data
fusion and DL, it is crucial to consider two major limitations of the
existing work. First, data fusion in the existing literature is predomi­
nately performed by combining extracted features from different data
modalities before feeding them to traditional ML classifiers (i.e. inputlevel fusion). However, input-level fusion does not take the peculiar­
ities of the different data into account and may therefore not result in
considerable improvements compared to using single-sensor data,
particularly for CNNs (Schmitt et al., 2020; Hafner et al., 2022).
Therefore, it is necessary to develop novel DL-based fusion techniques.
In fact, the integration of ML into data fusion beyond simple input-level
fusion was identified as a key challenge in remote sensing (Schmitt and
Zhu, 2016). Second, the existing DL literature for urban mapping relies
heavily on either the availability of local reference data or on the
generalization ability of supervised models beyond their training area.
The availability of reference data has, however, been a major practical
problem ever since supervised DL became the key to solving computer
vision tasks. Furthermore, mapping performance may be poor when a
model is transferred to new areas beyond its training area due to the
model’s assumption that satellite data in unseen areas follows the dis­
tribution of the satellite data it was trained on (Crawford et al., 2013).
This assumption is rather strong for remote sensing-based mapping in
general and for urban mapping in particular, considering the vast di­
versity in the morphological-spatial configuration of urban landscapes
around the globe (Taubenböck et al., 2020). While previous studies
demonstrated that it is possible to train a multitude of models on local
reference data, this requires large computational processing resources,
which complicates the production of frequent updates (e.g. on an annual
basis). Moreover, mapping results may be spatially inconsistent because
local models trained on different reference data are being deployed.
Finally, at a global scale even the best local reference data is not always
of sufficient quality for training (Corbane et al., 2020b), which inevi­
tably leads to across-region generalization problems (Woodcock et al.,
2001). Therefore, it is of great significance to address changes in the data
2
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 1. Overview of the data preparation workflow to generate data for the training, validation and test sites. Sentinel-1 SAR and Sentinel-2 MSI images are generated
for all sites. Labels for the labeled training and validation sites are derived from Microsoft’s building footprint, while labels for the test sites are derived from the
SpaceNet7 building footprints. The test sites are spatially disjoint from all training (labeled and unlabeled) and validation sites.
distribution between the training area and potential model deployment
regions for the development of robust urban extraction methods.
Domain Adaptation (DA) has been developed by the ML community
as a response to challenges related to transferring a model from its
training area (i.e., source domain) to new areas (i.e., target domain)
with different underlying data distributions. DA is a sub-category of
transfer learning focusing on transferring knowledge gained in one
domain of interest to a new domain of interest to avoid data-labeling
efforts (Pan and Yang, 2009). The domains can be associated with
geographical locations, where the source and target domain correspond
to the geographical training locations of the model and the new areas the
model is transferred to for deployment, respectively (Hu et al., 2020).
Therefore, DA is a potential solution to overcome across-region gener­
alization problems. A variety of DA methods can be found in the remote
sensing literature (Tuia et al., 2016). Active learning aims to adapt the
model by including labeled data from the target domain in the super­
vised model training. To keep the labeling cost low, only a limited
amount of well-chosen examples from the target domain is used (Tuia
et al., 2011; Crawford et al., 2013). The idea to improve the general­
ization ability of classifiers by active learning for large scale mapping is
not new (Alajlan et al., 2013; Hamrouni et al., 2021). For example,
Alajlan et al. (2013) applied an active learning-based approach to
generate a land cover map at a continental scale from MODIS data using
an SVM classifier, and Hamrouni et al. (2021) adapted a locally trained
RF classifier for poplar plantation mapping at national scale using
Sentinel-2 data. However, active learning for global-scale mapping may
require the addition of representative training samples from many
different locations to achieve good generalization. In turn, considerable
labeling costs may arise, particularly for urban mapping, where a vast
heterogeneity of urban forms exists around the globe. On the other hand,
a plethora of unlabeled satellite imagery is readily available at no cost.
In unsupervised DA, the model is adapted to the target domain by
incorporating unlabeled data from the target domain into model
training by Semi-Supervised Learning (SSL). In recent years, SSL has
brought forward several powerful methods to incorporate unlabeled
data into supervised training (Chapelle et al., 2009). While a
comprehensive review of SSL methods is outside the scope of this paper
(we refer interested readers to Chapelle et al. (2009)), Oliver et al., 2018
grouped the current state-of-the-art for SSL into two classes: Consistency
regularization (Laine and Aila, 2016; Sajjadi et al., 2016), which en­
forces that perturbations of a sample should not significantly change the
model output; and entropy minimization, which encourages more
confident predictions on unlabeled data. The consistency regularization
method mean teacher is particularly promising, achieving
state-of-the-art results in several SSL benchmarks. Mean teacher main­
tains an exponential moving average of the model weights during
training, and penalizes predictions of the single latest model that are
inconsistent with those of the temporal ensemble model for regulariza­
tion purposes (Tarvainen and Valpola, 2017). Consistency regulariza­
tion based on the mean teacher framework was also recently
investigated for remote sensing-based semantic segmentation and
showed that SSL could significantly improve the performance of models
by adding unlabeled data (Zhang et al., 2020a; Wang et al., 2020). Mean
teacher is, however, not only used for general SSL but was later on also
adapted for unsupervised DA in image classification and semantic seg­
mentation (French et al., 2018; Cui et al., 2019). Recently, mean teacher
was applied to alleviate across-region generalization problems between
cities for Sentinel-2-based local climate zone classification, and accuracy
improvements in comparison to fully supervised learning were found
(Hu et al., 2020). Consistency regularization, therefore, holds great
potential to overcome across-region generalization problems in remote
sensing in general and in urban mapping in particular.
In this research, we propose a novel unsupervised DA approach to
improve across-region generalization for BUA mapping. To that end, we
introduce a multi-modal consistency regularization scheme using the
fusion of Sentinel-1 SAR and Sentinel-2 MSI data. Our DA approach
includes two CNN branches to obtain BUA predictions from Sentinel-1
SAR and Sentinel-2 MSI data separately, and a third BUA prediction is
obtained by fusing the features extracted by the two branches. In
addition to the supervised loss, we add an unsupervised loss that pe­
nalizes inconsistent predictions obtained from corresponding unlabeled
SAR and optical data. Therefore, we exploit different but
3
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 2. Training and validation sites. Labeled data is only used in the United States, Canada and Australia.
complementary data modalities as real-world perturbations for consis­
tency regularization in contrast to conventional consistency regulari­
zation approaches that simulate realistic perturbations with data
augmentation techniques. To demonstrate the effectiveness of our
approach, we test it across the globe using the new SpaceNet7 dataset
(Van Etten et al., 2021), and qualitatively compare our BUA maps with
two state-of-the-art products, namely GHS-BUILT-S2 and WSF 2019. To
the best of our knowledge, this is the first unsupervised DA approach in
remote sensing that proposes the fusion of SAR and optical data to
overcome spatial generalization problems for the task of land cover
segmentation. Beyond the improvement of across-region generalization,
we also evaluate the data fusion of SAR and optical images for global
BUA mapping.
separated due to the strong influence of the incidence angle in back­
scatter coefficient of BUA. Only scenes from the pass with better data
availability in terms of absolute image count were selected. We then
masked backscatter coefficients lower than − 25 dB in each scene to
remove noisy data. The remaining observations were used to compute
the per-pixel temporal mean for both polarizations separately. Temporal
mean is an effective method to remove speckle noise from SAR data
without reducing the resolution (Chini et al., 2018). Finally, pixel values
were normalized from the range of input values [− 25,0] to [0,1].
The Sentinel-2 mission collects optical images at 13 spectral bands
with various spatial resolutions. Band 2 (blue), Band 3 (green), Band 4
(red) and Band 8 (near-infrared) are provided at 10 m resolution, while
Band 5 (red edge 1), Band 6 (red edge 2), Band 7 (red edge 3), Band 8a
(red edge 4), Band 11 (short wave infrared 1), and Band 12 (short wave
infrared 2) are provided at 20 m resolution. The remaining 3 bands
provided at 60 m resolution (Band 1, Band 9 and Band 10) were not used
because they contain atmospheric information which is not considered
relevant for BUA mapping. Sentinel-2 images are available in GEE as
ortho-corrected images scaled by a factor of 10,000 (UTM projection) at
two processing levels, Level-1A and Level-2A. Level-1A data represent­
ing top-of-atmosphere reflectance was chosen over Level-2C data rep­
resenting surface reflectance due to the fact that not all early
acquisitions of the Sentinel-2 mission are available at Level-2A in GEE,
and, moreover, early coverage is not global. The preprocessing to
compute a cloud-free image from a time series of Sentinel-2 scenes is
based on temporal statistics. More specifically, the median of the pixel
time series is computed after masking values that correspond with a high
probability (80% +) to clouds. Cloud probability is retrieved via the
Sentinel Hub’s cloud detector for Sentinel-2 imagery,1 which was
recently added to GEE as a precomputed dataset. Finally, pixel values
were normalized from range [0,10,000] to range [0,1].
2. Data and study area
Fig. 1 illustrates an overview of the data preparation workflow for
Sentinel-1 SAR, Sentinel-2 MSI and building footprint data used to
generate the training, validation and test set. All data were preprocessed
in and subsequently exported from the cloud-based platform Google
Earth Engine (GEE) (Gorelick et al., 2017). GEE is becoming one of the
most popular platforms for geospatial big data analysis. Several studies
have highlighted the GEE potentialities to analyse large amount of
geospatial data in a timely manner (e.g. Liu et al., 2018b; Gong et al.,
2020; Zhang et al., 2020a; Zhang et al., 2020b; Goldblatt et al., 2018;
Ravanelli et al., 2018). One of the advantages of GEE is that Sentinel-1
SAR data and Sentinel-2 optical data are directly available as analysisready data-cubes. Preprocessing steps for both data modalities are
described in detail in Section 2.1. Thereafter, the training and validation
set and the test set are described in Sections 2.2 and 2.3, respectively.
2.1. Sentinel-1/-2 data and preprocessing
2.2. Training and validation set
The Sentinel-1 mission collects C-band SAR images at 20 m resolu­
tion with dual polarization capability (HH + HV and VV + VH).
Sentinel-1 images in GEE were preprocessed to Ground Range Detected
(GRD) images using the Sentinel-1 Toolbox. Preprocessing includes
thermal noise removal, radiometric calibration and terrain correction. In
addition, backscatter coefficient (σ) were converted to decibels via log
scaling (10log10x). Our preprocessing starts by obtaining all dual-band
VV + VH scenes acquired in Interferometric Wide Swath (IW) mode in
a given time period. Ascending and Descending pass scenes were
To build a training and validation set, Sentinel data were acquired
for the year 2016 using our preprocessing workflow (Section 2.1). In
total, 3906 Sentinel-1 SAR scenes and 10,103 Sentinel-2 MSI scenes
were used to generate a mean-Sentinel-1 image and a median-Sentinel-2
image for each of the 96 sites constituting the dataset (Fig. 2). All 26
1
4
https://github.com/sentinel-hub/sentinel2-cloud-detector
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
the three datasets from the respective GitHub repository to GEE. We
then converted all building footprint polygons laying within labeled
training and validation sites to raster with a 10 m resolution. The
resulting raster layer stores percentage of overlap with buildings perpixel. Finally, the layer was reprojected to UTM projection and resam­
pled to establish a correspondence with the Sentinel data. It should be
noted that due to the temporal gap between the Bing and Sentinel im­
agery, labels may not perfectly correspond to the Sentinel imagery. In
order to avoid ambiguity in our mapping approach, we will hereinafter
refer to our target class as BUA. We adopt the BUA definition from the
Fig. 3. Number of pixels in training and validation set.
Fig. 4. Sixty test sites grouped into the source domain and target domain. The target domain sites are further grouped into the 5 regions Europe (EU), Latin America
(LA), Sub-Saharan Africa (SSA), Islamic World (IS), and Asia, based on the cultural-geographic regions adopted from Huntington (1997). Numbers in brackets denote
the number of sites comprising a regional group.
labeled sites are located in the United States, Canada and Australia.
These countries represent the source domain for which open access
Microsoft’s building footprints are available (Microsoft, 2018). How­
ever, in many other parts of the world, particularly in the Global South,
this information is missing. Therefore, the dataset features 66 unlabeled
training sites spread across six continents to adapt the model to the
target domain using SSL. Microsoft’s computer generated building
footprints are accessible on GitHub in separate repositories for the
United States,2 Canada3 and Australia.4 The approximately 145 million
building footprints were automatically generated using the Microsoft
Cognitive Toolkit (CNTK). Specifically, Microsoft’s Bing team first
applied Deep Neural Networks and the ResNet34 with RefinedNet upsampling to segment buildings in VHR Bing satellite imagery. Satellite
imagery for the segmentation in the United States was acquired between
2012 and 2015. Unfortunately, this information is not available in
Canada and Australia. Segmentation outputs were then curated and a
polygonization algorithm was applied to detect building edges and an­
gles for the generation of building footprints. The quality of the pro­
duced building footprints was assessed using an evaluation set of
approximately 15,000, 45,000 and 7000 buildings for the United States,
Canada and Australia, respectively. For all three datasets precision is
above 0.980 and recall is above 0.650. A definition of these accuracy
metrics is given in Section 3.4.
Our label preparation started with transferring building footprints of
GHSL, defining BUA as:”The union of all the satellite data samples that
corresponds to a roofed construction above ground which is intended or
used for the shelter of humans, animals, things, the production of eco­
nomic goods or the delivery of services” (Pesaresi et al., 2013).
Accordingly, impervious surfaces such as roads and parking lots are not
part of BUA. Sentinel images for all sites and BUA labels for the labeled
training and validation sites were downloaded from GEE. Finally the
images were split onto tiles of size 256 × 256 pixels.
The size of the training and validation set is shown in Fig. 3. The
labeled training set contains about 1.32 ⋅ 109 pixels covering >132,000
km2. Approximately 11% of that area corresponds to BUA. The unla­
beled training is larger than the labeled training set with about 1.79 ⋅ 109
covering >179,000 km2. The validation set is considerably smaller with
about 2.22 ⋅ 108 pixels covering >22,000 km2 (approximately 10%
BUA).
2.3. Test set
In order to assess the generalization ability of networks, accurate and
reliable building footprints are required from a diverse set of cities.
Creating such a dataset would, however, be labour intensive and
expensive. Therefore, we leveraged building footprint labels from the
SpaceNet 7 Multi-Temporal Urban Development Challenge (Van Etten
et al., 2021). The openly available SpaceNet7 dataset features temporal
stacks of monthly Planet composites, including corresponding manually
annotated building footprints, ranging from 2018 to the beginning of
2020. The dataset comprises 101 sites spread across six continents,
whereas building footprints are available for the 60 training sites, which
are mostly located in rapidly urbanizing areas. Fig. 4 visualizes the
2
https://github.com/microsoft/USBuildingFootprints
https://github.com/Microsoft/CanadianBuildingFootprints
4
https://github.com/microsoft/AustraliaBuildingFootprints
3
5
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
corresponding building footprints were converted to per-pixel BUA la­
bels, following the workflow used for the Microsoft building footprints
(Section 2.2).
The size of the test set is summarized in Fig. 5. Across the six regions,
it contains about 1.11 ⋅ 107 pixels, where the fewest pixels are available
for Europe (5.38 ⋅ 105) and the most pixels for Asia (3.34 ⋅ 106). BUA
percentage ranges from 13% (Source) to 21% (LA). In total, the test set
covers an area of 1112 km2.
3. Methodology
This section starts by introducing the network architecture, followed
by a description of the proposed approach to improve across-region
generalization by incorporating unlabeled Sentinel-1 and Sentinel-2
data into training. We then introduce the experimental setup to
demonstrate the effectiveness of our unsupervised DA approach. Finally,
the accuracy metrics and the state-of-the-art products are described.
Fig. 5. Number of pixels for test set by region.
3.1. CNN architecture
CNN networks are the most used models for semantic segmentation
and are considered state-of-the-art for computer vision tasks (Long et al.,
2015). U-Net is a particular CNN architecture consisting of a contracting
path capturing context and an expansive path enabling precise locali­
zation (Ronneberger et al., 2015). The U-Net architecture was originally
developed for biomedical image segmentation but became a popular
choice of architecture to solve a large variety of semantic segmentation
task. In remote sensing, U-Net was successfully used for land cover
classification (Rakhlin et al., 2018) and water body extraction (Feng
et al., 2018), for example.
Fig. 6 shows the lightweight architecture based on U-Net (hereinafter
LU-Net) used in this work. Like the original architecture, it consists of a
contracting path (left side) where downsampling blocks are being
repeatedly applied and an expansive path (right side) where upsampling
blocks are being repeatedly applied. The number of downsampling and
upsampling blocks is also referred to as network depth. With a network
depth of 4, the original U-Net has about 30 million trainable parameters.
Recent studies mapping human settlement from Sentinel-2 images
showed, however, that good results can be achieved with light archi­
tectures, having about 1 million trainable parameters (Qiu et al., 2020;
Corbane et al., 2020b; Hafner et al., 2022). In Qiu et al. (2020), the light
architecture even compared favorably to larger architectures. Therefore,
the depth of the original U-Net architecture was reduced from 4 to 2. As
a result, the LU-Net has about 930,000 trainable parameters, which is
similar to the aforementioned light architectures.
A downsampling step in the LU-Net network consists of applying two
times the operation triplet 3 × 3 convolution (Conv) with padding, batch
normalization (BatchNorm) (Ioffe and Szegedy, 2015) and rectified
linear unit activation function (ReLu) (Nair and Hinton, 2010), followed
by a 2 × 2 max pooling (MaxPool) operation. With each downsampling
step the number of feature channels is doubled, starting at 64 channels,
and the x-y-size is halved. In addition to increasing the number of fea­
tures and, consequently, the number of parameters, adding more
downsampling steps also increases the receptive field. Receptive field
describes the size of a pixel’s adjacency in the input image considered
for its classification. LU-Net consists of two downsampling steps in the
contracting path and two upsampling steps in the extensive path.
Upsampling steps inverse the operations in the downsampling path by
doubling the x-y-size and halving the number of channels in the feature
maps. This is done via a 2 × 2 transpose convolution operation followed
by applying two times the operation triplet also used in downsampling
steps. Furthermore, skip connections are used to directly convey feature
maps from the contracting path to the corresponding steps in the
extensive path. In the end, the feature map has the originally x-y-size of
the input image and 64 channels. The final operation converts this
feature map into a single channel output via a 1 × 1 convolution
Fig. 6. Our lightweight U-Net architecture adaptation. White boxes correspond
to multi-channel feature maps. Number of channels and x-y-size are denoted on
top of the box and on its left side, respectively. Operations are visualized as
colour-coded arrows connecting the feature maps (see legend).
locations of the SpaceNet7 training sites, which were adopted as test
sites. It should be noted that the resulting test set is spatially disjoint
from the training and validation set, meaning that there is no spatial
overlap between any of the test sites and the training (labeled and un­
labeled) or validation sites. The timestamp to be used for our urban
extraction test set (i.e., month of the year 2019) was chosen based on the
best agreement with WSF 2019, under the requirement that the Planet
composite is cloud-free and, consequently, all buildings were labeled.
This selection procedure was chosen in order to allow for a fair com­
parison with WSF 2019, which does not specify a particular month of the
year 2019. The metadata for the sites constituting the test set is listed in
the Appendix (Table 3). Finally, the 60 sites were grouped into source
and target domain according to their geographical location. The target
domain sites were further grouped into five cultural/geographical re­
gions, namely Europe (EU), Latin America (LA), Sub-Saharan Africa
(SSA), Islamic World (IW) and Asia. The grouping is based on the geo­
graphical/cultural regions defined by Huntington (1997) (cf. Appendix
Fig. 17), with the difference that the four Asian geographical/cultural
regions were merged due to the limited availability of sites located
within Asia. It should be emphasized that these regions are an over­
simplification of our complex globalized world (e.g. Berman, 2004).
Nevertheless, there are clusters of cities with distinctive
morphological-spatial configurations corresponding to the geo­
graphic/cultural regions (Taubenböck et al., 2020). The simplified
geographical/cultural regions were adopted in order to compute
regional statistics of network performance. Following our data pre­
processing workflow (Section 2.1), a total of 1292 Sentinel-1 SAR scenes
and 2531 Sentinel-2 MSI scenes were used to generate a mean-Sentinel-1
image and a median-Sentinel-2 image for each of the 60 test sites. The
6
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 7. Overview of the proposed unsupervised domain adaptation approach. Model parameters are optimized with a supervised loss and a consistency loss for
labeled and unlabeled data, respectively. Supervised loss is comprised of three loss terms: two for the sub-networks and one for the fusion of the features extracted
from the sub-networks. Consistency loss is used to optimize model parameters in a unsupervised manner by training the sub-networks to agree on their predictions.
The fusion prediction is used for inference.
followed by the sigmoid activation function. Consequently, the network
outputs continuous values between 0 and 1, representing BUA
probability.
fusion).
The model is trained in a fully supervised way for labeled data drawn
from the source domain. Using a loss function composed of three subterms:
(
)
(1)
L Supervised = L pJacc (psar , y) + L pJacc (popt , y) + L pJacc pfus , y
3.2. Unsupervised domain adaptation approach
The sub-terms separately measure the similarity between the three
segmentation probability outputs (psar, popt and pfus) and the BUA label
(y), where similarity for all three sub-terms was measured using the
Power Jaccard loss (Duque-Arias et al., 2021), defined as follows:
Consistency regularization has been increasingly used in SSL with
the goal that realistic perturbations of the same sample should not
significantly change the output of the model (Oliver et al., 2018). Over
the past years, a series of approaches exploring different variations to
pursue this intuitive goal has been developed and achieved
state-of-the-art performances for SSL tasks (Laine and Aila, 2016; Sajjadi
et al., 2016; Tarvainen and Valpola, 2017), as well as DA tasks (French
et al., 2018). Generally, model consistency is incorporated into network
training by minimizing the distance between the model outputs for
different augmented versions of the same sample. Rather than training
consistency across different augmented versions of unlabeled samples,
in this work, we propose to use multi-modal data in the form of
Sentinel-1 SAR and Sentinel-2 optical images. Consequently, we hy­
pothesize that the different but complementary urban information
contained in SAR and optical data can be leveraged for domain adap­
tation using the fundamental principle of consistency regularization.
An overview of the proposed unsupervised DA approach is given in
Fig. 7. The proposed model incorporates two sub-networks with iden­
tical LU-Net architectures to separately extract multi-channel feature
map from the SAR image (xsar) and the optical image (xopt), which is an
effective architectural design for the joint use of Sentinel-1 SAR and
Sentinel-2 MSI data (Hafner et al., 2022; Zhu et al., 2022). A detailed
description of the employed LU-Net architecture was given in the pre­
vious Section 3.1, although the proposed approach is largely decoupled
from the choice of network architecture. The multi-channel feature
maps, having the same xy-size as the original images, are used to obtain
a per-pixel BUA probability for SAR (psar) and optical (psar) via an outoperation. The out-operation consists of a trainable 1 × 1 convolution
followed by the sigmoid activation function. A third BUA probability
(pfus) is obtained by applying the out-operation to the concatenated
feature maps extracted by the SAR and optical sub-network (hereinafter
L pJacc (p, y) = 1 −
(p⋅y) + ε
(p2 + y2 − p⋅y) + ε
(2)
where ε is a very small number (i.e., 1 ⋅ 10− 6), preventing a division by
zero. Power Jaccard loss is based on the Jaccard index that measures the
similarity between two finite sets as the intersection over union. Power
Jaccard loss, on the other hand, supports continuous values by replacing
intersection and union with product and sum, respectively, and Power
Jaccard increases the weight of wrong predictions by introducing ex­
ponents in the denominator.
Alongside labeled data from the source domain, the proposed DA
approach leverages unlabeled data from the target domain using con­
sistency regularization to alleviate domain shifts. Typically, consistency
regularization introduces a loss term measuring similarity between
model outputs for different augmentations of the same unlabeled sample
(Oliver et al., 2018). This loss term is then added to the supervised loss,
which was first proposed in Bachman et al., 2014. However, due to the
multi-modal nature of our approach, we apply a consistency loss to the
model outputs of the sub-networks. Consequently, inconsistencies be­
tween the BUA probability obtained from the SAR and optical
sub-network are penalized during training. Like segmentation accuracy,
consistency was measured via Power Jaccard loss, but hyper-parameter
φ was added as a weight factor to regulate the impact on the final loss:
L Consistency = φ⋅L pJacc (popt , psar )
(3)
Therefore, the core idea of our approach is to train the two sub7
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 8. Loss curves for (A) SAR, optical and fusion approach and (B) Fusion-DA approach.
Table 1
F1 score, precision, recall, IoU and Kappa for the validation and test set.
F1 score
SAR
Optical
Fusion
Fusion-DA
Precision
Recall
IoU
Kappa
Val.
Test
Val.
Test.
Val.
Test
Val.
Test
Val.
Test
0.664
0.739
0.774
0.764
0.574
0.580
0.651
0.694
0.664
0.727
0.745
0.691
0.570
0.699
0.712
0.661
0.664
0.752
0.805
0.855
0.579
0.496
0.599
0.730
0.497
0.586
0.631
0.618
0.403
0.409
0.482
0.531
0.613
0.699
0.738
0.724
0.496
0.518
0.593
0.634
Fig. 9. Regional histograms comparing the BUA probability outputs of our Fusion-DA model to that of the GHS-S2Nets (Corbane et al., 2020b).
them together. Once the model is trained, the fusion prediction is used
for inference. Thus, both the SAR and optical image are used for
inference.
networks to produce similar segmentation outputs from unlabeled data
for regularization. We achieve this by training the sub-networks and outoperations constituting the proposed model with a twofold loss function
consisting of a supervised loss for labeled samples and a consistency loss
for unlabeled samples, resulting in the following loss function:
{
L Supervised , if y exists
L =
(4)
L Consistency , otherwise
3.3. Experimental setup
We first set up three baselines by training LU-Nets in a fully super­
vised way using only the labeled training data but different satellite data
as input, namely SAR, optical and fusion. Fusion corresponds to the
concatenation of the SAR and optical image along the channel axis,
which is commonly referred to as input or feature-level fusion. These
baselines are also used to evaluate the impact of domain shifts between
During training, we use mini-batch gradient decent where a minibatch can consist of labeled and unlabeled data. Consequently, the
cost for a mini-batch is computed by determining the loss for each
sample in the mini-batch separately according to Eq. (4), before adding
8
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
operations improves performance of CNNs in remote sensing scene
classification compared to training on the original dataset (Yu et al.,
2017). The third operation, gamma correction, is a non-linear operation
defined as Iγ. We applied it channel-wise to the input data where γ is
randomly selected from [0.25,2]. Gamma correction is not established
as a data augmentation operation in remote sensing but showed prom­
ising results for the improvement of performance and robustness of
classic CNN architectures in semantic segmentation for medical images
(Sun et al., 2021). Consequently, gamma correction may improve the
robustness of BUA mapping across different landscapes.
Hyperparameters were tuned empirically based on the performance
on the training and validation set. AdamW (Loshchilov and Hutter,
2018), an improved version of Adam (Kingma and Ba, 2014), was
employed as optimizer with an initial learning rate of 10− 4. A batch size
of 8 was used to account for the graphics card’s memory (NVIDIA
GeForce RTX 3090). Epochs were capped at 15, before which all models
reached stable performance on the training and validation set (Appendix
Fig. 16). For the DA approach, the impact of the consistency loss (φ) was
set to 0.5. We implemented everything in Python using Facebook’s DL
framework PyTorch (Paszke et al., 2019). Code is available at https
://github.com/SebastianHafner/DDA_UrbanExtraction.
Table 2
F1 score, precision, recall, IoU and Kappa averaged over regions of test set for
our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2
(Corbane et al., 2020b) and WSF 2019 (Marconcini et al., 2021).
SAR
Optical
Fusion
Fusion-DA
GHSBUILTS2
WSF 2019
F1 score
Precision
Recall
IoU
Kappa
0.570 ±
0.041
0.587 ±
0.077
0.654 ±
0.052
0.692 ±
0.039
0.576 ±
0.048
0.697 ±
0.044
0.707 ±
0.036
0.661 ±
0.043
0.571 ±
0.064
0.520 ±
0.115
0.613 ±
0.082
0.728 ±
0.043
0.400 ±
0.039
0.420 ±
0.077
0.488 ±
0.059
0.531 ±
0.047
0.490 ±
0.047
0.525 ±
0.075
0.595 ±
0.053
0.632 ±
0.039
0.591 ±
0.068
0.680 ±
0.042
0.485 ±
0.083
0.652 ±
0.063
0.767 ±
0.033
0.718 ±
0.059
0.423 ±
0.071
0.517 ±
0.050
0.493 ±
0.082
0.617 ±
0.044
geographic regions on the mapping performance for different input data.
A fourth model was trained on the labeled and unlabeled training data
using the proposed unsupervised DA approach. A detailed description of
the training, including hyperparameter settings, is given in the following
paragraph.
Data augmentation is an effective approach to enhance the training
dataset and has been widely used for semantic segmentation tasks. The
basic idea of data augmentation is to modify the input data to generate
more variant versions. We employed three data augmentation opera­
tions, namely rotations, flips and gamma correction during model
training. Rotations were implemented by randomly rotating images and
labels by an angle of k ⋅ 90∘, where k ∈{0,1,2,3}. Flips were applied by
either horizontally or vertically flipping images and labels with a
probability of 50%. Enhancing the training dataset with these basic
3.4. Accuracy metrics
We used five common accuracy metrics for the quantitative accuracy
assessment: F1 score, precision, recall, Intersection over Union (IoU) and
Cohen’s kappa coefficient (kappa). Formulas for the metrics are given in
Eqs. (5)–(9), where TP are true positives, FP false positives, and FN false
negatives.
F1 score =
TP
TP + 12 (FP + FN)
(5)
Fig. 10. Quantitative results on the test set grouped by geographical region for our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2 (Corbane
et al., 2020b) and WSF 2019 (Marconcini et al., 2021). Box plots depict F1 score for the sites constituting a region.
9
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 11. Qualitative comparison of (f) our Fusion-DA approach with (d) GHS-BUILT-S2 (Corbane et al., 2020b) and (e) WSF 2019 (Marconcini et al., 2021) for a
SpaceNet7 site located in the United States. (a), (b) and (c) show the Sentinel-1 SAR image (VV band), the Sentinel-2 MSI image (red: B4, green: B3, blue: B2) and the
SpaceNet7 ground truth, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
precision =
TP
TP + FP
TP
recall =
TP + FN
IoU =
TP
TP + FP + FN
kappa =
2⋅(TP⋅TN − FN⋅FP)
(TP + FP)⋅(FP + TN) + (TP + FN)⋅(FN + TN)
3.5. State-of-the-art comparison
(6)
Two global BUA products are considered for the state-of-the-art
comparison, namely GHS-BUILT-S2 and WSF 2019. GHS-BUILT-S2
outlines BUA at 10 m resolution for 2018 using Sentinel-2 MSI data
and a multi-model approach with local training leveraging open settle­
ment datasets (Corbane et al., 2020a, 2020b). The pretrained GHSS2Nets for different UTM grid zone are available on GitHub,5 and
were used to predict BUA for the corresponding 60 SpaceNet7 sites using
the selected months in 2019. We also adopted the Sentinel-2 MSI data
preprocessing presented in Corbane et al. (2020a) to ensure it matches
that of the training data of the GHS-S2Nets. Since the GHS-S2Nets
output continuous values in range [0,1], a cut-off value was applied
for the binary accuracy assessment. However, the results presented in
Corbane et al. (2020b) indicate that the cut-off value requires regional
fine-tuning. Therefore, we determined the optimal cut-off value for each
test site with respect to the label, considering all values in range [0,1]
with a step size of 0.05. Following that, all binarized GHS results were
generated using site-specific optimal cut-off values.
The other state-of-the-art product, WSF 2019, outlines the global
settlement extent for 2019 at 10 m resolution using multitemporal
Sentinel-1 SAR and Sentinel-2 MSI data (Marconcini et al., 2021). A
binary RF classifier was trained on points for the settlement and nonsettlement class, which were obtained by thresholding specific tempo­
ral SAR and optical features. Thresholds were manually fine-tuned for
(7)
(8)
(9)
Precision is the fraction of correct positives (TP) among the total
predicted positives (TP + FP), and recall is the fraction of correct posi­
tives (TP) among the total existing positives (TP + FN). Both metrics
make it possible to assess model performance on the minority class (i.e.,
BUA) (He and Ma, 2013). As a consequence, the joint analysis of pre­
cision and recall is effective for evaluating model performance on
imbalanced datasets (Branco et al., 2016). F1 score combines precision
and recall into a single metric by representing their harmonic mean. IoU
measures the intersection (TP) over union (TP + FP + FN). The highest
and lowest possible values for F1 score, precision, recall and IoU are 1
and 0, respectively. On the other hand, the fifth metric, kappa, can range
from − 1 to 1 and measures the agreement between model predictions
and ground truth (Cohen, 1960). Kappa values are interpreted as fol­
lows: values ≤0 as no agreement, 0.01–0.20 as none to slight agreement,
0.21–0.40 as fair agreement, 0.41–0.60 as moderate agreement,
0.61–0.80 as substantial agreement and 0.81–1.00 as almost perfect
agreement (Landis and Koch, 1977).
5
10
https://github.com/ec-jrc/GHS-S2Net
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 12. Qualitative comparison of (f) our Fusion-DA approach with (d) GHS-BUILT-S2 (Corbane et al., 2020b) and (e) WSF 2019 (Marconcini et al., 2021) for a
SpaceNet7 site located in Panama (LA). (a), (b) and (c) show the Sentinel-1 SAR image (VV band), the Sentinel-2 MSI image (red: B4, green: B3, blue: B2) and the
SpaceNet7 ground truth, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
different climate types. An important characteristic of the WSF 2019 is
also that post-processing was employed to mask out roads using the
corresponding Open Street Map (OSM) layer. As previously mentioned,
reference months for the SpaceNet7 sites were chosen based on the
agreement with WSF 2019 to account for intra-annual urban growth.
4.2. Performance assessment
Table 1 lists the quantitative results for the validation and test set.
For this assessment, a cut-off value of 0.5 was applied to the continuous
network outputs in order to categorize them into BUA and non-BUA. For
the validation set, F1 scores range from 0.664 to 0.774 and IoU values
from 0.491 to 0.631. Input-level fusion of SAR and optical data achieved
the highest values for both metrics. The proposed DA approach achieved
the second highest values, followed by the optical and SAR approach.
Test F1 scores and IoU values are considerably lower. While the optical
approach suffered the largest performance drop in terms of F1 score
(− 0.159), performance of the SAR approach dropped by <0.100
compared to the validation dataset. As a consequence, the single-sensor
approaches achieved similar F1 scores on the test set (approximately
0.580). In contrast, F1 scores of both approaches fusing SAR and optical
considerably exceeded 0.600 where the better result (+0.043) was ob­
tained with the proposed approach (0.694). In terms of precision, the
highest values were obtained with the input-level fusion approach
ranging between 0.694 and 0.752 across both datasets. In comparison,
the proposed fusion approach produced slightly less reliable predictions
with values ranging between 0.691 and 0.661. The lowest precision
values were obtained from the SAR approach (0.570 on the test set). In
terms of recall, the proposed fusion approach achieved the best results.
Especially on the test set, SSL improved BUA detection by >0.130
compared to fully supervised learning. While satisfactory detection was
also achieved using single-sensor SAR data, the network trained on
single-sensor optical data failed to achieve good BUA detection on the
test set with a recall of 0.496.
4. Results
4.1. Training
Training loss curves for the SAR, optical and fusion approach are
visualized in Fig. 8.A. Supervised loss for all approaches rapidly de­
creases at the beginning of training. After approximately 5 epochs, losses
start to decrease very slowly. Among the baseline approaches, fusion
achieves the lowest loss, closely followed optical and SAR. Fig. 8.B vi­
sualizes training loss curves—supervised and consistency—for the
fusion-DA approach. The supervised loss of our fusion-DA approach
shows a similar trend to those of the baseline approaches. In contrast,
consistency loss increases strongly at the very beginning of training,
indicating that there are considerable differences in predictions between
the SAR and optical network for the target domain. Later on, consistency
loss slowly decreases and, consequently, the SAR and optical network
were successfully trained to produce consistent BUA segmentation on
unlabeled data. It should be noted that supervised loss for the fusion-DA
approach is approximately three times that of the baseline approaches
because it represents the sum of the supervised losses for the SAR, op­
tical and fusion output.
11
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 13. Qualitative comparison of (f) our Fusion-DA approach with (d) GHS-BUILT-S2 (Corbane et al., 2020b) and (e) WSF 2019 (Marconcini et al., 2021) for a
SpaceNet7 site located in Egypt (IW). (a), (b) and (c) show the Sentinel-1 SAR image (VV band), the Sentinel-2 MSI image (red: B4, green: B3, blue: B2) and the
SpaceNet7 ground truth, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
4.3. Per-region quantitative assessment and state-of-the-art comparison
model outputs BUA probabilities that are pushed to either side of the
histogram across all six regions, while the peaks representing BUA for
GHS-BUILT-S2 are not close to the maximum probability for any region
but Europe; moreover, the locations of the peaks representing BUA vary
across region considerably. Therefore, the choice of cut-off value may
affect the performance of GHS-BUILT-S2 considerably and, conse­
quently, cut-off values may require regional fine-tuning to obtain good
separation. In contrast, model outputs of our approach are well sepa­
rable with wider range of cut-off values, which is a consistent property
across geographical regions.
Table 2 lists mean F1 score, precision, recall, IoU and kappa across
the six regions including standard deviation. Mean F1 scores range be­
tween 0.570 and 0.692, where lowest and highest F1 score were ob­
tained form the SAR and the proposed DA approach, respectively.
Noteworthy is that SAR-based mapping produced the most stable results
across the regions with a standard deviation of 0.041. In contrast, per­
formance of optical-based mapping varies greatly (0.077 standard de­
viation), especially BUA detection indicated by recall (0.520 ± 0.115).
Input-level fusion improved average performance across regions over
single-sensor approaches. Fusion with DA clearly outperformed inputlevel fusion in terms of mean recall (+0.115), while mean precision is
slightly lower (− 0.046). The proposed DA approach achieves overall the
best performance in terms of F1 score (0.692), as well as in terms of IoU
(0.531). Very good recall across regions was obtained with the GHSS2Nets (0.767 ± 0.033), particularly compared to our optical
approach. This indicates that the multi-model approach with local
training is effective in achieving reliable BUA detection, while trans­
fering a model trained on optical data to new regions causes
In order to gain a better understanding of the generalization ability,
this section quantitatively assesses test results between the approaches
with attention to regional differences. We also added a quantitative
comparison with the two state-of-the-art global human settlement
products GHS-BUILT-S2 and WSF 2019. It is important to note that the
purpose of this state-of-the-art comparison is not to identify the most
accurate product. Rather, we want to emphasize the ability of our DA
approach to produce BUA maps of high quality across different regions,
while using a much lighter methodology compared to the state-of-theart. In comparison to GHS-BUILT-S2, for example, our methodology
uses only a single globally applicable model, while their methodology
uses over 400 locally trained models, requiring manually fine-tuned
thresholds for binarization. Furthermore, we applied no postprocessing to model outputs, while WSF 2019 was generated using
OSM data for the removal of roads. Our light methodology offers great
potential for the production of accurate and easily updateable maps of
global human settlements.
We start by analysing the BUA probability output of our fusion-DA
model and compare it to those of the local GHS-S2Nets across the
different regions (Fig. 9). Since the WSF 2019 was produced using a
binary RF classifier, it is already separated into hard classes and hence
not included in this comparison. A clear separation between BUA and
non-BUA in terms of output probability is desirable in order to obtain
robust results that are not strongly dependent on the choice of cut-off
value. Moreover, this allows for the use of a standardized cut-off value
across different regions. The histograms in Fig. 9 illustrate that our
12
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 14. Qualitative comparison of (f) our Fusion-DA approach with (d) GHS-BUILT-S2 (Corbane et al., 2020b) and (e) WSF 2019 (Marconcini et al., 2021) for a
SpaceNet7 site located in Zambia (SSA). (a), (b) and (c) show the Sentinel-1 SAR image (VV band), the Sentinel-2 MSI image (red: B4, green: B3, blue: B2) and the
SpaceNet7 ground truth, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
generalization problems. However, F1 score is considerably impacted by
the great variance in precision among the GHS-S2Nets, which we attri­
bute to the use of labels from different datasets with varying qualities for
training. On the other hand, the WSF 2019 achieved a better balance
between precision and recall, resulting in an F1 score of 0.680, second
only to our fusion-DA approach.
Fig. 10 gives further insight into the generalization ability of our
approaches and the state-of-the-art in the form of box plots depicting F1
score across the sites constituting a region. Regional box plots for the
other accuracy metrics can be found in the Appendix (Appendix
Figs. 18–21). Similar relative patterns between our approaches exists for
the source domain and the three target domain regions EU, LA and SSA.
Specifically, the lowest performance in terms of F1 score (median) was
obtained from the SAR approach, followed by the optical, fusion and
fusion-DA approaches. Among the four regions, better results were
achieved in the source domain and the target domain regions EU and LA
compared to SSA. However, it should be noted that the largest range in
F1 scores exists for the source domain due to an outlier caused by a site
located in a mining area, where most of the BUA corresponds to silos. In
contrast to these regions, our optical approach was outperformed by
SAR for the target regions IW and Asia. The best results for these two
regions were also obtained using the fusion-DA approach. Furthermore,
the fusion-DA approach compares favorably to the state-of-the-art. In
fact, it clearly outperformed the locally trained GHS-S2Nets in all re­
gions, and it also performed as least as good as the WSF 2019 in all
regions but SSA. Therefore, our multi-modal DA approach is effective in
training a model that generalizes well across the six regions, despite
geographically limited use of labels in a single region.
4.4. Qualitative state-of-the-art comparison
Qualitative results for a selection of test sites from different
geographical regions are visualized in Figs. 11 to 15. (a) and (b) show
the Sentinel-1 SAR VV band and the Sentinel-2 MSI true colour image
(red: B4, green: B3, blue: B2), respectively. (c) shows the SpaceNet7
ground truth and (d), (e) and (f) show predicted BUA for GHS-BUILT-S2,
WSF 2019 and our fusion-DA approach, respectively. Furthermore,
qualitative results for all test sites are easily accessible in GEE
(https://hafnersailing.users.earthengine.app/view/urban-extraction
-app).
Fig. 11 shows the outlined qualitative results for a site in the United
States (source domain). The residential neighborhood in the bottom
right of the site was the most accurately mapped by fusion-DA; in
contrast, both GHS-BUILT-S2 and WSF 2019 suffer from false negatives.
Our model is also capable of distinguishing buildings from other
impervious surfaces, while GHS-S2Net frequently produces false posi­
tives for roads. This was solved for the WSF 2019 by masking out roads
using the corresponding OSM layer, but the post-processing occasionally
results in false positives artefacts alongside roads. These post-processing
artefacts are also apparent for the second site located in Panama
(Fig. 12) in the center of the site, where also the GHS-S2Net incorrectly
mapped a major road as BUA. In comparison, roads are consistently
excluded from BUA for fusion-DA. The most intra-urban detail for the
site was produced by WSF 2019 due to the post-processing road
removal. However, the removed roads are actually not always apparent
on the Sentinel-2 MSI image, particularly in very densely populated
areas. The next site (Fig. 13) is located in Egypt (IW) and shows a
13
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 15. Qualitative comparison of (f) our Fusion-DA approach with (d) GHS-BUILT-S2 (Corbane et al., 2020b) and (e) WSF 2019 (Marconcini et al., 2021) for a
SpaceNet7 site located in China (Asia). (a), (b) and (c) show the Sentinel-1 SAR image (VV band), the Sentinel-2 MSI image (red: B4, green: B3, blue: B2) and the
SpaceNet7 ground truth, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
sparsely populated area in an arid environment. The local GHS-S2Net
achieved good BUA detection but is lacking detail, which is evident
for the housing rows at the bottom of the image. GHS-S2Net also
confused some roads with BUA as for other sites. The housing rows were
mapped the most accurately by our model, but it missed most of the BUA
at the top, which is also the case for WSF 2019. Fig. 14 represents a site
in SSA, more specifically in Zambia. Our model and WSF 2019 outline
BUA accurately. However, the post-processing for WSF 2019 was more
effective in removing roads in the densely populated areas than our
model. GHS-BUILT-S2 overestimates the size of the small building in­
stances at the bottom of the image. In comparison, our approach does
not contain this error. Fig. 15 shows an urban area in China (Asia).
Although GHS-BUILT-S2 detected most BUA for this site, it contains a lot
of false positives from other impervious surfaces than roofed structures.
In comparison, a more detailed representation of BUA was obtained
from WSF 2019 and our approach. Furthermore, it is apparent that our
CNN model is mapping BUA at the building level, while WSF 2019 maps
it at the city block level for this particular site. Consequently, our model
generated the most detailed representation of BUA.
multi-spectral information contained in Sentinel-2 MSI images and the
10 m spatial resolution (blue, green, red and near-infrared) providing
detailed textural information. CNNs are effective in leveraging this in­
formation for urban mapping (Corbane et al., 2020b; Qiu et al., 2020). In
comparison to the 10 spectral Sentinel-2 bands, Sentinel-1 SAR data only
contains 2 bands corresponding to different polarization and, moreover,
these are acquired at a lower 20 m spatial resolution. Therefore, the
superiority of optical to SAR data for BUA mapping using DL is expected.
However, mapping BUA from a single data modality may lead to chal­
lenges associated with the inherent characteristics of the sensor. For
example, the spectral information between artificial impervious surfaces
and bare land is highly confusing and, consequently, mapping urban
areas from optical data in arid and semi-arid regions may lead to rela­
tively poor performance (Gong et al., 2019, 2020). Likewise, the
standalone use of SAR data may contain false positives due to high
backscattering from mountains facing the sensor (Ban et al., 2015).
Looking at the quantitative and qualitative results of the baseline ap­
proaches (Table 1), it is obvious that, overall, input-level data fusion
tends to improve BUA predictions compared to the use of single sensor
SAR or optical data. In our visual comparison with GHS-BUILT-S2
(Figs. 11 to 15), we also observed that fusion is particularly powerful
at distinguishing roads from BUA. Since different land covers within the
super-class of impervious surfaces share similar spectral signatures, it is
challenging to distinguish roads from roofed structures in optical im­
ages. On the contrary, they are easily distinguishable in SAR images due
to the fact that buildings are characterized by high backscatter signals,
while roads scatter the signal away from the sensor resulting in due to
the large incidence angle. The findings about the benefits of data fusion
5. Discussion
5.1. Fusion of SAR and optical data
Part of this work focused on comparing SAR and optical data, as well
as their fusion, for BUA mapping using DL. On the validation set—i.e.,
disregarding across-region generalization problems—optical data ach­
ieved better results than SAR data (Table 1). We attribute this to the rich
14
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
for CNN-based urban mapping are in line with the literature on SARoptical data fusion using traditional ML classifiers (Ban and Jacob,
2016; Marconcini et al., 2020). Finally, it should be noted that inputlevel fusion of SAR and optical data improves across-region general­
ization (Table 2 & Fig. 10). However, input-level fusion is evidently
insufficient to be adopted as a method to overcome it.
Backscattering in part of the slums can, however, be counterintuitively
low because extremely dense accumulations of low-rise buildings may
prevent double bounce effects, as also reported in Marconcini et al.
(2020) for highly urbanized areas in Lagos. In this case, the SAR subnetwork is likely to not detect the slums as BUA. Under the assump­
tion that the optical network is also subject to error of omission, the two
sub-networks are not penalized for inconsistent predictions through
consistency loss because predictions agree on there not being any BUA.
On the other hand, high-rise buildings’ layover effects in SAR images
from both ascending and descending orbits, as well as shadows in
Sentinel-2 images, may cause confusions, and thus missed detections.
Therefore, the proposed DA approach fails, in these cases, to exploit
unlabeled data to improve BUA detection.
Another limitation of the proposed DA approach is that it decreases
precision compared to the optical and input-level fusion approach, as
shown in Table 1. SAR-based predictions are generally less precise,
which can be attributed to the higher spatial resolution of the 10 m
resolution Sentinel-2 bands (blue, green, red and near-infrared)
compared to the 20 m resolution Sentinel-1 bands. Since our SSL setup
enforces consistent predictions across the two data sources, the SAR data
may limit precision of the proposed approach. However, it should be
considered that the small decrease in precision is in general outweighed
by large gains in recall.
Future work should explore methods to minimize negative effects
from fusing SAR with optical data. For example, temporal SAR features
could be considered to add more contextual data for the SAR stream
(Marconcini et al., 2020, 2021). Future work should also apply our
unsupervised DA approach to other mapping tasks. Although only few
modifications to the model may be required for multi-class problems,
one should take into consideration that the proposed method assumes
that all classes are detectable from both data modalities separability.
This may be a limiting factor for some applications.
5.2. Across-region generalization
First of all, this study stresses that a model trained on optical data in a
geographically limited region fails to reliably detect BUA across regions
(Table 2 & Fig. 10). Domain shifts between the source and target domain
hamper model performance (Tuia et al., 2016), which may be particu­
larly problematic for urban mapping due to variability in landscapes and
the characteristics of human settlements in terms of size, shape,
morphology and structure across regions (Taubenböck et al., 2020).
Interestingly, some trained a CNN model for large-scale mapping of
human settlement extent in Europe on optical Sentinel-2 and reported
no generalization problems when testing the trained model beyond
Europe (Qiu et al., 2020). While their findings are not in line with ours,
the use of a different target class or network architecture could have
potentially improved the generalization ability of their framework.
Others, however, emphasized model generalization at a global scale as a
challenge in CNN-based BUA mapping and recommend to limit model
deployment to geographical areas with similar characteristics in terms
of landscape and types of BUA as the training area (Corbane et al.,
2020b). Therefore, we consider across-region generalization an impor­
tant challenge in urban mapping from optical satellite images at a global
scale.
In contrast to optical data, the representation of BUA in SAR images,
high backscattering, is spatially consistent across regions. The general­
ization ability of a model trained on SAR data is superior to that of a
model trained on optical data, which is supported by the experiments in
this study (Table 2 & Fig. 10). Although improving BUA mapping per­
formance, input-level fusion fails to leverage SAR data for the reliable
detection of BUA (Fig. 10). Fully supervised learning optimizes BUA
mapping for the source domain and due to the richer spatial and
contextual information contained in optical data compared to SAR data
the model has no incentive to learn spatially invariant features from SAR
data during training. While input-level fusion improves performance in
the source domain compared to SAR or optical data alone, the results in
Fig. 10 show that it fails to alleviate generalization problems, particu­
larly in arid regions (IW). In contrast, the proposed DA approach
demonstrated more accurate BUA mapping across all regions with
respect to our baseline approaches, as well as GHS-BUILT-S2 and WSF
2019. Furthermore, output probabilities are well separable into BUA and
non-BUA across different regions since BUA pixels are generally
assigned a probability close to 1 by our model (Fig. 9). On the other
hand, models tend to assign lower probabilities to out-of-distribution
examples (Hendrycks and Gimpel, 2016). Consequently, the model
was successfully adapted to target regions without using any additional
local labels. The proposed DA approach is, therefore, an effective
method to improve across-region generalization ability for BUA map­
ping at a global scale in an unsupervised manner.
6. Conclusion
This research concludes that the proposed unsupervised DA
approach greatly increases BUA detection compared to fully supervised
learning approaches at no additional labeling cost. Therefore, it is an
effective method to improve modal generalization ability for urban
mapping at a global scale. Furthermore, jointly considering the quanti­
tative and qualitative accuracy assessment with respect to GHS-BUILTS2 and WSF 2019, we conclude that the proposed DA approach is
capable of training a model that produces BUA maps with comparable or
even better quality than global state-of-the-art products. Our work also
underlines the importance of considering across-region generalization
for global mapping from satellite imagery and, moreover, demonstrates
that geographically restricted label availability can be overcome using
novel DA techniques. Finally, this research has the potential to
contribute to the production of up-to-date and reliable BUA maps to
support sustainable planning and urban SDG indicator monitoring.
CRediT authorship contribution statement
Sebastian Hafner: Conceptualization, Data curation, Methodology,
Visualization, Validation, Writing – original draft, Writing – review &
editing. Yifang Ban: Conceptualization, Methodology, Writing – review
& editing, Supervision, Funding acquisition, Resources. Andrea Nas­
cetti: Conceptualization, Methodology, Writing – review & editing.
5.3. Limitations and perspectives
In this section, we will discuss limitations of the proposed DA
approach. First of all, the improvement of BUA detection by the pro­
posed DA approach works under the assumption that one of the two subnetworks successfully detects BUA. In that case, the outputs of the subnetworks disagree, which is penalized by consistency loss to improve
BUA detection. Our regional experiments demonstrate that this
assumption generally holds because BUA from SAR data is reliable
across regions (Table 2), which we attribute to the geographically
invariant indicator of BUA in SAR images—high backscattering.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
the work reported in this paper.
15
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Acknowledgement
Agency under Grant dnr 155/15, the Digital Futures under the grant for
the EO-AI4GlobalChange Project, and in part by ESA-China Dragon 5
Program under the EO-AI4Urban Project.
This work was supported in part by the Swedish National Space
Appendices
Fig. 16. F1 scores obtained during training on the (A) training set and (B) validation set for the SAR, optical, fusion and fusion-DA approach.
Fig. 17. The eight cultural-geographic regions adopted from Huntington (1997).
16
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 18. Quantitative results on the test set grouped by geographical region for our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2 (Corbane
et al., 2020b) and WSF 2019 (Marconcini et al., 2021). Box plots depict precision for the sites constituting a region.
17
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 19. Quantitative results on the test set grouped by geographical region for our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2 (Corbane
et al., 2020b) and WSF 2019 (Marconcini et al., 2021). Box plots depict recall for the sites constituting a region.
18
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 20. Quantitative results on the test set grouped by geographical region for our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2 (Corbane
et al., 2020b) and WSF 2019 (Marconcini et al., 2021). Box plots depict IoU for the sites constituting a region.
19
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Fig. 21. Quantitative results on the test set grouped by geographical region for our approaches (SAR, Optical, Fusion and Fusion-DA) and GHS-BUILT-S2 (Corbane
et al., 2020b) and WSF 2019 (Marconcini et al., 2021). Box plots depict kappa for the sites constituting a region.
Table 3
Metadata for the 60 sites constituting our test dataset.
Area of interest ID
Country
Domain
Date
L15-0331E-1257N_1327_3160_13
L15-0357E-1223N_1429_3296_13
L15-0358E-1220N_1433_3310_13
L15-0361E-1300N_1446_2989_13
L15-0368E-1245N_1474_3210_13
L15-0387E-1276N_1549_3087_13
L15-0434E-1218N_1736_3318_13
L15-0457E-1135N_1831_3648_13
L15-0487E-1246N_1950_3207_13
L15-0506E-1204N_2027_3374_13
L15-0544E-1228N_2176_3279_13
L15-0566E-1185N_2265_3451_13
L15-0571E-1075N_2287_3888_13
L15-0577E-1243N_2309_3217_13
L15-0586E-1127N_2345_3680_13
L15-0595E-1278N_2383_3079_13
L15-0614E-0946N_2459_4406_13
L15-0632E-0892N_2528_4620_13
L15-0683E-1006N_2732_4164_13
L15-0760E-0887N_3041_4643_13
L15-0924E-1108N_3699_3757_13
L15-0977E-1187N_3911_3441_13
L15-1014E-1375N_4056_2688_13
L15-1015E-1062N_4061_3941_13
L15-1025E-1366N_4102_2726_13
L15-1049E-1370N_4196_2710_13
L15-1138E-1216N_4553_3325_13
L15-1172E-1306N_4688_2967_13
L15-1185E-0935N_4742_4450_13
United States
United States
United States
United States
United States
United States
United States
Mexico
United States
United States
United States
United States
Panama
United States
Jamaica
United States
Peru
Chile
Brazil
Brazil
Senegal
Algeria
United Kingdom
Ghana
United Kingdom
Netherlands
Lybia
Romania
Zambia
Source
Source
Source
Source
Source
Source
Source
Target LA
Source
Source
Source
Source
Target LA
Source
Target LA
Source
Target LA
Target LA
Target LA
Target LA
Target IW
Target IW
Target EU
Target SSA
Target EU
Target EU
Target IW
Target EU
Target SSA
2019–03
2019-01
2019-01
2019-01
2019-01
2019-05
2019-07
2019-04
2019-04
2019-01
2019-12
2019-02
2019-01
2019-01
2019-01
2019-07
2019-08
2019-08
2019-07
2019-02
2019-12
2019-06
2019-06
2019-03
2019-11
2019-01
2019-04
2019-09
2019-11
20
Sentinel-1 Orbit
115
173
173
144
173
122
151
41
136
165
121
48
113
77
113
106
25
156
10
53
133
154
132
147
81
110
73
109
145
(continued on next page)
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Table 3 (continued )
Area of interest ID
Country
Domain
Date
Sentinel-1 Orbit
L15-1200E-0847N_4802_4803_13
L15-1203E-1203N_4815_3378_13
L15-1204E-1202N_4816_3380_13
L15-1204E-1204N_4819_3372_13
L15-1209E-1113N_4838_3737_13
L15-1210E-1025N_4840_4088_13
L15-1276E-1107N_5105_3761_13
L15-1289E-1169N_5156_3514_13
L15-1296E-1198N_5184_3399_13
L15-1298E-1322N_5193_2903_13
L15-1335E-1166N_5342_3524_13
L15-1389E-1284N_5557_3054_13
L15-1438E-1134N_5753_3655_13
L15-1439E-1134N_5759_3655_13
L15-1479E-1101N_5916_3785_13
L15-1481E-1119N_5927_3715_13
L15-1538E-1163N_6154_3539_13
L15-1615E-1205N_6460_3370_13
L15-1615E-1206N_6460_3366_13
L15-1617E-1207N_6468_3360_13
L15-1669E-1153N_6678_3579_13
L15-1669E-1160N_6678_3548_13
L15-1669E-1160N_6679_3549_13
L15-1672E-1207N_6691_3363_13
L15-1690E-1211N_6763_3346_13
L15-1691E-1211N_6764_3347_13
L15-1703E-1219N_6813_3313_13
L15-1709E-1112N_6838_3742_13
L15-1716E-1211N_6864_3345_13
L15-1748E-1247N_6993_3202_13
L15-1848E-0793N_7394_5018_13
South Africa
Egypt
Egypt
Egypt
Sudan
Uganda
Yemen
Saudi Arabia
Kuwait
Russia
United Arab Emirates
Uzbekistan
India
India
India
India
Bangladesh
China
China
China
China
China
China
China
China
China
China
Philippines
China
China
Australia
Target SSA
Target IW
Target IW
Target IW
Target IW
Target SSA
Target IW
Target IW
Target IW
Target EU
Target IW
Target IW
Target Asia
Target Asia
Target Asia
Target Asia
Target IW
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Target Asia
Source
2019-09
2019-07
2019-09
2019-12
2019-12
2019-02
2019-05
2019-01
2019-01
2019-03
2019-01
2019-08
2019-01
2019-06
2019-03
2019-08
2019-01
2019-03
2019-04
2019-02
2019-01
2019-03
2019-01
2019-06
2019-03
2019-07
2019-06
2019-10
2019-03
2019-02
2019-01
79
167
167
167
21
28
108
72
108
79
130
122
34
34
92
92
150
128
128
128
11
11
11
113
142
142
69
32
171
127
118
References
DeLancey, E.R., Simms, J.F., Mahdianpari, M., Brisco, B., Mahoney, C., Kariyeva, J.,
2020. Comparing deep learning and shallow learning for large-scale wetland
classification in Alberta, Canada. Remote Sens. 12, 2.
Duque-Arias, D., Velasco-Forero, S., Deschaud, J.E., Goulette, F., Serna, A.,
Decencière, E., Marcotegui, B., 2021. On power Jaccard losses for semantic
segmentation. In: VISAPP 2021: 16th International Conference on Computer Vision
Theory and Applications.
Esch, T., Taubenböck, H., Roth, A., Heldens, W., Felbier, A., Schmidt, M., Mueller, A.A.,
Thiel, M., Dech, S.W., 2012. TanDEM-X mission-new perspectives for the inventory
and monitoring of global settlement patterns. J. Appl. Remote. Sens. 6, 061702.
Esch, T., Marconcini, M., Felbier, A., Roth, A., Heldens, W., Huber, M., Schwinger, M.,
Taubenböck, H., Müller, A., Dech, S., 2013. Urban footprint processor—fully
automated processing chain generating settlement masks from global data of the
TanDEM-X mission. IEEE Geosci. Remote Sens. Lett. 10, 1617–1621.
Esch, T., Heldens, W., Hirner, A., Keil, M., Marconcini, M., Roth, A., Zeidler, J., Dech, S.,
Strano, E., 2017. Breaking new ground in mapping human settlements from
space–the global urban footprint. ISPRS J. Photogramm. Remote Sens. 134, 30–42.
Feng, W., Sui, H., Huang, W., Xu, C., An, K., 2018. Water body extraction from very highresolution remote sensing imagery using deep u-net and a superpixel-based
conditional random field model. IEEE Geosci. Remote Sens. Lett. 16, 618–622.
French, G., Mackiewicz, M., Fisher, M., 2018. Self-ensembling for visual domain
adaptation. In: International Conference on Learning Representations.
Gamba, P., Aldrighi, M., Stasolla, M., 2010. Robust extraction of urban area extents in
HR and VHR SAR images. IEEE J. Select. Top. Appl. Earth Observ. Rem. Sens. 4,
27–34.
Goldblatt, R., Stuhlmacher, M.F., Tellman, B., Clinton, N., Hanson, G., Georgescu, M.,
Wang, C., Serrano-Candela, F., Khandelwal, A.K., Cheng, W.H., et al., 2018. Using
Landsat and nighttime lights for supervised pixel-based image classification of urban
land cover. Remote Sens. Environ. 205, 253–275.
Gong, P., Marceau, D.J., Howarth, P.J., 1992. A comparison of spatial feature extraction
algorithms for land-use classification with SPOT HRV data. Remote Sens. Environ.
40, 137–151.
Gong, P., Li, X., Zhang, W., 2019. 40-year (1978–2017) human settlement changes in
China reflected by impervious surfaces from satellite remote sensing. Sci. Bull. 64,
756–763.
Gong, P., Li, X., Wang, J., Bai, Y., Chen, B., Hu, T., Liu, X., Xu, B., Yang, J., Zhang, W.,
et al., 2020. Annual maps of global artificial impervious area (GAIA) between 1985
and 2018. Remote Sens. Environ. 236, 111510.
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R., 2017.
Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens.
Environ. 202, 18–27.
Hafner, S., Nascetti, A., Azizpour, H., Ban, Y., 2022. Sentinel-1 and Sentinel-2 data fusion
for urban change detection using a dual stream u-net. IEEE Geosci. Remote Sens.
Lett. 19, 1–5. https://doi.org/10.1109/LGRS.2021.3119856.
Alajlan, N., Pasolli, E., Melgani, F., Franzoso, A., 2013. Large-scale image classification
using active learning. IEEE Geosci. Remote Sens. Lett. 11, 259–263.
As-syakur, A., Adnyana, I., Arthana, I.W., Nuarsa, I.W., et al., 2012. Enhanced built-up
and bareness index (EBBI) for mapping built-up and bare land in an urban area.
Remote Sens. 4, 2957–2970.
Bachman, P., Alsharif, O., Precup, D., 2014. Learning with pseudo-ensembles. In:
Advances in neural information processing systems, 27.
Ban, Y., Jacob, A., 2013. Object-based fusion of multitemporal multiangle ENVISAT
ASAR and HJ-1B multispectral data for urban land-cover mapping. IEEE Trans.
Geosci. Remote Sens. 51, 1998–2006.
Ban, Y., Jacob, A., 2016. Fusion of multitemporal spaceborne SAR and optical data for
urban mapping and urbanization monitoring. In: Multitemporal Remote Sensing.
Springer, pp. 107–123.
Ban, Y., Jacob, A., Gamba, P., 2015. Spaceborne SAR data for global urban mapping at
30 m resolution using a robust urban extractor. ISPRS J. Photogramm. Remote Sens.
103, 28–37.
Berman, P., 2004. Terror and Liberalism. WW Norton & Company.
Branco, P., Torgo, L., Ribeiro, R.P., 2016. A survey of predictive modeling on imbalanced
domains. ACM Comp. Surv. (CSUR) 49, 1–50.
Chapelle, O., Scholkopf, B., Zien, A., 2009. Semi-supervised learning (chapelle, o. et al.,
eds.; 2006)[book reviews]. IEEE Trans. Neural Netw. 20, 542.
Chini, M., Pelich, R., Hostache, R., Matgen, P., Lopez-Martinez, C., 2018. Towards a 20 m
global building map from Sentinel-1 SAR data. Remote Sens. 10, 1833.
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20,
37–46.
Corbane, C., Pesaresi, M., Politis, P., Syrris, V., Florczyk, A.J., Soille, P., Maffenini, L.,
Burger, A., Vasilev, V., Rodriguez, D., et al., 2017. Big earth data analytics on
Sentinel-1 and Landsat imagery in support to global human settlements mapping. Big
Earth Data 1, 118–144.
Corbane, C., Politis, P., Kempeneers, P., Simonetti, D., Soille, P., Burger, A., Pesaresi, M.,
Sabo, F., Syrris, V., Kemper, T., 2020a. A global cloud free pixel-based image
composite from Sentinel-2 data. Data Brief 31, 105737.
Corbane, C., Syrris, V., Sabo, F., Politis, P., Melchiorri, M., Pesaresi, M., Soille, P.,
Kemper, T., 2020b. Convolutional neural networks for global human settlements
mapping from Sentinel-2 satellite imagery. Neural Comput. & Applic. 1–24.
Crawford, M.M., Tuia, D., Yang, H.L., 2013. Active learning: any value for classification
of remotely sensed data? Proc. IEEE 101, 593–608.
Cui, W., Liu, Y., Li, Y., Guo, M., Li, Y., Li, X., Wang, T., Zeng, X., Ye, C., 2019. Semisupervised brain lesion segmentation with an adapted mean teacher model. In:
International Conference on Information Processing in Medical Imaging. Springer,
pp. 554–565.
21
S. Hafner et al.
Remote Sensing of Environment 280 (2022) 113192
Pesaresi, M., Ehrlich, D., Ferri, S., Florczyk, A., Freire, S., Halkia, M., Julea, A.,
Kemper, T., Soille, P., Syrris, V., et al., 2016b. Operating procedure for the
production of the global human settlement layer from Landsat data of the epochs
1975, 1990, 2000, and 2014. Publ. Office Europ. Union 1–62.
Qiu, C., Schmitt, M., Geiß, C., Chen, T.H.K., Zhu, X.X., 2020. A framework for large-scale
mapping of human settlement extent from Sentinel-2 images via fully convolutional
neural networks. ISPRS J. Photogramm. Remote Sens. 163, 152–170.
Rakhlin, A., Davydow, A., Nikolenko, S., 2018. Land cover classification from satellite
imagery with u-net and lovász-softmax loss. Proc. IEEE Conf. Comp. Vision Patt.
Recognit. Workshops 262–266.
Ravanelli, R., Nascetti, A., Cirigliano, R.V., Di Rico, C., Leuzzi, G., Monti, P., Crespi, M.,
2018. Monitoring the impact of land cover change on surface urban heat island
through Google earth engine: proposal of a global methodology, first applications
and problems. Remote Sens. 10 https://doi.org/10.3390/rs10091488.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for
biomedical image segmentation. In: International Conference on Medical Image
Computing and Computer-Assisted Intervention. Springer, pp. 234–241.
Sajjadi, M., Javanmardi, M., Tasdizen, T., 2016. Regularization with stochastic
transformations and perturbations for deep semi-supervised learning. In:
Proceedings of the 30th International Conference on Neural Information Processing
Systems, pp. 1171–1179.
Schmitt, M., Prexl, J., Ebel, P., Liebel, L., Zhu, X.X., 2020. Weakly supervised semantic
segmentation of satellite images for land cover mapping–challenges and
opportunities. In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, 3, pp. 795–802.
Schmitt, M., Zhu, X.X., 2016. Data fusion and remote sensing: An ever-growing
relationship. IEEE Geosci. Rem. Sens. Magaz. 4, 6–23.
Sun, X., Fang, H., Yang, Y., Zhu, D., Wang, L., Liu, J., Xu, Y., 2021. Robust retinal vessel
segmentation from a data augmentation perspective. In: International Workshop on
Ophthalmic Medical Image Analysis. Springer, Cham, pp. 189–198.
Tarvainen, A., Valpola, H., 2017. Mean teachers are better role models: weight-averaged
consistency targets improve semi-supervised deep learning results. In: Advances in
neural information processing systems, 30.
Taubenböck, H., Debray, H., Qiu, C., Schmitt, M., Wang, Y., Zhu, X.X., 2020. Seven city
types representing morphologic configurations of cities across the globe. Cities 105,
102814.
Tuia, D., Pasolli, E., Emery, W.J., 2011. Using active learning to adapt remote sensing
image classifiers. Remote Sens. Environ. 115, 2232–2242.
Tuia, D., Persello, C., Bruzzone, L., 2016. Domain adaptation for the classification of
remote sensing data: an overview of recent advances. IEEE Geosci. Rem. Sens.
Magaz. 4, 41–57.
United Nations, 2014. World Urbanization Prospects: The 2014 Revision, Highlights.
Department of Economic and Social Affairs. Population Division, United Nations,
p. 32.
Van Etten, A., Hogan, D., Manso, J.M., Shermeyer, J., Weir, N., Lewis, R., 2021. The
multi-temporal urban development SpaceNet dataset. In: Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, pp. 6398–6407.
Wang, J., Ding, H.Q., Chen, S., He, C., Luo, B., 2020. Semi-supervised remote sensing
image semantic segmentation via consistency regularization and average update of
pseudo-label. Remote Sens. 12, 3603.
Woodcock, C.E., Macomber, S.A., Pax-Lenney, M., Cohen, W.B., 2001. Monitoring large
areas for forest change using Landsat: generalization across space, time and Landsat
sensors. Remote Sens. Environ. 78, 194–203.
Xu, H., 2008. A new index for delineating built-up land features in satellite imagery. Int.
J. Remote Sens. 29, 4269–4276.
Yu, X., Wu, X., Luo, C., Ren, P., 2017. Deep learning in remote sensing scene
classification: a data augmentation enhanced convolutional neural network
framework. GISci. Rem. Sens. 54, 741–758.
Zha, Y., Gao, J., Ni, S., 2003. Use of normalized difference built-up index in
automatically mapping urban areas from TM imagery. Int. J. Remote Sens. 24,
583–594.
Zhang, B., Zhang, Y., Li, Y., Wan, Y., Wen, F., 2020a. Semi-supervised semantic
segmentation network via learning consistency for remote sensing land-cover
classification. ISPRS Ann. Photogram. Rem. Sens. Spat. Inform. Sci. 2, 609–615.
Zhang, X., Liu, L., Wu, C., Chen, X., Gao, Y., Xie, S., Zhang, B., 2020b. Development of a
global 30 m impervious surface map using multisource and multitemporal remote
sensing datasets with the Google earth engine platform. Earth Syst. Sci. Data 12,
1625–1648.
Zhu, X.X., Qiu, C., Hu, J., Shi, Y., Wang, Y., Schmitt, M., Taubenböck, H., 2022. The
urban morphology on our planet–global perspectives from space. Remote Sens.
Environ. 269, 112794.
Hamrouni, Y., Paillassa, E., Chéret, V., Monteil, C., Sheeren, D., 2021. From local to
global: a transfer learning-based approach for mapping poplar plantations at
national scale using Sentinel-2. ISPRS J. Photogramm. Remote Sens. 171, 76–100.
He, H., Ma, Y., 2013. Imbalanced Learning: Foundations, Algorithms, and Applications.
Hendrycks, D., Gimpel, K., 2016. A baseline for detecting misclassified and out-ofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136.
Hu, J., Mou, L., Zhu, X.X., 2020. Unsupervised domain adaptation using a teacherstudent network for cross-city classification of Sentinel-2 images. Int. Archiv.
Photogram. Rem. Sens. Spat. Inform. Sci. 43, 1569–1574.
Huntington, S.P., 1997. The clash of civilizations and the remaking of world order. In:
1996. Polish Edition MUZA SA.
Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training by
reducing internal covariate shift. In: International Conference on Machine Learning,
PMLR, pp. 448–456.
Kingma, D.P., Ba, J., 2014. Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Kussul, N., Lavreniuk, M., Skakun, S., Shelestov, A., 2017. Deep learning classification of
land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett.
14, 778–782.
Laine, S., Aila, T., 2016. Temporal ensembling for semi-supervised learning. arXiv
preprint arXiv:1610.02242.
Landis, J.R., Koch, G.G., 1977. The measurement of observer agreement for categorical
data. Biometrics 159–174.
Lin, Y., Zhang, H., Lin, H., Gamba, P.E., Liu, X., 2020. Incorporating synthetic aperture
radar and optical images to investigate the annual dynamics of anthropogenic
impervious surface at large scale. Remote Sens. Environ. 242, 111757.
Liu, T., Abd-Elrahman, A., Morton, J., Wilhelm, V.L., 2018a. Comparing fully
convolutional networks, random forest, support vector machine, and patch-based
deep convolutional neural networks for object-based wetland mapping using images
from small unmanned aircraft system. GISci. Rem. Sens. 55, 243–264.
Liu, X., Hu, G., Chen, Y., Li, X., Xu, X., Li, S., Pei, F., Wang, S., 2018b. High-resolution
multi-temporal mapping of global urban land using Landsat images based on the
Google earth engine platform. Remote Sens. Environ. 209, 227–239.
Liu, C., Yang, K., Bennett, M.M., Guo, Z., Cheng, L., Li, M., 2019. Automated extraction of
built-up areas by fusing VIIRS nighttime lights and Landsat-8 data. Remote Sens. 11,
1571.
Liu, X., Huang, Y., Xu, X., Li, X., Li, X., Ciais, P., Lin, P., Gong, K., Ziegler, A.D., Chen, A.,
et al., 2020. High-spatiotemporal-resolution mapping of global urban change from
1985 to 2015. Nat. Sustain. 1–7.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3431–3440.
Loshchilov, I., Hutter, F., 2018. Fixing Weight Decay Regularization in Adam.
Marconcini, M., Metz-Marconcini, A., Üreyen, S., Palacios-Lopez, D., Hanke, W.,
Bachofer, F., Zeidler, J., Esch, T., Gorelick, N., Kakarla, A., et al., 2020. Outlining
where humans live, the world settlement footprint 2015. Sci. Data 7, 1–14.
Marconcini, M., Metz-Marconcini, A., Esch, T., Gorelick, N., 2021. Understanding
Current Trends in Global Urbanisation-the World Settlement Footprint Suite.
Microsoft, 2018. Microsoft Releases 125 Million Building Footprints in the US as Open
Data. URL. https://blogs.bing.com/maps/2018-06/microsoft-releases-125-millionbuilding-footprints-in-the-us-as-open-data.
Nair, V., Hinton, G.E., 2010. Rectified Linear Units Improve Restricted Boltzmann
Machines, in. ICML.
Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I., 2018. Realistic evaluation
of deep semi-supervised learning algorithms. In: Advances in neural information
processing systems, 31.
Pacifici, F., Del Frate, F., Emery, W.J., Gamba, P., Chanussot, J., 2008. Urban mapping
using coarse SAR and optical data: outcome of the 2007 GRSS data fusion contest.
IEEE Geosci. Remote Sens. Lett. 5, 331–335.
Pan, S.J., Yang, Q., 2009. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22,
1345–1359.
Pesaresi, M., Huadong, G., Blaes, X., Ehrlich, D., Ferri, S., Gueguen, L., Halkia, M.,
Kauffmann, M., Kemper, T., Lu, L., et al., 2013. A global human settlement layer
from optical HR/VHR RS data: concept and first results. IEEE J. Select. Top. Appl.
Earth Observ. Rem. Sens. 6, 2102–2131.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., et al., 2019. Pytorch: an imperative style, highperformance deep learning library. In: Advances in neural information processing
systems, 32.
Pesaresi, M., Corbane, C., Julea, A., Florczyk, A.J., Syrris, V., Soille, P., 2016a.
Assessment of the added-value of Sentinel-2 for detecting built-up areas. Remote
Sens. 8, 299.
22
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )