docx (preprint)

advertisement
Mining Urban Land Use Patterns from Volunteered Geographic
Information
Using
Genetic
Algorithms
and
Artificial
Neural
Networks
Julian Hagenauer and Marco Helbich
GIScience, Institute of Geography, University of Heidelberg, Germany
Keywords: Volunteered Geography, OpenStreetMap, Spatial Data Quality, Spatial Data Mining
This is an Author’s Original Manuscript of an article whose final and definitive form,
the Version of Record, has been published in the International Journal of Geographical
Information Science [14 Nov 2011] [copyright Taylor & Francis], available online at:
http://www.tandfonline.com/doi/abs/10.1080/13658816.2011.619501.
Abstract
OpenStreetMap (OSM), as one of the most promising crowd sourced initiative, provides volunteered
mapped spatial data. At once, this bears several spatial data quality problems, inter alia
completeness, which on the one hand induces data omission errors and commission errors on the
other hand. Using European-wide urban land use patterns, this study investigates the first issue and
aims at predicting currently not mapped or partially mapped urban areas based on OSM. For this
purpose, a machine learning approach consisting of genetic algorithms and artificial neural networks
is applied to estimated urban areas. Under the premise of existing OSM data the model estimates
missing urban areas with an overall squared correlation coefficient (R2) of 0.6. Nevertheless,
interregional comparisons of European regions confirm spatial heterogeneity in model quality (R2
ranges from 0.2 up to 0.8) and thus the inherent varying completeness of OSM. Hence, this verifies
the hypothesis that more active volunteers within a region enhance the content of OSM.
1 Introduction
The emergence of internet technologies facilitates the generation and distribution of manifold digital
content (e.g., Flickr, Wikipedia) and thus makes collaborative efforts common and present in
everyday life. The enablement of participatory collaboration caused a paradigm shift, blurring the
distinction between consumers and producers that has been existent in the Web since its early days.
O’Reilly (2005) terms these developments Web 2.0.
1
Because of the high costs related to the process of gathering and maintaining as well as efforts
involved in sharing and distributing spatial data, these have been almost solely the domain of either
official land surveying offices or commercial companies. Nowadays, the availability of mobile devices
endowed with satellite navigation has enabled people to collect geographic data on their own, at low
costs, and high precision levels, formerly not conceivable for non-experts. Citizens have become
human sensors (Goodchild 2007). Web 2.0-technologies permit volunteers to aggregate, share and,
edit their collected geographic data in a collaborative manner. This phenomenon is usually referred
to as Volunteered Geographic Information (VGI; Goodchild 2007; Elwood 2008), whereas Sui (2008)
calls it in terms of GIS metaphorically the "wikification of GIS".
Among a broad list of initiatives dealing with VGI, OpenStreetMap (OSM) is one of the most
promising activities. Since its initiation in 2004, its primary goal is the generation of a free map of the
world through volunteered contributions. Although, the generation of maps is still the primary
intention, the collected spatial data is also made publicly available and may thus be used for
individual purposes (e.g., OpenRouteService.org (Neis and Zipf 2008), 3D city models (Over et al.
2010)). User generated GPS tracks, out of copyright maps, and more recently aerial images (e.g., Bing
Maps) serve as primary data source. The data itself are distributed under a license that guarantees
freedom of use, but enforces that all derived data are distributed under the same license (Haklay and
Weber 2008, Ramm and Topf 2010).
In general, awareness concerning the limitations of spatial data is essential and in particular data
quality issues, which are a comprehensive and ongoing active research field (e.g., Chrisman 1984;
Buttenfield 1993; Goodchild and Hunter 1996; Goodchild 1998; Shi et al. 2002, Devillers et al. 2010).
Evaluating spatial data quality is an important part in the process of assessing the "fitness for use"
(Chrisman 1993) of a data set for a particular application (van Oort 2003). Furthermore, spatial data
quality has also been addressed by a set of standard definitions and quality criteria proposals from
various organizations such as the National Committee on Digital Cartographic Data Standards
(NCDCDS, Moellering 1987), the International Cartographic Association (ICA, Guptill and Morrison
1995) or the International Standardization Organization (ISO, Kresse and Fadaie 2004). Based on the
definitions of the Technical Committee 211 of the ISO, the following elements of spatial data quality
can be identified (Kresse and Fadaie 2004):
-
Positional accuracy: concerns the accordance of positioning and geometry of an object to its
representation in the real world,
-
Attribute accuracy: measures the correctness of attributes assigned to a geographic object,
2
-
Completeness: measures the absence and excess of data ,
-
Logical consistency: describes the topological correctness and the relationships between
objects in respect to their internal consistency,
-
Semantic accuracy: evaluates the correspondence of the interpretation of spatial objects to
their meanings in the real world,
-
Temporal accuracy: describes the data actuality in relation to real world changes,
-
Lineage: concerns the history of a data set, how it was collected and derived to its actual
state,
-
Usage: assesses the extent of a data set to serve its intended purpose.
This contribution focuses on the completeness criterion of spatial quality. Completeness describes
the presence and absence of objects in a data set. Brassel et al. (1995) distinguish between data
completeness and model completeness. Former refers to the measureable errors between the data
set and its specification. Errors may be caused by lack of data that is formally expected to be present
in the database (omission errors) or otherwise, present in the database but not intended to be
included (commission errors). The latter is measurable and independent of the application. In
contrast to data completeness, model completeness considers the intended use, i.e. how well the
model of a dataset fulfills the requirements of an application (Brassel et al. 1995). To evaluate
completeness in terms of fitness for use, it is advisable to consider both data completeness as well as
model completeness. The appropriateness of spatial data quality usually depends on the availability
and quality of reference information (Servigne et al. 2010). Concerning VGI in general and OSM in
particular spatial data quality, namely spatial accuracy, is an important issue (Haklay et al. 2010) and
following Goodchild (2008) the same is valid for completeness.
Studies of OSM data quality have previously been conducted in several recent studies (Girres and
Touya 2010; Haklay 2010; Haklay et al. 2010). Haklay (2010) evaluates the positional accuracy of OSM
in reference to Ordnance Survey data for Great Britain based on the methodology of Goodchild and
Hunter (1996), by analyzing the percentage of overlaps between both data sets within a buffer
distance. Girres and Touya (2010) similarly evaluated different aspects of OSM data quality in France.
Both studies showed that in terms of positional accuracy the quality of OSM data is comparable to
traditional geographic datasets from national mapping agencies or commercial providers.
Nevertheless, comprehensive studies concerning completeness of OSM are lacking so far and thus
represent a compelling area of research, which is the main objective of this paper.
3
The measurement of VGI completeness is a complex task and bears several difficulties. First, VGI
activities address a large user-base, whose motivations for participating, contributing, and using
spatial data differ substantially (Heipke 2010). Second, a strictly defined dataset specification may be
contradicting the plurality of a user community, but without a precise specification of the dataset it is
not possible to detect errors of completeness (van Oort 2006; Servigne et al. 2010). Third, in OSM
anybody is permitted to add, delete, or modify data. However, mapping guidelines exist that are
recommended to be followed by contributors. The guidelines are communicated, discussed and
modified through a wiki, and reflect the consensus of the community. Because this specification is
not authoritative, it is not possible to measure completeness of OSM in a strict sense. Anyhow,
several studies comparing the completeness of OSM to other datasets exist (e.g., Girres and Touya
2010; Haklay 2010; Haklay et al. 2010; Zielstra and Zipf 2010). However, those studies only consider
objects of certain types (e.g., roads or rivers) for descriptive measurements.
For OSM it is not guaranteed that a certain object will ever be mapped. On a global scale the digital
divide is an important factor for incomplete mapping of less developed countries (Goodchild 2008;
Maue and Schade 2008). On a local scale the absence of voluntary contributors in disadvantaged
areas is a primary cause for omission errors (e.g., Haklay 2010). In particular, it appears that
completeness in the sense of present features not only depends on the density of information within
an area, but also on the number of contributors, which is generally high in densely populated areas
(Girres and Touya 2010). Third-party contributions to OSM, such as the import of the TIGER data
from the US Census Bureau and the availability of cadastral data from French authorities, improve
the situation of missing data. Beyond that, it is expected that intelligent tools and powerful
visualizations might further be helpful in detecting and fixing spatial data quality issues (e.g., see the
prototype of the web-based attribute visualization tool OSMatrix1 (Roick and Hagenauer 2011).
However, the absence of data may unbearably affect the fitness for use. For instance, incomplete or
wrongly mapped representations of road networks and the surrounding environment may induce
inaccurate route proposals of a navigation application (e.g., Neis and Zipf 2008). This in turn
negatively affects the users’ activity space and time schedule. In particular, this is true for larger
urban areas, affected by traffic jams, speed limits, and one-way streets, where it is often advisable to
take an alternative bypass route.
OSM exhibits implicit information that can be used to fill the gap between the contents of the data
set and information needed for the intended application. Urban areas are generally not delineated in
OSM, but there are various possibilities to derive them from existing data. A straightforward solution
1
http://osmatrix.uni-hd.de
4
is to aggregate land use information to form urban areas. However, especially in sparsely mapped
areas and for small rural communities such land use information is mostly absent in OSM. Based on
the method of Rozenfeld et al. (2008), Jiang and Jia (2011) propose a clustering algorithm to derive
city boundaries from OSM street nodes. Their methodology aggregates street nodes within a certain
distance to clusters. The choice of the distance has a crucial affect on the result. Their approach
validates Zipf’s law, which relates the size of cities with their rankings. However, their approach
ignores other OSM data and inherent non-linear relationships. Furthermore, both approaches derive
only crisp urban areas, but in urban geography there is strong evidence that metropolitan areas are
nowadays shaped by sub- and postsuburbanization processes (e.g., Helbich and Leitner 2009, 2010),
causing a continuous transition between urban and rural areas, and cannot be delineated by a crisp
and dichotomous classification scheme (Leung 1987). A continuously and density-based
representation of urban areas is more suitable but is difficult to obtain due to complex relationships
of urbanization processes.
Developments in artificial intelligence find remedy and bear potential to solve geographical problems
that were previously difficult to solve (Smith 1984, Gahegan 2003). In particular, Artificial Neural
Networks (ANNs) are appealing for spatial analysis (Openshaw and Openshaw 1997), because of their
computational speed, representational flexibility, ability to model non-linear relationships, and
computational adaptivity (Fischer 1997). ANNs perform particularly well compared to conventional
statistical models if the data are incomplete or inconsistent (Fischer and Gopal 1994), which is often
the case with complex spatial data such as OSM. In GIScience, ANNs have already shown high
potential for modeling complex geographic processes (e.g., Fischer and Gopal 1994; Pijanowski et al.
2002; Mas et al. 2004).
This paper makes an initial empirical contribution and charts current urban patterns on the basis of
VGI. Therefore, the objective of this study is to develop a density-based methodological framework
to delimitate continuous urban areas using the whole information diversity of OSM. To capture
possible non-linear relationships, interactions, and spatial effects within the GIS-based data, ANN
techniques are applied. The framework mitigates data completeness issues of OSM and thus helps to
improve the fitness for use of OSM for individual applications. The usefulness is demonstrated on a
set of selected European urban regions.
The paper is structured as follows: Section 2 provides an overview concerning the study area and the
data sets. Section 3 introduces the methodology used to detect urban patterns. Results of the
5
empirical analysis are discussed in section 4 and the paper concludes (section 5) with a discussion of
the results and identifies future work.
2 Materials
2.1 Study Site and Data
Training of ANNs requires reference information for learning. For the European Union (EU), two
publicly available urban land use datasets are predominant: CORINE (Coordination of information on
the environment) Land-Cover (CLC) data and Global Monitoring or Environment and Security Urban
Atlas (GMESUA) data. The former has the advantage that it is fully available for the whole territory of
the EU but has a minimum mapping unit of 25 hectare and a minimum width of linear elements of
100 meters. Therefore, it is only suitable for small scale mapping applications. The second
alternative, the GMESUA data, are a product of a joint initiative of the European Commission and the
European Space Agency. During the first quarter of 2011, The GMESUA data set covers 242 urban
regions within Europe, which differ in socio-economic and demographic factors.
The acquisition of GMESUA is based on SPOT-5 satellite images with a 10 m multispectral and 2.5 m
panchromatic pixel resolution. The multispectral data includes a near-infrared band. Compared to
CLC, the data has a considerably finer resolution: linear elements with a width of 10 m are mapped
and the minimum mapping unit for urban areas is 0.25 and 0.55 ha for non-urban areas. Thereby, 44
different land use categories are distinguished. The advantage of high spatial and thematic resolution
is reduced by the fact that the dataset is only available for selected urban areas with more than
100.000 habitants in 27 different countries at a scale of 1:10.000. Figure 1 illustrates the selected
urban regions. A random sampling was indispensable to keep computation feasible and consists of a
subset of about 20%, corresponding to 42 regions, to generate the training and validation data set.
6
Figure 1 Countries covered by GMESUA (dark gray) and the 42 randomly selected GMSEUA urban
regions (black).
3 Methodology
The proposed research design for estimation of urban land use patterns comprises application of
three major consecutive steps:
1. Data preparation: As a first step it is necessary to prepare the OSM and GMESUA data such
that both are valuable for model building. In particular, a large set of potential attributes are
derived from OSM for inductive learning and the desired output is calculated from GMESUA
(Section 3.1).
2. Selection and model building: Second, a genetic algorithm (GA) is used to reduce the total set
of attributes to a reasonable subset and an artificial neural network (ANN) is trained with
these subset (Section 3.2 and 3.3).
7
3. Sensitivity analysis and model performance: Finally, due to the unknown contribution of the
attributes to the model, their significance is analyzed (Section 3.4) and the model
performance for the different areas are investigated.
3.1 Data preparation
Training data for ANNs consists of a set of training samples. Each sample is a pair of an input vector
and a desired output. Therefore, it is necessary to derive input vectors from OSM data and the
desired output from the reference GMSEUA data set to learn the intra-relationship between them.
However, both datasets consist of manifold and diverse information that need to be aggregated to a
normalized representation, where the choice of the areal units for aggregation is crucial and possibly,
like regression analysis (Fotheringham and Wong 1991), affected by the modifiable areal unit
problem (Openshaw 1984). In this study, aggregation is carried out on a hexagonal raster
representation. This seems more reasonable than squarely shaped cells because hexagonal shapes
can better imitate European urban patterns at every scale. The side length of every hexagonal cell is
250 m for this European-wide analysis, which seems a trade-off between computational burden and
spatial resolution, but allows the derivation of fine scaled urban patterns.
3.1.1 GMESUA Urban Regions as Desired Output
GMESUA subsumes continuous and discontinuous urban fabric of built-up areas and its respective
associated land according to its primary use. The latter is further distinguished between degrees of
soil sealing (EEA 2010). To derive urban areas from land use classification, reclassification of the
original data is required, which is an ambiguous and subjective process. While for few classes of
GMESUA a clear distinction between urban and non-urban areas is obvious (e.g., category 1.1.1:
continuous urban fabric with sealing level above 80%), most classes require additional information to
achieve a clear class membership. Here, primary aerial images and local knowledge were used to
reclassify the GMESUA datasets. For each cell the overlap between the cell and the resulting urban
areas is computed and assigned as an attribute, representing the desired output for the ANN.
3.1.2 Derivation of OSM Attributes (Input data)
Basically, OSM presents three different types of information: geometric information, attributive
information, and meta-data (Ramm and Topf 2010). These types potentially contain implicitly or
8
explicitly information that can be used for urban pattern detection. A spatial object in OSM is
characterized by its geometric primitive and a set of assigned tags. A tag is a pair of a key and
additional values that represent the attributive information of a specific object, e.g. a linear
geometry with the assigned tags highway =”primary” and oneway=”true” describes a major highway
that is only accessible in one direction. Although, the data model of OSM is strictly specified, in the
sense that every user is permitted to assign arbitrary tags to any object.
It cannot be expected to significantly improve the total model performance by including information
about sparsely mapped objects, but instead bears the risk of overfitting. Highways and places are
assumed to be generally well mapped for most regions. However, due to the freedom of users to
assign keys and tags at will, the potential number of different highway and place categories is
arbitrarily large and thus requires certain generalization. Therefore, highways and places with a
reasonable high occurrence in OSM2 are exclusively considered. Most objects in OSM have metainformation assigned (e.g., the time of the last edit or the name of the user that has modified an
object at last). Hacklay (2010) indicated that mapping habits of people within urban areas differ from
people residing in rural areas. Thus, it can be expected that this difference is reflected in the metainformation of spatial objects. The basic descriptive statistics (e.g., minimum, maximum, average) are
calculated from the meta-information of all considered objects raster cell. According to concept of
spatial autocorrelation, geographically close observations depend on each other (Tobler 1970). Thus,
the distance to an object predominantly found in urban areas relates to the urbanization of an actual
raster cell and implicitly includes autocorrelated processes in the model. Hence, it is reasonable to
comprise the nearest distance to different objects, e.g. nearest highways, as an attribute for each
raster cell. For derivation of geometric and topologic attributes, it is necessary to distinguish
between the geometric primitives of OSM. Of special interest are the properties of lines, mostly
representing roads. It is hypothesized that urban areas show a higher amount of total road length,
junctions, curves, and right angles because of the necessity of dense traffic infrastructures in densely
populated areas, and are consequently included as the raster cell’s attributes. Further, graph
centrality (Nieminen 1974) measures the number of nodes that link a given node. Previous studies by
Yang and Harry (2004) and Bak et al. (2010) have documented the capability of this index for street
network analysis. In conclusion, table 1 gives an overview of the 102 derived statistics and attributes
for each cell.
2
Frequently used tags: a) highway-tags: residential, unclassified, tertiary, secondary, primary, motorway,
motorway_link, steps, trunk, path, track, footway, service, living_street, cycleway; b) place-tags: town, hamlet,
village, suburb, locality.
9
Table 1 Derived OSM attributes for each cell 3
Attributes for
Aggregated attributes
Attributes for
Aggregated attributes for
selected
for selected highways
selected place
selected highways and places
highway types
types
Length
Number of junctions
Number points
Number of objects
Distance
Number of junctions with at
Distance
Min./Max./Avg. version
Curviness
Number of
least one right angle curve
number(s)
Number or roads with right
Earliest/Latest/Avg. time of
angle curves
modification(s)
Number of road endings
Total/Min./Max./Avg. number
waypoints
of user contributions
Min./Max./Total/Avg.
Min./Max./Total/Avg. number
angle(s)
of object tags
Min./Max./Total/Avg.
centrality
3.2 Genetic Algorithm for Attribute Selection
As outlined in Section 2 several cell-based OSM attributes are calculated. The performance of ANN
models when learning a regression function depends on the choice of attributes (Pyle 1999). The
attributes implicitly define a pattern language (Yang and Honavar 1997). If the language is not
expressive enough, a model will fail to capture the information necessary to approximate the target
function. Contrary, if the language is too expressive, the computational time to learn the model
increases and is vulnerable to overfitting. Due to the large number of attributes and the non-linear
relationships between the attributes, heuristics are promising to obtain near-optimal attribute sets
for ANN model building (Siedlecki and Sklansky 1989).
Attribute selection approaches can be categorized as follows (Liu and Yu 2004): The filter approach
uses statistics to measure the relevance of attributes. It is totally independent of the learning
algorithm, thus it is computationally more efficient than the wrapper approach, which involves
computational overhead by executing the training algorithm for every presented attribute set and
evaluating the results. Attribute selection may not be independent of the learning algorithm, which is
ignored by the filter approach. In contrast, the wrapper approach takes the properties and biases of
3
For selected highways and places see 2
10
the inductive learning algorithm into account. Because the wrapper approach is generally
computational demanding, genetic algorithms (GA) are especially promising for attribute selection
(Siedlecki and Sklansky 1989).
A GA is a heuristic optimization method, simulating natural evolution processes in analogy to biology
(Holland 1975; Goldberg 1989). GAs represent a potential solution as an individual. Each individual is
encoded by a chromosome, comprised of a set of genes. The chromosomes are often coded as a
binary string. A set of individuals constitutes a population. This population iteratively evolves, until a
stop criterion is reached. At each iterative step of the GA the fitness of the individuals of the current
population is measured. Afterwards, the population for the next iterative step is built by selecting,
recombining, and mutating the most promising individuals of the current population (Mitchell 1998).
Because only promising individuals take part in the evolutionary process, it is likely that near-optimal
solutions emerge. The final solution is chosen from the individuals of the last population. In contrast
to gradient-decent optimization, multiple solutions are maintained in parallel within a population,
allowing interactions among them to explore regions in the search space between them (Qi et al.
1994). The goodness of a solution, respectively fitness of an individual, is numerically evaluated by a
fitness function, which depends on the optimization objective. To utilize GAs for attribute selection
following the wrapper approach, it is necessary to represent different attribute sets as individuals.
For each individual of a population an ANN is trained and its performance is measured, representing
the goodness of the individual.
3.3 Artificial Neural Network
Artificial neural networks (ANNs) model an interconnected system of neurons, enabling computers to
imitate the brain’s ability to detect patterns and learn relationships within data (Fischer 1998). The
multi-layer perceptron (MLP), introduced by Rumelhart et al. (1986), is one of the most widely used
ANNs (e.g., Fischer and Gopal 1994; Pijanowski et al. 2002; Mas et al. 2004). The MLP usually consists
of three different layers of neurons: the input layer, the hidden layer, and the output layer. Every
connection between neurons of different layers has an assigned weight, scaling input signals passing
through. The input data is first presented to the input layer, and then subsequently passed to the
hidden layer and to the output layer in a feed forward manner. A neuron receiving the weighted
signals from connected neurons of the preceding layer, sums the signals, and calculates an output
signal according to its inner activation function (Bishop 1996).
11
The crucial part of ANNs is the adaption of the weights, so that the model is capable to represent a
target function. The most popular way of training an ANN is by modifying its weight using the back
propagation algorithm (Rumelhart et al. 1986). This algorithm randomly sets the weights and
calculates the resulting output. After all data samples are presented to the network, the sum of the
mean squared error is calculated and the weights are modified according to a generalized delta rule
(Rumelhart et al. 1986), so that the total error is distributed among the various nodes in the network.
This process of feeding forward input signals and back propagating the errors is repeated iteratively,
until a terminating condition is fulfilled (e.g. the error-rate falls below a certain threshold).
It has been shown that ANNs with one hidden layer can theoretically approximate any function
(Hornik et al. 1989). However, a certain degree of freedom must be supplied, i.e. the layer must
consist of a sufficient number of hidden neurons. Generally, the number of hidden neurons is chosen
to minimize a trade-off between network bias and variance (Bishop 1995). A limitation in the use of
ANNs is that they provide a “black box” model. It is difficult to gain deep insight into the interior
working of an ANN by interpreting the weights of the network. Nevertheless, numerical analysis of
different input settings may help to gain insights into the importance of attributes. One primary
advantage of ANNs, compared to the more easily interpretable decision trees, is the ability to model
unknown interactions between input variables, the relationship between such interactions, and any
output pattern (Pyle 1999).
3.4 Significance Analysis
Although the total set of input variables for ANN training are reduced by a GA, the relative
contribution of the remaining attributes on the total model performance is not known. However,
being aware of the importance of the attributes can advance the understanding of the model and its
explanatory capabilities. To evaluate the relative contribution of each attribute to the output, several
methods have been proposed (Gevrey et al. 2003; Olden et al. 2004).
Because of the convergence of the GA to an optimal solution, the genetic diversity of the individuals
within a generation is generally decreasing. Consequently, it is hypothesized that genes important to
the survival of individuals are present in most chromosomes of individuals within a generation, while
unimportant genes are spare and diverse. Thus, by counting the frequency of the attributes
represented within a generation, the importance of the attributes to the output of the ANN can be
estimated. Another method to measure the importance of an attribute is to measure the change of
root mean square error (RMSE) when sequentially and stepwise setting input neurons to their mean
12
value (SSMV). The resulting change indicates the relative importance of each attribute (Gevrey et al.
2003). Because the two techniques can lead to diverse conclusions, both are applied and compared
in the next section.
4 Results
4.1 Model Specification and Overall Quality
For variable selection purposes a non-dominated sorting genetic algorithm (NSGA-1; Srinivas and Deb
1994) was used. The algorithm was allowed to run at most 1000 iterations, but stops earlier if the
performance for 25 iterations does not significantly improve. Each generation consisted of 100
individuals. To reduce the computation time and to limit the resulting model to a reasonable size, the
maximum number of attributes was set to 20 for each individual and is represented by a trained ANN
model. The training data consist of a subset of 20,000, cells (4% of the whole dataset), randomly
selected from all regions of the dataset (see Sec. 2). The remaining cells are used for testing and
validation. After empirical tests, the final ANN consists of a single hidden layer with (š‘› + 1)/2 hidden
neurons, where š‘› is the number of attributes, even though the optimal number of hidden neurons is
not known. The ANN is trained for 1,000 cycles by backpropagation with a learning rate of 0.3. The
final model is selected from the last GA generation, based on the RMSE, the squared correlation
coefficient (R2), Spearman’s rho (RS), and the number of attributes. The resulting model of the GA
optimization consists of 11 remaining attributes (distance to nearest residential road, length of
residential roads, number of waypoints of primary roads, length of motorways, cycleway curviness,
distance to nearest pedestrian road, distance to nearest track, length of tracks, number of junctions,
distance to nearest village, and number of attributes). Overall, the model showed moderate
performance with a RMSE of 0.12, a R2 of 0.6, and a RS of 0.59. Applying the model on the remaining
data, which are independent of model building, allows the investigation of its generalization
capabilities. The result yields a similar performance with a RMSE of 0.12, a R2 of 0.59, and a RS of
0.58. A further residual inspection shows a mean of -0.05 and standard deviation of 1.03 and
confirms a nearly Gaussian distribution. Thus, it attests full model capability.
4.2 Regional Model Performance
Due to the spatial heterogeneity of geographic processes, it is assumed that the performance of local
models changes if applied to distinct areas. Reasons are, on the one hand, differences in
urbanization, economic power, as well as cultural issues, and, on the other hand, the varying quality
13
of OSM data underlying the model. To assess the influence of locality to the model performance, it is
applied to each GMESUA region and its generalization capabilities are examined separately. The
results are summarized in Table 2.
Table 2 Model performance for selected regions within the study area
Urban Region
GMESUA Regions4
%
of
cells RMSE
R2
RS
intersecting
urban areas
Linz
At003l
48.8
0.121
0.695
0.650
Varna
Bg003l
28.7
0.194
0.353
0.554
Ruse
Bg006l
18.4
0.125
0.440
0.483
Brno
Cz002l
31.6
0.114
0.646
0.637
Ústí nad Labem
Cz005l
35.9
0.127
0.583
0.649
Jihlava
Cz014l
21.3
0.070
0.700
0.628
Leibzig
De008l
38.6
0.135
0.627
0.700
Bremen
De012l
36.0
0.103
0.709
0.626
Darmstadt
De025l
35.8
0.126
0.757
0.710
Mönchengladbach
De036l
80.4
0.186
0.701
0.837
Koblenz
De042l
37.8
0.118
0.743
0.720
Odense
Dk003l
45.3
0.118
0.580
0.476
Valencia
Es003l
53.1
0.185
0.534
0.678
Toledo
Es016l
12.1
0.075
0.337
0.275
Cordoba
Es020l
16.5
0.108
0.538
0.492
Bordeaux
Fr007l
45.0
0.167
0.528
0.628
Rennes
Fr013l
55.0
0.119
0.590
0.449
Besançon
Fr025l
31.4
0.101
0.632
0.666
Patrai
Gr003l
26.1
0.110
0.667
0.620
Miskolc
Hu002l
24.1
0.167
0.423
0.500
Debrecen
Hu005l
25.1
0.185
0.313
0.423
Győr
Hu007l
24.8
0.154
0.433
0.553
Firenze
It007l
40.5
0.117
0.647
0.629
Bologna
It009l
42.1
0.137
0.463
0.489
4
This code allows a clear assignment to the data sets. The first two characters correspond to the particular
country according to the ISO 3166 standard.
14
Trieste
It015l
57.1
0.169
0.590
0.773
Panevėžys
Lt003l
15.8
0.121
0.210
0.284
Luxembourg
Lu001l
32.1
0.095
0.683
0.678
Liepāja
Lv002l
0.08
0.060
0.361
0.202
Rotterdam
Nl003l
62.1
0.193
0.598
0.755
Tilburg
Nl006l
60.1
0.206
0.509
0.678
Wrocław
Pl004l
29.2
0.129
0.450
0.559
Poznań
Pl005l
32.0
0.122
0.569
0.599
Setúbal
Pt006l
57.3
0.236
0.155
0.390
Aveiro
Pt008l
52.8
0.242
0.129
0.340
Brăila
Ro008l
17.6
0.123
0.676
0.484
Călăraşi
Ro012l
19.1
0.163
0.442
0.480
Umeå
Se005l
0.07
0.054
0.428
0.225
Banská Bystrica
Sk003l
17.4
0.072
0.710
0.570
Žilina
Sk006l
26.8
0.109
0.589
0.597
Sheffield
Uk010l
47.5
0.125
0.789
0.781
Leicester
Uk014l
47.9
0.129
0.742
0.639
Wolverhampton
Uk028l
53.6
0.158
0.728
0.715
For most regions the model performs comparably well, with a few remarkable exceptions. For
instance, the region of Liepāja, Latvia, has a very low RMSE (0.060), which means that the model
mostly predicts well. However, the low R2 (0.361) and RS (0.202) indicate, that the low RMSE is not a
result of a good prediction of urban areas, but certainly caused by the sparse urbanization in that
region. The largest city in this region is Liepaja with a population of about 60,000, and only 3 further
towns are located in that region with more than 1,000 residents. In fact, only 8% of all cells are
covered at least partially by urban areas. Additionally, OSM data is mostly absent, except for the city
Liepaja itself, thus the model necessarily failed to predict urban areas.
As a second example, the region of Aveiro, Portugal, has an exceptional high RMSE (0.242). The high
RMSE is caused by a discrepancy between a high density of urbanization and sparse OSM data in that
region. Actually, 53% of the cells for that region are at least partially covered by urban structures.
Figure 2 depicts a section from this region, including the city of Ílhavo. Ílhavo has approximately
16,800 residents. However, virtually no OSM data are mapped for this city (Fig. 2 lower panel). Only
the place and a single primary highway exist, although the comparison to GMSEUA reveals, that the
15
region is densely settled (Fig. 2 upper panel). Such a discrepancy is, of course, reflected in the
performance of the model.
Figure 2 Urban areas of Aveiro (Portugal) according to GMESUA (upper panel) and OSM data (lower
panel)
However, for several regions the model performs exceptionally well. Especially, for the region of
Jihlava, Czech Republic, the model results in a very low RMSE (0.07) and also a considerably high R2
(0.7) and RS (0.682), even though 21% of the cells are at least partially covered by urban structures
The good performance for this region indicates that the quality of the OSM data suffice the data
requirements of the ANN model.
The OSM data quality of the United Kingdom is well-understood and it is empirically verified that at
least urban areas are well-covered in the sense of completeness (Hacklay 2010; Hacklay et al. 2010).
16
These results are also supported by this study as well. Generally, high R2 and RS for all urban regions
of the United Kingdom are revealed. One stated key motivation of this work is to improve
completeness of OSM by machine learning, and thus improve its fitness for use. The developed
model is able to predict urban patterns, based on a set of attributes derived from OSM data. Figure 3
(lower panel) shows a successful prediction of urban patterns for Great Glen (United Kingdom),
where the model distinguishes different degrees of urbanization based on the underlying OSM data.
Figure 3 Comparison between the real urban pattern based on GMESUA (upper panel) and the
predicted results (lower panel) for Great Glen, south-east of Leicester, United Kingdom.
4.2 Significance Analysis
To gain further insights into the model, the importance of the models’ attributes is evaluated. The
respective ranks for the two proposed methods (Sec. 3) are presented in table 4.
17
Table 3 Ranking of attribute importance by SSMV and GA
Attribute
SSMV
GA
Change of RMSE
Rank
Frequency
Rank
Distance to nearest residential road
0.0195
1
100
1
Length of residential roads
0.0196
2
100
1
Number of Attributes
0.0103
3
90
6
Number of junctions
0.0095
4
32
9
Distance to nearest village
0.0080
5
85
7
Distance to nearest pedestrian road
0.0054
6
92
5
Length of tracks
0.0001
7
100
1
Number of waypoints of primary roads
-0.0001
8
7
11
Distance to nearest track
-0.0003
9
99
4
Length of motorways
-0.0006
10
23
10
Cycleway curviness
-0.0014
11
58
8
The results for both methods are widely comparable. Both identify the length of residential roads
and their distance as most significant input variables. Both methods are also comparable at
identifying less important attributes, except the number of junctions and distance to nearest tracks,
whose ranks are complementary. For SSMV it is notable, that attributes with a rank less than 7 are
rather unimportant because setting them stepwise to constant values even improved the results.
However, there exist also significant differences in the measurement of importance. SSMV ranked
the number of attributes as very important, while for GA this attribute is only of average importance.
Another major difference between SSMV and GA is the ranking of track-related attributes. For GA the
length of tracks and the distance to the nearest track is very important, while for SSMV these
attributes are rather irrelevant. Furthermore, the importance of junctions is not emphasized by GA.
4.3 Discussion
Data production in VGI is carried out by many independent contributors, increasing the chance for
occurring repetitions. It can be expected that a large number of repetitions increases the chance of
errors being detected and fixed in the data, and thus improves the data quality in general (Heipke
2010). For OSM data Haklay et al. (2010) showed, that there is in fact a non-linear relationship
between the number of contributors and the positional accuracy. This assumption holds not for
18
every aspect of spatial data quality. For completeness it is anticipated, that the number of omission
errors negatively correlates with the number of commission errors because of the diversity of user
requirements. However, the proposed methodology is based on neural computing and aims to detect
urban areas using OSM data. It enables users to derive information implicitly stored in OSM,
superseding explicit storage, and thus making commission errors less likely.
The performance of the developed model varied considerably applied to the 42 different European
regions, emphasizing the spatial heterogeneity of the data. Local models may provide better results
for individual geographic regions, but do not capture general properties of the data well. Although
the developed model depends on data that is mostly mapped in OSM, it necessarily fails, if such data
is missing at all. Therefore, to apply the presented methodology it is necessary to make assumptions
on the presence of valuable information. Delineating urban patterns is subject to individual
considerations. Thus, the validity of the presented methodology for predicting density-based
patterns of urban areas is a difficult task, although the results are coherent and visually appealing. In
general, the validity of the results must be evaluated with respect to the intended application.
Machine Learning can only respond to geographic characteristics if those are encoded in the data
(Gahegan 2003). It is a priori unclear how to encode information in the dataset to give reasonable
results. Furthermore, the settings of the ANN and GA in this study are primarily chosen a priori.
However, a detailed inspection and sensitivity analysis of the different settings bear potential for
making the model more robust and yield better results (see Patuelli et al. 2010). A fundamental
problem of spatial data analysis on lattice data is the dependence of the results on the underlying
scale and zoning (Openshaw 1984). The influence of the cell size on the results of prediction was not
subject of this study, but needs more research.
6 Conclusions and Future Work
In this study a methodological framework is proposed for the delineation of continuous urban areas
from OSM data. Each of the framework’s components provides solely distinct capabilities. The ANN
enables the non-linear estimation of urban patterns from a set of attributes. The GA reduces the
total set of attributes to a reasonable size for inductive learning following the wrapper approach,
thus that no precise understanding of the processes is mandatory. The usefulness of the
methodology was demonstrated by applying it to the estimation of urban patterns of Europe.
19
It was shown, that urban patterns can be predicted to a large extend. Explicit information about
urban patterns and urban density is useful for manifold applications, e.g. navigation and spatial
planning tasks, but is currently mostly absent in OSM. By estimating the urban density with the
proposed framework the fitness for use in OSM for applications is improved, leading to additional
benefit for users.
Future work will elaborate on the application of machine learning to OSM data. OSM offers a rich and
manifold source of spatial data, offering a profound pool of implicit information. Extracting this
information from OSM and making it explicit improves the fitness for use and overwhelm
completeness issues in the data for manifold applications and requirements. Additionally, machine
learning bears potential for detection of other data quality issues, e.g. positional or semantic errors.
From a methodological point of view it seems fruitful to include further machine learning techniques
that in particular take into account the spatial and temporal properties of OSM data.
Acknowledgements
We acknowledge the valuable feedback received from our colleagues Georg Walenciak, Marcus Götz,
Bernhard Höfle, and Alexander Zipf on an early draft of this paper. In addition, we want to thank
Hannes Taubenböck (German Aerospace Center, DLR) for his useful hints for appropriate remote
sensing data.
Reference
Bak P, Omer I and Schreck T 2010 Geospatial thinking. Painho, M, Santos, M Y & Pundt, H, (eds)
Geospatial Thinking. Springer, Berlin: 25-42
Bishop C M 1996 Neural Networks for Pattern Recognition. New York, Oxford University Press
Brassel K, Bucher F and Stephan E-M 1995 Elements of Spatial Data Quality. . Pergamon, Oxford: 81108
Buttenfield B 1993 Representing data quality. Cartographica: The International Journal for
Geographic Information and Geovisualization 30: 1-7
Delavar M R and Devillers R 2010 Spatial data quality: From process to decisions. Transactions in GIS
14: 379-386
Devillers R, Stein A, Bédard Y, Chrisman N, Fisher P and Shi W 2010 Thirty years of research on spatial
data quality: Achievements, failures, and opportunities. Transactions in GIS 14: 387-400
EEA 2010 Mapping guide for a European urban atlas. WWW document,
20
http://www.eea.europa.eu/data-and-maps/data/urban-atlas/mapping-guide
Elwood S 2008 Volunteered geographic information: Key questions, concepts and methods to guide
emerging research and practice. GeoJournal 72: 133-135
Fischer M M 1998 Computational neural networks: A new paradigm for spatial analysis. Environment
and Planning A 30: 1873-1891
Fischer M M and Gopal S 1994 Artificial neural networks: A new approach to modelling interregional
telecommunication flows. Journal of Regional Science 34: 503-527
Fotheringham A S and Wong D W S 1991 The modifiable areal unit problem in multivariate statistical
analysis. Environment and Planning A 23: 1025-1044
Gahegan M 2003 Is inductive machine learning just another wild goose (or might it lay the golden
egg)? International Journal of Geographical Information Science 17: 69-92
Gevrey M, Dimopoulos I and Lek S 2003 Review and comparison of methods to study the
contribution of variables in artificial neural network models. Ecological Modelling 160: 249-264
Girres J-F and Touya G 2010 Quality assessment of the French OpenStreetMap dataset. Transactions
in GIS 14: 435-459
Goldberg D E 1989 Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA,
Addison-Wesley Longman Publishing Co., Inc.
Goodchild M F 2008 Spatial accuracy 2.0. In Proceedings of the Eighth International Symposium on
Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Liverpool, World
Academic Union: 1-7
Goodchild M F 2007 Citizens as sensors: The world of volunteered geography. GeoJournal 69: 211221
Goodchild M F and Hunter G J 1997 A simple positional accuracy measure for linear features.
International Journal of Geographical Information Science 11: 299-306
Haklay M 2010 How good is volunteered geographical information? A comparative study of
OpenStreetMap and Ordnance Survey datasets. Environment and Planning B: Planning and Design
37: 682-703
Haklay M, Basiouka S, Antoniou V and Ather A 2010 How many volunteers does it take to map an
area well? The validity of Linus’ law to volunteered geographic information. The Cartographic Journal
47: 315-322
Haklay M and Weber P 2008 OpenStreetMap: User-generated street maps. IEEE Pervasive Computing
7: 12-18
Heipke C 2010 Crowdsourcing geospatial data. ISPRS Journal of Photogrammetry and Remote Sensing
65: 550-557
Helbich H and Leitner M 2010 Postsuburban spatial evolution of Vienna's urban fringe: Evidence from
point process modeling. Urban Geography 31: 1100-1117
21
Helbich M and Leitner M 2009 Spatial analysis of the urban-to-rural migration determinants in the
Viennese metropolitan area. A transition from suburbia to postsuburbia?. Applied Spatial Analysis
and Policy 2: 237-260
Holland J 1992 Adaptation in Natural and Artificial Systems: An Introductory Analysis with
Applications to Biology, Control, and Artificial Intelligence. Cambridge, MA, MIT Press
Hornik K, Stinchcombe M and White H 1989 Multilayer feedforward networks are universal
approximators. Neural Networks 2: 359-366
Jiang B and Harrie L 2004 Selection of streets from a network using self-organizing maps.
Transactions in GIS 8: 335-350
Jiang B and Jia T 2011 Zipf’s law for all the natural cities in the United States: A geospatial
perspective. International Journal of Geographical Information Science (accepted)
Kresse W and Fadaie K 2004 ISO Standards for Geographic Information. Berlin, Springer
Leung Y 1987 On the imprecision of boundaries. Geographical Analysis 19: 125-151
Mas J F, Puig H, Palacio J L and Sosa-López A 2004 Modelling deforestation using GIS and artificial
neural networks. Environmental Modelling & Software 19: 461-471
Maué P and Schade S 2008 Quality of geographic information patchworks. In 11th AGILE
International Conference on Geographic Information Science.
Neis P and Zipf A 2008 OpenRouteService.org is three times "Open": Combining OpenSource, OpenLS
and OpenStreetMaps. In GIS Research UK. Manchester
Nieminen J 1974 On centrality in a graph. Scandinavian Journal of Psychology 15: 322-336
Olden J D, Joy M K and Death R G 2004 An accurate comparison of methods for quantifying variable
importance in artificial neural networks using simulated data. Ecological Modelling 178: 389-397
van Oort P 2006 Spatial Data Quality: From Description to Application. PhD thesis, Wageningen
University
Openshaw S and Openshaw C 1997 Artificial Intelligence in Geography. New York, NY, Wiley
Over M, Schilling A, Neubauer S and Zipf A 2010 Generating web-based 3D city models from
OpenStreetMap: The current situation in Germany. Computers, Environment and Urban Systems 34:
496-507
Patuelli R, Reggiani A, Nijkamp P and Schanne N 2011 Neural networks for regional employment
forecasts: Are the parameters relevant? Journal of Geographical Systems 13: 67-85
Pijanowski B C, Brown D G, Shellito B A and Manik G A 2002 Using neural networks and GIS to
forecast land use changes: A land transformation model. Computers, Environment and Urban
Systems 26: 553-575
Pyle D 1999 Data preparation for data mining. San Francisco, CA, Morgan Kaufmann Publishers Inc.
22
Qi X and Palmieri F 1994 Theoretical analysis of evolutionary algorithms with an infinite population
size in continuous space. Part I: Basic properties of selection and mutation. Neural Networks, IEEE
Transactions on 5: 102-119
Ramm F and Topf J 2010 OpenStreetMap: Using and Enhancing the Free Map of the World. UIT
Cambridge
Roick O and Hagenauer J 2011 OSMatrix - Grid-based analysis and visualization of
OpenStreetMap.State of the Map 2011 The 5th Annual International OpenStreetMap Conference,
Vienna, Austria (submitted)
Rozenfeld H D, Rybski D, Gabaix X and Makse H A 2009 The area and population of cities: New
insights from a different perspective on cities. National Bureau of Economic Research, Inc., NBER
Working Papers No 15409
Rumelhart D E, Hinton G E and Williams R J 1986 Learning representations by back-propagating
errors. Nature 323: 533-536
Servigne S, Lesage N and Libourel, T 2010 Quality Components, Standards, and Metadata
Fundamentals In Devillers R and Jeansoulin R (eds) Fundamentals of Spatial Data Quality. London,
ISTE: 179-210
Shi W, Goodchild M and Fisher P 2002 Spatial data quality. New York, Taylor & Francis
Siedlecki W and Sklansky J 1989 A note on genetic algorithms for large-scale feature selection.
Pattern Recognition Letters 10: 335-347
Smith T R 1984 Artificial intelligence and its applicabilty to geographical problem solving. The
Professional Geographer 36: 147-158
Srinivas N and Deb K 1994 Multiobjective optimization using nondominated sorting in genetic
algorithms. Evolutionary Computation 2: 221-248
Tobler W R 1970 A computer movie simulating urban growth in the Detroit region. Economic
Geography 46: 234-240
Yang J and Honavar V 1998 Feature subset selection using a genetic algorithm. Intelligent Systems
and their Applications, IEEE 13: 44-49
Zielstra D and Zipf A 2010 A comparative study of proprietary geodata and volunteered geographic
information for Germany. 13th AGILE International Conference on Geographic Information Science,
Guimaraes, Portugal
Guptill S C and Morrison J L (eds) 1995 Elements of Spatial Data Quality. Oxford, Pergamon
Moellering H (ed) 1987 A Draft Proposed Standard for Digital Cartographic Data. Columbus, OH,
National Committee for Digital Cartographic Standards
23
Download