Mining Urban Land Use Patterns from Volunteered Geographic Information Using Genetic Algorithms and Artificial Neural Networks Julian Hagenauer and Marco Helbich GIScience, Institute of Geography, University of Heidelberg, Germany Keywords: Volunteered Geography, OpenStreetMap, Spatial Data Quality, Spatial Data Mining This is an Author’s Original Manuscript of an article whose final and definitive form, the Version of Record, has been published in the International Journal of Geographical Information Science [14 Nov 2011] [copyright Taylor & Francis], available online at: http://www.tandfonline.com/doi/abs/10.1080/13658816.2011.619501. Abstract OpenStreetMap (OSM), as one of the most promising crowd sourced initiative, provides volunteered mapped spatial data. At once, this bears several spatial data quality problems, inter alia completeness, which on the one hand induces data omission errors and commission errors on the other hand. Using European-wide urban land use patterns, this study investigates the first issue and aims at predicting currently not mapped or partially mapped urban areas based on OSM. For this purpose, a machine learning approach consisting of genetic algorithms and artificial neural networks is applied to estimated urban areas. Under the premise of existing OSM data the model estimates missing urban areas with an overall squared correlation coefficient (R2) of 0.6. Nevertheless, interregional comparisons of European regions confirm spatial heterogeneity in model quality (R2 ranges from 0.2 up to 0.8) and thus the inherent varying completeness of OSM. Hence, this verifies the hypothesis that more active volunteers within a region enhance the content of OSM. 1 Introduction The emergence of internet technologies facilitates the generation and distribution of manifold digital content (e.g., Flickr, Wikipedia) and thus makes collaborative efforts common and present in everyday life. The enablement of participatory collaboration caused a paradigm shift, blurring the distinction between consumers and producers that has been existent in the Web since its early days. O’Reilly (2005) terms these developments Web 2.0. 1 Because of the high costs related to the process of gathering and maintaining as well as efforts involved in sharing and distributing spatial data, these have been almost solely the domain of either official land surveying offices or commercial companies. Nowadays, the availability of mobile devices endowed with satellite navigation has enabled people to collect geographic data on their own, at low costs, and high precision levels, formerly not conceivable for non-experts. Citizens have become human sensors (Goodchild 2007). Web 2.0-technologies permit volunteers to aggregate, share and, edit their collected geographic data in a collaborative manner. This phenomenon is usually referred to as Volunteered Geographic Information (VGI; Goodchild 2007; Elwood 2008), whereas Sui (2008) calls it in terms of GIS metaphorically the "wikification of GIS". Among a broad list of initiatives dealing with VGI, OpenStreetMap (OSM) is one of the most promising activities. Since its initiation in 2004, its primary goal is the generation of a free map of the world through volunteered contributions. Although, the generation of maps is still the primary intention, the collected spatial data is also made publicly available and may thus be used for individual purposes (e.g., OpenRouteService.org (Neis and Zipf 2008), 3D city models (Over et al. 2010)). User generated GPS tracks, out of copyright maps, and more recently aerial images (e.g., Bing Maps) serve as primary data source. The data itself are distributed under a license that guarantees freedom of use, but enforces that all derived data are distributed under the same license (Haklay and Weber 2008, Ramm and Topf 2010). In general, awareness concerning the limitations of spatial data is essential and in particular data quality issues, which are a comprehensive and ongoing active research field (e.g., Chrisman 1984; Buttenfield 1993; Goodchild and Hunter 1996; Goodchild 1998; Shi et al. 2002, Devillers et al. 2010). Evaluating spatial data quality is an important part in the process of assessing the "fitness for use" (Chrisman 1993) of a data set for a particular application (van Oort 2003). Furthermore, spatial data quality has also been addressed by a set of standard definitions and quality criteria proposals from various organizations such as the National Committee on Digital Cartographic Data Standards (NCDCDS, Moellering 1987), the International Cartographic Association (ICA, Guptill and Morrison 1995) or the International Standardization Organization (ISO, Kresse and Fadaie 2004). Based on the definitions of the Technical Committee 211 of the ISO, the following elements of spatial data quality can be identified (Kresse and Fadaie 2004): - Positional accuracy: concerns the accordance of positioning and geometry of an object to its representation in the real world, - Attribute accuracy: measures the correctness of attributes assigned to a geographic object, 2 - Completeness: measures the absence and excess of data , - Logical consistency: describes the topological correctness and the relationships between objects in respect to their internal consistency, - Semantic accuracy: evaluates the correspondence of the interpretation of spatial objects to their meanings in the real world, - Temporal accuracy: describes the data actuality in relation to real world changes, - Lineage: concerns the history of a data set, how it was collected and derived to its actual state, - Usage: assesses the extent of a data set to serve its intended purpose. This contribution focuses on the completeness criterion of spatial quality. Completeness describes the presence and absence of objects in a data set. Brassel et al. (1995) distinguish between data completeness and model completeness. Former refers to the measureable errors between the data set and its specification. Errors may be caused by lack of data that is formally expected to be present in the database (omission errors) or otherwise, present in the database but not intended to be included (commission errors). The latter is measurable and independent of the application. In contrast to data completeness, model completeness considers the intended use, i.e. how well the model of a dataset fulfills the requirements of an application (Brassel et al. 1995). To evaluate completeness in terms of fitness for use, it is advisable to consider both data completeness as well as model completeness. The appropriateness of spatial data quality usually depends on the availability and quality of reference information (Servigne et al. 2010). Concerning VGI in general and OSM in particular spatial data quality, namely spatial accuracy, is an important issue (Haklay et al. 2010) and following Goodchild (2008) the same is valid for completeness. Studies of OSM data quality have previously been conducted in several recent studies (Girres and Touya 2010; Haklay 2010; Haklay et al. 2010). Haklay (2010) evaluates the positional accuracy of OSM in reference to Ordnance Survey data for Great Britain based on the methodology of Goodchild and Hunter (1996), by analyzing the percentage of overlaps between both data sets within a buffer distance. Girres and Touya (2010) similarly evaluated different aspects of OSM data quality in France. Both studies showed that in terms of positional accuracy the quality of OSM data is comparable to traditional geographic datasets from national mapping agencies or commercial providers. Nevertheless, comprehensive studies concerning completeness of OSM are lacking so far and thus represent a compelling area of research, which is the main objective of this paper. 3 The measurement of VGI completeness is a complex task and bears several difficulties. First, VGI activities address a large user-base, whose motivations for participating, contributing, and using spatial data differ substantially (Heipke 2010). Second, a strictly defined dataset specification may be contradicting the plurality of a user community, but without a precise specification of the dataset it is not possible to detect errors of completeness (van Oort 2006; Servigne et al. 2010). Third, in OSM anybody is permitted to add, delete, or modify data. However, mapping guidelines exist that are recommended to be followed by contributors. The guidelines are communicated, discussed and modified through a wiki, and reflect the consensus of the community. Because this specification is not authoritative, it is not possible to measure completeness of OSM in a strict sense. Anyhow, several studies comparing the completeness of OSM to other datasets exist (e.g., Girres and Touya 2010; Haklay 2010; Haklay et al. 2010; Zielstra and Zipf 2010). However, those studies only consider objects of certain types (e.g., roads or rivers) for descriptive measurements. For OSM it is not guaranteed that a certain object will ever be mapped. On a global scale the digital divide is an important factor for incomplete mapping of less developed countries (Goodchild 2008; Maue and Schade 2008). On a local scale the absence of voluntary contributors in disadvantaged areas is a primary cause for omission errors (e.g., Haklay 2010). In particular, it appears that completeness in the sense of present features not only depends on the density of information within an area, but also on the number of contributors, which is generally high in densely populated areas (Girres and Touya 2010). Third-party contributions to OSM, such as the import of the TIGER data from the US Census Bureau and the availability of cadastral data from French authorities, improve the situation of missing data. Beyond that, it is expected that intelligent tools and powerful visualizations might further be helpful in detecting and fixing spatial data quality issues (e.g., see the prototype of the web-based attribute visualization tool OSMatrix1 (Roick and Hagenauer 2011). However, the absence of data may unbearably affect the fitness for use. For instance, incomplete or wrongly mapped representations of road networks and the surrounding environment may induce inaccurate route proposals of a navigation application (e.g., Neis and Zipf 2008). This in turn negatively affects the users’ activity space and time schedule. In particular, this is true for larger urban areas, affected by traffic jams, speed limits, and one-way streets, where it is often advisable to take an alternative bypass route. OSM exhibits implicit information that can be used to fill the gap between the contents of the data set and information needed for the intended application. Urban areas are generally not delineated in OSM, but there are various possibilities to derive them from existing data. A straightforward solution 1 http://osmatrix.uni-hd.de 4 is to aggregate land use information to form urban areas. However, especially in sparsely mapped areas and for small rural communities such land use information is mostly absent in OSM. Based on the method of Rozenfeld et al. (2008), Jiang and Jia (2011) propose a clustering algorithm to derive city boundaries from OSM street nodes. Their methodology aggregates street nodes within a certain distance to clusters. The choice of the distance has a crucial affect on the result. Their approach validates Zipf’s law, which relates the size of cities with their rankings. However, their approach ignores other OSM data and inherent non-linear relationships. Furthermore, both approaches derive only crisp urban areas, but in urban geography there is strong evidence that metropolitan areas are nowadays shaped by sub- and postsuburbanization processes (e.g., Helbich and Leitner 2009, 2010), causing a continuous transition between urban and rural areas, and cannot be delineated by a crisp and dichotomous classification scheme (Leung 1987). A continuously and density-based representation of urban areas is more suitable but is difficult to obtain due to complex relationships of urbanization processes. Developments in artificial intelligence find remedy and bear potential to solve geographical problems that were previously difficult to solve (Smith 1984, Gahegan 2003). In particular, Artificial Neural Networks (ANNs) are appealing for spatial analysis (Openshaw and Openshaw 1997), because of their computational speed, representational flexibility, ability to model non-linear relationships, and computational adaptivity (Fischer 1997). ANNs perform particularly well compared to conventional statistical models if the data are incomplete or inconsistent (Fischer and Gopal 1994), which is often the case with complex spatial data such as OSM. In GIScience, ANNs have already shown high potential for modeling complex geographic processes (e.g., Fischer and Gopal 1994; Pijanowski et al. 2002; Mas et al. 2004). This paper makes an initial empirical contribution and charts current urban patterns on the basis of VGI. Therefore, the objective of this study is to develop a density-based methodological framework to delimitate continuous urban areas using the whole information diversity of OSM. To capture possible non-linear relationships, interactions, and spatial effects within the GIS-based data, ANN techniques are applied. The framework mitigates data completeness issues of OSM and thus helps to improve the fitness for use of OSM for individual applications. The usefulness is demonstrated on a set of selected European urban regions. The paper is structured as follows: Section 2 provides an overview concerning the study area and the data sets. Section 3 introduces the methodology used to detect urban patterns. Results of the 5 empirical analysis are discussed in section 4 and the paper concludes (section 5) with a discussion of the results and identifies future work. 2 Materials 2.1 Study Site and Data Training of ANNs requires reference information for learning. For the European Union (EU), two publicly available urban land use datasets are predominant: CORINE (Coordination of information on the environment) Land-Cover (CLC) data and Global Monitoring or Environment and Security Urban Atlas (GMESUA) data. The former has the advantage that it is fully available for the whole territory of the EU but has a minimum mapping unit of 25 hectare and a minimum width of linear elements of 100 meters. Therefore, it is only suitable for small scale mapping applications. The second alternative, the GMESUA data, are a product of a joint initiative of the European Commission and the European Space Agency. During the first quarter of 2011, The GMESUA data set covers 242 urban regions within Europe, which differ in socio-economic and demographic factors. The acquisition of GMESUA is based on SPOT-5 satellite images with a 10 m multispectral and 2.5 m panchromatic pixel resolution. The multispectral data includes a near-infrared band. Compared to CLC, the data has a considerably finer resolution: linear elements with a width of 10 m are mapped and the minimum mapping unit for urban areas is 0.25 and 0.55 ha for non-urban areas. Thereby, 44 different land use categories are distinguished. The advantage of high spatial and thematic resolution is reduced by the fact that the dataset is only available for selected urban areas with more than 100.000 habitants in 27 different countries at a scale of 1:10.000. Figure 1 illustrates the selected urban regions. A random sampling was indispensable to keep computation feasible and consists of a subset of about 20%, corresponding to 42 regions, to generate the training and validation data set. 6 Figure 1 Countries covered by GMESUA (dark gray) and the 42 randomly selected GMSEUA urban regions (black). 3 Methodology The proposed research design for estimation of urban land use patterns comprises application of three major consecutive steps: 1. Data preparation: As a first step it is necessary to prepare the OSM and GMESUA data such that both are valuable for model building. In particular, a large set of potential attributes are derived from OSM for inductive learning and the desired output is calculated from GMESUA (Section 3.1). 2. Selection and model building: Second, a genetic algorithm (GA) is used to reduce the total set of attributes to a reasonable subset and an artificial neural network (ANN) is trained with these subset (Section 3.2 and 3.3). 7 3. Sensitivity analysis and model performance: Finally, due to the unknown contribution of the attributes to the model, their significance is analyzed (Section 3.4) and the model performance for the different areas are investigated. 3.1 Data preparation Training data for ANNs consists of a set of training samples. Each sample is a pair of an input vector and a desired output. Therefore, it is necessary to derive input vectors from OSM data and the desired output from the reference GMSEUA data set to learn the intra-relationship between them. However, both datasets consist of manifold and diverse information that need to be aggregated to a normalized representation, where the choice of the areal units for aggregation is crucial and possibly, like regression analysis (Fotheringham and Wong 1991), affected by the modifiable areal unit problem (Openshaw 1984). In this study, aggregation is carried out on a hexagonal raster representation. This seems more reasonable than squarely shaped cells because hexagonal shapes can better imitate European urban patterns at every scale. The side length of every hexagonal cell is 250 m for this European-wide analysis, which seems a trade-off between computational burden and spatial resolution, but allows the derivation of fine scaled urban patterns. 3.1.1 GMESUA Urban Regions as Desired Output GMESUA subsumes continuous and discontinuous urban fabric of built-up areas and its respective associated land according to its primary use. The latter is further distinguished between degrees of soil sealing (EEA 2010). To derive urban areas from land use classification, reclassification of the original data is required, which is an ambiguous and subjective process. While for few classes of GMESUA a clear distinction between urban and non-urban areas is obvious (e.g., category 1.1.1: continuous urban fabric with sealing level above 80%), most classes require additional information to achieve a clear class membership. Here, primary aerial images and local knowledge were used to reclassify the GMESUA datasets. For each cell the overlap between the cell and the resulting urban areas is computed and assigned as an attribute, representing the desired output for the ANN. 3.1.2 Derivation of OSM Attributes (Input data) Basically, OSM presents three different types of information: geometric information, attributive information, and meta-data (Ramm and Topf 2010). These types potentially contain implicitly or 8 explicitly information that can be used for urban pattern detection. A spatial object in OSM is characterized by its geometric primitive and a set of assigned tags. A tag is a pair of a key and additional values that represent the attributive information of a specific object, e.g. a linear geometry with the assigned tags highway =”primary” and oneway=”true” describes a major highway that is only accessible in one direction. Although, the data model of OSM is strictly specified, in the sense that every user is permitted to assign arbitrary tags to any object. It cannot be expected to significantly improve the total model performance by including information about sparsely mapped objects, but instead bears the risk of overfitting. Highways and places are assumed to be generally well mapped for most regions. However, due to the freedom of users to assign keys and tags at will, the potential number of different highway and place categories is arbitrarily large and thus requires certain generalization. Therefore, highways and places with a reasonable high occurrence in OSM2 are exclusively considered. Most objects in OSM have metainformation assigned (e.g., the time of the last edit or the name of the user that has modified an object at last). Hacklay (2010) indicated that mapping habits of people within urban areas differ from people residing in rural areas. Thus, it can be expected that this difference is reflected in the metainformation of spatial objects. The basic descriptive statistics (e.g., minimum, maximum, average) are calculated from the meta-information of all considered objects raster cell. According to concept of spatial autocorrelation, geographically close observations depend on each other (Tobler 1970). Thus, the distance to an object predominantly found in urban areas relates to the urbanization of an actual raster cell and implicitly includes autocorrelated processes in the model. Hence, it is reasonable to comprise the nearest distance to different objects, e.g. nearest highways, as an attribute for each raster cell. For derivation of geometric and topologic attributes, it is necessary to distinguish between the geometric primitives of OSM. Of special interest are the properties of lines, mostly representing roads. It is hypothesized that urban areas show a higher amount of total road length, junctions, curves, and right angles because of the necessity of dense traffic infrastructures in densely populated areas, and are consequently included as the raster cell’s attributes. Further, graph centrality (Nieminen 1974) measures the number of nodes that link a given node. Previous studies by Yang and Harry (2004) and Bak et al. (2010) have documented the capability of this index for street network analysis. In conclusion, table 1 gives an overview of the 102 derived statistics and attributes for each cell. 2 Frequently used tags: a) highway-tags: residential, unclassified, tertiary, secondary, primary, motorway, motorway_link, steps, trunk, path, track, footway, service, living_street, cycleway; b) place-tags: town, hamlet, village, suburb, locality. 9 Table 1 Derived OSM attributes for each cell 3 Attributes for Aggregated attributes Attributes for Aggregated attributes for selected for selected highways selected place selected highways and places highway types types Length Number of junctions Number points Number of objects Distance Number of junctions with at Distance Min./Max./Avg. version Curviness Number of least one right angle curve number(s) Number or roads with right Earliest/Latest/Avg. time of angle curves modification(s) Number of road endings Total/Min./Max./Avg. number waypoints of user contributions Min./Max./Total/Avg. Min./Max./Total/Avg. number angle(s) of object tags Min./Max./Total/Avg. centrality 3.2 Genetic Algorithm for Attribute Selection As outlined in Section 2 several cell-based OSM attributes are calculated. The performance of ANN models when learning a regression function depends on the choice of attributes (Pyle 1999). The attributes implicitly define a pattern language (Yang and Honavar 1997). If the language is not expressive enough, a model will fail to capture the information necessary to approximate the target function. Contrary, if the language is too expressive, the computational time to learn the model increases and is vulnerable to overfitting. Due to the large number of attributes and the non-linear relationships between the attributes, heuristics are promising to obtain near-optimal attribute sets for ANN model building (Siedlecki and Sklansky 1989). Attribute selection approaches can be categorized as follows (Liu and Yu 2004): The filter approach uses statistics to measure the relevance of attributes. It is totally independent of the learning algorithm, thus it is computationally more efficient than the wrapper approach, which involves computational overhead by executing the training algorithm for every presented attribute set and evaluating the results. Attribute selection may not be independent of the learning algorithm, which is ignored by the filter approach. In contrast, the wrapper approach takes the properties and biases of 3 For selected highways and places see 2 10 the inductive learning algorithm into account. Because the wrapper approach is generally computational demanding, genetic algorithms (GA) are especially promising for attribute selection (Siedlecki and Sklansky 1989). A GA is a heuristic optimization method, simulating natural evolution processes in analogy to biology (Holland 1975; Goldberg 1989). GAs represent a potential solution as an individual. Each individual is encoded by a chromosome, comprised of a set of genes. The chromosomes are often coded as a binary string. A set of individuals constitutes a population. This population iteratively evolves, until a stop criterion is reached. At each iterative step of the GA the fitness of the individuals of the current population is measured. Afterwards, the population for the next iterative step is built by selecting, recombining, and mutating the most promising individuals of the current population (Mitchell 1998). Because only promising individuals take part in the evolutionary process, it is likely that near-optimal solutions emerge. The final solution is chosen from the individuals of the last population. In contrast to gradient-decent optimization, multiple solutions are maintained in parallel within a population, allowing interactions among them to explore regions in the search space between them (Qi et al. 1994). The goodness of a solution, respectively fitness of an individual, is numerically evaluated by a fitness function, which depends on the optimization objective. To utilize GAs for attribute selection following the wrapper approach, it is necessary to represent different attribute sets as individuals. For each individual of a population an ANN is trained and its performance is measured, representing the goodness of the individual. 3.3 Artificial Neural Network Artificial neural networks (ANNs) model an interconnected system of neurons, enabling computers to imitate the brain’s ability to detect patterns and learn relationships within data (Fischer 1998). The multi-layer perceptron (MLP), introduced by Rumelhart et al. (1986), is one of the most widely used ANNs (e.g., Fischer and Gopal 1994; Pijanowski et al. 2002; Mas et al. 2004). The MLP usually consists of three different layers of neurons: the input layer, the hidden layer, and the output layer. Every connection between neurons of different layers has an assigned weight, scaling input signals passing through. The input data is first presented to the input layer, and then subsequently passed to the hidden layer and to the output layer in a feed forward manner. A neuron receiving the weighted signals from connected neurons of the preceding layer, sums the signals, and calculates an output signal according to its inner activation function (Bishop 1996). 11 The crucial part of ANNs is the adaption of the weights, so that the model is capable to represent a target function. The most popular way of training an ANN is by modifying its weight using the back propagation algorithm (Rumelhart et al. 1986). This algorithm randomly sets the weights and calculates the resulting output. After all data samples are presented to the network, the sum of the mean squared error is calculated and the weights are modified according to a generalized delta rule (Rumelhart et al. 1986), so that the total error is distributed among the various nodes in the network. This process of feeding forward input signals and back propagating the errors is repeated iteratively, until a terminating condition is fulfilled (e.g. the error-rate falls below a certain threshold). It has been shown that ANNs with one hidden layer can theoretically approximate any function (Hornik et al. 1989). However, a certain degree of freedom must be supplied, i.e. the layer must consist of a sufficient number of hidden neurons. Generally, the number of hidden neurons is chosen to minimize a trade-off between network bias and variance (Bishop 1995). A limitation in the use of ANNs is that they provide a “black box” model. It is difficult to gain deep insight into the interior working of an ANN by interpreting the weights of the network. Nevertheless, numerical analysis of different input settings may help to gain insights into the importance of attributes. One primary advantage of ANNs, compared to the more easily interpretable decision trees, is the ability to model unknown interactions between input variables, the relationship between such interactions, and any output pattern (Pyle 1999). 3.4 Significance Analysis Although the total set of input variables for ANN training are reduced by a GA, the relative contribution of the remaining attributes on the total model performance is not known. However, being aware of the importance of the attributes can advance the understanding of the model and its explanatory capabilities. To evaluate the relative contribution of each attribute to the output, several methods have been proposed (Gevrey et al. 2003; Olden et al. 2004). Because of the convergence of the GA to an optimal solution, the genetic diversity of the individuals within a generation is generally decreasing. Consequently, it is hypothesized that genes important to the survival of individuals are present in most chromosomes of individuals within a generation, while unimportant genes are spare and diverse. Thus, by counting the frequency of the attributes represented within a generation, the importance of the attributes to the output of the ANN can be estimated. Another method to measure the importance of an attribute is to measure the change of root mean square error (RMSE) when sequentially and stepwise setting input neurons to their mean 12 value (SSMV). The resulting change indicates the relative importance of each attribute (Gevrey et al. 2003). Because the two techniques can lead to diverse conclusions, both are applied and compared in the next section. 4 Results 4.1 Model Specification and Overall Quality For variable selection purposes a non-dominated sorting genetic algorithm (NSGA-1; Srinivas and Deb 1994) was used. The algorithm was allowed to run at most 1000 iterations, but stops earlier if the performance for 25 iterations does not significantly improve. Each generation consisted of 100 individuals. To reduce the computation time and to limit the resulting model to a reasonable size, the maximum number of attributes was set to 20 for each individual and is represented by a trained ANN model. The training data consist of a subset of 20,000, cells (4% of the whole dataset), randomly selected from all regions of the dataset (see Sec. 2). The remaining cells are used for testing and validation. After empirical tests, the final ANN consists of a single hidden layer with (š + 1)/2 hidden neurons, where š is the number of attributes, even though the optimal number of hidden neurons is not known. The ANN is trained for 1,000 cycles by backpropagation with a learning rate of 0.3. The final model is selected from the last GA generation, based on the RMSE, the squared correlation coefficient (R2), Spearman’s rho (RS), and the number of attributes. The resulting model of the GA optimization consists of 11 remaining attributes (distance to nearest residential road, length of residential roads, number of waypoints of primary roads, length of motorways, cycleway curviness, distance to nearest pedestrian road, distance to nearest track, length of tracks, number of junctions, distance to nearest village, and number of attributes). Overall, the model showed moderate performance with a RMSE of 0.12, a R2 of 0.6, and a RS of 0.59. Applying the model on the remaining data, which are independent of model building, allows the investigation of its generalization capabilities. The result yields a similar performance with a RMSE of 0.12, a R2 of 0.59, and a RS of 0.58. A further residual inspection shows a mean of -0.05 and standard deviation of 1.03 and confirms a nearly Gaussian distribution. Thus, it attests full model capability. 4.2 Regional Model Performance Due to the spatial heterogeneity of geographic processes, it is assumed that the performance of local models changes if applied to distinct areas. Reasons are, on the one hand, differences in urbanization, economic power, as well as cultural issues, and, on the other hand, the varying quality 13 of OSM data underlying the model. To assess the influence of locality to the model performance, it is applied to each GMESUA region and its generalization capabilities are examined separately. The results are summarized in Table 2. Table 2 Model performance for selected regions within the study area Urban Region GMESUA Regions4 % of cells RMSE R2 RS intersecting urban areas Linz At003l 48.8 0.121 0.695 0.650 Varna Bg003l 28.7 0.194 0.353 0.554 Ruse Bg006l 18.4 0.125 0.440 0.483 Brno Cz002l 31.6 0.114 0.646 0.637 Ústí nad Labem Cz005l 35.9 0.127 0.583 0.649 Jihlava Cz014l 21.3 0.070 0.700 0.628 Leibzig De008l 38.6 0.135 0.627 0.700 Bremen De012l 36.0 0.103 0.709 0.626 Darmstadt De025l 35.8 0.126 0.757 0.710 Mönchengladbach De036l 80.4 0.186 0.701 0.837 Koblenz De042l 37.8 0.118 0.743 0.720 Odense Dk003l 45.3 0.118 0.580 0.476 Valencia Es003l 53.1 0.185 0.534 0.678 Toledo Es016l 12.1 0.075 0.337 0.275 Cordoba Es020l 16.5 0.108 0.538 0.492 Bordeaux Fr007l 45.0 0.167 0.528 0.628 Rennes Fr013l 55.0 0.119 0.590 0.449 Besançon Fr025l 31.4 0.101 0.632 0.666 Patrai Gr003l 26.1 0.110 0.667 0.620 Miskolc Hu002l 24.1 0.167 0.423 0.500 Debrecen Hu005l 25.1 0.185 0.313 0.423 GyÅr Hu007l 24.8 0.154 0.433 0.553 Firenze It007l 40.5 0.117 0.647 0.629 Bologna It009l 42.1 0.137 0.463 0.489 4 This code allows a clear assignment to the data sets. The first two characters correspond to the particular country according to the ISO 3166 standard. 14 Trieste It015l 57.1 0.169 0.590 0.773 PanevÄžys Lt003l 15.8 0.121 0.210 0.284 Luxembourg Lu001l 32.1 0.095 0.683 0.678 LiepÄja Lv002l 0.08 0.060 0.361 0.202 Rotterdam Nl003l 62.1 0.193 0.598 0.755 Tilburg Nl006l 60.1 0.206 0.509 0.678 WrocÅaw Pl004l 29.2 0.129 0.450 0.559 PoznaÅ Pl005l 32.0 0.122 0.569 0.599 Setúbal Pt006l 57.3 0.236 0.155 0.390 Aveiro Pt008l 52.8 0.242 0.129 0.340 BrÄila Ro008l 17.6 0.123 0.676 0.484 CÄlÄraÅi Ro012l 19.1 0.163 0.442 0.480 Umeå Se005l 0.07 0.054 0.428 0.225 Banská Bystrica Sk003l 17.4 0.072 0.710 0.570 Žilina Sk006l 26.8 0.109 0.589 0.597 Sheffield Uk010l 47.5 0.125 0.789 0.781 Leicester Uk014l 47.9 0.129 0.742 0.639 Wolverhampton Uk028l 53.6 0.158 0.728 0.715 For most regions the model performs comparably well, with a few remarkable exceptions. For instance, the region of LiepÄja, Latvia, has a very low RMSE (0.060), which means that the model mostly predicts well. However, the low R2 (0.361) and RS (0.202) indicate, that the low RMSE is not a result of a good prediction of urban areas, but certainly caused by the sparse urbanization in that region. The largest city in this region is Liepaja with a population of about 60,000, and only 3 further towns are located in that region with more than 1,000 residents. In fact, only 8% of all cells are covered at least partially by urban areas. Additionally, OSM data is mostly absent, except for the city Liepaja itself, thus the model necessarily failed to predict urban areas. As a second example, the region of Aveiro, Portugal, has an exceptional high RMSE (0.242). The high RMSE is caused by a discrepancy between a high density of urbanization and sparse OSM data in that region. Actually, 53% of the cells for that region are at least partially covered by urban structures. Figure 2 depicts a section from this region, including the city of Ílhavo. Ílhavo has approximately 16,800 residents. However, virtually no OSM data are mapped for this city (Fig. 2 lower panel). Only the place and a single primary highway exist, although the comparison to GMSEUA reveals, that the 15 region is densely settled (Fig. 2 upper panel). Such a discrepancy is, of course, reflected in the performance of the model. Figure 2 Urban areas of Aveiro (Portugal) according to GMESUA (upper panel) and OSM data (lower panel) However, for several regions the model performs exceptionally well. Especially, for the region of Jihlava, Czech Republic, the model results in a very low RMSE (0.07) and also a considerably high R2 (0.7) and RS (0.682), even though 21% of the cells are at least partially covered by urban structures The good performance for this region indicates that the quality of the OSM data suffice the data requirements of the ANN model. The OSM data quality of the United Kingdom is well-understood and it is empirically verified that at least urban areas are well-covered in the sense of completeness (Hacklay 2010; Hacklay et al. 2010). 16 These results are also supported by this study as well. Generally, high R2 and RS for all urban regions of the United Kingdom are revealed. One stated key motivation of this work is to improve completeness of OSM by machine learning, and thus improve its fitness for use. The developed model is able to predict urban patterns, based on a set of attributes derived from OSM data. Figure 3 (lower panel) shows a successful prediction of urban patterns for Great Glen (United Kingdom), where the model distinguishes different degrees of urbanization based on the underlying OSM data. Figure 3 Comparison between the real urban pattern based on GMESUA (upper panel) and the predicted results (lower panel) for Great Glen, south-east of Leicester, United Kingdom. 4.2 Significance Analysis To gain further insights into the model, the importance of the models’ attributes is evaluated. The respective ranks for the two proposed methods (Sec. 3) are presented in table 4. 17 Table 3 Ranking of attribute importance by SSMV and GA Attribute SSMV GA Change of RMSE Rank Frequency Rank Distance to nearest residential road 0.0195 1 100 1 Length of residential roads 0.0196 2 100 1 Number of Attributes 0.0103 3 90 6 Number of junctions 0.0095 4 32 9 Distance to nearest village 0.0080 5 85 7 Distance to nearest pedestrian road 0.0054 6 92 5 Length of tracks 0.0001 7 100 1 Number of waypoints of primary roads -0.0001 8 7 11 Distance to nearest track -0.0003 9 99 4 Length of motorways -0.0006 10 23 10 Cycleway curviness -0.0014 11 58 8 The results for both methods are widely comparable. Both identify the length of residential roads and their distance as most significant input variables. Both methods are also comparable at identifying less important attributes, except the number of junctions and distance to nearest tracks, whose ranks are complementary. For SSMV it is notable, that attributes with a rank less than 7 are rather unimportant because setting them stepwise to constant values even improved the results. However, there exist also significant differences in the measurement of importance. SSMV ranked the number of attributes as very important, while for GA this attribute is only of average importance. Another major difference between SSMV and GA is the ranking of track-related attributes. For GA the length of tracks and the distance to the nearest track is very important, while for SSMV these attributes are rather irrelevant. Furthermore, the importance of junctions is not emphasized by GA. 4.3 Discussion Data production in VGI is carried out by many independent contributors, increasing the chance for occurring repetitions. It can be expected that a large number of repetitions increases the chance of errors being detected and fixed in the data, and thus improves the data quality in general (Heipke 2010). For OSM data Haklay et al. (2010) showed, that there is in fact a non-linear relationship between the number of contributors and the positional accuracy. This assumption holds not for 18 every aspect of spatial data quality. For completeness it is anticipated, that the number of omission errors negatively correlates with the number of commission errors because of the diversity of user requirements. However, the proposed methodology is based on neural computing and aims to detect urban areas using OSM data. It enables users to derive information implicitly stored in OSM, superseding explicit storage, and thus making commission errors less likely. The performance of the developed model varied considerably applied to the 42 different European regions, emphasizing the spatial heterogeneity of the data. Local models may provide better results for individual geographic regions, but do not capture general properties of the data well. Although the developed model depends on data that is mostly mapped in OSM, it necessarily fails, if such data is missing at all. Therefore, to apply the presented methodology it is necessary to make assumptions on the presence of valuable information. Delineating urban patterns is subject to individual considerations. Thus, the validity of the presented methodology for predicting density-based patterns of urban areas is a difficult task, although the results are coherent and visually appealing. In general, the validity of the results must be evaluated with respect to the intended application. Machine Learning can only respond to geographic characteristics if those are encoded in the data (Gahegan 2003). It is a priori unclear how to encode information in the dataset to give reasonable results. Furthermore, the settings of the ANN and GA in this study are primarily chosen a priori. However, a detailed inspection and sensitivity analysis of the different settings bear potential for making the model more robust and yield better results (see Patuelli et al. 2010). A fundamental problem of spatial data analysis on lattice data is the dependence of the results on the underlying scale and zoning (Openshaw 1984). The influence of the cell size on the results of prediction was not subject of this study, but needs more research. 6 Conclusions and Future Work In this study a methodological framework is proposed for the delineation of continuous urban areas from OSM data. Each of the framework’s components provides solely distinct capabilities. The ANN enables the non-linear estimation of urban patterns from a set of attributes. The GA reduces the total set of attributes to a reasonable size for inductive learning following the wrapper approach, thus that no precise understanding of the processes is mandatory. The usefulness of the methodology was demonstrated by applying it to the estimation of urban patterns of Europe. 19 It was shown, that urban patterns can be predicted to a large extend. Explicit information about urban patterns and urban density is useful for manifold applications, e.g. navigation and spatial planning tasks, but is currently mostly absent in OSM. By estimating the urban density with the proposed framework the fitness for use in OSM for applications is improved, leading to additional benefit for users. Future work will elaborate on the application of machine learning to OSM data. OSM offers a rich and manifold source of spatial data, offering a profound pool of implicit information. Extracting this information from OSM and making it explicit improves the fitness for use and overwhelm completeness issues in the data for manifold applications and requirements. Additionally, machine learning bears potential for detection of other data quality issues, e.g. positional or semantic errors. From a methodological point of view it seems fruitful to include further machine learning techniques that in particular take into account the spatial and temporal properties of OSM data. Acknowledgements We acknowledge the valuable feedback received from our colleagues Georg Walenciak, Marcus Götz, Bernhard Höfle, and Alexander Zipf on an early draft of this paper. In addition, we want to thank Hannes Taubenböck (German Aerospace Center, DLR) for his useful hints for appropriate remote sensing data. Reference Bak P, Omer I and Schreck T 2010 Geospatial thinking. Painho, M, Santos, M Y & Pundt, H, (eds) Geospatial Thinking. Springer, Berlin: 25-42 Bishop C M 1996 Neural Networks for Pattern Recognition. New York, Oxford University Press Brassel K, Bucher F and Stephan E-M 1995 Elements of Spatial Data Quality. . Pergamon, Oxford: 81108 Buttenfield B 1993 Representing data quality. Cartographica: The International Journal for Geographic Information and Geovisualization 30: 1-7 Delavar M R and Devillers R 2010 Spatial data quality: From process to decisions. Transactions in GIS 14: 379-386 Devillers R, Stein A, Bédard Y, Chrisman N, Fisher P and Shi W 2010 Thirty years of research on spatial data quality: Achievements, failures, and opportunities. Transactions in GIS 14: 387-400 EEA 2010 Mapping guide for a European urban atlas. WWW document, 20 http://www.eea.europa.eu/data-and-maps/data/urban-atlas/mapping-guide Elwood S 2008 Volunteered geographic information: Key questions, concepts and methods to guide emerging research and practice. GeoJournal 72: 133-135 Fischer M M 1998 Computational neural networks: A new paradigm for spatial analysis. Environment and Planning A 30: 1873-1891 Fischer M M and Gopal S 1994 Artificial neural networks: A new approach to modelling interregional telecommunication flows. Journal of Regional Science 34: 503-527 Fotheringham A S and Wong D W S 1991 The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A 23: 1025-1044 Gahegan M 2003 Is inductive machine learning just another wild goose (or might it lay the golden egg)? International Journal of Geographical Information Science 17: 69-92 Gevrey M, Dimopoulos I and Lek S 2003 Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling 160: 249-264 Girres J-F and Touya G 2010 Quality assessment of the French OpenStreetMap dataset. Transactions in GIS 14: 435-459 Goldberg D E 1989 Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA, Addison-Wesley Longman Publishing Co., Inc. Goodchild M F 2008 Spatial accuracy 2.0. In Proceedings of the Eighth International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Liverpool, World Academic Union: 1-7 Goodchild M F 2007 Citizens as sensors: The world of volunteered geography. GeoJournal 69: 211221 Goodchild M F and Hunter G J 1997 A simple positional accuracy measure for linear features. International Journal of Geographical Information Science 11: 299-306 Haklay M 2010 How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets. Environment and Planning B: Planning and Design 37: 682-703 Haklay M, Basiouka S, Antoniou V and Ather A 2010 How many volunteers does it take to map an area well? The validity of Linus’ law to volunteered geographic information. The Cartographic Journal 47: 315-322 Haklay M and Weber P 2008 OpenStreetMap: User-generated street maps. IEEE Pervasive Computing 7: 12-18 Heipke C 2010 Crowdsourcing geospatial data. ISPRS Journal of Photogrammetry and Remote Sensing 65: 550-557 Helbich H and Leitner M 2010 Postsuburban spatial evolution of Vienna's urban fringe: Evidence from point process modeling. Urban Geography 31: 1100-1117 21 Helbich M and Leitner M 2009 Spatial analysis of the urban-to-rural migration determinants in the Viennese metropolitan area. A transition from suburbia to postsuburbia?. Applied Spatial Analysis and Policy 2: 237-260 Holland J 1992 Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge, MA, MIT Press Hornik K, Stinchcombe M and White H 1989 Multilayer feedforward networks are universal approximators. Neural Networks 2: 359-366 Jiang B and Harrie L 2004 Selection of streets from a network using self-organizing maps. Transactions in GIS 8: 335-350 Jiang B and Jia T 2011 Zipf’s law for all the natural cities in the United States: A geospatial perspective. International Journal of Geographical Information Science (accepted) Kresse W and Fadaie K 2004 ISO Standards for Geographic Information. Berlin, Springer Leung Y 1987 On the imprecision of boundaries. Geographical Analysis 19: 125-151 Mas J F, Puig H, Palacio J L and Sosa-López A 2004 Modelling deforestation using GIS and artificial neural networks. Environmental Modelling & Software 19: 461-471 Maué P and Schade S 2008 Quality of geographic information patchworks. In 11th AGILE International Conference on Geographic Information Science. Neis P and Zipf A 2008 OpenRouteService.org is three times "Open": Combining OpenSource, OpenLS and OpenStreetMaps. In GIS Research UK. Manchester Nieminen J 1974 On centrality in a graph. Scandinavian Journal of Psychology 15: 322-336 Olden J D, Joy M K and Death R G 2004 An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling 178: 389-397 van Oort P 2006 Spatial Data Quality: From Description to Application. PhD thesis, Wageningen University Openshaw S and Openshaw C 1997 Artificial Intelligence in Geography. New York, NY, Wiley Over M, Schilling A, Neubauer S and Zipf A 2010 Generating web-based 3D city models from OpenStreetMap: The current situation in Germany. Computers, Environment and Urban Systems 34: 496-507 Patuelli R, Reggiani A, Nijkamp P and Schanne N 2011 Neural networks for regional employment forecasts: Are the parameters relevant? Journal of Geographical Systems 13: 67-85 Pijanowski B C, Brown D G, Shellito B A and Manik G A 2002 Using neural networks and GIS to forecast land use changes: A land transformation model. Computers, Environment and Urban Systems 26: 553-575 Pyle D 1999 Data preparation for data mining. San Francisco, CA, Morgan Kaufmann Publishers Inc. 22 Qi X and Palmieri F 1994 Theoretical analysis of evolutionary algorithms with an infinite population size in continuous space. Part I: Basic properties of selection and mutation. Neural Networks, IEEE Transactions on 5: 102-119 Ramm F and Topf J 2010 OpenStreetMap: Using and Enhancing the Free Map of the World. UIT Cambridge Roick O and Hagenauer J 2011 OSMatrix - Grid-based analysis and visualization of OpenStreetMap.State of the Map 2011 The 5th Annual International OpenStreetMap Conference, Vienna, Austria (submitted) Rozenfeld H D, Rybski D, Gabaix X and Makse H A 2009 The area and population of cities: New insights from a different perspective on cities. National Bureau of Economic Research, Inc., NBER Working Papers No 15409 Rumelhart D E, Hinton G E and Williams R J 1986 Learning representations by back-propagating errors. Nature 323: 533-536 Servigne S, Lesage N and Libourel, T 2010 Quality Components, Standards, and Metadata Fundamentals In Devillers R and Jeansoulin R (eds) Fundamentals of Spatial Data Quality. London, ISTE: 179-210 Shi W, Goodchild M and Fisher P 2002 Spatial data quality. New York, Taylor & Francis Siedlecki W and Sklansky J 1989 A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10: 335-347 Smith T R 1984 Artificial intelligence and its applicabilty to geographical problem solving. The Professional Geographer 36: 147-158 Srinivas N and Deb K 1994 Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 2: 221-248 Tobler W R 1970 A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234-240 Yang J and Honavar V 1998 Feature subset selection using a genetic algorithm. Intelligent Systems and their Applications, IEEE 13: 44-49 Zielstra D and Zipf A 2010 A comparative study of proprietary geodata and volunteered geographic information for Germany. 13th AGILE International Conference on Geographic Information Science, Guimaraes, Portugal Guptill S C and Morrison J L (eds) 1995 Elements of Spatial Data Quality. Oxford, Pergamon Moellering H (ed) 1987 A Draft Proposed Standard for Digital Cartographic Data. Columbus, OH, National Committee for Digital Cartographic Standards 23