Contribution Profiles of Voluntary Mappers in OpenStreetMap Renate Steinmann Salzburg Research Forschungsgesellschaft Jakob-Haringer Str. 5/III 5020 Salzburg renate.steinmann@ salzburgresearch.at Simon Gröchenig Salzburg Research Forschungsgesellschaft Jakob-Haringer Str. 5/III 5020 Salzburg Simon.groechenig@ salzburgresearch.at Karl Rehrl Salzburg Research Forschungsgesellschaft Jakob-Haringer Str. 5/III 5020 Salzburg karl.rehrl@ salzburgresearch.at Richard Brunauer Salzburg Research Forschungsgesellschaft Jakob-Haringer Str. 5/III 5020 Salzburg richard.brunauer@ salzburgresearch.at Abstract Voluntary Geographic Information (VGI) projects such as OpenStreetMap (OSM) attract more and more people. Despite the number of registered users continuously increases, it is widely unknown how these users actually contribute. The crucial question is “Who contributes what to a VGI project?” This work proposes a method for identifying contribution profiles of OSM mappers. Based on the OSM Full Planet History File, all user contributions back to the early days of the OSM project in the year 2005 have been analysed. For analysing different contribution patterns a k-means clustering of action and feature types has been applied. The clustering reveals a set of well-defined and characteristic contribution profiles like “Premium Creator”, “Highway Mapper” or “All-Rounder”. Keywords: VGI, contribution profiles, k-means clustering 1 Introduction The voluntary collection of geographical data has been a continuously growing trend over the last few years. Mappers from all over the world create a vast amount of free geographical data. Goodchild [2] defined this phenomenon as Volunteered Geographic Information (VGI). The continuing trend in VGI leads to a high number of registered mappers in VGI projects. OpenStreetMap (OSM), the most famous VGI project, counted 1,072,879 registered mappers by March 18th 2013. However, registered mappers are not equal to active mappers, because registration does not necessarily mean that a project member actively contributes. Studies revealed that only 5% of mappers contribute most of the data (e.g. Wikipedia editors study [11], Mooney and Corcoran [6], Neis and Zipf [7]). While the share of active mappers is relatively well known, more detailed insights into mapping activities of voluntary mappers are widely missing. This work aims at identifying contribution profiles of OSM mappers using the OSM Full Planet History File from February, 5th 2013 (includes all mapping activities from the early days of the OSM project until the generation date). We characterise mapper contribution with mapping activities (create, update, delete) and feature types (OSM primary keys) and extract different contribution profiles. On the long run classifying voluntary mapping activities could help to conclude on data quality as well as future development of a VGI project. The remainder of this work is structured as follows: Chapter 2 describes related work that deals with contributor activities in VGI and Wikipedia. Chapter 3 focuses on the explanation of the clustering approach which has been applied for identifying contribution profiles. Results are presented in chapter 4. Conclusions and outlook in chapter 5 complete the paper. 2 Related Work In the related work section we focus on studies assessing contributor activities in VGI projects. Recently published work of Neis and Zipf [7] analyses the contributor activity using an OSM Full Planet History File. In their study they found that 38% of registered mappers performed only one edit whereas a minority of 5% are continuous mappers. The proposed classification of contribution profiles is based on the number of contributed nodes. The authors distinguish between senior mappers, junior mappers, nonrecurring mappers and mappers with no edits. The approach in this paper is different in terms of the method and parameters used for analysis. We use k-means clustering based on several parameters like (1) action types, (2) feature types, (3) total number of contributed actions and (4) total number of contribution days in order to identify contribution profiles. Mooney and Corcoran [6] focus on the characteristics of “heavily edited objects” (edited 15 or more times). They found that only 11% of mappers worldwide contributed 87% of the heavily edited objects. Budhathoki and Haythornthwaite [1] distinguish between serious and casual mappers in their work on motivational factors. Their mapper classification is based on the number of contributed nodes, and/or the longevity of the contribution and/or the number of contribution days during active mapping periods. Beyond VGI, contributor activities have been studied in Wikipedia, too. Since OpenStreetMap is sometimes called the “Wikipedia of maps” we also considered work in this area as related. Liu and Ram [5] analysed collaboration patterns and their impact on the quality of Wikipedia articles. Their study not only identified different contribution profiles, but also revealed relationships between contribution profiles and data quality. West et al [10] pursue the intention of drawing a datadriven portrait of Wikipedia editors. They investigate how the online behaviour of Wikipedia editors can be distinguished from other users’ behaviour. Summarizing previous work on contributor activity and contribution profiles we conclude that there is no satisfying answer to the question how user profiles can be identified and AGILE 2013 – Leuven, May 14-17, 2013 described in VGI projects. This holds especially true for OpenStreetMap. Our work contributes to this question with an analysis of worldwide mapper activity since the early days of the OSM project. order primary features (at date of writing 26) defined in the map features section of the OSM Wiki 1. 3.2 3 Research questions and method Data To find answers to the research questions we build our analysis on the action model developed by Rehrl et al [9] and perform k-means clustering on action types (create, update, delete) and feature types (derived from primary keys in OSM). The dataset for instantiating the action model is the OSM Full History Planet File from February, 5th 2013. This file contains an ordered list of every version of every feature in the OSM database and includes all edits that have been contributed by mappers back to the year 2005. The first pre-processing step transforms the OSM file automatically into a stream of edit operations. In order to be able to draw representative conclusions on contribution profiles, “nonrecurring mappers” are filtered out of the operation’s stream. We used only mapper profiles from mappers contributed equal or more than 10 actions. This selection goes along with the definition of “nonrecurring mappers” given by Neis and Zipf (2012). We did not filter out any import users or robots since it is assumed that these contribution profiles either move to a separate cluster or are dismissed as outliers. 3.1 3.3 The goal of this work is to identify different contribution profiles based on mapping activities of OSM contributors. The work answers to the following research questions: 1) 2) How can different contribution profiles be identified from the OSM Full Planet History File? Which different mapping styles are reflected by the revealed profiles? Action model The action model proposed by Rehrl et al [9] proposes a conceptual model for operations, actions and activities to structure VGI mapping activities in analogy to Kuutti’s [3] activity theory. The basic concept is a VGI Operation which describes an atomic edit to a data object, e.g. position update to a node or update of an attribute value. A VGI Action is a sequence of consecutive operations which have been executed by a single mapper within a continuous time span referring to one single data object, e.g., all consecutive operations by a unique user that refer to a unique OSM node. A VGI Activity is composed of all VGI actions fulfilling certain constraints, e.g. a collection of mapping actions within one day and within a certain geographic region or all updates to the street network within one day. This conceptual model can be used as foundation for analyzing editing processes in different VGI projects. Due to the explicit definition it helps to reproduce and compare results. To instantiate the model with data from the OSM Full Planet History File we prepared the data in three steps: 1) Extraction of VGI operations. 2) Aggregation of operations to VGI actions based on a list of aggregation rules. As proposed in Rehrl et al [9] the resulting actions are a cross-classified tabulation of database operations (following the CRUD paradigm) and feature types used in OSM. Feature types are points (all nodes in OSM, also geometries of polylines), lines (ways and closed ways) and relations (all OSM relations) [8]). The cross-tabulation results in a set of VGI actions (cf. Table 1). Table 1: Action types which serve as input for analysis AC_CreatePoint AC_UpdatePoint AC_DeletePoint AC_CreateLine AC_UpdateLine AC_DeleteLine AC_CreateRelation AC_UpdateRelation AC_DeleteRelation 3) Analysis of VGI action streams according to a customized rule set for answering the research questions. As filter rules we used the different action types and the keys of the first Identification of contribution profiles In this section the method for identifying different contribution profiles is outlined. The method is composed of four steps: (1) profile definition, (2) profile calculation, (3) kmeans clustering and (4) interpretation of results. 1) Profile definition In a first step a contribution profile has to be defined. We call this definition “the context” of the profile analysis. For analyzing contribution to the OSM database contexts could greatly vary, e.g. “the set of used tags”, “the time span of user activity” or “the geographical activity range”. In this work we focus on two contexts, namely “the type of performed actions (create, update, delete)” and “the feature type (primary tag) of the edited OSM features (e.g. highway, building, etc.)”. 2) Profile calculation For the first profile (context “action types”) we count all create, update and delete actions which a user has performed since registration. The calculation for the second profile counts the number of actions of each user for each primary feature. In order to be able to compare contribution profiles, the relative share of create/update/delete actions and the relative share of edited map features by primary key are attributes of the contribution profile of a user. To find differences between users who contribute a lot and users who contribute less, an additional profile attribute representing the total number of contributed actions (total actions) is added. The values are normalized into the interval [0,1]. Furthermore the number of mapping days (active days) is also part of a contribution profile as it reveals whether a mapper profile is dependent on the number of active mapping days. These values are also normalized into the interval [0,1]. 1 http://wiki.openstreetmap.org/wiki/Map_Features AGILE 2013 – Leuven, May 14-17, 2013 3) k-means clustering The identification of different contribution profiles on the basis of the previously defined parameters is done with a kmeans data clustering [4], a commonly used clustering approach. k-means clustering aims to divide n observations into k clusters where a cluster is represented by its mean. During clustering the algorithm determines these means, also called centroids, in that way that the within cluster sum of Euclidian distances is minimized where each observation belongs to that cluster with the closest mean. With k-means clustering the number of k has to be specified before. To determine a proper value for k, clustering was repeatedly applied using the different values 3, 5, and 10 and 15 for k. The quality of clustering is assessed via the possibility of reasonable interpretability of and between the cluster means. The optimal number of clusters which was identified for the dataset is presented in the results section. Figure 1: Average number of actions per user per year Figure 2: User shares classified per mapping days and contributed actions in % 4) Interpretation of results The resulting clusters are named as typical contribution profiles and described according to the contribution characteristics. 4 Results This section gives insights into the results of the applied profiling approach. Previous to the contribution profiles we present basic figures showing general characteristics of mapping activity as well as the development over the years. 4.1 Contribution Metrics The average contribution of an OSM user counts 11.31 actions (including all registered users who contributed at least one action). The majority of actions are “Creates” (8.38). As shown in Table 2 contribution shares greatly vary between singular and several thousand actions. Table 2: Figures showing average user contribution in OSM (per user) Average Median Maximum All Actions 11,314 23 185,645,045 Standard Deviation 550,200 Create 8,384 11 184,879,749 420,300 Update 1,907 5 169,404,872 330,468 Delete 1,023 1 18,276,275 46,638 Figure 1 shows the development of the average user contribution over the years. In the years 2006 and 2007 the increased create shares are consequences of imports (e.g. U.S. Tiger Import). Except the import years, the share of “Creates” by an average user steadily increased until the year 2010 and decreased in the years 2011 and 2012. “Updates” reached the highest level in the year 2009. As expected, “Deletes” permanently stay on a rather low level. As Figure 2 shows 140,334 (53%) out of 264,531 mappers contributed data just on one day. With an increasing number of mapping days the number of contributors decreases. The users who submitted contributions on more than 100 days and less than 1,000 days (3% of all users) contributed the highest amount of actions (68% of all actions). Furthermore, about 80% of all actions are performed by no more than 3% of users. 4.2 Contribution Profiles The following subsection describes the results of k-means clustering for identifying typical contribution profiles. The repeated k-means clustering on action types, total actions and active days yields 10 as the optimal number for k. For total actions and active days we used the decadic logarithm to get a more balanced distribution in the value range (cf. Figure 1 and Figure 2), i.e., the logarithm stretches the small value range and compresses the high value range. AGILE 2013 – Leuven, May 14-17, 2013 Figure 3: k-means clustering (k=10) on create, update and delete actions (relative values), action count (decadic logarithmic) and mapping days (decadic logarithmic), overview of cluster centroids 3 0 6 Users with a high share of delete actions. These mappers are not very active. Users with a high share of create actions, but also notable shares of update and delete actions. “Basic All-Rounders” are not very active mappers. Users with a high share of create, update and delete actions. “Premium All-Rounders” are very active mappers. Deleter Basic AllRounder Premium AllRounder Figure 4 illustrates the number of mappers within the clusters. Large user groups fall into cluster 1, 2 and 7. The “Creators” are the ones who contribute most actions (cluster 1 and 7), followed by the “Creators-Updaters” (cluster 2) and the “Updaters-Creators” (cluster 9). Figure 4: Number of mappers in clusters 0 – 9 Figure 3 shows all cluster centroids calculated with k-means clustering algorithm. The figure shows that the attributes total actions (yellow) and active days (red) correlate. This means that users who contributed a high number of actions also collected these actions on numerous days. A value of 0.1 log_actions complies with 53 actions and a value of 0.1 log_days is conform to 2 days. Furthermore, very active users mainly perform create and update actions. The ones who mainly perform update or delete actions contributed just a few actions. Table 3 summarizes and interprets results. The classification of “Creators”, “Updaters” or “Deleters” is based on the predominant action type of these mappers. AllRounders contribute all action types. The labels “Basic” and “Premium” are derived from the total amount of contributed actions and active mapping days. Thus, “Premium Mappers” are mappers who contribute more than others. Table 3: Typical contribution profiles based on action types Cluster number 1 7 4 2 8 9 5 Description of contribution profile Users with a high share of create actions, but a low actions and mapping days count. Users with a high share of create actions and a medium action and mapping days count. Users with a high share of create actions and a high actions and mapping days count. Users with a high share of create and update actions and a low actions and mapping days count. Users with a high share of update actions and a low actions and mapping days count. Users with a high share of update and a lower share of create actions. These mappers are not very active. Users with a high share of create and delete actions. These mappers are not very active. Profile Acronym Basic Creator Creator Premium Creator Creator-Updater Updater Updater-Creator Creator-Deleter For the second cluster analysis we took the edited primary keys as parameter. Figure 5 shows the shares of edited primary keys over the years 2005-2013. The 10 primary keys with the greatest number of actions are selected. All other primary keys are summarized in the category “Other”. In the early days of the OSM project mostly highways were mapped. From 2007 to 2010 the share of highway-related actions decreased whereas building-related actions increased. Since 2010 the share of highway-related actions increases slightly whereas building actions decrease. The distribution of other primary keys does not remarkably change between the years 2005 and 2013. The primary key “natural” had a high activity in 2006 due to coastline imports, but then decreased again and stagnates at nearly the same level. Figure 5: Action shares per primary keys per year (2005 – 2013) AGILE 2013 – Leuven, May 14-17, 2013 Primary key clustering was also performed with k-means clustering using several k’s (3/5/10/15). The most reasonable one was again the clustering with k=10. Figure 6 shows the results of k-means clusters based on primary keys and illustrates the cluster centroids. Figure 7 shows the number of mappers in the different clusters. Figure 6: k-means clustering (k=10) on primary keys, overview of cluster centroids 7 8 2 6 3 4 5 By interpreting the cluster centroids (cf. Figure 6) we distinguish 8 different mapper types. Table 4 summarizes these types and gives them meaningful acronyms. In some cases the transition between the different types is floating. Figure 7: Number of mappers in cluster 0 – 9 By far the biggest number of users falls into cluster 1 and cluster 9 (cf. Figure 7). These are the clusters where mappers mainly tag highways (the “Highway Mapper”). Table 4: Typical mapper types based on primary keys Cluster number 1, 9 5 0 Description of contribution profile This user mainly maps highways. Users in cluster 9 also map some amenities. This user mainly maps highways, but also contributes buildings. This user mainly maps buildings. Profile acronym Highway Mapper Highway – Building Mapper Building Mapper This user mainly maps buildings, but also contributes highways. This user mainly maps amenities. This user maps amenities and highways in an equal share. This user mainly maps places, but also contributes to highways. This user mainly maps powerrelated features, but also contributes to highways. All-Rounders are not focused on one feature type but contribute many different types. Building – Highway Mapper Amenity Mapper Amenity - Highway Mapper Place - Highway Mapper Power – Highway Mapper All-Round Mapper Conclusions and Outlook In this work a data-centric approach for analysing mapper contribution in VGI projects such as OpenStreetMap was proposed. The approach is based on a k-means clustering of contribution profiles considering parameters for: (1) action types, (2) feature types, (3) total number of contributed actions and (4) total number of contribution days. With regard to the parameter “action type” we identified ten typical contribution profiles. “Creators” mainly contribute new features. A great number of users fall into the “Creators” cluster. “Premium Creators” are those contributors responsible for high volume contributions. These contributors are also active on a higher number of days compared to normal contributors. Beside “Creators” there is also a notable group of “Premium All-Rounders”, being very active and contributing with every action type. From the feature type analysis we derived eight different contribution profiles. By far the biggest number of users falls into the mapper type “Highway Mapper”. Additionally we discovered “Building Mappers” and “Amenity Mappers”. Again we found “AllRounders” (mappers not focused on specific feature types) and five clusters of “Dual Feature Mappers” (mappers mainly contributing two categories). From the results we conclude that the adoption of k-means clustering with centroids is gainful to identify and describe different mapper types. It has been demonstrated that the relative values for action and feature types combined with the overall contribution volume are proper parameters for profiling. It is worth to mention that most of the clusters are rather distinct whereas some show a fluent transition. As a next step the method can be repeated with different parameter sets to answer specific research questions. Additionally it would be worth to have a closer look on relationships between the two different profiling contexts “action types” and “feature types”. Further research could examine relationships between contribution profiles and quality of contribution. AGILE 2013 – Leuven, May 14-17, 2013 References [1] Budhathoki, N. R., & Haythornthwaite, C. (2012). Motivation for Open Collaboration: Crowd and Community Models and The Case of OpenStreetMap. American Behavioral Scientist, 28. [2] M. F. Goodchild, (2007). Citizens as Sensors: the world of volunteered geography. Geojournal 69:211-221. [3] K. Kuutti (1996), Activity theory as a potential framework for human computer interaction research. In B. A. Nardi (Ed.), Context and consciousness: Activity theory and human-computer interaction (pp. 17-44). Cambridge, MA: The MIT Press. [4] J. A. Hartigan (1975). Clustering Algorithms (Probability and Mathematical Statistics). John Wiley & Sons Inc. [5] J. Liu and S. Ram. Who does what: Collaboration patterns in the Wikipedia and their impact on article quality. ACM Trans. Manage. Inf. Syst., 2(2):11:1-11:23. [6] P. Mooney, & P. Corcoran (2012). Characteristics of Heavily Edited Objects in OpenStreetMap. Future Internet, 4(1), 285-305. [7] P. Neis & A. Zipf (2012). Analyzing the Contributor Activity of a Volunteered Geographic Information Project — The Case of OpenStreetMap. ISPRS International Journal of Geo-Information, 1(2), 146-165. Molecular Diversity Preservation International. Retrieved from http://www.mdpi.com/2220-9964/1/2/146/htm [8] F. Ramm and J. Topf (2010), OpenStreetMap, 3rd ed., Berlin: lehmanns media. [9] K. Rehrl, S. Gröchening, H. Hochmair, S. Leitinger, R. Steinmann & A. Wagner (2012). A conceptual model for analyzing contribution patterns in the context of VGI. In: LBS 2012 – 9th Symposium on Location Based Services. Berlin: Springer. [10] West et al (2012). Drawing a Data-Driven Portrait of Wikipedia Editors. WikiSym conference 2012, Linz, Austria. [11] Wikipedia editors study (2011): Results from the editor survey, cf.: http://commons.wikimedia.org/wiki/File:Editor_Survey_ Report_-_April_2011.pdf, 2012/10/04