full paper

advertisement
Contribution Profiles of Voluntary Mappers in OpenStreetMap
Renate Steinmann
Salzburg Research
Forschungsgesellschaft
Jakob-Haringer Str. 5/III
5020 Salzburg
renate.steinmann@
salzburgresearch.at
Simon Gröchenig
Salzburg Research
Forschungsgesellschaft
Jakob-Haringer Str. 5/III
5020 Salzburg
Simon.groechenig@
salzburgresearch.at
Karl Rehrl
Salzburg Research
Forschungsgesellschaft
Jakob-Haringer Str. 5/III
5020 Salzburg
karl.rehrl@
salzburgresearch.at
Richard Brunauer
Salzburg Research
Forschungsgesellschaft
Jakob-Haringer Str. 5/III
5020 Salzburg
richard.brunauer@
salzburgresearch.at
Abstract
Voluntary Geographic Information (VGI) projects such as OpenStreetMap (OSM) attract more and more people. Despite the number of
registered users continuously increases, it is widely unknown how these users actually contribute. The crucial question is “Who
contributes what to a VGI project?” This work proposes a method for identifying contribution profiles of OSM mappers. Based on the
OSM Full Planet History File, all user contributions back to the early days of the OSM project in the year 2005 have been analysed. For
analysing different contribution patterns a k-means clustering of action and feature types has been applied. The clustering reveals a set
of well-defined and characteristic contribution profiles like “Premium Creator”, “Highway Mapper” or “All-Rounder”.
Keywords: VGI, contribution profiles, k-means clustering
1
Introduction
The voluntary collection of geographical data has been a
continuously growing trend over the last few years. Mappers
from all over the world create a vast amount of free
geographical data. Goodchild [2] defined this phenomenon as
Volunteered Geographic Information (VGI). The continuing
trend in VGI leads to a high number of registered mappers in
VGI projects. OpenStreetMap (OSM), the most famous VGI
project, counted 1,072,879 registered mappers by March 18th
2013. However, registered mappers are not equal to active
mappers, because registration does not necessarily mean that a
project member actively contributes. Studies revealed that
only 5% of mappers contribute most of the data (e.g.
Wikipedia editors study [11], Mooney and Corcoran [6], Neis
and Zipf [7]). While the share of active mappers is relatively
well known, more detailed insights into mapping activities of
voluntary mappers are widely missing.
This work aims at identifying contribution profiles of OSM
mappers using the OSM Full Planet History File from
February, 5th 2013 (includes all mapping activities from the
early days of the OSM project until the generation date). We
characterise mapper contribution with mapping activities
(create, update, delete) and feature types (OSM primary keys)
and extract different contribution profiles. On the long run
classifying voluntary mapping activities could help to
conclude on data quality as well as future development of a
VGI project.
The remainder of this work is structured as follows: Chapter
2 describes related work that deals with contributor activities
in VGI and Wikipedia. Chapter 3 focuses on the explanation
of the clustering approach which has been applied for
identifying contribution profiles. Results are presented in
chapter 4. Conclusions and outlook in chapter 5 complete the
paper.
2
Related Work
In the related work section we focus on studies assessing
contributor activities in VGI projects. Recently published
work of Neis and Zipf [7] analyses the contributor activity
using an OSM Full Planet History File. In their study they
found that 38% of registered mappers performed only one edit
whereas a minority of 5% are continuous mappers. The
proposed classification of contribution profiles is based on the
number of contributed nodes. The authors distinguish between
senior mappers, junior mappers, nonrecurring mappers and
mappers with no edits. The approach in this paper is different
in terms of the method and parameters used for analysis. We
use k-means clustering based on several parameters like (1)
action types, (2) feature types, (3) total number of contributed
actions and (4) total number of contribution days in order to
identify contribution profiles. Mooney and Corcoran [6] focus
on the characteristics of “heavily edited objects” (edited 15 or
more times). They found that only 11% of mappers worldwide
contributed 87% of the heavily edited objects. Budhathoki and
Haythornthwaite [1] distinguish between serious and casual
mappers in their work on motivational factors. Their mapper
classification is based on the number of contributed nodes,
and/or the longevity of the contribution and/or the number of
contribution days during active mapping periods.
Beyond VGI, contributor activities have been studied in
Wikipedia, too. Since OpenStreetMap is sometimes called the
“Wikipedia of maps” we also considered work in this area as
related. Liu and Ram [5] analysed collaboration patterns and
their impact on the quality of Wikipedia articles. Their study
not only identified different contribution profiles, but also
revealed relationships between contribution profiles and data
quality. West et al [10] pursue the intention of drawing a datadriven portrait of Wikipedia editors. They investigate how the
online behaviour of Wikipedia editors can be distinguished
from other users’ behaviour.
Summarizing previous work on contributor activity and
contribution profiles we conclude that there is no satisfying
answer to the question how user profiles can be identified and
AGILE 2013 – Leuven, May 14-17, 2013
described in VGI projects. This holds especially true for
OpenStreetMap. Our work contributes to this question with an
analysis of worldwide mapper activity since the early days of
the OSM project.
order primary features (at date of writing 26) defined in the
map features section of the OSM Wiki 1.
3.2
3
Research questions and method
Data
To find answers to the research questions we build our
analysis on the action model developed by Rehrl et al [9] and
perform k-means clustering on action types (create, update,
delete) and feature types (derived from primary keys in
OSM).
The dataset for instantiating the action model is the OSM Full
History Planet File from February, 5th 2013. This file contains
an ordered list of every version of every feature in the OSM
database and includes all edits that have been contributed by
mappers back to the year 2005. The first pre-processing step
transforms the OSM file automatically into a stream of edit
operations.
In order to be able to draw representative conclusions on
contribution profiles, “nonrecurring mappers” are filtered out
of the operation’s stream. We used only mapper profiles from
mappers contributed equal or more than 10 actions. This
selection goes along with the definition of “nonrecurring
mappers” given by Neis and Zipf (2012). We did not filter out
any import users or robots since it is assumed that these
contribution profiles either move to a separate cluster or are
dismissed as outliers.
3.1
3.3
The goal of this work is to identify different contribution
profiles based on mapping activities of OSM contributors.
The work answers to the following research questions:
1)
2)
How can different contribution profiles be identified
from the OSM Full Planet History File?
Which different mapping styles are reflected by the
revealed profiles?
Action model
The action model proposed by Rehrl et al [9] proposes a
conceptual model for operations, actions and activities to
structure VGI mapping activities in analogy to Kuutti’s [3]
activity theory. The basic concept is a VGI Operation which
describes an atomic edit to a data object, e.g. position update
to a node or update of an attribute value. A VGI Action is a
sequence of consecutive operations which have been executed
by a single mapper within a continuous time span referring to
one single data object, e.g., all consecutive operations by a
unique user that refer to a unique OSM node. A VGI Activity
is composed of all VGI actions fulfilling certain constraints,
e.g. a collection of mapping actions within one day and within
a certain geographic region or all updates to the street network
within one day. This conceptual model can be used as
foundation for analyzing editing processes in different VGI
projects. Due to the explicit definition it helps to reproduce
and compare results.
To instantiate the model with data from the OSM Full
Planet History File we prepared the data in three steps: 1)
Extraction of VGI operations. 2) Aggregation of operations to
VGI actions based on a list of aggregation rules. As proposed
in Rehrl et al [9] the resulting actions are a cross-classified
tabulation of database operations (following the CRUD
paradigm) and feature types used in OSM. Feature types are
points (all nodes in OSM, also geometries of polylines), lines
(ways and closed ways) and relations (all OSM relations) [8]).
The cross-tabulation results in a set of VGI actions (cf. Table
1).
Table 1: Action types which serve as input for analysis
AC_CreatePoint
AC_UpdatePoint
AC_DeletePoint
AC_CreateLine
AC_UpdateLine
AC_DeleteLine
AC_CreateRelation
AC_UpdateRelation
AC_DeleteRelation
3) Analysis of VGI action streams according to a customized
rule set for answering the research questions. As filter rules
we used the different action types and the keys of the first
Identification of contribution profiles
In this section the method for identifying different
contribution profiles is outlined. The method is composed of
four steps: (1) profile definition, (2) profile calculation, (3) kmeans clustering and (4) interpretation of results.
1) Profile definition
In a first step a contribution profile has to be defined. We call
this definition “the context” of the profile analysis. For
analyzing contribution to the OSM database contexts could
greatly vary, e.g. “the set of used tags”, “the time span of user
activity” or “the geographical activity range”. In this work we
focus on two contexts, namely “the type of performed actions
(create, update, delete)” and “the feature type (primary tag) of
the edited OSM features (e.g. highway, building, etc.)”.
2) Profile calculation
For the first profile (context “action types”) we count all
create, update and delete actions which a user has performed
since registration. The calculation for the second profile
counts the number of actions of each user for each primary
feature. In order to be able to compare contribution profiles,
the relative share of create/update/delete actions and the
relative share of edited map features by primary key are
attributes of the contribution profile of a user. To find
differences between users who contribute a lot and users who
contribute less, an additional profile attribute representing the
total number of contributed actions (total actions) is added.
The values are normalized into the interval [0,1]. Furthermore
the number of mapping days (active days) is also part of a
contribution profile as it reveals whether a mapper profile is
dependent on the number of active mapping days. These
values are also normalized into the interval [0,1].
1
http://wiki.openstreetmap.org/wiki/Map_Features
AGILE 2013 – Leuven, May 14-17, 2013
3) k-means clustering
The identification of different contribution profiles on the
basis of the previously defined parameters is done with a kmeans data clustering [4], a commonly used clustering
approach. k-means clustering aims to divide n observations
into k clusters where a cluster is represented by its mean.
During clustering the algorithm determines these means, also
called centroids, in that way that the within cluster sum of
Euclidian distances is minimized where each observation
belongs to that cluster with the closest mean. With k-means
clustering the number of k has to be specified before. To
determine a proper value for k, clustering was repeatedly
applied using the different values 3, 5, and 10 and 15 for k.
The quality of clustering is assessed via the possibility of
reasonable interpretability of and between the cluster means.
The optimal number of clusters which was identified for the
dataset is presented in the results section.
Figure 1: Average number of actions per user per year
Figure 2: User shares classified per mapping days and
contributed actions in %
4) Interpretation of results
The resulting clusters are named as typical contribution
profiles and described according to the contribution
characteristics.
4
Results
This section gives insights into the results of the applied
profiling approach. Previous to the contribution profiles we
present basic figures showing general characteristics of
mapping activity as well as the development over the years.
4.1
Contribution Metrics
The average contribution of an OSM user counts 11.31
actions (including all registered users who contributed at least
one action). The majority of actions are “Creates” (8.38). As
shown in Table 2 contribution shares greatly vary between
singular and several thousand actions.
Table 2: Figures showing average user contribution in OSM
(per user)
Average
Median
Maximum
All Actions
11,314
23
185,645,045
Standard
Deviation
550,200
Create
8,384
11
184,879,749
420,300
Update
1,907
5
169,404,872
330,468
Delete
1,023
1
18,276,275
46,638
Figure 1 shows the development of the average user
contribution over the years. In the years 2006 and 2007 the
increased create shares are consequences of imports (e.g. U.S.
Tiger Import). Except the import years, the share of “Creates”
by an average user steadily increased until the year 2010 and
decreased in the years 2011 and 2012. “Updates” reached the
highest level in the year 2009. As expected, “Deletes”
permanently stay on a rather low level.
As Figure 2 shows 140,334 (53%) out of 264,531 mappers
contributed data just on one day. With an increasing number
of mapping days the number of contributors decreases. The
users who submitted contributions on more than 100 days and
less than 1,000 days (3% of all users) contributed the highest
amount of actions (68% of all actions). Furthermore, about
80% of all actions are performed by no more than 3% of
users.
4.2
Contribution Profiles
The following subsection describes the results of k-means
clustering for identifying typical contribution profiles. The
repeated k-means clustering on action types, total actions and
active days yields 10 as the optimal number for k. For total
actions and active days we used the decadic logarithm to get a
more balanced distribution in the value range (cf. Figure 1 and
Figure 2), i.e., the logarithm stretches the small value range
and compresses the high value range.
AGILE 2013 – Leuven, May 14-17, 2013
Figure 3: k-means clustering (k=10) on create, update and
delete actions (relative values), action count (decadic
logarithmic) and mapping days (decadic logarithmic),
overview of cluster centroids
3
0
6
Users with a high share of delete
actions. These mappers are not very
active.
Users with a high share of create
actions, but also notable shares of
update and delete actions. “Basic
All-Rounders” are not very active
mappers.
Users with a high share of create,
update
and
delete
actions.
“Premium All-Rounders” are very
active mappers.
Deleter
Basic AllRounder
Premium AllRounder
Figure 4 illustrates the number of mappers within the clusters.
Large user groups fall into cluster 1, 2 and 7. The “Creators”
are the ones who contribute most actions (cluster 1 and 7),
followed by the “Creators-Updaters” (cluster 2) and the
“Updaters-Creators” (cluster 9).
Figure 4: Number of mappers in clusters 0 – 9
Figure 3 shows all cluster centroids calculated with k-means
clustering algorithm. The figure shows that the attributes total
actions (yellow) and active days (red) correlate. This means
that users who contributed a high number of actions also
collected these actions on numerous days. A value of 0.1
log_actions complies with 53 actions and a value of 0.1
log_days is conform to 2 days. Furthermore, very active users
mainly perform create and update actions. The ones who
mainly perform update or delete actions contributed just a few
actions. Table 3 summarizes and interprets results. The
classification of “Creators”, “Updaters” or “Deleters” is based
on the predominant action type of these mappers. AllRounders contribute all action types. The labels “Basic” and
“Premium” are derived from the total amount of contributed
actions and active mapping days. Thus, “Premium Mappers”
are mappers who contribute more than others.
Table 3: Typical contribution profiles based on action types
Cluster
number
1
7
4
2
8
9
5
Description of contribution profile
Users with a high share of create
actions, but a low actions and
mapping days count.
Users with a high share of create
actions and a medium action and
mapping days count.
Users with a high share of create
actions and a high actions and
mapping days count.
Users with a high share of create
and update actions and a low
actions and mapping days count.
Users with a high share of update
actions and a low actions and
mapping days count.
Users with a high share of update
and a lower share of create actions.
These mappers are not very active.
Users with a high share of create
and delete actions. These mappers
are not very active.
Profile
Acronym
Basic Creator
Creator
Premium
Creator
Creator-Updater
Updater
Updater-Creator
Creator-Deleter
For the second cluster analysis we took the edited primary
keys as parameter. Figure 5 shows the shares of edited
primary keys over the years 2005-2013. The 10 primary keys
with the greatest number of actions are selected. All other
primary keys are summarized in the category “Other”. In the
early days of the OSM project mostly highways were mapped.
From 2007 to 2010 the share of highway-related actions
decreased whereas building-related actions increased. Since
2010 the share of highway-related actions increases slightly
whereas building actions decrease. The distribution of other
primary keys does not remarkably change between the years
2005 and 2013. The primary key “natural” had a high activity
in 2006 due to coastline imports, but then decreased again and
stagnates at nearly the same level.
Figure 5: Action shares per primary keys per year (2005 –
2013)
AGILE 2013 – Leuven, May 14-17, 2013
Primary key clustering was also performed with k-means
clustering using several k’s (3/5/10/15). The most reasonable
one was again the clustering with k=10. Figure 6 shows the
results of k-means clusters based on primary keys and
illustrates the cluster centroids. Figure 7 shows the number of
mappers in the different clusters.
Figure 6: k-means clustering (k=10) on primary keys,
overview of cluster centroids
7
8
2
6
3
4
5
By interpreting the cluster centroids (cf. Figure 6) we
distinguish 8 different mapper types. Table 4 summarizes
these types and gives them meaningful acronyms. In some
cases the transition between the different types is floating.
Figure 7: Number of mappers in cluster 0 – 9
By far the biggest number of users falls into cluster 1 and
cluster 9 (cf. Figure 7). These are the clusters where mappers
mainly tag highways (the “Highway Mapper”).
Table 4: Typical mapper types based on primary keys
Cluster
number
1, 9
5
0
Description of contribution
profile
This
user
mainly
maps
highways. Users in cluster 9
also map some amenities.
This
user
mainly
maps
highways, but also contributes
buildings.
This
user
mainly
maps
buildings.
Profile acronym
Highway Mapper
Highway –
Building Mapper
Building Mapper
This
user
mainly
maps
buildings, but also contributes
highways.
This
user
mainly
maps
amenities.
This user maps amenities and
highways in an equal share.
This user mainly maps places,
but
also
contributes
to
highways.
This user mainly maps powerrelated features, but also
contributes to highways.
All-Rounders are not focused on
one feature type but contribute
many different types.
Building –
Highway Mapper
Amenity Mapper
Amenity - Highway
Mapper
Place - Highway
Mapper
Power – Highway
Mapper
All-Round Mapper
Conclusions and Outlook
In this work a data-centric approach for analysing mapper
contribution in VGI projects such as OpenStreetMap was
proposed. The approach is based on a k-means clustering of
contribution profiles considering parameters for: (1) action
types, (2) feature types, (3) total number of contributed
actions and (4) total number of contribution days.
With regard to the parameter “action type” we identified ten
typical contribution profiles. “Creators” mainly contribute
new features. A great number of users fall into the “Creators”
cluster. “Premium Creators” are those contributors responsible
for high volume contributions. These contributors are also
active on a higher number of days compared to normal
contributors. Beside “Creators” there is also a notable group
of “Premium All-Rounders”, being very active and
contributing with every action type. From the feature type
analysis we derived eight different contribution profiles. By
far the biggest number of users falls into the mapper type
“Highway Mapper”. Additionally we discovered “Building
Mappers” and “Amenity Mappers”. Again we found “AllRounders” (mappers not focused on specific feature types)
and five clusters of “Dual Feature Mappers” (mappers mainly
contributing two categories).
From the results we conclude that the adoption of k-means
clustering with centroids is gainful to identify and describe
different mapper types. It has been demonstrated that the
relative values for action and feature types combined with the
overall contribution volume are proper parameters for
profiling. It is worth to mention that most of the clusters are
rather distinct whereas some show a fluent transition.
As a next step the method can be repeated with different
parameter sets to answer specific research questions.
Additionally it would be worth to have a closer look on
relationships between the two different profiling contexts
“action types” and “feature types”. Further research could
examine relationships between contribution profiles and
quality of contribution.
AGILE 2013 – Leuven, May 14-17, 2013
References
[1] Budhathoki, N. R., & Haythornthwaite, C. (2012).
Motivation for Open Collaboration: Crowd and
Community Models and The Case of OpenStreetMap.
American Behavioral Scientist, 28.
[2] M. F. Goodchild, (2007). Citizens as Sensors: the world
of volunteered geography. Geojournal 69:211-221.
[3] K. Kuutti (1996), Activity theory as a potential
framework for human computer interaction research. In
B. A. Nardi (Ed.), Context and consciousness: Activity
theory and human-computer interaction (pp. 17-44).
Cambridge, MA: The MIT Press.
[4] J. A. Hartigan (1975). Clustering Algorithms (Probability
and Mathematical Statistics). John Wiley & Sons Inc.
[5] J. Liu and S. Ram. Who does what: Collaboration
patterns in the Wikipedia and their impact on article
quality. ACM Trans. Manage. Inf. Syst., 2(2):11:1-11:23.
[6] P. Mooney, & P. Corcoran (2012). Characteristics of
Heavily Edited Objects in OpenStreetMap. Future
Internet, 4(1), 285-305.
[7] P. Neis & A. Zipf (2012). Analyzing the Contributor
Activity of a Volunteered Geographic Information
Project — The Case of OpenStreetMap. ISPRS
International Journal of Geo-Information, 1(2), 146-165.
Molecular Diversity Preservation International. Retrieved
from http://www.mdpi.com/2220-9964/1/2/146/htm
[8] F. Ramm and J. Topf (2010), OpenStreetMap, 3rd ed.,
Berlin: lehmanns media.
[9] K. Rehrl, S. Gröchening, H. Hochmair, S. Leitinger, R.
Steinmann & A. Wagner (2012). A conceptual model for
analyzing contribution patterns in the context of VGI. In:
LBS 2012 – 9th Symposium on Location Based Services.
Berlin: Springer.
[10] West et al (2012). Drawing a Data-Driven Portrait of
Wikipedia Editors. WikiSym conference 2012, Linz,
Austria.
[11] Wikipedia editors study (2011): Results from the editor
survey, cf.:
http://commons.wikimedia.org/wiki/File:Editor_Survey_
Report_-_April_2011.pdf, 2012/10/04
Download