You Are Where You Edit

advertisement
You Are Where You Edit:
Locating Wikipedia Contributors
Through Edit Histories
Michael D. Lieberman
codepoet@cs.umd.edu
Jimmy Lin
jimmylin@umd.edu
Department of Computer Science
University of Maryland
College Park, MD 20742 USA
Geographic Data Mining
• Geography has increasingly prevalent role in public,
communal, and collaborative Web projects
– Manual contributions (Wikipedia, Flickr, …)
– Automated annotation (geocoding, geotagging, …)
• Spatial and geographic mining methods increasingly
relevant as annotation standards and automated
metadata extraction mature
• Allows geographically-informed content retrieval,
filtering, ranking, community identification, …
– Low dimensional space!
• Potential for invasion of privacy
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Mining Wikipedia
• Contributors tend to add what they know and selforganize into groups based on interest
• Want to see whether contributors can be further
categorized based on their edits to geographic pages
– Pages that correspond to a physical location in the
real world with lat/lon coordinates
– Termed geopages
• Identify Wikipedia contributors who:
– Edit geopages in a constrained geographic area
– Mostly edit one or two “pet” geopages
• Identify reasons for the above patterns
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Wikipedia Data
•
•
Complete Wikipedia content freely downloadable in
several XML formats
1. Current and previous versions of pages
2. Images, media
3. Edit histories
4. Contributor metadata and user pages
Only included English Wikipedia dump in our analyses
– Extensible to other languages
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Wikipedia Content
• Page content written in evolving Wiki markup language
• Content consists of freeform text and structured data
– HTML and XML
– Infoboxes, templates, categories, images, …
• Geopage content contains a parameterized geographic
coordinate template
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Geopage Example
{{Infobox Settlement
…
|latd = 37 |latm = 18 |lats = 15 |latNS = N
|longd = 121 |longm = 52 |longs = 22 |longEW = W
…
}}
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Identifying Geopages
•
•
Must process page Wiki markup to identify geographic
templates and extract coordinates
1. Wiki markup language continually evolves
2. Geographic templates continually evolve
3. Over 20 distinct template forms at this time for
different coordinate systems and feature types
Shortcut: DBpedia
– Public ontology derived from Wikipedia, including
extracted geographic coordinates
– Amounts to a primitive gazetteer of geographic
entities in Wikipedia
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Wikipedia Geo Coverage
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Basic Observations
• Vast majority of geopages
tagged to the US and
Europe
• Possibly reflects the
geographic distribution of
contributors to the English
Wikipedia
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Features With Extent
• All geopages are tagged with a single lat/lon point
– Tradeoff between simplicity and accuracy
– Examples: Country or state  Center or capital city,
Road  Midpoint, River  Source
• Want to distinguish these features, as tagged point may
be geographically distant from other contributor edits
• In Wikipedia, more precise coordinates generally
indicates smaller extent
– California: (37, -120)
– San Jose, CA: (37.304, -121.873)
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Wikipedia Edit Histories
•
•
•
•
Easily-parsed XML format
Information saved for each edit:
1. Username (or IP address, if anonymous)
2. Timestamp
3. Whether edit is “minor” (spelling, formatting)
Excluded anonymous edits
– Not allowed to be marked minor, to avoid abuse
– Most Wikipedia vandalism perpetrated anonymously
Also excluded minor edits
– Geopages tend to have mostly non-minor edits
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Basic Observations
• A considerable number of
pages (~330k) are tagged
with geographic
coordinates
• Named contributors are outnumbered by anonymous
ones by about 5 to 1, but are responsible for 2–3 times
as many geopage edits
• A nontrivial number of named contributors have made at
least one non-minor edit to a geopage (14.6%)
• Most edits to geopages are non-minor edits (58.7%)
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Sample Edit Patterns
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Mining Contributor Locales
•
•
•
Intuitively, want to find contributors with:
1. Large number of edits to geopages
2. Geopage edits constrained to a small area
Select contributors with:
1. At least K edited geopages
2. Area α of convex hull of edited geopage coordinates
smaller than A — termed edit area
We used K = 3 and
A = 1 deg2 ≈ 70 x 70 mi
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Accounting For Outliers
•
•
•
•
Local edit patterns may be
muddled by “outlier” edits
For each contributor, select a
fraction F of edited geopages
with smallest convex hull area
Simple approximation scheme:
1. For each geopage P:
a. Sort edited geopages by distance from P
b. Compute convex hull HP of first F geopages
2. Select HP with smallest area α
Example: 71 deg2  10 deg2
(5k x 5k mi  700 x 700 mi)
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Contributor Locality
• Computed minimum edit
area sizes for
F = {95%, 80%}, both
(a) with and (b) without
features with extent
• 30–35% of contributors
have edit areas smaller
than 1 deg2
• Over 50% of contributors
with less than 5 geopage
edits are highly local
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Pet Geopages
• Want to identify contributors with preference for editing
particular geopages — “pet” geopages
– Contributors with a large number of edits to a small
number of geopages
• For each contributor:
– Determine the frequencies of the first- and secondmost-edited geopage, F1 and F2
– Select the contributor if F1 or F1+F2 clears a frequency
threshold Fmin
• We used Fmin = 0.80
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Pet Geopages
•
•
Statistics for users with:
a) 5–20 edits (~93k)
b) over 20 edits (~28k)
Over 50% of contributors
with 5–20 edits, and 25%
of contributors with over
20 edits, have over 80%
of geopage edits
confined to two
geopages
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Reasons for Tight Edit Areas
• Randomly selected 100
contributors with at least
10 edits to geopages and
small edit areas
• Concurrently examined contributors’ user pages and the
set of edited geopages to determine an interest
• Contributors with small edit areas tend to be born in or
are living in the region defined by their edit areas
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Future Work
• Using alternate measures to determine the significance
of geopage edits, such as:
– Page size before and after edit
– Whether edit was undone by another editor
• Characterizing contributors’ geographic interests from
supposedly minor edits
• Tracking evolving geographic interests over time
• Mining other geographical data sources, such as:
– Flickr
– Twitter
• Finding similar contributors based on geographic interest
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Summary
• A sizable number of Wikipedia contributors exhibit
constrained geographic focus
• Mined geographic focus can be applied to enhance user
experience in collaborative, Web-based systems
• Users should be aware that information about them and
their interests can be gleaned easily and perhaps
unexpectedly
– Dangerous if combined with other online databases
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Thanks!
You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu
ICWSM 2009, San Jose, CA
Download