You Are Where You Edit: Locating Wikipedia Contributors Through Edit Histories Michael D. Lieberman codepoet@cs.umd.edu Jimmy Lin jimmylin@umd.edu Department of Computer Science University of Maryland College Park, MD 20742 USA Geographic Data Mining • Geography has increasingly prevalent role in public, communal, and collaborative Web projects – Manual contributions (Wikipedia, Flickr, …) – Automated annotation (geocoding, geotagging, …) • Spatial and geographic mining methods increasingly relevant as annotation standards and automated metadata extraction mature • Allows geographically-informed content retrieval, filtering, ranking, community identification, … – Low dimensional space! • Potential for invasion of privacy You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Mining Wikipedia • Contributors tend to add what they know and selforganize into groups based on interest • Want to see whether contributors can be further categorized based on their edits to geographic pages – Pages that correspond to a physical location in the real world with lat/lon coordinates – Termed geopages • Identify Wikipedia contributors who: – Edit geopages in a constrained geographic area – Mostly edit one or two “pet” geopages • Identify reasons for the above patterns You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Wikipedia Data • • Complete Wikipedia content freely downloadable in several XML formats 1. Current and previous versions of pages 2. Images, media 3. Edit histories 4. Contributor metadata and user pages Only included English Wikipedia dump in our analyses – Extensible to other languages You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Wikipedia Content • Page content written in evolving Wiki markup language • Content consists of freeform text and structured data – HTML and XML – Infoboxes, templates, categories, images, … • Geopage content contains a parameterized geographic coordinate template You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Geopage Example {{Infobox Settlement … |latd = 37 |latm = 18 |lats = 15 |latNS = N |longd = 121 |longm = 52 |longs = 22 |longEW = W … }} You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Identifying Geopages • • Must process page Wiki markup to identify geographic templates and extract coordinates 1. Wiki markup language continually evolves 2. Geographic templates continually evolve 3. Over 20 distinct template forms at this time for different coordinate systems and feature types Shortcut: DBpedia – Public ontology derived from Wikipedia, including extracted geographic coordinates – Amounts to a primitive gazetteer of geographic entities in Wikipedia You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Wikipedia Geo Coverage You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Basic Observations • Vast majority of geopages tagged to the US and Europe • Possibly reflects the geographic distribution of contributors to the English Wikipedia You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Features With Extent • All geopages are tagged with a single lat/lon point – Tradeoff between simplicity and accuracy – Examples: Country or state Center or capital city, Road Midpoint, River Source • Want to distinguish these features, as tagged point may be geographically distant from other contributor edits • In Wikipedia, more precise coordinates generally indicates smaller extent – California: (37, -120) – San Jose, CA: (37.304, -121.873) You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Wikipedia Edit Histories • • • • Easily-parsed XML format Information saved for each edit: 1. Username (or IP address, if anonymous) 2. Timestamp 3. Whether edit is “minor” (spelling, formatting) Excluded anonymous edits – Not allowed to be marked minor, to avoid abuse – Most Wikipedia vandalism perpetrated anonymously Also excluded minor edits – Geopages tend to have mostly non-minor edits You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Basic Observations • A considerable number of pages (~330k) are tagged with geographic coordinates • Named contributors are outnumbered by anonymous ones by about 5 to 1, but are responsible for 2–3 times as many geopage edits • A nontrivial number of named contributors have made at least one non-minor edit to a geopage (14.6%) • Most edits to geopages are non-minor edits (58.7%) You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Sample Edit Patterns You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Mining Contributor Locales • • • Intuitively, want to find contributors with: 1. Large number of edits to geopages 2. Geopage edits constrained to a small area Select contributors with: 1. At least K edited geopages 2. Area α of convex hull of edited geopage coordinates smaller than A — termed edit area We used K = 3 and A = 1 deg2 ≈ 70 x 70 mi You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Accounting For Outliers • • • • Local edit patterns may be muddled by “outlier” edits For each contributor, select a fraction F of edited geopages with smallest convex hull area Simple approximation scheme: 1. For each geopage P: a. Sort edited geopages by distance from P b. Compute convex hull HP of first F geopages 2. Select HP with smallest area α Example: 71 deg2 10 deg2 (5k x 5k mi 700 x 700 mi) You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Contributor Locality • Computed minimum edit area sizes for F = {95%, 80%}, both (a) with and (b) without features with extent • 30–35% of contributors have edit areas smaller than 1 deg2 • Over 50% of contributors with less than 5 geopage edits are highly local You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Pet Geopages • Want to identify contributors with preference for editing particular geopages — “pet” geopages – Contributors with a large number of edits to a small number of geopages • For each contributor: – Determine the frequencies of the first- and secondmost-edited geopage, F1 and F2 – Select the contributor if F1 or F1+F2 clears a frequency threshold Fmin • We used Fmin = 0.80 You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Pet Geopages • • Statistics for users with: a) 5–20 edits (~93k) b) over 20 edits (~28k) Over 50% of contributors with 5–20 edits, and 25% of contributors with over 20 edits, have over 80% of geopage edits confined to two geopages You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Reasons for Tight Edit Areas • Randomly selected 100 contributors with at least 10 edits to geopages and small edit areas • Concurrently examined contributors’ user pages and the set of edited geopages to determine an interest • Contributors with small edit areas tend to be born in or are living in the region defined by their edit areas You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Future Work • Using alternate measures to determine the significance of geopage edits, such as: – Page size before and after edit – Whether edit was undone by another editor • Characterizing contributors’ geographic interests from supposedly minor edits • Tracking evolving geographic interests over time • Mining other geographical data sources, such as: – Flickr – Twitter • Finding similar contributors based on geographic interest You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Summary • A sizable number of Wikipedia contributors exhibit constrained geographic focus • Mined geographic focus can be applied to enhance user experience in collaborative, Web-based systems • Users should be aware that information about them and their interests can be gleaned easily and perhaps unexpectedly – Dangerous if combined with other online databases You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA Thanks! You Are Where You Edit — Michael D. Lieberman, codepoet@cs.umd.edu ICWSM 2009, San Jose, CA