Cultural differences and networks of topics across Wikipedia

advertisement
Cultural differences and networks of topics across Wikipedia language editions
Shahar Ronen
Jay Baxter
César A. Hidalgo
MIT Media Lab
Language and Network Science
Copenhagen, Denmark
June 3, 2013
Our goal
Understand the appeal of notable people to
different cultures
–  E.g., Obama in the US vs. Italy
Use this information to identify cultural
characteristics
June 3, 2013
Networks of topics across Wikipedia language editions
8
Wikipedia as a proxy of culture
Comprehensive
>270 language editions
Hundreds of thousands of topics in many editions
(relatively) Representative
Collaboratively authored, and therefore likely to
reflect the interests, knowledge and views of a large
community
June 3, 2013
Networks of topics across Wikipedia language editions
9
Most frequently edited bio. articles
English
Italian
Portuguese
Jesus
Marco Mengoni
Avril Lavigne
Barack Obama
Lady Gaga
Lady Gaga
Michael Jackson
Noemi (cantante)
Britney Spears
Britney Spears
John Cena
Anahí
Adolf Hitler
Madonna (cantante)
Neymar
Roger Federer
Personaggi di One Piece
Ashley Tisdale
Lady Gaga
Laura Pausini
Lucas Rodrigues Moura da Silva
Sarah Palin
Lorenzo Insigne
Cher
Vijay (actor)
Silvio Berlusconi
Rihanna
Julian Assange
Michael Jackson
Beyoncé Knowles
Wikipedia as a proxy of culture
Comprehensive
>270 language editions
Hundreds of thousands of topics in many editions
(relatively) Representative
Collaboratively authored, and therefore likely to reflect
the interests, knowledge and views of a large community
Accessible
Standard API and format
June 3, 2013
Networks of topics across Wikipedia language editions
11
Data and methods
•  List of people on Wikipedia from dbpedia.org
–  English: 500k articles
–  Italian: 220k –  Portuguese: 60k •  From Wikipedia:
–  Metadata: # editors, # revisions
–  Text: removed markup, >2kb only
June 3, 2013
Networks of topics across Wikipedia language editions
12
Topic identification
•  Preprocessing:
–  Lemmatization (NLTK, FreeLing)
–  Removed numbers, short words, stop words
•  Modeled topics using Latent Dirichlet Allocation,
and labeled them manually
June 3, 2013
Networks of topics across Wikipedia language editions
13
force war army men soldier troop military battle attack
led general sent killed…
company business million financial sold money year
owner market industry firm sale interest fund…
band album guitar rock released played also recorded
group recording member tour solo musician guitarist…
year two first one three four five time second six later
seven month made third eight following day ten last nine
next…
June 3, 2013
Networks of topics across Wikipedia language editions
14
Topic identification, cont’d
Art (architecture, painting,
photography, etc.)
Business
Education
Exploration
Humanities
Law and crime
Literature
Media (TV, film, theater, gossip)
Music (all genres)
Personal
Politics and government
Religion
Royalty
Science and technology
Sports
Warfare and military
June 3, 2013
Networks of topics across Wikipedia language editions
15
LEARNING ABOUT PEOPLE
June 3, 2013
Networks of topics across Wikipedia language editions
16
Barack Obama
17
Lionel Messi
Networks of topics across Wikipedia language editions
18
David Beckham
19
O. J. Simpson
20
LEARNING ABOUT
THE AUTHORING COMMUNITY
June 3, 2013
Networks of topics across Wikipedia language editions
21
Carla Bruni
22
Different cultural interests
•  What if we checked ALL biographies in a
language edition?
•  No large-scale research yet
–  Comparison of biographies of 30 Poles and 30
Americans people in the English and Polish
Wikipedias (Callahan and Herring, 2011)
June 3, 2013
Networks of topics across Wikipedia language editions
23
Classify into categories
English
Jesus
Barack Obama
Michael Jackson
Britney Spears
Adolf Hitler
Roger Federer
Lady Gaga
Sarah Palin
June 3, 2013
Category
Religion
Politics
Music
Music
Politics
Sports
Music
Politics
Networks of topics across Wikipedia language editions
24
Some categories are more diverse
June 3, 2013
Networks of topics across Wikipedia language editions
26
Digging deeper: networks of topics
Holloway et al, 2007
June 3, 2013
Networks of topics across Wikipedia language editions
27
Showing >5% links only"
Showing >5% links only"
Showing >5% links only"
Limitations
•  Biography lists are not comprehensive
•  Still refining the LDA topic modeling
–  Some categories overlap
–  Robustness: average of runs
•  Wikipedia authors are not the public
June 3, 2013
Networks of topics across Wikipedia language editions
32
Highlights
•  Systematically studying cultural differences using
Wikipedia text
•  Preliminary results show differences in the
micro and macro level
June 3, 2013
Networks of topics across Wikipedia language editions
33
Future work
•  Analyze networks
•  More editions, improved extraction
•  Categories: –  E.g., US presidents, soccer players
•  Sentiment
•  Evaluation
•  Your Ideas?
June 3, 2013
Cultural differences and networks of topics across Wikipedia language editions
sronen at media dot mit dot edu
Stay tuned for
Links that speak: the structure and
implications of the global language
network
today @ 2pm
Wednesday @ 5pm, Saxo room Networks of topics across Wikipedia language editions
34
Download