Topic Selection of Wikipedia-Articles Dag Sonntag Department of Computer and Information Science (IDA) Linköping University Sweden June 7, 2013 1 Overview Topic models Implementation and the project Results Conclusions and experiences Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 2 Dag Sonntag Topic Modelling - Theory Topic modelling is about finding underlying (hidden) themes in large datasets Can be used to organize, summarize, search or model the different themes Each topic is represented by a sub-model that is fitted to the samples representing that special topic Topic models can be both deterministic and probabilistic Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 3 Dag Sonntag Topic Modelling – Theory (2) - Inference Each sample (or document) is classified into the topic for which the corresponding model best explains the sample Example Two sentences: Three models, each with a probability of word in it being either cat, dog, ball or tree. “Dogs bark at cats” “Cats climb in trees” Model 1, Pr(dog = 0.5, ball = 0.3, cat = 0.2, tree = 0.0) Model 2, Pr(dog = 0.1, ball = 0.1, cat = 0.5, tree = 0.3) The first sentence is now more likely to belong to model 1, while the second only can belong to model 2 Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 4 Dag Sonntag Topic Modelling – Latent Dirichlet Allocation - Training LDA - A set of bag-of-word models Expectation step: Each document has a probability of belonging to each class depending on how good the model of that class explains the document The probabilities for each document sum up to 1 A class is randomly chosen with the probabilities above Maximization step: Each model is updated to fit the documents that belong to it Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 5 Dag Sonntag Implementation and the Project Problem statement: What underlying topic-structure exists on Wikipedia, and does this structure correspond to the known Wikipedia article structure Implementation: 1, Download random Wikipedia articles 2, Download Wikipedia articles within a known category 3, Preprocess these articles 4, Train topic models on the uncategorized articles 5, Classify the categorized articles with the trained topic models and see if articles from the same category belonged to the same topic Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 6 Dag Sonntag Implementation – Downloading and Preprocessing Articles Wikipedia has a random article generator-page that was used, title and text was then mined The preprocessing: HTML-tags and code were removed Only kept alpha-numeric characters, and then even removed the numbers Removed stopwords, uncommon words (<1%) and very common words (>0.99%) Lower casing of letters No stemming used Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 7 Dag Sonntag Implementation – Topic Models Used Two different implementations of LDA One commercial (Gensim) One academic (OnlineLDAVB) One implementation of LSI (Gensim) Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 8 Dag Sonntag Results Clear and relatively separated topics were found Topic: 0: => Schools and places ['county', 'school', 'city', 'state', 'new', 'area', 'river', 'district', 'population', 'university', 'north', 'south', 'coordinates', 'west', 'station', 'town', 'location', 'road', 'east', 'also’] Topic: 1: => Articles and webpages etc. ['wikipedia', 'page', 'article', 'may', 'links', 'view', 'policy', 'pages', 'help', 'create', 'changes', 'terms', 'link', 'contact', 'privacy', 'edit', 'articles', 'categories', 'text', 'disambiguation'] Topic: 2: => History, wars, etc. ['also', 'may', 'one', 'used', 'united', 'first', 'two', 'war', 'world', 'article', 'family', 'states', 'new', 'time', 'name', 'species', 'see', 'system', 'would', 'many'] Topic: 3: => Music and culture ['first', 'also', 'one', 'new', 'album', 'music', 'film', 'two', 'released', 'time', 'party', 'john', 'song', 'band', 'later', 'life', 'years', 'series', 'may', 'house'] Topic: 4: => Sports ['team', 'club', 'league', 'career', 'season', 'first', 'new', 'football', 'born', 'player', 'won', 'united', 'cup', 'games', 'national', 'year', 'played', 'game', 'may', 'one'] Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 9 Dag Sonntag Results (2) Topics corresponding to existing Wikipedia classification: General reference, Culture and the arts, Geography and places, Health and fitness, History and events, Mathematics and logic, Natural and physical sciences, People and self, Philosophy and thinking, Religion and belief systems, Society and social sciences, Technology and applied sciences Found Topics: • Topic: 0: Scools and places, • Topic: 1: Articles and webpages etc., • Topic: 2: History, wars, etc., • Topic: 3: Music and culture, • Topic: 4: Sports Most articles were classified to the same topic Probably due to large differences in training data and testing data Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 10 Dag Sonntag Results (3) Major problems: The academic LDA and the LSI topic models did sometimes have problem differentiating the models Common words became common in all models Defining the priors Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 11 Dag Sonntag Conclusion and Experiences Wikipedia does provide solutions for data-mining, don’t go through the ”normal” web-crawling in the HTMLinterface Decide at an early stage what model to use, and from there also decide what features to use Topic modelling works pretty good Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 12 Dag Sonntag Questions? References: David M. Blei, Princeton Gensim: http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf http://www.cs.princeton.edu/~blei/topicmodeling.html http://radimrehurek.com/gensim/index.html Wikipedia: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation http://en.wikipedia.org/wiki/Latent_semantic_indexing Department of Computer and Information Science (IDA) Linköpings universitet, Sweden June 7, 2013 13 Dag Sonntag