UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1 1Georgia Institute of Technology, 2Wayne State University *e-mail: jaegul.choo@cc.gatech.edu Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 brain evolve dna genetic gene nerve neuron life organism Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism Intro: Topic Modeling Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism Latent Dirichlet Allocation (LDA) in Visual Analytics • LDA has been widely used in visual analytics. • TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics [Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], … *Image courtesy of original papers. Overview of Our Work • Proposes nonnegative matrix factorization (NMF) for topic modeling. • Highlights advantages of NMF over LDA in visual analytics. • Presents UTOPIAN, an NMF-based interactive topic modeling system. Topic merging Topic splitting Keyword-induced topic creation Doc-induced topic creation What is Nonnegative Matrix Factorization? Nonnegative Matrix Factorization (NMF) Lower-rank approximation with nonnegativity constraints H A ~ = min || A – WH ||F W W>=0, H>=0 Why nonnegativity? Easy interpretation and semantically meaningful output Algorithm Alternating nonnegativity-constrained least squares [Kim et al., 2008] H NMF as Topic Modeling A ~ = W Document 1 Document 2 Document 3 Document 4 Document : a distribution over topic Topic 1 Topic 2 Topic 3 Topic: a distribution over keywords brain evolve dna genetic gene nerve neuron life organism Why NMF in Visual Analytics? Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions NMF vs. LDA Consistency from Multiple Runs Documents’ topical membership changes among 10 runs InfoVis/VAST paper data set 20 newsgroup data set NMF vs. LDA Empirical Convergence Documents’ topical membership changes between iterations InfoVis/VAST paper data set 48 seconds 10 minutes NMF LDA NMF vs. LDA Topic Summary (Top Keywords) InfoVis/VAST paper data set Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Run #1 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #2 visualization design information user analysis system graph layout visual analytics data sets color weaving Run #1 documents similarities knowledge edge query collaborative social tree measures multivariate tree animation dimensions treemap Run #2 documents query analysts scatterplot spatial collaborative text documents multidimensi onal, high tree aggregation dimensions treemap NMF LDA Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA. Advantages of NMF in Visual Analytics • Reliable algorithmic behaviors • Flexible support for user interactions Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.] min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2 W>=0, H>=0 •Wr, Hr : reference matrices for W and H •MW, MH : diagonal matrices for weighting/masking columns/rows of W and H Provides flexible yet intuitive means for user interaction. Maintains the same computational complexity as original NMF. UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF Topic merging Keyword-induced topic creation Doc-induced topic creation Topic splitting UTOPIAN Overview Supervised t-distributed stochastic neighbor embedding (t-SNE) User interactions supported • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation Real-time interaction via PIVE (Per-Iteration Visualization Environment) Topic merging Topic splitting Keyword-induced topic creation Doc-induced topic creation Supervised t-SNE Original t-SNE • Documents are often too noisy to work with. Supervised t-SNE • d(xi, xj) ← α•d(xi, xj) if xi and xj belongs to the same topic cluster. PIVE (Per-Iteration Visualization Environment) for Real-time Interaction [Choo et al., under revision] Standard approach PIVE approach Demo Video http://tinyurl.com/UTOPIAN2013 Usage Scenario: Hyundai Genesis Review Data Initial result After interaction Summary • Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF. • Highlighted the advantages of NMF over LDA in visual analytics. • Reliable algorithmic behaviors • Consistency from multiple runs • Early empirical convergence • Flexible support for user interactions • Keyword refinement • Topic merging/splitting • Keyword-/document-induced topic creation More in the paper & On-going Work • A general taxonomy of user interactions with computational methods • Keyword-based vs. document-based • Template-based vs. from-scratch-based • Algorithmic details about supported user interactions • Implementation details • More usage scenarios On-going Work • Scaling up the system with parallel distributed NMF Jaegul Choo Thank you! http://tinyurl.com/UTOPIAN2013 Topic merging jaegul.choo@cc.gatech.edu http://www.cc.gatech.edu/~joyfull/ Keyword-induced topic creation Doc-induced topic creation Topic splitting For more details, please find me at ‘Meet the Candidate’ A601+ A602, 6PM today