Discovering Latent Structure in Multiple Modalities Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, Natasha Mohanty, Andres Corrada, Chris Pal, Wei Li, Greg Druck. Social Network in an Email Dataset QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Social Network in Political Data Vote similarity in U.S. Senate QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. [Jakulin & Buntine 2005] Groups and Topics • Input: – Observed relations between people – Attributes on those relations (text, or categorical) • Output: – Attributes clustered into “topics” – Groups of people---varying depending on topic Discovering Groups from Observed Set of Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Admiration relations among six high school students. Adjacency Matrix Representing Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) A B C D E F G1G2G1G2G3G3 A B C D E F A B C D E F A B C D E F G1 G2 G1 G2 G3 G3 A C B D E F G1G1G2G2G3G3 A C B D E F G1 G1 G2 G2 G3 G3 Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] Beta S: number of entities Multinomial g G2 S Dirichlet G: number of groups Binomial v S2 Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Two Relations with Different Attributes Student Roster Academic Admiration Social Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F G1G1G2G2G3G3 A C B D E F G1 G1 G2 G2 G3 G3 A C E B D F G1G1G1G2G2G2 A C E B D F G1 G1 G1 G2 G2 G2 The Group-Topic Model: Discovering Groups and Topics Simultaneously [Wang, Mohanty, McCallum 2006] Uniform Beta Multinomial t g G2 S Binomial Multinomial w T Nb v T Dirichlet Dirichlet S2 B Inference and Estimation Gibbs Sampling: - Many r.v.s can be integrated out - Easy to implement - Reasonably fast We assume the relationship is symmetric. Dataset #1: U.S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay …… Topics Discovered (U.S. Senate) Mixture of Unigrams Group-Topic Model Education Energy Military Misc. Economic education school aid children drug students elementary prevention energy power water nuclear gas petrol research pollution government military foreign tax congress aid law policy federal labor insurance aid tax business employee care Education + Domestic Foreign Economic Social Security + Medicare labor insurance tax congress income minimum wage business social security insurance medical care medicare disability assistance education foreign school trade federal chemicals aid tariff government congress tax drugs energy communicable research diseases Groups Discovered (US Senate) Groups from topic Education + Domestic Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid Dataset #2: The UN General Assembly • Voting records of the UN General Assembly (1990 - 2003) • A country may choose to vote Yes, No or Abstain • 931 resolutions with text attributes (titles) • 192 countries in total • Also experiments later with resolutions from 1960-2003 Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions: In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia. Topics Discovered (UN) Mixture of Unigrams Group-Topic Model Everything Nuclear Human Rights Security in Middle East nuclear weapons use implementation countries rights human palestine situation israel occupied israel syria security calls Nuclear Non-proliferation Nuclear Arms Race Human Rights nuclear states united weapons nations nuclear arms prevention race space rights human palestine occupied israel Groups Discovered (UN) The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members. Outline Discovering Latent Structure in Multiple Modalities a • Groups & Text (Group-Topic Model, GT) • Nested Correlations (Pachinko Allocation, PAM) • Time & Text (Topics-over-Time Model, TOT) • Time & Text with Nested Correlations (PAM-TOT) • Multi-Conditional Mixtures Latent Dirichlet Allocation “images, motion, eyes” LDA 20 [Blei, Ng, Jordan, 2003] α θ z w β n φ N T visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location “motion, some junk” LDA 100 motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real Correlated Topic Model [Blei, Lafferty, 2005] logistic normal z w β n φ N T Square matrix of pairwise correlations. Topic Correlation Representation 7 topics: {A, B, C, D, E, F, G} Correlations: {A, B, C, D, E} and {C, D, E, F, G} CTM B C D E F Mixture Model G A B C D E F 21 parameters A B C D E 14 parameters F G Pachinko Machine QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Pachinko Allocation Model [Li, McCallum, 2005, 2006] 11 21 31 41 word1 word2 42 word3 Given: directed acyclic graph (DAG); at each interior node: a Dirichlet over its children and words at leaves 22 32 33 43 word4 For each document: Sample a multinomial from each Dirichlet 44 word5 word6 45 word7 For each word in this document: Starting from the root, sample a child from successive nodes, down to a leaf. Generate the word at the leaf word8 Like a Polya tree, but DAG shaped, with arbitrary number of children. Pachinko Allocation Model [Li, McCallum, 2005] 11 21 31 41 word1 word2 42 word3 DAG may have arbitrary structure • arbitrary depth • any number of children per node • sparse connectivity • edges may skip layers 22 32 33 43 word4 44 word5 word6 45 word7 word8 Pachinko Allocation Model [Li, McCallum, 2005] 11 21 31 41 word1 word2 42 word3 22 32 Distributions over distributions over topics... 33 43 word4 Distributions over topics; mixtures, representing topic correlations 44 word5 word6 45 Distributions over words (like “LDA topics”) word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet) Pachinko Allocation Model [Li, McCallum, 2005] 11 Estimate all these Dirichlets from data. 21 22 Estimate model structure from data. (number of nodes, and connectivity) 31 41 word1 word2 42 word3 32 33 43 word4 44 word5 word6 45 word7 word8 Pachinko Allocation Special Cases Latent Dirichlet Allocation 32 41 word1 word2 42 word3 43 word4 44 word5 word6 45 word7 word8 Pachinko Allocation Model ... with two layers, no skipping layers, fully-connected from one layer to the next. 11 21 31 32 22 23 33 “super-topics” 34 35 “sub-topics” fixed multinomials word1 word2 word3 word4 word5 word6 word7 word8 Another special case would select only one super-topic per document. Graphical Models Four-level PAM LDA (with fixed multinomials for sub-topics) α α1 α2 θ θ2 θ3 z w z2 β n φ N T T’ z3 w β n φ N T Inference – Gibbs Sampling P ( z w 2 t k , z w3 nkp( d ) kp n pw w n1(kd ) 1k t p | D, z w , , ) ( d ) (d ) n1 k ' 1k ' nk p ' kp ' n p m m α2 α3 P (t k ) θ2 Jointly sampled θ3 z2 P (t p | t k ) P(w | t p ) T’ z3 w β n φ N T Dirichlet parameters α are estimated with moment matching Experimental Results • Topic clarity by human judgement • Likelihood on held-out data • Document classification Datasets • Rexa (http://rexa.info/) – 4000 documents, 278438 word tokens and 25597 unique words. • NIPS – 1647 documents, 114142 word tokens and 11708 unique words. • 20 newsgroup comp5 subset – 4836 documents, 35567 unique words. Topic Correlations Example Topics “images, motion eyes” LDA 20 “motion” (+ some generic) LDA 100 visual model motion field object image images objects fields receptive eye position spatial direction target vision multiple figure orientation location motion detection field optical flow sensitive moving functional detect contrast light dimensional intensity computer mt measures occlusion temporal edge real “motion” PAM 100 “eyes” PAM 100 “images” PAM 100 motion video surface surfaces figure scene camera noisy sequence activation generated analytical pixels measurements assigne advance lated shown closed perceptual eye head vor vestibulo oculomotor vestibular vary reflex vi pan rapid semicircular canals responds streams cholinergic rotation topographically detectors ning image digit faces pixel surface interpolation scene people viewing neighboring sensors patches manifold dataset magnitude transparency rich dynamical amounts tor Blind Topic Evaluation • Randomly select 25 similar pairs of topics generated from PAM and LDA • 5 people • Each asked to “select the topic in each pair that you find more semantically coherent.” Topic counts LDA PAM 5 votes 0 5 >= 4 votes 3 8 >= 3 votes 9 16 Examples PAM LDA PAM LDA control systems robot adaptive environment goal state controller control systems based adaptive direct con controller change motion image detection images scene vision texture segmentation image motion images multiple local generated noisy optical 5 votes 0 vote 4 votes 1 vote Examples PAM LDA PAM LDA signals source separation eeg sources blind single event signal signals single time low source temporal processing algorithm learning algorithms gradient convergence function stochastic weight algorithm algorithms gradient convergence stochastic line descent converge 4 votes 1 vote 1 vote 4 votes Likelihood Comparison • Dataset: NIPS • Two sets of experiments: – Varying number of topics – Different proportions of training data Likelihood Comparison • Varying number of topics Likelihood Comparison • Different proportions of training data Document Classification • 20 newsgroup comp5 subset • 5-way classification (accuracy in %) class # docs LDA PAM graphics 243 83.95 86.83 os 239 81.59 84.10 pc 245 83.67 88.16 mac 239 86.61 89.54 Windows.x 243 88.07 92.20 total 1209 84.70 87.34 Outline Discovering Latent Structure in Multiple Modalities a • Groups & Text (Group-Topic Model, GT) a • Nested Correlations (Pachinko Allocation, PAM) • Time & Text (Topics-over-Time Model, TOT) • Time & Text with Nested Correlations (PAM-TOT) • Multi-Conditional Mixtures Want to Model Trends over Time • Is prevalence of topic growing or waning? • Pattern appears only briefly – Capture its statistics in focused way – Don’t confuse it with patterns elsewhere in time • How do roles, groups, influence shift over time? Topics over Time (TOT) [Wang, McCallum 2006] Dirichlet multinomial over topics word w topic index z T Nd D multinomial over topics topic index word t time stamp Dirichlet prior time stamp T Multinomial over words Uniform prior z T Dirichlet prior t Beta over time Uniform prior distribution on time stamps Beta over time w T Multinomial over words Nd D Attributes of this Approach to Modeling Time • Not a Markov model – No state transitions, or Markov assumption • Continuous Time – Time not discretized • Easily incorporated into other more complex models with additional modalities. State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. • 17156 ‘documents’ • 21534 words • 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910 Comparing TOT against LDA TOT on 17 years of NIPS proceedings topic mass (in vertical height) Topic Distributions Conditioned on Time time TOT on 17 years of NIPS proceedings TOT LDA TOT versus LDA on my email TOT improves ability to Predict Time Predicting the year of a State-of-the-Union address. L1 = distance between predicted year and actual year. Outline Discovering Latent Structure in Multiple Modalities a • Groups & Text (Group-Topic Model, GT) a • Nested Correlations (Pachinko Allocation, PAM) a • Time & Text (Topics-over-Time Model, TOT) • Time & Text with Nested Correlations (PAM-TOT) • Multi-Conditional Mixtures PAM Over Time (PAMTOT) α1 α2 θ2 θ3 t2 T’ z2 z3 t3 w β n φ N T Experimental Results • Dataset: Rexa subset – 4454 documents between years 1991 and 2005 – 372936 word tokens – 21748 unique words • Topic Examples • Predict Time Topic Examples (1) PAMTOT PAM Topic Examples (2) PAMTOT PAM Predict Time with PAMTOT L1 Error E(L1) Accuracy PAMTOT 1.56 1.57 0.29 PAM 5.34 5.30 0.10 L1 Error: the difference between predicted and true years E(L1): average difference between all years and true year using p(t|.) from the model. Outline Discovering Latent Structure in Multiple Modalities a • Groups & Text (Group-Topic Model, GT) a • Nested Correlations (Pachinko Allocation, PAM) a • Time & Text (Topics-over-Time Model, TOT) a • Time & Text with Nested Correlations (PAM-TOT) • Multi-Conditional Mixtures Want a “topic model” with the advantages of CRFs • Use arbitrary, overlapping features of the input. • Undirected graphical model, so we don’t have to think about avoiding cycles. • Integrate naturally with our other CRF components. • Train “discriminatively” • Natural semi-supervised & transfer learning “Multi-Conditional Mixtures” Latent Variable Models fit by Multi-way Conditional Probability [McCallum, Wang, Pal, 2005], [McCallum, Pal, Druck, Wang, 2006] • For clustering structured data, ala Latent Dirichlet Allocation & its successors • But an undirected model, like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005] • But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B) e.g. A,B,C are different modalities See also [Minka 2005 TR] and [Pereira & Gordon 2006 ICML] Objective Functions for Parameter Estimation Traditional Traditional, joint training (e.g. naive Bayes, most topic models) Traditional mixture model (e.g. LDA) Traditional, conditional training (e.g. MaxEnt classifiers, CRFs) New, multi-conditional Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance, ...) Multi-conditional (mostly conditional, generative regularization) Multi-conditional (for semi-sup) Multi-conditional (for transfer learning, 2 tasks, shared hiddens) “Multi-Conditional Learning” (Regularization) [McCallum, Pal, Wang, 2006] Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by color Generatively trained [McCallum, Wang, Pal, 2005] Multi-Conditional QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Conditionally-trained [Jebara 1998] Topic Words Strong positive and negative indicators 20 Newsgroups data, two subtopics of talk.politics.guns: Multi-Conditional Harmonium Multi-Conditional Mixtures vs. Harmoniun on document retrieval task [McCallum, Wang, Pal, 2005] Multi-Conditional, multi-way conditionally trained Conditionally-trained, to predict class labels Harmonium, joint, with class labels and words Harmonium, joint with words, no labels Outline Discovering Latent Structure in Multiple Modalities a • Groups & Text (Group-Topic Model, GT) a • Nested Correlations (Pachinko Allocation, PAM) a • Time & Text (Topics-over-Time Model, TOT) a • Time & Text with Nested Correlations (PAM-TOT) a • Multi-Conditional Mixtures Summary