UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts Outline • CC-Prediction – Learning in the wild from user email usage • DEX – Learning in the wild from user correction... as well as KB records filled by other CALO components • Rexa – Learning in the wild from user corrections to coreference... propagating constraints in a MarkovLogic-like system that scales to ~20 million objects • Several new topic models – Discover interesting useful structure without the need for supervision... learning from newly arrived data on the fly CC Prediction Using Various Exponential Family Factor Graphs Learning to keep an org. connected & avoid stove-piping. First steps toward ad-hoc team creation. Learning in the wild from user’s CC behavior, and from other parts of the CALO ontology. Graphical Models for Email • Compute P(y|x) for CC prediction - function - random variable N - N replications The graph describes the joint distribution of random variables in term of the product of local functions Email Model: Nb words in the body, Ns words in the subject, Nr recipients Recipient of Email xb Nb Body Words y xs Ns xr Nr-1 Subject Other Words Recipients Nr • Local functions facilitate system engineering through modularity Document Models • Models may relational attributes Na Author of Document xt Nt Title xb Nb Abstract y xs Ns xr Na-1 xb Nr Body Co-authors References • We can optimize P(y|x) for classification performance and P(x|y) for model interpretability and parameter transfer (to other models) CC Prediction and Relational Attributes Nr Target Recipient xtr Ntr xb Nb Thread Body Relation Words y xs xr Ns xr’ Nr-1 Subject Other Relation Words Recipients Thread Relations – e.g. Was a given recipient ever included on this email thread? Recipient Relationships – e.g. Does one of the other recipients report to the target recipient? CC-Prediction Learning in the Wild • As documents are added to Rexa, models of expertise for authors grows • As DEX obtains more contact information and keywords, organizational relations emerge • Model parameters can be adapted on-line • Priors on parameters can be used to transfer learned information between models • New relations can be added on-line • Modular model construction and intelligent model optimization enable these goals CC Prediction Upcoming work on Multi-Conditional Learning A discriminatively-trained topic model, discovering low-dimensional representations for transfer learning and improved regularization & generalization. Objective Functions for Parameter Estimation Traditional Traditional, joint training (e.g. naive Bayes, most topic models) Traditional mixture model (e.g. LDA) Traditional, conditional training (e.g. MaxEnt classifiers, CRFs) New, multi-conditional Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance, ...) Multi-conditional (mostly conditional, generative regularization) Multi-conditional (for semi-sup) Multi-conditional (for transfer learning, 2 tasks, shared hiddens) “Multi-Conditional Learning” (Regularization) [McCallum, Pal, Wang, 2006] Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by color Generatively trained [McCallum, Wang, Pal, 2005] Multi-Conditional QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Conditionally-trained [Jebara 1998] Multi-Conditional Mixtures vs. Harmoniun on document retrieval task [McCallum, Wang, Pal, 2005] Multi-Conditional, multi-way conditionally trained Conditionally-trained, to predict class labels Harmonium, joint, with class labels and words Harmonium, joint with words, no labels DEX Beginning with a review of previous work, then new work on record extraction, with the ability to leverage new KBs in the wild, and for transfer System Overview WWW Email CRF Qu i ck Ti me ™a nd a TIF F (Un co mpre ss ed )d ec omp res so r a re ne ed ed to s ee th i s pi c tu re. Keyword Extraction Person Name Extraction Name Coreference Homepage Retrieval names Contact Info and Person Name Extraction Social Network Analysis An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject ... Search for new people First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle: Associate Professor Company: University of Massachusetts Street Address: 140 Governor’s Dr. City: Amherst State: MA Zip: 01003 Company Phone: (413) 545-1323 Links: Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Example keywords extracted Person Keywords William Cohen Logic programming Text categorization Data integration Rule learning Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom Mitchell 1. 2. Machine learning Cognitive states Learning apprentice Artificial intelligence Summary of Results Contact info and name extraction performance (25 fields) CRF Token Acc Field Prec Field Recall Field F1 94.50 85.73 76.33 80.76 QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency. Importance of accurate DEX fields in IRIS • Information about – – – – – – – people contact information email affiliation job title expertise ... are key to answering many CALO questions... both directly, and as supporting inputs to higher-level questions. Learning Field Compatibilities in DEX Extracted Record Professor Jane Smith Name: Jane Smith, John Doe University of California JobTitle: Professor, Administrative Assistant 209-555-5555 Company: U of California Department: Computer Science Phone: 209-555-5555, 209-444-4444 Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … City: Boston Compatibility Graph .8 Jane Smith University of California 209-555-5555 John Doe Computer Science .4 -.5 Administrative Assistant University of California 209-444-4444 Professor -.6 Boston Administrative Assistant -.5 -.4 University of California .4 209-444-4444 John Doe Learning Field Compatibilities in DEX Extracted Record Professor Jane Smith Name: Jane Smith, John Doe University of California JobTitle: Professor, Administrative Assistant 209-555-5555 Company: U of California Department: Computer Science Phone: 209-555-5555, 209-444-4444 Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … City: Boston Jane Smith 209-555-5555 University of California Professor Computer Science John Doe Administrative Assistant Boston University of California Administrative Assistant University of California John Doe 209-444-4444 209-444-4444 Learning Field Compatibilities in DEX • ~35% error reduction over transitive closure • Qualitatively better than heuristic approach • Mine Knowledge Bases from other parts of IRIS for learning compatibility rules among fields – “Professor” job title co-occurs with “University” company – Area code / city compatibility – “Senator” job title co-occurs with “Washington, D.C” location • In the wild – As the user adds new fields & make corrections, DEX learns from this KB data • Transfer learning – between departments/industries Rexa A knowledge base of publications, grants, people, their expertise, topics, and inter-connections Learning for information extraction and coreference. Incrementally leveraging multiple sources of information for improved coreference Gathering information about people’s expertise and coauthor, citation relations First a tour of Rexa, then slides about learning Previous Systems QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper Grant Venue Person University Groups Learning in Rexa Extraction, coreference In the wild: Re-adjusting KB after corrections from a user Also, learning research topics/expertise, and their interconnections (Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence 1 p(y | x) (y t , y t1,x,t) Zx t (y t , y t1,x,t) exp k f k (y t , y t1,x,t) k where Finite state model Graphical model OTHER PERSON y t - 1 yt OTHER y t+1 ORG y t+2 TITLE … y t+3 output seq FSM states ... observations x x t -1 said t Jones x a t +1 x t +2 Microsoft x t +3 VP … input seq Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] IE from Research Papers [McCallum et al ‘99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] (Word-level accuracy is >99%) Joint segmentation and co-reference Extraction from and matching of research paper citations. o s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y Database field values c s c y Citation attributes s o Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Besag, 1986] [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] Rexa Learning in the Wild from User Feedback • Coreference will never be perfect. • Rexa allows users to enter corrections to coreference decisions • Rexa then uses this feedback to – re-consider other inter-related parts of the KB – automatically make further error corrections by propagating constraints • (Our coreference system uses underlying ideas very much like Markov Logic, and scales to ~20 million mention objects.) Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered. Topical Transfer Citation counts from one topic to another. Map “producers and consumers” Topical Diversity Find the topics that are cited by many other topics ---measuring diversity of impact. Entropy of the topic distribution among papers that cite this paper (this topic). Low Diversity High Diversity Some New Work on Topic Models Robustly capturing topic correlations Pachkinko Allocation Model Capturing phrases in topic-specific ways Topical N-Gram Model Pachinko Machine QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Pachinko Allocation Model [Li, McCallum, 2005] 11 21 31 41 word1 word2 42 word3 22 32 Distributions over distributions over topics... 33 43 word4 Distributions over topics; mixtures, representing topic correlations 44 word5 word6 45 Distributions over words (like “LDA topics”) word7 word8 Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet) Topic Coherence Comparison “models, “estimation, estimation, stopwords” some junk” “estimation” LDA 20 LDA 100 PAM 100 models model parameters distribution bayesian probability estimation data gaussian methods likelihood em mixture show approach paper density framework approximation markov estimation likelihood maximum noisy estimates mixture scene surface normalization generated measurements surfaces estimating estimated iterative combined figure divisive sequence ideal estimation bayesian parameters data methods estimate maximum probabilistic distributions noise variable variables noisy inference variance entropy models framework statistical estimating Example super-topic 33 input hidden units function number 27 estimation bayesian parameters data m 24 distribution gaussian markov likelihood 11 exact kalman full conditional determini 1 smoothing predictive regularizers inter Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters Likelihood Comparison Varying number of topics Want to Model Trends over Time • Is prevalence of topic growing or waning? • Pattern appears only briefly – Capture its statistics in focused way – Don’t confuse it with patterns elsewhere in time • How do roles, groups, influence shift over time? Topics over Time (TOT) [Wang, McCallum 2006] Dirichlet multinomial over topics word w topic index z T Nd D multinomial over topics topic index word t time stamp Dirichlet prior time stamp T Multinomial over words Uniform prior z T Dirichlet prior t Beta over time Uniform prior distribution on time stamps Beta over time w T Multinomial over words Nd D State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, 2002. To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied. •17156 ‘documents’ •21534 words •669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910 Comparing TOT against LDA Topic Distributions Conditioned on Time topic mass (in vertical height) NIPS vol 1-14 time