Relational Graphical Models for Collaborative Filtering and Recommendation William H. Hsu Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org Sunday, 31 July 2005 Multi-Agent Learning from Portal User Data Relational Representation Collaborative Recommendation, Information Retrieval & Extraction IJCAI-2005 Workshop W20, Multi-Agent Information Retrieval This presentation is: http://www.kddresearch.org/KSU/CIS/IJCAI-20050731.ppt Joint work with: Jeffrey M. Barber, Haipeng Guo, Andrew L. King, Julie A. Thornton Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Outline • Application: Workflow Modeling in Bioinformatics – Collaborative recommendation (CR) – Shallow CR: market basket analysis for cross-selling – Domain: gene expression modeling, proteomics, metabolomics • Methodology: Relational Graphical Models (RGMs) – Workflow basics – DESCRIBER project: using RGMs for CR and info retrieval (IR) – Input, desired output, application, methodology, criteria • Link Analysis Applications – Finding dynamic relational attributes – Identity uncertainty in spatial data cleaning • Software for Building Graphical Models: BNJ • Infrastructure and Preliminary Experiments Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences “Classical” Collaborative Recommendation: Clickstream Mining Explanation from Recommender (Decision Support) System Classification and Regression based upon Historical Customer Data (Market Basket Analysis) © 2003 Amazon.com, Inc. Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Shallow Collaborative Recommendation: Market Basket Analysis for Cross-Selling Cross-Selling based upon Market Basket Analysis – Apriori (Agrawal, 1993) Basis for Collaborative Recommendation © 2002 Amazon.com, Inc. Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Application to Computational Grid Portal: DESCRIBER Design Users of Information Grid & Scientific Workflow Repository Example Queries: • What experiments have found cell cycle-regulated metabolic pathways in Saccharomyces? • What codes and microarray data were used? How and why? Data Entity, Service, and Component Repository Index for Bioinformatics Experimental Research Personalized Interface User Queries & Evaluations Domain-Specific Collaborative Recommendation Learning over Workflow Instances and Use Cases (Historical Decision Support User Requirements) Models Use Case & Query/Evaluation Data Interface(s) to Distributed Repository Domain-Specific Workflow Repositories Workflows Transactional, Objective Views Workflow Components Data Sources, Transformations; Other Services Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Computational Genomics and Microarray Gene Expression Modeling [A] Structure Learning G2 D: Data (User, Microarray) G1 G4 G3 G5 G = (V, E) [B] Parameter Estimation G2 Treatment 1 (Control) Treatment 2 (Pathogen) Learning Environment Messenger RNA cDNA (mRNA) Extract 1 G1 Dval (Model Validation by Inference) G4 G5 G3 B = (V, E, ) Specification Fitness (Inferential Loss) Messenger RNA (mRNA) Extract 2 cDNA DNA Hybridization Microarray (under LASER) Nir’s Invited Talk at IJCAI: Wednesday, 0900 GMT 03 Aug 2005 Adapted from Friedman et al. (2000) http://www.cs.huji.ac.il/labs/compbio/ Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Bioinformatics: Data Mining from DNA Hybridization Microarrays How do we get from microarray data (and other expression data) to a linked network? © G. Simpson (1999) Used with permission Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Outline • Application: Workflow Modeling in Bioinformatics – Collaborative recommendation (CR) – Shallow CR: market basket analysis for cross-selling – Domain: gene expression modeling, proteomics, metabolomics • Methodology: Relational Graphical Models (RGMs) – Workflow basics – DESCRIBER project: using RGMs for CR and info retrieval (IR) – Input, desired output, application, methodology, criteria • Link Analysis Applications – Finding dynamic relational attributes – Identity uncertainty in spatial data cleaning • Software for Building Graphical Models: BNJ • Infrastructure and Preliminary Experiments Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Finding Dynamic Relational Attributes: From Workflows to Class Diagrams Transactional View (cf. UML Sequence Diagram) Objective View (cf. UML Class Diagram) cDNA MicroarrayExperiment Gene Protein DNA-sequence protein-ID cDNA-sequence protein-product canonicalname role treatment pathway hybridization pathway accession-number data functionaldescription normalization Pathway pathway-ID regulation pathway-name TAVERNA Workbench myGrid Project © 2003 Oinn et al. Kansas State University KDD Lab (www.kddresearch.org) Relational Link (Reference Key) Probabilistic Dependency pathwaydescriptor DESCRIBER example schema © 2003 Hsu Kansas State University Department of Computing and Information Sciences DESCRIBER: Preliminary Overview of System Workflow Logs, Instances, Templates, Components (Services, Data Sources) Structure & Data Module 2 Training Data Learning & Validation RGMs of of Relational Graphical Workflows Models (RGMs) for Experimental Workflows and Components Personalized Interface Recommendations/Evaluations (Before and After Use) Module 1 Collaborative Recommendation Front-End Training Data User Queries Module 4 Learning & Validation of RGMs for User Requirements Module 3 Estimation of RGM Parameters from Workflow and Component Database Structure & Data RGMs of Queries Module 5 RGM Parameters from User Query Data Complete RGMs of User Queries Complete RGMs of Workflows (Data-Oriented) Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Workflow Management [1]: Input and Representation • Input: Implemented Workflows – Workflow: operational aspect of work procedure • • • • Data sources: relational databases, object stores (what) Structure of tasks (what/how) Operations: structured queries, data transformations (how) Agents to perform tasks: web services/enactment history (who/where) – Examples • Desktop: TIGR TM4 (gene expression data analysis suite) • Intranet: groupware (e.g., business process management, ORACLE Workflow, IBM WebSphere MQ Workflow) • Online: Computational science (grid) portals • Representation – SCUFL (Stevens, 2002): language (DAML+OIL, now OWL) – TAVERNA (Oinn, 2003): editor Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Workflow Management [2]: Problem Specification: Output, Criteria • Output – Relational abstraction over workflow classes – Underlying graphical models representing workflow instances • Goals – Personalize UI – Assist in retrieval, development and repurposing • Workflows and components • Decrease time, maintain quality • Criteria – The hard part! – Classical evaluation measures: accuracy, precision vs. recall, likelihood – “just a start” (Langley, 2000) – Utility measures: user ratings, performance – User modeling: usability, accessibility of grid portal Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Methodology [1]: from Collaborative Recommendation to IR • Applications to Information Retrieval – Development of new workflows – Repurposing of prefabricated workflows – Personalization of interfaces • What is Collaborative? – Filtering of workflow components by usage – Recommendation via ratings: EachMovie (McJones et al., 1997), Jester (Goldberg et al., 2001), MovieLens (Miller et al., 2003) • Multi-Agent Aspects – Brokered services (W3C’s Simple Object Access Protocol v1.2) http://www.w3.org/TR/soap/ – Modeling context of data transformations, services, clients – Heterogeneous data at multiple levels of abstraction Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Methodology [2]: Relational Models for Multi-Agent IR • Probabilistic Inference and Representation – Probabilistic Relational Models (Friedman et al., 1999) – Single instancs extracted from TAVERNA editor – Workflow abstractions: dropping enactment information – Schemata: relational skeletons, link/reference slot uncertainty • Applied Machine Learning – General problem: knowledge acquisition and capture – Schemata: designed with grid portal builder – Distributions learned from data: link, reference slot – Clusters: workflows, components, users – Relations from clusters to one another Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Emergent Relational Structure • “Google Approach” – Hubs/authorities (Brin & Page 1998, Kleinberg 1998) – Using existing structure: Netscape Open Directory Project (ODP) – Minimal annotation: meta tags (keywords, description) • “CiteSeer/ResearchIndex Approach” – Citation indexing (Lawrence et al., 1998, Giles et al., 2002) – Web of influence (Koller, 2001) • Where is The Relational Structure? – “Does inherent relational structure exist?” (Russell, SRL-2003) – Sources of rich info: “link structure” – Richer sources? Procedural context and beyond! Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Outline • Application: Workflow Modeling in Bioinformatics – Collaborative recommendation (CR) – Shallow CR: market basket analysis for cross-selling – Domain: gene expression modeling, proteomics, metabolomics • Methodology: Relational Graphical Models (RGMs) – Workflow basics – DESCRIBER project: using RGMs for CR and info retrieval (IR) – Input, desired output, application, methodology, criteria • Link Analysis Applications – Finding dynamic relational attributes – Identity uncertainty in spatial data cleaning • Software for Building Graphical Models: BNJ • Infrastructure and Preliminary Experiments Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Identity Uncertainty • How to Tell When Two Descriptors Refer to Same Entity? • Problem – Coalesced databases – Multiple sources • Errors and Inconsistencies – Spatial, temporal error – Inconsistent descriptors • Clues – Proximity in space, time – Similarities in values of key variables (attributes, features) • Applications – Fraud detection and information security (intrusion detection) – Data cleaning Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Spatial Data Cleaning: STARWARD Groundwater irrigation lifetime estimates in the Ogallala region of the Kansas High Plains aquifer. [Wilson et al. 2002] http://snurl.com/39kz Darkest: already depleted Next darkest: 25-50 years Problems Water well location (identity uncertainty in coalesced spatial databases), descriptive statistics (paraconsistency), spatial outlier detection Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Outline • Application: Workflow Modeling in Bioinformatics – Collaborative recommendation (CR) – Shallow CR: market basket analysis for cross-selling – Domain: gene expression modeling, proteomics, metabolomics • Methodology: Relational Graphical Models (RGMs) – Workflow basics – DESCRIBER project: using RGMs for CR and info retrieval (IR) – Input, desired output, application, methodology, criteria • Link Analysis Applications – Finding dynamic relational attributes – Identity uncertainty in spatial data cleaning • Software for Building Graphical Models: BNJ • Infrastructure and Preliminary Experiments Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences BNJ Graphical User Interface [1]: Editor © 2005 KSU Bayesian Network tools in Java (BNJ) Development Team Kansas State University KDD Lab (www.kddresearch.org) ALARM Network Kansas State University Department of Computing and Information Sciences BNJ Graphical User Interface [2]: Graph Visualization and Algorithm Animation © 2004 KSU Bayesian Network tools in Java (BNJ) Development Team Kansas State University KDD Lab (www.kddresearch.org) CPCS-54 Network Kansas State University Department of Computing and Information Sciences Genetic Algorithm for BN Structure Learning Results: ALARM-13 Inferential RMSE for Forward Simulation 0.25 Gold Standard Network RMSE 0.2 0.15 K2 Output on Optimal Ordering 0.1 0.05 K2 Output on GA Ordering 0 1 2693 5385 8077 10769 13461 K2: 20K FS: 1500 Samples (Hsu, Guo, Perry & Stilson, 2002) Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Software Packages for Building Graphical Models: BNJ, etc. • Commercial Tools: Ergo, Netica, TETRAD, Hugin • Open Source Tools: BNT (Murphy, 2001), gR (Lauritzen et al., 2002) • Bayesian Network tools in Java (BNJ) – Hsu et al. (2002-present) – Distribution page http://bnj.sourceforge.net – Development group http://groups.yahoo.com/group/bndev – Current (re)implementation projects for KSU KDD Lab • Structure learning and parameter estimation – Hsu, Barber • Fast Adaptive Importance Sampling, other sampling – King, Guo • Statistical Machine Translation / Information Extraction (IE) toolkit – Al-Jandal, Meyer, Pydimarri • Continuous time representations – Barber, Hsu • Formats: XML BNIF (MSBN), Netica – Guo, Barber, Hsu • Space-efficient DBN inference – Hsu, Barber Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences Acknowledgements • Kansas State University Lab for Knowledge Discovery in Databases – Alumni: Guo (HKUST), Perry (Delaware), Thornton (Kansas State) – Graduate students: Ph.D. – Al-Jandal, Li; M.S. – Barber (Math), Meyer, Pydimarri – Undergraduate programmers: King (CIS); Bell, Figueroa (2005 summer interns) • Joint Work with – KSU Bioinformatics Group (EECE: Das; Agronomy: Welch, Roe; Weather: Knapp) – NSF FIBR (Brown: Schmitt; NCSU: Purugganan; Wisconsin: Amasino) www.egad.ksu.edu • Thanks to Collaborators and Other Research Groups – IJCAI-2001, AAAI/UAI/KDD-2002, IJCAI-2003 (UMBC: Kargupta, ASU: Liu; Iowa: Street; MSR: Horvitz; UConn: Santos; HKUST: Guo) www.kddresearch.org/Workshops – BNJ/CSR (CMU: Glymour, Scheines; IA State: Honavar, Margaritis, Tian) – myGrid/TAVERNA (Manchester: Goble, Stevens; EBI: Oinn; Southampton: Addis) – The Institute for Genomic Research (Quackenbush, Saeed) – Kansas Geological Survey (Bohling), Kansas Biological Survey, KU EECS – NSF ITR (KSU Physics: Rahman, Kara; KSU CIS: Wallentine) http://www.phys.ksu.edu/~a0kara01/ITR/ Kansas State University KDD Lab (www.kddresearch.org) Kansas State University Department of Computing and Information Sciences