Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing April 8, 2015 1 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 2 Data Mining and Data Warehousing Jiawei Han’s Group at CS, UIUC Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus , RankClus, and NetClus 600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W. McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Project lead for NASA EventCube for Aviation Safety [2008-2012] Director of Information Network Academic Research Center funded from Army Research Lab (ARL) [2009-2014] 3 Data Mining Research Group at CS, UIUC 4 New Books on Data Mining & Link Mining Sun and Han, Mining Heterogeneous Han, Kamber and Pei, Yu, Han and Faloutsos (eds.), Data Mining, 3rd ed. 2011 Link Mining, 2010 Information Networks, 2012 5 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 6 Mining Heterogeneous Information Networks RankClus/NetClus SIGMOD VLDB Alice EDBT KDD ICDM SDM RankCompete: A Competing Random Walk Model for Rank-Based Clustering AAAI ICML Ranking Tom Mary Bob Cindy Tracy Jack Mike Objects Lucy Jim SDM VLDB KDD ICDM EDBT SIGMOD AAAI ICML RankClass [KDD11] Knowledge Propagation in Heterogeneous Network Top-5 ranked conferenc es Top-5 ranked terms Database Data Mining AI IR VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text Similarity Search and Role Discovery in Information Networks Which images are most similar to me in Flickr? PathSim [VLDB11] Meta Path-Guided Similarity Search in Networks Path: ITI A “dirty” Information Network (imaginary) Cleaned/Inferred Adversarial Network Automa tically infer Chief Path: ITIGITI 8 Advisee Top Ranked Advisor Time Note David M. Blei 1. Michael I. Jordan 01-03 PhD advisor, 2004 2. John D. Lafferty 05-06 Postdoc, 2006 Hong Cheng 1. Qiang Yang 02-03 MS advisor, 2003 2. Jiawei Han 04-08 PhD advisor, 2008 Sergey Brin 1. Rajeev Motawani 97-98 Unofficial advisor Cell Lead Insurgent Role Discovery in Information Networks [KDD’10] Meta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4 Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction 9 Relationship Prediction in Heterogeneous Info Networks Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes in heterogeneous networks E.g., what papers should Faloutsos writes? Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook Relationship prediction Study the roles of topological features in heterogeneous networks in predicting the co-author relationship building Meta-path guided prediction! Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July 2011 10 Guidance: Meta Path in Bibliographic Network Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar semantics and are comparable and inferable venue publish topic mention-1 publish-1 paper mention cite/cite-1 contain/contain-1 write-1 author write Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A) 11 Case Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized path count” for HP-3hop dataset 12 Case Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task Using data in T0 =[1989; 1995] and T1 = [1996; 2002] Predict new coauthor relationship in T2 = [2003; 2009] 13 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 14 iTopicModel: Model Set-Up & Objective Function Graphical model: ϴi=(ϴi1, ϴi2,…, ϴiT): Topic distribution for document xi Structural Layer: follow the same topology as the document network Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(βz) Objective function: joint probability X: observed text information G: document network Parameters ϴ: topic distribution β: word distribution ϴ is the most critical, need to be Structure part Text part Can model themconsistent with the text as well separately! as the network structure Case Study: Topic Hierarchy Building for DBLP Probabilistic Topic Models with Network-Based Biased Propagation Text-rich heterogeneous information network Ubiquitous textual documents (news, papers) Connect with users and other objects: Topic propagation Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11 How to discover latent topics and identify clusters of multi-typed objects simultaneously? How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks? 17 Biased Topic Propagation Intuition: InfoNet provides valuable information Different objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D Basic Criterion: (Biased Topic Propagation) The topic of an object without explicit text depends on the topic of the documents it connects The topic of a document is correlated with its objects to some extend, and should be principally determined by its inherent content of the text A simple and unbiased topic propagation does not make much sense 18 Incorporating Heterogeneous Info. Network R(G): Biased propagation L(C): Topic model 19 Experiments: DBLP & NSF Awards Data Collection DBLP NSF-Awards Metrics Accuracy (AC) Normalized mutual information (NMI) Results 20 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 21 Event Cube: An Overview Funded by NASA (2008-2010) Analysis Support … Analyst Multidimensional OLAP, Ranking, Cause Analysis, …… Topic Summarization/Comparison Topic Topic turbulence birds undershoot Event Cube Representation Encounter Deviation overshoot LAX SJC MIA AUS Location 98.02 98.01 99.02 99.01 drilldown 1998 1999 CA FL TX Location roll-up Multidimensional Text Database Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 22 Text/Topic Cube: General Idea Heterogeneous: categorical attributes + unstructured text ACN Time Location Place Environment …… Event Report Text data How to combine? Our solution: Cube: Categorical Attributes Measure Term/Topic Weight T1 W1 T2 W2 T3 W3 … … Text/Topic Model: Unstructured Text Effective Keyword Search TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube. Healthcare Reform … 24 Effective OLAP Exploration TEXplorer (submitted): Integrating keyword-based ranking and OLAP exploration Healthcare Reform 25 Effective Event Tracking PET (KDD’ 10): tracking popularity and textual representation of events in social communities (twitter) Healthcare Reform debate, cost, senate, … 26 pass, success, law, … benefit, profit, effective, … Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 27 Growing Parallel Paths (WWW 2011) Path DIV ... P AD HTML DIV HTML DIV LI AB HTML P LI AC AE HTML Page B Page E HTML HTML Page C 1 LI AY 2 LI AZ 3 LI AW 4 TD AU 5 TD AV 6 X Y DIV UL Page A AX UL Page D DIV ... LI DIV P AF Page F DIV TABLE Z UL TR W U V Result: 28 Mapping Pages to Records (CIKM’10) Name Tarek Abdelzaher Sarita Adve Vikram Adve Gul Agha Eyal Amir Dan Roth Jiawei Han Zipcode -------- URL -------- rsim.cs.illinois.edu/ ~sadve/ llvm.cs.uiuc.edu /~vadve/Home.html l2r.cs.uiuc.edu /~danr/ www.cs.illinois.edu /homes/hanj/ Mappings Web Pages Structured Data Database records can be found on link paths! Faculty /people Vikram Adve /people /faculty /people /faculty /vikramadve Personal Site llvm.cs.uiuc.edu /~vadve/Home.html Dan Roth People Jiawei Han / (root) [cs.illinois.edu] /people /faculty /dan-roth Personal Site l2r.cs.uiuc.edu /~danr/ Research Data Mining /research Dan Roth /research /areas /data Jiawei Han /people /faculty /jiawei-han Personal Site www.cs.illinois.edu /homes/hanj/ 29 WinaCS: Web Information Network Analysis for Computer Science Integration of Web structure mining and information network analysis Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011. 30 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 31 Discovery of Swarms and Periodic Patterns in Moving Object Data A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub) ← Bird flying paths shown on Google Earth Mined periodic patterns by our new method → Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub) Swarm discovers more patterns → ← Convoy discovers only restricted patterns 32 GeoTopic Discovery: Mining Spatial Text Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11 Geo-tagged photos w. landscape (coast vs. desert vs. mountain) LDM TDM GeoFolk LGTA 33 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions 34 Conclusions: Towards Mining Data Semantics in Integrated Heterog. Networks Most data objects are linked, forming heterogeneous information networks Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Structures can be progressively mined from less organized data sets by info. network analysis Surprisingly rich knowledge can be mine from such structured heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis, role discovery, similarity search, relationship prediction, …… It is promising to mine data semantics from rich info. networks ! 35 References for the Talk J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks" (tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of WebBased Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, IEEE TKDE, 20(6), 2008 36