Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, and Bart De Moor Presented by Cindy Burklow CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17th, 2008 Outline Introduction Motivation Related Work Proposed Models Proposed Algorithms Results: Hybrid & Dynamic Clustering Discussion of Pros and Cons Questions References Introduction Bioinformatics … ◦ Computer Science ◦ Information Technology ◦ Solves problems in Biomedicine Goal of Paper: Investigate ◦ ◦ ◦ ◦ ◦ Cognitive structure Dynamics of bioinformatics core Sub-disciplines ISI Web of Science & MEDLINE Retrieval of core literature in bioinformatics MeSH = Medical Subject Headings Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007. Motivation Bioinformatics field … ◦ Dynamic ◦ Evolving discipline ◦ Fast growth rate Monitor current trends Predict future direction Decision Making ◦ Grants ◦ Business Ventures ◦ Research Opportunities Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007. Related Work Web mining Bibliometrics Text mining & citation analysis ◦ Mapping of knowledge ◦ Charting science & technology fields Textual & graph-based approaches ◦ Different perceptions of similarity between documents or groups of documents Related Work Establishing the Data Set Patra & Mishra – Bibliometric Study ◦ ◦ ◦ ◦ ◦ ◦ MeSH term based Liberal delineation strategy with maximal recall Broader interpretation of bioinformatics Less restricted search strategy Broader coverage of underlying database 14,563 journal papers Related Work Hybrid Clustering ◦ He – Unsupervised spectral clustering of web pages ◦ Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages Dynamic hybrid clustering ◦ Mei & Zhai – Temporal Text Mining ◦ Kullback-Leibler – Divergence for coherent themes & Hidden Markov Models ◦ Griffiths & Steyvers – Latent Dirichlet Allocation with hot topics in PNAS abstracts Models: Data Set Bibliometric Retrieval Strategy Novel subject delineation strategy ◦ Retrieve core literature ◦ Combines textual components & bibliometrics, citation-based techniques ◦ Web of Science Edition of Thomson Scientific 7401 bioinformatics-related papers 1981 to 2004 Titles, abstracts, author keywords, and MeSH terms Models – Text Analysis ◦ All text was indexed with Jakarta Lucene Platform ◦ Encoded in Vector Space Model using TF-IDF weighting scheme ◦ Text-based similarities Cosine of angle between the vector representations of two papers ◦ No Stop word used during indexing ◦ Porter Stemmer All remaining terms from titles and abstracts ◦ Bigrams Candidate list of MeSH descriptors, author keywords, and noun phrases ◦ Latent Semantic Indexing (LSI) – 10 terms Models – Citation Analysis Citation Graphs Link-based algorithms ◦ HITS ◦ PageRank Representative Publications Bibliographic coupling (BC) Cosine Text-based Boolean Input Vectors QUANTIFY SIMILARITIES Citation-based Documents Co-citation Image Reference: Google Logo from http://www.google.com Models – Clustering Agglomerative Hierarchical Clustering Algorithm with Ward’s Method Hard Clustering Algorithm: ◦ Every publication is assigned to exactly 1 cluster. Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering Models – Clustering Optimal number of clusters Combine Distance-based & Stability-based Methods Strategy Silhouette Curves: Mean text and Citation-based Dendrogram observation Stability Diagram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007. Proposed Algorithm – Hybrid Clustering Cluster Input: Distances Combining text mining and bibliometrics ◦ Integrate text & citation info early in mapping process before applying of clustering algorithm Weighted linear combination Fisher’s inverse chi-square method Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007. Proposed Algorithm – Dynamic Hybrid Clustering Goal: Match & track clusters through time Process: ◦ Separate hybrid clustering for each period ◦ Determine optimal number of clusters Dendrogram Silhouette curve Ben-hur stability plot ◦ Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS ◦ Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2 Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007. Results – Hybrid Clustering Silhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007. Result – Hybrid Clustering Silhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007. Result – Hybrid Clustering Stability Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007. Result – Hybrid Clustering Dendrogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007. Result – Hybrid Clustering Cluster Characterization Systems biology & 694 molecular networks Genome 640 sequencing & assembly RNA 205 structure prediction Gene / 995 promoter / motif prediction Phylogeny & 749 Evolution Multiple 713 sequence alignment Protein 1167 structure prediction Microarray analysis 1147 Molecular DBs & 1091 annotation platforms Result – Dynamics Clustering Histogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007. Result – Dynamics Clustering Cluster Chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007. Yearly Publication Output among Cluster chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007. Dynamic Term Network Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007. Pros & Cons Pros ◦ Offers fresh perspective on clustering ◦ Integrates various techniques ◦ Provides insight into bioinformatics Cons ◦ Challenge of selecting the optimal number of clusters still exists ◦ There are many steps required to implement their approach Questions References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233 ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highli ghted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8 GFDKmpBLhFOIM&search_mode=GeneralSearch PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project: http://lucene.apache.org/java/1_4_3/ Fisher’s Method: http://en.wikipedia.org/wiki/Fisher%27s_method “Data Mining - Concepts and techniques” by Han and Kamber, Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)