Burklow_TextMining - Network Protocols Lab

advertisement
Dynamic Hybrid Clustering of
Bioinformatics by Incorporating
Text Mining and Citation Analysis
Frizo Janssens, Wolfgang Glänzel, and Bart De Moor
Presented by Cindy Burklow
CS 685: Special Topics in Data Mining
Professor Dr. Jinze Liu
University of Kentucky
April 17th, 2008
Outline
Introduction
 Motivation
 Related Work
 Proposed Models
 Proposed Algorithms
 Results: Hybrid & Dynamic Clustering
 Discussion of Pros and Cons
 Questions
 References

Introduction

Bioinformatics …
◦ Computer Science
◦ Information Technology
◦ Solves problems in Biomedicine

Goal of Paper: Investigate
◦
◦
◦
◦
◦
Cognitive structure
Dynamics of bioinformatics core
Sub-disciplines
ISI Web of Science & MEDLINE
Retrieval of core literature in bioinformatics
MeSH = Medical
Subject Headings
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.
Motivation

Bioinformatics field …
◦ Dynamic
◦ Evolving discipline
◦ Fast growth rate
Monitor current trends
 Predict future direction
 Decision Making

◦ Grants
◦ Business Ventures
◦ Research Opportunities
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Related Work
Web mining
 Bibliometrics
 Text mining & citation analysis

◦ Mapping of knowledge
◦ Charting science & technology fields

Textual & graph-based approaches
◦ Different perceptions of similarity between
documents or groups of documents
Related Work
Establishing the Data Set
 Patra & Mishra – Bibliometric Study
◦
◦
◦
◦
◦
◦
MeSH term based
Liberal delineation strategy with maximal recall
Broader interpretation of bioinformatics
Less restricted search strategy
Broader coverage of underlying database
14,563 journal papers
Related Work

Hybrid Clustering
◦ He – Unsupervised spectral clustering of web pages
◦ Wang & Kitsuregawa – Contents-linked coupled
clustering algorithm of web pages

Dynamic hybrid clustering
◦ Mei & Zhai – Temporal Text Mining
◦ Kullback-Leibler – Divergence for coherent themes &
Hidden Markov Models
◦ Griffiths & Steyvers – Latent Dirichlet Allocation with
hot topics in PNAS abstracts
Models: Data Set
Bibliometric Retrieval Strategy

Novel subject delineation strategy
◦ Retrieve core literature
◦ Combines textual components &
bibliometrics, citation-based techniques
◦ Web of Science Edition of Thomson Scientific
 7401 bioinformatics-related papers
 1981 to 2004
 Titles, abstracts, author keywords, and MeSH
terms
Models – Text Analysis
◦ All text was indexed with Jakarta Lucene Platform
◦ Encoded in Vector Space Model using TF-IDF
weighting scheme
◦ Text-based similarities
 Cosine of angle between the vector representations of
two papers
◦ No Stop word used during indexing
◦ Porter Stemmer
 All remaining terms from titles and abstracts
◦ Bigrams
 Candidate list of MeSH descriptors, author keywords,
and noun phrases
◦ Latent Semantic Indexing (LSI) – 10 terms
Models – Citation Analysis
Citation Graphs
 Link-based algorithms

◦ HITS
◦ PageRank
Representative Publications
Bibliographic
coupling (BC)
Cosine
Text-based
Boolean Input
Vectors
QUANTIFY
SIMILARITIES
Citation-based
Documents
Co-citation
Image Reference: Google Logo from http://www.google.com
Models – Clustering
Agglomerative Hierarchical Clustering
Algorithm with Ward’s Method
 Hard Clustering Algorithm:
◦ Every publication is assigned to exactly 1 cluster.

Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering
Models – Clustering
Optimal number of clusters
Combine Distance-based & Stability-based Methods Strategy
Silhouette Curves:
Mean text and
Citation-based
Dendrogram observation
Stability Diagram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Hybrid Clustering
Cluster Input: Distances
 Combining text mining and bibliometrics

◦ Integrate text & citation info early in mapping
process before applying of clustering algorithm

Weighted linear combination

Fisher’s inverse chi-square method
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Dynamic Hybrid
Clustering
Goal: Match & track clusters through time
 Process:

◦ Separate hybrid clustering for each period
◦ Determine optimal number of clusters
 Dendrogram
 Silhouette curve
 Ben-hur stability plot
◦ Construct complete graph
 All cluster centroids from each period as nodes
 Edge weights as mutual cosine similarities in LSS
◦ Form Cluster Chains
 Keep edge weights > threshold, T1
 Allow qualifying clusters to join > threshold, T2
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Results – Hybrid Clustering
Silhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid Clustering
Silhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid Clustering
Stability
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid Clustering
Dendrogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid Clustering
Cluster Characterization
Systems
biology & 694
molecular
networks
Genome 640
sequencing &
assembly
RNA 205
structure
prediction
Gene /
995
promoter /
motif
prediction
Phylogeny & 749
Evolution
Multiple 713
sequence
alignment
Protein 1167
structure
prediction
Microarray
analysis
1147
Molecular
DBs & 1091
annotation
platforms
Result – Dynamics Clustering
Histogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Dynamics Clustering
Cluster Chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Yearly Publication Output
among Cluster chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Dynamic Term
Network
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Pros & Cons

Pros
◦ Offers fresh perspective on clustering
◦ Integrates various techniques
◦ Provides insight into bioinformatics

Cons
◦ Challenge of selecting the optimal number of
clusters still exists
◦ There are many steps required to implement
their approach
Questions
References






Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic hybrid
clustering of bioinformatics by incorporating text mining and
citation analysis. In Proceedings of the 13th ACM SIGKDD international
Conference on Knowledge Discovery and Data Mining (San Jose,
California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY,
360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233
ISI Web of Science Image:
http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highli
ghted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8
GFDKmpBLhFOIM&search_mode=GeneralSearch
PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/
The Apache Jakarta Project: http://lucene.apache.org/java/1_4_3/
Fisher’s Method: http://en.wikipedia.org/wiki/Fisher%27s_method
“Data Mining - Concepts and techniques” by Han and Kamber,
Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)
Download