A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Outline Introduction of Dynamic Information Stream and the issues Bio-inspired Clustering MSF Clustering Model Based on Bird Flock Collective Behavior TFIDF not practical for dynamic data MSF Document Clustering Algorithm Multi-Agent Document Clustering Implementation Future works and Conclusion OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Text Challenge Problem How to effectively reduce the size of a large, streaming set of documents “Give me the 10 documents that I need to read, out of the 1000 I received today?” Characteristics A steady flow of simple documents Need to rapidly organize the documents into subsets Select representative documents from the subsets OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Approach Use standard IR techniques to convert text to vectors Use unsupervised learning/text clustering to organize the documents Look for improvements in term weighting approaches OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Standard Information Retrieval Document 1 The Army needs senor technology to help find improvised explosive devices Terms Army Sensor Technology Help Find Improvise Explosive device Document 2 ORNL has developed sensor technology for homeland defense Document 3 Mitre has won a contract to develop homeland defense sensors for explosive devices ORNL develop sensor technology homeland defense Mitre won contract develop homeland defense sensor explosive devices OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Vector Space Model Term List Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Doc 1 Army 1 Sensor 1 Technology 1 Help 1 Find 1 Improvise 1 Explosive 1 Device 1 ORNL 0 develop 0 homeland 0 Defense 0 Mitre 0 won 0 contract 0 Doc 2 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 Doc 3 0 1 0 0 0 0 1 1 0 1 1 1 1 1 1 Standard Textual Clustering Vector Space Model Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Doc 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 Doc 2 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 Doc 3 0 1 0 0 0 0 1 1 0 1 1 1 1 1 1 Cluster Analysis Dissimilarity Matrix Doc 1 Doc 2 Doc 3 100% 36% 100% Documents to Documents TFIDF N Wij log 2 f ij 1* log 2 n OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Doc 1 Doc 2 Doc 3 100% 17% 21% D1 D2 D3 Most similar documents Euclidean distance Time Complexity O(n2Log n) Issues (1) Analysts are currently overwhelmed with the amount of information streams generated everyday. Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection. Research on clustering the dynamic information stream is limited. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Solution: Bio-inspired Clustering New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment. These algorithms are characterized by the interaction of a large number of agents that follow the same rules. The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Data Clustering by Ant Clustering Algorithm Deneubourg proposed the first clustering solutions inspired by ant colonies in 1991. Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability. The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed. 1 2 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY 3 A New Clustering Algorithm Based on Bird Flock Collective Behavior Trivial Behavior OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Emergent behavior = flocking Flocking Model Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in 1987. Alignment : steer towards the average heading of the local flock mates n v x vb d ( Px , Pb ) d 2 v sr x d ( Px , Pb ) Separation : steer to avoid crowding flock mates 1 n d ( Px , Pb ) d1 d ( Px , Pb ) d 2 var v x n x Cohesion : steer towards the average position of local flock mates n d ( Px , Pb ) d1 ( Px , Pb ) d 2 vcr ( Px Pb ) x OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Alignment Separation Cohesion Flocking Demo OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Multiple Species Flocking (MSF) Model Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Issues (2) Every added or removed document from the set requires recalculation of the entire VSM Document Set must be known before VSM can be calculated TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Inverse Corpus Frequency • Look at the forest, not the trees C 1 Wij log 2 fij 1log 2 c 1 We analyzed near 1 million documents from 6 major research corpora We use this term frequency distribution as our “global” term frequency 250000 Unique Term Count We found 229,023 unique terms (A large dictionary contains around 70,000 terms) 200000 150000 100000 50000 0 5 55 105 155 205 255 305 355 405 455 505 555 605 655 705 755 805 Number of Documents (K) Reed, Jiao, et al., “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,” The Fifth International Conference on Machine Learning and Applications (2006) to appear et al., System for Distributed Cluster Analysis,” Third International Workshop on Software OAKReed RIDGE N“Multi-Agent ATIONAL LABORATORY for OF Large-Scale U. S.Engineering DEPARTMENT ENERGYMulti-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland 855 905 Why this matters We can now generate an accurate vector directly from a text document That vector can be generated where ever the document resides We can now use agents to create vectors from documents over a broad range of computers OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Multiple Species Flocking (MSF) Document Clustering Each document is projected as a bird in a 2D virtual space. The birds that have similar document vector feature (same as the bird’s species and colony in nature) will automatically group together and became a bird flock. Other birds that have different document vector features will stay away from this flock. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY MSF Document Clustering Demo The Document collection Dataset Category/Topic Number of articles 1 Airline Safety 10 2 China and Spy Plane and Captives 4 3 Hoof and Mouth Disease 9 4 Amphetamine 10 5 Iran Nuclear 16 6 N. Korea and Nuclear Capability 5 7 Mortgage Rates 8 8 Ocean and Pollution 10 9 Saddam Hussein and WMD 10 10 Storm Irene 22 11 Volcano 8 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Performance Results of MSF, K-means and Ant Clustering Algorithm The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations Synthetic Dataset Real Document Collection Algorithms Average cluster number Average Fmeasure value MSF 4 0.9997 K-means (4)*** 0.9879 Ant 4 0.9823 MSF 9.105 0.7913 K-means (11)*** 0.5632 Ant 1 0.1623 * Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number. OAK RIDGE NATIONAL LABORATORY U. S.Ref: DEPARTMENT OF ENERGY X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9 , pp. 505-515, August 2006, ISSN: 1318-7621 MSF Clustering Algorithm for Information Stream The MSF clustering algorithm can achieve better performance in document clustering than the Kmeans and the Ant clustering algorithm. This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Multi-Agent Document Clustering Implementation JADE platform. (http://jade.tilab.com/) Linux Cluster Machine. One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory. Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (http://trec.nist.gov/) OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Current and Future Works Switched agent platform from JADE to our light agent platform (ORMAC). Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster. Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time. Building a better GUI interface OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Conclusion The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents. TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering Agent architecture provides analysis approach that can run on cluster computers. OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Thank you! OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY The architectures the central model and distributed model Location proxy agents JADE main Container Node2 JADE system agents Head Node Location proxy agent Boid agents Node3 Node1 Node1 … JADE Container Boid agents JADE main Container JADE Container the Single Processor model OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Head Node JADE system agents the distributed model