A Distributed Agent Implementation of Multiple Species Flocking Model for

advertisement
A Distributed Agent Implementation of
Multiple Species Flocking Model for
Document Partitioning Clustering
Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D.
Applied Software Engineering Research Group
Oak Ridge National Laboratory
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Outline
 Introduction of Dynamic Information Stream and
the issues
 Bio-inspired Clustering
 MSF Clustering Model Based on Bird Flock
Collective Behavior
 TFIDF not practical for dynamic data
 MSF Document Clustering Algorithm
 Multi-Agent Document Clustering Implementation
 Future works and Conclusion
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Text Challenge
 Problem
 How to effectively reduce the size of a large, streaming
set of documents
 “Give me the 10 documents that I need to read, out of
the 1000 I received today?”
 Characteristics
 A steady flow of simple documents
 Need to rapidly organize the documents into subsets
 Select representative documents from the subsets
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Approach
 Use standard IR techniques to convert text
to vectors
 Use unsupervised learning/text clustering
to organize the documents
 Look for improvements in term weighting
approaches
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Standard Information Retrieval
Document 1
The Army needs senor
technology to help find
improvised explosive
devices
Terms
Army
Sensor
Technology
Help
Find
Improvise
Explosive
device
Document 2
ORNL has developed
sensor technology for
homeland defense
Document 3
Mitre has won a contract
to develop homeland
defense sensors for
explosive devices
ORNL
develop
sensor
technology
homeland
defense
Mitre
won
contract
develop
homeland
defense
sensor
explosive
devices
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Vector Space Model
Term List
Army
Sensor
Technology
Help
Find
Improvise
Explosive
Device
ORNL
develop
homeland
Defense
Mitre
won
contract
Doc 1
Army
1
Sensor
1
Technology
1
Help
1
Find
1
Improvise
1
Explosive
1
Device
1
ORNL
0
develop
0
homeland
0
Defense
0
Mitre
0
won
0
contract
0
Doc 2
0
1
1
0
0
0
0
0
1
1
1
1
0
0
0
Doc 3
0
1
0
0
0
0
1
1
0
1
1
1
1
1
1
Standard Textual Clustering
Vector Space Model
Army
Sensor
Technology
Help
Find
Improvise
Explosive
Device
ORNL
develop
homeland
Defense
Mitre
won
contract
Doc 1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Doc 2
0
1
1
0
0
0
0
0
1
1
1
1
0
0
0
Doc 3
0
1
0
0
0
0
1
1
0
1
1
1
1
1
1
Cluster Analysis
Dissimilarity Matrix
Doc 1
Doc 2
Doc 3
100%
36%
100%
Documents to Documents
TFIDF
N
Wij  log 2  f ij  1* log 2  
n
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Doc 1 Doc 2 Doc 3
100%
17%
21%
D1
D2
D3
Most similar documents
Euclidean distance
Time Complexity
O(n2Log n)
Issues (1)
 Analysts are currently overwhelmed with
the amount of information streams
generated everyday.
 Researches in clustering analysis mainly
focus on how to quickly and accurately
cluster static data collection.
 Research on clustering the dynamic
information stream is limited.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Solution: Bio-inspired Clustering
 New computational algorithms inspired from
biological models, such as ant colonies, bird
flocks, and swarm of bees etc., can solve
problems in dynamical environment.
 These algorithms are characterized by the
interaction of a large number of agents that follow
the same rules.
 The bio-inspired clustering algorithms apply the
self-organizing and collective behaviors of social
insects for organizing of dynamical changed data.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Data Clustering by
Ant Clustering Algorithm
Deneubourg proposed the first clustering solutions inspired by ant
colonies in 1991.
Agent (ant) action rule: agent move randomly in the grid. Agents
only recognize objects immediately in front of them. Picking up or
dropping item based on pickup probability and drop probability.
The movement of data objects has to be implemented through the
movements of a small number of ant agents, which will slow down
the clustering speed.
1
2
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
3
A New Clustering Algorithm Based
on Bird Flock Collective Behavior
Trivial Behavior
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Emergent behavior = flocking
Flocking Model
Flocking model, one of the first bio-inspired computational collective
behavior models, was first proposed by Craig Reynolds in 1987.
Alignment : steer towards the average heading of the local flock mates
 
n
v x  vb

d ( Px , Pb )  d 2  v sr  
x d ( Px , Pb )
Separation : steer to avoid crowding flock mates
1 n 

d ( Px , Pb )  d1  d ( Px , Pb )  d 2  var   v x
n x
Cohesion : steer towards the average position of local flock mates
n

d ( Px , Pb )  d1  ( Px , Pb )  d 2  vcr   ( Px  Pb )
x
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Alignment
Separation
Cohesion
Flocking Demo
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Multiple Species Flocking (MSF) Model
Feature similarity rule: Steer away from other birds that have
dissimilar features and stay close to these birds that have
similar features.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Issues (2)
 Every added or removed document from the set requires
recalculation of the entire VSM
Document Set
must be known
before VSM
can be
calculated
 TFIDF not practical for dynamic data
 Requires sequential processing
 Not good for a distributed agent approach
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Inverse Corpus Frequency
• Look at the forest, not the
trees
 C  1
Wij  log 2  fij  1log 2 

 c 1 
 We analyzed near 1 million
documents from 6 major
research corpora
 We use this term frequency
distribution as our “global”
term frequency
250000
Unique Term Count
 We found 229,023 unique
terms (A large dictionary
contains around 70,000
terms)
200000
150000
100000
50000
0
5
55
105
155
205
255
305
355
405
455
505
555
605
655
705
755
805
Number of Documents (K)
Reed, Jiao, et al., “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,” The Fifth
International Conference on Machine Learning and Applications (2006) to appear
et al.,
System for Distributed Cluster Analysis,” Third International Workshop on Software
OAKReed
RIDGE
N“Multi-Agent
ATIONAL LABORATORY
for OF
Large-Scale
U. S.Engineering
DEPARTMENT
ENERGYMulti-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland
855
905
Why this matters
 We can now generate an accurate vector
directly from a text document
 That vector can be generated where ever
the document resides
 We can now use agents to create vectors
from documents over a broad range of
computers
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Multiple Species Flocking (MSF)
Document Clustering
 Each document is projected as a bird in a 2D
virtual space.
 The birds that have similar document vector
feature (same as the bird’s species and
colony in nature) will automatically group
together and became a bird flock.
 Other birds that have different document
vector features will stay away from this flock.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
MSF Document Clustering Demo
The Document
collection Dataset
Category/Topic
Number
of
articles
1
Airline Safety
10
2
China and Spy
Plane and Captives
4
3
Hoof and Mouth
Disease
9
4
Amphetamine
10
5
Iran Nuclear
16
6
N. Korea and
Nuclear Capability
5
7
Mortgage Rates
8
8
Ocean and Pollution
10
9
Saddam Hussein
and WMD
10
10
Storm Irene
22
11
Volcano
8
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Performance Results of MSF, K-means and Ant Clustering
Algorithm
The clustering results of K-means, Ant clustering and MSF clustering
Algorithm on synthetic* and document** datasets after 300 iterations
Synthetic
Dataset
Real
Document
Collection
Algorithms
Average
cluster
number
Average Fmeasure value
MSF
4
0.9997
K-means
(4)***
0.9879
Ant
4
0.9823
MSF
9.105
0.7913
K-means
(11)***
0.5632
Ant
1
0.1623
* Four data types and each includes 200 two dimensional (x, y) data objects.
x and y are distributed according to Normal distribution.
** 112 news article dataset, 11 categories
*** The k-means algorithm has pre-knowledge of the cluster number.
OAK RIDGE NATIONAL LABORATORY
U. S.Ref:
DEPARTMENT
OF ENERGY
X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues
8-9 , pp. 505-515, August 2006, ISSN: 1318-7621
MSF Clustering Algorithm
for Information Stream
 The MSF clustering algorithm can achieve better
performance in document clustering than the Kmeans and the Ant clustering algorithm.
 This algorithm can continually refine the
clustering result and quickly react to the change
of individual data. This character enables the
algorithm suitable for clustering dynamic
changed document information, such as the text
information stream.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Multi-Agent Document
Clustering Implementation
 JADE platform. (http://jade.tilab.com/)
 Linux Cluster Machine.

One main node and three client nodes, which
are connected with a Gigabit Ethernet switch.
Each node contains a single 2.4G Intel Pentium
IV processor and 512M memory.
 Document datasets are derived from TREC
collections. TREC: Text REtrieval
Conference (http://trec.nist.gov/)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Current and Future Works
 Switched agent platform from JADE to our light agent platform
(ORMAC).
 Built a control agent for automatically generating and deploying
flock agents on all available cluster nodes of 135 node cluster.
 Built agents to monitor the news
update on several popular
Internet news websites and
collect news and feed into the
system in real-time.
 Building a better GUI interface
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Conclusion
 The heuristic searching mechanism of flocking
model helps document agents to quickly form
flocks and react to the change of any individual
documents.
 TFIDF enhancement, the TFICF vector space
model, allows for parallel or distributed
algorithms for information stream clustering
 Agent architecture provides analysis approach
that can run on cluster computers.
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Thank you!
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
The architectures the central model and
distributed model
Location
proxy agents
JADE main
Container
Node2
JADE system
agents
Head
Node
Location
proxy agent
Boid
agents
Node3
Node1
Node1
…
JADE Container
Boid agents
JADE main
Container
JADE Container
the Single Processor model
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
Head
Node
JADE system
agents
the distributed model
Related documents
Download