Finding Natural Communities

advertisement
Finding Natural Communities
Graph G
Perturb data by randomly
extracting 5%.
5% Do this n times
Subgraphs G1
G2
G3 … Gn
Cluster set of subgraphs to get a
set of cluster trees
Trees T1
T2
T3 … Tn
Choose one of the
trees as a base tree
Trees TB = T1
T2
T3 … Tn
For some cluster CB in TB, find its best match
(max(Overlap/Size)) in all remaining trees
Cluster CB
C2
C3 … Cn
If the match between CB and its best match in some tree CJ is greater than p,
then these two communities are considered the same.
If there are f such “same” communities, then CB is considered natural
For the CiteSeer Data, both p and f were set to 70%.
Finding All Natural Communities
By using only one base in our filtering process, we will miss in the region of
1-f of the natural communities in the data by definition.
We must repeat the process with different base trees to arrive at the full set
Using five different bases reduces the probability of missing a natural
community to an insignificant number when f = 70%: 0.35 = 0.2%.
Redundant Natural Communities
Due to the nature of the agglomerative merge, it is possible to find many
natural communities that differ only by a small amount
We merge all pairs of natural communities that match by at least 70% by
taking the larger of the two. This 70% threshold corresponds with our p
parameter
Finding Natural Communities on the CiteSeer Data
Given the CiteSeer data, we removed papers referenced by only 1 paper, and
papers that reference fewer than 5 papers.
We set the number of trees to compare over (n) to 100. We found this to be a
point of stabilization.
Performing the natural community filter twice on two separate sets of trees,
we found 158 natural communities in the 1st run and 150 in the 2nd run.
Match Between Natural Community and its Representative in the Base Tree
Recall that we merge natural communities from several trees into one
common base tree by finding its best match in the base tree
This best match consistently hovers around the 70% boundary at which we
consider two clusters similar. The percentage average of all such matches is
68%.
If C1 and C2 are the best matches of a base community, how well do C1 and
C2 match?
70%
70%
C1
??
CI
?? > 70% 85% of the time
When ?? < 70, the match is 66% on average
CJ
How well do communities from the two experiments match?
After finding the two sets of natural communities from our two experimental
runs, we proceeded to examine their similarity.
To compare these two sets of natural communities, we used the concept of a
two-way best match:
Given two clusters, C1 from the first set of natural communities and C2 from
the second, if the best match of C1 is C2 and the best match of C2 is C1, then
these two clusters form a two-way best match, and are considered
essentially the same community.
80% of the natural communities formed a two-way match
The remaining 20%, which did not appear in both experimental runs were
found to be threshold communities. Show Chart
We can conclude that the natural communities found in the two experimental
runs are essentially the same, with the only discrepancies occurring with
natural communities that barely made it in one experimental run and barely
missed in the other. Due to the fact that we have no true bound for
quantifying natural communities, such mismatches concerning natural
communities that hover around the boundary f are to be expected.
(*We were also concerned with clusters that were found to be natural in one tree but did
not map well into the base tree Tb . Thus, when mapped into Tb , this mapped version of the
natural communities no longer qualifies as natural, and may display strengths below 40.
However, even though these communities may not map well, in many cases, a two-way
match with another set of naturals can still be found due to the distinctiveness of each
natural community.*)
Variance in Strength
Table 1 gives the distribution of the natural communities of the two experiments as a function of
their strength.
Strength
70-75
76-80
81-85
86-90
91-100
Number of Communities
Experiment 1
55
35
17
22
29
Number of Communities
Experiment 2
47
35
21
16
31
(*The strength of its best match in Tb tends to be slightly different.*)
Although the distribution remains quite constant, the strengths of
corresponding natural communities were not found to match on a consistent
basis.
We find that the strength of a community depends on which base tree is
used.
We believe that for each natural community found there is one theoretical
natural community with a core set of vertices and different versions of this
natural community in each tree:
C  [C1, C2, …, Cn]
C != C1 != C2
We also expect that the strength of these different versions of natural
communities will never be as high as the theoretical natural community C
When we find all natural communities when p=0.8, we find that there is
almost no variance in strength.
Variance in Strength May not be a Problem
Given two clusters C1 and C2 of the same natural community C from trees T1
and T2, the best matches of these two clusters with respect to the remaining
set of trees have stayed the same in all our tests.
Due to the fact that we find natural communities in 5 different base trees and
combine them, the chance of missing a natural community due to its
variance in strength is slim
Download