CiteSeer Project: Existence of Natural Communities

advertisement
2/18/2016
Natural communities
1
CiteSeer Project: Existence of Natural Communities
We are interested in examining structural changes over time in the CiteSeer database by
clustering the data at various points in time and tracking changes in these clusterings.
However, clustering methods are unstable and temporal analysis requires stable
clustering in order to distinguish between changes due to evolution from changes due to
clustering instability. Thus, we introduce the notion of natural communities – stable
clusters which persist despite small changes in the data.
An agglomerative clustering of the 300,000 papers in the CiteSeer collection will produce
a tree of 300,000 clusters. The vast majority of these clusters have no real significance;
they are artifacts of the clustering algorithm itself. However, if there are natural clusters
of papers in the data set, then these clusters should appear in any clustering. Assume that
there are collections of papers on certain topics such as AI or theoretical computer
science but that these collections have no natural sub collections in them. Then on
different runs of the agglomerative algorithm, the collection of AI papers and the
collection of theoretical computer science papers should always appear as clusters, but
sub clusters of these two clusters are likely to different on different runs. This suggests
running the agglomerative algorithm many times on randomly chosen subsets of 95% of
the data and looking for clusters that tend to appear frequently. We restrict our search for
frequently occurring clusters to clusters of size at least 70 and call these clusters natural
communities. It is these communities that we will track over time.
To find these stable clusters, take the CiteSeer data and form a citation graph G. Then
perturb the data by randomly removing 5% of the vertices of the graph – by repeating this
process multiple times, we are left with a set of slightly different subgraphs {G1, G2, … ,
Gn} of G. Applying an agglomerative merge algorithm to each of these subgraphs yields
a cluster tree Ti for each subgraph Gi . By taking advantage of the agglomerative merge’s
inherent instability, we are able to discard what we believe to be unpredictable clusters by
observing commonalities amongst the set of clusters trees.
Concept of a best match
In order to create a correspondence between clusters in two different trees, we define the
concept of a best match. The best match of C in the tree T is the community C’ in T for
which
 C C' C C' 
min 
,
 is maximized. One should observe that the best match is neither
C' 
 C
symmetric nor transitive. Also a best match has a parameter p associated with it,

 C C ' C C ' 
namely p  max min 
,
 .
C 'T
C ' 
 C

Intuitively, we define C to be a natural community if in a set of k trees, C has a best
match of f percent of the trees of value at least p. There are three parameters in the above
concept of a natural community. First, the number of trees must be large enough so that
2/18/2016
Natural communities
2
the number of best matches of value at least p approximates the expected number of
matches. For fixed p and f we find that the number of best matches decreases with
increasing k until about k=100 and then levels off at approximately 120 natural
communities. We will explain this phenomenon later. The other two parameters are
somewhat arbitrary.
If the tree of natural communities is going to evolve with time, then there must be a time
when a community changes from almost natural to just natural. Thus, there will be some
arbitrariness in our definition and in the set of natural communities. We fix p to be 70%
(for large communities 60%). We define the strength of a community to be the fraction f
of trees it appears in and call a community natural if its strength is at least 70%. Just as
natural communities evolve with time, the tree of natural communities also evolves and
thus certain edges are stronger than others and the actual trees will be slightly different
for different clusterings.
Merging equivalent natural communities
Due to the nature of the agglomerative merge, which forms a new cluster by merging two
current clusters, it is possible to merge a community containing a single node or a small
number of nodes with a natural community to form a second natural community that is so
similar to the first that having both would be redundant. As a result, given the set of
natural communities N, if a natural communities best match among the set of natural
communities (not including itself) is greater than 70%, then the smaller of the two natural
communities is removed. The threshold of a 70% match corresponds with the value of p
that we introduced to compare natural communities over different trees.
Combining natural communities from different bases
To find the natural communities, select a tree Tb as a base of comparison with all other
trees. Let T1 be the base tree for constructing a set of natural communities. We will
identify N i  T1 as natural if the parameter p of its best match is greater than 70% in at
least a percentage f of the set of clusters T. We note that by using only one base, T1 , in
our filtering process we will miss in the region of 1-f of the natural communities in the
data by definition, since our requirement for natural communities expects the cluster to
appear in only f percent of the trees. Thus, to arrive at the full set of natural communities,
we must repeat the process with different base trees and merge newfound natural
communities into our set until the probability of missing a natural community is minimal.
For the CiteSeer data where we set n = 100 (the point of stabilization), repeating this
process using five different bases reduces the probability of missing a natural community
to an insignificant number: 0.35 = 0.2%.
Translating natural community to base one
When we find a natural community we actually have a slightly different community
appearing in each of the 100 trees. In fact, some of the papers appearing in one copy of
the natural community might not even appear in the set of 95% of the papers that were
used for another tree. Thus, we need to have some way to characterize the community by
some set of papers. We arbitrarily denote the community by the version that appears in
2/18/2016
Natural communities
3
the base set. Note that if [C1 , C2 , , C100 ] is the set of communities of some natural
community, then each Ci is the best match of C1 but Ci is not necessarily the best match
of some other C j . It would be nice if best match was symmetric and transitive but it is
not. This raises a difficulty when we add natural communities found using a different
base. In this case there is no C1 , so we find the community C which is the best match of
Cb in tree one and let that be the representative of the natural community. However,
finding the best matches of C in each tree may not give us the same communities as using
the base Cb . However, we will show that these are closely matched.
Data supporting the natural community concept.
How the graph was cleaned
The graph from the CiteSeer database is first filtered to take out what we believe are
papers that do not provide any additional information and merely add noise to our data.
We removed papers referenced by only 1 paper, and papers that reference fewer than 5
papers. This reduced our number of vertices from 252493 to 189303.
2. If a natural community does not appear in tree T how good is its best match in
tree T?
The fact that a natural community may occur in only 70% of the trees raises the question
as to how good the best match of the natural cluster is in the tree were there is no match
above 70%. After observing such matches in the CiteSeer data, we find that the value of
such best matches is approximately 64% on average with only 0.3% of these matches
falling below 40%. These results show that natural communities do tend to appear in all
trees.
3. If C1 and C2 are the best matches of a base community, how well do C1 and C2
match?
One potential problem with the filtering process used for finding natural communities is
that all comparisons for finding similar communities are done using one base tree. This
means that while two best matching communities, C1 and C2, of a base community CB
may match CB by the threshold percentage p, C1 and C2 may not correspond with each
other at all. However, by our concept of natural communities, these stable clusters should
stay relatively constant throughout all trees. This suggests that if both C1 and C2 are best
matches of CB in their respective trees, then the match between C1 and C2 should remain
strong.
Given the list of natural communities found using our filtering process with p set at 70%
for small communities, and 60% for large communities, we examined the percentage
match between each pair of best matches C1 and C2 in 2 randomly selected trees. After
running this experiment several times, we find that approximately 85% of these pairs of
best match communities match each other by at least our threshold parameter p. The
remaining 15% fall short of this threshold value, but only by 4% on average.
2/18/2016
Natural communities
4
4. Structure of best match
Tree 1
Tree 2
70%
C3
C2
C1
Consider C1 and C3 from one tree of natural communities, and C2 from another tree of
natural communities. When deciding on using a two-way best match on our data as a
measure of similarity, we were concerned with the possibility of C1 and C2 forming a
two-way match, where C3’s best match was also C2, and this match was greater than
70%. We believe that if this scenario arises, then C3 and C1 will be quite similar, and
should be considered the same community. When looking for two-way matches on two
trees of natural communities before compressing like communities, we examined several
cases where the above situation occurred. The match between C1 and C3 in our
experimental data ranged from approximately 75%-90%. Thus, by eliminating all natural
communities in the same set that match at least 70%, we eliminate all such problem
cases. However, we cannot rule out the possibility of a long chain of best matches of the
following form. For 1  i  n , Ci ’s best match is Ci 1 . The chain terminates with Cn ’s
best match being Cn1 and the size of Cn being significantly different than the size of
C1 . Although we cannot rule out the possibility, the situation never occurred.
5. Match between a natural community and its representative in the base tree
The process of finding natural communities uses several different base trees. The base of
each natural community is translated to a given base Tb by finding its best match in Tb .
We were concerned with the value of this best match. Upon closer inspection, we
discovered that a natural community not already present in Tb has a best match in Tb that
consistently hovers around the 70% boundary at which we consider two clusters similar.
Indeed, we found that the percentage average of all such matches is 68%. Moreover,
although these values do fluctuate, they never fall below 40%, and rarely fall below 50%.
Comparing natural communities from two experiments
In order to provide evidence supporting the concept of a tree of natural communities as
well as the stability of the tree, we did two experiments each involving 100 trees. Based
on these two tests, the CiteSeer data has approximately 150 natural communities of size
greater than or equal to 70 and strength greater than or equal to 70. More specifically, we
found 158 natural communities in our 1st experiment, and 150 in our second. Table 1
gives the distribution of these communities as a function of their strength.
Strength
70-75
Number of Communities
Experiment 1
55
Number of Communities
Experiment 2
47
2/18/2016
76-80
81-85
86-90
91-100
Natural communities
35
17
22
29
5
35
21
16
31
(*In finding the final set of natural communities, we combine the natural communities
found in several different trees into one base tree Tb , by finding its best match in Tb , and
use this best match to replace the original community. The strength of the natural
communities listed above is their strength prior to the merging process. The strength of
its best match in Tb tends to be slightly different.*)
Although the distribution of strength is quite even across the two experiments, a natural
community found in one set of trees, and its corresponding natural community from
another set of trees may not have the same strength. In fact, after examining pairs of
equivalent natural communities found in two experimental runs, we notice that while the
strength of some corresponding pairs match very well, others differ by quite a large
margin.
How base tree used effects strength of natural communities
We account for this discrepancy by pointing out that the base tree used affects the
strength of a community. Recall that our process of discovering natural communities
involves finding clusters that appear in a majority of a set of cluster trees. Consider a
theoretical natural community C over the set of trees [T1,T2, … , Tn] and its best
matching cluster in each of these trees [C1, C2, … , Cn]. Say that we choose T1 as our
base tree; as a result, C1 becomes the representation of C. Since all comparisons are done
with respect to community C1, we can associate a strength to natural community C by
counting the number of clusters over the remaining trees [T2 … Tn] that match C1 by at
least our threshold match percentage p. However, consider that instead, we use T2 as our
base tree and C2 as the representation of the natural community. If this is the case, then
there is no guarantee that the strength associated with C1 and the strength associated with
C2 will be the same. While we would expect the strength of these communities to be
somewhat similar, the fact that C1 and C2 can differ by p2 and still be considered part of
the natural community C allows for the strength of the natural community to vary with
respect to which base tree is used.
By choosing natural communities at random and examining their strengths using several
different trees as the base tree, we found that the strength of each natural community
varies with respect to its base tree. For each natural community, postulate a theoretical set
of vertices C. If the natural community is represented by some base cluster CB that
deviates even slightly from theoretical set of vertices, then the strength of the natural
community with respect to CB and the strength with respect to the theoretical set of
vertices can differ.
Consider, for example, that a base community that corresponds to some natural
community C contains all the vertices of the natural community’s theoretical core set of
vertices but happens to be large enough that the two communities overlap only by p.
2/18/2016
Natural communities
6
Assume that in a separate tree Tj, C’s best match in Tj is exactly the same as the
theoretical natural community’s core set, except that it is missing one paper. While this
community’s tree would be included in the strength of the theoretical natural community,
it would not be present in the slightly different base community’s strength.
Following our reasoning above, it makes sense that the natural communities that
consistently have a higher best match across all trees should have less variance in their
strength. Finding all natural communities when p is set to 0.8, we discover that the
strength of such strongly connected natural communities rarely varies regardless of which
base tree is chosen. This is expected since a higher match among the set of clusters
corresponding to a natural community results in less deviation among the vertices that
make up these clusters.
The first thing that must be noted is that while the strength of a natural community may
differ when different base trees are used, their best matches over the remaining set of
trees do not change. Given some natural community C and its best matching clusters C1
and C2 from trees T1 and T2, the best matches of these two clusters with respect to the
remaining set of trees have stayed the same in all our tests. This suggests that even
though the communities vary enough to have different strengths, they are still extremely
similar. Another thing to note is that when finding natural communities, it is possible that
natural communities are missed due to the variance in the strength – a particularly low
match between a natural community and its best match in some base tree could lead to
overlooked natural communities. However, the fact that we look for natural communities
in several base trees significantly lowers the chance of missing such natural communities.
More on Theoretical Natural Communities
A natural community is a theoretical set of clusters that tend to consistently appear in a
set of cluster trees despite small perturbations in the clustered data. It is clear that our
filtering process does not find this true set of clusters, but merely a close representation of
it (its best matching cluster in the base tree). However, our belief is that natural
communities are exclusive sets of tightly bounded vertices, which form much more
readily than regular clusters. Thus, if we were to find the strength of some theoretical
natural community C, it should be higher than its representative cluster in the base tree.
In fact, we believe that the more a base community deviates from a natural community’s
core set of vertices, the lower its strength will be – we expect the natural community’s
core to persist over most of the n cluster trees with minor fluctuations, but the same
cannot be said for a community that matches only 70% with this theoretical natural
community.
Thus, if we wanted to find the makeup of such a theoretical community C with best
matches in trees [T1 … Tn] of B = [C1 … CN], it seems intuitive to select the Ci e B with
the highest strength. As we examine clusters of increasing strength, these clusters
progressively contain sets of vertices that appear in more and more of the perturbed trees
– this is exactly what we expect of the theoretical natural communities.
2/18/2016
Natural communities
7
1. How well do communities from the two experiments match?
After finding the two sets of natural communities from our two experimental runs, we
proceeded to examine their similarity. To compare these two sets, we used the concept of
a two-way best match defined as follows: Given two clusters, C1 from the first set of
natural communities and C2 from the second, if the best match of C1 is C2 and the best
match of C2 is C1, then these two clusters form a two-way best match, and are considered
essentially the same community. Running this test on the two sets, we discovered that
approximately 80% of the natural communities formed a two-way match. The remaining
20%, which did not appear in both experimental runs were found to be threshold
communities.
Threshold communities are those that barely pass the filter of being natural, with
strengths only slightly higher than f. By examining the best match of these threshold
communities in the dataset in which it did not appear, we find that their best match is
usually a community slightly below the bound of being natural, with strengths slightly
less than f. More specifically, with our boundary strength f set at 70%, we discovered that
the average strength of a community with no two-way match was 74, while its best match
in the dataset that it did not appear in was 63. Due to the fact that we have no true bound
for quantifying natural communities, such mismatches concerning natural communities
that hover around the boundary f are to be expected. We can conclude that the natural
communities found in the two experimental runs are essentially the same, with the only
discrepancies occurring with natural communities that barely made it in one experimental
run and barely missed in the other.
Where does this material belong
(*We were also concerned with clusters that were found to be natural in one tree but did
not map well into the base tree Tb . Thus, when mapped into Tb , this mapped version of the
natural communities no longer qualifies as natural, and may display strengths below 40.
However, even though these communities may not map well, in many cases, a two-way
match with another set of naturals can still be found due to the distinctiveness of each
natural community.*)
Core
90% or 70%?
Given a natural community it is interesting to ask if there is a core of papers that appears
in the corresponding community in each of the hundred trees. Given that each tree was
generated by a random 95% of the total set of papers, the probability that any given paper
even exists in all hundred sets of papers is 0.95100  0.0059 or less than one percent.
Thus, instead we identify a core by looking for papers that appear in 70% of the hundred
collections. (The probability that a paper will appear in 70% of the trees is about 96%.
Thus, if it appears in 70% of the communities, it appears almost every time it is present in
the set of papers used to construct a tree.)
2/18/2016
Natural communities
8
Moreover, we expect that using the core of papers will remove randomly included papers
from a cluster, and give us a more accurate representation of the theoretical natural
community. If our notion of a core more accurately represents a natural community
compared to using a natural communities best-matching cluster in the base tree, then we
would expect that the strength of this core community will be consistently higher than the
base tree community, and we would also presume that the same core communities would
match very well from different experimental runs.
Validation of the Core
Max Strength Match
Show graph – much more organized then the regular match of strength (which tends to
fluctuate give the base tree used), which was almost random. This reiterates the idea of
theoretical natural communities, and the fact that the a natural communities
representative in some tree may differ enough from the theoretical natural community
that its strength may differ quite a bit across trees – again, we believe that the community
with the highest strength is the best approximation of this theoretical natural community.
Taking a look at the max strength communities across 20 of the 100 trees, we definitely
see a higher number of matches.
Match of cores from 2 experimental runs
After finding the core communities of the sets of natural communities from the two
experimental runs, we analyzed the match of corresponding communities from the two
experiments. When matching these core communities, we found that match to be
consistently high at an average of 93.6%. This is significantly higher (more than 20%
higher) than the match of corresponding communities using the best match in the base
tree to represent natural communities.
Strength of the core
If our belief that the core of a natural community more accurately represents the stable
theoretical cluster, than it should follow that the strength of the core should be high.
When examining the cores of our tree of natural communities, which is defined to be all
papers appearing in at least 70% of the trees, we found that their strengths did not differ
much from the strength of the actual natural community found. However, when we
include in the core papers that appear in at least 50% of the trees, the strength of such
cores increase significantly, with matches in almost 90% of the trees on average. Due to
the fact that the 70% core is typically of smaller size than the clusters found in each tree,
the matching of the core with a cluster is somewhat biased, since the denominator of the
best match calculation is the size of the larger cluster. Thus, even though the overlap may
between a core and its corresponding natural community may be high, the match may be
low due to the larger size of the actual natural community. However, when we allow
2/18/2016
Natural communities
9
papers that appear in only 50% of the trees into the core, the overlap between the core
and its corresponding natural community will increase, but the size of the larger
community, which is still the actual natural community, does not change. This explains
why a 50% core does better than a 70% core in terms of strength. Nevertheless, when
doing core comparisons across experimental runs we believe that a 70% core will be a
better representation of the stable clusters found, and the high match between such cores
verifies this belief.
Trees
A’
A
B
B’
C’
C
To compare the structure of the natural communities trees from the two experimental
runs, we excluded the threshold communities that were in one experimental run and not
in the other. By looking for two-way matches between the set of natural communities
from the two experiments, we are able to analyze the structure of the two trees simply by
examining whether two matching communities have a pair of matching parents, as well
as sets of matching children. Of the set of nodes found from our two experimental runs,
approximately 10% of the nodes presented mismatches in the two trees. However, these
were very slight mismatches in which the parent of the mismatched node in one tree
turned out to be the mismatched node’s parent’s parent in the other tree. Moreover, the
size of these mismatched nodes are consistently much smaller than the size of its parent;
in such cases, these small communities can be as random as a single paper, so the fact
that an extra node appears between a parent-child pair is not unforeseen. We expect to see
more noticeable differences when doing time comparison, so such insignificant
differences may not be a problem.
Download