2/18/2016 Natural communities 1 CiteSeer Project: Existence of Natural Communities We are interested in examining structural changes over time in the CiteSeer database by clustering the data at various points in time and tracking changes in these clusterings. However, clustering methods are unstable and temporal analysis requires stable clustering in order to distinguish between changes due to evolution from changes due to clustering instability. Thus, we introduce the notion of natural communities – stable clusters which persist despite small changes in the data. An agglomerative clustering of the 300,000 papers in the CiteSeer collection will produce a tree of 300,000 clusters. The vast majority of these clusters have no real significance; they are artifacts of the clustering algorithm itself. However, if there are natural clusters of papers in the data set, then these clusters should appear in any clustering. Assume that there are collections of papers on certain topics such as AI or theoretical computer science but that these collections have no natural sub collections in them. Then on different runs of the agglomerative algorithm, the collection of AI papers and the collection of theoretical computer science papers should always appear as clusters, but sub clusters of these two clusters are likely to different on different runs. This suggests running the agglomerative algorithm many times on randomly chosen subsets of 95% of the data and looking for clusters that tend to appear frequently. We restrict our search for frequently occurring clusters to clusters of size at least 70 and call these clusters natural communities. It is these communities that we will track over time. To find these stable clusters, take the CiteSeer data and form a citation graph G. Then perturb the data by randomly removing 5% of the vertices of the graph – by repeating this process multiple times, we are left with a set of slightly different subgraphs {G1, G2, … , Gn} of G. Applying an agglomerative merge algorithm to each of these subgraphs yields a cluster tree Ti for each subgraph Gi . By taking advantage of the agglomerative merge’s inherent instability, we are able to discard what we believe to be unpredictable clusters by observing commonalities amongst the set of clusters trees. Concept of a best match In order to create a correspondence between clusters in two different trees, we define the concept of a best match. The best match of C in the tree T is the community C’ in T for which C C' C C' min , is maximized. One should observe that the best match is neither C' C symmetric nor transitive. Also a best match has a parameter p associated with it, C C ' C C ' namely p max min , . C 'T C ' C Intuitively, we define C to be a natural community if in a set of k trees, C has a best match of f percent of the trees of value at least p. There are three parameters in the above concept of a natural community. First, the number of trees must be large enough so that 2/18/2016 Natural communities 2 the number of best matches of value at least p approximates the expected number of matches. For fixed p and f we find that the number of best matches decreases with increasing k until about k=100 and then levels off at approximately 120 natural communities. We will explain this phenomenon later. The other two parameters are somewhat arbitrary. If the tree of natural communities is going to evolve with time, then there must be a time when a community changes from almost natural to just natural. Thus, there will be some arbitrariness in our definition and in the set of natural communities. We fix p to be 70% (for large communities 60%). We define the strength of a community to be the fraction f of trees it appears in and call a community natural if its strength is at least 70%. Just as natural communities evolve with time, the tree of natural communities also evolves and thus certain edges are stronger than others and the actual trees will be slightly different for different clusterings. Merging equivalent natural communities Due to the nature of the agglomerative merge, which forms a new cluster by merging two current clusters, it is possible to merge a community containing a single node or a small number of nodes with a natural community to form a second natural community that is so similar to the first that having both would be redundant. As a result, given the set of natural communities N, if a natural communities best match among the set of natural communities (not including itself) is greater than 70%, then the smaller of the two natural communities is removed. The threshold of a 70% match corresponds with the value of p that we introduced to compare natural communities over different trees. Combining natural communities from different bases To find the natural communities, select a tree Tb as a base of comparison with all other trees. Let T1 be the base tree for constructing a set of natural communities. We will identify N i T1 as natural if the parameter p of its best match is greater than 70% in at least a percentage f of the set of clusters T. We note that by using only one base, T1 , in our filtering process we will miss in the region of 1-f of the natural communities in the data by definition, since our requirement for natural communities expects the cluster to appear in only f percent of the trees. Thus, to arrive at the full set of natural communities, we must repeat the process with different base trees and merge newfound natural communities into our set until the probability of missing a natural community is minimal. For the CiteSeer data where we set n = 100 (the point of stabilization), repeating this process using five different bases reduces the probability of missing a natural community to an insignificant number: 0.35 = 0.2%. Translating natural community to base one When we find a natural community we actually have a slightly different community appearing in each of the 100 trees. In fact, some of the papers appearing in one copy of the natural community might not even appear in the set of 95% of the papers that were used for another tree. Thus, we need to have some way to characterize the community by some set of papers. We arbitrarily denote the community by the version that appears in 2/18/2016 Natural communities 3 the base set. Note that if [C1 , C2 , , C100 ] is the set of communities of some natural community, then each Ci is the best match of C1 but Ci is not necessarily the best match of some other C j . It would be nice if best match was symmetric and transitive but it is not. This raises a difficulty when we add natural communities found using a different base. In this case there is no C1 , so we find the community C which is the best match of Cb in tree one and let that be the representative of the natural community. However, finding the best matches of C in each tree may not give us the same communities as using the base Cb . However, we will show that these are closely matched. Data supporting the natural community concept. How the graph was cleaned The graph from the CiteSeer database is first filtered to take out what we believe are papers that do not provide any additional information and merely add noise to our data. We removed papers referenced by only 1 paper, and papers that reference fewer than 5 papers. This reduced our number of vertices from 252493 to 189303. 2. If a natural community does not appear in tree T how good is its best match in tree T? The fact that a natural community may occur in only 70% of the trees raises the question as to how good the best match of the natural cluster is in the tree were there is no match above 70%. After observing such matches in the CiteSeer data, we find that the value of such best matches is approximately 64% on average with only 0.3% of these matches falling below 40%. These results show that natural communities do tend to appear in all trees. 3. If C1 and C2 are the best matches of a base community, how well do C1 and C2 match? One potential problem with the filtering process used for finding natural communities is that all comparisons for finding similar communities are done using one base tree. This means that while two best matching communities, C1 and C2, of a base community CB may match CB by the threshold percentage p, C1 and C2 may not correspond with each other at all. However, by our concept of natural communities, these stable clusters should stay relatively constant throughout all trees. This suggests that if both C1 and C2 are best matches of CB in their respective trees, then the match between C1 and C2 should remain strong. Given the list of natural communities found using our filtering process with p set at 70% for small communities, and 60% for large communities, we examined the percentage match between each pair of best matches C1 and C2 in 2 randomly selected trees. After running this experiment several times, we find that approximately 85% of these pairs of best match communities match each other by at least our threshold parameter p. The remaining 15% fall short of this threshold value, but only by 4% on average. 2/18/2016 Natural communities 4 4. Structure of best match Tree 1 Tree 2 70% C3 C2 C1 Consider C1 and C3 from one tree of natural communities, and C2 from another tree of natural communities. When deciding on using a two-way best match on our data as a measure of similarity, we were concerned with the possibility of C1 and C2 forming a two-way match, where C3’s best match was also C2, and this match was greater than 70%. We believe that if this scenario arises, then C3 and C1 will be quite similar, and should be considered the same community. When looking for two-way matches on two trees of natural communities before compressing like communities, we examined several cases where the above situation occurred. The match between C1 and C3 in our experimental data ranged from approximately 75%-90%. Thus, by eliminating all natural communities in the same set that match at least 70%, we eliminate all such problem cases. However, we cannot rule out the possibility of a long chain of best matches of the following form. For 1 i n , Ci ’s best match is Ci 1 . The chain terminates with Cn ’s best match being Cn1 and the size of Cn being significantly different than the size of C1 . Although we cannot rule out the possibility, the situation never occurred. 5. Match between a natural community and its representative in the base tree The process of finding natural communities uses several different base trees. The base of each natural community is translated to a given base Tb by finding its best match in Tb . We were concerned with the value of this best match. Upon closer inspection, we discovered that a natural community not already present in Tb has a best match in Tb that consistently hovers around the 70% boundary at which we consider two clusters similar. Indeed, we found that the percentage average of all such matches is 68%. Moreover, although these values do fluctuate, they never fall below 40%, and rarely fall below 50%. Comparing natural communities from two experiments In order to provide evidence supporting the concept of a tree of natural communities as well as the stability of the tree, we did two experiments each involving 100 trees. Based on these two tests, the CiteSeer data has approximately 150 natural communities of size greater than or equal to 70 and strength greater than or equal to 70. More specifically, we found 158 natural communities in our 1st experiment, and 150 in our second. Table 1 gives the distribution of these communities as a function of their strength. Strength 70-75 Number of Communities Experiment 1 55 Number of Communities Experiment 2 47 2/18/2016 76-80 81-85 86-90 91-100 Natural communities 35 17 22 29 5 35 21 16 31 (*In finding the final set of natural communities, we combine the natural communities found in several different trees into one base tree Tb , by finding its best match in Tb , and use this best match to replace the original community. The strength of the natural communities listed above is their strength prior to the merging process. The strength of its best match in Tb tends to be slightly different.*) Although the distribution of strength is quite even across the two experiments, a natural community found in one set of trees, and its corresponding natural community from another set of trees may not have the same strength. In fact, after examining pairs of equivalent natural communities found in two experimental runs, we notice that while the strength of some corresponding pairs match very well, others differ by quite a large margin. How base tree used effects strength of natural communities We account for this discrepancy by pointing out that the base tree used affects the strength of a community. Recall that our process of discovering natural communities involves finding clusters that appear in a majority of a set of cluster trees. Consider a theoretical natural community C over the set of trees [T1,T2, … , Tn] and its best matching cluster in each of these trees [C1, C2, … , Cn]. Say that we choose T1 as our base tree; as a result, C1 becomes the representation of C. Since all comparisons are done with respect to community C1, we can associate a strength to natural community C by counting the number of clusters over the remaining trees [T2 … Tn] that match C1 by at least our threshold match percentage p. However, consider that instead, we use T2 as our base tree and C2 as the representation of the natural community. If this is the case, then there is no guarantee that the strength associated with C1 and the strength associated with C2 will be the same. While we would expect the strength of these communities to be somewhat similar, the fact that C1 and C2 can differ by p2 and still be considered part of the natural community C allows for the strength of the natural community to vary with respect to which base tree is used. By choosing natural communities at random and examining their strengths using several different trees as the base tree, we found that the strength of each natural community varies with respect to its base tree. For each natural community, postulate a theoretical set of vertices C. If the natural community is represented by some base cluster CB that deviates even slightly from theoretical set of vertices, then the strength of the natural community with respect to CB and the strength with respect to the theoretical set of vertices can differ. Consider, for example, that a base community that corresponds to some natural community C contains all the vertices of the natural community’s theoretical core set of vertices but happens to be large enough that the two communities overlap only by p. 2/18/2016 Natural communities 6 Assume that in a separate tree Tj, C’s best match in Tj is exactly the same as the theoretical natural community’s core set, except that it is missing one paper. While this community’s tree would be included in the strength of the theoretical natural community, it would not be present in the slightly different base community’s strength. Following our reasoning above, it makes sense that the natural communities that consistently have a higher best match across all trees should have less variance in their strength. Finding all natural communities when p is set to 0.8, we discover that the strength of such strongly connected natural communities rarely varies regardless of which base tree is chosen. This is expected since a higher match among the set of clusters corresponding to a natural community results in less deviation among the vertices that make up these clusters. The first thing that must be noted is that while the strength of a natural community may differ when different base trees are used, their best matches over the remaining set of trees do not change. Given some natural community C and its best matching clusters C1 and C2 from trees T1 and T2, the best matches of these two clusters with respect to the remaining set of trees have stayed the same in all our tests. This suggests that even though the communities vary enough to have different strengths, they are still extremely similar. Another thing to note is that when finding natural communities, it is possible that natural communities are missed due to the variance in the strength – a particularly low match between a natural community and its best match in some base tree could lead to overlooked natural communities. However, the fact that we look for natural communities in several base trees significantly lowers the chance of missing such natural communities. More on Theoretical Natural Communities A natural community is a theoretical set of clusters that tend to consistently appear in a set of cluster trees despite small perturbations in the clustered data. It is clear that our filtering process does not find this true set of clusters, but merely a close representation of it (its best matching cluster in the base tree). However, our belief is that natural communities are exclusive sets of tightly bounded vertices, which form much more readily than regular clusters. Thus, if we were to find the strength of some theoretical natural community C, it should be higher than its representative cluster in the base tree. In fact, we believe that the more a base community deviates from a natural community’s core set of vertices, the lower its strength will be – we expect the natural community’s core to persist over most of the n cluster trees with minor fluctuations, but the same cannot be said for a community that matches only 70% with this theoretical natural community. Thus, if we wanted to find the makeup of such a theoretical community C with best matches in trees [T1 … Tn] of B = [C1 … CN], it seems intuitive to select the Ci e B with the highest strength. As we examine clusters of increasing strength, these clusters progressively contain sets of vertices that appear in more and more of the perturbed trees – this is exactly what we expect of the theoretical natural communities. 2/18/2016 Natural communities 7 1. How well do communities from the two experiments match? After finding the two sets of natural communities from our two experimental runs, we proceeded to examine their similarity. To compare these two sets, we used the concept of a two-way best match defined as follows: Given two clusters, C1 from the first set of natural communities and C2 from the second, if the best match of C1 is C2 and the best match of C2 is C1, then these two clusters form a two-way best match, and are considered essentially the same community. Running this test on the two sets, we discovered that approximately 80% of the natural communities formed a two-way match. The remaining 20%, which did not appear in both experimental runs were found to be threshold communities. Threshold communities are those that barely pass the filter of being natural, with strengths only slightly higher than f. By examining the best match of these threshold communities in the dataset in which it did not appear, we find that their best match is usually a community slightly below the bound of being natural, with strengths slightly less than f. More specifically, with our boundary strength f set at 70%, we discovered that the average strength of a community with no two-way match was 74, while its best match in the dataset that it did not appear in was 63. Due to the fact that we have no true bound for quantifying natural communities, such mismatches concerning natural communities that hover around the boundary f are to be expected. We can conclude that the natural communities found in the two experimental runs are essentially the same, with the only discrepancies occurring with natural communities that barely made it in one experimental run and barely missed in the other. Where does this material belong (*We were also concerned with clusters that were found to be natural in one tree but did not map well into the base tree Tb . Thus, when mapped into Tb , this mapped version of the natural communities no longer qualifies as natural, and may display strengths below 40. However, even though these communities may not map well, in many cases, a two-way match with another set of naturals can still be found due to the distinctiveness of each natural community.*) Core 90% or 70%? Given a natural community it is interesting to ask if there is a core of papers that appears in the corresponding community in each of the hundred trees. Given that each tree was generated by a random 95% of the total set of papers, the probability that any given paper even exists in all hundred sets of papers is 0.95100 0.0059 or less than one percent. Thus, instead we identify a core by looking for papers that appear in 70% of the hundred collections. (The probability that a paper will appear in 70% of the trees is about 96%. Thus, if it appears in 70% of the communities, it appears almost every time it is present in the set of papers used to construct a tree.) 2/18/2016 Natural communities 8 Moreover, we expect that using the core of papers will remove randomly included papers from a cluster, and give us a more accurate representation of the theoretical natural community. If our notion of a core more accurately represents a natural community compared to using a natural communities best-matching cluster in the base tree, then we would expect that the strength of this core community will be consistently higher than the base tree community, and we would also presume that the same core communities would match very well from different experimental runs. Validation of the Core Max Strength Match Show graph – much more organized then the regular match of strength (which tends to fluctuate give the base tree used), which was almost random. This reiterates the idea of theoretical natural communities, and the fact that the a natural communities representative in some tree may differ enough from the theoretical natural community that its strength may differ quite a bit across trees – again, we believe that the community with the highest strength is the best approximation of this theoretical natural community. Taking a look at the max strength communities across 20 of the 100 trees, we definitely see a higher number of matches. Match of cores from 2 experimental runs After finding the core communities of the sets of natural communities from the two experimental runs, we analyzed the match of corresponding communities from the two experiments. When matching these core communities, we found that match to be consistently high at an average of 93.6%. This is significantly higher (more than 20% higher) than the match of corresponding communities using the best match in the base tree to represent natural communities. Strength of the core If our belief that the core of a natural community more accurately represents the stable theoretical cluster, than it should follow that the strength of the core should be high. When examining the cores of our tree of natural communities, which is defined to be all papers appearing in at least 70% of the trees, we found that their strengths did not differ much from the strength of the actual natural community found. However, when we include in the core papers that appear in at least 50% of the trees, the strength of such cores increase significantly, with matches in almost 90% of the trees on average. Due to the fact that the 70% core is typically of smaller size than the clusters found in each tree, the matching of the core with a cluster is somewhat biased, since the denominator of the best match calculation is the size of the larger cluster. Thus, even though the overlap may between a core and its corresponding natural community may be high, the match may be low due to the larger size of the actual natural community. However, when we allow 2/18/2016 Natural communities 9 papers that appear in only 50% of the trees into the core, the overlap between the core and its corresponding natural community will increase, but the size of the larger community, which is still the actual natural community, does not change. This explains why a 50% core does better than a 70% core in terms of strength. Nevertheless, when doing core comparisons across experimental runs we believe that a 70% core will be a better representation of the stable clusters found, and the high match between such cores verifies this belief. Trees A’ A B B’ C’ C To compare the structure of the natural communities trees from the two experimental runs, we excluded the threshold communities that were in one experimental run and not in the other. By looking for two-way matches between the set of natural communities from the two experiments, we are able to analyze the structure of the two trees simply by examining whether two matching communities have a pair of matching parents, as well as sets of matching children. Of the set of nodes found from our two experimental runs, approximately 10% of the nodes presented mismatches in the two trees. However, these were very slight mismatches in which the parent of the mismatched node in one tree turned out to be the mismatched node’s parent’s parent in the other tree. Moreover, the size of these mismatched nodes are consistently much smaller than the size of its parent; in such cases, these small communities can be as random as a single paper, so the fact that an extra node appears between a parent-child pair is not unforeseen. We expect to see more noticeable differences when doing time comparison, so such insignificant differences may not be a problem.