CMPUT 695 FALL 2004 2004-11-28 ASSOCIATIVE CLASSIFIER WITH REOCCURRING ITEMS PROJECT REPORT RAFAL RAK WOJCIECH STACH 1 INTRODUCTION Classification problem is one of the most common tasks in Data Mining and Machine Learning fields. In general, it boils down to use the training set, which includes a set of labelled objects, to build the classifier that is able to predict labels for objects that have them unknown. There are numerous different classification methods [4] from among which we can separate those providing a model in the form of rule sets, e.g. decision trees, rule learning, or naïve-Bayes. These approaches have several benefits that come from having rules that describe model. One of the most important advantages is that such a model is transparent, i.e. experts from the domain of the application are able to understand it. This feature allows them to manipulate and add rules in order to increase the confidence and accuracy of the classifier. Associative classification is relatively new method. This main objective is to discover strong patterns that are associated with the class labels. The set of data consists of transactions, which include items. As a final model, one obtains a set of association rules. Literature research indicates that a few classifiers based on the above-mentioned idea, i.e. CBA [6], CMAR [5], and ARC-AC/ARC-BC [11], were introduced. The substantial limitation of all these algorithms is that they do not handle the data having multiple presence of the same item in a single transaction. There is a group of mining areas, such as multimedia, medicine, or text mining, in which the considering number of the occurrences of items (e.g. similar objects in the same medical image) might be more beneficial than presence or absence of items. A few approaches to mining rules with reoccurring items have been proposed, e.g. MaxOccur [12], FP’-tree [8] and WAR [9]. The main goal of this project is to devise, implement and test an associative classifier with reoccurring items. It exploits, combines, and extends the ideas mentioned above, especially ARC-AC/ARC-BC and MaxOccur algorithms. 2 PROBLEM STATEMENT The original approach of classification using association rules, named class association rules (CAR), was introduced in [6]. The main idea was to modify the form of transactions known from the traditional approach to the form of condset, c , where condset is a set of items and c is a class label. As a result, each item is represented by (attribute, value) pair. Hence, we can treat a transaction as a set of (attribute, value) pairs and a class label e.g. {( A1 , 3), ( A2 , 2)}, c , where A1 and A2 are attributes. Different notations of this type of transaction can be used. In this project, we will be using the following one {3 A1 , 2 A2 }, c All the rules generated from frequent itemsets are of the form condset c . Once the classifier (in this case: set of rules) is found, it can be used to predict the class of new objects. However, two main problems might occur. One of them is that two or more contradictory rules might exist, i.e. rules that have the same condset, yet different class labels. The other one concerns situation, in which there is no exact rule, i.e. having the 2 same condset, for the object being classified. Different strategies can be applied to handle these problems, see Section 4.2. Our task is to combine the associative classification with the problem of recurrent items. More formally, it can be stated that our goal is to modify the original approach using transactions from the form of {i1 , i2 , in }, c , where ii is an item in a transaction (e.g. a word in a document text) and c is a class label, to the form of {o1i1 , o2 i2 , on in }, c , where oi is the number of the occurrences of the item ii in the transaction. The rules generated from this set of transaction have the form condset, c and they are used for classification of new objects. The hypothesis that we will try to examine is that this representation of data supposes to be more beneficial, since it maintains more information about both objects and rules. 3 RELATED WORK Association rules have been recognized as a useful tool for finding interesting hidden patterns in transactional databases. Several different techniques have been introduced to tackle this problem effectively. The most important methods are those that are based on either Apriori or FP-growth approaches. However less research has been done considering transactions with reoccurrence of items. In [9] authors assign weights to items in transactions and introduce the WAR algorithm to mine the rules. This method is two fold: in the first step frequent itemsets are generated without considering weights and then weighted association rules (WARs) are derived from each of these itemsets. MaxOccur algorithm [12] is an efficient apriori-based method for discovering association rules with recurrent items. It reduces the search space by effective usage of joining and pruning techniques. The FP’-tree approach presented in [8] extends the FP-tree idea includes several steps. For every distinct number of occurrences of given item, the separated node is created. In case when new transaction is inserted into the tree, it might increase support count for the different path(s) of the tree as well. This is based on intersection between these two itemsets. Given the complete tree, the enumeration process to find frequent patterns is similar to that from the FP-tree approach. One of the very interesting and promising applications of association rules is a classification task. Several classifiers have been introduced so far, i.e. CBA, CMAR, ARC-AC, and ARC-BC. However, they use rules without reoccurrence of items in a single transaction. CBA [7], an apriori-like algorithm, labels new objects based on the confidence of matched rules. CMAR [5] using the FP-growth algorithm produces strong rules by pruning more specific and less confident rules by more confident and general ones. If more than one rule matches given object, advanced tests are performed to select the strongest group of rules with the same label. ARC-AC and ARC-BC [11] are based on the apriori algorithm. The ARC-BC approach treats each bunch of transactions labelled by the same category separately while the ARC-AC considers all categories combined. In order to manage the selection of rules, the term of dominance factor is introduced, which is defined as a proportion of rules of the most dominant category in the applicable rules for a document to classify. ARC-AC and ARC-BC were originally developed for classifying text [11] and later on used to classify objects in images [2] [3]. 3 4 PROPOSED APPROACH Based on the work presented in Section 3 we introduce the new classification method. Our approach, ACRI (Associative Classifier with Reoccurring Items), consists of two modules; see Figure 1. We decided to base our algorithm for mining reoccurrence items on Apriori property. The rationale behind this selection was well knowledge about basic Apriori algorithm, its efficiency, and our own C++ implementation of this method. The second choice was to following ARC-BC approach. It was based on the efficiency of this method in the case of not evenly distributed class labels. This is the core of rule generator module, see subsection 4.1. It mines the set of rules with reoccurring items that are the main element of the other, i.e. classifier, module. The classification process might be considered as matching the rules to the object; see subsection 4.2. Set of transactions Rule generator Set of rules Classifier Labeled objects Objects without labels Figure 1 High-level diagram of ACRI 4.1 RULE GENERATOR This module is designed for finding all frequent rules in the form of {o1i1 , o2i2 , , on in }, c from a given set of transactions, i.e. rules that have support equal or greater than user-defined min_support. The general framework of this part is based on ARC-BC approach [11]. This means that initial set of transactions is divided by categories and rules are generated for each of them independently, see Figure 2. 4 Rule generator for class C1 Transactions with class C1 Set of transactions Transaction divider Transactions with class C2 Rules with class C1 Rule generator for class C2 Rules with class C2 Rule merger Set of rules Rules with class CN Transactions with class CN Rule generator for class CN Figure 2 High-level diagram of rule generator module The main components of this module are the following: Transaction divider – this block scans the set of transactions once and creates N subsets of this set – each for one category (N is equal to number of categories); Rule generator for class Cx – this is Apriori-like algorithm for mining frequent itemsets that extends the original method by taking into account reoccurrences of items in a single transaction. In order to deal with this problem, the support count has to be redefined. The main difference is that single transaction may increase the support of a given itemset by more than one. We say that transaction supports itemset by a given number. The formal definition of this approach, which was proposed in [12] as a MaxOccur algorithm, is given below. Transaction T= {o1i1 , o2i2 , , on in }, c supports itemset I= {l1i1 , l 2 i2 , l n i n } if and only if i 1...n l1 o1 l2 o2 ... ln on . The number t by which T supports I is calculated according to the formula: o t min i i 1...n :li 0 oi 0 li Example Let us take into account the following transaction T {2i1 , 4i2 ,6i7 }, 1 Support for several itemsets given by this transaction is shown in Table 1 Table 1 Example of support counting Itemset I {2i1 , 3i2 ,5i7 } {i1 , 2i2 ,3i7 } {i2 ,2i7 } {i1 , i2 , i3 } 5 Support t 1 2 3 0 Thus, rule generator module finds all the itemsets that are frequent according to the above definition of support. Rule merger – this block collects the rule sets for different classes and merges them in order to generate complete set of mined rules. In this part we do not perform any pruning, thus the merging is nothing but collecting the rules together. 4.2 CLASSIFIER This module labels objects based on the set of mined rules obtained from rule generator. Associative classifier is rule-based classification system, which means that object is labelled on the basis of matched rule (or set of rules in case of multiclassification). This task is simple if there is exact match between rule and object, i.e. antecedent of the rule and the object are identical. The model often does not include any rule that matches given object exactly, though. In such a case, in order to make the classification, all rules are ranked according to a given scenario and the best one (or several) is matched to a given object. Rule ranking might be performed following different strategies, which associate each rule to a number that reflects its similarity to a given object. These strategies may be used either separately or in different combinations. We have tested the following ones: cosine measure, distance measure, coverage, confidence, support, and dominant class, which are characterized below. Let us consider the rule {o1i1 , o2i2 , , onin }, c and the object to be classified l1i1 , l 2i2 , , ln in . The corresponding n-dimensional vectors can be denoted as o [o1 , o2 ,..., on ] and l [l1 , l2 ,...,ln ] . The three following measures are based on this representation. Cosine measure (CM) – assigns value that is equal to the angle between these two vectors, i.e. CM arccos(o, l ) . The smaller the CM value is, the smaller the angle, and the closer these vectors are in the n-dimensional space. It is equal to zero if the vectors have the same direction, which, roughly speaking, means that they have the same “proportions” of items. Distance measure (DM) – assigns value that is equal to the distance between the ending points of these two vectors according to given distance norm, i.e. DM distance (o , l ) . We have tested norm L1 and norm L2 as distance functions. In general, this measure “standalone” seems to be useless, since the vectors might have various lengths. The only rational usage might be as a “fine tuning”, i.e. when the set of rule has been already pruned. The smaller the distance is, the closer the two ending points of these vectors are. Coverage (CV) – assigns value that is equal to the ratio of the number of common items in the object and rule to the number of items in the rule (ignoring reoccurrences) In this case, the larger is the CV ratio, the more items are common for the rule and the object. CV=1 means that the rule is entirely contained in the object. The two following ranking methods refer to the rule property only and do not depend on the classified object. Thus, they have to be used with other measures that prune the rule set. 6 Confidence Support In the last examined classification scenario, called dominant class, the class label is assigned to the object by choosing the one being the most frequent from the set of rules matching the object. 5 SOFTWARE IMPLEMENTATION 5.1 GENERAL OVERVIEW The ACRI program suite consists of three modules; see Figure 3. acri (rule generator) and acric (classifier) are written in C++. The acri module, however, produces set of rules that needs to be completed with support and confidence values. The time constraints for the project was the reason that this operation is performed by separate script written in Perl (count.pl). In the future this module will be integrated into acri application. Training set Rules acri count.pl Rules with support and confidence Labeled objects acric Objects without labels Figure 3 Software modules of ACRI Refer to Appendix A to see the usage of programs. 6 EXPERIMENTS 6.1 EXPERIMENTAL SETUP We used the Reuters-21578 text collection to perform experiments. We chose “ready-touse” top 10 topics [14] from this dataset. The total of 9980 documents is split onto two sets: 7193 and 2787 for a training and test set respectively. First, we preprocessed data extracting text from a bunch of XML documents. For normalization of words we used Porter’s algorithm [13] to stem the words. We also pruned stop words, i.e., words that appear very frequently and contributes nothing to the results. The list of stop words was a combination of the list used in [11] and words from our observations while performing tests (e.g., the obvious stop word is the word reuters as it appears in every document and should be treated as noise rather than as useful information). 6.2 COMPARISON TESTS In order to compare the results of our implementation of the algorithm for classifying documents with recurrent items to ARC-BC approach, we provided the application with 7 “switches” that allowed us to produce rules regardless reoccurrence of words in documents, which is a sort of emulation of ARC-BC algorithm. Although ACRI program suite contains the ARC-BC approach, from now on the term ACRI will be used only to denote the method with recurrent items. We produced several different sets of rules to be used in classifier. For ARC-BC we chose the support threshold ranged from 10 to 30% with the step of 5%; and 15 to 65% with the same step for our approach. The difference between the support thresholds lies in the definition of support for mining rules with recurrent items. As we mentioned in section 4.1 a single transaction (document) can support an itemset (a set of words) more than once. Therefore, if we consider support as the ratio of support count to the total number of transactions, as it was introduced in [8], we may encounter support more than 100% for some itemsets. On the other hand, if we choose the definition presented in [12], i.e., the ratio of support count to the number of distinct items (words), the support will never reach 100% as long as there is more than one distinct item in the dataset (which is quite obvious). Actually, in practice, the latter support definition requires for setting very small thresholds to obtain reasonable results. Hence, we decided to use the first one as more similar to “classical” definition of support. It is important to notice that no matter which definition we choose, it eventually leads to setting the same support count (i.e., the absolute number of transaction supporting an itemset). For each support threshold we set three different confidence thresholds: 0, 35, and 70%. The latter threshold was used in [11] as minimum reasonable threshold for producing rules; the first one (no threshold) was introduced to observe the reaction of the classifier for dealing with large numbers of rules; and the threshold of 35% is simply the middle value between the two others. For each single experiment we tried to keep the level of more then 98% of classified objects, which resulted in manipulating CV (see section 4.2) from 0.3 to 1. We discarded cases for which it was not possible to set CV to satisfy the minimum number of classified objects. More than 90% of the remaining results had CV = 1. We also performed experiments without specifying CV (using different methods of choosing applicable rules); however, they eventually occur to produce much more lower accuracy than those with specified CV > 0.3. We used different classification techniques for choosing the most applicable rule matching the object. Best confidence and dominant class matching methods were utilized for both ARC-BC and ACRI approaches. Additionally, ACRI was provided with cosine measure technique, and a combination of cosine measure, best confidence, and dominant class techniques. The combination of matching techniques is realized as the kind of scenario with different tolerance factors for each technique. The exemplary scenario would look like this: (1) choose top 20% of rules with the best cosine measure, then (2) choose 50% of the remaining rules with the highest confidence, and then (3) choose the rule based on the dominant class technique. 6.3 RESULTS Table 2 and Table 3 include the results in terms of the number of rules and accuracy for ACRI and ARC-BC approaches, respectively. The rows indicate different levels of support and confidence, and they are given in the form of SxCy, which means that support was x% and confidence was y%. 8 Table 2 Number of rules and accuracy [%] of ACRI S15C0 S15C35 S20C0 S20C35 S25C0 S25C35 S30C0 S30C35 S35C0 S35C35 S40C0 S40C35 Number of rules 13899 11242 5481 4253 2693 2011 1564 1131 1006 716 696 488 Confidence 76.42 76.43 74.54 74.59 74.92 75.00 74.86 74.95 74.25 74.53 73.05 73.33 Dominant class 75.88 81.17 75.01 80.42 73.81 80.12 72.99 79.24 72.77 78.74 71.28 77.96 Cosine 73.19 79.95 71.38 78.29 71.58 78.28 72.41 80.18 71.51 79.28 70.95 78.99 Cosine & confidence 82.77 83.90 81.54 82.72 81.10 82.32 81.51 82.64 80.13 81.42 78.78 80.52 Table 3 Number of rules and accuracy [%] of ARC-BC S10C0 S10C35 S10C70 S15C0 S15C35 S15C70 S20C0 S20C35 S25C0 S25C35 S30C0 Number of rules 8231 5764 1402 1817 1164 290 643 401 316 213 182 Confidence 75.74 75.78 49.10 75.29 75.33 47.85 74.96 75.52 75.14 62.57 74.87 Dominant factor 78.00 84.09 74.82 77.77 83.67 71.48 76.55 83.24 76.30 71.88 71.80 In both cases the best confidence level was 35%. For ARC-BC classifier, the best strategy was to use dominant factor, whereas in case of ACRI combination of cosine measure and confidence factors worked best. Based on these tables, Figure 4 shows the relationship between support and accuracy for these approaches. Comparing the best-found results, ARC-BC slightly overcomes the ACRI, however it seems to be more sensitive to changes of support. The accuracy of ACRI almost does not depend on the support threshold. In case of ARC-BC it decreases significantly when this threshold is greater than 20%. 9 Accuracy vs. support (confidence = 35% ) ARC-BC (dominant class) ACRI (cosine + confidence) ARC-BC (confidence) ACRI (cosine + conf. + dominant c.) 90 Accuracy [%] 85 80 75 70 65 60 10 15 20 25 30 35 40 Support Figure 4 Accuracy of the ACRI and ARC-BC Figure 5 and Figure 6 show the number of generated rules with and without recurrent items. As it can be observed, considering recurrences results in having more rules, which has its origin in different support definition. The other interesting relationship is that by increasing the confidence threshold from 0% to 35%, the difference between number of rules decreases more rapidly for ACRI. Number of rules vs. support (confidence = 0% ) Single Occurence Number of rules vs. support (confidence = 35% ) Multiple Occurences Single Occurence 16000 14000 10000 12000 Number of rules Number of rules Multiple Occurences 12000 10000 8000 6000 4000 8000 6000 4000 2000 2000 0 0 0 10 20 30 40 50 60 70 0 Support [%] 10 20 30 40 50 60 70 Support [%] Figure 5 Number of rules with confidence = 0% Figure 6 Number of rules with confidence = 35% Figure 7 shows the relationship between running time for rule generator with and without considering recurrent items. The algorithm with recurrences is slower, since it has to search larger space, yet the differences become smaller while increasing support threshold. 10 Algorithm efficiency single support multiple support 7 Avg. time per class [s] 6 5 4 3 2 1 0 0 10 20 30 40 50 60 70 Support threshold Figure 7 Algorithms efficiency The authors ARC-BC [11] claim to have the best results for confidence threshold greater than 70%. However, our experiments prove to perform better on lower confidence for both ARC-BC and ACRI approaches. This dissimilarity may be explained by usage of the different method of counting support and confidence or/and by the fact, that our classifier is slightly different than the original one. 7 CONCLUSIONS AND FUTURE WORK In this document we explained the idea of associative classification and mining frequent itemsets with recurrent items. Then, we combined these two and presented the new approach of association classification with recurrent items. In our project we developed ACRI, the suite of modules for generating rules and classification. Our implementation allowed us to compare the classification with recurrent items to the ARC-BC classification method. We performed several series of experiments and presented the results. For given definitions of support and confidence we observed that ACRI performs slightly worse than the ARC-BC classifier. What is more, ACRI needed more rules to classify objects with accuracy comparable to ARC-BC. However, ACRI seems to be less sensitive, with respect to accuracy, to the support threshold. Though, we must be aware that our definition of support in mining frequent itemsets with recurrent items is different from the original one, and to obtain the same absolute value of support (a support count) we have to set the higher relative value of the support. As we were limited by time constraints we did not perform enough experiments to properly assess performance of our approach. We anticipate that using the different definitions of counting support and confidence for rules assigned to different classes may improve the results. We would also like to try different techniques for pruning rules such as removing contradictory rules or rule generalization. 11 APPENDIX A ACRI PROGRAM SUITE All three modules of ACRI program suite are command line applications. A.1 acri The usage of acri is described in Table 4. inputfilename is the name of a file containing transactions (documents). minsupp is a support threshold and outputfilename is the name of the file that will contain rules. The formats of inputfilename and outputfilename are presented in Table 5 and Table 6 respectively. Optionally a user can choose --verbose option to see the steps of mining process; --suppressrecurr to produce rules regardless reoccurrences of items (simple Apriori); --tmpdir to specify the folder for temporarily files. Table 4 Command line usage annotation for acri $ acri Usage: acri [OPTIONS] inputfilename minsupp outputfilename OPTIONS: --verbose --suppressrecurr --tmpdir string Table 5 acri input file format class_label [class_label [...]] : word [word [...]] [...] Table 6 acri output file format class_label : word(recurrence) [word(recurrence) [...]] [...] A.2 count.pl count.pl requires to specify three file names (see Table 7): the one containing rules generated by acri, the one containing training set, and the output for rules with support and confidence (see Table 8). Optionally a user can set a confidence threshold (--minconf) and/or suppress counting items recurrence (--suppressrecurr). Table 7 Command line usage annotation for count.pl $ count.pl Usage: count.pl [--minconf float] [--suppressrecurr] rule_file transaction_file output_file Table 8 count.pl output file format class_label : word(recurrence) [word(recurrence) [...]] : support : confidence [...] 12 A.3 acric The usage of acric is presented in Table 9. The application requires for setting the name of a file containing the rules (rulesfilename) from the previous step and the name of a file containing a test set (objectsfilename). --rulecoverage, --cosinefactor, --distancefactor, --supportfactor, and --confidencefactor allow for specifying the tolerance factor for CV, CM, DM, support, and confidence respectively. Dominant class technique can be turn on by setting --dominantclass option. Executing acric will produce the results containing a confusion matrix and several different parameters measuring the correctness of the method (see Table 10). Table 9 Command line usage annotation for acric $ acric Usage: acric [OPTIONS] rulesfilename objectsfilename OPTIONS: --verbose --rulecoverage double --cosinefactor double --distancefactor double --supportfactor double --confidencefactor double --dominantclass Table 10 Example of acric usage $ ./acric data/rules/rmS25C35 data/source/reuters_test.data --rulecoverage 1 \ --supportfactor 0.1 --cosinefactor 0.4 --dominantclass Confusion Matrix (rows: true classes; columns: predicted classes) 623 0 10 38 0 5 25 3 8 2 0 37 2 1 25 0 145 3 3 1 0 74 0 4 999 0 2 8 8 8 8 4 99 5 4 3 0 0 0 0 102 20 10 0 1 1 0 50 102 12 0 24 3 12 2 1 6 0 0 1 2 2 5 4 0 4 1 57 2 0 0 0 3 0 1 0 0 21 0 0 6 3 9 0 12 5 13 14 101 3 0 0 0 0 0 0 0 0 0 0 Accuracy[%]: 79.37 Classified[%]: 99.46 Class ID 0 1 2 3 4 5 6 7 8 *MacroAvg *MicroAvg | | | | | | | | | | | | Class label 00_acq 01_corn 02_crude 03_earn 04_grain 05_interest 06_money-fx 07_ship 08_trade | Precision[%] | Recall[%] | 81.1 | 88.1 | 50.0 | 14.2 | 73.2 | 76.7 | 95.1 | 91.9 | 47.1 | 66.4 | 58.9 | 78.4 | 61.4 | 57.6 | 84.0 | 23.5 | 60.8 | 86.3 | 61.1 | 58.3 | 79.3 | 79.3 13 | F1[%] | | | | | | | | | | | 84.4 22.2 74.9 93.4 55.1 67.3 59.4 36.8 71.3 59.7 79.3 | | | | | | | | | | | | REFERENCES: [1] Agarwal R., Srikant R., Fast Algorithms for Mining Association Rules, International Conference Very Large Data Bases, pp. 487-499, 1994 [2] Antonie M.-L., Zaïane O., Coman A., Associative Classifiers for Medical Images, Lecture Notes in Artificial Intelligence 2797, Mining Multimedia and Complex Data, Springer-Verlag, pp. 68-83, 2003 [3] Antonie M.-L., Zaïane O., Coman A., Mammography Classification by an Association Rule-Based Classifier, Third International ACM SIGKDD Workshop on Multimedia Data Mining (MDM/KDD'2002) in conjunction with Eighth ACM SIGKDD, pp. 62-69, 2002 [4] Han J., Kamber M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001 [5] Li W., Han J., Pei J., CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, IEEE International Conference on Data Mining, pp. 369-376, 2001 [6] Liu B., Hsu W., Ma Y., Integrating Classification and Association Rule Mining, Knowledge Discovery and Data Mining, pp. 80-86, 1998 [7] Liu B., Hsu W., Ma Y., Integrating Classification and Association Rule Mining, Knowledge Discovery and Data Mining, pp. 80-86, 1998 [8] Ong K.-L., Ng W.-K., Lim E.-P., Mining Multi-Level Rules with Recurrent Items Using FP'-Tree, ICICS, 2001 [9] Wang W., Yang J., Yu P., WAR: Weighted Association Rules for Item Intensities, Knowledge and Information Systems, vol. 6, pp. 203-229, 2004 [10] Zaïane O., Han J., Finding Spatial Associations in Images, Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, SPIE 14th International Symposium on Aerospace/Defense Sensing, Simulation and Controls, pp. 138-147, 2000 [11] Zaïane O., R., Antonie M., L., Classifying Text Documents by Associating Terms with Text Categories, Thirteenth Australasian Database Conference (ADC’02), pp. 215-222, 2002 [12] Zaïane O., R., Han J., Zhu H., Mining Recurrent Items in Multimedia with Progressive Resolution Refinement, International Conference on Data Engineering (ICDE'2000), pp. 461-470, 2000 WEB PAGE REFERENCES: [13] Porter’s Stemming Algorithm. http://www.tartarus.org/~martin/PorterStemmer/ [14] Reuters-21578 Top 10 Topics Collection. http://www.jihe.net/datasets.htm - rttop10 14