GAFNY ET AL.: OCCT: A ONE-CLASS CLUSTERING TREE FOR IMPLEMENTING ONE-TO-MANY DATA LINKAGE Appendix A In this section we illustrate the process of building the OCCT. We will present the induction process using each of the proposed splitting criteria. Table A1 presents the training set which will be used throughout this illustration. The data was extracted from the database misuse detection dataset. The first three columns (except for Request ID) are columns from table TA, while the right two columns represent the columns from table TB. All records represent matching pairs of records. TABLE A1 TRAINING SET FOR ILLUSTRATION Request ID Request Part Of Request Day Of Request Location Customer City Day Week Customer Type Afternoon Friday Berlin Berlin Afternoon Wednseday Hamburg Hamburg private private Morning Wednseday Berlin Berlin business Morning Wednseday Berlin Berlin private Afternoon Saturday Berlin Berlin private Morning Thursday Berlin Berlin private Afternoon Friday Berlin Berlin private Afternoon Saturday Berlin Berlin business Afternoon Saturday Berlin Berlin private Afternoon Friday Hamburg Hamburg business Afternoon Monday Hamburg Hamburg business Afternoon Saturday Hamburg Hamburg private Afternoon Monday Berlin Berlin private Afternoon Monday Berlin Bonn private Afternoon Monday Berlin Berlin private Morning Saturday Bonn Bonn private Morning Saturday Hamburg Hamburg private Morning Saturday Hamburg Hamburg private Afternoon Friday Hamburg Hamburg private Afternoon Friday Hamburg Bonn private Morning Friday Hamburg Berlin private Morning Friday Berlin Berlin business Morning Friday Berlin Berlin private Afternoon Wednseday Berlin Berlin private Afternoon Thursday Berlin Berlin private Afternoon Thursday Berlin Berlin business Afternoon Monday Bonn Bonn business Afternoon Monday Bonn Hamburg private Afternoon Monday Bonn Berlin business Afternoon Wednseday Bonn Bonn business Afternoon Friday Bonn Bonn private Maximum Likelihood Estimation We will start by illustrating the MLE splitting criterion. When using this criterion, we do not evaluate a series of binary splits for each possible split. Instead, we evaluate a multi way split as a whole. For example, when examining the 'Request Location' candidate attribute, we split the given dataset into three subsets, according to the possible values of the attribute. Then, for each subset we build a set of probabilistic models describing the probability for the values from table TB given the values of all other attributes of table TB (e.g., a model describing the probability for each value of 'Customer City' given the value of 'Customer Type'. Then, for each record in the subset, we calculate its maximum likelihood according to Equation (5), using the set of models we had just induced. This will reflect how well the set of models describes the subset from 1 which it was induced. Given the dataset described in Table A1, and the 'Request Location' attribute, we would achieve three likelihood scores: -6.0183 (Berlin), -4.133 (Bonn) and 4.885 (Hamburg). The final score of the attribute is the sum of all individual scores (-15.0376). Similarly, we calculate the scores for 'Request Day of Week' (-20.709) and 'Request Part of Day' (-21.386). We are maximizing the MLE score, and therefore 'Request Location' would be chosen as the first split in the tree. Pursuing this method with no pruning at all will yield the OCCT which is presented in Fig. A1. It is visible that the tree consists of two to three levels, representing the three attributes that are originaly from table TA. The leaves of the tree contain the models representing the attributes originating from table TB. In the path 'Req. Location=Berlin & Req. PartOfDay=Morning' for example, the feature selection process found that the CustomerCity attribute did not reveal much information, and was therefore was not included in the final set of models that represent the leaves of the path. Req. Day of Week Req. Part of Day cust. city cust. type cust. city cust. type cust. city cust. type cust. city cust. type cust. city cust. type cust. type Req. Day of Week cust. type cust. type Req. Location Req. Day of Week cust. city cust. type cust. city cust. type cust. city cust. type cust. city cust. type Req. Day of Week Req. Part of Day Req. Day of Week cust. city cust. type cust. city cust. type cust. city cust. type cust. city cust. type cust. city cust. city Fig. A1. The decision model which is induced from the given training set, when using the MLE splitting criteria and applying no pruning. If we apply MLE pruning while building the tree, the model that would be induced would be different, as demonstrated in Fig. A2. It is noticeable that some branches were pruned during the induction process. For example, the subranch of the path 'Req. Location=Berlin' was pruned. This is because the MLE score of the path without growing the tree further was 6.01838, while the total score of the children would be 6.2813. The score of performing the split was lower that the score when the split is not performed. Therefore, the split does not provide us of enough new information which makes the split worthwhile, and thus the branch is pruned. If we apply LPI pruning with the pruning threshold of 1, the tree is quite radicly pruned such that only the set of models remain, as describes in Fig. A3. This is due to the fact that the LPI score of the best split in this scenario ('Req. Location') is 0.426. The pruning threshold is larger than the LPI score, and thus the branch is pruned. In fact, any threshold larger than 0.426 would result in the model described in Fig. 8. The set of mod- 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE-2011-09-0577 els which is formed describe the complete dataset. If we were to set a threshold lower than 0.426, the tree would not be pruned at all, and will resemble the tree described in Fig. 6. This is due to the fact that all LPI scores which are achieved further down in the tree are larger than the threshold. cust. city Req. Location cust. type cust. city cust. type cust. city cust. type cust. city cust. type cust.city cust. type Req. Day of Week cust. city cust. type Req. Part of Day Req. Day of Week cust. city cust. city Fig. A2. The decision model which is induced from the given training set, when using the MLE splitting criteria and applying MLE pruning. cust. city cust. type Fig. A3. The decision model which is induced from the given training set, when using the LPI splitting criteria (pruning threshold=1) and applying MLE pruning. Coarse-Jaccard In order to choose the first split of the tree using the CGJ criterion, a Jaccard score must be calculated for each of the attributes from table TA. For example, for the 'Request Location' attribute, there are three possible values: Berlin, Bonn and Hamburg. We examine each of the possible binary splits. First we split dataset into two sub-sets: 'Request Location=Berlin' and 'Request Location != Berlin'. We find that when neglecting the 'Request Location', there is one intersecting record between the subsets (requests 21 and 23), and that the size of their union is 23 (identical records are counted only once). Thus, according to Equation (1), the CGJ 1 score for this binary split is = 0.0434. There are 16 23 records with the value 'Berlin' in the attribute 'Request Location', out of a total of 31 records. Therefore, the 16 weight of the binary split is = 0.516. 31 The same process is repeated for the other possible two values of the attribute 'Request Location' ('Request Location=Bonn': score=0.0434, weight=0.193; 'Request Location=Hamburg': score=0.0869, weight=0.2903). The overall score of the candidate split 'Request Location' is calculated by the weighted average, and therefore is 0.0561. A similar procedure is used in order to calculate the scores of the other two candidate attributes: 'Request Day of Week' (0.278) and 'Request Part of Day' (0.173). We are minimizing the similarity between the subnodes created following a split, and therefore we will choose 'Request Location' as the first split in the tree. The described process is repeated on each of the subsets created following the split until no more at- tributes are candidate for a split, or until the subset is smaller than the set threshold. Fine-Jaccard We will use the attribute 'Request Location' in order to illustrate the process. For this attribute, we examine each of the possible binary splits. For example, request 1 will belong to the subset 'Request Location=Berlin'. We will compare it with all of the records belonging to the dataset 'Request Location != Berlin', and look for partial matches. For example, request 2 is not originated in Berlin, and therefore will belong to the second subset. The partial intersection between these two records would be 2 ('Request Part of Day' and 'Customer Type') out of a total of 4 possible, and therefore accord2 ing to Equation (2), their similarity would be = 0.5. 4 Overall, the total score of 'Request Location=Berlin' would be 0.5928, and of 'Request Location=Bonn' and 'Request Location=Hamburg' would be 0.5815 and 0.51666 respectively. The total score of this attribute would be 0.5636. A similar process is used in order to calculate the scores of the other two candidate attributes: 'Request Day of Week' (0.5493) and 'Request Part of Day' (0.512). We favor minimal values and therefore 'Request Part of Day' is chosen as the first split in the tree. The process repeats iteratively until no more attributes are candidate for a split, or until the subset is smaller than the set threshold. Least Probable Intersections Again we examine the 'Request Location' candidate attribute. When examining the binary split 'Request Location=Berlin' and 'Request Location != Berlin' there is one intersecting record. Thus, j=1. Pi is calculated for each distinct record in the dataset according to Equation (3). For example, the first record in the dataset presented in table 11 appears twice in the complete dataset (oi=2). The size of the subset containing records in which 'Request Location=Berlin' is 16 (k=16), while the size of the subset containing records in which 'Request Location != Berlin' is 15 (q-k=15). Thus, the Pi value of the first record would be 0.499. Summing up the Pi scores of all distinct records in the dataset given the examined split yields 1.4984 (λ = 1.4984). Overall, the score of this binary split is 0.2572. The scores of the attributes 'Request Location=Bonn' and 'Request Location=Hamburg' are 0.2695 and 0.426 respectively. The weighted average of these three would yield 0.426 (the weights are calculated as described above) which is the final score for the 'Request Location' attribute. The other two attributes yielded the scores of 1.4378 ('Request Day of Week') and 1.381 ('Request Part of Day'). We are maximizing the score and therefore the first attribute chosen for a split is 'Request Day of Week'. The procedure will be repeated for each of the subnodes created. GAFNY ET AL.: OCCT: A ONE-CLASS CLUSTERING TREE FOR IMPLEMENTING ONE-TO-MANY DATA LINKAGE TABLE B4 COMPARING THE PRUNING METHODS (MOVIE RECOMMENDER) Appendix B TABLE B1 COMPARING THE SPLITTING CRITERIA (DATABASE MISUSE) FGJ No CGJ pruning CGJ LPI p-value= 0.461 (◊) p-value= 0.693 (◊) statistic= 0.0975 statistic= 0.2624 p-value= 0.33 (◊) statistic= 0.4400 LPI FGJ LPI CGJ p-value= 0.001(◄) p-value= 0.468 (◊) statistic= 3.0221 statistic= 0.0787 p-value= 0.001(◄) statistic= 3.0355 LPI FGJ MLE CGJ p-value= 0.000(◄) p-value= 0.463 (◊) statistic= 9.3044 statistic= 0.0906 p-value= 0.041(▲) statistic= 1.7308 LPI MLE p-value= 0.440 (◊) statistic= 0.1494 p-value= 0.445 (◊) statistic= 0.1381 p-value= 0.312 (◊) statistic= 0.4899 p-value= 0.000(◄) statistic= 3.1777 p-value= 0.027(◄) statistic= 1.9142 p-value= 0.000(◄) statistic= 3.1904 p-value= 0.00 (◄) statistic= 7.6250 p-value= 0.118 (◊) statistic= 1.1805 p-value= 0204. (◊) statistic= 0.2473 The '◄' symbol indicates that the accuracy of the Row's splitting criterion was significantly higher than the Column's splitting criterion. The '▲' symbol indicates that the accuracy of the Row's criterion was significantly lower, and the '◊' symbol indicates no significant difference. LPI No pruning CGJ No pruning FGJ LPI No pruning ------- CGJ LPI FGJ p-value= 0.0047 (▲) statistic= 2.6007 LPI No pruning LPI p-value= 0.0047 (▲) statistic=2.6007 LPI No pruning ------- MLE LPI MLE p-value= 0.4137 (◊) statistic= 0.2179 p-value= 0.0132 (▲) statistic= 2.2197 p-value= 0.0752 (◊) statistic= 1.4383 p-value= 0.1096 (◊) statistic= 1.2286 p-value= 0.0752 (◊) statistic= 1.4383 p-value= 0.1096 (◊) statistic= 1.2286 p-value= 0.4137 (◊) statistic= 0.2179 p-value= 0.0132 (▲) statistic= 2.2197 CGJ FGJ No pruning CGJ LPI FGJ LPI CGJ LPI FGJ MLE CGJ LPI LPI MLE p-value= 0.247 (◊) ------------statistic= 0.6824 p-value= 0.330 (◊) p-value= 0.445 (◊) statistic= 0.4400 statistic= 0.1381 p-value= 0.312 (◊) statistic= 0.4899 p-value= 0.00 (▲) p-value= 0.00 (▲) p-value= 0.00 (▲) statistic= 35.177 statistic= 38.135 statistic= 44.687 p-value= 0.059 (◊) p-value= 0.230 (◊) statistic= 1.5584 statistic= 0.7384 p-value= 0.085 (◊) statistic= 1.3688 p-value= 0.202 (◊) p-value= 0.157 (◊) p-value= 0.386 (◊) statistic= 0.8373 statistic= 1.0052 statistic= 0.2877 p-value= 0.384 (◊) p-value= 0.248 (◊) statistic= 0.2951 statistic= 0.6778 p-value= 0.406 (◊) statistic= 0.2366 ------- LPI ------p-value= 0.0605 (◊) statistic= 1.5506 p-value= 0.0019 (▲) statistic= 2.8945 p-value= 0.0218 (▲) statistic= 2.0177 p-value= 0.0000 (▲) statistic= 4.3209 LPI ------- No pruning MLE p-value= 0.0001 (▲) statistic= 3.7368 ------p-value= 0.0091 (▲) statistic= 2.3627 LPI TABLE B5 COMPARING THE SPLITTING CRITERIA (FRAUD DETECTION) FGJ No CGJ pruning CGJ LPI p-value= 0.00 (▲) p-value= 0.031(▲) statistic= 1.8655 statistic= 5.2872 p-value= 0.000(◄) statistic= 3.9612 LPI FGJ LPI CGJ p-value= 0.140 (◊) p-value= 0.00 (▲) statistic= 1.0768 statistic= 5.0734 p-value= 0.000(▲) statistic= 4.6296 LPI FGJ MLE CGJ p-value= 0.004(◄) p-value= 0.00 (▲) statistic= 2.6130 statistic= 5.0649 p-value= 0.00 (▲) statistic= 7.7839 LPI MLE p-value= 0.055 (◊) statistic= 1.5953 p-value= 0.000(◄) statistic= 4.8237 p-value= 0.312 (◊) statistic= 0.4899 p-value= 0.000(▲) statistic= 3.6861 p-value= 0.000(▲) statistic= 3.2235 p-value= 0.044(◄) statistic= 0. 0274 p-value= 0.042(◄) statistic= 1.7240 p-value= 0.119 (◊) statistic= 1.1795 p-value= 0.00 (◄) statistic= 7.2602 TABLE B6 COMPARING THE PRUNING METHODS (FRAUD DETECTION) LPI No pruning TABLE B3 COMPARING THE SPLITTING CRITERIA (MOVIE RECOMMENDER) p-value= 0.0040 (▲) statistic= 2.6537 LPI No pruning MLE p-value= 0.0001 (▲) statistic= 3.7296 LPI TABLE B2 COMPARING THE PRUNING METHODS (DATABASE MISUSE) No pruning 3 CGJ p-value= 0.0013 (◄) statistic= 3.0215 LPI No pruning FGJ p-value= 0.0000 (▲) statistic= 8.4023 LPI No pruning LPI p-value= 0.0000 (▲) statistic= 4.8552 LPI No pruning MLE LPI p-value= 0.000 (▲) statistic= 3.7368 MLE p-value= 0.0000 (◄) statistic= 6.9261 p-value= 0.0000 (◄) statistic= 5.6153 p-value= 0.0278 (◄) statistic= 1.9139 p-value= 0.0000 (◄) statistic= 13.6651 p-value= 0.0010 (▲) statistic= 3.0973 p-value= 0.0327 (◄) statistic= 3.0973 p-value= 0.0084 (◄) statistic= 2.3917 p-value= 0.000 (◄) statistic= 7.2590