ttk2014030682s1 - IEEE Computer Society

advertisement
GAFNY ET AL.: OCCT: A ONE-CLASS CLUSTERING TREE FOR IMPLEMENTING ONE-TO-MANY DATA LINKAGE
Appendix A
In this section we illustrate the process of building the
OCCT. We will present the induction process using
each of the proposed splitting criteria. Table A1 presents the training set which will be used throughout
this illustration. The data was extracted from the database misuse detection dataset. The first three columns
(except for Request ID) are columns from table TA,
while the right two columns represent the columns
from table TB. All records represent matching pairs of
records.
TABLE A1
TRAINING SET FOR ILLUSTRATION
Request ID
Request Part Of Request Day Of
Request Location Customer City
Day
Week
Customer Type
Afternoon
Friday
Berlin
Berlin
Afternoon
Wednseday
Hamburg
Hamburg
private
private
Morning
Wednseday
Berlin
Berlin
business
Morning
Wednseday
Berlin
Berlin
private
Afternoon
Saturday
Berlin
Berlin
private
Morning
Thursday
Berlin
Berlin
private
Afternoon
Friday
Berlin
Berlin
private
Afternoon
Saturday
Berlin
Berlin
business
Afternoon
Saturday
Berlin
Berlin
private
Afternoon
Friday
Hamburg
Hamburg
business
Afternoon
Monday
Hamburg
Hamburg
business
Afternoon
Saturday
Hamburg
Hamburg
private
Afternoon
Monday
Berlin
Berlin
private
Afternoon
Monday
Berlin
Bonn
private
Afternoon
Monday
Berlin
Berlin
private
Morning
Saturday
Bonn
Bonn
private
Morning
Saturday
Hamburg
Hamburg
private
Morning
Saturday
Hamburg
Hamburg
private
Afternoon
Friday
Hamburg
Hamburg
private
Afternoon
Friday
Hamburg
Bonn
private
Morning
Friday
Hamburg
Berlin
private
Morning
Friday
Berlin
Berlin
business
Morning
Friday
Berlin
Berlin
private
Afternoon
Wednseday
Berlin
Berlin
private
Afternoon
Thursday
Berlin
Berlin
private
Afternoon
Thursday
Berlin
Berlin
business
Afternoon
Monday
Bonn
Bonn
business
Afternoon
Monday
Bonn
Hamburg
private
Afternoon
Monday
Bonn
Berlin
business
Afternoon
Wednseday
Bonn
Bonn
business
Afternoon
Friday
Bonn
Bonn
private
Maximum Likelihood Estimation
We will start by illustrating the MLE splitting criterion.
When using this criterion, we do not evaluate a series
of binary splits for each possible split. Instead, we
evaluate a multi way split as a whole. For example,
when examining the 'Request Location' candidate attribute, we split the given dataset into three subsets,
according to the possible values of the attribute. Then,
for each subset we build a set of probabilistic models
describing the probability for the values from table TB
given the values of all other attributes of table TB (e.g.,
a model describing the probability for each value of
'Customer City' given the value of 'Customer Type'.
Then, for each record in the subset, we calculate its
maximum likelihood according to Equation (5), using
the set of models we had just induced. This will reflect
how well the set of models describes the subset from
1
which it was induced.
Given the dataset described in Table A1, and the
'Request Location' attribute, we would achieve three
likelihood scores: -6.0183 (Berlin), -4.133 (Bonn) and 4.885 (Hamburg). The final score of the attribute is the
sum of all individual scores (-15.0376).
Similarly, we calculate the scores for 'Request Day of
Week' (-20.709) and 'Request Part of Day' (-21.386). We
are maximizing the MLE score, and therefore 'Request
Location' would be chosen as the first split in the tree.
Pursuing this method with no pruning at all will
yield the OCCT which is presented in Fig. A1. It is visible that the tree consists of two to three levels, representing the three attributes that are originaly from table
TA. The leaves of the tree contain the models representing the attributes originating from table TB. In the path
'Req. Location=Berlin & Req. PartOfDay=Morning' for
example, the feature selection process found that the
CustomerCity attribute did not reveal much information, and was therefore was not included in the final
set of models that represent the leaves of the path.
Req. Day
of Week
Req. Part
of Day
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. type
Req. Day
of Week
cust. type
cust. type
Req.
Location
Req. Day
of Week
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
Req. Day
of Week
Req. Part
of Day
Req. Day
of Week
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. city
Fig. A1. The decision model which is induced from the given training
set, when using the MLE splitting criteria and applying no pruning.
If we apply MLE pruning while building the tree,
the model that would be induced would be different,
as demonstrated in Fig. A2. It is noticeable that some
branches were pruned during the induction process.
For example, the subranch of the path 'Req. Location=Berlin' was pruned. This is because the MLE score
of the path without growing the tree further was 6.01838, while the total score of the children would be 6.2813. The score of performing the split was lower that
the score when the split is not performed. Therefore,
the split does not provide us of enough new information which makes the split worthwhile, and thus the
branch is pruned.
If we apply LPI pruning with the pruning threshold
of 1, the tree is quite radicly pruned such that only the
set of models remain, as describes in Fig. A3. This is
due to the fact that the LPI score of the best split in this
scenario ('Req. Location') is 0.426. The pruning threshold is larger than the LPI score, and thus the branch is
pruned. In fact, any threshold larger than 0.426 would
result in the model described in Fig. 8. The set of mod-
2
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE-2011-09-0577
els which is formed describe the complete dataset.
If we were to set a threshold lower than 0.426, the
tree would not be pruned at all, and will resemble the
tree described in Fig. 6. This is due to the fact that all
LPI scores which are achieved further down in the tree
are larger than the threshold.
cust. city
Req.
Location
cust. type
cust. city
cust. type
cust. city
cust. type
cust. city
cust. type
cust.city
cust. type
Req. Day
of Week
cust. city
cust. type
Req. Part
of Day
Req. Day
of Week
cust. city
cust. city
Fig. A2. The decision model which is induced from the given training
set, when using the MLE splitting criteria and applying MLE pruning.
cust. city
cust. type
Fig. A3. The decision model which is induced from the given training
set, when using the LPI splitting criteria (pruning threshold=1) and
applying MLE pruning.
Coarse-Jaccard
In order to choose the first split of the tree using the
CGJ criterion, a Jaccard score must be calculated for
each of the attributes from table TA. For example, for
the 'Request Location' attribute, there are three possible
values: Berlin, Bonn and Hamburg. We examine each
of the possible binary splits. First we split dataset into
two sub-sets: 'Request Location=Berlin' and 'Request
Location != Berlin'. We find that when neglecting the
'Request Location', there is one intersecting record between the subsets (requests 21 and 23), and that the
size of their union is 23 (identical records are counted
only once). Thus, according to Equation (1), the CGJ
1
score for this binary split is = 0.0434. There are 16
23
records with the value 'Berlin' in the attribute 'Request
Location', out of a total of 31 records. Therefore, the
16
weight of the binary split is = 0.516.
31
The same process is repeated for the other possible
two values of the attribute 'Request Location' ('Request
Location=Bonn': score=0.0434, weight=0.193; 'Request
Location=Hamburg': score=0.0869, weight=0.2903).
The overall score of the candidate split 'Request Location' is calculated by the weighted average, and therefore is 0.0561.
A similar procedure is used in order to calculate the
scores of the other two candidate attributes: 'Request
Day of Week' (0.278) and 'Request Part of Day' (0.173).
We are minimizing the similarity between the subnodes created following a split, and therefore we will
choose 'Request Location' as the first split in the tree.
The described process is repeated on each of the
subsets created following the split until no more at-
tributes are candidate for a split, or until the subset is
smaller than the set threshold.
Fine-Jaccard
We will use the attribute 'Request Location' in order
to illustrate the process. For this attribute, we examine
each of the possible binary splits. For example, request
1 will belong to the subset 'Request Location=Berlin'.
We will compare it with all of the records belonging to
the dataset 'Request Location != Berlin', and look for
partial matches. For example, request 2 is not originated in Berlin, and therefore will belong to the second
subset. The partial intersection between these two records would be 2 ('Request Part of Day' and 'Customer
Type') out of a total of 4 possible, and therefore accord2
ing to Equation (2), their similarity would be = 0.5.
4
Overall, the total score of 'Request Location=Berlin'
would be 0.5928, and of 'Request Location=Bonn' and
'Request Location=Hamburg' would be 0.5815 and
0.51666 respectively. The total score of this attribute
would be 0.5636.
A similar process is used in order to calculate the
scores of the other two candidate attributes: 'Request
Day of Week' (0.5493) and 'Request Part of Day' (0.512).
We favor minimal values and therefore 'Request Part
of Day' is chosen as the first split in the tree.
The process repeats iteratively until no more attributes are candidate for a split, or until the subset is
smaller than the set threshold.
Least Probable Intersections
Again we examine the 'Request Location' candidate
attribute. When examining the binary split 'Request
Location=Berlin' and 'Request Location != Berlin' there
is one intersecting record. Thus, j=1.
Pi is calculated for each distinct record in the dataset
according to Equation (3). For example, the first record
in the dataset presented in table 11 appears twice in the
complete dataset (oi=2). The size of the subset containing records in which 'Request Location=Berlin' is 16
(k=16), while the size of the subset containing records
in which 'Request Location != Berlin' is 15 (q-k=15).
Thus, the Pi value of the first record would be 0.499.
Summing up the Pi scores of all distinct records in the
dataset given the examined split yields 1.4984 (λ =
1.4984). Overall, the score of this binary split is 0.2572.
The scores of the attributes 'Request Location=Bonn'
and 'Request Location=Hamburg' are 0.2695 and 0.426
respectively. The weighted average of these three
would yield 0.426 (the weights are calculated as described above) which is the final score for the 'Request
Location' attribute.
The other two attributes yielded the scores of 1.4378
('Request Day of Week') and 1.381 ('Request Part of
Day'). We are maximizing the score and therefore the
first attribute chosen for a split is 'Request Day of
Week'. The procedure will be repeated for each of the
subnodes created.
GAFNY ET AL.: OCCT: A ONE-CLASS CLUSTERING TREE FOR IMPLEMENTING ONE-TO-MANY DATA LINKAGE
TABLE B4
COMPARING THE PRUNING METHODS (MOVIE RECOMMENDER)
Appendix B
TABLE B1
COMPARING THE SPLITTING CRITERIA (DATABASE MISUSE)
FGJ
No
CGJ
pruning
CGJ
LPI
p-value= 0.461 (◊) p-value= 0.693 (◊)
statistic= 0.0975
statistic= 0.2624
p-value= 0.33 (◊)
statistic= 0.4400
LPI
FGJ
LPI
CGJ
p-value= 0.001(◄) p-value= 0.468 (◊)
statistic= 3.0221
statistic= 0.0787
p-value= 0.001(◄)
statistic= 3.0355
LPI
FGJ
MLE
CGJ
p-value= 0.000(◄) p-value= 0.463 (◊)
statistic= 9.3044
statistic= 0.0906
p-value= 0.041(▲)
statistic= 1.7308
LPI
MLE
p-value= 0.440 (◊)
statistic= 0.1494
p-value= 0.445 (◊)
statistic= 0.1381
p-value= 0.312 (◊)
statistic= 0.4899
p-value= 0.000(◄)
statistic= 3.1777
p-value= 0.027(◄)
statistic= 1.9142
p-value= 0.000(◄)
statistic= 3.1904
p-value= 0.00 (◄)
statistic= 7.6250
p-value= 0.118 (◊)
statistic= 1.1805
p-value= 0204. (◊)
statistic= 0.2473
The '◄' symbol indicates that the accuracy of the Row's splitting criterion was significantly higher than the Column's splitting criterion. The
'▲' symbol indicates that the accuracy of the Row's criterion was significantly lower, and the '◊' symbol indicates no significant difference.
LPI
No pruning
CGJ
No pruning
FGJ
LPI
No pruning
-------
CGJ
LPI
FGJ
p-value= 0.0047 (▲)
statistic= 2.6007
LPI
No pruning
LPI
p-value= 0.0047 (▲)
statistic=2.6007
LPI
No pruning
-------
MLE
LPI
MLE
p-value= 0.4137 (◊)
statistic= 0.2179
p-value= 0.0132 (▲)
statistic= 2.2197
p-value= 0.0752 (◊)
statistic= 1.4383
p-value= 0.1096 (◊)
statistic= 1.2286
p-value= 0.0752 (◊)
statistic= 1.4383
p-value= 0.1096 (◊)
statistic= 1.2286
p-value= 0.4137 (◊)
statistic= 0.2179
p-value= 0.0132 (▲)
statistic= 2.2197
CGJ
FGJ
No
pruning CGJ
LPI
FGJ
LPI
CGJ
LPI
FGJ
MLE
CGJ
LPI
LPI
MLE
p-value= 0.247 (◊)
------------statistic= 0.6824
p-value= 0.330 (◊) p-value= 0.445 (◊)
statistic= 0.4400
statistic= 0.1381
p-value= 0.312 (◊)
statistic= 0.4899
p-value= 0.00 (▲) p-value= 0.00 (▲) p-value= 0.00 (▲)
statistic= 35.177
statistic= 38.135
statistic= 44.687
p-value= 0.059 (◊) p-value= 0.230 (◊)
statistic= 1.5584
statistic= 0.7384
p-value= 0.085 (◊)
statistic= 1.3688
p-value= 0.202 (◊) p-value= 0.157 (◊) p-value= 0.386 (◊)
statistic= 0.8373
statistic= 1.0052
statistic= 0.2877
p-value= 0.384 (◊) p-value= 0.248 (◊)
statistic= 0.2951
statistic= 0.6778
p-value= 0.406 (◊)
statistic= 0.2366
-------
LPI
------p-value= 0.0605 (◊)
statistic= 1.5506
p-value= 0.0019 (▲)
statistic= 2.8945
p-value= 0.0218 (▲)
statistic= 2.0177
p-value= 0.0000 (▲)
statistic= 4.3209
LPI
-------
No pruning
MLE
p-value= 0.0001 (▲)
statistic= 3.7368
------p-value= 0.0091 (▲)
statistic= 2.3627
LPI
TABLE B5
COMPARING THE SPLITTING CRITERIA (FRAUD DETECTION)
FGJ
No
CGJ
pruning
CGJ
LPI
p-value= 0.00 (▲) p-value= 0.031(▲)
statistic= 1.8655
statistic= 5.2872
p-value= 0.000(◄)
statistic= 3.9612
LPI
FGJ
LPI
CGJ
p-value= 0.140 (◊) p-value= 0.00 (▲)
statistic= 1.0768
statistic= 5.0734
p-value= 0.000(▲)
statistic= 4.6296
LPI
FGJ
MLE
CGJ
p-value= 0.004(◄) p-value= 0.00 (▲)
statistic= 2.6130
statistic= 5.0649
p-value= 0.00 (▲)
statistic= 7.7839
LPI
MLE
p-value= 0.055 (◊)
statistic= 1.5953
p-value= 0.000(◄)
statistic= 4.8237
p-value= 0.312 (◊)
statistic= 0.4899
p-value= 0.000(▲)
statistic= 3.6861
p-value= 0.000(▲)
statistic= 3.2235
p-value= 0.044(◄)
statistic= 0. 0274
p-value= 0.042(◄)
statistic= 1.7240
p-value= 0.119 (◊)
statistic= 1.1795
p-value= 0.00 (◄)
statistic= 7.2602
TABLE B6
COMPARING THE PRUNING METHODS (FRAUD DETECTION)
LPI
No pruning
TABLE B3
COMPARING THE SPLITTING CRITERIA (MOVIE RECOMMENDER)
p-value= 0.0040 (▲)
statistic= 2.6537
LPI
No pruning
MLE
p-value= 0.0001 (▲)
statistic= 3.7296
LPI
TABLE B2
COMPARING THE PRUNING METHODS (DATABASE MISUSE)
No pruning
3
CGJ
p-value= 0.0013 (◄)
statistic= 3.0215
LPI
No pruning
FGJ
p-value= 0.0000 (▲)
statistic= 8.4023
LPI
No pruning
LPI
p-value= 0.0000 (▲)
statistic= 4.8552
LPI
No pruning
MLE
LPI
p-value= 0.000 (▲)
statistic= 3.7368
MLE
p-value= 0.0000 (◄)
statistic= 6.9261
p-value= 0.0000 (◄)
statistic= 5.6153
p-value= 0.0278 (◄)
statistic= 1.9139
p-value= 0.0000 (◄)
statistic= 13.6651
p-value= 0.0010 (▲)
statistic= 3.0973
p-value= 0.0327 (◄)
statistic= 3.0973
p-value= 0.0084 (◄)
statistic= 2.3917
p-value= 0.000 (◄)
statistic= 7.2590
Download