L. Dee Miller Weka Tertius Explanation October 19, 2010 Work In

advertisement
Weka Tertius Explanation
October 19, 2010
1. Introduction
The Tertius association rule algorithm is presented in Flach & Lachiche (2001). This paper discusses a
statistically well-founded confirmation measure used to determine how strong the rule is supported and
the novelty of the rule. It further discuses how this confirmation measure can be paired with a search
algorithm to generate interesting association rules (both supervised and unsupervised) from datasets
with nominal attributes.
The Tertius rule miner has previously been implemented for the Weka machine learning library (Deltour,
2001). This implementation makes it easy to use Tertius on any dataset that is in compatible Weka
format (e.g., arff files). However, there is a disconnect between the Tertius algorithm discussed in Flach
& Lachiche (2001) and the Weka implementation. Deltour (2001) provides only a user manual which
discusses the command line options for Tertius. The author does not discuss the Tertius implementation
at all. Weka is open source so the source code is freely available @
http://www.cs.waikato.ac.nz/ml/weka/. However, the actual source code has almost no
documentation. This makes it very difficult to satisfactorily explain the results. For example: recently,
we used this Tertius implementation on the EHC questionnaire data (???). The results were quite
anomalous with the counter-examples values much higher than the confirmation. We were unable to
find any explanation for this in the Tertius paper. After extensive digging through the source code, we
found that these values were not the counter-examples but something completely different―truepositive values!
In this paper, we begin to discuss Weka implementation of Tertius. In Section 2, we focus on the
equations and parameters used to generate the three values associated with each association rule: (1)
confirmation, (2) true-positive, and (3) false-positive. We provide a running example on the simple,
synthetic Balls dataset given in Table 1. This dataset consists of three separate nominal attributes: Size,
Bounce, and Color. The Color is used as the label attribute which is the head of the association rules.
The other two attributes are used as the body of the rules. Note: this discussion applies only to Tertius
used to generate supervised association rules. This occurs when the setRocAnalysis flag for Weka
Tertius is set to true. In the future, we may provide an equivalent section for Tertius used to generate
unsupervised association rules. Armed with an understanding of the equations/parameters, in Section 3
we provide an explanation for the anomalous results on the EHC questionnaire dataset.
Table 1: Balls Dataset Used in Running Example.
Size
Bounce
Color
Small
Low
Red
Large
Low
Red
Small
High
Red
Large
High
Red
Small
Low
Blue
Large
Low
Blue
Small
High
Blue
Large
High
Blue
Small
Large
Low
Low
Red
Red
2. Tertius Equations and Parameters
As previously discussed, the Weka implementation of Tertius (referred to as wTertius from here on)
outputs three separate values with each association rule: (1) confirmation value, (2) true-positive, and
(3) the false-positive. Of these three values, the confirmation value of a rule is the most important
because it measures how unusual/interesting the rules are. Interested readers should consult Flach &
Lachiche (2001) for more details on confirmation. The other two values are included to measure the
significance of the association rules. They will be explained in more detail later in this section.
First, we look at the four parameters common to all the functions:
ο‚· π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘œπ‘–π‘›π‘‘π‘  π‘€β„Žπ‘’π‘Ÿπ‘’ π‘π‘œπ‘‘π‘¦ π‘Žπ‘‘π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘’π‘  π‘šπ‘Žπ‘‘π‘β„Ž π‘Ÿπ‘’π‘™π‘’
ο‚· β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘œπ‘–π‘›π‘‘π‘  π‘€β„Žπ‘’π‘Ÿπ‘’ β„Žπ‘’π‘Žπ‘‘ π‘Žπ‘‘π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘’ π‘‘π‘œπ‘’π‘  𝑁𝑂𝑇 π‘šπ‘Žπ‘‘π‘β„Ž π‘Ÿπ‘’π‘™π‘’
ο‚· π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = π‘›π‘’π‘šπ‘π‘’π‘Ÿπ‘œπ‘“ π‘π‘œπ‘–π‘›π‘‘π‘  π‘€β„Žπ‘’π‘Ÿπ‘’ π‘π‘œπ‘‘π‘¦ π‘Žπ‘‘π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘’π‘  π‘šπ‘Žπ‘‘π‘β„Ž, 𝑏𝑒𝑑 β„Žπ‘’π‘Žπ‘‘ π‘‘π‘œπ‘’π‘  𝑁𝑂𝑇
ο‚· π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’π‘  = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘π‘œπ‘–π‘›π‘‘π‘ 
The names for the first three are somewhat vague. The π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ refers to the actual counterexample value discussed in the Flach & Lachiche (2001). The π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ seems to refer to the first
row in the contingency table in that paper, whereas the β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ refers to the second column.
For wTertius purposes, all four parameters are involved in generating the output values. An example of
computing the four parameters on the Balls dataset is given in Table 2 for the rule: π‘π‘œπ‘’π‘›π‘π‘’ = π‘™π‘œπ‘€ →
π‘π‘œπ‘™π‘œπ‘Ÿ = π‘Ÿπ‘’π‘‘. (That is, π‘π‘œπ‘‘π‘¦ → β„Žπ‘’π‘Žπ‘‘.) The table gives the following parameter values for this rule:
ο‚· π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = 6
ο‚· β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = 4
ο‚· π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ = 2
ο‚· π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’π‘  = 10
Table 2: Example of computing the four parameters for the rule: π‘π‘œπ‘’π‘›π‘π‘’ = π‘™π‘œπ‘€ → π‘π‘œπ‘™π‘œπ‘Ÿ = π‘Ÿπ‘’π‘‘. The
tiny B is for π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ, the tiny H is for β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ, and the tiny M is for π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ.
Size
Bounce
Color
Small
Low B
Red
Large
Low B
Red
Small
High
Red
Large
High
Red
Small
Low B
Blue HM
Large
Low B
Blue HM
Small
High
Blue H
Large
High
Blue H
Small
Low B
Red
Large
Low B
Red
Second, we consider the confirmation value used to measure the novelty of the association rules. The
confirmation value requires two more parameters (1) expected frequency and (2) observed frequency.
The equations for all are given below. Both the 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 and π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ parameters range from 0 to 1.
The constraints on 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 prevent confirmation from dividing by zero. The 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 parameter
measures the novelty for the rules. It becomes larger when many points match the body of the rule
and/or when many points do NOT match the head. The π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ parameter balances 𝑒π‘₯𝑝𝑒𝑐𝑑by
penalizing the confirmation when the body is satisfied, but not the head.
𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 =
π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ ∗β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’π‘ 2
π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ =
π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’π‘ 
(roughly equal to the estimated number of counterexamples)
(roughly equal to the observed number of counterexamples)
π‘π‘œπ‘›π‘“π‘–π‘Ÿπ‘šπ‘Žπ‘‘π‘–π‘œπ‘› = 0 𝑖𝑓 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 = 1 π‘œπ‘Ÿ 0
π‘π‘œπ‘›π‘“π‘–π‘Ÿπ‘šπ‘Žπ‘‘π‘–π‘œπ‘› =
𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 − π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘
√𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 − 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑
An example of computing the 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑, π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘, π‘π‘œπ‘›π‘“π‘–π‘Ÿπ‘šπ‘Žπ‘‘π‘–π‘œπ‘› on the Balls dataset is given in the
bullets below for the rule: π‘π‘œπ‘’π‘›π‘π‘’ = π‘™π‘œπ‘€ → π‘π‘œπ‘™π‘œπ‘Ÿ = π‘Ÿπ‘’π‘‘.
ο‚· 𝑒π‘₯𝑝𝑒𝑐𝑑𝑒𝑑 = 6 ∗ 4/100 = 0.24
ο‚· π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ = 2/10 = 0.2
0.24−0.2
ο‚· π‘π‘œπ‘›π‘“π‘–π‘Ÿπ‘šπ‘Žπ‘‘π‘–π‘œπ‘› = 0.24−0.24 = 0.16
√
Third, we consider the true-positive and false-positive rates given as output in wTertius. The equations
for both are given below. Both measure the quality of the association rule created by wTertius. The
true-positive rate measures how often the entire rule is satisfied compared to the head whereas falsepositive measure how often the head of the rule is not satisfied in the data points. Ideally, to find
interesting rules, the confirmation should be higher than true-positive and true-negative because we
prefer novel rules over common rules, but a rule that never works (i.e., never satisfied) on the data
points is also of limited use.
π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ =
π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ − π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
π‘–π‘›π‘ π‘‘π‘Žπ‘›π‘π‘’π‘  − β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
π‘“π‘Žπ‘™π‘ π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ =
π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
β„Žπ‘’π‘Žπ‘‘_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ
An example of computing the π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ and π‘“π‘Žπ‘™π‘ π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ on the Balls dataset is given in the
bullets below for the rule: π‘π‘œπ‘’π‘›π‘π‘’ = π‘™π‘œπ‘€ → π‘π‘œπ‘™π‘œπ‘Ÿ = π‘Ÿπ‘’π‘‘.
6−2
ο‚· π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ = 10−4 = 0.67
ο‚·
π‘“π‘Žπ‘™π‘ π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ = 2/4 = 0.5
To summarize, a good rule is one that has a high confirmation value, a high true-positive rate, and a low
false-positive rate.
3. EHC Questionnaire Results
We have previously used Tertius to obtain good results on education data (Riley, et al. 2009). However,
when we ran the same configuration on the EHC Questionnaire dataset (???) there were some
anomalous results. Specifically, we found that the second value was 1.0 for many rules generated from
the results on domain VII. Originally, we thought this was the counter-instances, but we have
subsequently found out this value was the true-positives. Now that we understand all the wTertius
parameters, we can provide an explanation for these results. Overall, wTertius is working as intended.
Recall that the dataset uses the Respond Parallel (RP) attribute as the label with attributes combined
from the previous three turns. Despite this merge, the number of data points with 𝑅𝑃 = 1 are
extremely small―only 1 point out of 676. This lopsided label distribution is causing high true-positive
values. Take the following actual rule as an example:
ο‚· 𝐷 ∗ = 0 π‘Žπ‘›π‘‘ 𝐸𝑉𝐸𝑅 = 1 → 𝑅𝑃 = 1
4−3
o π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ = 676−675 = 1
o π‘π‘œπ‘›π‘“π‘–π‘Ÿπ‘šπ‘Žπ‘‘π‘–π‘œπ‘› = 0.04 (1𝑠𝑑)
For this rule, wTertius is working as intended. Recall, that true-positive measures how often the entire
rule is satisfied compared to the head. For this rule, only one data point has 𝑅𝑃 = 1 so only one data
point can satisfy the entire rule regardless of the number of points which satisfy the body. Therefore, if
the number of body matches (π‘π‘œπ‘‘π‘¦_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ) increases, we will get a corresponding increase in the
π‘š_π‘π‘œπ‘’π‘›π‘‘π‘’π‘Ÿ without a change overall π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ (and vice-versa). This explains the large number of
attributes with 𝑅𝑃 = 1 and π‘‘π‘Ÿπ‘’π‘’-π‘π‘œπ‘ π‘–π‘‘π‘–π‘£π‘’ = 1.
References
Deltour, A. (2001). Tertius Extension to Weka (Technical Report No. CSTR-01-001). United Kingdom:
University of Bristol.
Flach, P. A., & Lachiche, N. (2001). Confirmation-Guided Discovery of First-Order Rules with Tertius.
Mach. Learn., 42(1-2), 61-95.
Riley, S., Miller, L. D., Soh, L., Samal, A., & Nugent, G. (2009). Intelligent Learning Object Guide (iLOG): A
Framework for Automatic Empirically-Based Metadata Generation. In Artificial Intelligence in
Education (pp. 515-522).
Download