L. Dee Miller Weka Tertius Explanation October 19, 2010 Work In

Weka Tertius Explanation October 19, 2010 1. Introduction The Tertius association rule algorithm is presented in Flach & Lachiche (2001). This paper discusses a statistically well-founded confirmation measure used to determine how strong the rule is supported and the novelty of the rule. It further discuses how this confirmation measure can be paired with a search algorithm to generate interesting association rules (both supervised and unsupervised) from datasets with nominal attributes. The Tertius rule miner has previously been implemented for the Weka machine learning library (Deltour, 2001). This implementation makes it easy to use Tertius on any dataset that is in compatible Weka format (e.g., arff files). However, there is a disconnect between the Tertius algorithm discussed in Flach & Lachiche (2001) and the Weka implementation. Deltour (2001) provides only a user manual which discusses the command line options for Tertius. The author does not discuss the Tertius implementation at all. Weka is open source so the source code is freely available @ http://www.cs.waikato.ac.nz/ml/weka/. However, the actual source code has almost no documentation. This makes it very difficult to satisfactorily explain the results. For example: recently, we used this Tertius implementation on the EHC questionnaire data (???). The results were quite anomalous with the counter-examples values much higher than the confirmation. We were unable to find any explanation for this in the Tertius paper. After extensive digging through the source code, we found that these values were not the counter-examples but something completely different―truepositive values! In this paper, we begin to discuss Weka implementation of Tertius. In Section 2, we focus on the equations and parameters used to generate the three values associated with each association rule: (1) confirmation, (2) true-positive, and (3) false-positive. We provide a running example on the simple, synthetic Balls dataset given in Table 1. This dataset consists of three separate nominal attributes: Size, Bounce, and Color. The Color is used as the label attribute which is the head of the association rules. The other two attributes are used as the body of the rules. Note: this discussion applies only to Tertius used to generate supervised association rules. This occurs when the setRocAnalysis flag for Weka Tertius is set to true. In the future, we may provide an equivalent section for Tertius used to generate unsupervised association rules. Armed with an understanding of the equations/parameters, in Section 3 we provide an explanation for the anomalous results on the EHC questionnaire dataset. Table 1: Balls Dataset Used in Running Example. Size Bounce Color Small Low Red Large Low Red Small High Red Large High Red Small Low Blue Large Low Blue Small High Blue Large High Blue Small Large Low Low Red Red 2. Tertius Equations and Parameters As previously discussed, the Weka implementation of Tertius (referred to as wTertius from here on) outputs three separate values with each association rule: (1) confirmation value, (2) true-positive, and (3) the false-positive. Of these three values, the confirmation value of a rule is the most important because it measures how unusual/interesting the rules are. Interested readers should consult Flach & Lachiche (2001) for more details on confirmation. The other two values are included to measure the significance of the association rules. They will be explained in more detail later in this section. First, we look at the four parameters common to all the functions:  𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑏𝑜𝑑𝑦 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 𝑚𝑎𝑡𝑐ℎ 𝑟𝑢𝑙𝑒  ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑑𝑜𝑒𝑠 𝑁𝑂𝑇 𝑚𝑎𝑡𝑐ℎ 𝑟𝑢𝑙𝑒  𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑏𝑜𝑑𝑦 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 𝑚𝑎𝑡𝑐ℎ, 𝑏𝑢𝑡 ℎ𝑒𝑎𝑑 𝑑𝑜𝑒𝑠 𝑁𝑂𝑇  𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 The names for the first three are somewhat vague. The 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 refers to the actual counterexample value discussed in the Flach & Lachiche (2001). The 𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 seems to refer to the first row in the contingency table in that paper, whereas the ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 refers to the second column. For wTertius purposes, all four parameters are involved in generating the output values. An example of computing the four parameters on the Balls dataset is given in Table 2 for the rule: 𝑏𝑜𝑢𝑛𝑐𝑒 = 𝑙𝑜𝑤 → 𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑. (That is, 𝑏𝑜𝑑𝑦 → ℎ𝑒𝑎𝑑.) The table gives the following parameter values for this rule:  𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 6  ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 4  𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 = 2  𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 = 10 Table 2: Example of computing the four parameters for the rule: 𝑏𝑜𝑢𝑛𝑐𝑒 = 𝑙𝑜𝑤 → 𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑. The tiny B is for 𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟, the tiny H is for ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟, and the tiny M is for 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟. Size Bounce Color Small Low B Red Large Low B Red Small High Red Large High Red Small Low B Blue HM Large Low B Blue HM Small High Blue H Large High Blue H Small Low B Red Large Low B Red Second, we consider the confirmation value used to measure the novelty of the association rules. The confirmation value requires two more parameters (1) expected frequency and (2) observed frequency. The equations for all are given below. Both the 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 and 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 parameters range from 0 to 1. The constraints on 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 prevent confirmation from dividing by zero. The 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 parameter measures the novelty for the rules. It becomes larger when many points match the body of the rule and/or when many points do NOT match the head. The 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 parameter balances 𝑒𝑥𝑝𝑒𝑐𝑡by penalizing the confirmation when the body is satisfied, but not the head. 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = 𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 ∗ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠2 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 = 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (roughly equal to the estimated number of counterexamples) (roughly equal to the observed number of counterexamples) 𝑐𝑜𝑛𝑓𝑖𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = 0 𝑖𝑓 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = 1 𝑜𝑟 0 𝑐𝑜𝑛𝑓𝑖𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 − 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 √𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 An example of computing the 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑, 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑, 𝑐𝑜𝑛𝑓𝑖𝑟𝑚𝑎𝑡𝑖𝑜𝑛 on the Balls dataset is given in the bullets below for the rule: 𝑏𝑜𝑢𝑛𝑐𝑒 = 𝑙𝑜𝑤 → 𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑.  𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 = 6 ∗ 4/100 = 0.24  𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 = 2/10 = 0.2 0.24−0.2  𝑐𝑜𝑛𝑓𝑖𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = 0.24−0.24 = 0.16 √ Third, we consider the true-positive and false-positive rates given as output in wTertius. The equations for both are given below. Both measure the quality of the association rule created by wTertius. The true-positive rate measures how often the entire rule is satisfied compared to the head whereas falsepositive measure how often the head of the rule is not satisfied in the data points. Ideally, to find interesting rules, the confirmation should be higher than true-positive and true-negative because we prefer novel rules over common rules, but a rule that never works (i.e., never satisfied) on the data points is also of limited use. 𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 − 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 − ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 𝑓𝑎𝑙𝑠𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 ℎ𝑒𝑎𝑑_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 An example of computing the 𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 and 𝑓𝑎𝑙𝑠𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 on the Balls dataset is given in the bullets below for the rule: 𝑏𝑜𝑢𝑛𝑐𝑒 = 𝑙𝑜𝑤 → 𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑. 6−2  𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 10−4 = 0.67  𝑓𝑎𝑙𝑠𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 2/4 = 0.5 To summarize, a good rule is one that has a high confirmation value, a high true-positive rate, and a low false-positive rate. 3. EHC Questionnaire Results We have previously used Tertius to obtain good results on education data (Riley, et al. 2009). However, when we ran the same configuration on the EHC Questionnaire dataset (???) there were some anomalous results. Specifically, we found that the second value was 1.0 for many rules generated from the results on domain VII. Originally, we thought this was the counter-instances, but we have subsequently found out this value was the true-positives. Now that we understand all the wTertius parameters, we can provide an explanation for these results. Overall, wTertius is working as intended. Recall that the dataset uses the Respond Parallel (RP) attribute as the label with attributes combined from the previous three turns. Despite this merge, the number of data points with 𝑅𝑃 = 1 are extremely small―only 1 point out of 676. This lopsided label distribution is causing high true-positive values. Take the following actual rule as an example:  𝐷 ∗ = 0 𝑎𝑛𝑑 𝐸𝑉𝐸𝑅 = 1 → 𝑅𝑃 = 1 4−3 o 𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 676−675 = 1 o 𝑐𝑜𝑛𝑓𝑖𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = 0.04 (1𝑠𝑡) For this rule, wTertius is working as intended. Recall, that true-positive measures how often the entire rule is satisfied compared to the head. For this rule, only one data point has 𝑅𝑃 = 1 so only one data point can satisfy the entire rule regardless of the number of points which satisfy the body. Therefore, if the number of body matches (𝑏𝑜𝑑𝑦_𝑐𝑜𝑢𝑛𝑡𝑒𝑟) increases, we will get a corresponding increase in the 𝑚_𝑐𝑜𝑢𝑛𝑡𝑒𝑟 without a change overall 𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (and vice-versa). This explains the large number of attributes with 𝑅𝑃 = 1 and 𝑡𝑟𝑢𝑒-𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 1. References Deltour, A. (2001). Tertius Extension to Weka (Technical Report No. CSTR-01-001). United Kingdom: University of Bristol. Flach, P. A., & Lachiche, N. (2001). Confirmation-Guided Discovery of First-Order Rules with Tertius. Mach. Learn., 42(1-2), 61-95. Riley, S., Miller, L. D., Soh, L., Samal, A., & Nugent, G. (2009). Intelligent Learning Object Guide (iLOG): A Framework for Automatic Empirically-Based Metadata Generation. In Artificial Intelligence in Education (pp. 515-522).

L. Dee Miller Weka Tertius Explanation October 19, 2010 Work In

Related documents

Products

Support

L. Dee Miller Weka Tertius Explanation October 19, 2010 Work In

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib