CAP 4770 Introduction to Data Mining Project Report #1: Association Rules Pedro Exposito ID: 1826385 I. Introduction and Objectives This report explains how to do an association mining task in WEKA, using a dataset in .arff format, and how to obtain useful knowledge from the results. The dataset contact-lenses— described in the next section—will be the example used throughout the report to explain how to do association rule mining on a set of related data. The domain of contact-lenses involves types of contact lenses and characteristics of people who either use a particular type of contact lenses or none at all. The dataset has information on several individuals (mostly characteristics that can affect vision) and the type of contact lenses they use, if any. Performing association rule mining on this dataset can have potential benefits, such as finding correlations among certain eye-related characteristics and the need to use a particular type of contact-lenses for people who have those characteristics. Moreover, it can help to visualize which are common traits among people who do not use contact lenses at all—specially, if a high correlation between having those traits and not using lenses is found. If we search for patterns, or associations, on this dataset the results can lead us to uncover more specific correlations, such as finding that most people that have both trait ‘A’ and trait ‘B’ also use contact lenses of type ‘2’. These relationships are very hard to find by just looking at the data, but WEKA can be used to find them. Moreover, two or more found patterns could be used together to infer another general relationship among the data attributes, which might not be obvious at all. These results make WEKA’s association rule mining useful as a tool of knowledge discovery from data. II. Data Set Description and Data Pre-processing The dataset contact-lenses contains five attributes. The attribute age is a nominal attribute with values young, pre-presbyopic, and presbyopic. The age attribute here is more related to age of the eye than to the age of the person. Individuals in the pre-presbyopic category are adults around 30-40 years old and those considered presbyopic are older than 40. The word presbyopic refers to the loss of focus of the natural eye lenses, which is a natural consequence of aging. The attribute spectacle-prescrip stands for spectacle prescription and has two possible values: myope and hypermetrope. These two refer to the eye condition of the patient, for which he or she should wear a specific type of glasses. Different types of lenses are used to correct myopia and hypermetropia. The attribute astigmatism has possible values yes and no, which refer to whether the person has this vision problem or not. The attribute tear-prod-rate stands for tear production rate and can have the value reduced or normal. The final attribute, contact-lenses, has possible values soft, hard, and none, and refers to the type of contact lenses that the individual is currently using. In this example, the dataset contact-lenses already had the correct file format to use in WEKA (see Reference section, part B). It was in .arff format. To see what an .arff file should look like check the .arff file for contact-lenses provided at the end of this section. The file gives the dataset’s name with @relation at the top, then it specifies all the attributes with @attribute followed by the attribute’s name and its set of possible values (for nominal attributes) or by real numeric values, such as 85 or 20.3. In this example, there were only nominal attributes. The file specifies the beginning of the data with @data on top and each data instance of the relation follows in each row below, with all its attribute values separated by a comma, except the last attribute in the line, which does not end with a comma. To do association rule mining with WEKA one first should put the data in .arff format. This can be done easily using Notepad and saving the file with extension .arff after setting it up correctly. Other formats, such as .CSV work as well, but for simplicity only .arff will be covered in this report. WEKA’s Explorer interface provides several options to pre-process the data after you load the dataset using the .arff file. In this example, the dataset was already suitable to perform association rule mining right away, since all its attributes were nominal already and eliminating attributes was not convenient in this case. However, other pre-processing steps have to be done before association mining for other datasets. First, all the numeric attributes must be converted to nominal attributes, otherwise WEKA’s associate tab does not allow you to Start an association mining task on the dataset. To do this, the user has to click on Choose in the preprocess tab to pick the Discretize filter. Open filters->unsupervised->attribute->Discretize to select the Discretize filter that will convert numeric attributes to nominal. Then, double-click on the Discretize text field and a window to specify parameter settings for discretization, such as number of bins and weight of instances per interval, pops up. Depending on how many bins you choose, the numeric attributes are divided into partitions and converted to nominal attributes with those partitions as their possible values. Then, data would be ready for association mining, but additional preprocessing could be done, such as removing unimportant attributes, to obtain better results. In the case of our dataset, contact-lenses, there was no need for the aforementioned data preparation. The data file was already in .arff format, all the attributes were nominal already, and all the attributes were important or useful to find association rules, thus, none were removed. In fact, removing an attribute from contact-lenses would have affected the discovery of more complex association rules, specially if the removed attribute was also involved in those associations. It is important to note that although this dataset was already set up and ready for association rule mining, it would have required the data preprocessing methods mentioned so far if it was not. To further describe the contact-lenses dataset, the .arff file that represents it is provided next: @relation contact-lenses @attribute age {young, pre-presbyopic, presbyopic} @attribute spectacle-prescrip {myope, hypermetrope} @attribute astigmatism {no, yes} @attribute tear-prod-rate {reduced, normal} @attribute contact-lenses {soft, hard, none} @data young,myope,no,reduced,none young,myope,no,normal,soft young,myope,yes,reduced,none young,myope,yes,normal,hard young,hypermetrope,no,reduced,none young,hypermetrope,no,normal,soft young,hypermetrope,yes,reduced,none young,hypermetrope,yes,normal,hard pre-presbyopic,myope,no,reduced,none pre-presbyopic,myope,no,normal,soft pre-presbyopic,myope,yes,reduced,none pre-presbyopic,myope,yes,normal,hard pre-presbyopic,hypermetrope,no,reduced,none pre-presbyopic,hypermetrope,no,normal,soft pre-presbyopic,hypermetrope,yes,reduced,none pre-presbyopic,hypermetrope,yes,normal,none presbyopic,myope,no,reduced,none presbyopic,myope,no,normal,none presbyopic,myope,yes,reduced,none presbyopic,myope,yes,normal,hard presbyopic,hypermetrope,no,reduced,none presbyopic,hypermetrope,no,normal,soft presbyopic,hypermetrope,yes,reduced,none presbyopic,hypermetrope,yes,normal,none III. Rule Mining Process After the data preprocessing stage is done, association rule mining is performed through WEKA’s Associate tab. There we can choose what algorithm will be used to find the association rules and, after picking the algorithm, the user can customize its parameters by double-clicking on the text field that contains the algorithm’s name. In this example, the standard Apriori algorithm was used and only one of its parameters was modified: the number of rules, which is 10 by default, was set to 20. Thus, Apriori was run with its default parameter settings (except for numRules, which was changed to 20 to produce more rules) on contact-lenses to produce the output shown in Part A of the Reference section. The algorithm took less than one second to execute and output the results. However, contact-lenses is a small dataset, so this is not surprising; time is expected to be longer for very large datasets with thousands of entries. IV. Resulting Rules The association mining process, done using Apriori, results in 20 association rules. For the full listing of the output and all the ordered rules refer to Reference section, Part A. Next, five rules, out of the 20, are shown to explain how useful knowledge and predictions can be obtained from the association rules found: 1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1) 6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 11. contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1) 7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 12. contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1) Rules number 1, 6, 11, 7, and 12 are shown above. Here, their order was altered from the original output to make it easier to show how different rules can be used together to make a single prediction. The rules are given in the form if X then Y, or X implies Y (X Y). The rule number is determined by the level of support for the rule, or the number of instances with the specified attribute values in the rule. Notice that rule 1 has a number 12 in both sides of the conditional, while the other rules have 4 and 5. This means that 12 instances of the relation supported the association given by rule 1, thus, making it the most valuable rule found and placing it at the top of the rules. The shown rules above are the ones that would be suitable to show to a customer or to make a hypothesis based on the results from association mining because they give useful correlations. Rule 1 shows that in half of the sample, if the person has reduced tear production, then he or she does not use contact-lenses. This could lead to hidden knowledge about the data, such as finding that for most people (based on the contact-lenses sample) contact-lenses could lead to higher tear production. Another valuable fact from rule 1 is obtained if it is compared with other less significant rules, such as rules 2-5 and rules 14-16 (refer to section VI, part A). The fact is that the values for other attributes, such age, whether the person has astigmatism or not, and spectacle prescription, are not important to determine if the person wears contact-lenses or not; the rate of tear production seems to be the determining factor. Notice that when tear production rate is reduced the person is likely to not wear contact-lenses, regardless of what they have for other attributes. Rules 6 and 11 together show that it is probable that people with soft contact-lenses do not have astigmatism, and, inversely, it is probable that those with hard contact-lenses have astigmatism. The two rules together give more support for this correlation because rule 6 is supported by five instances and rule 11 by other four instances, but both rules support the same idea because they have the opposite values of the same related attributes. Therefore, we can say that nine instances (those of rules 6 and 11) support that people with soft contact-lenses probably don’t have astigmatism, and those with hard contact-lenses probably do have it. Rules 7 and 12 together show another interesting result, which is not obvious at all by looking at each rule separately. They show that it is probable that people with either soft or hard contactlenses have a normal tear production rate. Therefore, these two together can be used to infer that the type of contact-lenses used do not determine the person’s tear production rate. These are some association rules, obtained from performing association mining on the contactlenses dataset, that lead to interesting inferences. It is worth noting that only five rules were discussed here out of the 20 found, so not all association rules lead to useful knowledge. V. Recommendations Suppose that the rules discovered and section IV’s explanation were presented to a client, a seller, or someone who would make decisions based on the results. In such a situation we can see why association rule mining can be useful. Let’s suppose the same rules with the same level of support were obtained for the same attributes, but for a contact-lenses dataset with 1 million instances, or rows, instead of just 24. In that case, a contact-lenses seller should sell hard contactlenses to people with astigmatism and soft contact-lenses to people without astigmatism (rules 6 and 11). Likewise, people who know they have astigmatism should buy hard contact-lenses. In addition, if people that use contact-lenses want to reduce their tear production rate, it would be wise to recommend them that they stop using contact lenses and switch to glasses, since half the people from a 1 million sample showed that if their tear production rate was reduced, then they did not use contact-lenses either. This association rule from such a large sample could lead us to believe that contact-lenses can be a factor in having higher rate of tear production. VI. Reference (A) WEKA’s Associator Output (complete output) The following is the complete output given by WEKA’s associator running with the Apriori algorithm on the contact-lenses dataset. It contains the full listing of the 20 rules found at the end. === Run information === Scheme: weka.associations.Apriori -N 20 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: contact-lenses-weka.filters.unsupervised.attribute.Discretize-B4-M7.0-Rfirst-last Instances: 24 Attributes: 5 age spectacle-prescrip astigmatism tear-prod-rate contact-lenses === Associator model (full training set) === Apriori ======= Minimum support: 0.1 (2 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 Size of set of large itemsets L(2): 49 Size of set of large itemsets L(3): 84 Size of set of large itemsets L(4): 26 Best rules found: 1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1) 2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1) 6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1) 9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1) 10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5 conf:(1) 11. contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1) 12. contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1) 13. age=young contact-lenses=none 4 ==> tear-prod-rate=reduced 4 conf:(1) 14. age=young tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1) 15. age=pre-presbyopic tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1) 16. age=presbyopic tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1) 17. tear-prod-rate=normal contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1) 18. astigmatism=yes contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1) 19. contact-lenses=hard 4 ==> astigmatism=yes tear-prod-rate=normal 4 conf:(1) 20. spectacle-prescrip=myope contact-lenses=hard 3 ==> astigmatism=yes 3 conf:(1) (B) Recreation of the Report’s Results The following describes step-by-step how to get the same contact-lenses dataset used in this report and how to replicate the same results analyzed here in WEKA: 1. Open a web browser and go to www.google.com 2. Search for “.arff data sets” and click on the result named “My Weka page,” which should be among the first five results found. Alternatively, go to http://www.hakank.org/weka/ to arrive at the same site. 3. Scroll down the page until you find the ARFF data files section. From the files available for download, click on the one that ends in contact-lenses.arff, which is the 16th dataset available. Download it to your computer. 4. Now open WEKA and click on the Explorer button to open the Explorer user interface. 5. In the Preprocess tab, click on the Open file button and load the contact-lenses.arff file you obtained from step 3. The number of instances, attributes, and other statistics show up on screen. 6. Next, click on the Associate tab. Click on the Choose button and pick Apriori to perform the association rule mining. 7. Double-click in the Apriori text field to the right of the Choose button and a window with the parameters used for the Apriori algorithm shows up. Change the numRules to 20 in this window. Close the window. 8. Finally, click on the Start button and the Apriori algorithm runs on the contact-lenses dataset, producing the same result shown in Reference, Part A, with the 20 association rules at the end.