Project Report #1 - Pedro C. Exposito

advertisement
CAP 4770
Introduction to Data Mining
Project Report #1:
Association Rules
Pedro Exposito
ID: 1826385
I. Introduction and Objectives
This report explains how to do an association mining task in WEKA, using a dataset in .arff
format, and how to obtain useful knowledge from the results. The dataset contact-lenses—
described in the next section—will be the example used throughout the report to explain how to
do association rule mining on a set of related data.
The domain of contact-lenses involves types of contact lenses and characteristics of people who
either use a particular type of contact lenses or none at all. The dataset has information on several
individuals (mostly characteristics that can affect vision) and the type of contact lenses they use,
if any. Performing association rule mining on this dataset can have potential benefits, such as
finding correlations among certain eye-related characteristics and the need to use a particular
type of contact-lenses for people who have those characteristics. Moreover, it can help to
visualize which are common traits among people who do not use contact lenses at all—specially,
if a high correlation between having those traits and not using lenses is found.
If we search for patterns, or associations, on this dataset the results can lead us to uncover more
specific correlations, such as finding that most people that have both trait ‘A’ and trait ‘B’ also
use contact lenses of type ‘2’. These relationships are very hard to find by just looking at the
data, but WEKA can be used to find them. Moreover, two or more found patterns could be used
together to infer another general relationship among the data attributes, which might not be
obvious at all. These results make WEKA’s association rule mining useful as a tool of
knowledge discovery from data.
II. Data Set Description and Data Pre-processing
The dataset contact-lenses contains five attributes.
The attribute age is a nominal attribute with values young, pre-presbyopic, and presbyopic. The
age attribute here is more related to age of the eye than to the age of the person. Individuals in
the pre-presbyopic category are adults around 30-40 years old and those considered presbyopic
are older than 40. The word presbyopic refers to the loss of focus of the natural eye lenses, which
is a natural consequence of aging. The attribute spectacle-prescrip stands for spectacle
prescription and has two possible values: myope and hypermetrope. These two refer to the eye
condition of the patient, for which he or she should wear a specific type of glasses. Different
types of lenses are used to correct myopia and hypermetropia. The attribute astigmatism has
possible values yes and no, which refer to whether the person has this vision problem or not. The
attribute tear-prod-rate stands for tear production rate and can have the value reduced or normal.
The final attribute, contact-lenses, has possible values soft, hard, and none, and refers to the type
of contact lenses that the individual is currently using.
In this example, the dataset contact-lenses already had the correct file format to use in WEKA
(see Reference section, part B). It was in .arff format. To see what an .arff file should look like
check the .arff file for contact-lenses provided at the end of this section. The file gives the
dataset’s name with @relation at the top, then it specifies all the attributes with @attribute
followed by the attribute’s name and its set of possible values (for nominal attributes) or by real
numeric values, such as 85 or 20.3. In this example, there were only nominal attributes. The file
specifies the beginning of the data with @data on top and each data instance of the relation
follows in each row below, with all its attribute values separated by a comma, except the last
attribute in the line, which does not end with a comma.
To do association rule mining with WEKA one first should put the data in .arff format. This can
be done easily using Notepad and saving the file with extension .arff after setting it up correctly.
Other formats, such as .CSV work as well, but for simplicity only .arff will be covered in this
report.
WEKA’s Explorer interface provides several options to pre-process the data after you load the
dataset using the .arff file. In this example, the dataset was already suitable to perform
association rule mining right away, since all its attributes were nominal already and eliminating
attributes was not convenient in this case. However, other pre-processing steps have to be done
before association mining for other datasets. First, all the numeric attributes must be converted to
nominal attributes, otherwise WEKA’s associate tab does not allow you to Start an association
mining task on the dataset. To do this, the user has to click on Choose in the preprocess tab to
pick the Discretize filter. Open filters->unsupervised->attribute->Discretize to select the
Discretize filter that will convert numeric attributes to nominal. Then, double-click on the
Discretize text field and a window to specify parameter settings for discretization, such as
number of bins and weight of instances per interval, pops up. Depending on how many bins you
choose, the numeric attributes are divided into partitions and converted to nominal attributes with
those partitions as their possible values. Then, data would be ready for association mining, but
additional preprocessing could be done, such as removing unimportant attributes, to obtain better
results.
In the case of our dataset, contact-lenses, there was no need for the aforementioned data
preparation. The data file was already in .arff format, all the attributes were nominal already, and
all the attributes were important or useful to find association rules, thus, none were removed. In
fact, removing an attribute from contact-lenses would have affected the discovery of more
complex association rules, specially if the removed attribute was also involved in those
associations. It is important to note that although this dataset was already set up and ready for
association rule mining, it would have required the data preprocessing methods mentioned so far
if it was not.
To further describe the contact-lenses dataset, the .arff file that represents it is provided next:
@relation contact-lenses
@attribute age
{young, pre-presbyopic, presbyopic}
@attribute spectacle-prescrip {myope, hypermetrope}
@attribute astigmatism
{no, yes}
@attribute tear-prod-rate
{reduced, normal}
@attribute contact-lenses
{soft, hard, none}
@data
young,myope,no,reduced,none
young,myope,no,normal,soft
young,myope,yes,reduced,none
young,myope,yes,normal,hard
young,hypermetrope,no,reduced,none
young,hypermetrope,no,normal,soft
young,hypermetrope,yes,reduced,none
young,hypermetrope,yes,normal,hard
pre-presbyopic,myope,no,reduced,none
pre-presbyopic,myope,no,normal,soft
pre-presbyopic,myope,yes,reduced,none
pre-presbyopic,myope,yes,normal,hard
pre-presbyopic,hypermetrope,no,reduced,none
pre-presbyopic,hypermetrope,no,normal,soft
pre-presbyopic,hypermetrope,yes,reduced,none
pre-presbyopic,hypermetrope,yes,normal,none
presbyopic,myope,no,reduced,none
presbyopic,myope,no,normal,none
presbyopic,myope,yes,reduced,none
presbyopic,myope,yes,normal,hard
presbyopic,hypermetrope,no,reduced,none
presbyopic,hypermetrope,no,normal,soft
presbyopic,hypermetrope,yes,reduced,none
presbyopic,hypermetrope,yes,normal,none
III. Rule Mining Process
After the data preprocessing stage is done, association rule mining is performed through
WEKA’s Associate tab. There we can choose what algorithm will be used to find the association
rules and, after picking the algorithm, the user can customize its parameters by double-clicking
on the text field that contains the algorithm’s name. In this example, the standard Apriori
algorithm was used and only one of its parameters was modified: the number of rules, which is
10 by default, was set to 20. Thus, Apriori was run with its default parameter settings (except for
numRules, which was changed to 20 to produce more rules) on contact-lenses to produce the
output shown in Part A of the Reference section. The algorithm took less than one second to
execute and output the results. However, contact-lenses is a small dataset, so this is not
surprising; time is expected to be longer for very large datasets with thousands of entries.
IV. Resulting Rules
The association mining process, done using Apriori, results in 20 association rules. For the full
listing of the output and all the ordered rules refer to Reference section, Part A. Next, five rules,
out of the 20, are shown to explain how useful knowledge and predictions can be obtained from
the association rules found:
1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1)
6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1)
11. contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1)
7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1)
12. contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1)
Rules number 1, 6, 11, 7, and 12 are shown above. Here, their order was altered from the original
output to make it easier to show how different rules can be used together to make a single
prediction. The rules are given in the form if X then Y, or X implies Y (X  Y). The rule number
is determined by the level of support for the rule, or the number of instances with the specified
attribute values in the rule. Notice that rule 1 has a number 12 in both sides of the conditional,
while the other rules have 4 and 5. This means that 12 instances of the relation supported the
association given by rule 1, thus, making it the most valuable rule found and placing it at the top
of the rules.
The shown rules above are the ones that would be suitable to show to a customer or to make a
hypothesis based on the results from association mining because they give useful correlations.
Rule 1 shows that in half of the sample, if the person has reduced tear production, then he or she
does not use contact-lenses. This could lead to hidden knowledge about the data, such as finding
that for most people (based on the contact-lenses sample) contact-lenses could lead to higher tear
production. Another valuable fact from rule 1 is obtained if it is compared with other less
significant rules, such as rules 2-5 and rules 14-16 (refer to section VI, part A). The fact is that
the values for other attributes, such age, whether the person has astigmatism or not, and spectacle
prescription, are not important to determine if the person wears contact-lenses or not; the rate of
tear production seems to be the determining factor. Notice that when tear production rate is
reduced the person is likely to not wear contact-lenses, regardless of what they have for other
attributes.
Rules 6 and 11 together show that it is probable that people with soft contact-lenses do not have
astigmatism, and, inversely, it is probable that those with hard contact-lenses have astigmatism.
The two rules together give more support for this correlation because rule 6 is supported by five
instances and rule 11 by other four instances, but both rules support the same idea because they
have the opposite values of the same related attributes. Therefore, we can say that nine instances
(those of rules 6 and 11) support that people with soft contact-lenses probably don’t have
astigmatism, and those with hard contact-lenses probably do have it.
Rules 7 and 12 together show another interesting result, which is not obvious at all by looking at
each rule separately. They show that it is probable that people with either soft or hard contactlenses have a normal tear production rate. Therefore, these two together can be used to infer that
the type of contact-lenses used do not determine the person’s tear production rate.
These are some association rules, obtained from performing association mining on the contactlenses dataset, that lead to interesting inferences. It is worth noting that only five rules were
discussed here out of the 20 found, so not all association rules lead to useful knowledge.
V. Recommendations
Suppose that the rules discovered and section IV’s explanation were presented to a client, a
seller, or someone who would make decisions based on the results. In such a situation we can see
why association rule mining can be useful. Let’s suppose the same rules with the same level of
support were obtained for the same attributes, but for a contact-lenses dataset with 1 million
instances, or rows, instead of just 24. In that case, a contact-lenses seller should sell hard contactlenses to people with astigmatism and soft contact-lenses to people without astigmatism (rules 6
and 11). Likewise, people who know they have astigmatism should buy hard contact-lenses.
In addition, if people that use contact-lenses want to reduce their tear production rate, it would be
wise to recommend them that they stop using contact lenses and switch to glasses, since half the
people from a 1 million sample showed that if their tear production rate was reduced, then they
did not use contact-lenses either. This association rule from such a large sample could lead us to
believe that contact-lenses can be a factor in having higher rate of tear production.
VI. Reference
(A) WEKA’s Associator Output (complete output)
The following is the complete output given by WEKA’s associator running with the Apriori
algorithm on the contact-lenses dataset. It contains the full listing of the 20 rules found at the
end.
=== Run information ===
Scheme:
weka.associations.Apriori -N 20 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: contact-lenses-weka.filters.unsupervised.attribute.Discretize-B4-M7.0-Rfirst-last
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 12
Size of set of large itemsets L(2): 49
Size of set of large itemsets L(3): 84
Size of set of large itemsets L(4): 26
Best rules found:
1. tear-prod-rate=reduced 12 ==> contact-lenses=none 12 conf:(1)
2. spectacle-prescrip=myope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1)
3. spectacle-prescrip=hypermetrope tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1)
4. astigmatism=no tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1)
5. astigmatism=yes tear-prod-rate=reduced 6 ==> contact-lenses=none 6 conf:(1)
6. contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1)
7. contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1)
8. tear-prod-rate=normal contact-lenses=soft 5 ==> astigmatism=no 5 conf:(1)
9. astigmatism=no contact-lenses=soft 5 ==> tear-prod-rate=normal 5 conf:(1)
10. contact-lenses=soft 5 ==> astigmatism=no tear-prod-rate=normal 5 conf:(1)
11. contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1)
12. contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1)
13. age=young contact-lenses=none 4 ==> tear-prod-rate=reduced 4 conf:(1)
14. age=young tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1)
15. age=pre-presbyopic tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1)
16. age=presbyopic tear-prod-rate=reduced 4 ==> contact-lenses=none 4 conf:(1)
17. tear-prod-rate=normal contact-lenses=hard 4 ==> astigmatism=yes 4 conf:(1)
18. astigmatism=yes contact-lenses=hard 4 ==> tear-prod-rate=normal 4 conf:(1)
19. contact-lenses=hard 4 ==> astigmatism=yes tear-prod-rate=normal 4 conf:(1)
20. spectacle-prescrip=myope contact-lenses=hard 3 ==> astigmatism=yes 3 conf:(1)
(B) Recreation of the Report’s Results
The following describes step-by-step how to get the same contact-lenses dataset used in this
report and how to replicate the same results analyzed here in WEKA:
1. Open a web browser and go to www.google.com
2. Search for “.arff data sets” and click on the result named “My Weka page,” which should
be among the first five results found. Alternatively, go to http://www.hakank.org/weka/
to arrive at the same site.
3. Scroll down the page until you find the ARFF data files section. From the files available
for download, click on the one that ends in contact-lenses.arff, which is the 16th dataset
available. Download it to your computer.
4. Now open WEKA and click on the Explorer button to open the Explorer user interface.
5. In the Preprocess tab, click on the Open file button and load the contact-lenses.arff file
you obtained from step 3. The number of instances, attributes, and other statistics show
up on screen.
6. Next, click on the Associate tab. Click on the Choose button and pick Apriori to perform
the association rule mining.
7. Double-click in the Apriori text field to the right of the Choose button and a window with
the parameters used for the Apriori algorithm shows up. Change the numRules to 20 in
this window. Close the window.
8. Finally, click on the Start button and the Apriori algorithm runs on the contact-lenses
dataset, producing the same result shown in Reference, Part A, with the 20 association
rules at the end.
Download