WEKA methods and background

advertisement
WEKA
The association rule mining will be performed by using
a version of the WEKA software program.
WEKA is an open
sourced program that was developed at the University of
Waikato
in
New
Zealand.
The
program’s
potential
was
recognized at WPI and subsequently modifications were made
to WEKA by various WPI students.
These improvements cause
WEKA to mine in a slightly different manner to reduce the
amount of rules found.
WEKA takes in a set of data and returns a list of
association rules pertaining to the input.
The input file
must be in the proper arff format in order to process the
association rules. The current WPI version of WEKA allows
building association rules by three different algorithms
depending on the data set.
basic
Apriori
algorithm
containing standard data.
The first algorithm is the
that
is
performed
on
input
The Apriori Sets algorithm will
mine rules from data containing sets of items.
The final
algorithm which will be used in this project is the Apriori
Sets and Sequences.
This algorithm allows for mining data
that contains both set and time sequence attributes.
WPI
version
of
WEKA
generates
a
desired
amount
rules by diminishing support while mining for rules.
initial
support
is
initially
set
high
(95%)
and
of
The
is
diminished in increments until at least the number of rules
desired has been met.
Rules are then displayed along with
their support and confidence.
This feature was added to
avoid the guesswork in setting an initial minimal support
to
get
a
desirable
amount
of
rules.
This
new
feature
requires certain parameters that the user must select for
each mining experiment.
The
most
important
upperBoundMinSupport,
and delta.
parameters
are
lowerBoundMinSupport,
numRules,
minConfidence
The number of rules sought after is numRules
and has a default value set to 10.
The beginning minimal
support is upperBoundMinSupport and has a default value at
95%. Data mining will begin using this minimal support but
if at least 10 rules are not found then the support will be
decremented by delta which is initialized at 5%.
This will
continue
when
until
number
of
rules
is
reached
or
decremented support reaches the lowerBounMinSupport.
the
When
this is finished rules exceeding the minConfidence will be
displayed.
There
specify
are
which
consequent.
also
some
attributes
other
to
user
make
set
the
parameters
antecedent
to
and
This project will focus solely on rules where
the motifs are the antecedent and the expression is the
consequent.
Data Mining with WEKA
Once the motifs were compared for the last time we
then created arff files for each experiment we conducted.
These files were constructed by using software developed
for this project. The program took as input a motif file
along with gene sequences.
by
their
expression
so
The gene sequences were chosen
that
rules
could
be
mined
from
different gene sequences with different expression.
The
output file was an arff file containing the descriptive
header along with a list of instances. Each instance was
made up of sequence, expression and set of motifs.
After each file was constructed we imported the arff
into the WPI version of WEKA.
Each experiment was done
individually by creating a different arff file for each
one.
Once
the
correct
file
was
imported
filter to filter out the gene sequence.
we
applied
a
This now left the
two attributes motifs and expression.
The next step was to associate the data and establish
the parameters to produce the type of results we wanted.
This project needed to produce association rules where the
antecedent was a set of motifs and the consequent was a
cell expression.
This was accomplished by selecting values
in WEKA to require certain attributes in the antecedent and
consequent.
Download