WEKA The association rule mining will be performed by using a version of the WEKA software program. WEKA is an open sourced program that was developed at the University of Waikato in New Zealand. The program’s potential was recognized at WPI and subsequently modifications were made to WEKA by various WPI students. These improvements cause WEKA to mine in a slightly different manner to reduce the amount of rules found. WEKA takes in a set of data and returns a list of association rules pertaining to the input. The input file must be in the proper arff format in order to process the association rules. The current WPI version of WEKA allows building association rules by three different algorithms depending on the data set. basic Apriori algorithm containing standard data. The first algorithm is the that is performed on input The Apriori Sets algorithm will mine rules from data containing sets of items. The final algorithm which will be used in this project is the Apriori Sets and Sequences. This algorithm allows for mining data that contains both set and time sequence attributes. WPI version of WEKA generates a desired amount rules by diminishing support while mining for rules. initial support is initially set high (95%) and of The is diminished in increments until at least the number of rules desired has been met. Rules are then displayed along with their support and confidence. This feature was added to avoid the guesswork in setting an initial minimal support to get a desirable amount of rules. This new feature requires certain parameters that the user must select for each mining experiment. The most important upperBoundMinSupport, and delta. parameters are lowerBoundMinSupport, numRules, minConfidence The number of rules sought after is numRules and has a default value set to 10. The beginning minimal support is upperBoundMinSupport and has a default value at 95%. Data mining will begin using this minimal support but if at least 10 rules are not found then the support will be decremented by delta which is initialized at 5%. This will continue when until number of rules is reached or decremented support reaches the lowerBounMinSupport. the When this is finished rules exceeding the minConfidence will be displayed. There specify are which consequent. also some attributes other to user make set the parameters antecedent to and This project will focus solely on rules where the motifs are the antecedent and the expression is the consequent. Data Mining with WEKA Once the motifs were compared for the last time we then created arff files for each experiment we conducted. These files were constructed by using software developed for this project. The program took as input a motif file along with gene sequences. by their expression so The gene sequences were chosen that rules could be mined from different gene sequences with different expression. The output file was an arff file containing the descriptive header along with a list of instances. Each instance was made up of sequence, expression and set of motifs. After each file was constructed we imported the arff into the WPI version of WEKA. Each experiment was done individually by creating a different arff file for each one. Once the correct file was imported filter to filter out the gene sequence. we applied a This now left the two attributes motifs and expression. The next step was to associate the data and establish the parameters to produce the type of results we wanted. This project needed to produce association rules where the antecedent was a set of motifs and the consequent was a cell expression. This was accomplished by selecting values in WEKA to require certain attributes in the antecedent and consequent.