EVALUATION OF THE PRODUCTS OFFERED BY RULEQUEST RESEARCH COMPANY CS 595 Assignment 1 By Sivakumar Sundaramoorthy INTRODUCTION The RuleQuest Research is a company dealing with data mining tools. It is based in Australia. The company provides some of the best data mining tools that help you transform data into knowledge The Data Mining tools that the company offers helps in Constructing decision trees and rule-based classifiers Building rule-based numerical models Finding association rules that reveal interrelationships Identifying data anomalies for data cleansing (A new venture) The following companies are using the RuleQuest tools E-Merchandising from Blue Martini The Blue Martini Customer Interaction System is the leading enterprise-scale Internet application for interacting live with customers. EPM from Broadbase Broadbase applications power digital markets by analyzing customer data and using that information to execute personalized interactions that drive revenue. Clementine from ISL/SPSS Clementine Server enables us to sift through huge data to discover valuable experiences and information - and turn them into powerful decision-taking knowledge. Decision Series from Accrue Accrue software is the leading provider of Internet Enterprise software solutions for optimizing the effectiveness of e-tail, retail and e-media initiatives. Industrial systems from Parsytec Parsytec AG is a software company, linked and evolved from the Technical University of Aachen , which specialized in the analysis and evaluation of defects on high speed production lines, like those produced, or example in the steel, aluminum, paper or plastics industries. The Tools that are developed by RuleQuest run on variety of platforms like Windows: Unix : 95 Sun Solaris 2.5 or later 98 SGI Irix NT 4.0 or later Linux RuleQuest offer a wide range of products which are as follows PRODUCTS See5 / C5.0 Cubist Magnum Opus GritBot EVALUATION OF THE PRODUCTS See5 / C5.0 “A CLASSIFIER “ Introduction “See5 is a state-of-the-art system constructs classifiers in the form of decision trees and rule sets.” See5/C5.0 has been designed to operate on large databases and incorporates innovations such as boosting. The products See 5 and C 5.0 are analogous. The former operates on Windows 95/98/NT and C 5.0 is its UNIX counter part. The See5 and C5.0 are sophisticated data mining tools for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions. The major features of the See 5/ C5.0 are See5/C5.0 has been designed to analyze substantial databases containing thousands to hundreds of thousands of records and tens to hundreds of numeric or nominal fields. To maximize interpretability, See5/C5.0 classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than neural networks. See5/C5.0 is easy to use and does not presume advanced knowledge of Statistics or Machine Learning RuleQuest provides C source code so that classifiers constructed by See5/C5.0 can be embedded in an organization's own systems. Operations Details of See 5 In order to work with the See 5 we need to follow a number of conventions. The following points explain the whole process. a) b) c) d) e) f) g) Preparing Data for See5 User Interface Constructing Classifiers Using Classifiers Cross-referencing Classifiers and Data Generating Classifiers in Batch Mode Linking to Other Programs a) Preparing Data for See5 See5 is a tool that analyzes data to produce decision trees and/or rulesets that relate a case’s class to the values of its attributes. An application is a collection of text files. These files define classes and attributes, describe the cases to be analyzed, provide new cases to test the classifiers produced by See5, and specify misclassification costs or penalties. Every See5 application has a short name called a filestem; Example a credit data set may have a file stem like credit. All files read or written by See5 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file. The file name is case sensitive. The See 5 has a number of files that need to be available in order for it to classify the data set . The Files must follow the conventions and are as follows Names file (essential) Data file (essential) Test and cases files (optional) Costs file (optional) Names File The first essential file is the names file (e.g. credit.names) that describes the attributes and classes. There are two important subgroups of attributes: Discrete/Continuos/Label/Date/Ignore A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, and a label attribute serves only to identify a particular case. Ignore parameter specifies See 5 that it needs to ignore the value during classification. Example : In a credit information data set Amount spent would be continuous. Sex of the customer would be discrete. Date of joining would be a date attribute. Id No would be a label Bank name is ignored Explicit/Implicit The value of an explicitly defined attribute is given directly in the data, while the value of an implicitly defined attribute is specified by a formula. Example of an implicit attribute would be the status of a customer If the dues=0 and Payment = ontime then Status = Good Here the attribute status depends on the attributes payment and dues. Example Names File status. | the target attribute Age: ignore. | The age of the customer Sex: m, f. Lastmonthsbalance: continuous. Thismonthsbalance: continuous. Totalbalance:= lastmonthsbalance + thismonthsbalance. Paymentdue:= true,false. Status: excellent, good, average, poor. Creditcardno: label The conventions can be noted as follows: EXPLICIT attributes Attribute Name: IMPLICIT attributes Attribute Name:= TYPE FORMULA |Comment |Comment There are six possible types of value : continuous The attribute takes numeric values. date The attribute's values are dates in the form YYYY/MM/DD, e.g. 1999/09/30 a comma-separated list of names The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise they will be taken as unordered. discrete N for some integer N The attribute has discrete, unordered values, but the values are assembled from the data itself; N is the maximum number of such values. ignore The values of the attribute should be ignored. label This attribute contains an identifying label for each case. Data file The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute values are followed by the case's class value. If an attribute value is not known, it is replaced by a question mark `?'. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar `|' is ignored. Example 31,m,30.5,300,330.5,true,good,0001 23,f,333,22,355,false,average,0222 Test and cases files (optional) The third kind of file used by See5 consists of new test cases (e.g. credit.test) on which the classifier can be evaluated. This file is optional and, if used, has exactly the same format as the data file. Another optional file, the cases file (e.g. cerdit.cases), differs from a test file only in allowing the cases' classes to be unknown. The cases file is used primarily with the cross-referencing procedure and public source code. Costs file (optional) The last kind of file, the costs file (e.g. credit.costs), is also optional and sets out differential misclassification costs. In some applications there is a much higher penalty for certain types of mistakes. b) User Interface in See 5 Usage of each Icons Locate Data invokes a browser to find the files for your application, or to change the current application; Construct Classifier Stop Review Output Use Classifier Cross-Reference selects the type of classifier to be constructed and sets other options; interrupts the classifier-generating process; Re-displays the output from the last classifier construction (if any); Interactively applies the current classifier to one or more cases; and Maps between the training data and classifiers constructed from it. c) Constructing A Classifier: STEP 1 Locate the data file using the locate button on the tool bar STEP 2 Click on the construct classifier button on the toolbar The following window is displayed Select the necessary options (they are explained below) and construct the classifier. STEP 3 use the options (use classifier and cross-reference for more detailed classification) Options Available for constructing the classifier Rule sets: Rules can be listed by class or by their importance to classification accuracy. If the latter utility ordering is selected, the rules are grouped into a number of bands. Errors and costs are reported individually for the first band, the first two bands, and so on. Boosting: By default See5 generates a single classifier. The Boosting option causes a number of classifiers to be constructed; when a case is classified, all these classifiers are consulted before a decision is made. Boosting often gives higher predictive accuracy at the expense of increased classifier construction time. Subset By default, See5 deals separately with each value of an unordered discrete attribute. If the Subset option is chosen, these values are grouped into subsets. Use of Sample & Lock sample If the data are very numerous, the Sampling option may be useful. This causes only the specified percentage of the cases in the filestem.data file to be used for constructing the classifier. resampling can be prevented using the Lock sample option; Cross Validate The Cross-validate option can be used to estimate the accuracy of the classifier constructed by See5 even when there are no separate test cases. The data are split into a number of blocks equal to the chosen number of folds. Each block contains approximately the same number of cases and the same distribution of classes. For each block in turn, See5 constructs a classifier using the cases in all the other blocks and then tests its accuracy on the cases in the holdout block. In this way, each case in the data is used just once as a test case. The error rate of the classifier produced from all the cases is estimated as the ratio of the total number of errors on the holdout cases to the total number of cases. Since the classifiers constructed during a cross-validation use only part of the training data, no classifier is saved when this option is selected. Ignore Cost File In applications with differential misclassification costs, it is sometimes desirable to see what effect the costs file is having on the construction of the classifier. If the Ignore costs file box is checked, See5 will construct a classifier as if all misclassification costs are the same. Advance Options As the box proclaims, the remaining options are intended for advanced users who are familiar with the way See5 works. When a continuous attribute is tested in a decision tree, there are branches corresponding to the conditions attribute value <= threshold and attribute value > threshold for some threshold chosen by See5. As a result, small movements in the attribute value near the threshold can change the branch taken from the test. The Fuzzy thresholds option softens this knife-edge behavior for decision trees by constructing an interval close to the threshold. Within this interval, both branches of the tree are explored and the results combined to give a predicted class. Note: fuzzy thresholds do not affect the behavior of rulesets. Example of a See 5 output when defaults are used Example Credit program See5 [Release 1.11] Mon Feb 21 14:23:56 2000 ** This demonstration version cannot process ** ** more than 200 training or test cases. ** Read 200 cases (15 attributes) from credit.data Decision tree: A15 > 225: + (81/2) A15 <= 225: :...A10 = t: + (60/14) A10 = f: :...A5 = gg: - (0) A5 = p: :...A14 <= 311: - (12) : A14 > 311: + (3) A5 = g: :...A7 = h: + (11) A7 = j: - (1) A7 in {n,z,dd,ff,o}: + (0) A7 = bb: :...A12 = t: - (5) : A12 = f: + (2) A7 = v: :...A15 > 50: + (2) A15 <= 50: :...A14 <= 102: + (5) A14 > 102: - (18/5) Evaluation on training data (200 cases): Decision Tree ---------------Size Errors 13 (a) ---148 16 21(10.5%) (b) ---5 31 << <-classified as (a): class + (b): class - ** This demonstration version cannot process ** ** more than 200 training or test cases. ** Evaluation on test data (200 cases): Decision Tree ---------------Size Errors 13 (a) ---82 67 75(37.5%) (b) ---8 43 << <-classified as (a): class + (b): class - Time: 0.1 secs The first line identifies the version of See5 and the run date. See5 constructs a decision tree from the 200 training cases in the file credit.data, and this appears next The last section of the See5 output concerns the evaluation of the decision tree, first on the cases in credit.data from which it was constructed, and then on the new cases in credit.test. The size of the tree is its number of leaves and the column headed Errors shows the number and percentage of cases misclassified. The tree, with 13 leaves, misclassifies 21 of the 200 given cases, an error rate of 10.5%. Performance on these cases is further analyzed in a confusion matrix that pinpoints the kinds of errors made. d) Using Classifiers Once a classifier has been constructed, an interactive interpreter can be used to assign new cases to classes. The Use Classifier button invokes the interpreter, using the most recent classifier for the current application, and prompts for information about the case to be classified. Since the values of all attributes may not be needed, the attribute values requested will depend on the case itself. When all the relevant information has been entered, the most probable class (or classes) are shown, each with a certainty value. e) Cross Reverencing Classifiers and Data Complex classifiers, especially those generated with the boosting option, can be difficult to understand. See5 incorporates a unique facility that links data and the relevant sections of (possibly boosted) classifiers. The Cross- Reference button brings up a window showing the most recent classifier for the current application and how it relates to the cases in the data, test or cases file. (If more than one of these is present, a menu will prompt you to select the file.) Example of Cross referencing f)Generating Classifiers in Bach Mode The See5 distribution includes a program See5X that can be used to produce classifiers non-interactively. This console application resides in the same folder as See5 (usually C:\Program Files\See5) and is invoked from an MSDOS Prompt window. The command to run the program is See5X -f filestem parameters where the parameters enable one or more options discussed above to be selected: -s use the Subset option -r use the Ruleset option -b use the Boosting option with 10 trials -t trials ditto with specified number of trials -S x use the Sampling option with x% -I seed set the sampling seed value -c CF set the Pruning CF value -m cases set the Minimum cases -p use the Fuzzy thresholds option -e ignore any costs file -h print a summary of the batch mode options If desired, output from See5 can be diverted to a file in the usual way. As an example, typing the commands cd "C:\Program Files\See5" See5X -f Samples\anneal -r -b >save.txt in an MS-DOS Prompt window will generate a boosted ruleset classifier for the anneal application in the Samples directory, leaving the output in file save.txt. g) Linking to Other Programs The classifiers generated by See5 are retained in binary files, filestem.tree for decision trees and filestem.rules for rulesets. Public C source code is available to read these classifier files and to use them to make predictions. Using this code, it is possible to call See5 classifiers from other programs. As an example, the source includes a program to read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets. Cubist “A Regresser” Introduction “Cubist produces rule-based models for numerical prediction.” Each rule specifies the conditions under which an associated multivariate linear sub-model should be used. The result powerful piecewise linear models. Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be used to gain insight into aspects of the organization's operations, and to predict outcomes for future situations as an aid to decision-making. Cubist builds rule-based predictive models that output values, complementing See5/C5.0 that predicts categories. For instance, See5/C5.0 might classify the yield from some process as "high", "medium", or "low", whereas Cubist would output a number such as 73%. (Statisticians call the first kind of activity "classification" and the second "regression".) Cubist is a powerful tool for generating piecewise-linear models that balance the need for accurate prediction against the requirements of intelligibility. Cubist models generally give better results than those produced by simple techniques such as multivariate linear regression, while also being easier to understand than neural networks. Important Features Of Cubist Cubist has been designed to analyze substantial databases containing thousands of records and tens to hundreds of numeric or nominal fields. To maximize interpretability, Cubist models are expressed as collections of rules, where each rule has an associated multivariate linear model. Whenever a situation matches a rule's conditions, the associated model is used to calculate the predicted value. Cubist is available for Windows 95/98/NT and several flavors of Unix. Cubist is easy to use and does not presume advanced knowledge of Statistics or Machine Learning RuleQuest provides C source code so that models constructed by Cubist can be embedded in your organization's own systems. Operations Details of CUBIST In order to work with the Cubist we need to follow a number of conventions. The following points explain the whole process. a) Preparing Data for Cubist Application filestem Names file Data file Test and cases files (optional) b) User Interface c) Constructing Models Rule-based models Composite models Rule coverage Extrapolation Simplicity-accuracy tradeoff Cross-validation trials Sampling from large data sets d) Cross-Referencing Models and Data e) Generating Models in Batch Mode f) Linking to Other Programs a) Preparing Data for Cubist Cubist is a self-referential application that is it estimates the time required to build a model. An application is a collection of text files that define attributes, describe the cases to be analyzed, and optionally provide new cases to test the models produced by Cubist. Cubist's job is to find how to estimate a case's target value in terms of its attribute values Cubist does this by building a model containing one or more rules, where each rule is a conjunction of conditions associated with a linear expression. The meaning of a rule is that, if a case satisfies all the conditions, then the linear expression is appropriate for predicting the target value. Cubist thus constructs a piecewise linear model to explain the target value. As we will see, Cubist can also combine these models with instance-based (nearest neighbor) models. Application filestem Every Cubist application has a short name called a filestem; we will use the filestem logtime for this illustration. All files read or written by Cubist for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file. The case of letters in both the filestem and extension is important -- file names APP.DATA, app.data, and App.Data, are all different. It is important that the extensions are written exactly as shown below, otherwise Cubist will not recognize the files for your application. A Cubist application consists of two mandatory files and two optional files. Names file filestem.names (required). This defines the application’s attributes or features. One attribute, the target, contains the value to be predicted from the other attributes. Data file filestem.data (required). This provides information on the cases that will be analyzed by Cubist in order to produce a model. Test file filestem.test (optional). This file contains cases that are not used to produce a model, but are used instead to estimate the predictive accuracy of the model. Additional test file filestem.cases (optional) for use with the cross-referencing facility described below. The principal file types written by Cubist are: Results file filestem.out. This file contains details of the most recent model constructed for this application. Plot file filestem.pred. This file contains case-by-case results for the most recent test cases, and is used to produce a scatter plot. Binary model file filestem.model (for use by Cubist only). The Names File The names file filestem.names contains a series of entries defining attributes and their values. The file is free-format with the exception that the vertical bar ‘|’ causes the rest of that line to be skipped. Each entry is terminated with a period. Each name is a string of characters that does not contain commas, question marks, or colons. A period may be embedded in a name provided that it is not followed by a space. Embedded spaces are also permitted but multiple whitespace characters are replaced by a single space. Example of A Names File | Comment: sample names file goal. | the numeric attribute that contains the target | values to be predicted. (In other words, | 'verdict' is the dependent variable.) patient ID: age: height (cms): weight (kg): sex: label. continuous. continuous. continuous. male, female. | identifies this patient | age is a number | so is height .. | .. and weight | a discrete attribute goal: continuous. | recommended weight The first entry in the names file identifies the attribute that contains the target value to be modeled (the dependent variable).The rest of the file contains one entry for each attribute. Attributes are of two kinds: explicitly-defined attributes (specified by type, values etc.) and implicitly-defined attributes (specified by formulas). Explicitly-defined attributes Implicitly-defined attributes Explicitly-Defined Attributes The entry for an explicitly-defined attribute begins with the attribute name followed by a colon, and then one of the following: ‘ignore’, indicating that this attribute should not be used in models; ‘label’, indicating that the value of this attribute is used only to identify particular cases; ‘continuous’, for attributes with numeric values; ‘date’, for attributes whose values take the form YYYY/MM/DD; ‘discrete’ followed by an integer N, instructing Cubist to assemble a list of up to N possible values that appear in the training cases; or a list of the allowable discrete values of the attribute, separated by commas. The list of values can be prefaced by ‘[ordered]’ to indicate that the values are given in a meaningful order – Cubist can exploit this information to produce more sensible groupings of values. The entry for each attribute is terminated by a period. An attribute of type ‘label’ can be useful for allowing particular cases to be identified easily, especially in crossreferencing (see below). If more than one label attribute appears, only the last is used. An Implicitly-defined attribute An implicitly-defined attribute is one whose value is calculated from the values of previously-defined attributes. An entry for an implicitly-defined attribute has the form attribute name := formula . The formula is written in the usual way, using parentheses where needed, and may refer to any attribute defined before this one. Constants in the formula can be numbers, discrete attribute values enclosed in string quotes (e.g., “small”), and dates. The operators and functions that can be used in formulas are: and, or +, -, *, /, % (modulus), ^ (power) >, >=, <, <=, =, !=, <> (the last two denoting not equal) log, exp, sin, cos. , tan, int (integer part of) Attributes defined in this way have values that are either numbers or logical values true and false, depending on the formula. For example, x := a + b. defines a number, but x := a > b. defines a logical value. Dates YYYY/MM/DD are stored internally as the number of days from a base date, and so can appear in some formulas. The Data File The data file filestem.data contains the cases that will be analyzed in order to produce the model. Each case is represented by a separate entry in the file; the order of cases in the data file is not important. A case entry consists of the values of the explicitly-defined attributes in the same order that they appear in the names file. All values are separated by commas and the entry is terminated with a period. If the case has unknown values for one or more of the attributes, those values are represented by a question mark. Note that there are no entries for implicitly-defined attributes since these are computed from the values of other attributes using the given formula. If one or more of the previously-defined attributes has an unknown value so that the formula cannot be evaluated, the value of the implicitly-defined attribute is also unknown. As with the names file, the rest of the line after a vertical bar is ignored, and multiple white space characters are treated as a single space. Example of A Data File | Comment: sample data file 00121, 23, 153, 95, male, 65. | first case (needs to lose weight) 02002, 47, 183, 70, male, 75. 00937, ?, 157, 54, female, 60. 01363, 33, 165, 64, ?, 65. | second case (about ok) | notice missing age | sex not recorded Test and cases files (optional) Of course, the value of predictive models lies in their ability to make accurate predictions! It is difficult to judge the accuracy of a model by measuring how well it does on the cases used in its construction; the performance of the model on new cases is much more informative. The third kind of file used by Cubist is a test file of new cases (e.g. logtime.test) on which the model can be evaluated. This file is optional and, if used, has exactly the same format as the data file. Another optional file, the cases file (e.g. logtime.cases), has the same format as the data and test files. The cases file is used primarily with the cross-referencing procedure and public source code, both of which are described later on. b) User Interface The main window of Cubist is displayed in the following diagram c) Constructing Models The basic steps to construct a model is as follows STEP 1: Locate the Application’s Data The first step in constructing a model for an application is to locate that application’s data files on your computer. All these files must be together in one folder, although the same folder may contain files for several applications. The left-most button on the toolbar invokes a standard specify-or-browse dialog to find a file whose name is filestem.data for some filestem. Identifying this file provides the application filestem, and Cubist then looks in the folder for other files related to this application. After the application’s data has been located, the edit menu allows the .names file to be edited using Wordpad. STEP 2: Select Options Once the application has been identified, Cubist can then be used to construct a model. Several options that can be used to influence this process are specified in the following dialog box: Rule-based piecewise linear models When Cubist is invoked with the default values of all options, it constructs a rule-based model and produces output like this: Cubist [Release 1.07] Thu Aug 12 14:51:21 1999 --------------------Target attribute `log(cpu time)' Read 162 cases (10 attributes) from logtime.data Model: Rule 1: [4 cases, mean -2.000, range -2 to -2, est err 0.000] If discrete atts > 7 instances option in [force inst-allow inst] total vals <= 10560 then log(cpu time) = -2 Evaluation on training data (162 cases): Average |error| 0.097 Relative |error| 0.14 Correlation coefficient 0.99 Evaluation on test data (162 cases): Average |error| 0.126 Relative |error| 0.17 Correlation coefficient 0.99 Time: 0.0 secs The ‘standard’ Cubist model consists of a set of rules, each with an associated linear model. The value of a case is predicted by finding the rules that apply to it, calculating the corresponding values from the rules’ linear models, and averaging these values. The Rules alone option builds standard models in this form. Another way of assessing the accuracy of the predictions is through a visual inspection of a scatter plot that graphs the real target values of new cases against the values predicted by the model. When a file filestem.test of test cases is present, Cubist also provides a scatter plot window. In this example, it looks like this: Composite models Alternatively, Cubist can generate a special form of composite instance-based and rule-based model. As in a conventional instance-based (or nearest neighbor) model, the target value for a case is found by identifying the most similar cases in the training data. Instead of simply averaging the known target values for these neighbors, however, Cubist uses an adjusted value for each neighbor case. Suppose that x is the case whose unknown target value is to be predicted, and n is a neighbor with target value T(n) . If the rule-based model predicts values M(x) and M(n) for x and n respectively, then the model predicts that the target value of x will be higher than the target value of n by an amount M(x) - M(n). Cubist therefore uses the value T(n) + M(x) - M(n) as the adjusted value associated with the neighbor n. If the second model form option Instances and rules are chosen in the dialog box, Cubist uses composite models like this. Evaluation on training data (162 cases): Average |error| 0.100 Relative |error| 0.14 Correlation coefficient 0.99 Evaluation on test data (162 cases): Average |error| 0.120 Relative |error| 0.16 Correlation coefficient 0.99 The third option Let Cubist choose allows Cubist to decide, from analysis of the training cases in filestem.data, which model form seems likely to give more accurate predictions. Rule coverage The dialog box contains a field for specifying the (approximate) minimum case cover for any rule as a percentage of the number of training cases. That is, the conditions associated with any rule should be satisfied by at least the specified percentage of all training cases. Extrapolation The extrapolation parameter controls the extent to which predictions made by Cubist's linear models can fall outside the range of values seen in the training data. Extrapolation is inherently more risky than interpolation, where predictions must lie between the lowest and highest observed value. Simplicity-accuracy tradeoff Cubist employs heuristics that try to simplify models without substantially reducing their predictive accuracy. These heuristics are influenced by a parameter called the brevity factor whose values range from 0 to 100%. The default value is 50%; lower values tend to produce more detailed models that are sometimes (but not always!) slightly more accurate than those generated by default, while higher values tend to produce fewer rules, usually with a penalty in predictive accuracy. Cross-validation trials As we saw earlier, the predictive accuracy of a model constructed from the cases in a data file can be estimated from its performance on new cases in a test file. This estimate can be rather erratic unless there are large numbers of cases in both files. If the cases in logtime.data and logtime.test were to be shuffled and divided into new 162-case training and test sets, Cubist would probably construct a different model whose accuracy on the test cases might vary considerably. One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases in the data file are divided into f blocks of roughly the same size and target value distribution. For each block in turn, a model is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The accuracy of a model produced from all the cases is estimated by averaging results on the hold-out cases. Running a 10-fold cross-validation using the option that allows Cubist to choose the model type gives results on the hold-out cases similar to these: Summary: Average |error| 0.135 Relative |error| 0.20 Correlation coefficient 0.98 Sampling from large datasets If the data are very numerous, the Sampling option may be useful. This causes only the specified percentage of the cases in the filestem.data file to be used for constructing the model. This option automatically causes the model to be tested on a disjoint sample of cases in the filestem.data file. By default, the samples are redrawn every time a model is constructed, so the models built with the sampling option will usually change too. This resampling can be prevented using the Lock sample option; the sample is unchanged until a different application is loaded, the sample percentage is altered, the option is unselected, or Cubist is restarted. d) Cross-Referencing Models and Data The Cross-validate option can be used to estimate the accuracy of the model constructed by Cubist without requiring separate test cases. The data are split into a number of blocks equal to the chosen number of folds (default 10). Each block contains approximately the same number of cases and the same distribution of target values. For each block in turn, Cubist constructs a model using the cases in all the other blocks and then evaluates the accuracy of this model on the cases in the hold-out block. In this way, each case in filestem.data is used just once as a test case. (The results for a cross-validation depend on the way that the cases are divided into blocks. This division changes each time a cross-validation is carried out, so don’t be surprised if the results are not identical to the results from previous cross-validations with the same data.) Since models constructed during cross-validation use only part of the training data, no model is saved when this option is selected. e) Generating Models in Batch Mode The Cubist distribution includes a program CubistX that can be used to produce models non-interactively. This console application resides in the same folder as Cubist (usually C:\Program Files\Cubist) and is invoked from an MS-DOS Prompt window. The command to run the program is CubistX -f filestem parameters where the parameters enable one or more options discussed above to be selected: -i definitely use composite models -a allow the use of composite models -S percent use the sampling option -m percent set the minimum cases option -e percent set the extrapolation limit -b percent set the brevity factor If desired, output from Cubist can be diverted to a file in the usual way. As an example, executing cd "C:\Program Files\Cubist" CubistX -f Samples\boston -S 80 -a >save.txt from a MS-DOS Prompt window focuses on the boston application in the Samples directory, uses a sample of 80% of the cases for constructing the model (and tests it on the remaining 20%), allows Cubist to build composite models, and leaves the output in file save.txt. f) Linking to Other Programs The models generated by Cubist are retained in binary files filestem.model. C source code to read these model files and to use them to make predictions is freely available. Using this code, it is possible to call Cubist models from other programs. As an example, the source includes a program to read a cases file and to print the predicted value for each case using the most recent model. Magnum Opus “A Association Tool” INTRODUCTION “Magnum Opus is an exciting new tool for finding associations” ( “if then rules” that highlight interrelationships among attributes). Magnum Opus uses the highly efficient OPUS search algorithm for fast association rule discovery. The system is designed to handle large data sets containing millions of cases, but its capacity to do so will be constrained by the amount of RAM available. The user can choose between two measures of the importance of an association, leverage or lift The user specifies the maximum number of association rules to be found, and can place restrictions on association rules to be considered. Within those restrictions, Magnum Opus finds the non-trivial associations with the highest values on the specified measure of importance. Magnum Opus will only find fewer than the specified number of association rules if the search is terminated by the user or there are fewer than the specified number of non-trivial associations that satisfy the user specified constraints. General information about the data must be described in a names file. The data must be stored in a data file. The association rules found by the system are recorded in an output file. This simple data represents the type of data that might be collected by a supermarket about customers. It is contained in two files: tutorial.nam and tutorial.data. The first is a names file. The second is a data file. The names file describes the attributes recorded in the data file. The data file contains the values of those attributes for each case. Example of a names file Profitability99 <= 418 < 895 Profitability98 <= 326 < 712 Spend99 <= 2025< 4277 Spend98 <= 1782 < 4027 NoVisits99 <= 35 < 68 NoVisits98 <= 31 < 64 Dairy <= 214 < 515 Deli <= 207 < 554 Bakery <= 209 < 558 Grocery <= 872 < 2112 SocioEconomicGroup: A, B, C, D1, D2, E Promotion1: t, f Promotion2: t, f Most of these attributes are numeric. These numeric attributes have been divided into three sub-ranges, each of which contains approximately the same number of cases. The profitability attributes represent the profit made from a customer in 1999 and 1998. The spend attributes represent the total spend by a customer in each year. The NoVisits attributes represent the numbers of store visits in each year. The Dairy, Deli, Bakery, and Grocery attributes record the customer’s total spend in each of four significant departments. The remaining three attributes are categorical. The SocioEconomicGroup attribute records an assessment of the customer’s socio-economic group. The final two attributes record whether the customer participated in each of two store promotions. Basic Terminology in Magnus Opus Association rule An association rule identifies a combination of attribute values that occur together with greater frequency than might be expected if the values were independent of one-another. A Magnum Opus association rule has two parts, a Left Hand Side (LHS) and a Right Hand Side (RHS). The LHS is a set of one or more attribute values. The RHS is a single attribute value. Each association rule indicates that the frequency with which the RHS occurs in the data is higher among cases that have the LHS attribute values than among those that do not. Case Cases are the basic unit of analysis considered by Magnum Opus. A case represents a single entity. Examples of the type of thing that might be used as a case include customers, transactions, and states of a dynamic process. Each case is described by a set of values for a number of attributes. The same attributes are used to describe all cases for a single analysis. The attributes for the analysis are defined in a names file. The cases are described in a data file. Coverage The coverage of an association rule is the proportion of cases in the data that have the attribute values specified on the Left Hand Side of the rule. The total number of cases that this represents is indicated in brackets. For example, suppose that there are 1000 cases and the LHS covers 200 cases. The coverage is 200/1000 = 0.2. The total number of cases, displayed in brackets, is 200. Association rule strength The strength of an association rule is the proportion of cases covered by the LHS of the rule that are also covered by the RHS. For example, suppose that the LHS covers 100 cases and the RHS covers 50 of the cases covered by the LHS. The strength is 50/100 = 0.5. Association rule lift The lift of an association rule is the strength divided by the proportion of all cases that are covered by the RHS. This is a measure of the importance of the association that is independent of coverage. For example, suppose that there are 1000 cases, the LHS covers 200 cases, the RHS covers 100 cases, and the RHS covers 50 of the cases covered by the LHS. The strength is 50/100 = 0.5. The proportion of all cases covered by the RHS is 100/1000 = 0.1. The lift is 0.5/0.1 = 5. Association rule leverage The leverage of an association rule is the proportion of additional cases covered by both the LHS and RHS above those expected if the LHS and RHS were independent of each other. This is a measure of the importance of the association that includes both the strength and the coverage of the rule. The total number of cases that this represents is presented in brackets following the leverage. For example, suppose that there are 1000 cases, the LHS covers 200 cases, the RHS covers 100 cases, and the RHS covers 50 of the cases covered by the LHS. The proportion of cases covered by both the LHS and RHS is 50/1000 = 0.05. The proportion of cases that would be expected to be covered by both the LHS and RHS if they were independent of each other is (200/1000) * (100/1000) = 0.02. The leverage is 0.05 minus 0.02 = 0.03. The total number of cases that this represents is 30. Analysis Process a) b) c) d) e) f) g) h) Starting Magnus Opus Open the names file. Select options for search by leverage and Execution View the output of the search by leverage. Dissection of an association rule. Run the search by lift. And Output from search by lift. Run search by lift with minimum coverage. Conclusion. a) Starting Magnus Opus Magnum Opus is started from the start menu Initially, the main window will be empty, indicating that no names file has yet been opened. b) Open the names file The Names file contains the attributes associated with a particular application The following is an example of the names file described above when opened in MO Note the following elements of the initial screen. At the top the currently selected search mode is displayed, either search by leverage or search by lift. Next the currently selected names and paths for the names, data, and output files are displayed. Below these appear the currently selected values for - The number of attributes allowed on the LHS of an association. The maximum number of associations to be returned by the search The minimum leverage for an association The minimum coverage for an association and The minimum lift for an association. Finally appear two list boxes containing lists of all of the available attribute values. Only those attribute values selected in these list boxes will be able to appear on the LHS and RHS of an association, respectively. c) Select options for search by leverage and Execution The following is a search by leverage with the following conditions Search Limit ten associations We find associations between other attributes and profitability. [Therefore, we want to limit the attribute values that may appear on the RHS of an association to the three values for profitability.] The Example screen is as follows Once you have specified all of the settings for the search, click on the GO button to start the search. As the Search by Leverage mode is in effect, Magnum Opus will perform search-finding associations with the highest values on leverage within the other constraints that have been specified. d) Viewing the output When the search is completed, the associations are written to the specified output file. The contents of this file are then displayed in an output window. The output file lists the names of the files that described the data and the settings that were used for the search. These are followed by a list of association rules that were found. The final line of the output summarizes the number of seconds taken to complete the search, the number of association rules found, and the number of cases employed. The output file for our example Magnum Opus - Turning Data to Knowledge. Copyright (c) 1999 G. I. Webb & Associates Pty Ltd. Sun Oct 03 18:07:40 1999 Namesfile: C:\Tutorial\tutorial.nam Datafile: C:\Tutorial\tutorial.data Search by leverage Maximum number of attributes on LHS = 4 Maximum number of associations = 10 Minimum leverage = 0.001 Minimum coverage = 0.0 Minimum lift = 1.0 All values allowed on LHS Values allowed on RHS: Profitability99<=418 418<Profitability99<895 Profitability99>=895 Sorted best association rules Spend99<=2025 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.907; Lift=2.72; Leverage=0.1911 (191)] Spend99>=4277 -> Profitability99>=895 [Coverage=0.335 (335); Strength=0.860; Lift=2.57; Leverage=0.1758 (175)] Spend99<=2025 & Grocery<=872 -> Profitability99<=418 [Coverage=0.278 (278); Strength=0.953; Lift=2.86; Leverage=0.1724 (172)] Spend99<=2025 & NoVisits99<=35 -> Profitability99<=418 [Coverage=0.276 (276); Strength=0.938; Lift=2.82; Leverage=0.1671 (167)] Grocery<=872 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.832; Lift=2.50; Leverage=0.1661 (166)] Profitability98<=326 & Spend99<=2025 -> Profitability99<=418 [Coverage=0.256 (256); Strength=0.961; Lift=2.89; Leverage=0.1608 (160)] 0 seconds for 10 association rules from 1000 examples e) Dissection of an association rule The first association rule in the example output file is the following. Spend99<=2025 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.907; Lift=2.72; Leverage=0.1911 (191)] The left-hand_side of this rule is presented before the -> arrow. The right-hand_side follows the arrow and precedes the opening bracket [. This association rule indicates that cases for which Spend99 has a value less than or equal to 2025 are associated with cases for which Profitability99 has a value less than or equal to 418 more frequently than is the average for all cases. That is, the frequency of association between cases that satisfy the left and right hand sides is greater than normal. The following information about the association is displayed in brackets. Coverage=0.333 (333) The first value indicates the proportion of all cases that satisfy the LHS of the rule (Spend99<=2025). The value in brackets indicates the absolute number of cases that this represents. 333 cases satisfy the LHS, and recall there are 1000 cases in the data set. Strength=0.907 This indicates the proportion of those cases that satisfy the LHS that also satisfy the RHS. 333 cases satisfy the LHS and 302 of those also satisfy the RHS, so Strength is calculated as 302/333. Lift=2.72 Lift is a measure of how much stronger than normal is the association between the LHS and RHS. Out of all the data, 333 cases satisfy the RHS (this is the same as the number satisfying the LHS, because each attribute was split on a value that created three equal sized partitions). Therefore the strength of association that would be expected if the LHS and RHS were independent of each other is 333/1000. Dividing the actual strength by this value we obtain (302/333)/(333/1000) = 2.72. Leverage=0.1911 (191) Leverage is a measure of the magnitude of the effect created by the association. This is the proportion of cases that exhibit the association in excess of those that would be expected if the LHS and RHS were independent of each other. If the two attribute values were independent of each other than the expected proportion of cases exhibiting both values would be the proportion exhibiting the LHS (0.333) times the proportion exhibiting the RHS (0.333) = 0.1109. The proportion exhibiting the association is 0.302. The difference between these values is 0.1911. f) Run the search by lift And Output from search by lift. First select Search by Lift mode by clicking the LIFT Toolbar button. Next select a new output file name, so that the output from the last search is not overwritten. For the first search by lift we will use the same settings as were used for the search by leverage (find best ten association rules only; allow only values for Profitability99 on RHS). As these are already selected we can proceed to starting the search by clicking on the GO button. The output of the search by lift is as follows. Magnum Opus - Turning Data to Knowledge. Copyright (c) 1999 G. I. Webb & Associates Pty Ltd. Sun Oct 03 18:11:45 1999 Namesfile: C:\Tutorial\tutorial.nam Datafile: C:\Tutorial\tutorial.data Search by lift Maximum number of attributes on LHS = 4 Maximum number of associations = 10 Minimum leverage = 0.001 Minimum coverage = 0.0 Minimum lift = 1.0 All values allowed on LHS Values allowed on RHS: Profitability99<=418 418<Profitability99<895 Profitability99>=895 Sorted best associations Profitability98<=326 & Dairy>=515 & Deli>=554 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.006 (6); Strength=1.000; Lift=3.01; Leverage=0.0040 (4)] Profitability98<=326 & Spend99>=4277 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.004 (4); Strength=1.000; Lift=3.01; Leverage=0.0027 (2)] NoVisits99<=35 & Dairy>=515 & Deli>=554 -> 418<Profitability99<895 [Coverage=0.004 (4); Strength=1.000; Lift=3.01; Leverage=0.0027 (2)] Profitability98<=326 & Spend99>=4277 & Deli>=554 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.003 (3); Strength=1.000; Lift=3.01; Leverage=0.0020 (2)] Profitability98<=326 & NoVisits98>=64 & Dairy>=515 -> 418<Profitability99<895 [Coverage=0.003 (3); Strength=1.000; Lift=3.01; Leverage=0.0020 (2)] Profitability98<=326 & NoVisits98>=64 & Deli>=554 -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)] Profitability98<=326 & NoVisits99>=68 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)] Spend98<=1782 & NoVisits99<=35 & Grocery>=2112 & Promotion1=f -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)] Spend98<=1782 & NoVisits98>=64 & Dairy>=515 -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)] Spend98<=1782 & NoVisits99>=68 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)] 3 seconds for 10 associations from 1000 examples In this example the search has found only association rules that cover very small numbers of cases. When a rule has small coverage there is increased probability of the observed lift being unrealistically high. Also, if there is small coverage then the total impact (as indicated by leverage) will be low. To counter this, in the next screen we will set a minimum coverage during search. Association rules that do not meet the coverage requirement will not be considered. Raising the minimum coverage has a secondary effect of reducing computation time, which may be desirable for large data sets. Exit from viewing the output of the current search by clicking on the close button. You will be returned to the main window where you can adjust the settings before starting another search. g) Run search by lift with minimum coverage For the next example, set the value for Minimum coverage of interest to 0.05. This will ensure that association rules that cover less than 0.05 of the available cases (50 cases for the tutorial data) will be discarded. Change the output file name so that the previous search results will not be overwritten. h) Conclusion Thus Magnum Opus is used to find associations rules The Magnum Opus can also be used to perform Market Basket Analysis. Basket analysis is one of the most common applications for association rules. In basket analysis the items jointly involved in a single transaction form the basic unit of analysis. A typical example is the contents of each shopping basket that passes through a retailer’s check out line. The purpose of the basket analysis is to identify which combinations of items have the greatest affinity with each other. One way to perform basket analysis with Magnum Opus is to declare an attribute for each item of analysis. In the retail shopping basket example, this would entail one attribute for each product of interest. For each attribute one value only need be declared. If the basket contains an item then the corresponding attribute has its single value set. Otherwise the attribute is set to the missing value. The following are examples of hypothetical names and data files that support this approach to basket analysis. Example.nam Example.data zippo_large: p zippo_small: p zappy_large: p neato_compact: p neato_standard: p p,?,?,?,p ?,p,,?,p,? ?,?,p,?,p p,?,p,?,p p,?,p,p,? The names of the five attributes declared in example.nam are the names of the five products. The single value p for each attribute represents that the corresponding item is present in the basket represented by a line of the data file. The first line indicates that the first basket contained only a zippo_large and a neato_standard. The main concern in the MO is the computing time the following are different ways to reduced the computing time of the software In general the following measures will decrease compute time. - Increase the minimum leverage of interest. Increase the minimum coverage of interest. Decrease the maximum number of attributes allowed on the LHS of an association. Decrease the maximum number of associations to be found. Decrease the number of associations allowed on the LHS and the RHS of association GritBot “A Data Cleaning Tool“ GritBot is a sophisticated data-cleansing tool that helps you to audit and maintain data quality. Working from the raw data alone, GritBot automatically explores partitions of the data that share common properties and reports surprising values in each partition. GritBot uncovers anomalies that might compromise the effectiveness of your data mining tools. "Data mining" is a family of techniques for extracting valuable information from an organization's stored or warehoused data .Data-mining methods search for patterns and can be compromised if the data contain corrupted values that obscure these patterns. As the saying goes, "Garbage in, garbage out." GritBot is an automatic tool that tries to find anomalies in data as a precursor to data mining. It can be thought of as an autonomous data quality auditor that identifies records having "surprising" values of nominal (discrete) and/or numeric (continuous) attributes. Values need not stand out in the complete dataset -- GritBot searches for subsets of records in which the anomaly is apparent. In one of the sample applications referenced below, GritBot identifies the age of two women in their seventies as being anomalous. Such ages are not surprising in the whole population, but they certainly are in this case because the women are noted as being pregnant. Some important features: GritBot has been designed to analyze substantial databases containing tens or hundreds of thousands of records and large numbers of numeric or nominal fields. Every possibly anomalous value that GritBot identifies is reported, together with an explanation of why the value seems surprising. GritBot is virtually automatic -- the user does not require knowledge of Statistics or Data Analysis. GritBot is available for Windows 95/98/NT and several flavors of Unix.