evaluation of the products offered by rulequest research company

advertisement
EVALUATION OF THE PRODUCTS OFFERED BY
RULEQUEST RESEARCH COMPANY
CS 595 Assignment 1
By Sivakumar Sundaramoorthy
INTRODUCTION
The RuleQuest Research is a company dealing with data mining tools. It is based in Australia. The company
provides some of the best data mining tools that help you transform data into knowledge
The Data Mining tools that the company offers helps in




Constructing decision trees and rule-based classifiers
Building rule-based numerical models
Finding association rules that reveal interrelationships
Identifying data anomalies for data cleansing
(A new venture)
The following companies are using the RuleQuest tools
E-Merchandising from Blue Martini
The Blue Martini Customer Interaction System is the leading enterprise-scale Internet application for interacting live
with customers.
EPM from Broadbase
Broadbase applications power digital markets by analyzing customer data and using that information to execute
personalized interactions that drive revenue.
Clementine from ISL/SPSS
Clementine Server enables us to sift through huge data to discover valuable experiences and information - and turn
them into powerful decision-taking knowledge.
Decision Series from Accrue
Accrue software is the leading provider of Internet Enterprise software solutions for optimizing the effectiveness of
e-tail, retail and e-media initiatives.
Industrial systems from Parsytec
Parsytec AG is a software company, linked and evolved from the Technical University of Aachen , which
specialized in the analysis and evaluation of defects on high speed production lines, like those produced, or example
in the steel, aluminum, paper or plastics industries.
The Tools that are developed by RuleQuest run on variety of platforms like
Windows:
Unix
:
95
Sun Solaris 2.5 or later
98
SGI Irix
NT 4.0 or later
Linux
RuleQuest offer a wide range of products which are as follows
PRODUCTS
See5 / C5.0
Cubist
Magnum Opus
GritBot
EVALUATION OF THE PRODUCTS
See5 / C5.0
“A CLASSIFIER “
Introduction
“See5 is a state-of-the-art system constructs classifiers in the form of decision trees and rule sets.”
See5/C5.0 has been designed to operate on large databases and incorporates innovations such as boosting. The
products See 5 and C 5.0 are analogous. The former operates on Windows 95/98/NT and C 5.0 is its UNIX counter
part. The See5 and C5.0 are sophisticated data mining tools for discovering patterns that delineate categories,
assembling them into classifiers, and using them to make predictions.
The major features of the See 5/ C5.0 are




See5/C5.0 has been designed to analyze substantial databases containing thousands to hundreds of thousands of
records and tens to hundreds of numeric or nominal fields.
To maximize interpretability, See5/C5.0 classifiers are expressed as decision trees or sets of if-then rules, forms
that are generally easier to understand than neural networks.
See5/C5.0 is easy to use and does not presume advanced knowledge of Statistics or Machine Learning
RuleQuest provides C source code so that classifiers constructed by See5/C5.0 can be embedded in an
organization's own systems.
Operations Details of See 5
In order to work with the See 5 we need to follow a number of conventions. The following points explain the whole
process.
a)
b)
c)
d)
e)
f)
g)
Preparing Data for See5
User Interface
Constructing Classifiers
Using Classifiers
Cross-referencing Classifiers and Data
Generating Classifiers in Batch Mode
Linking to Other Programs
a) Preparing Data for See5
See5 is a tool that analyzes data to produce decision trees and/or rulesets that relate a case’s class to the values of its
attributes.
An application is a collection of text files. These files define classes and attributes, describe the cases to be
analyzed, provide new cases to test the classifiers produced by See5, and specify misclassification costs or penalties.
Every See5 application has a short name called a filestem; Example a credit data set may have a file stem like credit.
All files read or written by See5 for an application have names of the form filestem.extension, where filestem
identifies the application and extension describes the contents of the file. The file name is case sensitive.
The See 5 has a number of files that need to be available in order for it to classify the data set .
The Files must follow the conventions and are as follows
Names file
(essential)
Data file
(essential)
Test and cases files (optional)
Costs file
(optional)
Names File
The first essential file is the names file (e.g. credit.names) that describes the attributes and classes.
There are two important subgroups of attributes:
Discrete/Continuos/Label/Date/Ignore
A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric
value, and a label attribute serves only to identify a particular case. Ignore parameter specifies See 5 that it
needs to ignore the value during classification.
Example : In a credit information data set
Amount spent would be continuous.
Sex of the customer would be discrete.
Date of joining would be a date attribute.
Id No would be a label
Bank name is ignored
Explicit/Implicit
The value of an explicitly defined attribute is given directly in the data, while the value of an implicitly
defined attribute is specified by a formula.
Example of an implicit attribute would be the status of a customer
If the dues=0 and Payment = ontime then Status = Good
Here the attribute status depends on the attributes payment and dues.
Example Names File
status.
| the target attribute
Age:
ignore.
| The age of the customer
Sex:
m, f.
Lastmonthsbalance:
continuous.
Thismonthsbalance:
continuous.
Totalbalance:=
lastmonthsbalance + thismonthsbalance.
Paymentdue:=
true,false.
Status:
excellent, good, average, poor.
Creditcardno:
label
The conventions can be noted as follows:
EXPLICIT attributes Attribute Name:
IMPLICIT attributes Attribute Name:=
TYPE
FORMULA
|Comment
|Comment
There are six possible types of value :






continuous The attribute takes numeric values.
date
The attribute's values are dates in the form YYYY/MM/DD, e.g. 1999/09/30
a comma-separated list of names
The attribute takes discrete values, and these are the allowable values.
The values may be prefaced by [ordered] to indicate that they are given in a meaningful ordering, otherwise
they will be taken as unordered.
discrete N for some integer N The attribute has discrete, unordered values, but the values are assembled
from the data itself; N is the maximum number of such values.
ignore
The values of the attribute should be ignored.
label
This attribute contains an identifying label for each case.
Data file
The second essential file, the application's data file (e.g. hypothyroid.data) provides information on the training
cases from which See5 will extract patterns. The entry for each case consists of one or more lines that give the
values for all explicitly-defined attributes. If the classes are listed in the first line of the names file, the attribute
values are followed by the case's class value. If an attribute value is not known, it is replaced by a question mark `?'.
Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line
after a vertical bar `|' is ignored.
Example
31,m,30.5,300,330.5,true,good,0001
23,f,333,22,355,false,average,0222
Test and cases files (optional)
The third kind of file used by See5 consists of new test cases (e.g. credit.test) on which the classifier can be
evaluated. This file is optional and, if used, has exactly the same format as the data file.
Another optional file, the cases file (e.g. cerdit.cases), differs from a test file only in allowing the cases' classes to be
unknown. The cases file is used primarily with the cross-referencing procedure and public source code.
Costs file (optional)
The last kind of file, the costs file (e.g. credit.costs), is also optional and sets out differential misclassification costs.
In some applications there is a much higher penalty for certain types of mistakes.
b) User Interface in See 5
Usage of each Icons

Locate Data
invokes a browser to find the files for your application,
or to change the current application;

Construct Classifier


Stop
Review Output

Use Classifier

Cross-Reference
selects the type of classifier to be constructed and sets
other options;
interrupts the classifier-generating process;
Re-displays the output from the last classifier
construction (if any);
Interactively applies the current classifier to one or
more cases; and
Maps between the training data and classifiers
constructed from it.
c) Constructing A Classifier:
STEP 1 Locate the data file using the locate button on the tool bar
STEP 2 Click on the construct classifier button on the toolbar
The following window is displayed
Select the necessary options (they are explained below) and construct the classifier.
STEP 3 use the options (use classifier and cross-reference for more detailed classification)
Options Available for constructing the classifier
Rule sets:
Rules can be listed by class or by their importance to classification accuracy. If the latter utility ordering is
selected, the rules are grouped into a number of bands. Errors and costs are reported individually for the
first band, the first two bands, and so on.
Boosting:
By default See5 generates a single classifier. The Boosting option causes a number of classifiers to be
constructed; when a case is classified, all these classifiers are consulted before a decision is made.
Boosting often gives higher predictive accuracy at the expense of increased classifier construction time.
Subset
By default, See5 deals separately with each value of an unordered discrete attribute. If the Subset option is
chosen, these values are grouped into subsets.
Use of Sample & Lock sample
If the data are very numerous, the Sampling option may be useful. This causes only the specified
percentage of the cases in the filestem.data file to be used for constructing the classifier. resampling can be
prevented using the Lock sample option;
Cross Validate
The Cross-validate option can be used to estimate the accuracy of the classifier constructed by See5 even
when there are no separate test cases. The data are split into a number of blocks equal to the chosen
number of folds. Each block contains approximately the same number of cases and the same distribution of
classes. For each block in turn, See5 constructs a classifier using the cases in all the other blocks and then
tests its accuracy on the cases in the holdout block. In this way, each case in the data is used just once as a
test case. The error rate of the classifier produced from all the cases is estimated as the ratio of the total
number of errors on the holdout cases to the total number of cases. Since the classifiers constructed during
a cross-validation use only part of the training data, no classifier is saved when this option is selected.
Ignore Cost File
In applications with differential misclassification costs, it is sometimes desirable to see what effect the
costs file is having on the construction of the classifier. If the Ignore costs file box is checked, See5 will
construct a classifier as if all misclassification costs are the same.
Advance Options
As the box proclaims, the remaining options are intended for advanced users who are familiar with the way
See5 works.
When a continuous attribute is tested in a decision tree, there are branches corresponding to the conditions
attribute value <= threshold and attribute value > threshold
for some threshold chosen by See5. As a result, small movements in the attribute value near the threshold
can change the branch taken from the test. The Fuzzy thresholds option softens this knife-edge behavior
for decision trees by constructing an interval close to the threshold. Within this interval, both branches of
the tree are explored and the results combined to give a predicted class. Note: fuzzy thresholds do not
affect the behavior of rulesets.
Example of a See 5 output when defaults are used
Example Credit program
See5 [Release 1.11]
Mon Feb 21 14:23:56 2000
** This demonstration version cannot process **
** more than 200 training or test cases.
**
Read 200 cases (15 attributes) from credit.data
Decision tree:
A15 > 225: + (81/2)
A15 <= 225:
:...A10 = t: + (60/14)
A10 = f:
:...A5 = gg: - (0)
A5 = p:
:...A14 <= 311: - (12)
:
A14 > 311: + (3)
A5 = g:
:...A7 = h: + (11)
A7 = j: - (1)
A7 in {n,z,dd,ff,o}: + (0)
A7 = bb:
:...A12 = t: - (5)
:
A12 = f: + (2)
A7 = v:
:...A15 > 50: + (2)
A15 <= 50:
:...A14 <= 102: + (5)
A14 > 102: - (18/5)
Evaluation on training data (200 cases):
Decision Tree
---------------Size
Errors
13
(a)
---148
16
21(10.5%)
(b)
---5
31
<<
<-classified as
(a): class +
(b): class -
** This demonstration version cannot process **
** more than 200 training or test cases.
**
Evaluation on test data (200 cases):
Decision Tree
---------------Size
Errors
13
(a)
---82
67
75(37.5%)
(b)
---8
43
<<
<-classified as
(a): class +
(b): class -
Time: 0.1 secs
The first line identifies the version of See5 and the run date. See5 constructs a decision tree from the 200 training
cases in the file credit.data, and this appears next
The last section of the See5 output concerns the evaluation of the decision tree, first on the cases in credit.data from
which it was constructed, and then on the new cases in credit.test. The size of the tree is its number of leaves and the
column headed Errors shows the number and percentage of cases misclassified. The tree, with 13 leaves,
misclassifies 21 of the 200 given cases, an error rate of 10.5%. Performance on these cases is further analyzed in a
confusion matrix that pinpoints the kinds of errors made.
d) Using Classifiers
Once a classifier has been constructed, an interactive interpreter can be used to assign new cases to classes. The Use
Classifier button invokes the interpreter, using the most recent classifier for the current application, and prompts for
information about the case to be classified. Since the values of all attributes may not be needed, the attribute values
requested will depend on the case itself. When all the relevant information has been entered, the most probable class
(or classes) are shown, each with a certainty value.
e) Cross Reverencing Classifiers and Data
Complex classifiers, especially those generated with the boosting option, can be difficult to understand. See5
incorporates a unique facility that links data and the relevant sections of (possibly boosted) classifiers. The Cross-
Reference button brings up a window showing the most recent classifier for the current application and how it
relates to the cases in the data, test or cases file. (If more than one of these is present, a menu will prompt you to
select the file.)
Example of Cross referencing
f)Generating Classifiers in Bach Mode
The See5 distribution includes a program See5X that can be used to produce classifiers non-interactively. This
console application resides in the same folder as See5 (usually C:\Program Files\See5) and is invoked from an MSDOS Prompt window. The command to run the program is
See5X -f filestem parameters
where the parameters enable one or more options discussed above to be selected:
-s
use the Subset option
-r
use the Ruleset option
-b
use the Boosting option with 10 trials
-t trials ditto with specified number of trials
-S x
use the Sampling option with x%
-I seed set the sampling seed value
-c CF set the Pruning CF value
-m cases set the Minimum cases
-p
use the Fuzzy thresholds option
-e
ignore any costs file
-h
print a summary of the batch mode options
If desired, output from See5 can be diverted to a file in the usual way.
As an example, typing the commands
cd "C:\Program Files\See5"
See5X -f Samples\anneal -r -b >save.txt
in an MS-DOS Prompt window will generate a boosted ruleset classifier for the anneal application in the Samples
directory, leaving the output in file save.txt.
g) Linking to Other Programs
The classifiers generated by See5 are retained in binary files, filestem.tree for decision trees and filestem.rules for
rulesets. Public C source code is available to read these classifier files and to use them to make predictions. Using
this code, it is possible to call See5 classifiers from other programs. As an example, the source includes a program to
read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets.
Cubist
“A Regresser”
Introduction
“Cubist produces rule-based models for numerical prediction.”
Each rule specifies the conditions under which an associated multivariate linear sub-model should be used. The
result powerful piecewise linear models.
Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be
used to gain insight into aspects of the organization's operations, and to predict outcomes for future situations as an
aid to decision-making.
Cubist builds rule-based predictive models that output values, complementing See5/C5.0 that predicts categories.
For instance, See5/C5.0 might classify the yield from some process as "high", "medium", or "low", whereas Cubist
would output a number such as 73%. (Statisticians call the first kind of activity "classification" and the second
"regression".)
Cubist is a powerful tool for generating piecewise-linear models that balance the need for accurate prediction against
the requirements of intelligibility. Cubist models generally give better results than those produced by simple
techniques such as multivariate linear regression, while also being easier to understand than neural networks.
Important Features Of Cubist





Cubist has been designed to analyze substantial databases containing thousands of records and tens to
hundreds of numeric or nominal fields.
To maximize interpretability, Cubist models are expressed as collections of rules, where each rule has an
associated multivariate linear model. Whenever a situation matches a rule's conditions, the associated
model is used to calculate the predicted value.
Cubist is available for Windows 95/98/NT and several flavors of Unix.
Cubist is easy to use and does not presume advanced knowledge of Statistics or Machine Learning
RuleQuest provides C source code so that models constructed by Cubist can be embedded in your
organization's own systems.
Operations Details of CUBIST
In order to work with the Cubist we need to follow a number of conventions. The following points explain the
whole process.
a)
Preparing Data for Cubist
 Application filestem
 Names file
 Data file
 Test and cases files (optional)
b)
User Interface
c)
Constructing Models
 Rule-based models
 Composite models
 Rule coverage
 Extrapolation
 Simplicity-accuracy tradeoff
 Cross-validation trials
 Sampling from large data sets
d) Cross-Referencing Models and Data
e)
Generating Models in Batch Mode
f)
Linking to Other Programs
a) Preparing Data for Cubist
Cubist is a self-referential application that is it estimates the time required to build a model.
An application is a collection of text files that define attributes, describe the cases to be analyzed, and optionally
provide new cases to test the models produced by Cubist.
Cubist's job is to find how to estimate a case's target value in terms of its attribute values Cubist does this by
building a model containing one or more rules, where each rule is a conjunction of conditions associated with a
linear expression. The meaning of a rule is that, if a case satisfies all the conditions, then the linear expression is
appropriate for predicting the target value. Cubist thus constructs a piecewise linear model to explain the target
value. As we will see, Cubist can also combine these models with instance-based (nearest neighbor) models.
Application filestem
Every Cubist application has a short name called a filestem; we will use the filestem logtime for this illustration. All
files read or written by Cubist for an application have names of the form filestem.extension, where filestem
identifies the application and extension describes the contents of the file. The case of letters in both the filestem and
extension is important -- file names APP.DATA, app.data, and App.Data, are all different. It is important that the
extensions are written exactly as shown below, otherwise Cubist will not recognize the files for your application.
A Cubist application consists of two mandatory files and two optional files.

Names file filestem.names (required). This defines the application’s attributes or features. One attribute, the
target, contains the value to be predicted from the other attributes.



Data file filestem.data (required). This provides information on the cases that will be analyzed by Cubist in
order to produce a model.
Test file filestem.test (optional). This file contains cases that are not used to produce a model, but are used
instead to estimate the predictive accuracy of the model.
Additional test file filestem.cases (optional) for use with the cross-referencing facility described below.
The principal file types written by Cubist are:



Results file filestem.out. This file contains details of the most recent model constructed for this application.
Plot file filestem.pred. This file contains case-by-case results for the most recent test cases, and is used to
produce a scatter plot.
Binary model file filestem.model (for use by Cubist only).
The Names File
The names file filestem.names contains a series of entries defining attributes and their values. The file is free-format
with the exception that the vertical bar ‘|’ causes the rest of that line to be skipped. Each entry is terminated with a
period. Each name is a string of characters that does not contain commas, question marks, or colons. A period may
be embedded in a name provided that it is not followed by a space. Embedded spaces are also permitted but
multiple whitespace characters are replaced by a single space.
Example of A Names File
| Comment: sample names file
goal.
| the numeric attribute that contains the target
| values to be predicted. (In other words,
| 'verdict' is the dependent variable.)
patient ID:
age:
height (cms):
weight (kg):
sex:
label.
continuous.
continuous.
continuous.
male, female.
| identifies this patient
| age is a number
| so is height ..
| .. and weight
| a discrete attribute
goal:
continuous.
| recommended weight
The first entry in the names file identifies the attribute that contains the target value to be modeled (the dependent
variable).The rest of the file contains one entry for each attribute. Attributes are of two kinds: explicitly-defined
attributes (specified by type, values etc.) and implicitly-defined attributes (specified by formulas).
Explicitly-defined attributes
Implicitly-defined attributes
Explicitly-Defined Attributes
The entry for an explicitly-defined attribute begins with the attribute name followed by a colon, and then one of the
following:


‘ignore’, indicating that this attribute should not be used in models;
‘label’, indicating that the value of this attribute is used only to identify particular cases;



‘continuous’, for attributes with numeric values;
‘date’, for attributes whose values take the form YYYY/MM/DD;
‘discrete’ followed by an integer N, instructing Cubist to assemble a list of up to N possible values that appear
in the training cases; or a list of the allowable discrete values of the attribute, separated by commas.
The list of values can be prefaced by ‘[ordered]’ to indicate that the values are given in a meaningful order – Cubist
can exploit this information to produce more sensible groupings of values.
The entry for each attribute is terminated by a period.
An attribute of type ‘label’ can be useful for allowing particular cases to be identified easily, especially in crossreferencing (see below). If more than one label attribute appears, only the last is used.
An Implicitly-defined attribute
An implicitly-defined attribute is one whose value is calculated from the values of previously-defined attributes. An
entry for an implicitly-defined attribute has the form
attribute name := formula .
The formula is written in the usual way, using parentheses where needed, and may refer to any attribute defined
before this one. Constants in the formula can be numbers, discrete attribute values enclosed in string quotes (e.g.,
“small”), and dates. The operators and functions that can be used in formulas are:
and, or
+, -, *, /, % (modulus), ^ (power)
>, >=, <, <=, =, !=, <> (the last two denoting not equal)
log, exp, sin, cos. , tan, int (integer part of)
Attributes defined in this way have values that are either numbers or logical values true and false, depending on the
formula. For example, x := a + b. defines a number, but x := a > b. defines a logical value.
Dates YYYY/MM/DD are stored internally as the number of days from a base date, and so can appear in some
formulas.
The Data File
The data file filestem.data contains the cases that will be analyzed in order to produce the model. Each case is
represented by a separate entry in the file; the order of cases in the data file is not important.
A case entry consists of the values of the explicitly-defined attributes in the same order that they appear in the names
file. All values are separated by commas and the entry is terminated with a period. If the case has unknown values
for one or more of the attributes, those values are represented by a question mark.
Note that there are no entries for implicitly-defined attributes since these are computed from the values of other
attributes using the given formula. If one or more of the previously-defined attributes has an unknown value so that
the formula cannot be evaluated, the value of the implicitly-defined attribute is also unknown.
As with the names file, the rest of the line after a vertical bar is ignored, and multiple white space characters are
treated as a single space.
Example of A Data File
| Comment: sample data file
00121, 23, 153, 95, male, 65.
| first case (needs to lose weight)
02002, 47, 183, 70, male, 75.
00937, ?, 157, 54, female, 60.
01363, 33, 165, 64, ?, 65.
| second case (about ok)
| notice missing age
| sex not recorded
Test and cases files (optional)
Of course, the value of predictive models lies in their ability to make accurate predictions! It is difficult to judge the
accuracy of a model by measuring how well it does on the cases used in its construction; the performance of the
model on new cases is much more informative.
The third kind of file used by Cubist is a test file of new cases (e.g. logtime.test) on which the model can be
evaluated. This file is optional and, if used, has exactly the same format as the data file.
Another optional file, the cases file (e.g. logtime.cases), has the same format as the data and test files. The cases file
is used primarily with the cross-referencing procedure and public source code, both of which are described later on.
b) User Interface
The main window of Cubist is displayed in the following diagram
c) Constructing Models
The basic steps to construct a model is as follows
STEP 1: Locate the Application’s Data
The first step in constructing a model for an application is to locate that application’s data files on your
computer. All these files must be together in one folder, although the same folder may contain files for
several applications. The left-most button on the toolbar invokes a standard specify-or-browse dialog to
find a file whose name is filestem.data for some filestem. Identifying this file provides the application
filestem, and Cubist then looks in the folder for other files related to this application. After the application’s
data has been located, the edit menu allows the .names file to be edited using Wordpad.
STEP 2: Select Options
Once the application has been identified, Cubist can then be used to construct a model. Several options that
can be used to influence this process are specified in the following dialog box:
Rule-based piecewise linear models
When Cubist is invoked with the default values of all options, it constructs a rule-based model and produces output
like this:
Cubist [Release 1.07] Thu Aug 12 14:51:21 1999
--------------------Target attribute `log(cpu time)'
Read 162 cases (10 attributes) from logtime.data
Model:
Rule 1: [4 cases, mean -2.000, range -2 to -2, est err 0.000]
If
discrete atts > 7
instances option in [force inst-allow inst]
total vals <= 10560
then
log(cpu time) = -2
Evaluation on training data (162 cases):
Average |error|
0.097
Relative |error|
0.14
Correlation coefficient
0.99
Evaluation on test data (162 cases):
Average |error|
0.126
Relative |error|
0.17
Correlation coefficient
0.99
Time: 0.0 secs
The ‘standard’ Cubist model consists of a set of rules, each with an associated linear model. The value of a case is
predicted by finding the rules that apply to it, calculating the corresponding values from the rules’ linear models, and
averaging these values. The Rules alone option builds standard models in this form.
Another way of assessing the accuracy of the predictions is through a visual inspection of a scatter plot that graphs
the real target values of new cases against the values predicted by the model. When a file filestem.test of test cases is
present, Cubist also provides a scatter plot window. In this example, it looks like this:
Composite models
Alternatively, Cubist can generate a special form of composite instance-based and rule-based model. As in a
conventional instance-based (or nearest neighbor) model, the target value for a case is found by identifying the most
similar cases in the training data. Instead of simply averaging the known target values for these neighbors, however,
Cubist uses an adjusted value for each neighbor case. Suppose that x is the case whose unknown target value is to
be predicted, and n is a neighbor with target value T(n)
. If the rule-based model predicts values M(x) and M(n) for x and n respectively, then the model predicts that the
target value of x will be higher than the target value of n by an amount M(x) - M(n). Cubist therefore uses the value
T(n) + M(x) - M(n) as the adjusted value associated with the neighbor n.
If the second model form option Instances and rules are chosen in the dialog box, Cubist uses composite models like
this.
Evaluation on training data (162 cases):
Average |error|
0.100
Relative |error|
0.14
Correlation coefficient
0.99
Evaluation on test data (162 cases):
Average |error|
0.120
Relative |error|
0.16
Correlation coefficient
0.99
The third option Let Cubist choose allows Cubist to decide, from analysis of the training cases in filestem.data,
which model form seems likely to give more accurate predictions.
Rule coverage
The dialog box contains a field for specifying the (approximate) minimum case cover for any rule as a percentage of
the number of training cases. That is, the conditions associated with any rule should be satisfied by at least the
specified percentage of all training cases.
Extrapolation
The extrapolation parameter controls the extent to which predictions made by Cubist's linear models can fall outside
the range of values seen in the training data. Extrapolation is inherently more risky than interpolation, where
predictions must lie between the lowest and highest observed value.
Simplicity-accuracy tradeoff
Cubist employs heuristics that try to simplify models without substantially reducing their predictive accuracy. These
heuristics are influenced by a parameter called the brevity factor whose values range from 0 to 100%. The default
value is 50%; lower values tend to produce more detailed models that are sometimes (but not always!) slightly more
accurate than those generated by default, while higher values tend to produce fewer rules, usually with a penalty in
predictive accuracy.
Cross-validation trials
As we saw earlier, the predictive accuracy of a model constructed from the cases in a data file can be estimated from
its performance on new cases in a test file. This estimate can be rather erratic unless there are large numbers of cases
in both files. If the cases in logtime.data and logtime.test were to be shuffled and divided into new 162-case training
and test sets, Cubist would probably construct a different model whose accuracy on the test cases might vary
considerably.
One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases in the data file
are divided into f blocks of roughly the same size and target value distribution. For each block in turn, a model is
constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each
case is used just once as a test case. The accuracy of a model produced from all the cases is estimated by averaging
results on the hold-out cases.
Running a 10-fold cross-validation using the option that allows Cubist to choose the model type gives results on the
hold-out cases similar to these:
Summary:
Average |error|
0.135
Relative |error|
0.20
Correlation coefficient
0.98
Sampling from large datasets
If the data are very numerous, the Sampling option may be useful. This causes only the specified percentage of the
cases in the filestem.data file to be used for constructing the model. This option automatically causes the model to
be tested on a disjoint sample of cases in the filestem.data file. By default, the samples are redrawn every time a
model is constructed, so the models built with the sampling option will usually change too. This resampling can be
prevented using the Lock sample option; the sample is unchanged until a different application is loaded, the sample
percentage is altered, the option is unselected, or Cubist is restarted.
d) Cross-Referencing Models and Data
The Cross-validate option can be used to estimate the accuracy of the model constructed by Cubist without requiring
separate test cases. The data are split into a number of blocks equal to the chosen number of folds (default 10).
Each block contains approximately the same number of cases and the same distribution of target values. For each
block in turn, Cubist constructs a model using the cases in all the other blocks and then evaluates the accuracy of
this model on the cases in the hold-out block. In this way, each case in filestem.data is used just once as a test case.
(The results for a cross-validation depend on the way that the cases are divided into blocks. This division changes
each time a cross-validation is carried out, so don’t be surprised if the results are not identical to the results from
previous cross-validations with the same data.) Since models constructed during cross-validation use only part of the
training data, no model is saved when this option is selected.
e) Generating Models in Batch Mode
The Cubist distribution includes a program CubistX that can be used to produce models non-interactively. This
console application resides in the same folder as Cubist (usually C:\Program Files\Cubist) and is invoked from an
MS-DOS Prompt window. The command to run the program is
CubistX -f filestem parameters
where the parameters enable one or more options discussed above to be selected:
-i
definitely use composite models
-a
allow the use of composite models
-S percent use the sampling option
-m percent set the minimum cases option
-e percent set the extrapolation limit
-b percent set the brevity factor
If desired, output from Cubist can be diverted to a file in the usual way.
As an example, executing
cd "C:\Program Files\Cubist"
CubistX -f Samples\boston -S 80 -a >save.txt
from a MS-DOS Prompt window focuses on the boston application in the Samples directory, uses a sample of 80%
of the cases for constructing the model (and tests it on the remaining 20%), allows Cubist to build composite
models, and leaves the output in file save.txt.
f) Linking to Other Programs
The models generated by Cubist are retained in binary files filestem.model. C source code to read these model files
and to use them to make predictions is freely available. Using this code, it is possible to call Cubist models from
other programs. As an example, the source includes a program to read a cases file and to print the predicted value for
each case using the most recent model.
Magnum Opus
“A
Association Tool”
INTRODUCTION
“Magnum Opus is an exciting new tool for finding associations”
( “if then rules” that highlight interrelationships among attributes).
Magnum Opus uses the highly efficient OPUS search algorithm for fast association rule discovery. The system is
designed to handle large data sets containing millions of cases, but its capacity to do so will be constrained by the
amount of RAM available.
The user can choose between two measures of the importance of an association, leverage or lift
The user specifies the maximum number of association rules to be found, and can place restrictions on association
rules to be considered. Within those restrictions, Magnum Opus finds the non-trivial associations with the highest
values on the specified measure of importance. Magnum Opus will only find fewer than the specified number of
association rules if the search is terminated by the user or there are fewer than the specified number of non-trivial
associations that satisfy the user specified constraints.
General information about the data must be described in a names file. The data must be stored in a data file. The
association rules found by the system are recorded in an output file.
This simple data represents the type of data that might be collected by a supermarket about customers. It is
contained in two files: tutorial.nam and tutorial.data. The first is a names file. The second is a data file. The names
file describes the attributes recorded in the data file. The data file contains the values of those attributes for each
case.
Example of a names file
Profitability99 <= 418 < 895
Profitability98 <= 326 < 712
Spend99 <= 2025< 4277
Spend98 <= 1782 < 4027
NoVisits99 <= 35 < 68
NoVisits98 <= 31 < 64
Dairy <= 214 < 515
Deli <= 207 < 554
Bakery <= 209 < 558
Grocery <= 872 < 2112
SocioEconomicGroup: A, B, C, D1, D2, E
Promotion1: t, f
Promotion2: t, f
Most of these attributes are numeric. These numeric attributes have been divided into three sub-ranges, each of
which contains approximately the same number of cases. The profitability attributes represent the profit made from
a customer in 1999 and 1998. The spend attributes represent the total spend by a customer in each year. The
NoVisits attributes represent the numbers of store visits in each year. The Dairy, Deli, Bakery, and Grocery
attributes record the customer’s total spend in each of four significant departments. The remaining three attributes
are categorical. The SocioEconomicGroup attribute records an assessment of the customer’s socio-economic group.
The final two attributes record whether the customer participated in each of two store promotions.
Basic Terminology in Magnus Opus

Association rule
An association rule identifies a combination of attribute values that occur together with greater frequency
than might be expected if the values were independent of one-another. A Magnum Opus association rule
has two parts, a Left Hand Side (LHS) and a Right Hand Side (RHS). The LHS is a set of one or more
attribute values. The RHS is a single attribute value. Each association rule indicates that the frequency
with which the RHS occurs in the data is higher among cases that have the LHS attribute values than
among those that do not.

Case
Cases are the basic unit of analysis considered by Magnum Opus. A case represents a single entity.
Examples of the type of thing that might be used as a case include customers, transactions, and states of a
dynamic process. Each case is described by a set of values for a number of attributes. The same attributes
are used to describe all cases for a single analysis. The attributes for the analysis are defined in a names
file. The cases are described in a data file.

Coverage
The coverage of an association rule is the proportion of cases in the data that have the attribute values
specified on the Left Hand Side of the rule. The total number of cases that this represents is indicated in
brackets. For example, suppose that there are 1000 cases and the LHS covers 200 cases. The coverage is
200/1000 = 0.2. The total number of cases, displayed in brackets, is 200.

Association rule strength
The strength of an association rule is the proportion of cases covered by the LHS of the rule that are also
covered by the RHS. For example, suppose that the LHS covers 100 cases and the RHS covers 50 of the
cases covered by the LHS. The strength is 50/100 = 0.5.

Association rule lift
The lift of an association rule is the strength divided by the proportion of all cases that are covered by the
RHS. This is a measure of the importance of the association that is independent of coverage. For example,
suppose that there are 1000 cases, the LHS covers 200 cases, the RHS covers 100 cases, and the RHS
covers 50 of the cases covered by the LHS. The strength is 50/100 = 0.5. The proportion of all cases
covered by the RHS is 100/1000 = 0.1. The lift is 0.5/0.1 = 5.

Association rule leverage
The leverage of an association rule is the proportion of additional cases covered by both the LHS and RHS
above those expected if the LHS and RHS were independent of each other. This is a measure of the
importance of the association that includes both the strength and the coverage of the rule. The total number
of cases that this represents is presented in brackets following the leverage. For example, suppose that there
are 1000 cases, the LHS covers 200 cases, the RHS covers 100 cases, and the RHS covers 50 of the cases
covered by the LHS. The proportion of cases covered by both the LHS and RHS is 50/1000 = 0.05. The
proportion of cases that would be expected to be covered by both the LHS and RHS if they were
independent of each other is (200/1000) * (100/1000) = 0.02. The leverage is 0.05 minus 0.02 = 0.03. The
total number of cases that this represents is 30.
Analysis Process
a)
b)
c)
d)
e)
f)
g)
h)
Starting Magnus Opus
Open the names file.
Select options for search by leverage and Execution
View the output of the search by leverage.
Dissection of an association rule.
Run the search by lift. And Output from search by lift.
Run search by lift with minimum coverage.
Conclusion.
a) Starting Magnus Opus
Magnum Opus is started from the start menu
Initially, the main window will be empty, indicating that no names file has yet been opened.
b) Open the names file
The Names file contains the attributes associated with a particular application
The following is an example of the names file described above when opened in MO
Note the following elements of the initial screen. At the top the currently selected search mode is displayed, either
search by leverage or search by lift. Next the currently selected names and paths for the names, data, and output
files are displayed. Below these appear the currently selected values for
-
The number of attributes allowed on the LHS of an association.
The maximum number of associations to be returned by the search
The minimum leverage for an association
The minimum coverage for an association and
The minimum lift for an association.
Finally appear two list boxes containing lists of all of the available attribute values. Only those attribute values
selected in these list boxes will be able to appear on the LHS and RHS of an association, respectively.
c) Select options for search by leverage and Execution
The following is a search by leverage with the following conditions


Search Limit ten associations
We find associations between other attributes and profitability.
[Therefore, we want to limit the attribute values that may appear on the RHS of an association to the three
values for profitability.]
The Example screen is as follows
Once you have specified all of the settings for the search, click on the GO button to start the search. As the Search
by Leverage mode is in effect, Magnum Opus will perform search-finding associations with the highest values on
leverage within the other constraints that have been specified.
d) Viewing the output
When the search is completed, the associations are written to the specified output file. The contents of this file are
then displayed in an output window.
The output file lists the names of the files that described the data and the settings that were used for the search.
These are followed by a list of association rules that were found. The final line of the output summarizes the
number of seconds taken to complete the search, the number of association rules found, and the number of cases
employed.
The output file for our example
Magnum Opus - Turning Data to Knowledge.
Copyright (c) 1999 G. I. Webb & Associates Pty Ltd.
Sun Oct 03 18:07:40 1999
Namesfile: C:\Tutorial\tutorial.nam
Datafile: C:\Tutorial\tutorial.data
Search by leverage
Maximum number of attributes on LHS = 4
Maximum number of associations = 10
Minimum leverage = 0.001
Minimum coverage = 0.0
Minimum lift = 1.0
All values allowed on LHS
Values allowed on RHS:
Profitability99<=418 418<Profitability99<895 Profitability99>=895
Sorted best association rules
Spend99<=2025 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.907; Lift=2.72; Leverage=0.1911 (191)]
Spend99>=4277 -> Profitability99>=895 [Coverage=0.335 (335); Strength=0.860; Lift=2.57; Leverage=0.1758 (175)]
Spend99<=2025 & Grocery<=872 -> Profitability99<=418 [Coverage=0.278 (278); Strength=0.953; Lift=2.86; Leverage=0.1724 (172)]
Spend99<=2025 & NoVisits99<=35 -> Profitability99<=418 [Coverage=0.276 (276); Strength=0.938; Lift=2.82; Leverage=0.1671 (167)]
Grocery<=872 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.832; Lift=2.50; Leverage=0.1661 (166)]
Profitability98<=326 & Spend99<=2025 -> Profitability99<=418 [Coverage=0.256 (256); Strength=0.961; Lift=2.89; Leverage=0.1608 (160)]
0 seconds for 10 association rules from 1000 examples
e) Dissection of an association rule
The first association rule in the example output file is the following.
Spend99<=2025 -> Profitability99<=418 [Coverage=0.333 (333); Strength=0.907; Lift=2.72; Leverage=0.1911
(191)]
The left-hand_side of this rule is presented before the -> arrow. The right-hand_side follows the arrow and precedes
the opening bracket [. This association rule indicates that cases for which Spend99 has a value less than or equal to
2025 are associated with cases for which Profitability99 has a value less than or equal to 418 more frequently than is
the average for all cases. That is, the frequency of association between cases that satisfy the left and right hand sides
is greater than normal.
The following information about the association is displayed in brackets.
Coverage=0.333 (333)
The first value indicates the proportion of all cases that satisfy the LHS of the rule
(Spend99<=2025). The value in brackets indicates the absolute number of cases that this represents. 333 cases
satisfy the LHS, and recall there are 1000 cases in the data set.
Strength=0.907 This indicates the proportion of those cases that satisfy the LHS that also satisfy the RHS. 333
cases satisfy the LHS and 302 of those also satisfy the RHS, so Strength is calculated as 302/333.
Lift=2.72
Lift is a measure of how much stronger than normal is the association between the LHS and RHS.
Out of all the data, 333 cases satisfy the RHS (this is the same as the number satisfying the LHS, because each
attribute was split on a value that created three equal sized partitions). Therefore the strength of association that
would be expected if the LHS and RHS were independent of each other is 333/1000. Dividing the actual strength by
this value we obtain (302/333)/(333/1000) = 2.72.
Leverage=0.1911 (191) Leverage is a measure of the magnitude of the effect created by the association. This is
the proportion of cases that exhibit the association in excess of those that would be expected if the LHS and RHS
were independent of each other. If the two attribute values were independent of each other than the expected
proportion of cases exhibiting both values would be the proportion exhibiting the LHS (0.333) times the proportion
exhibiting the RHS (0.333) = 0.1109. The proportion exhibiting the association is 0.302. The difference between
these values is 0.1911.
f) Run the search by lift And Output from search by lift.
First select Search by Lift mode by clicking the LIFT Toolbar button.
Next select a new output file name, so that the output from the last search is not overwritten.
For the first search by lift we will use the same settings as were used for the search by leverage (find best ten
association rules only; allow only values for Profitability99 on RHS). As these are already selected we can proceed
to starting the search by clicking on the GO button.
The output of the search by lift is as follows.
Magnum Opus - Turning Data to Knowledge.
Copyright (c) 1999 G. I. Webb & Associates Pty Ltd.
Sun Oct 03 18:11:45 1999
Namesfile: C:\Tutorial\tutorial.nam
Datafile: C:\Tutorial\tutorial.data
Search by lift
Maximum number of attributes on LHS = 4
Maximum number of associations = 10
Minimum leverage = 0.001
Minimum coverage = 0.0
Minimum lift = 1.0
All values allowed on LHS
Values allowed on RHS:
Profitability99<=418 418<Profitability99<895 Profitability99>=895
Sorted best associations
Profitability98<=326 & Dairy>=515 & Deli>=554 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.006 (6); Strength=1.000;
Lift=3.01; Leverage=0.0040 (4)]
Profitability98<=326 & Spend99>=4277 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.004 (4); Strength=1.000;
Lift=3.01; Leverage=0.0027 (2)]
NoVisits99<=35 & Dairy>=515 & Deli>=554 -> 418<Profitability99<895 [Coverage=0.004 (4); Strength=1.000; Lift=3.01; Leverage=0.0027 (2)]
Profitability98<=326 & Spend99>=4277 & Deli>=554 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.003 (3); Strength=1.000;
Lift=3.01; Leverage=0.0020 (2)]
Profitability98<=326 & NoVisits98>=64 & Dairy>=515 -> 418<Profitability99<895 [Coverage=0.003 (3); Strength=1.000; Lift=3.01; Leverage=0.0020
(2)]
Profitability98<=326 & NoVisits98>=64 & Deli>=554 -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013
(1)]
Profitability98<=326 & NoVisits99>=68 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000;
Lift=3.01; Leverage=0.0013 (1)]
Spend98<=1782 & NoVisits99<=35 & Grocery>=2112 & Promotion1=f -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01;
Leverage=0.0013 (1)]
Spend98<=1782 & NoVisits98>=64 & Dairy>=515 -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000; Lift=3.01; Leverage=0.0013 (1)]
Spend98<=1782 & NoVisits99>=68 & Dairy>=515 & SocioEconomicGroup=C -> 418<Profitability99<895 [Coverage=0.002 (2); Strength=1.000;
Lift=3.01; Leverage=0.0013 (1)]
3 seconds for 10 associations from 1000 examples
In this example the search has found only association rules that cover very small numbers of cases. When a rule has
small coverage there is increased probability of the observed lift being unrealistically high. Also, if there is small
coverage then the total impact (as indicated by leverage) will be low. To counter this, in the next screen we will set
a minimum coverage during search. Association rules that do not meet the coverage requirement will not be
considered. Raising the minimum coverage has a secondary effect of reducing computation time, which may be
desirable for large data sets.
Exit from viewing the output of the current search by clicking on the close button. You will be returned to the main
window where you can adjust the settings before starting another search.
g) Run search by lift with minimum coverage
For the next example, set the value for Minimum coverage of interest to 0.05. This will ensure that association rules
that cover less than 0.05 of the available cases (50 cases for the tutorial data) will be discarded. Change the output
file name so that the previous search results will not be overwritten.
h) Conclusion
Thus Magnum Opus is used to find associations rules The Magnum Opus can also be used to perform Market Basket
Analysis.
Basket analysis is one of the most common applications for association rules. In basket analysis the items jointly
involved in a single transaction form the basic unit of analysis. A typical example is the contents of each shopping
basket that passes through a retailer’s check out line. The purpose of the basket analysis is to identify which
combinations of items have the greatest affinity with each other.
One way to perform basket analysis with Magnum Opus is to declare an attribute for each item of analysis. In the
retail shopping basket example, this would entail one attribute for each product of interest. For each attribute one
value only need be declared. If the basket contains an item then the corresponding attribute has its single value set.
Otherwise the attribute is set to the missing value. The following are examples of hypothetical names and data files
that support this approach to basket analysis.
Example.nam
Example.data
zippo_large: p
zippo_small: p
zappy_large: p
neato_compact: p
neato_standard: p
p,?,?,?,p
?,p,,?,p,?
?,?,p,?,p
p,?,p,?,p
p,?,p,p,?
The names of the five attributes declared in example.nam are the names of the five products. The single value p for
each attribute represents that the corresponding item is present in the basket represented by a line of the data file.
The first line indicates that the first basket contained only a zippo_large and a neato_standard.
The main concern in the MO is the computing time the following are different ways to reduced the computing time
of the software
In general the following measures will decrease compute time.
-
Increase the minimum leverage of interest.
Increase the minimum coverage of interest.
Decrease the maximum number of attributes allowed on the LHS of an association.
Decrease the maximum number of associations to be found.
Decrease the number of associations allowed on the LHS and the RHS of association
GritBot
“A Data Cleaning Tool“
GritBot is a sophisticated data-cleansing tool that helps you to audit and maintain data quality.
Working from the raw data alone, GritBot automatically explores partitions of the data that share common properties
and reports surprising values in each partition. GritBot uncovers anomalies that might compromise the effectiveness
of your data mining tools.
"Data mining" is a family of techniques for extracting valuable information from an organization's stored or
warehoused data .Data-mining methods search for patterns and can be compromised if the data contain corrupted
values that obscure these patterns. As the saying goes, "Garbage in, garbage out."
GritBot is an automatic tool that tries to find anomalies in data as a precursor to data mining. It can be thought of as
an autonomous data quality auditor that identifies records having "surprising" values of nominal (discrete) and/or
numeric (continuous) attributes.
Values need not stand out in the complete dataset -- GritBot searches for subsets of records in which the anomaly is
apparent. In one of the sample applications referenced below, GritBot identifies the age of two women in their
seventies as being anomalous. Such ages are not surprising in the whole population, but they certainly are in this
case because the women are noted as being pregnant.
Some important features:



GritBot has been designed to analyze substantial databases containing tens or hundreds of thousands of records
and large numbers of numeric or nominal fields. Every possibly anomalous value that GritBot identifies is
reported, together with an explanation of why the value seems surprising.
GritBot is virtually automatic -- the user does not require knowledge of Statistics or Data Analysis.
GritBot is available for Windows 95/98/NT and several flavors of Unix.
Download