Short Paper - Computer Science

advertisement
An Investigation of Commercial Data Mining
Emily Davis
Supervisor: John Ebden
Abstract: This paper describes an investigation of a commercial data mining suite
specifically that of Oracle9i.This investigation is conducted in order to determine the
type of results achieved when applying data mining models created using Oracle’s
data mining components to data. Two types of models in the same category were built
and used as a basis for comparison, Naïve Bayes and Adaptive Bayes Network, and
their results compared in order to determine if the results supported each other and
whether the results differed in any way. It was concluded that only one of six
comparisons showed very similar results for the two algorithms providing
classification and that therefore choice of modelling algorithm can have a significant
impact on results from the same data even when the category of data mining
technique is the same.
1. Introduction
This paper describes the method and results of investigating Oracle9i data mining and
specifically algorithms that fall into the classification category in order to determine
the type of results produced and whether the results from the different models support
each other. Four models were initially built using classification algorithms in the
Oracle 9i data mining suite. The two algorithms used were Naïve Bayes and Adaptive
Bayes Networks. The algorithms were applied to the data to build, test and apply the
models and the results documented using different combinations of parameter settings
for the algorithms.
2. Methodology
2.1 Preparation
An Oracle database was configured and the tools and software for data mining
installed and configured for use with the database. The Oracle Data Mining suite is
made up of two components, the data mining Java API and the Data Mining Server
(DMS). [Oracle9i Data Mining Concepts Release 2 (9.2), 2002] The DMS provides a
repository of metadata of the input and result objects of data mining. For the purposes
of this investigation JDeveloper 10g provides the access to the Java API and the
DMS. The data mining itself is performed using DM4J 9.0.4 which is an extension of
JDeveloper that provides the user with a number of wizards that automatically create
the Java programs that perform the data mining when these programs are run. [Oracle
Data Mining Tutorial, Release 9.0.4, 2004]
2.2 Algorithms
According to Berry and Linoff [2000], directed data mining or supervised learning
involves using data to build a model that describes one particular variable of interest
in terms of the rest of the data. This category includes techniques such as
Classification, Estimation and Prediction.
1
Roiger and Geatz [2003] define input variables as independent variables and output
variables as dependent variables. In supervised learning a predictive, dependent
variable is produced as output.
Roiger and Geatz [2003] describe classification as a technique where the dependent or
output variable is categorical. The emphasis of the model is to assign new instances of
data to categorical classes.
ODM supports the following classification algorithms selected for this experiment as
stated by Oracle9i Data Mining Concepts Release 2 (9.2) [2002]:

Adaptive Bayes Network supporting decision trees (classification)

Naive Bayes (classification)
2.3 Investigation
In order to be able to perform comparisons during the evaluation of ODM it has been
necessary to select two forms of data mining algorithm that fall into the same
categories, that is, supervised learning and classification. For this reason Naïve Bayes
for Classification and Adaptive Bayes Classification have been selected as the format
of the results they produce will be comparable. Both algorithms allow for building the
model, testing the model, computing model lift (providing a measure of how quickly
the model finds actual positive target values) and application of the model to new
data.
2.3.1 Data
The data used in this experiment consists of three tables that are stored in the Oracle
database. The tables are MINING_DATA_BUILD, MINING_DATA_TEST and
MINING_DATA_APPLY and are distributed as part of a DM4J tutorial [Oracle Data
Mining Tutorial, Release 9.0.4, 2004]. The data represents the demographics of
customers of an electronics shop chain that would like to offer loyalty cards
(AFFINITY_CARD) to customers that are expected to increase their buying. The
tables have identical structure as required by the data mining tasks and each consist of
1500 records none of which are identical. The table structure is as follows in Table 1:
Name
Data Type
Size
Nulls?
CUST_ID
NUMBER
CUST_GENDER
CHAR
AGE
NUMBER
CUST_MARITAL_STATUS
VARCHAR2
60
COUNTRY_NAME
VARCHAR2
120
CUST_INCOME_LEVEL
VARCHAR2
90
YES
EDUCATION
VARCHAR2
63
YES
OCCUPATION
VARCHAR2
63
YES
HOUSEHOLD_SIZE
VARCHAR2
63
YES
YRS_RESIDENCE
NUMBER
3
YES
YES
YES
2
AFFINITY_CARD
NUMBER
10
YES
BULK_PACK_DISKETTES
NUMBER
10
YES
FLAT_PANEL_MONITOR
NUMBER
10
YES
HOME_THEATER_PACKAGE
NUMBER
10
YES
BOOKKEEPING_APPLICATION
NUMBER
10
YES
PRINTER_SUPPLIES
NUMBER
10
YES
Y_BOX_GAMES
NUMBER
10
YES
OS_DOC_SET_KANJI
NUMBER
10
YES
Table 1 Mining Data Table Structure
MINING_ DATA_BUILD is used for the building of the data mining models for both
algorithms.
MINING_DATA_ TEST is used as the test data to evaluate the effectiveness of the
models created from the build data. Roiger and Geatz [2003] state that evaluation of
supervised learning models involves determining the level of predictive accuracy.
Such models can be evaluated by comparing the test set error rates of supervised
learning models with expected rates obtained from historical data of a similar form to
determine accuracy of models and which model to apply if need be. [Roiger and
Geatz, 2003]
MINING_DATA_ APPLY is the data which the built and tested model is applied to
in order to make classifications. The results of the application of the models to the
data are stored by DM4J for inspection and use. It is also possible to export the results
to a spreadsheet format which has been done in this case to allow for comparison.
2.3.2 Testing Models
The test model results produced by DM4J are depicted in confusion matrices and lift
charts. Confusion matrices can be used to determine the accuracy of classification
models and to show the number of false negative or false positive predictions made by
the model on the test data. Confusion matrices are best used for evaluating the
accuracy of models using categorical data which is being used in this case. [Roiger
and Geatz, 2003]
2.3.3 Building the Models
Four models were built using the build data, two of the Naïve Bayes form and two of
the Adaptive Bayes Network form. The models were named nb, nbw, abn and abnw.
All the models were built using AFFINITY_CARD as the target attribute, that is, the
attribute that would be predicted. The algorithms aim to use the other attribute values
in a record to predict whether a customer is likely to increase spending if offered an
affinity card.
Naïve Bayes works by looking at the build data and calculating conditional
probabilities for the target value, AFFINITY_CARD, this is done by observing the
frequency of certain attribute values and combinations thereof. [Oracle Data Mining
3
Tutorial, Release 9.0.4, 2004]The two parameters that must be supplied to the Naïve
Bayes build wizard indicate how outliers in the data should be treated; occurrences
below the threshold values are ignored when creating the model. [Oracle Data Mining
Tutorial, Release 9.0.4, 2004]
The singleton threshold value provides a threshold for the count of items that occur
frequently in the data. Given k as the number of times the item occurs in the data, P as
the number of data profiles or records and t as the singleton threshold expressed as a
percentage of P; then the item is considered to occur frequently if k>=t*P. [Oracle
Help for Java,1997-2004]
The pairwise threshold provides a threshold for the count of pairs of items that occur
frequently in the data. Given k as the number of times two items appear together in
the profiles and P and t as above; a pair is frequent if k>t*P. [Oracle Help for
Java,1997-2004]
Adaptive Bayes Network works by ranking the attributes and then building a Naïve
Bayes model in order of the ranked attributes. The algorithm then builds a set of
features or ‘trees’ which are in turn tested against the model in order to determine
whether they improve the accuracy of the model or not. If no improvement is found
the feature is discarded. When the number of discarded features reaches a certain level
the building stops and the model is those features that remain.[Oracle Data Mining
Tutorial, Release 9.0.4, 2004]The detail of the various classification models is shown
below in Table 2:
Model
Name
nb
Algorithm
Weighting
Parameters
Features
Naïve Bayes
none
NA
nbw
Naïve Bayes
abn
Adaptive
Bayes
Adaptive
Bayes
3.0 for false
negatives
none
Singleton threshold: 0.01
Pairwise threshold: 0.01
Singleton threshold: 0.01
Pairwise threshold: 0.01
default parameters
default parameters
Multi-feature
abnw
3.0 for false
negatives
NA
Multi-feature
Table 2 Model Detail
2.3.3.1 Training and Tuning the Model
Using ODM it is possible to assign weights to the target value when using Naïve
Bayes or Adaptive Bayes so that the model predicts more of one kind of value if it
appears that there are a large number of false predictions of a certain kind when
testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] Bias can be
built into the model to increase predictions of the desired target value. In this
investigation weighting was used to introduce this bias because when testing the abn
model it was apparent from the confusion matrix that a significant error was
encountered as the model predicted 0 or no in every case and these predictions were
false in 346 of the cases. This level of false predictions was very high so it is then
viable to use weighting in order to decrease the number of false negative predictions.
nbw was then weighted for purposes of comparison.
4
Although the weighting is chosen by trial and error, a weighting of 3.0 was used as
was suggested by [Oracle Data Mining Tutorial, Release 9.0.4, 2004]. The weighting
is then associated with a certain type of prediction, false negative or positive, and the
model will then treat a false prediction of that kind as three times as costly as an error
of the other kind. This forces the model to make more predictions in the other
direction. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
2.3.4 Results
Testing the models on the test data set, MINING_DATA_TEST, produced confusion
matrices which were used to determine the accuracy of the model when tested on the
test data. The accuracy of the respective models is depicted in Table 5.
The models were then applied to the new data in the MINING_DATA_APPLY set.
The results were depicted by customer id and showed a prediction, 1 meaning yes or 0
meaning no, of whether a customer was likely to increase spending if offered an
affinity card. The probability of this prediction was also depicted as shown in a
sample below in Table 3. The results in this extract can be interpreted as the customer
with ID 100408 is predicted to increase spending and this prediction is given with a
probability of 0.9598. Customer 100413 is predicted not to increase spending with a
probability of 0.7854.
PREDICTION
1
0
PROBABILITY
0.9598
0.7854
CUST_ID
100408
100413
Table 3. Extract of results from model nb.
Those models that were weighted provided predictions and cost figures. This cost
figure is provided instead of probability as the model makes predictions based on cost
in terms of the weighting in these cases. An extract from these types of results is
shown in Table 4. This extract can be interpreted as customer 100408 is predicted to
use an affinity card and the cost of such a prediction being incorrect is 0.0401.
Customer 100413 is predicted not to use an affinity card and if this prediction is
incorrect the cost is higher at 0.6437. Low cost can be interpreted as higher
probability as can be seen from the extract but it is not possible to directly calculate
probability from cost. See Tables 6 and 7 for comparative cost and probability figures
for the models. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]
PREDICTION
1
0
COST
0.0401
0.6437
CUST_ID
100408
100413
Table 4. Extract of results from model nbw.
2.3.5 Interpretation of Results
It was possible to compare the results between the four models as shown in Table 5
and in some of the cases compare the difference in average probability or difference
in average cost as shown in Tables 6 and 7.
5
Models
Compared
Accuracy of
Model Test
Percentage
Positive
Predictions
Number of
predictions in
agreement(total
1500)
Percentage
Agreeing
Predictions
1
nb vs nbw
54.9333%
nb vs abn
1045
69.6667%
3
nb vs
abnw
767
51.1333%
4
abn
vs
nbw
nbw vs
abnw
30.33%
vs
33.80%
30.33%
vs
0.00%
30.33%
vs
43.33%
0.00%
vs
33.80%
33.80%
vs
43.33%
0.00%
vs
43.33%
824
2
79.93333%
vs
78.86667%
79.93333%
vs
76.93333%
79.93333
vs
73.06667%
76.93333%
vs
78.86667%
78.86667%
vs
73.06667%
76.93333%
vs
73.06667%
993
62.2000%
1279
85.2667%
850
56.6667%
5
6
abn vs
abnw
Table 5 Comparison of Model Results on sample data set, MINING_DATA_APPLY, of 1500 records.
nb
Average probability for
positive predictions
Average probability for
negative predictions
0.893733
abn
0
0.968798
0.761749
Table 6. Comparison of average probabilities for unweighted models
Average cost for positive
predictions
Average cost for negative
predictions
nbw
abnw
0.159508
0.533948
0.039525
0.191902
Table 7. Comparison of average costs for weighted models
It is possible to exclude comparisons 2, 4 and 6 from any noteworthy results as it is
apparent from the results during testing that the use of Adaptive Bayes with no
weighting gives no positive predictions for the target attribute when the model is
applied to the new data. This is unrealistic and although the model abn showed an
accuracy of 76.93333% during testing this is not corroborated at all during application
of the models which seems unreliable.
Comparison 1 shows the effect that weighting has when using the same algorithm to
build two different models. As expected nbw has a higher percentage of positive
predictions due to the weighting but the percentage of agreeing predictions of
54.9333% shows little corroboration. However, the two models built using Naïve
Bayes show the highest accuracy during testing.
6
Comparison 3 also shows a low level of corroboration at only a little over 51%. This
is possibly due to the fact that two different algorithms are used and weighting is used
in abnw and not nb.
The most interesting comparison is between nbw and abnw (Comparison 5). This
has a vastly higher percentage of agreement, 82.2667%, than the other models’
results. This is interesting because two different algorithms are used but the weighting
used is the same for both algorithms. The difference in accuracy of the models is
around 5%. In the case of nbw weighting had a smaller effect on the accuracy of the
model when compared to nb. This effect was heightened when comparing abn and
abnw although abn made no positive predictions and the accuracy can be deemed
unreliable in that case.
3. Conclusion
After building, testing and applying the models to the data it was possible to conduct a
comparison of the results.
It is possible to conclude that the only case in which the results of a Naïve Bayes
model and an Adaptive Bayes model seem to corroborate each other is when a
weighting of 3.0 for false negatives is set for both models. This is possibly due to the
fact that the Adaptive Bayes model only provides realistic results in this case and the
results of Naïve Bayes are affected by the weighting to show similar results to the
Adaptive Bayes model.
The results are not what was expected as it was expected that the results of the two
categories of models would show more similarities in most of the cases. For this
reason it appears that choice of modelling algorithm and parameters can have a
significant impact on results from the same data even when the category of data
mining technique is the same.
4. Future Work
As an extension to this investigation it is hoped that a similar comparison may be
performed on data that is of interest to the university. Data being considered is that
which documents students’ school performance and consequently performance at
university. It will be of interest to determine if a pattern is present in such data as well
as to perform comparisons on the results of the data mining given by the different
models in DM4J.
References





[Michael J.A. Berry and Gordon S. Linoff, 2000], Mastering Data Mining: The Art and
Science of Customer Relationship Management, USA, Wiley Computer Publishing.
[Richard J. Roiger and Michael W. Geatz, 2003], Data mining: a tutorial- based primer by,
Boston, Massachusetts, Addison Wesley.
Oracle9i Data Mining Concepts Release 2 (9.2), Oracle Technology Network, March 2002,
<http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.htm>
Oracle Data Mining Tutorial, Release 9.0.4, Oracle Technology Network, February 2004,<
http://www.oracle.com/technology/products/bi/odm/9idm4jv2.html>
Oracle Help for Java, Version 4.2.5.1.0, Copyright 1997-2004.
7
Download