An Investigation of Commercial Data Mining Emily Davis Supervisor: John Ebden Abstract: This paper describes an investigation of a commercial data mining suite specifically that of Oracle9i.This investigation is conducted in order to determine the type of results achieved when applying data mining models created using Oracle’s data mining components to data. Two types of models in the same category were built and used as a basis for comparison, Naïve Bayes and Adaptive Bayes Network, and their results compared in order to determine if the results supported each other and whether the results differed in any way. It was concluded that only one of six comparisons showed very similar results for the two algorithms providing classification and that therefore choice of modelling algorithm can have a significant impact on results from the same data even when the category of data mining technique is the same. 1. Introduction This paper describes the method and results of investigating Oracle9i data mining and specifically algorithms that fall into the classification category in order to determine the type of results produced and whether the results from the different models support each other. Four models were initially built using classification algorithms in the Oracle 9i data mining suite. The two algorithms used were Naïve Bayes and Adaptive Bayes Networks. The algorithms were applied to the data to build, test and apply the models and the results documented using different combinations of parameter settings for the algorithms. 2. Methodology 2.1 Preparation An Oracle database was configured and the tools and software for data mining installed and configured for use with the database. The Oracle Data Mining suite is made up of two components, the data mining Java API and the Data Mining Server (DMS). [Oracle9i Data Mining Concepts Release 2 (9.2), 2002] The DMS provides a repository of metadata of the input and result objects of data mining. For the purposes of this investigation JDeveloper 10g provides the access to the Java API and the DMS. The data mining itself is performed using DM4J 9.0.4 which is an extension of JDeveloper that provides the user with a number of wizards that automatically create the Java programs that perform the data mining when these programs are run. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 2.2 Algorithms According to Berry and Linoff [2000], directed data mining or supervised learning involves using data to build a model that describes one particular variable of interest in terms of the rest of the data. This category includes techniques such as Classification, Estimation and Prediction. 1 Roiger and Geatz [2003] define input variables as independent variables and output variables as dependent variables. In supervised learning a predictive, dependent variable is produced as output. Roiger and Geatz [2003] describe classification as a technique where the dependent or output variable is categorical. The emphasis of the model is to assign new instances of data to categorical classes. ODM supports the following classification algorithms selected for this experiment as stated by Oracle9i Data Mining Concepts Release 2 (9.2) [2002]: Adaptive Bayes Network supporting decision trees (classification) Naive Bayes (classification) 2.3 Investigation In order to be able to perform comparisons during the evaluation of ODM it has been necessary to select two forms of data mining algorithm that fall into the same categories, that is, supervised learning and classification. For this reason Naïve Bayes for Classification and Adaptive Bayes Classification have been selected as the format of the results they produce will be comparable. Both algorithms allow for building the model, testing the model, computing model lift (providing a measure of how quickly the model finds actual positive target values) and application of the model to new data. 2.3.1 Data The data used in this experiment consists of three tables that are stored in the Oracle database. The tables are MINING_DATA_BUILD, MINING_DATA_TEST and MINING_DATA_APPLY and are distributed as part of a DM4J tutorial [Oracle Data Mining Tutorial, Release 9.0.4, 2004]. The data represents the demographics of customers of an electronics shop chain that would like to offer loyalty cards (AFFINITY_CARD) to customers that are expected to increase their buying. The tables have identical structure as required by the data mining tasks and each consist of 1500 records none of which are identical. The table structure is as follows in Table 1: Name Data Type Size Nulls? CUST_ID NUMBER CUST_GENDER CHAR AGE NUMBER CUST_MARITAL_STATUS VARCHAR2 60 COUNTRY_NAME VARCHAR2 120 CUST_INCOME_LEVEL VARCHAR2 90 YES EDUCATION VARCHAR2 63 YES OCCUPATION VARCHAR2 63 YES HOUSEHOLD_SIZE VARCHAR2 63 YES YRS_RESIDENCE NUMBER 3 YES YES YES 2 AFFINITY_CARD NUMBER 10 YES BULK_PACK_DISKETTES NUMBER 10 YES FLAT_PANEL_MONITOR NUMBER 10 YES HOME_THEATER_PACKAGE NUMBER 10 YES BOOKKEEPING_APPLICATION NUMBER 10 YES PRINTER_SUPPLIES NUMBER 10 YES Y_BOX_GAMES NUMBER 10 YES OS_DOC_SET_KANJI NUMBER 10 YES Table 1 Mining Data Table Structure MINING_ DATA_BUILD is used for the building of the data mining models for both algorithms. MINING_DATA_ TEST is used as the test data to evaluate the effectiveness of the models created from the build data. Roiger and Geatz [2003] state that evaluation of supervised learning models involves determining the level of predictive accuracy. Such models can be evaluated by comparing the test set error rates of supervised learning models with expected rates obtained from historical data of a similar form to determine accuracy of models and which model to apply if need be. [Roiger and Geatz, 2003] MINING_DATA_ APPLY is the data which the built and tested model is applied to in order to make classifications. The results of the application of the models to the data are stored by DM4J for inspection and use. It is also possible to export the results to a spreadsheet format which has been done in this case to allow for comparison. 2.3.2 Testing Models The test model results produced by DM4J are depicted in confusion matrices and lift charts. Confusion matrices can be used to determine the accuracy of classification models and to show the number of false negative or false positive predictions made by the model on the test data. Confusion matrices are best used for evaluating the accuracy of models using categorical data which is being used in this case. [Roiger and Geatz, 2003] 2.3.3 Building the Models Four models were built using the build data, two of the Naïve Bayes form and two of the Adaptive Bayes Network form. The models were named nb, nbw, abn and abnw. All the models were built using AFFINITY_CARD as the target attribute, that is, the attribute that would be predicted. The algorithms aim to use the other attribute values in a record to predict whether a customer is likely to increase spending if offered an affinity card. Naïve Bayes works by looking at the build data and calculating conditional probabilities for the target value, AFFINITY_CARD, this is done by observing the frequency of certain attribute values and combinations thereof. [Oracle Data Mining 3 Tutorial, Release 9.0.4, 2004]The two parameters that must be supplied to the Naïve Bayes build wizard indicate how outliers in the data should be treated; occurrences below the threshold values are ignored when creating the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] The singleton threshold value provides a threshold for the count of items that occur frequently in the data. Given k as the number of times the item occurs in the data, P as the number of data profiles or records and t as the singleton threshold expressed as a percentage of P; then the item is considered to occur frequently if k>=t*P. [Oracle Help for Java,1997-2004] The pairwise threshold provides a threshold for the count of pairs of items that occur frequently in the data. Given k as the number of times two items appear together in the profiles and P and t as above; a pair is frequent if k>t*P. [Oracle Help for Java,1997-2004] Adaptive Bayes Network works by ranking the attributes and then building a Naïve Bayes model in order of the ranked attributes. The algorithm then builds a set of features or ‘trees’ which are in turn tested against the model in order to determine whether they improve the accuracy of the model or not. If no improvement is found the feature is discarded. When the number of discarded features reaches a certain level the building stops and the model is those features that remain.[Oracle Data Mining Tutorial, Release 9.0.4, 2004]The detail of the various classification models is shown below in Table 2: Model Name nb Algorithm Weighting Parameters Features Naïve Bayes none NA nbw Naïve Bayes abn Adaptive Bayes Adaptive Bayes 3.0 for false negatives none Singleton threshold: 0.01 Pairwise threshold: 0.01 Singleton threshold: 0.01 Pairwise threshold: 0.01 default parameters default parameters Multi-feature abnw 3.0 for false negatives NA Multi-feature Table 2 Model Detail 2.3.3.1 Training and Tuning the Model Using ODM it is possible to assign weights to the target value when using Naïve Bayes or Adaptive Bayes so that the model predicts more of one kind of value if it appears that there are a large number of false predictions of a certain kind when testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] Bias can be built into the model to increase predictions of the desired target value. In this investigation weighting was used to introduce this bias because when testing the abn model it was apparent from the confusion matrix that a significant error was encountered as the model predicted 0 or no in every case and these predictions were false in 346 of the cases. This level of false predictions was very high so it is then viable to use weighting in order to decrease the number of false negative predictions. nbw was then weighted for purposes of comparison. 4 Although the weighting is chosen by trial and error, a weighting of 3.0 was used as was suggested by [Oracle Data Mining Tutorial, Release 9.0.4, 2004]. The weighting is then associated with a certain type of prediction, false negative or positive, and the model will then treat a false prediction of that kind as three times as costly as an error of the other kind. This forces the model to make more predictions in the other direction. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] 2.3.4 Results Testing the models on the test data set, MINING_DATA_TEST, produced confusion matrices which were used to determine the accuracy of the model when tested on the test data. The accuracy of the respective models is depicted in Table 5. The models were then applied to the new data in the MINING_DATA_APPLY set. The results were depicted by customer id and showed a prediction, 1 meaning yes or 0 meaning no, of whether a customer was likely to increase spending if offered an affinity card. The probability of this prediction was also depicted as shown in a sample below in Table 3. The results in this extract can be interpreted as the customer with ID 100408 is predicted to increase spending and this prediction is given with a probability of 0.9598. Customer 100413 is predicted not to increase spending with a probability of 0.7854. PREDICTION 1 0 PROBABILITY 0.9598 0.7854 CUST_ID 100408 100413 Table 3. Extract of results from model nb. Those models that were weighted provided predictions and cost figures. This cost figure is provided instead of probability as the model makes predictions based on cost in terms of the weighting in these cases. An extract from these types of results is shown in Table 4. This extract can be interpreted as customer 100408 is predicted to use an affinity card and the cost of such a prediction being incorrect is 0.0401. Customer 100413 is predicted not to use an affinity card and if this prediction is incorrect the cost is higher at 0.6437. Low cost can be interpreted as higher probability as can be seen from the extract but it is not possible to directly calculate probability from cost. See Tables 6 and 7 for comparative cost and probability figures for the models. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] PREDICTION 1 0 COST 0.0401 0.6437 CUST_ID 100408 100413 Table 4. Extract of results from model nbw. 2.3.5 Interpretation of Results It was possible to compare the results between the four models as shown in Table 5 and in some of the cases compare the difference in average probability or difference in average cost as shown in Tables 6 and 7. 5 Models Compared Accuracy of Model Test Percentage Positive Predictions Number of predictions in agreement(total 1500) Percentage Agreeing Predictions 1 nb vs nbw 54.9333% nb vs abn 1045 69.6667% 3 nb vs abnw 767 51.1333% 4 abn vs nbw nbw vs abnw 30.33% vs 33.80% 30.33% vs 0.00% 30.33% vs 43.33% 0.00% vs 33.80% 33.80% vs 43.33% 0.00% vs 43.33% 824 2 79.93333% vs 78.86667% 79.93333% vs 76.93333% 79.93333 vs 73.06667% 76.93333% vs 78.86667% 78.86667% vs 73.06667% 76.93333% vs 73.06667% 993 62.2000% 1279 85.2667% 850 56.6667% 5 6 abn vs abnw Table 5 Comparison of Model Results on sample data set, MINING_DATA_APPLY, of 1500 records. nb Average probability for positive predictions Average probability for negative predictions 0.893733 abn 0 0.968798 0.761749 Table 6. Comparison of average probabilities for unweighted models Average cost for positive predictions Average cost for negative predictions nbw abnw 0.159508 0.533948 0.039525 0.191902 Table 7. Comparison of average costs for weighted models It is possible to exclude comparisons 2, 4 and 6 from any noteworthy results as it is apparent from the results during testing that the use of Adaptive Bayes with no weighting gives no positive predictions for the target attribute when the model is applied to the new data. This is unrealistic and although the model abn showed an accuracy of 76.93333% during testing this is not corroborated at all during application of the models which seems unreliable. Comparison 1 shows the effect that weighting has when using the same algorithm to build two different models. As expected nbw has a higher percentage of positive predictions due to the weighting but the percentage of agreeing predictions of 54.9333% shows little corroboration. However, the two models built using Naïve Bayes show the highest accuracy during testing. 6 Comparison 3 also shows a low level of corroboration at only a little over 51%. This is possibly due to the fact that two different algorithms are used and weighting is used in abnw and not nb. The most interesting comparison is between nbw and abnw (Comparison 5). This has a vastly higher percentage of agreement, 82.2667%, than the other models’ results. This is interesting because two different algorithms are used but the weighting used is the same for both algorithms. The difference in accuracy of the models is around 5%. In the case of nbw weighting had a smaller effect on the accuracy of the model when compared to nb. This effect was heightened when comparing abn and abnw although abn made no positive predictions and the accuracy can be deemed unreliable in that case. 3. Conclusion After building, testing and applying the models to the data it was possible to conduct a comparison of the results. It is possible to conclude that the only case in which the results of a Naïve Bayes model and an Adaptive Bayes model seem to corroborate each other is when a weighting of 3.0 for false negatives is set for both models. This is possibly due to the fact that the Adaptive Bayes model only provides realistic results in this case and the results of Naïve Bayes are affected by the weighting to show similar results to the Adaptive Bayes model. The results are not what was expected as it was expected that the results of the two categories of models would show more similarities in most of the cases. For this reason it appears that choice of modelling algorithm and parameters can have a significant impact on results from the same data even when the category of data mining technique is the same. 4. Future Work As an extension to this investigation it is hoped that a similar comparison may be performed on data that is of interest to the university. Data being considered is that which documents students’ school performance and consequently performance at university. It will be of interest to determine if a pattern is present in such data as well as to perform comparisons on the results of the data mining given by the different models in DM4J. References [Michael J.A. Berry and Gordon S. Linoff, 2000], Mastering Data Mining: The Art and Science of Customer Relationship Management, USA, Wiley Computer Publishing. [Richard J. Roiger and Michael W. Geatz, 2003], Data mining: a tutorial- based primer by, Boston, Massachusetts, Addison Wesley. Oracle9i Data Mining Concepts Release 2 (9.2), Oracle Technology Network, March 2002, <http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.htm> Oracle Data Mining Tutorial, Release 9.0.4, Oracle Technology Network, February 2004,< http://www.oracle.com/technology/products/bi/odm/9idm4jv2.html> Oracle Help for Java, Version 4.2.5.1.0, Copyright 1997-2004. 7