DEPARTMENT OF COMPUTING COLLEGE OF BUSINESS & TECHNOLOGY CSCI-4957 - DATA MINING FALL SEMESTER 2017 PARALYZED VETERANS OF AMERICA DATA MINING PROJECT WRITE-UP Leandro Nascimento nascimento@etsu.edu E00529380 JOHNSON CITY, TENNESSEE DECEMBER 6TH, 2017 LEANDRO DA PIA NASCIMENTO PARALYZED VETERANS OF AMERICA DATA MINING – FINAL PROJECT This document covers the requisites of the CSCI-4957 - Data Mining class, Fall 2017 Semester - Final Project. This document is final version and it is in accordance to the requirements listed by Prof. Dr. Jay Jarman, Ph.D., in this course’s syllabus and D2L terms (East Tennessee State University learning portal). DOCUMENT EVALUATED BY ______________________________________ Prof. Dr. Jay Jarman, Ph.D. jarman@etsu.edu East Tennessee State University Johnson City, TN – December _____ , 2017. TABLE OF CONTENTS TABLE OF CONTENTS .......................................................................................................... 2 1 – EXECUTIVE SUMMARY .................................................................................................. 4 4 – DATA SET DESCRIPTION ................................................................................................ 6 4.1 PRESENTATION ...................................................................................................................... 6 4.2 STATISTICS EVALUATED .......................................................................................................... 6 4.3 SAMPLE DATA ........................................................................................................................ 6 4.4 ATTRIBUTES ......................................................................................................................... 10 4.5 TRAINING & TEST RESULTS ................................................................................................... 10 5 – ATTRIBUTE SELECTION ................................................................................................ 11 5.1 PRESENTATION .................................................................................................................... 11 5.2 SORTING THE ATTRIBUTES.................................................................................................... 11 5.3 ATTRIBUTES SELECTED.......................................................................................................... 15 5.4 ADDITIONAL ATTRIBUTES ..................................................................................................... 17 6 – MISSING DATA ............................................................................................................ 18 7 – CLASSIFICATION PROXY .............................................................................................. 18 8 – MODELS ...................................................................................................................... 20 8.1 INTRODUCTION .................................................................................................................... 20 8.2 MODEL I DIMENSION............................................................................................................ 20 8.3 MODEL I ALGORITHMS ......................................................................................................... 21 8.3.1 MLP – MultiLayerPerceptron ................................................................................................. 21 8.4 MODEL I RESULTS ................................................................................................................. 15 8.4.1 MLP – Multilayer Perceptron Results .................................................................................... 15 8.5 MODEL I CONCLUSION ......................................................................................................... 15 8.6 MODEL II DIMENSION........................................................................................................... 16 8.7 MODEL II ALGORITHMS ........................................................................................................ 17 8.7.1 OneR – As the base algorithm ............................................................................................... 17 8.7.2 J48 – Analyze the data set ..................................................................................................... 17 8.7.3 J48 – Repeated training and testing ...................................................................................... 18 8.7.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy .............................................................. 18 8.8 MODEL II RESULTS ................................................................................................................ 19 8.8.1 OneR – As Base algorithm ..................................................................................................... 19 8.8.2 J48 – Analyze the data set ..................................................................................................... 19 8.8.3 J48 – Repeated training and testing ...................................................................................... 20 8.8.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy .............................................................. 21 8.9 MODEL II CONCLUSION ........................................................................................................ 21 8.10 MODEL III DIMENSION ........................................................................................................ 22 8.11 MODEL III ALGORITHMS ..................................................................................................... 22 8.11.1 Linear Regression ................................................................................................................ 23 8.11.2 Logistic Regression .............................................................................................................. 24 8.11.3 SMO..................................................................................................................................... 24 8.11.4 LibSVM with defaults .......................................................................................................... 24 8.12 MODEL III RESULTS ............................................................................................................. 25 8.12.1 Linear Regression ................................................................................................................ 25 8.12.3 Logistic Regression .............................................................................................................. 25 8.12.3 SMO Results – All Variations ............................................................................................... 26 8.12.4 LibSVM with defaults .......................................................................................................... 27 NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 9 – EVALUATION............................................................................................................... 28 10 – LESSONS LEARNED .................................................................................................... 31 11 – REFERENCES .............................................................................................................. 31 NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 1 – EXECUTIVE SUMMARY The goal of this project is to inverse the correlation between possible donor and dollar amount of the gift, maximizing the assertiveness based on candidate’s personal information, economies, financial, social and demographic data. The result of this analysis intends also to reduce the costs of large-scale mailing to the candidates and increase the amount of money collected, targeting only the ones that are potential or valuable donors, building a mail-list by predicting future response behavior based on historical data that contains information on whether a candidate responded and the amount collected if responded to a campaign. I present a solution to this project after exploring a myriad of different data mining techniques, association rules, decision trees, machine learning algorithms and prediction results, using the data set cup98TrgWTarget.csv as start point, which contains 47,705 instances and 481 attributes, supported by the usage of the following tools: Weka1, SAP Cloud Computing Analytics and Tableau2. Data missing or bad data plays a significant role in the prediction, therefore, handling anomalies is one of the many tasks added to the data set preparation. Working with a large data set makes the data processing to be very slow, then this project included four different size data sets. Experiment procedures were performed using multiple algorithms and functions. A comprehensive list of tests variations is included in this document. Some significant results were observed using decision trees, functions, rules, etc., and each result was evaluated considering the accuracy (error rate, proportion of explained variation) and significance (statistical, reasonableness, sensitivity, value of decisions). From a large set of historical data available to train the algorithm, it was possible to use some support tools to produce statistics and compare with the output results generated on Weka. Unfortunately, the ingredients of the success are not only the tools and the data base available. I had to adopt different strategies and synthetize what I learned from them in benefit of prediction results. One of those strategies was researching literature and academic papers that are available for donor retention. In addition to that, I have also discussed with professionals that work in non-profit organizations, charity, fundraising. In both cases, it was 1 Weka is the Data Mining tool that supports applying Machine Learning algorithms and extract meaningful data from raw data. Weka contains 100+ algorithms for classification, 75 for data preprocessing, 25 to assist with feature selection and 20 for clustering, finding association rules, etc. 2 SAP Cloud Computing Analytics and Tableau are predictive analytics, business intelligence and data visualization tools NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. unanimous the consensus that there is an inverse correlation between the probability to donate and the amount offered to the donation campaign. When analyzing the results, I had to take in consideration that the higher is the amount donated, the more cautious is the donor in deciding. That is, there are few donors making donations of a big amount, and many donors making donations of a small amount. Because of this inversion correlation, using a decision tree that ranks the results based on probability tends to rank down the donors that made larger amount donation instead of ranking them up. Key words: Data Mining; Machine Learning; Clustering; Association Rule; Decision Tree; KDD; NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 4 – DATA SET DESCRIPTION 4.1 PRESENTATION In this section, I will describe the data sets and answer to the questions: What statistics did I evaluate on the attributes? Are any of the attributes skewed? For instance, is the geographical information skewed to one state more than the others, or one part of the country more than the others? Are income values skewed to lower or higher incomes or are they equally distributed? The data is in a single file. Did I parse it into multiple files? If so, how? Did I put the data into a database from which I built datasets? 4.2 STATISTICS EVALUATED The cup98TrgWTarget data set contains 47,705 instances about people contacted in the 1997 mailing campaign. Each instance is represented by 479 non-target attributes and two target variables indicating the “donor / non-donor” and the actual amount donated in dollars: • TARGET_B: Binary indicator of donor (1) or non-donor (0). There are 2,422 donors and 45,283 non-donors in the data set. • TARGET_D: Dollar amount of the donation. TARGET D is not an attributed to be used to run the classifier, otherwise every instance with a dollar amount in this attribute will be considered a possible donor. The large dimensionality (481 variables) and the small target population (5%) presented an additional challenge to the extraction of characteristics from the donor class. 4.3 SAMPLE DATA A data mining project includes a training phase and a test/validation phase. Although one data set was selected to the final prediction, a total of 4 data sets were created (each of them with its own training and test sets). In the data set 1, a simple sample of data was collected from the original data set where from the 45,283 non-donors’ instances, it was selected 3,000; from the 2,422 donors’ instances, it was selected 1000. The table below shows the dimensions of each group, to the instances selection, non-ordered, randomly sort using Weka filters. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. Table 1: Training and Test sets 1 – Non-ordered random partitioning Filters: Unsupervised > Attribute > Numeric to Nominal Default Settings Unsupervised > Instance > Randomize Default Settings In the data set 2, to test better the robustness of the models, the partitions were created using the attribute ODATEW (date of donor's first gift to Vet Association). This date’s choice was made so that donor and non-donor training and test ratios were approximately the same as to the non-ordered random partitioning. The cut date was 8901 (YYMM), resulting in: Table 2: Training and Test sets 2 – Non-ordered – ODATEW 8901 cut Filters: Unsupervised > Attribute > Numeric to Nominal Default Settings Unsupervised > Instance > Randomize Default Settings In the data set 3, I used all the 47,705 existing instances dividing it into testing (30%) and training (70%) sets produced from independent data of the infinite data set population. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. Table 3: Training and Test sets 3 – Non-ordered random partitioning – All instances Filters: Unsupervised > Attribute > Numeric to Nominal Default Settings Unsupervised > Instance > Randomize Default Settings Unsupervised > Instance > RemovePercentage invertSelection: FALSE percentage: 30 Unsupervised > Instance > RemovePercentage invertSelection: TRUE percentage: 30 See below one example on how the data set was resized and prepared using Weka, taking the data set 3 as an example. Step 1: Randomize the dataset to create a random permutation using filter Randomize which is in Weka > Filters > Unsupervised > Instance. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. Step 2: Apply the filter RemovePercentage (Weka > Filters > Unsupervised > Instance) with percentage 30. Save the result in a file to the “training” data set. This step will reserve a data set with 70% of the instances. Step 3: Click the button “Undo” and apply the filter RemovePercentage (Weka > Filters > Unsupervised > Instance) with percentage 30, choosing invertSelection this time. This step will select the rest of the data (30%). Save the file as “testing” data set. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. In the data set 4, I used the same instances from the data set 2 but reduced the number of attributes from 30 to 19 to test. Table 4: Training and Test Set 4 – Non-Ordered – ODATEW cut Filters: Unsupervised > Attribute > Numeric to Nominal Default Settings Unsupervised > Instance > Randomize Default Settings 4.4 ATTRIBUTES See Section 5 ATTRIBUTE SELECTION for more details. From the 481 attributes in the original data set, I selected 49 to be used in the training and test phases but after a better evaluation and needing to shrink the size of the data set to gain performance, I removed 11 attributes and ended with 30 attributes on data sets 1, 2 and 3, and 19 attributes on data set 4. No new attributes were created nor attributes were combined. 4.5 TRAINING & TEST RESULTS See section 8 MODELS for results of the training and test phase to the 4 data sets. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 5 – ATTRIBUTE SELECTION 5.1 PRESENTATION In this section, I will introduce the strategy I used to select the attributes that I used in each model and what was the rationale for their selection. This section will cover answers to questions like: Did I base it on a statistical approach? Did I look at information gain or one of the other statistics discussed in class? Did I use a different approach based on my intuition? 5.2 SORTING THE ATTRIBUTES As far as the attribute selection goes, I ended up using three different strategies and then I combined the knowledge that I got from each of them towards to a final (and common) result. The three strategies I used in the attributes selection were: intuition, data visualization, reaching to non-profit industry professionals. The high dimensionality of the data set, i.e., 481 attributes, added a significant challenge to this project. Using only my intuition, I questioned myself if one attribute removed supposed to be part of the model that I was testing or if one attribute I selected supposed to be tossed out. One approach I took to minimize those questions was using Tableau to view a graphical representation of the data and understand better how a given attribute correlated to the donor indicator (TARGET_B) and the dollar amount (TARGET_D), as demonstrated in the images below: The trends of TARGET_B and TARGET_D for AGE using all instances of the cup98TrgWTarget data set NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. The graphic visualization of the data indicates that the AGE attribute plays a significant role in the likelihood of donating a gift. Here is another example of data visualization using Tableau where I it is possible clearly to see the role that the Gender and the Home Owner status plays in the donor candidates: The trends of TARGET_B for GENDER and HOMEOWNR using all instances of the cup98TrgWTarget data set NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. The graphic above confirmed one information observed in the academic papers and mentioned by professionals in the non-profit industry: females and homeowners are more inclined to donate than males and non-homeowners. One other attribute I that called my attention was the number of card promotion received in the last 12 months and how that correlates to the donor indicator. The plot of sum of TARGET_B for CARDPM12 using all instances of the cup98TrgWTarget data set In addition to attributes measured directly from the candidate, attributes related to the donor’s environment was measured as well. This is possible using demographic and economic surveys offered by government institutions or not. For example, the variable ETH refers to the ethnic percentage of the population in the region where the potential donor lives and other data collected from donor’s neighborhood, IC5 indicates the per capita income in donor’s neighborhood, etc. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. As far as the academic papers go, there were different predictors to where a person would donate, how much they would donate. Some of those indicators were also mentioned by professionals in the non-profit industry and they turned to be conclusive attributes to be selected on the donor’s prediction. The attributes I selected can be sorted in three different categories: • Demographics Data • Census Tract Data • Behavioral Data on previous campaigns Weka provides an attribute evaluator tool that uses mathematical metrics to rank the attributes of a given data set and how much of predict power the attribute has for a given class, which I used to help cropping the attributes I selected, reducing it from 48 variables to 30 variables. In section 5.3, I listed those attributes in more details, including the ones that were initially considered part of the model, but removed later. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 5.3 ATTRIBUTES SELECTED I listed below the attributes that are part of the data sets 1, 2 and 3. Table 5: Attributes selected for data sets 1, 2 and 3 NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. I listed below the attributes that are part of the data set 4. Table 6: Attributes selected for data set 4 NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 5.4 ADDITIONAL ATTRIBUTES I listed below the attributes that were initially part of the data sets 1, 2 and 3 but they were removed as their role in the donor’s retention turned out to be inconclusive. Some of them were remove after testing them in the Weka attribute evaluator tool. Table 7: Additional attributes I listed below the attributes that were used in data sets 1, 2 and 3 but removed from data set 4. Table 8: Attributes removed from data set 4 NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 6 – MISSING DATA Each attribute was analyzed individually. Attributes with too many instances missing information were tossed out after confirming from academic papers and industry professionals, that they were not particularly relevant to the prediction. I also trimmed attributes where the instances had more zeroes than non-zeroes values. The attribute NUMCHLD (Number of Children) got the value ‘0’ if there were no information and the field INCOME (Household Income) got the average income calculated from the income in the neighborhood. 7 – CLASSIFICATION PROXY I used the basic principles of knowledge discovery process, borrowed from CRISP-DM (Cross Industry Standard Process for Data Mining): • • • • • • • Define business problem Build data mining data base Explore data Prepare data for modeling Build model Evaluate model Deploy model and results Because of the inverse correlation between the likelihood of donating and de dollar amount of the gift, a probability based algorithm tends to rank down, rather than rank up donors who made bigger donations. One strategy to circle that around was to evaluate four different data sets and algorithms with different characteristics: • • • • • Functions o MLP – Multilayer Perceptron o Linear Regression o Logistic Regression o SMO o LibSVM Rules o OneR o ZeroR Trees o J48 Bayes o Naive Bayes Lazy o IBk NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. The validation of the models was based on simple validation method: split the data into training data set and testing data set, as it makes it easy to use and understand and it gives a good estimate of prediction error for reasonably large data sets. The four data sets were tested in many different algorithms (listed above) using most of the times more than one test option which resulted in a myriad number of results. One of the test options that turned out to be very valuable on small data sets was n-fold cross validation, which divides the data into n equal sized groups and build a model on the data with one group left out. The missing group is predicted and the prediction error rate is calculated. This is repeated to each of the groups and the average over all n folds is used as the model error rate. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8 – MODELS 8.1 INTRODUCTION This section contains subsections for each model built and evaluated during this project. For each model, when relevant, I described the following: • The algorithms I used • Test Options – X-fold cross validation, 60/40 split, etc. • Confusion matrix • The performance metrics I used to evaluate the model and their interpretation • Error analysis 8.2 MODEL I DIMENSION In the first model, I used two data sets (refer to section 4 for more details on the strategy used to create the data sets). Table 9: Model I dimension NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.3 MODEL I ALGORITHMS To the first model, I did use the MLP – MultiLayer Perceptron algorithm. To each of the data sets I built few testing schemas, they are listed in detail below. 8.3.1 MLP – MultiLayerPerceptron MLP is a classifier that uses backpropagation to classify instances and it is an Artificial Neural Network classifier (ANN). This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become un-thresholder linear units). As there isn’t an exact method that states the best architecture to each problem, neither the best training algorithm to be used, I did opt to use a group of architecture and algorithms with some tuning strategies. Then, in the end, I selected the one with better performance. Table 10: Multilayer Perceptron Architecture MLP – Multilayer Perceptron Network Architectures Hidden Layer Output Layer 2 Layer Network 75 2 2 Layer Network 25 2 3 Layer Network 75 x 30 2 3 Layer Network 25 x 10 2 Algorithm Settings I § § § Validation Threshold: 10 Learning Rate: 0.4 momentum 0.5 Algorithm Settings II § § § validation Threshold: 3 learning Rate: 0.4 momentum 0.5 This way, each of the four architectures were trained to the two combinations of settings above, generating eight network results. In total, the prototyping involved sixty-four network tests in four data sets. All networks have two neurons in the output layer, N1 and N2. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. Each network is trained to the output answer: N1 = 1 and N2 = 0 if the instance is a potential donor, and N1 = 0 and N2 = 1 if it is not a potential donor. The evaluation of the outputs was performed using two methods: Winner Take All (WTA) and Threshold (T). The WTA, when applied to an output array of values, takes the highest value among them to one and the remaining to zero. The T establishes a minimum value to be reached by a given output to that be confirmed, in this study I used 0.4, that is, to be considered donor, the neuron needs to have its value above the threshold, otherwise, the instance is considered non-donor. 8.4 MODEL I RESULTS 8.4.1 MLP – Multilayer Perceptron Results Table 11: Multilayer Perceptron results Training Set Result Training 1 - no cut date Training 2 - cut date 8901 WTA T 95% 83% 76% 76% Test Set Result Test 1 - no cut date Test 2 - cut date 8901 WTA T 75% 75% 71% 71% 8.5 MODEL I CONCLUSION While MLP can implement arbitrary decision, boundaries using “hidden layers”, as the number of hidden layers increase, the processing time grows exponentially, from some dozen minutes to few hours. Using Neural Networks can be more complex than using decision trees, however, the MLP algorithm on Weka does a good job identifying which nodes will be used more than others, which helps removing unneeded attributes. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.6 MODEL II DIMENSION To the second model, I used three data sets (refer to section 4 for more details on the strategy used to create the data sets), two of them were used in the first model and they were used for results’ comparison. Table 12: Model II dimension NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.7 MODEL II ALGORITHMS To each of the three data sets, I built few testing schemas. Those schemas, test options and other details are listed below. 8.7.1 OneR – As the base algorithm I did want to evaluate how good OneR, which is a rule algorithm, would perform as a model, comparing the results with a decision tree algorithm, like the J48. 1. Loaded training set 3 2. Selected OneR rule classifier 3. Selected Test Options: Cross-validation – 10 folds 4. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.1 section for results To verify how well the model performs on unseen data, I had to run the classifier on the supplied test data, therefore, I performed the following steps: 5. Selected Test Options: Supplied test set 6. Loaded testing set 3 7. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.1 section for results 8. Repeated steps 1 to 7 for training and testing sets 1 and 2 a. See 8.8.1 section for results 8.7.2 J48 – Analyze the data set 1. Loaded training set 3 2. Selected J48 decision tree classifier 3. Selected Test Option: Supplied test set 4. Loaded testing set 3 5. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.2 section for results 6. Evaluation on training set – Selected Test Option: Use training set 7. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.2 section for results 8. Evaluation on percentage split – Selected Test Option: Percentage split 66% 9. Evaluated the results, checked accuracy, confusion matrix, etc. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. a. See 8.8.2 section for results 10. Repeated steps 1 to 9 for training and testing sets 1 and 2 a. See 8.8.2 section for results 8.7.3 J48 – Repeated training and testing 1. Loaded the training set 3 2. Selected J48 decision tree classifier 3. Selected Test Option: Percentage split 90% 4. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.3 section for results 5. Repeated the tests with the changes in “More Options”: Seed 2, 3, 4, 5, 6, 7, 8, 9, 10 a. See 8.8.3 section for results 6. Evaluated the results, checked accuracy, confusion matrix, etc. 7. Calculated the mean and standard deviation a. See 8.8.3 section for results 8.7.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy 1. Loaded the training set 3 2. Selected Test Option: Percentage split 66% 3. Selected and executed the classifiers: a. Trees > J48 b. Bayes > NaiveBayes c. Lazy > IBk 4. Evaluated the results, checked accuracy, confusion matrix, etc. of each classifier a. See 8.8.4 section for results 5. Did run ZeroR using Test Option: use training set 6. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.8.4 section for results NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.8 MODEL II RESULTS In this section, I will present the results of each strategy listed in the sections 8.7.1 to 8.7.4. 8.8.1 OneR – As Base algorithm Table 13: Model II OneR (results of 8.7.1) 8.8.2 J48 – Analyze the data set Table 14: Model II J48 (results of 8.7.2) NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.8.3 J48 – Repeated training and testing Table 15: Model II J48 (results of 8.7.3) Table 16: Model II J48 (results of 8.7.3) NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.8.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy Table 17: Model II ZeroR / J48 / NaiveBayes / IBk (results of 8.7.4) 8.9 MODEL II CONCLUSION ZeroR is the simplest classification method which relies on the target and ignores all predictors. ZeroR was executed in the data set 3 and predicted the majority category (TARGET_B). Using the ZeroR as a “baseline test”, the results are the same to the predictions using the decision tree classifier J48, that is, both had 94.95% of accuracy in the correctly classified entries. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.10 MODEL III DIMENSION The third model was built with a different approach, which is, a new data set (data set 4) that is narrower than the other data sets tested in the previous models (it has less attributes). To improve run-time performance and better prediction, I had to remove attributes that were skewing the results and were not too significant. Cutting the data set was supported by the results of Weka attribute evaluator. Table 18: Model III dimension 8.11 MODEL III ALGORITHMS To the unique data set selected to this model, I built few testing schemas. The strategies are listed in detail below. The algorithms selected to this model were the following: Linear Regression*, Logistic Regression, SMO and LibSVM. * Logistics Regression was selected as an algorithm to be tested, but results were not successful. See below more information on the performance issue I found using this algorithm. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.11.1 Linear Regression The data set I selected to this model is the smallest data set I have (it has less than 3000 instances and only 19 attributes). I tried to change one of the attributes from Nominal to Numeric, which I did with success, however, when I got to the step of adding a classification, Weka presented a very poor performance, running for more than 14 hours with the CPU working over than 200% of its processing capacity (screenshot below). I tried different data sets (although they were bigger) and removed some more attributes to make the data smaller, but the results were hours of processing without completing the operation. I increased the heap size to 8GB using the command java -Xmx8192m -jar weka.jar but that was also unsuccessful. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.11.2 Logistic Regression 1. Loaded the training set 4 2. Selected Logistic function classifier a. Left the algorithm on the default settings 3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds” 4. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.12.2 section for results 8.11.3 SMO Also known as SVM (Support Vector Machine), this algorithm is a linear decision boundary, which can get more complex boundaries using the Kernel settings. The strategies to the four SMO tests were the same: 1. Loaded the training set 4 2. Selected SMO function classifier a. Select a different Kernel or left Default (results will show the Kernel) b. Select c low, c high or left it default (results will show the c setting) 3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds” 4. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.12.3 section for results 8.11.4 LibSVM with defaults 1. Loaded the training set 4 2. Selected LibSVM function classifier a. Left the algorithm on default settings 3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds” 4. Evaluated the results, checked accuracy, confusion matrix, etc. a. See 8.12.4 section for results NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.12 MODEL III RESULTS 8.12.1 Linear Regression See section 8.11.1 with the details on why it was not possible to extract results from Linear Regression. 8.12.3 Logistic Regression Table 20: Model III Logistic Regression NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.12.3 SMO Results – All Variations Table 20: Model III SMO Results – All Variations The algorithm and configuration highlighted in yellow was the algorithm I selected to run the final prediction. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 8.12.4 LibSVM with defaults Table 21: Model III LibSVM Results NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 9 – EVALUATION First, I want to summarize the results of each model: Model I: • • • • Algorithm: MLP Data Set: 1 and 2 Lowest Accuracy: 71% Highest Accuracy: 95% Model II: • • • • • Algorithms: OneR, J48 Data Set: 1 and 2 Lowest Accuracy: 70.4% Highest Accuracy: 76.25% Average: 73.95% • • • • • Algorithms: OneR, J48, NaiveBayes, IBk, ZeroR Data Set: 3 Lowest Accuracy: 92.27% Highest Accuracy: 95.13% Average: 94.12% Model III: • • • • • Algorithms: Logistic Data Set: 4 Lowest Accuracy: 64.21% Highest Accuracy: 100% Average: 78.78% • • • • • Algorithms: SMO Data Set: 4 Lowest Accuracy: 66.58% Highest Accuracy: 100% Average: 78.52% • • • • • Algorithms: LibSVM Data Set: 4 Lowest Accuracy: 71.68% Highest Accuracy: 76.25% Average: 74.72% Three of four data sets, was split randomly into 75% to the training set and 25% to the testing set. The fourth data set was split randomly into 70% to the training set and 30% to the testing set. Overall, the data sets that contained approximately 25% donors and 75% non-donors (data sets 1, 2 and 4), presented results ranging from 64% to 100% of accuracy (correctly classified) while the data set that contained approximately 5% donors and 95% non-donors NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. (data set 3) presented results ranging from 92.27% to 95.13% of accuracy (correctly classified). This project’s challenge is to identify with higher precision as possible, the donors and nondonors in a data set without TARGET_B (and extra challenge to predict the donation amount was offered, but I decided not to pursue the prediction results of TARGET_D). The testing set was used for tuning the minimum and maximum support in the three models, not for evaluation purpose. The evaluation was performed in a data set containing 10,000 entries out of +47,000 instances without TARGET_B. I selected the third model with used the data set 4, algorithm SMO, with “Normalized Poly Kernel” running on training set as “the chosen one” among all the 57 different tests’ combinations. One of the reasons for that, was the performance and simplicity of this model. Among all results on the 25/75 donors/non-donors split, I discarded all the results that were below 60% and above 90%. “The chosen one” model, the one I selected to the final prediction, had a result of 85.92%. Here are some extra details of the model to the final prediction: The final prediction results were: NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. Running the final prediction file using the SMO, the one with 10,000 instances and the same 19 attributes I used in the original model, resulted on the following output: NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017. 10 – LESSONS LEARNED The data mining process continues after the solution has been developed. Lessons learned during the process can prompt new questions, often more pertinent to the business. Subsequent processes will benefit from the experiences of previous processes. I would like to have the opportunity to repeat all this work, starting over again, but from another perspective, and in addition, if possible, developing better results by working in small groups of students. The KDD process is interactive, iterative, cognitive and exploratory, involving several steps, many decisions being made by professionals in the field (data domain specialist, non-profit organization professionals, etc.). Some areas where I would like to explore a different approach in this project would be: The training data set to be optimized through the insertion of complementary variables, formation of the training data set by varying the sampling technique used, searching for a greater diversity of candidates’ data base behavior. Application of other classification techniques such as association rules or cluster analysis. Evaluation of external variables to be included in the model as statistical and sociodemographic data or information about the national economic scenario in each period. 11 – REFERENCES [1] AGRAWAL, R; IMIELINSKI, T; SWAMI, A. Mining association rules between sets of items in large databases. [2] BRAMER, M. Undergraduate Topics in Computer Science - Principles of Data Mining. Springer, 2007. [3] WITTEN, I. H; FRANK, E. Data Mining - Practical Machine Learning Tools and Techniques. Elsevier, 2005. [4] WAIKATO, U. O. WEKA. http://www.cs.waikato.ac.nz/ml/weka/ [5] LAROSE, D. T. Discovering Knowledge in Data: An Introduction to Data Mining. John Wiley and Sons, Inc., 2005. NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business & Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017.