Data Mining - Paralyzed Veterans of America

Leandro Nascimento
This document covers the requisites of the
CSCI-4957 - Data Mining class, Fall 2017
Semester - Final Project.
This document is final version and it is in
accordance to the requirements listed by
Prof. Dr. Jay Jarman, Ph.D., in this
course’s syllabus and D2L terms (East
Tennessee State University learning
Prof. Dr. Jay Jarman, Ph.D.
East Tennessee State University
Johnson City, TN – December _____ , 2017.
TABLE OF CONTENTS .......................................................................................................... 2
1 – EXECUTIVE SUMMARY .................................................................................................. 4
4 – DATA SET DESCRIPTION ................................................................................................ 6
4.1 PRESENTATION ...................................................................................................................... 6
4.2 STATISTICS EVALUATED .......................................................................................................... 6
4.3 SAMPLE DATA ........................................................................................................................ 6
4.4 ATTRIBUTES ......................................................................................................................... 10
4.5 TRAINING & TEST RESULTS ................................................................................................... 10
5 – ATTRIBUTE SELECTION ................................................................................................ 11
5.1 PRESENTATION .................................................................................................................... 11
5.2 SORTING THE ATTRIBUTES.................................................................................................... 11
5.3 ATTRIBUTES SELECTED.......................................................................................................... 15
5.4 ADDITIONAL ATTRIBUTES ..................................................................................................... 17
6 – MISSING DATA ............................................................................................................ 18
7 – CLASSIFICATION PROXY .............................................................................................. 18
8 – MODELS ...................................................................................................................... 20
8.1 INTRODUCTION .................................................................................................................... 20
8.2 MODEL I DIMENSION............................................................................................................ 20
8.3 MODEL I ALGORITHMS ......................................................................................................... 21
8.3.1 MLP – MultiLayerPerceptron ................................................................................................. 21
8.4 MODEL I RESULTS ................................................................................................................. 15
8.4.1 MLP – Multilayer Perceptron Results .................................................................................... 15
8.5 MODEL I CONCLUSION ......................................................................................................... 15
8.6 MODEL II DIMENSION........................................................................................................... 16
8.7 MODEL II ALGORITHMS ........................................................................................................ 17
8.7.1 OneR – As the base algorithm ............................................................................................... 17
8.7.2 J48 – Analyze the data set ..................................................................................................... 17
8.7.3 J48 – Repeated training and testing ...................................................................................... 18
8.7.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy .............................................................. 18
8.8 MODEL II RESULTS ................................................................................................................ 19
8.8.1 OneR – As Base algorithm ..................................................................................................... 19
8.8.2 J48 – Analyze the data set ..................................................................................................... 19
8.8.3 J48 – Repeated training and testing ...................................................................................... 20
8.8.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy .............................................................. 21
8.9 MODEL II CONCLUSION ........................................................................................................ 21
8.10 MODEL III DIMENSION ........................................................................................................ 22
8.11 MODEL III ALGORITHMS ..................................................................................................... 22
8.11.1 Linear Regression ................................................................................................................ 23
8.11.2 Logistic Regression .............................................................................................................. 24
8.11.3 SMO..................................................................................................................................... 24
8.11.4 LibSVM with defaults .......................................................................................................... 24
8.12 MODEL III RESULTS ............................................................................................................. 25
8.12.1 Linear Regression ................................................................................................................ 25
8.12.3 Logistic Regression .............................................................................................................. 25
8.12.3 SMO Results – All Variations ............................................................................................... 26
8.12.4 LibSVM with defaults .......................................................................................................... 27
NASCIMENTO, Leandro. Veteran Donors Data Mining, Final Project. Department of Computing, College of Business &
Technology. CSCI-4957 – Data Mining. ETSU – East Tennessee State University, Fall Semester 2017.
9 – EVALUATION............................................................................................................... 28
10 – LESSONS LEARNED .................................................................................................... 31
11 – REFERENCES .............................................................................................................. 31
The goal of this project is to inverse the correlation between possible donor and dollar amount
of the gift, maximizing the assertiveness based on candidate’s personal information,
economies, financial, social and demographic data. The result of this analysis intends also to
reduce the costs of large-scale mailing to the candidates and increase the amount of money
collected, targeting only the ones that are potential or valuable donors, building a mail-list by
predicting future response behavior based on historical data that contains information on
whether a candidate responded and the amount collected if responded to a campaign.
I present a solution to this project after exploring a myriad of different data mining techniques,
association rules, decision trees, machine learning algorithms and prediction results, using the
data set cup98TrgWTarget.csv as start point, which contains 47,705 instances and 481
attributes, supported by the usage of the following tools: Weka1, SAP Cloud Computing
Analytics and Tableau2.
Data missing or bad data plays a significant role in the prediction, therefore, handling
anomalies is one of the many tasks added to the data set preparation.
Working with a large data set makes the data processing to be very slow, then this project
included four different size data sets.
Experiment procedures were performed using multiple algorithms and functions. A
comprehensive list of tests variations is included in this document. Some significant results
were observed using decision trees, functions, rules, etc., and each result was evaluated
considering the accuracy (error rate, proportion of explained variation) and significance
(statistical, reasonableness, sensitivity, value of decisions).
From a large set of historical data available to train the algorithm, it was possible to use some
support tools to produce statistics and compare with the output results generated on Weka.
Unfortunately, the ingredients of the success are not only the tools and the data base available.
I had to adopt different strategies and synthetize what I learned from them in benefit of
prediction results. One of those strategies was researching literature and academic papers
that are available for donor retention. In addition to that, I have also discussed with
professionals that work in non-profit organizations, charity, fundraising. In both cases, it was
Weka is the Data Mining tool that supports applying Machine Learning algorithms and extract meaningful data from raw data.
Weka contains 100+ algorithms for classification, 75 for data preprocessing, 25 to assist with feature selection and 20 for
clustering, finding association rules, etc.
SAP Cloud Computing Analytics and Tableau are predictive analytics, business intelligence and data visualization tools
unanimous the consensus that there is an inverse correlation between the probability to donate
and the amount offered to the donation campaign.
When analyzing the results, I had to take in consideration that the higher is the amount
donated, the more cautious is the donor in deciding. That is, there are few donors making
donations of a big amount, and many donors making donations of a small amount. Because
of this inversion correlation, using a decision tree that ranks the results based on probability
tends to rank down the donors that made larger amount donation instead of ranking them up.
Key words: Data Mining; Machine Learning; Clustering; Association Rule; Decision Tree;
In this section, I will describe the data sets and answer to the questions: What statistics did I
evaluate on the attributes? Are any of the attributes skewed? For instance, is the geographical
information skewed to one state more than the others, or one part of the country more than
the others? Are income values skewed to lower or higher incomes or are they equally
distributed? The data is in a single file. Did I parse it into multiple files? If so, how? Did I put
the data into a database from which I built datasets?
The cup98TrgWTarget data set contains 47,705 instances about people contacted in the
1997 mailing campaign. Each instance is represented by 479 non-target attributes and two
target variables indicating the “donor / non-donor” and the actual amount donated in dollars:
• TARGET_B: Binary indicator of donor (1) or non-donor (0). There are 2,422 donors and
45,283 non-donors in the data set.
• TARGET_D: Dollar amount of the donation.
TARGET D is not an attributed to be used to run the classifier, otherwise every instance with
a dollar amount in this attribute will be considered a possible donor.
The large dimensionality (481 variables) and the small target population (5%) presented an
additional challenge to the extraction of characteristics from the donor class.
A data mining project includes a training phase and a test/validation phase. Although one data
set was selected to the final prediction, a total of 4 data sets were created (each of them with
its own training and test sets).
In the data set 1, a simple sample of data was collected from the original data set where from
the 45,283 non-donors’ instances, it was selected 3,000; from the 2,422 donors’ instances, it
was selected 1000. The table below shows the dimensions of each group, to the instances
selection, non-ordered, randomly sort using Weka filters.
Table 1: Training and Test sets 1 – Non-ordered random partitioning
Unsupervised > Attribute > Numeric to Nominal
Default Settings
Unsupervised > Instance > Randomize
Default Settings
In the data set 2, to test better the robustness of the models, the partitions were created using
the attribute ODATEW (date of donor's first gift to Vet Association). This date’s choice was
made so that donor and non-donor training and test ratios were approximately the same as
to the non-ordered random partitioning. The cut date was 8901 (YYMM), resulting in:
Table 2: Training and Test sets 2 – Non-ordered – ODATEW 8901 cut
Unsupervised > Attribute > Numeric to Nominal
Default Settings
Unsupervised > Instance > Randomize
Default Settings
In the data set 3, I used all the 47,705 existing instances dividing it into testing (30%) and
training (70%) sets produced from independent data of the infinite data set population.
Table 3: Training and Test sets 3 – Non-ordered random partitioning – All instances
Unsupervised > Attribute > Numeric to Nominal
Default Settings
Unsupervised > Instance > Randomize
Default Settings
Unsupervised > Instance > RemovePercentage
invertSelection: FALSE
percentage: 30
Unsupervised > Instance > RemovePercentage
invertSelection: TRUE
percentage: 30
See below one example on how the data set was resized and prepared using Weka, taking
the data set 3 as an example.
Step 1: Randomize the dataset to create a random permutation using filter Randomize which
is in Weka > Filters > Unsupervised > Instance.
Step 2: Apply the filter RemovePercentage (Weka > Filters > Unsupervised > Instance) with
percentage 30. Save the result in a file to the “training” data set. This step will reserve a data
set with 70% of the instances.
Step 3: Click the button “Undo” and apply the filter RemovePercentage (Weka > Filters >
Unsupervised > Instance) with percentage 30, choosing invertSelection this time. This step
will select the rest of the data (30%). Save the file as “testing” data set.
In the data set 4, I used the same instances from the data set 2 but reduced the number of
attributes from 30 to 19 to test.
Table 4: Training and Test Set 4 – Non-Ordered – ODATEW cut
Unsupervised > Attribute > Numeric to Nominal
Default Settings
Unsupervised > Instance > Randomize
Default Settings
See Section 5 ATTRIBUTE SELECTION for more details.
From the 481 attributes in the original data set, I selected 49 to be used in the training and
test phases but after a better evaluation and needing to shrink the size of the data set to gain
performance, I removed 11 attributes and ended with 30 attributes on data sets 1, 2 and 3,
and 19 attributes on data set 4.
No new attributes were created nor attributes were combined.
See section 8 MODELS for results of the training and test phase to the 4 data sets.
In this section, I will introduce the strategy I used to select the attributes that I used in each
model and what was the rationale for their selection. This section will cover answers to
questions like: Did I base it on a statistical approach? Did I look at information gain or one of
the other statistics discussed in class? Did I use a different approach based on my intuition?
As far as the attribute selection goes, I ended up using three different strategies and then I
combined the knowledge that I got from each of them towards to a final (and common) result.
The three strategies I used in the attributes selection were: intuition, data visualization,
reaching to non-profit industry professionals.
The high dimensionality of the data set, i.e., 481 attributes, added a significant challenge to
this project. Using only my intuition, I questioned myself if one attribute removed supposed to
be part of the model that I was testing or if one attribute I selected supposed to be tossed out.
One approach I took to minimize those questions was using Tableau to view a graphical
representation of the data and understand better how a given attribute correlated to the donor
indicator (TARGET_B) and the dollar amount (TARGET_D), as demonstrated in the images
The trends of TARGET_B and TARGET_D for AGE using all instances of the cup98TrgWTarget data set
The graphic visualization of the data indicates that the AGE attribute plays a significant role
in the likelihood of donating a gift.
Here is another example of data visualization using Tableau where I it is possible clearly to
see the role that the Gender and the Home Owner status plays in the donor candidates:
The trends of TARGET_B for GENDER and HOMEOWNR using all instances of the cup98TrgWTarget data set
The graphic above confirmed one information observed in the academic papers and
mentioned by professionals in the non-profit industry: females and homeowners are more
inclined to donate than males and non-homeowners.
One other attribute I that called my attention was the number of card promotion received in
the last 12 months and how that correlates to the donor indicator.
The plot of sum of TARGET_B for CARDPM12 using all instances of the cup98TrgWTarget data set
In addition to attributes measured directly from the candidate, attributes related to the donor’s
environment was measured as well. This is possible using demographic and economic
surveys offered by government institutions or not. For example, the variable ETH refers to the
ethnic percentage of the population in the region where the potential donor lives and other
data collected from donor’s neighborhood, IC5 indicates the per capita income in donor’s
neighborhood, etc.
As far as the academic papers go, there were different predictors to where a person would
donate, how much they would donate. Some of those indicators were also mentioned by
professionals in the non-profit industry and they turned to be conclusive attributes to be
selected on the donor’s prediction.
The attributes I selected can be sorted in three different categories:
Demographics Data
Census Tract Data
Behavioral Data on previous campaigns
Weka provides an attribute evaluator tool that uses mathematical metrics to rank the attributes
of a given data set and how much of predict power the attribute has for a given class, which I
used to help cropping the attributes I selected, reducing it from 48 variables to 30 variables.
In section 5.3, I listed those attributes in more details, including the ones that were initially
considered part of the model, but removed later.
I listed below the attributes that are part of the data sets 1, 2 and 3.
Table 5: Attributes selected for data sets 1, 2 and 3
I listed below the attributes that are part of the data set 4.
Table 6: Attributes selected for data set 4
I listed below the attributes that were initially part of the data sets 1, 2 and 3 but they were
removed as their role in the donor’s retention turned out to be inconclusive. Some of them
were remove after testing them in the Weka attribute evaluator tool.
Table 7: Additional attributes
I listed below the attributes that were used in data sets 1, 2 and 3 but removed from data set
Table 8: Attributes removed from data set 4
Each attribute was analyzed individually. Attributes with too many instances missing
information were tossed out after confirming from academic papers and industry
professionals, that they were not particularly relevant to the prediction. I also trimmed
attributes where the instances had more zeroes than non-zeroes values.
The attribute NUMCHLD (Number of Children) got the value ‘0’ if there were no information
and the field INCOME (Household Income) got the average income calculated from the
income in the neighborhood.
I used the basic principles of knowledge discovery process, borrowed from CRISP-DM
(Cross Industry Standard Process for Data Mining):
Define business problem
Build data mining data base
Explore data
Prepare data for modeling
Build model
Evaluate model
Deploy model and results
Because of the inverse correlation between the likelihood of donating and de dollar amount
of the gift, a probability based algorithm tends to rank down, rather than rank up donors who
made bigger donations. One strategy to circle that around was to evaluate four different data
sets and algorithms with different characteristics:
o MLP – Multilayer Perceptron
o Linear Regression
o Logistic Regression
o LibSVM
o OneR
o ZeroR
o J48
o Naive Bayes
o IBk
The validation of the models was based on simple validation method: split the data into
training data set and testing data set, as it makes it easy to use and understand and it gives
a good estimate of prediction error for reasonably large data sets.
The four data sets were tested in many different algorithms (listed above) using most of the
times more than one test option which resulted in a myriad number of results.
One of the test options that turned out to be very valuable on small data sets was n-fold
cross validation, which divides the data into n equal sized groups and build a model on the
data with one group left out. The missing group is predicted and the prediction error rate is
calculated. This is repeated to each of the groups and the average over all n folds is used as
the model error rate.
This section contains subsections for each model built and evaluated during this project.
For each model, when relevant, I described the following:
• The algorithms I used
• Test Options – X-fold cross validation, 60/40 split, etc.
• Confusion matrix
• The performance metrics I used to evaluate the model and their interpretation
• Error analysis
In the first model, I used two data sets (refer to section 4 for more details on the strategy
used to create the data sets).
Table 9: Model I dimension
To the first model, I did use the MLP – MultiLayer Perceptron algorithm. To each of the data
sets I built few testing schemas, they are listed in detail below.
8.3.1 MLP – MultiLayerPerceptron
MLP is a classifier that uses backpropagation to classify instances and it is an Artificial
Neural Network classifier (ANN).
This network can be built by hand, created by an algorithm or both. The network can also be
monitored and modified during training time. The nodes in this network are all sigmoid
(except for when the class is numeric in which case the output nodes become un-thresholder
linear units).
As there isn’t an exact method that states the best architecture to each problem, neither the
best training algorithm to be used, I did opt to use a group of architecture and algorithms with
some tuning strategies. Then, in the end, I selected the one with better performance.
Table 10: Multilayer Perceptron Architecture
MLP – Multilayer Perceptron
Network Architectures
Hidden Layer Output Layer
2 Layer Network
2 Layer Network
3 Layer Network
75 x 30
3 Layer Network
25 x 10
Algorithm Settings I
Validation Threshold: 10
Learning Rate: 0.4
momentum 0.5
Algorithm Settings II
validation Threshold: 3
learning Rate: 0.4
momentum 0.5
This way, each of the four architectures were trained to the two combinations of settings
above, generating eight network results. In total, the prototyping involved sixty-four network
tests in four data sets. All networks have two neurons in the output layer, N1 and N2.
Each network is trained to the output answer: N1 = 1 and N2 = 0 if the instance is a potential
donor, and N1 = 0 and N2 = 1 if it is not a potential donor.
The evaluation of the outputs was performed using two methods: Winner Take All (WTA) and
Threshold (T).
The WTA, when applied to an output array of values, takes the highest value among them to
one and the remaining to zero. The T establishes a minimum value to be reached by a given
output to that be confirmed, in this study I used 0.4, that is, to be considered donor, the
neuron needs to have its value above the threshold, otherwise, the instance is considered
8.4.1 MLP – Multilayer Perceptron Results
Table 11: Multilayer Perceptron results
Training Set Result
Training 1 - no cut date
Training 2 - cut date 8901
95% 83%
76% 76%
Test Set Result
Test 1 - no cut date
Test 2 - cut date 8901
75% 75%
71% 71%
While MLP can implement arbitrary decision, boundaries using “hidden layers”, as the
number of hidden layers increase, the processing time grows exponentially, from some
dozen minutes to few hours.
Using Neural Networks can be more complex than using decision trees, however, the MLP
algorithm on Weka does a good job identifying which nodes will be used more than others,
which helps removing unneeded attributes.
To the second model, I used three data sets (refer to section 4 for more details on the
strategy used to create the data sets), two of them were used in the first model and they
were used for results’ comparison.
Table 12: Model II dimension
To each of the three data sets, I built few testing schemas. Those schemas, test options and
other details are listed below.
8.7.1 OneR – As the base algorithm
I did want to evaluate how good OneR, which is a rule algorithm, would perform as a model,
comparing the results with a decision tree algorithm, like the J48.
1. Loaded training set 3
2. Selected OneR rule classifier
3. Selected Test Options: Cross-validation – 10 folds
4. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.1 section for results
To verify how well the model performs on unseen data, I had to run the classifier on the
supplied test data, therefore, I performed the following steps:
5. Selected Test Options: Supplied test set
6. Loaded testing set 3
7. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.1 section for results
8. Repeated steps 1 to 7 for training and testing sets 1 and 2
a. See 8.8.1 section for results
8.7.2 J48 – Analyze the data set
1. Loaded training set 3
2. Selected J48 decision tree classifier
3. Selected Test Option: Supplied test set
4. Loaded testing set 3
5. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.2 section for results
6. Evaluation on training set – Selected Test Option: Use training set
7. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.2 section for results
8. Evaluation on percentage split – Selected Test Option: Percentage split 66%
9. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.2 section for results
10. Repeated steps 1 to 9 for training and testing sets 1 and 2
a. See 8.8.2 section for results
8.7.3 J48 – Repeated training and testing
1. Loaded the training set 3
2. Selected J48 decision tree classifier
3. Selected Test Option: Percentage split 90%
4. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.3 section for results
5. Repeated the tests with the changes in “More Options”: Seed 2, 3, 4, 5, 6, 7, 8, 9, 10
a. See 8.8.3 section for results
6. Evaluated the results, checked accuracy, confusion matrix, etc.
7. Calculated the mean and standard deviation
a. See 8.8.3 section for results
8.7.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy
1. Loaded the training set 3
2. Selected Test Option: Percentage split 66%
3. Selected and executed the classifiers:
a. Trees > J48
b. Bayes > NaiveBayes
c. Lazy > IBk
4. Evaluated the results, checked accuracy, confusion matrix, etc. of each classifier
a. See 8.8.4 section for results
5. Did run ZeroR using Test Option: use training set
6. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.8.4 section for results
In this section, I will present the results of each strategy listed in the sections 8.7.1 to 8.7.4.
8.8.1 OneR – As Base algorithm
Table 13: Model II OneR (results of 8.7.1)
8.8.2 J48 – Analyze the data set
Table 14: Model II J48 (results of 8.7.2)
8.8.3 J48 – Repeated training and testing
Table 15: Model II J48 (results of 8.7.3)
Table 16: Model II J48 (results of 8.7.3)
8.8.4 ZeroR / J48 / NaiveBayes / IBk – Baseline accuracy
Table 17: Model II ZeroR / J48 / NaiveBayes / IBk (results of 8.7.4)
ZeroR is the simplest classification method which relies on the target and ignores all
predictors. ZeroR was executed in the data set 3 and predicted the majority category
(TARGET_B). Using the ZeroR as a “baseline test”, the results are the same to the
predictions using the decision tree classifier J48, that is, both had 94.95% of accuracy in the
correctly classified entries.
The third model was built with a different approach, which is, a new data set (data set 4) that
is narrower than the other data sets tested in the previous models (it has less attributes).
To improve run-time performance and better prediction, I had to remove attributes that were
skewing the results and were not too significant.
Cutting the data set was supported by the results of Weka attribute evaluator.
Table 18: Model III dimension
To the unique data set selected to this model, I built few testing schemas. The strategies are
listed in detail below.
The algorithms selected to this model were the following: Linear Regression*, Logistic
Regression, SMO and LibSVM.
* Logistics Regression was selected as an algorithm to be tested, but results were not
successful. See below more information on the performance issue I found using this
8.11.1 Linear Regression
The data set I selected to this model is the smallest data set I have (it has less than 3000
instances and only 19 attributes). I tried to change one of the attributes from Nominal to
Numeric, which I did with success, however, when I got to the step of adding a classification,
Weka presented a very poor performance, running for more than 14 hours with the CPU
working over than 200% of its processing capacity (screenshot below).
I tried different data sets (although they were bigger) and removed some more attributes to
make the data smaller, but the results were hours of processing without completing the
I increased the heap size to 8GB using the command java -Xmx8192m -jar weka.jar
but that was also unsuccessful.
8.11.2 Logistic Regression
1. Loaded the training set 4
2. Selected Logistic function classifier
a. Left the algorithm on the default settings
3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds”
4. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.12.2 section for results
8.11.3 SMO
Also known as SVM (Support Vector Machine), this algorithm is a linear decision boundary,
which can get more complex boundaries using the Kernel settings.
The strategies to the four SMO tests were the same:
1. Loaded the training set 4
2. Selected SMO function classifier
a. Select a different Kernel or left Default (results will show the Kernel)
b. Select c low, c high or left it default (results will show the c setting)
3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds”
4. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.12.3 section for results
8.11.4 LibSVM with defaults
1. Loaded the training set 4
2. Selected LibSVM function classifier
a. Left the algorithm on default settings
3. Selected Test Options: “Use training set”, then “Supplied Test Set”, and finally “Crossvalidation – 10 folds”
4. Evaluated the results, checked accuracy, confusion matrix, etc.
a. See 8.12.4 section for results
8.12.1 Linear Regression
See section 8.11.1 with the details on why it was not possible to extract results from Linear
8.12.3 Logistic Regression
Table 20: Model III Logistic Regression
8.12.3 SMO Results – All Variations
Table 20: Model III SMO Results – All Variations
The algorithm and configuration highlighted in yellow was the algorithm I selected to run the final
8.12.4 LibSVM with defaults
Table 21: Model III LibSVM Results
First, I want to summarize the results of each model:
Model I:
Algorithm: MLP
Data Set: 1 and 2
Lowest Accuracy: 71%
Highest Accuracy: 95%
Model II:
Algorithms: OneR, J48
Data Set: 1 and 2
Lowest Accuracy: 70.4%
Highest Accuracy: 76.25%
Average: 73.95%
Algorithms: OneR, J48, NaiveBayes, IBk, ZeroR
Data Set: 3
Lowest Accuracy: 92.27%
Highest Accuracy: 95.13%
Average: 94.12%
Model III:
Algorithms: Logistic
Data Set: 4
Lowest Accuracy: 64.21%
Highest Accuracy: 100%
Average: 78.78%
Algorithms: SMO
Data Set: 4
Lowest Accuracy: 66.58%
Highest Accuracy: 100%
Average: 78.52%
Algorithms: LibSVM
Data Set: 4
Lowest Accuracy: 71.68%
Highest Accuracy: 76.25%
Average: 74.72%
Three of four data sets, was split randomly into 75% to the training set and 25% to the testing
set. The fourth data set was split randomly into 70% to the training set and 30% to the testing
Overall, the data sets that contained approximately 25% donors and 75% non-donors (data
sets 1, 2 and 4), presented results ranging from 64% to 100% of accuracy (correctly
classified) while the data set that contained approximately 5% donors and 95% non-donors
(data set 3) presented results ranging from 92.27% to 95.13% of accuracy (correctly
This project’s challenge is to identify with higher precision as possible, the donors and nondonors in a data set without TARGET_B (and extra challenge to predict the donation amount
was offered, but I decided not to pursue the prediction results of TARGET_D).
The testing set was used for tuning the minimum and maximum support in the three models,
not for evaluation purpose. The evaluation was performed in a data set containing 10,000
entries out of +47,000 instances without TARGET_B.
I selected the third model with used the data set 4, algorithm SMO, with “Normalized Poly
Kernel” running on training set as “the chosen one” among all the 57 different tests’
One of the reasons for that, was the performance and simplicity of this model. Among all
results on the 25/75 donors/non-donors split, I discarded all the results that were below 60%
and above 90%. “The chosen one” model, the one I selected to the final prediction, had a
result of 85.92%. Here are some extra details of the model to the final prediction:
The final prediction results were:
Running the final prediction file using the SMO, the one with 10,000 instances and the same
19 attributes I used in the original model, resulted on the following output:
The data mining process continues after the solution has been developed.
Lessons learned during the process can prompt new questions, often more pertinent to the
business. Subsequent processes will benefit from the experiences of previous processes. I
would like to have the opportunity to repeat all this work, starting over again, but from
another perspective, and in addition, if possible, developing better results by working in small
groups of students.
The KDD process is interactive, iterative, cognitive and exploratory, involving several steps,
many decisions being made by professionals in the field (data domain specialist, non-profit
organization professionals, etc.).
Some areas where I would like to explore a different approach in this project would be:
The training data set to be optimized through the insertion of complementary variables,
formation of the training data set by varying the sampling technique used, searching for a
greater diversity of candidates’ data base behavior.
Application of other classification techniques such as association rules or cluster analysis.
Evaluation of external variables to be included in the model as statistical and sociodemographic data or information about the national economic scenario in each period.
[1] AGRAWAL, R; IMIELINSKI, T; SWAMI, A. Mining association rules between sets of
items in large databases.
[2] BRAMER, M. Undergraduate Topics in Computer Science - Principles of Data Mining.
Springer, 2007.
[3] WITTEN, I. H; FRANK, E. Data Mining - Practical Machine Learning Tools and
Techniques. Elsevier, 2005.
[5] LAROSE, D. T. Discovering Knowledge in Data: An Introduction to Data Mining. John
Wiley and Sons, Inc., 2005.
