Figure 20: Search Page

advertisement
Carter
1
SECTION 1: INTRODUCTION
1.1 OVERVIEW
The process of data collection, no matter the objective, can provide a person or
entity with a large amount of generic information on a particular subject. However, to
learn from and to interpret the data requires much more than simply gathering it. For
example, a business may build meaningful relationships with its customers by learning
from previous interactions with them, observing their needs, and remembering what their
preferences are, in order to determine how to serve them better in the future. In order for
this type of learning to take place, data must first be collected and organized in a useful
and consistent way. This procedure is known as data warehousing. Data warehousing
allows a user to remember what has been noticed in the data. Afterwards, the data must
then be analyzed, interpreted, and transformed into useful information. At this stage is
where data mining comes into play. Data mining is the exploration and analysis, by
automatic or semiautomatic means, of large quantities of data in order to discover
meaningful patterns and rules (Berry, Linoff, pg. 5). Data mining can be applied in a
wide variety of areas, from sports to law enforcement to education.
In this project, I use data mining techniques to predict the current contraceptive
method choice (no use, long-term methods, or short-term methods) of Indonesian women
based on their demographic and socio-economic characteristics. The algorithms that are
implemented here are Naive Bayesian Classification, One-Rule Classification and
Decision Tree. This project presents a Web-based client/server application. The project
makes use of the three-tier client/server architecture, with the Web browser as the client
Carter
2
front-end, the Common Gateway Interface (CGI), Perl, Visual Basic, and Active Server
Pages (ASP) as the middle-tier software, and Microsoft Access 2000 and a commaseparated value (CSV) text file for the database back-end. The database administrator
has the capability to add, delete, edit, and search for records. The administrator can also
change the administrator password and add users who have permission to gain access to
the website. Users have privileges to add records and to search for records. A logging
system is also implemented, which keeps track of the time, date, host server, browser,
and operating system of users that access the database. The log is accessible by both the
administrator and the users.
1.2 BACKGROUND INFORMATION
According to the Central Bureau of Statistics, the nation of Indonesia is the
fourth-most populous country in the world, with an estimated total population of 207
million in 2000 (United Nations Population Fund). Indonesia has a growth rate of 1.5
percent a year, and although the population growth rate is at a moderate level, the country
has a significant momentum of growth. The government of Indonesia is concerned about
the uneven distribution of the population and the scale of the population growth. This is
especially true in when considering overcrowding in urban, densely populated areas, such
as Java and Bali. Other areas of concern are the relatively high infant and under-five
mortality rates (52 and 71 per 1,000, respectively) and the persistently high maternal
mortality ratio (estimated at 370 per 100,000 births).
Indonesia has been recognized for the success of its family planning efforts.
However, according to (United Nations Population Fund), the progress in the
contraceptive prevalence rate (CPR) seems to have stalled at about 57 percent. Also, the
Carter
3
burden of use of contraceptives appears to be unevenly shouldered by women, as the
male-based CPR is less than 2 percent. And even though the “unmet need” for
contraceptives of currently married women has been estimated at the relatively small 9.2
percent, this number is probably considerably higher when unmarried men and women
are taken into account. In order to meet this need, it is paramount that the quality and
scope of contraceptive services and information be expanded. A critical challenge for
Indonesia remains the access to affordable contraceptives by all its citizens, especially the
poor.
1.3 ABOUT THE DATASET
This dataset comes via a subset of the 1987 National Indonesia Contraceptive
Prevalence Survey. It was created and donated by Tjen-Sien Lim on June 7, 1997. The
contents were downloaded from the UCI Machine Learning Depository. The samples
contained in the survey are of married women who were either not pregnant or did not
know if they were pregnant at the time of the interview. The problem faced is predicting
the contraceptive method choice of the woman based on her demographic and socioeconomic characteristics. Predicting the contraceptive method choice of Indonesian
women can assist the government with how to and where to target and provide
information on contraceptive choices for its female population. The three choices are no
use, long-term methods, or short-term methods. The number of instances is 1473, and the
number of attributes is 11, including the primary key (ID) and the classifying attribute
(cmchoice).
Carter
4
1.4 ATTRIBUTE INFORMATION
No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Attribute
ID
wife_age
wife_ed
hus_ed
no_child
wife_rel
wife_work
hus_oc
st_live
media
cmchoice
Description
ID number
Wife’s age
Wife’s education
Husband’s education
Number of children ever born?
Wife’s religion
Wife’s now working?
Husband’s occupation
Standard-of-living index
Media exposure
Contraceptive method used
Type / Values
(primary key attribute)
(numerical)
(categorical)
1=low level, 2, 3, 4=high level
(categorical)
1=low level, 2, 3, 4=high level
(numerical)
(binary)
0=Non-Islam, 1=Islam
(binary)
0=Yes, 1=No
(categorical)
1=low level, 2, 3, 4=high level
(categorical)
1=low, 2, 3, 4=high
(binary)
0=Good, 1=Not good
(class) 1=No-use, 2-Long-term, 3=Short-term
Figure 1: Attribute Information
There are no missing values in the dataset.
Carter
5
SECTION 2: TECHNICAL DESCRIPTION
2.1 THE THREE-TIER ARCHITECTURE
The most commonly used application development architecture, and the one
supported by most application servers, is a component-based, three-tier model (Directions
on Microsoft). Components provide an increase of reusable code and simplify
development. By using components, a developer can package the compiled (binary) code
in such a way that another developer is able to easily and efficiently discover the
functions provided by the component (usually by using a programming language
application such as Visual Basic) and invoke those functions. This is accomplished while
keeping the internal workings of the component hidden.
The three-tier architecture increases scalability and reliability by separating the
three major logical functions of an application (user interaction, business logic, data
storage) from one another. Many Web services must provide functionality that displays
the graphical user interface (GUI), performs the main logic of the program, and then
stores and retrieves data. And although a developer may write a single module that will
interconnect the three functions of user interaction, logic, and data storage, such an
approach would require a great deal of work in maintenance and in deployment.
Therefore, developers attempt to divide the application’s functionality into tiers, or
layers. Years ago, as business applications moved from minicomputer or mainframe
systems to the PC, developers adopted a two-tier strategy, which is also known as the
client-server model. In this model, the data storage (typically provided by a server
running a database management system such as SQL Server or DB2) is separated from
the rest of the application (typically running on desktop PCs). This resulted in many
Carter
6
developer tools being created around the client-server model. However, the client-server
model had its drawbacks, which included the following, as described in (Directions on
Microsoft):

Difficult to evolve. Because the client piece of a client-server system included
both the GUI and the business logic, developers updating the GUI could
inadvertently change the business logic as well.

Difficult to deploy. A client application had to be deployed on the desktop PC of
each user who wanted to access the application, potentially requiring thousands of
deployments.

Difficult to scale. Each running client connected directly to the database, thereby
consuming server resources and often limiting the number of simultaneous users
that could access an application.
On the other hand, the three-tier model introduces an intermediate business-logic tier
between the GUI and the data storage, which provides these advantages over the clientserver model:

Increased scalability. Logic components can be pooled and shared across multiple
running clients.

Easier to maintain. Since the GUI code is separate from the business logic, the GUI
can be changed and enhanced without accidentally altering core business rules. In
addition, when the business logic must be changed, only a relatively small number of
middle-tier servers need to be updated instead of a larger number of desktop PCs.

Shared business logic and support for multiple interfaces. The same business logic
can be used from a Web-based interface and a thick-client interface.
Carter
Figure 2 illustrates the setup of a typical three-tier architectural model:
Figure 2 (Delphi 2)
7
Carter
8
2.2 WEB BROWSER / HTML
HTML, the HyperText Markup Language, is the standard authoring language for
publishing on the World Wide Web. Having gone through several stages of evolution,
today’s HTML has a wide range of features reflecting the needs of a very diverse and
international community wishing to make information available on the Web (HTML
Activity Statement). HTML defines the layout and structure of a Web document by
using a series of tags and attributes.
In this project, I use HTML for the structure of the Web pages within my project
site. A Web browser is a software application used to locate and display HTML pages.
The Microsoft Internet Explorer Web browser serves as the client in this application.
2.3 CGI
The Common Gateway Interface (CGI) is a standard for interfacing external
applications with information servers, such as HTTP or Web servers (CGI: Common
Gateway Interface). A CGI program is executed in real-time, which means that it can
output dynamic data to a Web page. On the other hand, a generic HTML document that
is retrieved contains static information, which means it exists in a constant state and the
information outputted to the screen does not change. Because a CGI program is
executable, it allows visitors to a Web page to run a program on the server where the CGI
document is hosted. For this and other reasons, authors of CGI scripts must take some
security measures when it comes to the execution of the scripts. CGI programs must
reside in a special directory, so that the Web server knows to execute the program instead
of merely displaying it to the browser. Typically, this directory is under direct control of
Carter
the webmaster, which prevents the average user from creating CGI programs. The most
common practice is to place CGI programs in a directory entitled ‘/cgi-bin’.
2.4 PERL
A CGI program can be written in any language that allows it to be executed on
the user’s system, and Perl is the language of choice for many developers. Perl is an
acronym for the Practical Extraction Report Language. Perl is available for most
operating systems, including virtually all Unix-like platforms (Perl). The language is
optimized for scanning arbitrary text files, extracting information from those text files,
and printing reports based on that information. Perl can handle many system
management tasks, and the language’s designers intended it to be practical, easy to use,
and efficient. Perl combines many features of C, sed, awk, and sh, as well as csh,
Pascal, and BASIC-PLUS. Expression syntax in Perl corresponds closely to C
expression syntax. Perl, unlike most Unix utilities, does not arbitrarily limit the size of
the user’s data, as long as the required memory is available. As an example, Perl can
parse a whole file as a single string. Recursion in Perl is of unlimited depth. The
tables used by hashes, commonly referred to as associative arrays, grow as necessary to
prevent diminished performance. One of Perl’s most useful capabilities is that it can
use sophisticated pattern matching techniques to scan large amounts of data quickly.
And although optimized for scanning text, Perl can also deal with binary data.
In this project, I use Perl to implement CGI scripts for performing the database
manipulation operations, such as insert, delete, edit, and search. Perl and CGI serve as
a part of the middle tier of this application.
9
Carter 10
2.5 ASP / VBScript
Active Server Pages (ASP) are components that allow Web developers to create
server-side scripted templates. In turn, these templates generate dynamic, interactive web
server applications. By embedding special programmatic codes in standard HTML pages,
a user can interact with page objects such as Active-X or Java components, access data in
a database, or create other types of dynamic output. The HTML output by an Active
Server Page is totally browser independent, which means that it can be read equally well
by Microsoft Explorer, Netscape Navigator, or most other browsers (ASP-help.com).
In this project, I use ASP technology to allow the implementation of the user login
feature, as well as the add user function, which is done using Visual Basic script, or
VBScript. ASP / VBScript serve as a part of the middle tier of this application.
2.6 B-Course
B-Course is a Web-based data analysis tool for Naive Bayesian modeling.
Specifically, B-Course is used for dependence and classification modeling. B-Course can
be freely used for educational and research purposes as an analysis tool where
dependence or classification modeling based on data is needed. The software provides
two courses of modeling: dependency modeling and classification.
2.7 VISUAL BASIC DATA MINING.NET
Visual Basic Data Mining.Net is a Web portal that provides data mining
algorithm and application documentation, as well as various source codes in .Net and
Visual Basic. These features of the site demonstrate how the .NET Framework and/or
Visual Basic can be used to either learn how data mining algorithms and applications
function or to build data mining applications. Visual Basic Data Mining.Net also offers
Carter 11
a data mining community and provides functionality of data mining algorithms and
applications. The site provides a wizard-based interface for implementing the
algorithms. Visual Basic Data Mining.Net can be found online at: http://www.visualbasic-data-mining.net.
2.8 SEE5
See5 analyzes data to produce decision trees and/or rulesets that relate a case’s
class to the values of its attributes (See5). In See5, an application consists of a
collection of text files. These files define classes and attributes, describe the cases to
be analyzed, provide new cases to test the classifiers produced by See5, and specify
misclassification costs or penalties. A See5 application consists of two mandatory
files, which are a .names file and a .data file. The .names file defines the classes and
attributes associated with the data. The .data file contains the actual cases to be
analyzed by See5 in the process of producing a classifier.
Carter 12
SECTION 3: DATA MINING ALGORITHMS
3.1 NAIVE BAYESIAN CLASSIFICATION
Bayes Theorem illustrates how to calculate the probability of one event given that
it is known some other event has occurred. Expressed algebraically, this is a simple classconditional approach, based upon the following assumption:
P(A|B) = P(A) * P(B|A) / P(B)
or, the probability that A takes place given that B has occurred (P(A|B)) equals the
probability that A occurs (P(A)) times the probability that B occurs if A has happened
(P(B|A)), divided by the probability of B occurring (P(B)). Naive Bayesian classifiers
make the assumption that an attribute’s effect on a given class is independent of values of
any other attribute, and this assumption is known as class conditional independence. It is
made to simplify the computation and in this sense considered to be “naive” (Naive
Bayes – Introduction).
The independence assumption that underlies the Naive Bayesian classification
technique is one that is deep-seated and therefore, may not be realistic. However, a
Naive Bayesian classifier can yield an excellent prediction. One example of this case
may occur when a feature selection process on the data is completed prior to
classification. This ensures that only one pair of any highly correlated features is saved
and used in the classification process. When dealing with gene expression data, feature
selection must be performed prior to classification due to the extremely high
dimensionality of the feature space (Wallach, 2003).
Carter 13
A Bayesian network consists of nodes and arcs that can connect pairs of nodes
(P.Myllymäki, et. al). For each variable, exactly one node exists. A major
restriction for the Bayesian network is that arcs are not allowed to form loops. If
the arcs can be followed such that some node is visited twice, the model is not a
Bayesian network. Figure 3 is an example of a network that is NOT a Bayesian
network:
Figure 3 (P.Myllymäki, et.al.)
Presented next is a dependency model for a Bayesian network. This example
model is given in (P.Myllymäki, et. al):





A and B are dependent on each other if we know something about C or D (or
both).
A and C are dependent on each other no matter what we know and what we
don't know about B or D (or both).
B and C are dependent on each other no matter what we know and what we
don't know about A or D (or both).
C and D are dependent on each other no matter what we know and what we
don't know about A or B (or both).
There are no other dependencies that do not follow from those listed above.
Figure 4 shows the Bayesian network for these dependencies:
Figure 4 (P.Myllymäki, et.al.)
Carter 14
A and B are considered dependent, when given a (possibly empty) set S that contains
some other variables of the network, if one can freely travel the arcs from A to B. If the
arcs cannot be freely traveled from A to B, A and B are not dependent given S. The
ability to travel an arc is generally independent of the direction of the arc. If S is an
empty set, one may travel the arcs forward and backward, given that the same node is
never visited twice and that an arc is first traveled forward, and immediately afterward
traveled backward on some other arc.
In this project, I use B-Course to perform Naive Bayesian dependency modeling
and Naive Bayesian Classification on the contraceptive method choice database.
3.2 ONE-RULE CLASSIFICATION
The one-rule algorithm creates one data mining rule for the dataset based on one
attribute (one column in a database table). After comparing the error rates from all the
attributes, it then chooses the rule that gives the lowest classification error. The rule will
assign to one category or class each distinct value of one chosen attribute. This rule can
be defined in pseudocode as (Tagbo):
For each attribute in the data set
For each distinct value of the attribute
Find the most frequent classification
Assign the classification to the value
Calculate the error rate for the value
Calculate the total error rate for the attribute
Choose the attribute with the lowest error rate
Carter 15
Create one rule for the chosen attribute
The goal of the one rule data mining algorithm in this implementation is to
classify each of the attributes wife_age, hus_ed, no_child, wife_rel, wife_work, hus_oc,
st_live, and media of the contraceptive method choice database as no use, long-term
methods, and short-term methods. Afterwards, the attribute with the lowest error rate is
chosen as the best rule. In this project, I use Visual Basic Data Mining.Net to process
the results of the One-Rule Classification algorithm on the contraceptive method choice
database.
3.3 DECISION TREE
A visual aid for data mining is the decision tree. A decision tree is in essence a
flow chart of questions or data points. These questions or data points eventually lead to a
decision. Decision tree algorithms begin by finding the test that performs the best task of
splitting the data among the preferred categories. At each successive level of the tree,
subsets created by the previous split are themselves split, making a path down the tree.
Each of the paths through the tree represents a rule. However, some rules are more useful
than other ones. And in some cases, the predictive power of the entire tree can be
bettered by pruning back the weaker branches. At each node of the tree, three things can
be measured: the number of records entering the node, the percentage of records
classified correctly at the node, and the way the records would be classified if it were a
leaf node. The tree continues to grow until it is no longer possible to locate more useful
ways to split the incoming records. Decision trees create a set of bins or boxes where the
data miner may place records.
Carter 16
In Figure 5, a partial binary tree for the classification of musical instruments. The
gap in the center of the row of bins corresponds to the root node of the tree. All stringed
instruments then fall to the left of the gap, and all other instruments fall to the right.
Figure 5 (Berry, Linoff, pg. 245)
In this project, I use See5 to construct decision trees and process those results for the
contraceptive method choice database.
Carter 17
SECTION 4: SYSTEM DESIGN
4.1 SYSTEM LAYOUT
Query / Manipulation
User Login
Logs
Data Mining
Administrator Login
Naive
Bayes
Search
Delete
Edit
Add Records
Add Users
Change Admin
Password
Figure 6: Project System Flow
One Rule
Decision Tree
Carter 18
4.2 WEBSITE PRESENTATION
Figure 7: The contraceptive method choice database homepage
Carter 19
Figure 8: Administrator Login Page
To guarantee security, only the privileged database administrator can log in to the
database to perform three of the database manipulation functions, which are to add users,
delete records, and edit records. The administrator can also add users and change the
admin password.
Carter 20
Figure 9: Administrator Options
After the administrator successfully logs in, administrator options are presented. These
options include: search records, change password, add records, add users, delete records,
and edit records. NOTE: Clicking the “Delete Record” button next to an entry will
delete that entry from the database.
Carter 21
Figure 10: Password Change Success Page
Carter 22
Figure 11: Add User Page
Figure 12: Add User Success Page
Carter 23
Figure 13: Edit Record Page
Carter 24
Figure 14: Edit Record Success Page
Carter 25
Figure 15: User Login Page
Users have privileges to add records and to search for records.
Figure 16: Bad User Login Page
Carter 26
Figure 17: User Request Page
Figure 18: User Request Success Page
Carter 27
Figure 19: Email Message
This is the email that the system automatically sends to the database administrator when a
user requests a login name and password.
Carter 28
Figure 20: Search Page
Figure 21: Search Page Results
Both the database administrator and users have access to the search function.
Carter 29
Figure 22: Add Record Page
Figure 23: Add Record Success Page
Both the database administrator and users have access to the add records function.
Carter 30
Figure 24: Access Log Detail
Both the database administrator and users have access to the access log feature. A count
is kept for the different types of browsers and operating systems used. The log detail
contains the time, date, host server, browser, and operating system of the computer that
accesses the system.
Carter 31
SECTION 5: DISCUSSION
5.1 NAIVE BAYESIAN RESULTS
B-Course was used to construct Bayesian dependency models for the
contraceptive method choice database. All variables, excluding the primary key
ID, were used in constructing the model. When the software is invoked, B-Course
searches for the most probable model for the data and returns these intermediate
results. B-Course can then continue using a search strategy of selecting models
that resemble the current best model, instead of picking models randomly from a
set. As B-Course continues, it collects a set of relatively good models and then
attempts to combine the best parts of these models so that the resulting combined
model is better than any of the original models.
After evaluating 8539 candidate models, B-Course returned the following
Bayesian network as the best model:
Figure 25: Bayesian Network (P.Myllymäki, et.al.)
Carter 32
B-Course was started again, evaluating 444681 more candidate models, for a grand
total of 453220 models evaluated. After searching these candidate models, BCourse located a new Bayesian network that represents the same model as the
previous network:
Figure 26: New Bayesian Network (P.Myllymäki, et.al.)
B-Course also provides for Naive Bayesian classification. In classification
modeling, one attribute of the data is chosen as the class variable, and the other attributes
become predictor variables. The ultimate goal is to find the model that, given the values
of predictor variables, deduces the value of the class variable. Classification modeling
can also help to test whether some classes are similar or not. For example, if a model can
correctly tell the classes apart, then there must be some difference in those particular
classes. More analysis can measure how significant the differences in classes are.
Carter 33
B-Course merges many quantitative models to build one single classification model.
After running B-Course, 301 candidate models were evaluated. The estimated
classification accuracy of the best model found was 48.74%. On the average the correct
class received 36.56% probability. Figure 27 displays the variables B-Course found as
the best subset for predicting the class variable:
Figure 27: Classification model (P.Myllymäki, et.al.)
Figure 28: Class arc weights (P.Myllymäki, et.al.)
Carter 34
It was estimated that if the selected models were used, then 48.74% of future
classifications would be done correctly. B-Course built 1473 models, each of which was
constructed using the data items in the dataset. Next, the model was used to classify the
data items not used in the model’s construction. Out of 1473 models, 718 succeeded in
classifying the one unseen data item correctly.
A confusion matrix displays how many members of a certain class were predicted
to be members of a different class. Figure 28 shows a confusion matrix for the Naïve
Bayesian classifier, where the entries denoting numbers of correct classifications are in
bold print.
Predicted
Confusion
Long-term No-use
Shortterm
Long-term
102
60
171
No-use
79
319
231
Short-term
66
147
297
Actual
Figure 29: Confusion Matrix (P.Myllymäki, et.al.)
5.2 ONE-RULE RESULTS
Using Visual Basic Data Mining.Net software, I applied the one-rule
classification algorithm to the contraceptive method choice database. The steps used in
producing the one-rule results are as follows:

Step 1: Decide which of the attributes will be used to create the best one-rule for
the dataset. Attribute ID is not chosen because it is the primary key for the
database. Attribute cmchoice is not selected because it is the class attribute,
containing the categories needed for classification. The remaining 9 attributes are
chosen.

Step 2: List the distinct values of each attribute. These values can be seen in
Figure 1.
Carter 35

Step 3: Find the most frequent classification for every distinct value of an
attribute using the contraceptive method choice class values (no use, long-term
methods, short-term methods). For example, according to the output, when
no_child = 8, there were 9 cases of category no use, 7 cases of category long-term
methods, and 8 cases of category short-term methods. Therefore, the most
frequent classification is category no use, and a rule is made classifying 8 children
as category no use, or 8 children
No Use. The error rate for 8 children is the
total number of times it appears in the dataset (24) minus the number of instances
of its most frequent class (9), divided by the total (24). So the error rate in this
case is 15 / 24.

Step 4: Repeat Step 3 for each case of each attribute.

Step 5: Choose the attribute with the lowest error rate

Step 6: Create a one-rule classification based on this attribute
Figure 30 displays a portion of the one-rule classification output. As shown, the attribute
with the lowest error rate, which was selected as the best rule, is no_child.
Carter 36
Attribute IsNumeric BestRule Value L.P.B. U.P.B. Class Frequency Total
wife_work False
False
wife_rel
False
False
wife_ed
False
False
st_liv
False
False
media
False
False
hus_oc
False
False
hus_ed
False
False
wife_age
True
False
no_child
True
True
0
0
0
1
62
62
no_child
True
True
0
0
0
1
33
34
no_child
True
True
0
0
1
1
94
95
no_child
True
True
0
1
1
2
31
31
no_child
True
True
0
1
1
3
61
61
no_child
True
True
0
1
1
1
49
49
no_child
True
True
0
1
1
2
15
15
no_child
True
True
0
1
1.5
3
26
26
no_child
True
True
0
1.5
2
1
83
83
no_child
True
True
0
2
2
2
39
39
no_child
True
True
0
2
2
3
77
77
no_child
True
True
0
2
2
1
31
31
no_child
True
True
0
2
2
2
17
17
no_child
True
True
0
2
2.5
3
29
29
no_child
True
True
0
2.5
3
1
46
46
no_child
True
True
0
3
3
2
44
44
no_child
True
True
0
3
3
3
90
90
no_child
True
True
0
3
3
1
24
24
no_child
True
True
0
3
3
2
26
26
no_child
True
True
0
3
3.5
3
29
29
no_child
True
True
0
3.5
4
1
37
37
Figure 30: One Rule Output (Tagbo)
Carter 37
5.3 DECISION TREE RESULTS
I performed decision tree analysis on the contraceptive method choice dataset
using See5. There are 1473 instances in the dataset, with the 10 attributes, plus the
unique identifier ID. However, this version of See5 allowed a maximum of 400 cases
that could be used at a time. The class attribute, cmchoice, is represented by three
categories (1 = no use; 2 = long-term; 3 = short-term).
The numbers shown between 0
and 1 represent the probability of the attribute, at the given criteria, belonging to the
specific class (no use, long-term, short term). The 400 cases were selected such that
relatively equal numbers of cases for each contraceptive method choice classification are
present. Thus, for the cmc.data file, the breakdown by ID is as follows:

No use (1): ID# 1 – 133

Long-term (2): ID# 416-549

Short-term (3): ID# 643-776
Below is a partial screen shot of a mine for the ruleset of the attributes. A 95%
confidence interval was used for all mines.
Carter 38
Figure 31: Ruleset (Quinlan)
Carter 39
Figure 32: Decision Tree Output (Quinlan)
See5 creates a decision tree of the results. To paraphrase, the tree can be translated in
this manner:
Carter 40
if no_child is less than or equal to 0, then no use
else if no_child > 0
if wife_ed = 1
if wife_age > 36, then no use
else if wife_age <= 36
if st_live = 1, then no use
if st_live = 2, then long-term
if st_live = 3, then short-term
if st_live = 4, then long-term
if wife_ed = 2
…………….
(etc…)
From the decision tree, conclusions can be drawn for determining which contraceptive
method choice is best for Indonesian women. For example, a woman with no children
would be most likely to choose no use. A wife with at least one child, a low educational
level, and above the age of 36 is predicted for no use. A wife with at least one child, a
low educational level, less than or equal to 36 years old, and with a standard of living
index of 2 is predicted to have long-term methods. A wife with those same
characteristics, but with a standard of living index of 3, is predicted to have short-term
methods. Numerous predictions can be seen throughout the decision tree.
Many times, classification decisions can occur slowly with changes in attribute
values. For example, a threshold may be a value less than or equal to 0.5 for one
classification, say long-term methods, and the values more than 0.5 may be another
classification, say short-term methods. If the former holds, we go no further and predict
long-term methods. By default, a threshold such as this is sharp. Therefore, a case with a
hypothetical value of 0.49 is treated quite differently from one with a value of 0.51.
See5 contains an option to invoke, instead of sharp thresholds like the case
mentioned in the previous paragraph, fuzzy thresholds. A fuzzy set is a set whose
Carter 41
elements are usually neither totally in the set nor totally out of the set (Meadow, et.al., pg.
217). When this is invoked, each threshold is broken into three ranges – they are denoted
by a lower bound lb, an upper bound ub, and a central value t. If the questioned attribute
value is below lb or above ub, classification is made by using the single branch
corresponding to the `<=' or '>' result respectively. If the value falls between lb and ub,
then both branches of the tree are investigated, with the results combined
probabilistically. The values of lb and ub are determined by See5 based on a study of the
perceived sensitivity of classification to small changes in the threshold. Figure 33 shows
a screenshot of the classifier construction options, and Figure 34 displays part of the
decision tree with fuzzy thresholds:
Figure 33: Classifier Construction Options (Quinlan)
Carter 42
Figure 34: Decision Tree Output with Fuzzy Thresholds (Quinlan)
Of note is how the upper and lower bounds of the thresholds are specified. For instance,
in the non-fuzzy example, when no_child is > 0 and wife_ed = 1, wife_age has one
threshold, 36 – if wife_age is greater than 36, no use is returned; if wife_age is less than
or equal to 36, then the tree branches to the st_live attribute to determine the appropriate
class. However, in the fuzzy example, there is no one specific threshold, or cut-off. If
no_child >= 1 and wife_ed = 1 when wife_age is >= 38, no use is returned; when
wife_age is <= 35, then the tree branches to the the st_live attribrute to determine which
class is predicted. The fuzzy thresholds option constructs an interval close to the
threshold. Within this interval, both branches of the tree are explored. Next, the results
are combined to give a predicted class. When wife_age is greater than 35 and less than
38, or 35 < wife_age < 38, the prediction becomes imprecise. A wife_age value of 36.5 is
chosen as the fuzzy threshold.
Carter 43
5.4 CONCLUSION
All three data mining algorithms were successful at predicting the contraceptive
method choice of an Indonesian woman based on her demographic and socio-economic
characteristics. B-Course created Bayesian dependency networks for the attributes of the
dataset. The estimated classification accuracy of the best model found was 48.74%.
With the resulting accuracy of the classification being less than 50% in this case, the
Naive Bayesian algorithm may not be the best model for this dataset. It is possible that
the creation of more candidate models may increase the accuracy percentage. One-Rule
classification determined that the no_child attribute, which is the number of children born
to an Indonesian woman, was the best rule for predicting the contraceptive method
choice. The decision tree algorithm determined that the best predictor of the
contraceptive method choice was the rule where no_child <=0, which would predict the
no use category (95.7%). In comparing the regular decision tree to the decision tree
containing fuzzy thresholds, the regular decision tree had an error rate of 25.0%, while
the decision tree with fuzzy thresholds had an error rate of 25.5%. There was not a
significant difference between these two methods.
Carter 44
WORKS CITED
ASPhelp.com. “What are Active Server Pages?”. Retrieved March 8, 2003, from
the World Wide Web. http://www.asp-help.com/getstarted/gs_aboutasp.asp
Berry, Michael, and Gordon Linoff. Data Mining Techniques for Marketing,
Sales, and Customer Support. New York: John Wiley and Sons. 1997.
CGI: Common Gateway Interface. Retrieved March 8, 2003, from the World
Wide Web. http://hoohoo.ncsa.uiuc.edu/cgi/intro.html.
Delphi 2 – Developing for Multi-Tier Distributed Computing Architectures.
Retrieved March 9, 2003, from the World Wide Web.
http://community.borland.com/article/0,1410,10343,00.html#three.
Directions on Microsoft. “What is an Application Server?”. Retrieved March 9,
2003, from the World Wide Web.
http://www.directionsonmicrosoft.com/sample/DOMIS/research/2002/12dec/1202wiaas.
htm
HTML Activity Statement. Retrieved March 8, 2003, from the World Wide Web.
http://www.w3.org/MarkUp/Activity.
Lewis, David. “Naive (Bayes) at Forty: The Independence Assumption in
Information Retrieval”. Proceedings of ECML-98, 10th European Conference on
Machine Learning. Florham Park, NJ: AT&T Labs Research, 1998.
Meadow, Charles, B.R. Boyce, D.H. Kraft. Text Information Retrieval Systems,
2nd Edition. San Diego: Academic Press. 2000.
“Naïve Bayes – Introduction”. Retrieved February 5, 2003, from the World Wide
Web. http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm.
O’Reilly and Associates. “Perl”. Retrieved March 8, 2003, from the World Wide
Web. http://www.perldoc.com/perl5.6/pod/perl.html.
P.Myllymäki, T.Silander, H.Tirri, and P.Uronen. B-Course: A Web-Based Tool
for Bayesian and Causal Data Analysis. International Journal on Artificial Intelligence
Tools, Vol 11, No. 3 (2002) 369-387.
Quinlan, Ross. “RuleQuest Research Data Mining Tools”. Retrieved March 18,
2003, from the World Wide Web. http://www.rulequest.com/.
Tagbo, Kingsley. “Visual Basic Data Mining.Net”. http://www.visual-basicdata-mining.net. 2002.
Carter 45
United Nations Population Fund - Indonesia. Retrieved March 16, 2003, from the
World Wide Web. http://www.un.or.id/unfpa/idpop.html.
Wallach, Hannah. “Supervised Learning Methods”. Retrieved March 14, 2003,
from the World Wide Web.
http://www.srcf.ucam.org/~hmw26/coursework/dme/node14.html
Download