Data Retrieval and Preparation

advertisement
INTRODUCTION TO WEKA:
WHAT IS WEKA?
Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association rules, and
visualization. It is also well-suited for developing new machine learning schemes.
Main features:
 Comprehensive set of data pre-processing tools, learning algorithms and evaluation
methods
 Graphical user interfaces (incl. data visualization)
 Environment for comparing learning algorithms
THE GUI CHOOSER:
The GUI chooser is used to start different interfaces of the Weka Environment. These
interfaces can be considered different programs and they vary in form, function and purpose.
Depending on the specific need whether it be simple data exploration, detailed
experimentation or tackling very large problems, these different interfaces of Weka will be more
appropriate.
This project will mainly focus on the Explorer, Experimenter and Knowledge Flow
interfaces of the Weka Environment.
The CLI is a text based interface to the Weka Environment. It is the most memory efficient interface
available in Weka.
WEKA DATA MINER
Weka is a comprehensive set of advanced data mining and analysis tools. The strength of
Weka lies in the area of classification where it covers many of the most current machine learning
(ML) approaches. The version of Weka used in this project is version 3-4-4.
At its simplest, it provides a quick and easy way to explore and analyze data. Weka is also
suitable for dealing with large data where the resources of many computers and or multi-processor
computers can be used in parallel. We will be examining different aspects of the software with a focus
on its decision tree classification features.
DATA HANDLING:
Weka currently supports 3 external file formats namely CSV, Binary and C45. Weka also
allows for data to be pulled directly from database servers as well as web servers. Its native data
format is known as the ARFF format
ATTRIBUTE RELATION FILE FORMAT (ARFF):
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at
the Department of Computer Science of The University of Waikato for use with the Weka machine
learning software
OVERVIEW
ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
Where <relation-name> is a string. The string must be quoted if the name includes
Spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute statements. Each
attribute in the data set has its own @attribute statement which uniquely defines the name of that
attribute and it's data type. The order the attributes are declared indicates the column position in the
data section of the file. For example, if an attribute is the third one declared then Weka expects that all
that attributes values will be found in the third comma delimited column.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
Where the <attribute-name> must start with an alphabetic character. If spaces are to be
included in the name then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:




numeric
<nominal-specification>
string
date [<date-format>]
Where <nominal-specification> and <date-format> are defined below. The keywords numeric, string
and date are case insensitive.
Numeric attributes
Numeric attributes can be real or integer numbers.
Nominal attributes
Nominal values are defined by providing an <nominal-specification> listing the possible values:
{<nominal-name1>, <nominal-name2>, <nominal-name3> ...}
For example, the class value of the Iris dataset can be defined as follows:
@ATTRIBUTE class
{Iris-setosa,Iris-versicolor,Iris-virginica}
Values that contain spaces must be quoted.
String attributes
String attributes allow us to create attributes containing arbitrary textual values. This is
very useful in text-mining applications, as we can create datasets with string attributes, then write
Weka Filters to manipulate strings (like StringToWordVectorFilter). String attributes are declared as
follows:
@ATTRIBUTE LCC string
Date attributes
Date attribute declarations take the form:
@attribute <name> date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string specifying how
date values should be parsed and printed (this is the same format used by SimpleDateFormat). The
default format string accepts the ISO-8601 combined date and time format: "yyyy-MMdd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations of the date/time
(see example below).
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in the file. The format is:
@data
The instance data
Each instance is represented on a single line, with carriage returns denoting the end of the instance.
Attribute values for each instance are delimited by commas. They must appear in the order that they
were declared in the header section (i.e. the data corresponding to the nth @attribute declaration is
always the nth field of the attribute).
Missing values are represented by a single question mark, as in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space must be
quoted, as follows:
@relation LCCvsLCSH
@attribute LCC string
@attribute LCSH string
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.; Moon -- Tables.'
Dates must be specified in the data section using the string representation specified in the attribute
declaration. For example:
@RELATION Timestamps
@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
NOTE: All header commands start with ‘@’ and all comment lines start with ‘%’.
blank lines are ignored.
COMMA SEPARATED VALUE (CSV):
Ex:
Sno, Sname, Branch, Year
1, abc, MCA, First
2, def,MCA, First
Attributes
Data
Comment and
Sparse Arff Structure:
@relation <relation name>
@attribute <attribute name> <Datatype>
@data
{< index> < value1>, <index> <value2>, <index> <value3>,….}
……………………………………….
Sparse Weather Arff Dataset:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
{0 sunny,1 85,2 85,3 FALSE,4 no}
{0 sunny,1 80,2 90,3 TRUE,4 no}
{0 overcast,1 83,2 86,3 FALSE,4 yes}
{0 rainy,1 70,2 96,3 FALSE,4 yes}
{0 rainy,1 68,2 80,3 FALSE,4 yes}
{0 rainy,1 65,2 70,3 TRUE,4 no}
{0 overcast,1 64,2 65,3 TRUE,4 yes}
{0 sunny,1 72,2 95,3 FALSE,4 no}
{0 sunny,1 69,2 70,3 FALSE,4 yes}
{0 rainy,1 75,2 80,3 FALSE,4 yes}
{0 sunny,1 75,2 70,3 TRUE,4 yes}
{0 overcast,1 72,2 90,3 TRUE,4 yes}
Data Retrieval and Preparation
Getting the data:
There are three ways of loading data into the explorer, these are loading from a file, a
database connection and finally getting a file from web server. We will be loading the data file from a
locally stored file.
Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native
ARFF format. To demonstrate the functionality of the explorer environment we will be loading a CSV
file and then in the following section we will preprocess the data to prepare it for analysis. To open a
local data file, click on the “Open File” button, and in the window that follows select the desired data
file.
Preprocess the Data:
First Method:
Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file
(.csv or .arff). In this case we will open the above data file.
Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in
Figure. You can click on "Use Converter" button, and click OK in the next dialog box that appears.
Again you can click on “choose” button list of converters are listed below. Choose the which
converter you want and click on “OK” button.
.
Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will
compute some basic statistics on each attribute. The left panel in above figure shows the list of
recognized attributes, while the top panels indicate the names of the base relation (or table) and the
current working relation (which are the same initially).
Statistical
measures of
selected
attribute
Attributes
Visualize
Clicking on any attribute in the left panel will show the basic statistics on that attribute. For
categorical attributes, the frequency for each attribute value is shown, while for continuous attributes
we can obtain min, max, mean, standard deviation, etc.
Note that the visualization in the right bottom panel is a form of cross-tabulation across two
attributes. For example, in above Figure, the default visualization panel cross-tabulates "married" with
the "pep" attribute (by default the second attribute is the last column of the data file). You can select
another attribute using the drop down list.
Second Method:
In this method you can load data from a web server. In preprocess tab click on “Open URL”
button pop-up window appeared like as below.
You can give a name of web server and followed by file name then click on “OK” button.
Third Method:
In this method you can load from a Database. In preprocess tab click on “OpenDB” button then
window is appears like as.
Filtering Algorithms:
Filters transform the input dataset in some way. When a filter is selected using the Choose
button, its name appears in the line beside that button. Click that line to get a generic object editor to
specify its properties. What appears in the line is the command-line version of the filter, and the
parameters are specified with minus signs. This is a good way of learning how to use the Weka
commands directly. There are two kinds of filters unsupervised and supervised Filters.
Filters are often applied to a training dataset and then also applied to the test file. If the filter
is supervised—for example, if it uses class values to derive good intervals for discretization—
applying it to the test data will bias the results. It is the discretization intervals derived from the
training data that must be applied to the test data. When using supervised filters you must be careful
to ensure that the results are evaluated fairly, an issue that does not arise with unsupervised filters.
We treat Weka’s unsupervised and supervised filtering methods separately. Within each type
there is a further distinction between attribute filters, which work on the attributes in the datasets,
and instance filters, which work on the instances.
Sno
1
Name of Function
Add
Description
Add a new attribute, whose values are all marked as
missing.
Add a new nominal attribute representing the cluster
2
Add Cluster
assigned to each instance by a given clustering algorithm.
Create a new attribute by applying a specified mathematical
3
Add Expression
function to existing attributes
4
Add Noise
Change a percentage of a given nominal attribute’s values.
5
Cluster Membership
Use a clusterer to generate cluster membership values,
which then form the new attributes.
6
7
8
Copy
Discretize.
First Order
Copy a range of attributes in the dataset
Convert numeric attributes to nominal: Specify which
attributes, number of bins, whether to optimize the number
of bins, and output binary attributes.
Use equal-width (default) or equal-frequency binning
Apply a first-order differencing operator to a range of
numeric attributes.
Replace a nominal attribute with a Boolean attribute. Assign
9
Make Indicator
value 1 to instances with a particular range of attribute
values; otherwise, assign 0. By default, the Boolean attribute
is coded as numeric.
10
MergeTwoValues
11
NominalToBinary
12
Normalize
13
NumericToBinary
14
Numeric Transform
15
Remove Type
16
Remove Useless
17
ReplaceMissingValues
18
Standardize
19
StringToNominal
Remove attributes of a given type (nominal, numeric, string,
or date).
Remove constant attributes, along with nominal attributes
that vary too much.
Replace all missing values for nominal and numeric
attributes with the modes and means of the training data.
Standardize all numeric attributes to have zero mean and
unit variance.
Convert a string attribute to nominal.
20
Swap Values
Swap two values of an attribute.
Unsupervised Instance Filters:
Merge two values of a given attribute: Specify the index of
the two values to be merged.
Change a nominal attribute to several binary ones, one for
each value.
Scale all numeric values in the dataset to lie within the
interval [0,1].
Convert all numeric attributes into binary ones: Nonzero
values become 1.
Transform a numeric attribute using any Java function.
Sno
Name of Function
Description
1
NonSparseToSparse
Convert all incoming instances to sparse format
2
Normalize
Treat numeric attributes as a vector and normalize it to a
given length
3
Randomize
Randomize the order of instances in a dataset
4
Remove Folds
Output a specified cross-validation fold for the dataset
5
Remove Misclassified
Remove instances incorrectly classified according to a
specified
6
classifier
useful for removing outliers
7
Remove Percentage
Remove a given percentage of a dataset
8
Remove Range
Remove a given range of instances from a dataset
9
RemoveWithValues
Filter out instances with certain attribute values
10
Resample
Produce a random sub sample of a dataset, sampling with
replacement
11
SparseToNonSparse
Convert all incoming sparse instances into nonsparse
format
SAMPLE DATASETS:
a) Weather Dataset:
Description for weather dataset(arff)
Title: weather dataset
source of information
no of attributes : 5(1 string,2 numeric,2 nominal)
no. of instances: 50
Attribute Description For Waether Dataset
attribute 1:outlook(string)
attribute 2:temp(numeric)
attribute 3:humd(numeric)
attribute 4:windy(nominal)
labels:yes,no
attribute 5:play(nominal)
labels:play,noplay
Example:
Sample Waether.arff Dataset
@relation weather
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {yes,no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
……………………
Sample Wether.csv Dataset
Outlook,temp,humd,windy,play
Rainy,30,40,yes,play
Rainy,50,20,no,play
Sunny,60,50,yes,noplay
Sunny,65,70,no,noplay
Overcast,40,40,yes,play
………………….
Bank Dataset:
Description of bank dataset
Title: bank dataset (arff)
source of information
no of attributes 12(4 numarical,8 nominal)
no of instances 100
Attribute Description For Bank Dataset
attribute 1:id(numeric)
attribute 2:age(numeric)
attribute 3:sex(nominal)
labels:male,female
attribute 4:region(nominal)
labels:inner_city,rural,suburbn,town
attribute 5:income(numeric)
attribute 6:married(nominal)
labels:yes,no
attribute 7:children(numeric)
attribute 8:car(nominal)
labels:yes,no
attribute 9:save_acct(nominal)
labels:yes,no
attribute 10:current_acct(nominal)
labels:yes,no
attribute 11:mortgage(nominal)
labels:yes,no
attribute 12:pep(nominal)
labels:yes,no
Example:
Sample Bank.arffdataset
@relation personal Equity plan
@attribute id
@attribute age
@attribute sex{“male”,”female”}
@attribute region{“inner_city”,”rural”,”suburban”,”town”}
@attribute income
@attribute marrage{“yes”,”no”}
@attribute children
@attribute car{“yes”,”no”}
@attribute save_acct{“yes”,”no”}
@attribute current_acct{“yes”,”no”}
@attribute mortgage{“yes”,”no”}
@attribute pep{“yes”,”no”}
@data
1,20,male,inner_city,10000,no,0,yes,yes,no,yes,no
2,45,male,rural,50000,yes,3,yes,no,no,yes,no
3,35,female,suburban,25000,yes,2,yes,no,no,yes,no
4,27,male,town,30000,no,0,yes,yes,no,yes,no
5,25,female,inner_city,20000,yes,2,yes,no,no,yes,no
6,30,male,town,15000,no,0,yes,yes,no,yes,no
…………………………………
Sample Bank.csv dataset
id,age,sex,region,income,married,children,car,save_acct,current_acct,mortgage,pep
1,20,male,inner_city,10000,no,0,yes,yes,no,yes,no
2,45,male,rural,50000,yes,3,yes,no,no,yes,no
3,35,female,suburban,25000,yes,2,yes,no,no,yes,no
4,27,male,town,30000,no,0,yes,yes,no,yes,no
5,25,female,inner_city,20000,yes,2,yes,no,no,yes,no
6,30,male,town,15000,no,0,yes,yes,no,yes,no
……………………………………
German Credit Dataset:
Description of the German credit dataset.
Title: German Credit data
Source Information
Number of Instances: 1000
Number of Attributes german: 20 (7 numerical, 13 categorical)
Number of Attributes german.numer: 24 (24 numerical)
Attribute Description For German
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 :
... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 :
.. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 :
... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 :
.. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
Relabeled values in attribute checking_status
From: A11
To: '<0'
From: A12
To: '0<=X<200'
From: A13
To: '>=200'
From: A14
To: 'no checking'
Relabeled values in attribute credit_history
From: A30
To: 'no credits/all paid'
From: A31
To: 'all paid'
From: A32
To: 'existing paid'
From: A33
To: 'delayed previously'
From: A34
To: 'critical/other existing credit'
Relabeled values in attribute purpose
From: A40
To: 'new car'
From: A41
To: 'used car'
From: A42
To: furniture/equipment
From: A43
From: A44
From: A45
From: A46
From: A47
From: A48
From: A49
From: A410
To: radio/tv
To: 'domestic appliance'
To: repairs
To: education
To: vacation
To: retraining
To: business
To: other
Relabeled values in attribute savings_status
From: A61
To: '<100'
From: A62
To: '100<=X<500'
From: A63
To: '500<=X<1000'
From: A64
To: '>=1000'
From: A65
To: 'no known savings'
Relabeled values in attribute employment
From: A71
To: unemployed
From: A72
To: '<1'
From: A73
To: '1<=X<4'
From: A74
To: '4<=X<7'
From: A75
To: '>=7'
%
Relabeled values in attribute personal_status
From: A91
To: 'male div/sep'
From: A92
To: 'female div/dep/mar'
From: A93
To: 'male single'
From: A94
To: 'male mar/wid'
From: A95
To: 'female single'
Relabeled values in attribute other_parties
From: A101
To: none
From: A102
To: 'co applicant'
From: A103
To: guarantor
Relabeled values in attribute property_magnitude
From: A121
To: 'real estate'
From: A122
To: 'life insurance'
From: A123
To: car
From: A124
To: 'no known property'
Relabeled values in attribute other_payment_plans
From: A141
To: bank
From: A142
To: stores
From: A143
To: none
Relabeled values in attribute housing
From: A151
To: rent
From: A152
To: own
From: A153
To: 'for free'
Relabeled values in attribute job
From: A171
To: 'unemp/unskilled non res'
From: A172
To: 'unskilled resident'
From: A173
To: skilled
From: A174
To: 'high qualif/self emp/mgmt'
Relabeled values in attribute own_telephone
From: A191
To: none
From: A192
To: yes
Relabeled values in attribute foreign_worker
From: A201
To: yes
From: A202
To: no
Relabeled values in attribute class
From: 1
To: good
From: 2
To: bad
@relation german_credit
@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute duration real
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously',
'critical/other existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs,
education, vacation, retraining, business, other}
@attribute credit_amount real
@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute installment_commitment real
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female
single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since real
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute age real
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits real
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'}
@attribute num_dependents real
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real
estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real
estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male
single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life
insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known
property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no
known property',35,none,'for free',1,'unskilled resident',2,yes,yes,good
'no checking',24,'existing paid',furniture/equipment,2835,'500<=X<1000','>=7',3,'male
single',none,4,'life insurance',53,none,own,1,skilled,1,none,yes,good
'0<=X<200',36,'existing paid','used car',6948,'<100','1<=X<4',2,'male
single',none,2,car,35,none,rent,1,'high qualif/self emp/mgmt',1,yes,yes,good
'no checking',12,'existing paid',radio/tv,3059,'>=1000','4<=X<7',2,'male div/sep',none,4,'real
estate',61,none,own,1,'unskilled resident',1,none,yes,good
'0<=X<200',30,'critical/other existing credit','new car',5234,'<100',unemployed,4,'male
mar/wid',none,2,car,28,none,own,2,'high qualif/self emp/mgmt',1,none,yes,bad
'0<=X<200',12,'existing paid','new car',1295,'<100','<1',3,'female
div/dep/mar',none,1,car,25,none,rent,1,skilled,1,none,yes,bad
'<0',48,'existing paid',business,4308,'<100','<1',3,'female div/dep/mar',none,4,'life
insurance',24,none,rent,1,skilled,1,none,yes,bad
'0<=X<200',12,'existing paid',radio/tv,1567,'<100','1<=X<4',1,'female
div/dep/mar',none,1,car,22,none,own,1,skilled,1,yes,yes,good
'<0',24,'critical/other existing credit','new car',1199,'<100','>=7',4,'male
single',none,4,car,60,none,own,2,'unskilled resident',1,none,yes,bad
'<0',15,'existing paid','new car',1403,'<100','1<=X<4',2,'female
div/dep/mar',none,4,car,28,none,rent,1,skilled,1,none,yes,good
'<0',24,'existing paid',radio/tv,1282,'100<=X<500','1<=X<4',4,'female
div/dep/mar',none,2,car,32,none,own,1,'unskilled resident',1,none,yes,bad
'no checking',24,'critical/other existing credit',radio/tv,2424,'no known savings','>=7',4,'male
single',none,4,'life insurance',53,none,own,2,skilled,1,none,yes,good
'<0',30,'no credits/all paid',business,8072,'no known savings','<1',2,'male
single',none,3,car,25,bank,own,3,skilled,1,none,yes,good
Market basket dataset:
@relation marketbasket-weka.filters.unsupervised.attribute.NumericToBinaryweka.filters.unsupervised.attribute.NumericToBinary
@attribute ' Hair Conditioner_binarized' {0,1}
@attribute ' Lemons_binarized' {0,1}
@attribute ' Standard coffee_binarized' {0,1}
@attribute ' Frozen Chicken Wings_binarized' {0,1}
@attribute ' 98pct. Fat Free Hamburger_binarized' {0,1}
@attribute ' Sugar Cookies_binarized' {0,1}
@attribute ' Onions_binarized' {0,1}
@attribute ' Deli Ham_binarized' {0,1}
@attribute ' Dishwasher Detergent_binarized' {0,1}
@attribute ' Beets_binarized' {0,1}
@attribute ' 40 Watt Lightbulb_binarized' {0,1}
@attribute ' Ice Cream_binarized' {0,1}
@attribute ' Cottage Cheese_binarized' {0,1}
@attribute ' Plain English Muffins_binarized' {0,1}
@attribute ' Strawberry Soda_binarized' {0,1}
@attribute ' Vanilla Ice Cream_binarized' {0,1}
@attribute ' Potato Chips_binarized' {0,1}
@attribute ' Strawberry Yogurt_binarized' {0,1}
@attribute ' Diet Soda_binarized' {0,1}
@attribute ' D Cell Batteries_binarized' {0,1}
@attribute ' Paper Towels_binarized' {0,1}
@attribute ' Mint Chocolate Bar_binarized' {0,1}
@attribute ' Salsa Dip_binarized' {0,1}
@attribute ' Buttered Popcorn_binarized' {0,1}
@attribute ' Cheese Crackers_binarized' {0,1}
@attribute ' Chocolate Bar_binarized' {0,1}
@attribute ' Rice Soup_binarized' {0,1}
@attribute ' Mouthwash_binarized' {0,1}
@attribute ' Sugar_binarized' {0,1}
@attribute ' Cheese Flavored Chips_binarized' {0,1}
@attribute ' Sweat Potatoes_binarized' {0,1}
@attribute ' Deodorant_binarized' {0,1}
@attribute ' Waffles_binarized' {0,1}
@attribute ' Decaf Coffee_binarized' {0,1}
@attribute ' Smoked Turkey Sliced_binarized' {0,1}
@attribute ' Screw Driver_binarized' {0,1}
@attribute ' Sesame Oil_binarized' {0,1}
@attribute ' Red Wine_binarized' {0,1}
@attribute ' 60 Watt Lightbulb_binarized' {0,1}
@attribute ' Cream Soda_binarized' {0,1}
@attribute ' Apple Fruit Roll_binarized' {0,1}
@attribute ' Noodle Soup_binarized' {0,1}
@attribute ' Ice Cream Sandwich_binarized' {0,1}
@attribute ' Soda Crackers_binarized' {0,1}
@attribute ' Lettuce_binarized' {0,1}
@attribute ' AA Cell Batteries_binarized' {0,1}
@attribute ' Honey Roasted Peanuts_binarized' {0,1}
@attribute ' Frozen Cheese Pizza_binarized' {0,1}
@attribute ' Tomato Soup_binarized' {0,1}
@attribute ' Manicotti_binarized' {0,1}
@attribute ' Toilet Bowl Cleaner_binarized' {0,1}
@attribute ' Liquid Laundry Detergent_binarized' {0,1}
@attribute ' Instant Rice_binarized' {0,1}
@attribute ' Green Pepper_binarized' {0,1}
@attribute ' Frozen Broccoli_binarized' {0,1}
@attribute ' Chardonnay Wine_binarized' {0,1}
@attribute ' Brown Sugar Grits_binarized' {0,1}
@attribute ' Canned Peas_binarized' {0,1}
@attribute ' Skin Moisturizer_binarized' {0,1}
@attribute ' Avocado Dip_binarized' {0,1}
@attribute ' Blueberry Muffins_binarized' {0,1}
@attribute ' Apple Cinnamon Waffles_binarized' {0,1}
@attribute ' Chablis Wine_binarized' {0,1}
@attribute ' Cantaloupe_binarized' {0,1}
@attribute ' Shrimp Cocktail Sauce_binarized' {0,1}
@attribute ' 100 Watt Lightbulb_binarized' {0,1}
@attribute ' Whole Green Beans_binarized' {0,1}
@attribute ' Turkey TV Dinner_binarized' {0,1}
@attribute ' Wash Towels_binarized' {0,1}
@attribute ' Dog Food_binarized' {0,1}
@attribute ' Cat Food_binarized' {0,1}
@attribute ' Frozen Sausage Pizza_binarized' {0,1}
@attribute ' Frosted Donuts_binarized' {0,1}
@attribute ' Shrimp_binarized' {0,1}
@attribute ' Summer Sausage_binarized' {0,1}
@attribute ' Plums_binarized' {0,1}
@attribute ' Mild Cheddar Cheese_binarized' {0,1}
@attribute ' Cream of Wheat_binarized' {0,1}
@attribute ' Fresh Lima Beans_binarized' {0,1}
@attribute ' Flavored Fruit Bars_binarized' {0,1}
@attribute ' Mushrooms_binarized' {0,1}
@attribute ' Flour_binarized' {0,1}
@attribute ' Plain Rye Bread_binarized' {0,1}
@attribute ' Jelly Filled Donuts_binarized' {0,1}
@attribute ' Apple Sauce_binarized' {0,1}
@attribute ' Hot Chicken Wings_binarized' {0,1}
@attribute ' Orange Juice_binarized' {0,1}
@attribute ' Strawberry Jam_binarized' {0,1}
@attribute ' Chocolate Chip Cookies_binarized' {0,1}
@attribute ' Vegetable Soup_binarized' {0,1}
@attribute ' Oats and Nuts Cereal_binarized' {0,1}
@attribute ' Fruit Roll_binarized' {0,1}
@attribute ' Corn Oil_binarized' {0,1}
@attribute ' Corn Flake Cereal_binarized' {0,1}
@attribute ' 75 Watt Lightbulb_binarized' {0,1}
@attribute ' Mushroom Pizza - Frozen_binarized' {0,1}
@attribute ' Sour Cream_binarized' {0,1}
@attribute ' Deli Salad_binarized' {0,1}
@attribute ' Deli Turkey_binarized' {0,1}
@attribute ' Glass Cleaner_binarized' {0,1}
@attribute ' Brown Sugar_binarized' {0,1}
@attribute ' English Muffins_binarized' {0,1}
@attribute ' Apple Soda_binarized' {0,1}
@attribute ' Strawberry Preserves_binarized' {0,1}
@attribute ' Pepperoni Pizza - Frozen_binarized' {0,1}
@attribute ' Plain Oatmeal_binarized' {0,1}
@attribute ' Beef Soup_binarized' {0,1}
@attribute ' Trash Bags_binarized' {0,1}
@attribute ' Corn Chips_binarized' {0,1}
@attribute ' Tangerines_binarized' {0,1}
@attribute ' Hot Dogs_binarized' {0,1}
@attribute ' Can Opener_binarized' {0,1}
@attribute ' Dried Apples_binarized' {0,1}
@attribute ' Grape Juice_binarized' {0,1}
@attribute ' Carrots_binarized' {0,1}
@attribute ' Frozen Shrimp_binarized' {0,1}
@attribute ' Grape Fruit Roll_binarized' {0,1}
@attribute ' Merlot Wine_binarized' {0,1}
@attribute ' Raisins_binarized' {0,1}
@attribute ' Cranberry Juice_binarized' {0,1}
@attribute ' Shampoo_binarized' {0,1}
@attribute ' Pancake Mix_binarized' {0,1}
@attribute ' Paper Plates_binarized' {0,1}
@attribute ' Bologna_binarized' {0,1}
@attribute ' 2pct. Milk_binarized' {0,1}
@attribute ' Daily Newspaper_binarized' {0,1}
@attribute ' Popcorn Salt_binarized' {0,1}
@data
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
EXPERMENT NO:1
Aim: Implement the filtering algorithms learning schemas using weather dataset(arff)
Unsupervised Filters:
Unsupervised Attribute Filters:
Add:
a)SCHEMA : Weka.filters.unsupervised.attribute.Add –N unnamed-C 3
Discretize:
SCHEMA : Weka.filters.unsupervised.attribute.descretize-B11-M-1.0-R first-last
NominalToBinary:
b)SCHEMA : Weka.filters.unsupervised.attribute.NominalToBinary-R first-last
Normalize:
c)SCHEMA : Weka.filters.unsupervised.attribute.Normalize-S 1.0-T 0.0
NumericToBinary:
d)SCHEMA : Weka.filters.unsupervised.attribute.NumericToBinary
Swap Values:
e)SCHEMA : Weka.filters.unsupervised.attribute .Swap Values –C last –F first –s last
String to Nominal:
f)Weka.filters.unsupervised.attribute.StringtoNominal
Implement Weka.Classifiers.trees.j48
Weather Dataset.arff
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Use Training Set Testing Options:
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves :
5
Size of the tree :
8
Time taken to build model: 0.02 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
14
100 %
Incorrectly Classified Instances
0
0 %
Kappa statistic
1
Mean absolute error
0
Root mean squared error
0
Relative absolute error
0 %
Root relative squared error
0 %
Coverage of cases (0.95 level)
100 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances
14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
yes
1
0
1
1
1
1
no
=== Confusion Matrix ===
a b <-- classified as
9 0 | a = yes
0 5 | b = no
Visualize Tree:
Use Cross Validation Testing Option:
Classifier Output:
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves :
5
Size of the tree :
8
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
9
64.2857 %
Incorrectly Classified Instances
5
35.7143 %
Kappa statistic
0.186
Mean absolute error
0.2857
Root mean squared error
0.4818
Relative absolute error
60 %
Root relative squared error
97.6586 %
Coverage of cases (0.95 level)
92.8571 %
Mean rel. region size (0.95 level) 64.2857 %
Total Number of Instances
14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.778
0.6
0.7
0.778 0.737
0.789 yes
0.4
0.222
0.5
0.4
0.444
0.789 no
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = yes
3 2 | b = no
Visualize Tree:
Use Supplied Test Set Testing Options:
We will now use our model to classify the new instances. A portion of the new instances ARFF
file is depicted in Figure Note that the attribute section is identical to the training data (bank data we
used for building our model). However, in the data section, the value of the "pep" attribute is "?" (or
unknown).
In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the
"Set..." button. This will pop up a window which allows you to open the file containing test instances,
as in Figures
In this case, we open the file "bank-new.arff" and upon returning to the main window,
we click the "start" button. This, once again generates the models from our training data, but this time
it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the
value of "pep" attribute. The result is depicted in Figure 28. Note that the summary of the results in
the right panel does not show any statistics. This is because in our test instances the value of the class
attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the
predicted values of new instances.
Of course, in this example we are interested in knowing how our model managed to
classify the new instances. To do so we need to create a file containing all the new instances along
with their predicted class value resulting from the application of the model. Doing this is much
simpler using the command line version of WEKA classifier application. However, it is possible to do
so in the GUI version using an "indirect" approach, as follows.
First, right-click the most recent result set in the left "Result list" panel. In the resulting
pop-up window select the menu item "Visualize classifier errors". This brings up a separate window
containing a two-dimensional graph.
We would like to "save" the classification results from which the graph is generated. In
the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff".
This file contains a copy of the new instances along with an additional column for the predicted value
of "pep".
Note that two attributes have been added to the original new instances data:
"Instance_number" and "predictedpep". These correspond to new columns in the data portion. The
"predictedpep" value for each new instance is the last value before "?" which the actual "pep" class
value. For example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our
model, while the predicted class value for instance 4 is "NO".
Use Percentage Split Testing Options:
Classifiers Output:
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves :
5
Size of the tree :
8
Time taken to build model: 0 seconds
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances
2
40 %
Incorrectly Classified Instances
3
60 %
Kappa statistic
-0.3636
Mean absolute error
0.6
Root mean squared error
0.7746
Relative absolute error
126.9231 %
Root relative squared error
157.6801 %
Coverage of cases (0.95 level)
40 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances
5
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.667 1
0.5
0.667 0.571
0.333 yes
0
0.333 0
0
0
0.333 no
=== Confusion Matrix ===
a b <-- classified as
2 1 | a = yes
2 0 | b = no
Analyzing the Output:
Run Information:
The first line of the run information section contains information about the learning scheme
chosen and its parameters. The parameters chosen (both default and modified) are shown in short
form. In our example, the learning scheme was ‘weka.classifiers.trees.J48’ or the J48 algorithm.
The second line shows information about the relation. Relations in Weka are like data
files. The name of the relation contains in it the name of data file used to build it, and the names
of filters that have been applied on it.
The next part shows the number of instances in the relation, followed by the number of
attributes. This is followed by the list of attributes.
The last part show the type of testing that was employed; in our example it was 10-fold crossvalidation
Classifier Model (full Training set):
J48 pruned tree
-----------------Outlook = sunny
| Humidity <= 75: yes (2.0)
| Humidity > 75: no (3.0)
Outlook = overcast: yes (4.0)
Outlook = rainy
| Windy = TRUE: no (2.0)
| Windy = FALSE: yes (3.0)
Number of Leaves: 5
Size of the tree: 8
It displays information about the model generated using the full training set. It mentions
full training set because we used cross-validation and what is being displayed here is the final
model that was built used all of the dataset to be generated. When using tree models, a text
display of the generated tree is shown. This is followed by the information about the number of
leaves and overall tree size (above).
Confusion Matrix:
A confusion matrix is an easy way of describing the results of the experiment. The best
way to describe it is by example
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = yes
3 2 | b = no
The confusion matrix is more commonly named contingency table. In our case we have
two classes, and therefore a 2x2 confusion matrix, the matrix could be arbitrarily large. The
number of correctly classified instances is the sum of diagonals in the matrix; all others are
incorrectly classified (class "a" gets misclassified as "b" exactly twice, and class "b" gets
misclassified as "a" three times).
Detailed Accuracy of Class:
The True Positive (TP) rate is the proportion of examples which were classified as class x,
among all examples which truly have class x, i.e. how much part of the class was captured. It is
equivalent to Recall. In the confusion matrix, this is the diagonal element divided by the sum over
the relevant row, i.e. 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our example
The False Positive (FP) rate is the proportion of examples which were classified as class
x, among all examples which are not of class x, In this matrix, this is the column sum of class x
minus the diagonal element, divided by the rows sums of all other classes; i.e. 3/5=0.6 for class
yes and 2/9=0.222 for class no
The Precision is the proportion of the examples which truly have class x among all those
which were classified as class x. In the matrix, this is the diagonal element divided by the sum
over the relevant column, i.e. 7/(7+3)=0.7 for class yes and 2/(2+2)=0.5 for class no.
precision=TP/(TP+FP) for “yes”
precision=TN/(TN+FN) for “No”
The F-Measure is simply 2*Precision*Recall/ (Precision+Recall), a combined measure for
precision and recall.
Evaluating numeric prediction:
 Same strategies: independent test set, cross-validation, significance tests, etc.
 Difference: error measures
 Actual target values: a1 a2 …an
 Predicted target values: p1 p2 … pn
 Most popular measure: mean-squared error
( p1  a1 ) 2  ...  ( pn  an ) 2
n
The root mean-squared error :
( p1  a1 ) 2  ...  ( pn  an ) 2
n
 The mean absolute error is less sensitive to outliers than the mean-squared error:
| p1  a1 | ... | pn  an |
n
 Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when
predicting 500)
 How much does the scheme improve on simply predicting the average?
 The relative squared error is
( p1  a1 ) 2  ...  ( pn  an ) 2
(a  a1 ) 2  ...  (a  an ) 2
 The relative absolute error is:
| p1  a1 | ... | pn  an |
| a  a1 | ... | a  an |
A
Root mean-squared error
67.8
B
91.7
C
63.3
D
57.4
Mean absolute error
41.3
38.5
33.4
29.2
Root rel squared error
42.2%
57.2%
39.4%
35.8%
Relative absolute error
43.1%
40.1%
34.8%
30.4%
Correlation coefficient
0.88
0.88
0.89
0.91
K- Means Clustering in Weka:
This example illustrates the use of k-means clustering with WEKA The sample data set used
for this example is based on the "bank data" available in ARFF format (bank-data.arff). As an
illustration of performing clustering in WEKA, we will use its implementation of the K-means
algorithm to cluster the customers in this bank data set, and to characterize the resulting customer
segments.Since the Data File is loaded in Weka Explorer.
Bank Data Set:
@relation bank
@attribute Instance_number numeric
@attribute age numeric
@attribute sex {"FEMALE","MALE"}
@attribute region {"INNER_CITY","TOWN","RURAL","SUBURBAN"}
@attribute income numeric
@attribute married {"NO","YES"}
@attribute children {0,1,2,3}
@attribute car {"NO","YES"}
@attribute save_act {"NO","YES"}
@attribute current_act {"NO","YES"}
@attribute mortgage {"NO","YES"}
@attribute pep {"YES","NO"}
@data
0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO
5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES
6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES
7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO
8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO
9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO
10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO
11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO
12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES
13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES
14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO
15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO
16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO
17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO
18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES
19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO
20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES
21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO
22,54,MALE,INNER_CITY,38446.6,YES,0,NO,YES,YES,NO,NO
23,27,FEMALE,TOWN,15538.8,NO,0,YES,YES,YES,YES,NO
24,22,MALE,INNER_CITY,12640.3,NO,2,YES,YES,YES,NO,NO
25,56,MALE,INNER_CITY,41034,YES,0,YES,YES,YES,YES,NO
26,45,MALE,INNER_CITY,20809.7,YES,0,NO,YES,YES,YES,NO
27,39,FEMALE,TOWN,20114,YES,1,NO,NO,YES,NO,YES
28,39,FEMALE,INNER_CITY,29359.1,NO,3,YES,NO,YES,YES,NO
29,61,MALE,RURAL,24270.1,YES,1,NO,NO,YES,NO,YES
30,61,FEMALE,RURAL,22942.9,YES,2,NO,YES,YES,NO,NO
…………………………………………………………………………
In the pop-up window we enter 5 as the number of clusters (instead of the default values of 2)
and we leave the value of "seed" as is. The seed value is used in generating a random number which
is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, Kmeans is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try
different values and evaluate the results.
Once the options have been specified, we can run the clustering algorithm. Here we make sure
that in the "Cluster Mode" panel, the "Use Percentage Split" option is selected, and we click "Start".
We can right click the result set in the "Result list" panel and view the results of clustering in a
separate window.
Cluster Output:
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I
500 -S 10
Relation: bank
Instances: 600
Attributes: 12
Instance_number
age
sex
region
income
married
children
car
save_act
current_act
mortgage
pep
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 14
Within cluster sum of squared errors: 1719.2889887418955
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Full Data
0
1
2
3
4
(600)
(66) (112)
(120) (137) (165)
===============================================================
Instance_number
299.5
306.6364 265.9732 302.2333 320.292 300.1515
age
42.395 40.0606 32.7589 51.475 44.3504 41.6424
sex
FEMALE FEMALE FEMALE FEMALE FEMALE
MALE
region
INNER_CITY RURAL INNER_CITY INNER_CITY
TOWN INNER_CITY
income
27524.0312 26206.1992 18260.9218 34922.1563 27626.442 28873.3638
married
YES
NO
YES
YES
YES
YES
children
0
3
2
1
0
0
car
NO
NO
NO
NO
NO
YES
save_act
YES
YES
YES
YES
YES
YES
current_act
YES
YES
YES
YES
YES
YES
mortgage
NO
NO
NO
NO
NO
YES
pep
NO
NO
NO
YES
NO
YES
Attribute
Time taken to build model (full training data) : 0.14 seconds
=== Model and evaluation on test split ===
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 1115.231316606429
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Full Data
0
1
2
3
4
(396) (131)
(63)
(80)
(41)
(81)
===============================================================
Instance_number 299.6364 277.1374 347.1905 362.0625 231.5122 271.8642
age
43.1061 40.4733 49.1111
45.1 51.2439 36.6049
sex
MALE FEMALE
MALE
MALE FEMALE
MALE
region
INNER_CITY INNER_CITY INNER_CITY
TOWN RURAL INNER_CITY
income
27825.983 25733.2533 32891.0238 30817.2439 32090.4995 22158.1307
married
YES
YES
YES
NO
NO
YES
children
0
0
1
0
0
0
car
NO
YES
YES
NO
YES
NO
save_act
YES
YES
YES
YES
YES
NO
current_act
YES
YES
NO
YES
YES
YES
mortgage
NO
NO
NO
NO
NO
YES
pep
NO
NO
YES
YES
NO
YES
Attribute
Time taken to build model (percentage split) : 0.04 seconds
Clustered Instances
0
73 ( 36%)
1
37 ( 18%)
2
30 ( 15%)
3
28 ( 14%)
4
36 ( 18%)
The result window shows the centroid of each cluster as well as statistics on the number and
percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each
cluster (so, each dimension value in the centroid represents the mean value for that dimension in the
cluster). Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster
1 shows that this is a segment of cases representing middle aged to young (approx. 38) females living
in inner city with an average income of approx. $28,500, who are married with one child, etc.
Furthermore, this group have on average said YES to the PEP product.
Another way of understanding the characteristics of each cluster in through visualization. We
can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize
cluster assignments".
You can choose the cluster number and any of the other attributes for each of the three
different dimensions available (x-axis, y-axis, and color). Different combinations of choices will
result in a visual rendering of different relationships within each cluster. In the above example, we
have chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis,
and the "sex" attribute as the color dimension. This will result in a visualization of the distribution of
males and females in each cluster. For instance, you can note that clusters 2 and 3 are dominated by
males, while clusters 4 and 5 are dominated by females. In this case, by changing the color dimension
to other attributes, we can see their distribution within each of the clusters.
Finally, we may be interested in saving the resulting data set which included each instance
along with its assigned cluster. To do so, we click the "Save" button in the visualization window and
save the result as the file "bank-kmeans.arff".
@relation bank_clustered
@attribute Instance_number numeric
@attribute age numeric
"@attribute sex {FEMALE,MALE}"
"@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}"
@attribute income numeric
"@attribute married {NO,YES}"
"@attribute children {0,1,2,3}"
"@attribute car {NO,YES}"
"@attribute save_act {NO,YES}"
"@attribute current_act {NO,YES}"
"@attribute mortgage {NO,YES}"
"@attribute pep {YES,NO}"
"@attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}"
@data
"0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES,cluster1"
"1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO,cluster3"
"2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO,cluster2"
"3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO,cluster5"
"4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO,cluster5"
"5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES,cluster5"
"6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES,cluster0"
"7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO,cluster2"
"8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO,cluster5"
"9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO,cluster2"
"10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO,cluster5"
"11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO,cluster4"
"12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES,cluster1"
"13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES,cluster1"
"14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO,cluster5"
"15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO,cluster2"
"16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO,cluster5"
"17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO,cluster5"
"18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES,cluster4"
"19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO,cluster2"
"20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES,cluster2"
"21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO,cluster5"
Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to
the original data set. In the data portion, each instance now has its assigned cluster as the last attribute
value. By doing some simple manipulation to this data set, we can easily convert it to a more usable
form for additional analysis or processing.
Aprior Algorithm using Weka
Market Basket Dataset:
@relation marketbasket-weka.filters.unsupervised.attribute.NumericToBinaryweka.filters.unsupervised.attribute.NumericToBinary
@attribute ' Hair Conditioner_binarized' {0,1}
@attribute ' Lemons_binarized' {0,1}
@attribute ' Standard coffee_binarized' {0,1}
@attribute ' Frozen Chicken Wings_binarized' {0,1}
@attribute ' 98pct. Fat Free Hamburger_binarized' {0,1}
@attribute ' Sugar Cookies_binarized' {0,1}
@attribute ' Onions_binarized' {0,1}
@attribute ' Deli Ham_binarized' {0,1}
@attribute ' Dishwasher Detergent_binarized' {0,1}
@attribute ' Beets_binarized' {0,1}
@attribute ' 40 Watt Lightbulb_binarized' {0,1}
@attribute ' Ice Cream_binarized' {0,1}
@attribute ' Cottage Cheese_binarized' {0,1}
@attribute ' Plain English Muffins_binarized' {0,1}
@attribute ' Strawberry Soda_binarized' {0,1}
@attribute ' Vanilla Ice Cream_binarized' {0,1}
@attribute ' Potato Chips_binarized' {0,1}
@attribute ' Strawberry Yogurt_binarized' {0,1}
@attribute ' Diet Soda_binarized' {0,1}
@attribute ' D Cell Batteries_binarized' {0,1}
@attribute ' Paper Towels_binarized' {0,1}
@attribute ' Mint Chocolate Bar_binarized' {0,1}
@attribute ' Salsa Dip_binarized' {0,1}
@attribute ' Buttered Popcorn_binarized' {0,1}
@attribute ' Cheese Crackers_binarized' {0,1}
@attribute ' Chocolate Bar_binarized' {0,1}
@attribute ' Rice Soup_binarized' {0,1}
@attribute ' Mouthwash_binarized' {0,1}
@attribute ' Sugar_binarized' {0,1}
@attribute ' Cheese Flavored Chips_binarized' {0,1}
@attribute ' Sweat Potatoes_binarized' {0,1}
@attribute ' Deodorant_binarized' {0,1}
@attribute ' Waffles_binarized' {0,1}
@attribute ' Decaf Coffee_binarized' {0,1}
@attribute ' Smoked Turkey Sliced_binarized' {0,1}
@attribute ' Screw Driver_binarized' {0,1}
@attribute ' Sesame Oil_binarized' {0,1}
@attribute ' Red Wine_binarized' {0,1}
@attribute ' 60 Watt Lightbulb_binarized' {0,1}
@attribute ' Cream Soda_binarized' {0,1}
@attribute ' Apple Fruit Roll_binarized' {0,1}
@attribute ' Noodle Soup_binarized' {0,1}
@attribute ' Ice Cream Sandwich_binarized' {0,1}
@attribute ' Soda Crackers_binarized' {0,1}
@attribute ' Lettuce_binarized' {0,1}
@attribute ' AA Cell Batteries_binarized' {0,1}
@attribute ' Honey Roasted Peanuts_binarized' {0,1}
@attribute ' Frozen Cheese Pizza_binarized' {0,1}
@attribute ' Tomato Soup_binarized' {0,1}
@attribute ' Manicotti_binarized' {0,1}
@attribute ' Toilet Bowl Cleaner_binarized' {0,1}
@attribute ' Liquid Laundry Detergent_binarized' {0,1}
@attribute ' Instant Rice_binarized' {0,1}
@attribute ' Green Pepper_binarized' {0,1}
@attribute ' Frozen Broccoli_binarized' {0,1}
@attribute ' Chardonnay Wine_binarized' {0,1}
@attribute ' Brown Sugar Grits_binarized' {0,1}
@attribute ' Canned Peas_binarized' {0,1}
@attribute ' Skin Moisturizer_binarized' {0,1}
@attribute ' Avocado Dip_binarized' {0,1}
@attribute ' Blueberry Muffins_binarized' {0,1}
@attribute ' Apple Cinnamon Waffles_binarized' {0,1}
@attribute ' Chablis Wine_binarized' {0,1}
@attribute ' Cantaloupe_binarized' {0,1}
@attribute ' Shrimp Cocktail Sauce_binarized' {0,1}
@attribute ' 100 Watt Lightbulb_binarized' {0,1}
@attribute ' Whole Green Beans_binarized' {0,1}
@attribute ' Turkey TV Dinner_binarized' {0,1}
@attribute ' Wash Towels_binarized' {0,1}
@attribute ' Dog Food_binarized' {0,1}
@attribute ' Cat Food_binarized' {0,1}
@attribute ' Frozen Sausage Pizza_binarized' {0,1}
@attribute ' Frosted Donuts_binarized' {0,1}
@attribute ' Shrimp_binarized' {0,1}
@attribute ' Summer Sausage_binarized' {0,1}
@attribute ' Plums_binarized' {0,1}
@attribute ' Mild Cheddar Cheese_binarized' {0,1}
@attribute ' Cream of Wheat_binarized' {0,1}
@attribute ' Fresh Lima Beans_binarized' {0,1}
@attribute ' Flavored Fruit Bars_binarized' {0,1}
@attribute ' Mushrooms_binarized' {0,1}
@attribute ' Flour_binarized' {0,1}
@attribute ' Plain Rye Bread_binarized' {0,1}
@attribute ' Jelly Filled Donuts_binarized' {0,1}
@attribute ' Apple Sauce_binarized' {0,1}
@attribute ' Hot Chicken Wings_binarized' {0,1}
@attribute ' Orange Juice_binarized' {0,1}
@attribute ' Strawberry Jam_binarized' {0,1}
@attribute ' Chocolate Chip Cookies_binarized' {0,1}
@attribute ' Vegetable Soup_binarized' {0,1}
@attribute ' Oats and Nuts Cereal_binarized' {0,1}
@attribute ' Fruit Roll_binarized' {0,1}
@attribute ' Corn Oil_binarized' {0,1}
@attribute ' Corn Flake Cereal_binarized' {0,1}
@attribute ' 75 Watt Lightbulb_binarized' {0,1}
@attribute ' Mushroom Pizza - Frozen_binarized' {0,1}
@attribute ' Sour Cream_binarized' {0,1}
@attribute ' Deli Salad_binarized' {0,1}
@attribute ' Deli Turkey_binarized' {0,1}
@attribute ' Glass Cleaner_binarized' {0,1}
@attribute ' Brown Sugar_binarized' {0,1}
@attribute ' English Muffins_binarized' {0,1}
@attribute ' Apple Soda_binarized' {0,1}
@attribute ' Strawberry Preserves_binarized' {0,1}
@attribute ' Pepperoni Pizza - Frozen_binarized' {0,1}
@attribute ' Plain Oatmeal_binarized' {0,1}
@attribute ' Beef Soup_binarized' {0,1}
@attribute ' Trash Bags_binarized' {0,1}
@attribute ' Corn Chips_binarized' {0,1}
@attribute ' Tangerines_binarized' {0,1}
@attribute ' Hot Dogs_binarized' {0,1}
@attribute ' Can Opener_binarized' {0,1}
@attribute ' Dried Apples_binarized' {0,1}
@attribute ' Grape Juice_binarized' {0,1}
@attribute ' Carrots_binarized' {0,1}
@attribute ' Frozen Shrimp_binarized' {0,1}
@attribute ' Grape Fruit Roll_binarized' {0,1}
@attribute ' Merlot Wine_binarized' {0,1}
@attribute ' Raisins_binarized' {0,1}
@attribute ' Cranberry Juice_binarized' {0,1}
@attribute ' Shampoo_binarized' {0,1}
@attribute ' Pancake Mix_binarized' {0,1}
@attribute ' Paper Plates_binarized' {0,1}
@attribute ' Bologna_binarized' {0,1}
@attribute ' 2pct. Milk_binarized' {0,1}
@attribute ' Daily Newspaper_binarized' {0,1}
@attribute ' Popcorn Salt_binarized' {0,1}
@attribute ' Frozen Cauliflower_binarized' {0,1}
@attribute ' Vanilla Wafers_binarized' {0,1}
@attribute ' Tomatoes_binarized' {0,1}
@attribute ' Vegetable Oil_binarized' {0,1}
@attribute ' Chicken Soup_binarized' {0,1}
@attribute ' Eggs_binarized' {0,1}
@attribute ' Canned Mixed Fruit_binarized' {0,1}
@attribute ' Raisin Pudding_binarized' {0,1}
@attribute ' Celery_binarized' {0,1}
@attribute ' Oatmeal_binarized' {0,1}
@attribute ' Hot Chocolate_binarized' {0,1}
@attribute ' Spaghetti_binarized' {0,1}
@attribute ' Napkins_binarized' {0,1}
@attribute ' Blueberry Waffles_binarized' {0,1}
@attribute ' White Bread_binarized' {0,1}
@data
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Clicking on the "Associate" tab will bring up the interface for the association rule
algorithms. The Apriori algorithm which we will use is the deafult algorithm selected. However, in
order to change the parameters for this run (e.g., support, confidence, etc.) we click on the text box
immediately to the right of the "Choose" button. Note that this box, at any given time, shows the
specific command line arguments that are to be used for the algorithm. The dialog box for changing
the parameters is depicted in Figure a2. Here, you can specify various parameters associated with
Apriori. Click on the "More" button to see the synopsis for the different parameters.
WEKA allows the resulting rules to be sorted according to different metrics such as
confidence, leverage, and lift. In this example, we have selected lift as the criteria. Furthermore, we
have entered 1.5 as the minimum value for lift (or improvement) is computed as the confidence of the
rule divided by the support of the right-hand-side (RHS). In a simplified form, given a rule L => R,
lift is the ratio of the probability that L and R occur together to the multiple of the two individual
probabilities for L and R, i.e.,
lift = Pr(L,R) / Pr(L).Pr(R).
If this value is 1, then L and R are independent. The higher this value, the more likely that
the existence of L and R together in a transaction is not just a random occurrence, but because of
some relationship between them.
Here we also change the default value of rules (10) to be 100; this indicates that the program
will report no more than the top 100 rules (in this case sorted according to their lift values). The upper
bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA
starts with the upper bound support and incrementally decreases support (by delta increments which
by default is set to 0.05 or 5%). The algorithm halts when either the specified number of rules are
generated, or the lower bound for min. support is reached. The significance testing option is only
applicable in the case of confidence and is by default not used (-1.0).
Once the parameters have been set, the command line text box will show the new
command line. We now click on start to run the program.
Associator Output:
=== Run information ===
Scheme:
weka.associations.Apriori -N 5 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: weather.symbolic
Instances: 1000
Attributes: 100
Milk
Tea powder
Egg
Butter
Jam
Paste
bread
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.25 (3 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 15
Generated sets of large itemsets:
Size of set of large itemsets L(1): 12
Size of set of large itemsets L(2): 26
Size of set of large itemsets L(3): 4
Best rules found:
1. milk=egg 4 ==> tea powder=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1] conv:(1.43)
2. milk=butter 4 ==> jam=normal 4 <conf:(1)> lift:(2) lev:(0.14) [2] conv:(2)
3. bread=jam=milk 4 ==> egg=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1] conv:(1.43)
4. milk=tea powder=bread 3 ==> egg=high 3 <conf:(1)> lift:(2) lev:(0.11) [1] conv:(1.5)
5. egg=+paste=milk 3 ==> bread=no 3 <conf:(1)> lift:(2.8) lev:(0.14) [1] conv:(1.93)
The panel on the left ("Result list") now shows an item indicating the algorithm that was
run and the time of the run. You can perform multiple runs in the same session each time with
different parameters. Each run will appear as an item in the Result list panel. Clicking on one of the
results in this list will bring up the details of the run, including the discovered rules in the right panel.
In addition, right-clicking on the result set allows us to save the result buffer into a separate file. In
this case, we save the output in the file bank-data-ar1.txt.
Note that the rules were discovered based on the specified threshold values for support and
lift. For each rule, the frequency counts for the LHS and RHS of each rule is given, as well as the
values for confidence, lift, leverage, and conviction. Note that leverage and lift measure similar
things, except that leverage measures the difference between the probability of co-occurrence of L and
R (see above example) as the independent probabilities of each of L and R, i.e.,
Leverage = Pr(L,R) - Pr(L).Pr(R).
In other words, leverage measures the proportion of additional cases covered by both L
and R above those expected if L and R were independent of each other. Thus, for leverage, values
above 0 are desirable, whereas for lift, we want to see values greater than 1. Finally, conviction is
similar to lift, but it measures the effect of the right-hand-side not being true. It also inverts the ratio.
So, convictions is measured as:
Conviction = Pr (L).Pr(not R) / Pr(L,R).
Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound).
In most cases, it is sufficient to focus on a combination of support, confidence, and
either lift or leverage to quantitatively measure the "quality" of the rule. However, the real value of a
rule, in terms of usefulness and action ability is subjective and depends heavily of the particular
domain and business objectives.
Using The Experimenter
The Experimenter interface to Weka is specialized for conducting experiments where the
user is interested in comparing several learning schemes on one or several datasets. As its output, the
Experimenter produces data that can be used to compare these learning schemes visually and
numerically as well as conducting significance testing. To demonstrate how to use the Experimenter,
we will conduct an experiment comparing two tree-learning algorithms on the birth dataset.
There are 3 main areas in the Experimenter interface and they are accessed via tabs at the top
left of the window. These 3 areas are the Setup area where the experiment parameters are set, the Run
area where the experiment is started and its progress monitored and lastly the Analyze area where the
results of the experiment are studied.
Choose any Testing Option
Click on New Button to Create
New Experiment
Click on Browse Button to
Browse the Result arff (or) csv
file
Click on Add New Button to add a
Dataset used to comparing
algorithms
Click on Add New Button to add
Classification algorithms for
comparing
Setting up the Experiment
The Setup window has 6 main areas that must each be configured in order for the experiment
to be properly configured. Starting from the top these areas are Experiment Configuration Mode,
Results Destination, Experiment Type, Iteration Control, Datasets, Algorithms and lastly the Notes
area.
Experiment Configuration Mode:
We will be using the simple experimental interface mode, as we do not require the extra
features the advanced mode offers. We will start by creating a new experiment and then defining its
parameters. A new experiment is created by pushing the on the ‘New’ button at the top of the window
and this will create a blank new experiment. After we have finished setting up the experiment, we
save it using the ‘Save’ button. Experiment settings are saved in either EXP or a more familiar XML
format. These files can be opened later to recall all the experiment configuration settings.
Choose Destination:
The results of the experiment will be stored in a datafile. This area allows one to specify the
name and format that this file will have. It is not necessary to do this if one does not intend on using
this data outside of the Experimenter and if this data does not need to be examined at a later date.
Results can be stored in the ARFF or CSV format and they can also be sent to an external database.
Set Experiment Type:
There are 3 types of experiments available in the simple interface. These types vary in how
the data is going to be split for the training/testing in the experiment. The options are cross-validation,
random split and random split with order preserved (i.e. data is split randomly but the order of the
instances is not randomized; so it will instance#1 followed by instance #2 and so on). We will use
cross-validation in our example
Iteration Control:
For the randomized experiment types, the user has the option of randomizing the data again
and repeating the experiment. This value for ‘Number of Repetitions’ controls how many times this
will take place.
Add data set(s):
In this section, the user adds the datasets that will be used in the experiment. Only ARFF files
can be used here and as mentioned before, the Experimenter is expecting a fully prepared and cleaned
dataset. There is no option for choosing the classification variable here, and it will always pick the last
attribute to be the class attribute. In our example, the birth-weight data set is the only one we will use.
Add Algorithms:
In this section, the user adds the classification algorithms to be employed in the experiment.
The procedure here to select an algorithm and choose its options is exactly the same as in the
Explorer. The difference here is that more than one algorithm can be specified.
Algorithms are added by clicking on the add button in the Algorithm section of the window
and this will pop-up a window that user will use to select the algorithm. This window will also display
the available options for the selected algorithm.
The first time the window displays, the ZeroR rule algorithm will be selected. This is shown
on the picture above. The user can select a different algorithm by clicking on the Choose button.
Clicking on the ‘More’ button will display help about the selected Algorithm and a description of its
available options. For our example, we will add the J48 algorithm with the option for binary splits
turned on and the REPTree algorithm. Individual algorithms can be edited or deleted by clicking on
the algorithm from the list of algorithms and then by clicking on the Edit or Delete buttons. Finally,
any extra notes or comments about the experiment setup can be added by clicking on the Notes button
at the bottom of the window and entering the information in the window provided.Saving the
Experiment Setup:
At this point, we have entered all the necessary options to start our experiment. We will now
save the experiment setup so that we do not have to re-enter all this information again. This is done by
clicking on the ‘Save Options’ button on the bottom of the window. These settings can be loaded at
another time if one wishes to redo the experiment or modify it.
Running the Experiment
Running
The next step is to run the experiment and this is done by clicking on the Run tab at the top
of the window. There is not much involved in this step, all that is needed is to click on the Start
button. The progress of the experiment is displayed the Status area at the bottom of the window and
any errors reported will be displayed in the Log area. Once the experiment has been run, the next step
is to analyze the results.
Analyzing the output:
Click on Experiment Button to
analyze the Experiment
Choose Comparison Field used to
compare algorithm
Sort Dataset in ascending order
Choose Test Base any one in
given list
Test Output:
Tester: weka.experiment.PairedCorrectedTTester
Analysing: Percent_correct
Datasets: 1
Resultsets: 3
Confidence: 0.05 (two tailed)
Sorted by: Date:
4/6/12 9:55 PM
Dataset
(1) trees.J4 | (2) trees (3) trees
-----------------------------------------------------------weather.symbolic
(525) 50.03 | 60.03 69.02
-----------------------------------------------------------(v/ /*) | (0/1/0) (0/1/0)
Key:
(1) trees.J48 '-C 0.25 -M 2' -217733168393644444
(2) trees.REPTree '-M 2 -V 0.0010 -N 3 -S 1 -L -1 -I 0.0' -9216785998198681299
(3) trees.RandomTree '-K 0 -M 1.0 -S 1' 8934314652175299374
Download