Uploaded by aashritha33

DWDM LAB Final Manualtest

advertisement
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
UNIT-1 BUILD DATA WAREHOUSE AND EXPLORE WEKA
A.
Build a Data warehouse/Data Mart (using open source tools like Pentaho
Data Integration tool, Pentaho Business Analytics; or other data
warehouse tools like Microsoft-ssis, Informatica, Business Objects, etc)
(i).
Identify source tables and populate sample data.
Create the Data Warehouse
We create 3 dimension tables and 1 fact table in the data warehouse. DimDate, DimCustomer,
DimVan and FactHire. We populate the 3 dimensions but we’ll leave the fact table empty.
The following script is used to create and populate dim and fact tables .
Create data warehouse
create database TopHireDW
go
use TopHireDW
go
Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go
create table DimDate
( DateKey int not null primary key,
DATA WAREHOUSING AND DATA MINING LAB MANUAL
1
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
[Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go
Populate Date Dimension
truncate table DimDate
go
declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,@DateString
varchar(10), @Year varchar(4), @Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate
insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row
while @Date <= @EndDate
begin
set @DateString = convert(varchar(10), @Date, 20)
set @DateKey = convert(int, replace(@DateString,'-',''))
set @Year = left(@DateString,4)
set @Month = left(@DateString, 7)
insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
values (@DateKey, @Year, @Month, @Date, @DateString)
DATA WAREHOUSING AND DATA MINING LAB MANUAL
2
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
set @Date = dateadd(d, 1, @Date)
end
go
select * from DimDate
Create Customer dimension
if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go
create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,CustomerId varchar(20) not null,
CustomerName varchar(30), DateOfBirth date, Town varchar(50), TelephoneNo varchar(30),
DrivingLicenceNo varchar(30), Occupation varchar(30) )
go
insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer
select * from DimCustomer
Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go
DATA WAREHOUSING AND DATA MINING LAB MANUAL
3
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
create table DimVan
( VanKey int not null identity(1,1) primary key, RegNo varchar(10) not null, Make varchar(30),
Model varchar(30), [Year] varchar(4), Colour varchar(20), CC int, Class varchar(10) )
go
insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go
select * from DimVan
Create Hirefact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go
create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
HireId varchar(10) not null, --Degenerate Dimension
NoOfDays int, VanHire money, SatNavHire money,
Insurance money, DamageWaiver money, TotalBill money
)
go
select * from FactHire.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
4
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
(ii).
Design multi-dimensional data models namely Star,snowflake and Fact
constellation schemas for any one enterprise (ex.Banking, Insurance,
Finance, Healthcare, Manufacturing, Automobile, etc)
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data warehouses
and data marts.
Star Schema
• Each dimension in a star schema is represented with only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
5
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Star schema
Snowflake Schema
● Some dimension tables in the Snowflake schema are normalized.
● The normalization splits up the data into additional tables.
● Unlike Star schema, the dimensions table in a snowflake schema is normalized. For
● example, the item dimension table in star schema is normalized and split into two
● dimension tables, namely item and supplier table.
● Now the item dimension table contains the attributes item_key, item_name, type, brand,
● and supplier-key.
● The supplier key is linked to the supplier dimension table. The supplier dimension table
● contains the attributes supplier_key and supplier_type.
Fig: Snowflake Schema
DATA WAREHOUSING AND DATA MINING LAB MANUAL
6
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fact Constellation Schema
● A fact constellation has multiple fact tables. It is also known as galaxy schema.
•
The following diagram shows two fact tables, namely sales and shipping.
•
The sales fact table is same as that in the star schema.
•
The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key,from_location, to_location.
•
The shipping fact table also contains two measures, namely dollars sold and units
sold.
•
It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
Fig: Fact constellation schema
(iii). Write ETL scripts and implement using data warehouse tools
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers
a process of how the data are loaded from the source system to the data warehouse. Extraction–
transformation–loading (ETL) tools are pieces of software responsible for the extraction of data
from several sources, its cleansing, customization, reformatting, integration, and insertion into a
data warehouse.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
7
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Building the ETL process is potentially one of the biggest tasks of building a warehouse;
it is complex, time consuming, and consumes most of data warehouse project’s implementation
efforts, costs, and resources.
Building a data warehouse requires focusing closely on understanding three main areas:
1. Source Area- The source area has standard models such as entity relationship diagram.
2. Destination Area- The destination area has standard models such as star schema.
3. Mapping Area- But the mapping area has not a standard model till now.
Abbreviations
● ETL-extraction–transformation–loading
● DW-data warehouse
● DM- data mart
● OLAP- on-line analytical processing
● DS-data sources
● ODS- operational data store
● DSA- data staging area
● DBMS- database management system
● OLTP-on-line transaction processing
● CDC-change data capture
● SCD-slowly changing dimension
● FCME- first-class modeling elements
● EMD-entity mapping diagra
DATA WAREHOUSING AND DATA MINING LAB MANUAL
8
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
ETL Process:
Extract
The Extract step covers the data extraction from the source system and makes it accessible
for further processing. The main objective of the extract step is to retrieve all the required data
from the source system with as little resources as possible. The extract step should be designed in
a way that it does not negatively affect the source system in terms or performance, response time
or any kind of locking.
Transform
The transform step applies a set of rules to transform the data from the source to the target.
This includes converting any measured data to the same dimension (i.e. conformed dimension)
using the same units so that they can later be joined. The transformation step also requires joining
data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
Load
During the load step, it is necessary to ensure that the load is performed correctly and with
as little resources as possible. The target of the Load process is often a database. In order to make
the load process efficient, it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity needs to be maintained
by ETL tool to ensure consistency.
LEAVE 1 PAPER(2 PAGES)
(iv). Perform various OLAP operations such slice,dice,rollup,drill down and
pivot.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
9
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Ans: OLAP OPERATIONS
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
OLAP operations in multidimensional data.
Here is the list of OLAP operations:
●
Roll-up
●
Drill-down
●
Slice and dice
●
Pivot (rotate) Roll-up
The following diagram illustrates how roll-up works.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
10
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
11
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
●
Initially the concept hierarchy was "street < city < province < country".
●
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.
●
The data is grouped into cities rather than countries.
●
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
●
By stepping down a concept hierarchy for a dimension
●
By introducing a new dimension.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
12
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The following diagram illustrates how drill-down works:
DATA WAREHOUSING AND DATA MINING LAB MANUAL
13
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
14
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
●
Initially the concept hierarchy was "day < month < quarter < year."
●
On drilling down, the time dimension is descended from the level of quarter to the level of
month.
●
When drill-down is performed, one or more dimensions from the data cube are added.
●
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
15
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
●
It will form a new sub-cube by selecting one or more dimensions.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
16
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
17
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The dice operation on the cube based on the following selection criteria involves three dimensions.
●
(location = "Toronto" or "Vancouver")
●
(time = "Q1" or "Q2")
●
(item =" Mobile" or "Modem")
DATA WAREHOUSING AND DATA MINING LAB MANUAL
18
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
19
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
A.(v). Explore visualization features of the tool for analysis like identifying
trends etc.
Ans:Visualization Features:
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization
is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize
single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has
“Jitter” option to deal with nominal attributes and to detect “hidden” data points.
●
Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is
Available From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result
List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The
Text Output And --- Depending On The Scheme --- Further Options For Visualizing Errors,
Clusters, Trees Etc.
To open Visualization screen, click ‘Visualize’ tab.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
20
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play
on the left and ‘outlook’ at the top
DATA WAREHOUSING AND DATA MINING LAB MANUAL
21
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Changing the View:
In the visualization window, beneath the X-axis selector there is a drop-down list,
‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the
attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility
you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.
To the right of the plot area there are series of horizontal strips. Each strip represents an
attribute, and the dots within it show the distribution values of the attribute. You can choose
what axes are used in the main graph by clicking on these strips (left-click changes X-axis, right-click
changes Y-axis).
The software sets X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out
in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement
given to all points in the plot, to the right, until you can spot concentration points.
The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides
‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the
‘outlook’. It will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to
explain the result, you need to see the ‘temperature’ – if it is too hot, you do not want to play.
Change ‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well.
Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A special case
is the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances.
Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of
points on the graph can be selected in four ways [2]
DATA WAREHOUSING AND DATA MINING LAB MANUAL
22
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
1. Select Instance. Click on an individual data point. It brings up a window listing
attributes of the point. If more than one point will appear at the same location, more than one set
of attributes will be shown.
2. Rectangle. You can create a rectangle by dragging it around the point
DATA WAREHOUSING AND DATA MINING LAB MANUAL
23
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
24
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph to
add vertices to the polygon and right-click to complete it.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
25
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
4.
Polyline. To distinguish the points on one side from the once on another, you can build a polyline.
Left-click on the graph to add vertices to the polyline and right-click to finish.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
26
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
(v). Explore Visualization features of the tool for analysis like identifying trends etc.
B.
Explore WEKA Data Mining/Machine Learning Toolkit
(i)
Downloading and/or installation of WEKA data mining toolkit
Operating system in computer is Ubuntu operating system.
Step1:
Go to the Applications Menu-> Ubuntu software center->Search for Weka and click Install button.
Fig: Ubuntu Software Center
DATA WAREHOUSING AND DATA MINING LAB MANUAL
27
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Weka software availability search in Ubuntu Software Center
DATA WAREHOUSING AND DATA MINING LAB MANUAL
28
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Weka installation process asks for authentication of the administrator.
Fig: Authentication provision window in installation process
Weka software would be installed in few minutes.
Fig: Weka installation status confirmation window
Step2: Launching WEKA GUI Tool.
The following navigation is appropriate for launching weka tool.
Applications Menu->Science->weka
DATA WAREHOUSING AND DATA MINING LAB MANUAL
29
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Weka GUI Launching window
(ii).
Understand the features of WEKA toolkit such as Explorer, Knowledge Flow
interface,Experimenter,command-line interface.
Introduction:
Weka is a workbench that contains a collection of visualization tools and algorithms for data analysis
and predictive modeling, together with graphical user interfaces for easy access to these functions.
Advantages of Weka include:
Free availability under the GNU General Public License.
Portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform.
A comprehensive collection of data preprocessing and modeling techniques.
Ease of use due to its graphical user interfaces.
Description:
DATA WAREHOUSING AND DATA MINING LAB MANUAL
30
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the programs start option and that will depend on the user’s operating system. Figure
1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:
Fig: Weka GUI
1.
Applications
Explorer - the graphical interface used to conduct experimentation on raw data
After clicking the Explorer button the weka explorer interface appears.Inside the weka explorer window
there are six tabs:
1.
Preprocess- used to choose the data file to be used by the application.
●
●
●
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by
user
DATA WAREHOUSING AND DATA MINING LAB MANUAL
31
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Preprocessor Environment
2.
Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation. Again there are several options to be selected inside of the classify tab. Test option
gives the user the choice of using four different test mode scenarios on the data set.
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage
DATA WAREHOUSING AND DATA MINING LAB MANUAL
32
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig:Classifiation of data and its options in Weka
3.
Cluster- used to apply different tools that identify clusters within the data file. The Cluster tab
opens the process that is used to identify commonalties or clusters of occurrences within the data
set and produce information for the user to analyze.
Fig: Cluser analysis options and using training data set in Weka tool
DATA WAREHOUSING AND DATA MINING LAB MANUAL
33
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
4.
Association- used to apply different rules to the data file that identify association within the data.
The associate tab opens a window to select the options for associations within the data set.
Fig: Choosing Apriori data set
5.
Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment
Fig: Selection of Attributes in weka tool
DATA WAREHOUSING AND DATA MINING LAB MANUAL
34
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
6.
Visualize- used to see what the various manipulation produced on the data set in a 2D format, in
scatter plot and bar graph output.
Fig: Visualization in Weka tool
2.
Experimenter - this option allows users to conduct different experimental variations on data sets
and perform statistical manipulation. The Weka Experiment Environment enables the user to create,
run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes. Results destination: ARFF file, CSV file,
JDBC database.
Experiment type: Cross-validation (default), Train/Test Percentage Split (data randomized).
Iteration control: Number of repetitions, Data sets first/Algorithms first.
Algorithms: filters
DATA WAREHOUSING AND DATA MINING LAB MANUAL
35
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
3.
Knowledge Flow -basically the same functionality as Explorer with drag and drop functionality.
The advantage of this option is that it supports incremental learning from previous results
4.
Simple CLI - provides users without a graphic interface option the ability to execute commands
from a terminal window.
(iv).
Study the arff file format
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances
sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department
of Computer Science of The University of Waikato for use with the Weka machine learning software in
WEKA, each data entry is an instance of the java class weka. core. Instance, and each instance consists of
a For loading datasets in WEKA, WEKA can load ARFF files. Attribute Relation File Format has two
sections:
1) The Header section defines relation (dataset) name, attribute name, and type.
2) The Data section lists the data instances.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
36
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: ARFF file
The figure above is from the german credit data that shows an ARFF file. Lines beginning with a % sign
are comments. And there are three basic keywords:
“@relation “ in Header section followed with relation name.
“@attribute “ in Header section followed with attribute names and their types( or sizes).
“@data “ in Data section followed with the list of data instances .
The external representation of an Instances class Consists of :
− A header: Describes the attribute types
DATA WAREHOUSING AND DATA MINING LAB MANUAL
37
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
− Data section: Comma separated list of data
(v).
Explore the available datasets in weka tool.
Click the “Open file...” button to open a data set and double click on the “data” directory.Weka provides a
number of small common machine learning datasets that you can use to practice on.
Fig: Sample data sets provided by Weka tool
(vi).
Load a data set ( ex.Weather data set , credit data set)
DATA WAREHOUSING AND DATA MINING LAB MANUAL
38
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
(vii).
and
Load data set
observe
the
following
a. List the Attribute names and their types
DATA WAREHOUSING AND DATA MINING LAB MANUAL
39
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Title: German Credit data
Attribute Names
Type of Attribute
Status of existing checking account
qualitative
Duration in month
numerical
Credit history
qualitative
Purpose
qualitative
Credit amount
numerical
Savings account/bonds
qualitative
Present employment since
qualitative
b. Number of records in data set are 1000.
c. Class attribute is Credit history.
d. Summary
Correctly Classified Instances
855
85.5 %
Incorrectly Classified Instances
145
14.5 %
Total Number of Instances
1000
DATA WAREHOUSING AND DATA MINING LAB MANUAL
40
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
UNIT - 2
Perform data pre-processing tasks and Demonstrate performing
association rule mining on data sets
A.
Explore various options available in Weka for preprocessing data and
apply ( like Discretization Filters, Resample filter, etc) on each dataset.
The navigation flow for preprocess of data set with out any filter in Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Preprocess --> Open file (Browse system for data set) -->
select ALL attributes .
A histogram which represents all attributes appear in visualize area.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
41
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig:Preprocessing the data without any filter.
The navigation flow for preprocess of data set with Discretize filter in Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Preprocess -->
DATA WAREHOUSING AND DATA MINING LAB MANUAL
42
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> attributes -->
Discretiz --> ALL attributes .
A histogram which represents credit history appear in visualize area.
Fig: preprocess with discretization
DATA WAREHOUSING AND DATA MINING LAB MANUAL
43
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The navigation flow for preprocess of data set with Resample filter in Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Preprocess -->
Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> instance -->
Resample --> Select Credit History .
A histogram which represents credit history appear in visualize area.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
44
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Preprocess of data with Resample
B.
Load a data set into weka and run Aprori algorithm with different
support and confidence values. Study the rules generated.
The navigation flow for applying Apriori algorithm on a data set Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Associate -->
DATA WAREHOUSING AND DATA MINING LAB MANUAL
45
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Open file (Browse system for data set) --> Choose --> weka --> Associations-->Apriori --> Start .
A histogram which represents credit history appear in visualize area.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
46
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Apriori ssociation
DATA WAREHOUSING AND DATA MINING LAB MANUAL
47
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Apriori Association continuation
UNTI-3
Demonstrate performing classification on data sets
A. Load dataset into Weka and run J48 classification algorithm.Study the classifier output.Compute
entropy values, Kappa statistic.
The navigation flow for classification of data set with J48 classifier in Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Classify -->
Open file (Browse system for data set) --> Choose --> weka --> Trees --> J48 .
DATA WAREHOUSING AND DATA MINING LAB MANUAL
48
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Classification of a dataset using classifier J48
B. Extract if-then rules from the decision tree generated by the classifier, observe the confusion
matrix and derive accuracy, F-measure,TPrate, FPrate, Precision and recall values. Apply crossvalidation strategy with values and compare accuracy results.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
49
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Classification of a dataset using classifier J48 continuation.
C.
Load dataset into Weka and run Naive-bayes classification algorithm.Study the classifier
output.
The navigation flow for classification of data set with Naive-bayes classifier in Weka tool is as
follows :The GUI WEKA launcher -- > Explorer --> Classify -->Open file (Browse system for data set)
--> Choose --> weka --> bayes --> Naive-Bayes
DATA WAREHOUSING AND DATA MINING LAB MANUAL
50
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: classification using Naive-bayes classifier
Fig: classification using Naive-bayes classifier continuation..
DATA WAREHOUSING AND DATA MINING LAB MANUAL
51
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: classification using Naive-bayes classifier continuation..
D.Plot ROC Curves
DATA WAREHOUSING AND DATA MINING LAB MANUAL
52
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Classifier run on german credit dataset
DATA WAREHOUSING AND DATA MINING LAB MANUAL
53
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig : ROC curve for german credit dataset
visualize threshould curves
DATA WAREHOUSING AND DATA MINING LAB MANUAL
54
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: No credits or All paid
DATA WAREHOUSING AND DATA MINING LAB MANUAL
55
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig:
All paid
DATA WAREHOUSING AND DATA MINING LAB MANUAL
56
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Existing paid
DATA WAREHOUSING AND DATA MINING LAB MANUAL
57
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Delayed previously
DATA WAREHOUSING AND DATA MINING LAB MANUAL
58
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Fig: Critical or Other existing credit
E.
Compare classification results of ID3,J48, Naïve-Bayes and k-NN classifiers for each dataset
, and reduce which classifier is performing best and poor for each dataset and justify.
Ans Steps for run ID3 and J48 Classification algorithms in WEKA
1.
2.
3.
4.
5.
Open WEKA Tool.
Click on WEKA Explorer.
Click on Preprocessing tab button.
Click on open file button.
Choose WEKA folder in C drive.
6.
Select and Click on data option button.
7.
8.
9.
10.
Choose iris data set and open file.
Click on classify tab and Choose J48 algorithm and select use training set test option
Click on start button.
Click on classify tab and Choose ID3 algorithm and select use training set test option.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
59
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
11.
12.
13.
14.
15.
Click on start button.
Click on classify tab and Choose Naïve-bayes algorithm and select use training set test
option.
Click on start button.
Click on classify tab and Choose k-nearest neighbor and select use training set test option.
Click on start button.
J48:
=== Run information ===
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
iris
Instances:
150
Attributes:
5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode:evaluate on training data
DATA WAREHOUSING AND DATA MINING LAB MANUAL
60
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
=== Classifier model (full training set) ===
J48 pruned tree
-----------------petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves :
Size of the tree :
5
9
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
DATA WAREHOUSING AND DATA MINING LAB MANUAL
61
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Correctly Classified Instances
147
98
%
3
2
%
Incorrectly Classified Instances
Kappa statistic
0.97
Mean absolute error
0.0233
Root mean squared error
0.108
Relative absolute error
5.2482 %
Root relative squared error
22.9089 %
Total Number of Instances
150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision
1
0 1
1
1
Recall F-Measure
1
ROC Area Class
Iris-setosa
0.98
0.02
0.961
0.98
0.97
0.99
0.96
0.01
0.98
0.96
0.97
0.99 Iris-virginica
Weighted Avg. 0.98
0.01
0.98
0.98
Iris-versicolor
0.98
0.993
=== Confusion Matrix ===
abc
<-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
1 2 48 | c = Iris-virginica
DATA WAREHOUSING AND DATA MINING LAB MANUAL
62
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Naïve-bayes:
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayes
Relation:
iris
Instances:
150
Attributes:
5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode:evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute
Iris-setosa Iris-versicolor Iris-virginica
DATA WAREHOUSING AND DATA MINING LAB MANUAL
63
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
(0.33)
(0.33)
(0.33)
===============================================================
sepallength
mean
4.9913
5.9379
6.5795
std. dev.
0.355
0.5042
0.6353
weight sum
50
50
50
precision
0.1059
0.1059
0.1059
mean
3.4015
2.7687
2.9629
std. dev.
0.3925
0.3038
0.3088
weight sum
50
50
50
precision
0.1091
0.1091
0.1091
mean
1.4694
4.2452
5.5516
std. dev.
0.1782
0.4712
0.5529
weight sum
50
50
50
precision
0.1405
0.1405
0.1405
mean
0.2743
1.3097
2.0343
std. dev.
0.1096
0.1915
0.2646
weight sum
50
50
50
sepalwidth
petallength
petalwidth
DATA WAREHOUSING AND DATA MINING LAB MANUAL
64
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
precision
0.1143
0.1143
0.1143
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
144
96 %
Incorrectly Classified Instances
6
4
Kappa statistic
%
0.94
Mean absolute error
0.0324
Root mean squared error
0.1495
Relative absolute error
7.2883 %
Root relative squared error
31.7089 %
Total Number of Instances
150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision
1
0 1
Recall F-Measure
ROC Area Class
1
1
1
Iris-setosa
0.96
0.04
0.923
0.96
0.941
0.993
Iris-versicolor
0.92
0.02
0.958
0.92
0.939
0.993
Iris-virginica
0.02
0.96
0.96
Weighted Avg. 0.96
0.96
0.995
=== Confusion Matrix ===
abc
<-- classified as
DATA WAREHOUSING AND DATA MINING LAB MANUAL
65
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
1 4 46 | c = Iris-virginica
K-Nearest Neighbor (IBK):
=== Run information ===
Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R first-last\""
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode:evaluate on training data
=== Classifier model (full training set) ===
IB1 instance-based classifier
DATA WAREHOUSING AND DATA MINING LAB MANUAL
66
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
150
Incorrectly Classified Instances
Kappa statistic
100 %
0
0
%
1
Mean absolute error
0.0085
Root mean squared error
0.0091
Relative absolute error
1.9219 %
Root relative squared error
1.9335 %
Total Number of Instances
150
=== Detailed Accuracy By Class ===
TP Rate FP Rate
Precision
1
0
1
1
1
1
Iris-setosa
1
0
1
1
1
1
Iris-versicolor
1
0
1
1
1
1
Iris-virginica
Weighted Avg. 1
0
Recall F-Measure ROC Area Class
1
1
1
1
DATA WAREHOUSING AND DATA MINING LAB MANUAL
67
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
68
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Unit – 4
DEMONSTRATE PERFORMING CLUSTERING ON DATA SETS
CLUSTERING TAB
A.
Load each dataset into Weka and run simple k-means clustering algorithm with different
values of k(number of desired clusters). Study the clusters formed. Observe the sum of
squared errors and centroids, and derive insights
Ans:Steps for run K-mean Clustering algorithms in WEKA
1.
2.
3.
4.
5.
6.
7.
8.
9.
Open WEKA Tool.
Click on WEKA Explorer.
Click on Preprocessing tab button.
Click on open file button.
Choose WEKA folder in C drive.
Select and Click on data option button.
Choose iris data set and open file.
Click on cluster tab and Choose k-mean and select use training set test option.
Click on start button.
Output:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500
-S 10
Relation:
iris
Instances:
150
Attributes:
5
sepallength
sepalwidth
DATA WAREHOUSING AND DATA MINING LAB MANUAL
69
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
petallength
petalwidth
class
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
(150)
Full Data
(100)
0
1
(50)
==================================================================
DATA WAREHOUSING AND DATA MINING LAB MANUAL
70
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
sepallength
5.8433
6.262
5.006
sepalwidth
3.054
2.872
3.418
petallength
3.7587
4.906
1.464
petalwidth
1.1987
1.676
0.244
class
Iris-setosa Iris-versicolor
Iris-setosa
Time taken to build model (full training data) : 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
1
100 ( 67%)
DATA WAREHOUSING AND DATA MINING LAB MANUAL
71
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
2
50 ( 33%)
B.
Explore other clustering techniques available in Weka.
Ans:
Clustering Algorithms And Techniques in WEKA, They are
DATA WAREHOUSING AND DATA MINING LAB MANUAL
72
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
73
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
C.
Explore visualization features of weka to visualize the clusters. Derive interesting insights and
explain.
Ans: Visualize Features
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA
can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style).
WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points.
Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A
Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring Up The Menu.
You Will Be Presented With Options For Viewing Or Saving The Text Output And
-
Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc.
To open Visualization screen, click ‘Visualize’ tab.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
74
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
75
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
76
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play o
Changing the View:
In the visualization window, beneath the X-axis selector there is a drop-down list, ‘Colour’, for
choosing the color scheme. This allows you to choose the color of points based on the attribute
selected. Below the plot area, there is a legend that describes what values the colors correspond
to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you
should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.
n the left and ‘outlook’ at the top.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
77
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
78
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A special case is
the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances.
Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of
points on the graph can be selected in four ways [2]:
1.
Select Instance. Click on an individual data point. It brings up a window listing attributes of the
point. If more than one point will appear at the same location, more than one set of attributes will
be shown.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
79
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
2.
Rectangle. You can create a rectangle by dragging it around the point.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
80
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
3.
Polygon. You can select several points by building a free-form polygon. Left-click on the graph to
add vertices to the polygon and right-click to complete it.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
81
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
4.Polyline. To distinguish the points on one side from the once on another, you can build a polyline.
Left-click on the graph to add vertices to the polyline and right-click to finish.
Viva Questions
Unit-V
Demonstrate performing regression on data sets.
A.
Load each dataset into Weka and build Linear Regression model. Study
the cluster formed. Use training set option. Interpret the regression model
and derive patterns and conclusions from the regression results.
Ans:Steps for run Aprior algorithm in WEKA
1.
Open WEKA Tool.
2.
3.
4.
Click on WEKA Explorer.
Click on Preprocessing tab button.
Click on open file button.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
82
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
5.
Choose WEKA folder in C drive.
6.
Select and Click on data option button.
7.
8.
9.
10.
Choose labor data set and open file.
Click on Classify tab and Click the Choose button then expand the functions branch.
Select the LinearRegression leaf ans select use training set test option.
Click on start button.
Output:
=== Run information ===
Scheme:
weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation:
labor-neg-data
Instances:
57
Attributes:
17
duration
wage-increase-first-year
wage-increase-second-year
DATA WAREHOUSING AND DATA MINING LAB MANUAL
83
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
DATA WAREHOUSING AND DATA MINING LAB MANUAL
84
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
contribution-to-health-plan
class
Test mode:
10-fold cross-validation
=== Classifier model (full training set) ===
Linear Regression Model
DATA WAREHOUSING AND DATA MINING LAB MANUAL
85
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
B.
Use options cross-validation and percentage split and repeat running the
Linear Regression Model. Observe the results and derive meaningful
results.
Ans: Steps for run Aprior algorithm in WEKA
1.
Open WEKA Tool.
2.
3.
4.
5.
Click on WEKA Explorer.
Click on Preprocessing tab button.
Click on open file button.
Choose WEKA folder in C drive.
6.
Select and Click on data option button.
7.
8.
9.
10.
11.
12.
Choose labor data set and open file.
Click on Classify tab and Click the Choose button then expand the functions branch.
Select the LinearRegression leaf and select test options cross-validation.
Click on start button.
Select the LinearRegression leaf and select test options percentage split.
Click on start button.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
86
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Output: cross-validation
=== Run information ===
Scheme:
weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: labor-neg-data
57
Instances:
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
DATA WAREHOUSING AND DATA MINING LAB MANUAL
87
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode:
10-fold cross-validation
DATA WAREHOUSING AND DATA MINING LAB MANUAL
88
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
=== Classifier model (full training set) ===
Linear Regression Model
duration =
0.4689 * cost-of-living-adjustment=tc,tcf +
0.6523 * pension=none,empl_contr +
1.0321 * bereavement-assistance=yes +
0.3904 * contribution-to-health-plan=full +
0.2765
Time taken to build model: 0.02 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient
0.1967
DATA WAREHOUSING AND DATA MINING LAB MANUAL
89
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Mean absolute error
0.6499
Root mean squared error
0.777
Relative absolute error
111.6598 %
Root relative squared error
108.8152 %
Total Number of Instances
56
Ignored Class Unknown Instances
1
Output: percentage split
=== Run information ===
DATA WAREHOUSING AND DATA MINING LAB MANUAL
90
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Scheme:
weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: labor-neg-data
Instances:
57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
DATA WAREHOUSING AND DATA MINING LAB MANUAL
91
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode:
split 66.0% train, remainder test
=== Classifier model (full training set) ===
Linear Regression Model
duration =
0.4689 * cost-of-living-adjustment=tc,tcf +
0.6523 * pension=none,empl_contr +
1.0321 * bereavement-assistance=yes +
0.3904 * contribution-to-health-plan=full +
0.2765
Time taken to build model: 0.02 seconds
=== Evaluation on test split ===
=== Summary ===
Correlation coefficient
0.243
Mean absolute error
0.783
DATA WAREHOUSING AND DATA MINING LAB MANUAL
92
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Root mean squared error
0.9496
Relative absolute error
106.8823 %
Root relative squared error
114.13 %
Total Number of Instances
19
C.
Explore Simple linear regression techniques that only looks at one
variable.
Ans: Steps for run Aprior algorithm in WEKA
1.
Open WEKA Tool.
2.
Click on WEKA Explorer.
3.
Click on Preprocessing tab button.
4.
Click on open file button.
5.
Choose WEKA folder in C drive.
6.
Select and Click on data option button.
7.
8.
9.
10.
Choose labor data set and open file.
Click on Classify tab and Click the Choose button then expand the functions branch.
Select the S i m p l e Linear Regression leaf and select test options cross-validation.
Click on start button.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
93
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Data Mining Lab
CREDIT RISK ASSESSMENT
● The business of banks is making loans. Assessing the credit worthiness of an applicant’s
of crucial importance. We have to develop a system to help a loan officer decide whether
the credit of a customer is good or bad. A bank’s business rules regarding loans must
consider two opposing factors. On the one hand, a bank wants to make as many loans as
possible. Interest on these loans is the banks profit source. On the other hand, a bank
cannot afford to make too many bad loans. To many bad could leads to the collapse of
the bank. The bank’s loan policy must involve a compromise not too strict, and not too
lenient.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
94
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
●
●
●
●
●
●
Credit risk is an investor's risk of loss arising from a borrower who does not make
payments as promised. Such an event is called a default. Other terms for credit risk are
default risk and counterparty risk.
Credit risk is most simply defined as the potential that a bank borrower or counterparty
will fail to meet its obligations in accordance with agreed terms.
The goal of credit risk management is to maximize a bank's risk-adjusted rate of return by
maintaining credit risk exposure within acceptable parameters.
Banks need to manage the credit risk inherent in the entire portfolio as well as the risk in
individual credits or transactions.
Banks should also consider the relationships between credit risk and other risks.
The effective management of credit risk is a critical component of a comprehensive
approach to risk management and essential to the long-term success of any banking
organization.
A good credit assessment means you should be able to qualify, within the limits of your
income, for most loans.
SHOULD WRITE GERMAN CREDIT DATA
Week 1
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us, the following attribute are found
to be applicable for Credit-Risk Assessment. Total Valid Attributes Categorical or Nominal
attributes (which takes True/false, etc values) Real valued attributes
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
DATA WAREHOUSING AND DATA MINING LAB MANUAL
95
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
10. debtors
11. residence_since
12. property
13. installment plans
14. housing
15. existing credits
16. job
17. num_dependents
18. telephone
19. foreign worker
Week 2
2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. Job
5. Duration
6. Credit_amount
7. Installment
8. Existing credit
Based on the above attributes, we can make a decision whether to give credit or not.
checking_status = no checking AND other_payment_plans = none AND
credit_history = critical/other existing credit: good
checking_status = no checking AND existing_credits <= 1 AND
other_payment_plans = none AND purpose = radio/tv: good
checking_status = no checking AND foreign_worker = yes AND
employment = 4<=X<7: good
foreign_worker = no AND personal_status = male single: good
checking_status = no checking AND purpose = used car AND
DATA WAREHOUSING AND DATA MINING LAB MANUAL
96
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
other_payment_plans = none: good
duration <= 15 AND other_parties = guarantor: good
duration <= 11 AND credit_history = critical/other existing credit: good
checking_status = >=200 AND num_dependents <= 1 AND
property_magnitude = car: good
checking_status = no checking AND property_magnitude = real estate AND
other_payment_plans = none AND age > 23: good
savings_status = >=1000 AND property_magnitude = real estate: good
savings_status = 500<=X<1000 AND employment = >=7: good
credit_history = no credits/all paid AND housing = rent: bad
savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits >
1: good
checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance:
good
installment_commitment <= 2 AND other_parties = co applicant AND existing_credits >
1: bad
installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1
AND residence_since > 1: good
installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <=
1: good
duration > 30 AND savings_status = 100<=X<500: bad
credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad
DATA WAREHOUSING AND DATA MINING LAB MANUAL
97
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
duration > 30 AND savings_status = no known savings AND num_dependents > 1: good
duration > 30 AND credit_history = delayed previously: bad
duration > 42 AND savings_status = <100 AND residence_since > 1: bad
Week 3
3. One type of model that you can create is a Decision Tree - train a Decision Tree using
the complete dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a
test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal
node)holds a class label.
Decision trees can be easily converted into classification rules.
e.g. ID3, C4.5 and CART.
J48 pruned tree
● Using WEKA Tool, we can generate a decision tree by selecting the “classify tab”.
● In classify tab select choose option where a list of different decision trees are available.
From that list select J48.
● Now under test option, select training data test option.
● The resulting window in WEKA is as follows:
● To generate the decision tree, right click on the result list and select visualize tree option
by which the decision tree will be generated
DATA WAREHOUSING AND DATA MINING LAB MANUAL
98
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
● The obtained decision tree for credit risk assessment is very large to fit on the screen.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
99
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The decision tree below is unclear due to a large number of attributes.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
100
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 4
4. Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly? (This is also called testing on the training set) Why do you think you cannot
get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of
the examples in the dataset.
For example:
IF purpose=vacation THEN credit=bad;
ELSE purpose=business THEN credit=good;
In this way we classified each of the examples in the dataset. We classified 85.5% of examples
correctly and the remaining 14.5% of examples are incorrectly classified. We can’t get 100%
training accuracy because out of the 20 attributes, we have some unnecessary attributes which
are also been analyzed and trained. Due to this the accuracy is affected and hence we can’t get
100% training accuracy.
Week 5
5. Is testing on the training set as you did above a good idea? Why not?
Bad idea, if take all the data into training set. Then how to test the above classification is correctly
or not?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training
set and the remaining 1/3 as test set. But here in the above model we have taken complete
dataset as training set which results only 85.5% accuracy. This is done for the analyzing and
training of the unnecessary attributes which does not make a crucial role in credit risk
assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If
some part of the dataset is used as a training set and the remaining as test set then it leads to
the accurate results and the time for computation will be less. This is why, we prefer not to take
complete dataset as training set. Use Training Set Result for the table German Credit Data:
DATA WAREHOUSING AND DATA MINING LAB MANUAL
101
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Correctly Classified Instances 855 85.5 %
Incorrectly Classified Instances 145 14.5 %
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000
Week 6
6. One approach for solving the problem encountered in the previous question is using crossvalidation? Describe what cross-validation is briefly. Train a Decision Tree again using crossvalidation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:In k-fold cross-validation, the initial data are randomly portioned into ‘k’ mutually exclusive
subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is
performed ‘k’ times. In iteration I, partition Di is reserved as the test set and the remaining
partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . .
. . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on
Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on....
DATA WAREHOUSING AND DATA MINING LAB MANUAL
102
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
1. Select classify tab and J48 decision tree and in the test option select cross validation radio
button and the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors
will be zeroed out, but in reality there is no such training set that gives 100% accuracy.
4. Cross Validation Result at folds: 10 for the table German Credit Data:
Correctly Classified Instances
705
70.5 %
Incorrectly Classified Instances
295
29.5 %
DATA WAREHOUSING AND DATA MINING LAB MANUAL
103
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Kappa statistic
0.2467
Mean absolute error
0.3467
Root mean squared error
0.4796
Relative absolute error
82.5233 %
Root relative squared error
104.6565 %
Total Number of Instances
1000
Here there are 1000 instances with 100 instances per partition.
Cross Validation Result at folds: 20 for the table GermanCreditData:
Correctly Classified Instances
698
69.8 %
Incorrectly Classified Instances
302
30.2
%
Kappa statistic
0.2264
Mean absolute error
0.3571
Root mean squared error
0.4883
Relative absolute error
DATA WAREHOUSING AND DATA MINING LAB MANUAL
104
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
85.0006 %
Root relative squared error
106.5538 %
Total Number of Instances
1000
Cross Validation Result at folds: 50 for the table GermanCreditData:
Correctly Classified Instances
710
71
%
Incorrectly Classified Instances
290 29%
Kappa statistic
0 .2587
Mean absolute error
0.3444
Root mean squared error
0.4771
Relative absolute error
81.959 %
Root relative squared error
104.1164 %
Total Number of Instances 1000
Cross Validation Result at folds: 100 for the table German Credit Data:
Correctly Classified Instance
Incorrectly Classified Instance
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
710 71
290 29
0.2587
0.3444
0.477
81.959 %
DATA WAREHOUSING AND DATA MINING LAB MANUAL
105
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Root relative squared error
Total Number of Instances
104.1164 %
1000
Percentage split does not allow 100%, it allows only till 99.9%
DATA WAREHOUSING AND DATA MINING LAB MANUAL
106
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
●
●
●
●
●
●
●
●
Percentage Split Result at
50%:
Correctly Classified Instances362 72.4 %
Incorrectly Classified Instances
138 27.6 %
Kappa statistic
0.2725
Mean absolute error
0.3225
Root mean squared error
0.4764
Relative absolute error
76.3523 %
Root relative squared error
106.4373%
Total Number of Instances
500
DATA WAREHOUSING AND DATA MINING LAB MANUAL
107
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
●
●
●
●
●
●
●
●
●
Percentage Split Result at
Correctly Classified Instances
Incorrectly Classified Instance
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
99.9%:
00
1 100
0
0.6667
0.6667
221.7054 %
221.7054 %
1
DATA WAREHOUSING AND DATA MINING LAB MANUAL
108
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 7
7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or "personalstatus"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these
attributes from the dataset and see if the decision tree created in those cases is significantly
different from the full dataset case which you have already done. To remove an attribute you
can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes have any
significant effect? Discuss.
This increases in accuracy because the two attributes “foreign workers” and “personal status “are
not much important in training and analyzing. By removing this, the time has been reduced to
some extent and then it results in increase in the accuracy. The decision tree which is created is
very large compared to the decision tree which we have trained now. This is the main difference
between these two decision trees.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
109
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
After foreign worker is removed, the accuracy is increased to 85.9%
DATA WAREHOUSING AND DATA MINING LAB MANUAL
110
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these
two attributes are not significant to perform training.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
111
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Cross validation after removing 9 th attribute.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
112
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Percentage split after removing 9 th attribute.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
113
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
After removing the 20th attribute, the cross validation is as above.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
114
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
After removing 20 th attribute, the percentage split is as above.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
115
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 8
8. Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5,
7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in problem 7 Remember to reload the ARFF data file to get all the
attributes initially before you start selecting the ones you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
116
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Here accuracy is decreased.
Select random attributes and then check the accuracy.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
117
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
118
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over
attributes and visualize them.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
119
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further
try random combination of attributes to increase the accuracy.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
120
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
121
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 9
9. Sometimes, the cost of rejecting an applicant who actually has a good credit
Case 1: Might be higher than accepting an applicant who has bad credit
Case 2: Instead of counting the misclassifications equally in both cases, give a higher cost to the
first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix
in WEKA.
Train your Decision Tree again and report the Decision Tree and cross-validation results. Are
they significantly different from results obtained in problem 6 (using equal cost)?
DATA WAREHOUSING AND DATA MINING LAB MANUAL
122
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two
cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2.When we give such costs
in both cases and after training the decision tree, we can observe that almost equal to that of the
decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5)
Total Cost
3820 1705
Average Cost 3.82 1.705
We don’t find this cost factor in problem 6. As there we use equal cost. This is the major
difference between the results of problem 6 and problem 9.
The cost matrices we used here:
Case 1: 5 1
1 5
Case 2: 2 1
1 2
1. Select classify tab. Select More Option from Test Option. Tick on cost sensitive Evaluation
and go to set. Set classes as 2.Click on Resize and then we’ll get cost matrix. Then change
the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
2. Then confusion matrix will be generated and you can find out the difference between
good and bad attribute.
3. Check accuracy whether it’s changing or not.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
123
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
124
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
125
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 10
10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the
tree which results in increase of the bias of the model. Because of this, the accuracy of the model
can also effect.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
126
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
This problem can be reduced by considering simple decision tree. The attributes will be less and
it decreases the bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
1. Open any existing ARFF file e.g labour.arff.
2. In preprocess tab, select ALL to select all the attributes.
3. Go to classify tab and then use training set with J48 algorithm.
4. To generate the decision tree, right click on the result list and select visualize tree option,
by which the decision tree will be generated.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
127
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
5. Right click on J48 algorithm to get Generic Object Editor window
6. In this, make the unpruned option as true .
7. Then press OK and then start. We find the tree will become more complex if not pruned.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
128
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Visualize tree
DATA WAREHOUSING AND DATA MINING LAB MANUAL
129
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
8. The tree has become more complex.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
130
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
131
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Week 11
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your
Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree
you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase?
Reduced-error pruning:DATA WAREHOUSING AND DATA MINING LAB MANUAL
132
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The idea of using a separate pruning set for pruning—which is applicable to decision trees as well
as rule sets—is called reduced-error pruning. The variant described previously prunes a rule
immediately after it has been grown and is called incremental reduced-error pruning. Another
possibility is to build a full, unpruned rule set first, pruning it afterwards b discarding individual
tests.
However, this method is much slower. Of course, there are many different ways to assess the
worth of a rule based on the pruning set. A simple measure is to consider how well the rule would
do at discriminating the predicted class from other classes if it were the only rule in the theory,
operating under the closed world assumption. If it gets p instances right out of the t instances
that it covers, and there are P instances of this class out of a total T of instances altogether, then
it gets positive instances right. The instances that it does not cover include
N–n
negative ones, where
n=t–p
is the number of negative instances that the rule covers and
N=T-P
is the total number of negative instances. Thus the rule has an overall success ratio of
[p +(N - n)] T,
and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when
using reduced-error pruning.
1. Right click on J48 algorithm to get Generic Object Editor window
2. In this, make reduced error pruning option as true and also the unpruned option as
true .
3. Then press OK and then start.
4. 4. We find that the accuracy has been increased by selecting the reduced error pruning
option.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
133
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Reduced error pruning is set to True
Week 12
12. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up your
own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also
exist different classifiers that output the model in the form of rules - one such classifier in WEKA
is rules. PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one! Can you predict what
attribute that might be in this dataset? One R classifier uses a single attribute to make decisions
(it chooses the attribute based on minimum error). Report the rule obtained by training a one
R classifier. Rank the performance of j48, PART and oneR.
In WEKA, rules, PART is one of the classifier which converts the decision trees into “IF- THENELSE” rules. Converting Decision trees into “IF-THEN-ELSE” rules using rules. PART classifier:PART decision list
● outlook = overcast: yes (4.0)
● windy = TRUE: no (4.0/1.0)
DATA WAREHOUSING AND DATA MINING LAB MANUAL
134
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
● outlook = sunny: no (3.0/1.0)
: yes (3.0)
● Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
● sunny -> no
● overcast -> yes
● rainy -> yes
(10/14 instances correct)
With respect to the time, the one R classifier has higher ranking and J48 is in 2nd place and PART
gets 3rd place.
J48 PART one R
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and one R gets lst place J48 PART oneR
ACCURACY (%) 70.5 70.2% 66.8%
1. Open existing file as weather.nomial.arff
2. Select All.
3. Go to classify.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
135
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
4. Start.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
136
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA WAREHOUSING AND DATA MINING LAB MANUAL
137
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Here the accuracy is 100%
The tree is something like “if-then-else” rule
If outlook=overcast then
play=yes
If outlook=sunny and humidity=high then
play = no
else
play = yes
DATA WAREHOUSING AND DATA MINING LAB MANUAL
138
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
If outlook=rainy and windy=true then
play = no
else
play = yes
To click out the rules
1. Go to choose then click on Rule then select PART.
2. Click on Save and start.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
139
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
3. Similarly for oneR algorithm.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
140
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
If outlook = overcast then
Play = yes
If outlook = sunny and humidity = high then
Play = no
If outlook = sunny and humidity = low then
Play = yes
DATA WAREHOUSING AND DATA MINING LAB MANUAL
141
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Additional Programs
1. Perform cluster analysis on German credit data set using partition clustering algorithm
2. Perform cluster analysis on German credit data set using EM clustering algorithm
Additional program-1
Aim: Perform cluster analysis on German credit data set using partition clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
Pseudo code
In pseudo code, the general algorithm for k-means clustering algorithm is:
1. Place K points into the space represented by the objects that are being clustered. These
points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be calculated.
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select Simplekmeans. Then go to
Choose and select use training set. Click on start.
Output: cluster analysis on k-means clustering algorithm
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R
first-last" -I 500 -O -S 10
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
83
savings_status
employment
installment_commitment
personal_status
other_parties
DATA WAREHOUSING AND DATA MINING LAB MANUAL
142
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 8
Within cluster sum of squared errors: 5145.269062855846
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(1000) (484) (190) (326)
===============================================================
==========================================
checking_status no checking no checking <0
0<=X<200
duration 20.903 20.7314 26.0526 18.1564
credit_history existing paid existing paid existing paid existing paid
purpose radio/tv new car used car radio/tv
84
credit_amount 3271.258 3293.1281 4844.6474
2321.7822
savings_status <100 <100 <100 <100
employment 1<=X<4 1<=X<4 >=7 >=7
installment_commitment 2.973 2.8822 3.0579 3.0583
personal_status male single male single male single male single
other_parties none none none none
residence_since 2.845 2.4483 3.5211 3.0399
property_magnitude car car no known property real estate
age 35.546 33.155 41.0526 35.8865
other_payment_plans none none none none
housing own own for free own
DATA WAREHOUSING AND DATA MINING LAB MANUAL
143
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
existing_credits 1.407 1.3967 1.4474 1.3988
job skilled skilled skilled skilled
num_dependents 1.155 1.155 1.2474 1.1012
own_telephone none none yes none
foreign_worker yes yes yes yes
Time taken to build model (full training data) : 0.07 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 484 ( 48%)
1 190 ( 19%)
2 326 ( 33%)
Additional program-2
Aim: Perform cluster analysis on German credit data set using hierarchal clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
85
Pseudo code:
In pseudo code, the general algorithm using EM clustering algorithm is
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select EM. Then go to Choose and
select use training set. Click on start.
Output:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
DATA WAREHOUSING AND DATA MINING LAB MANUAL
144
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
86
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross validation: 4
Cluster
Attribute 0 1 2 3
(0.26) (0.26) (0.2) (0.29)
===============================================================
============
checking_status
<0 100.8097 58.5666 51.7958 66.8279
0<=X<200 69.3481 63.6477 34.9535 105.0507
>=200 17.6736 20.0978 11.9012 17.3274
no checking 73.012 119.2995 101.8966 103.7918
[total] 260.8434 261.6116 200.5471 292.9978
duration
mean 17.7484 14.3572 23.4112 27.8358
std. dev. 8.0841 7.1757 12.1018 14.1317
credit_history
no credits/all paid 10.1705 6.0326 8.4795 19.3174
all paid 17.9296 11.0899 9.6553 14.3252
existing paid 175.3951 142.1934 53.3962 163.0153
delayed previously 10.1938 18.0432 24.9273 38.8357
critical/other existing credit 48.1544 85.2526 105.0888 58.5041
[total] 261.8434 262.6116 201.5471 293.9978
purpose
new car 57.7025 76.7946 47.734 55.7689
used car 14.504 7.9487 40.7163 43.831
furniture/equipment 95.3943 25.2704 24.1583 40.1769
radio/tv 53.3828 106.3023 48.3866 75.9283
DATA WAREHOUSING AND DATA MINING LAB MANUAL
145
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
domestic appliance 7.9495 3.4917 1.161 3.3979
repairs 5.5771 9.5832 6.9408 3.8988
education 9.921 10.7236 11.9789 21.3766
vacation 1 1 1 1
87
retraining 4.7356 4.1209 2.311 1.8324
business 16.6708 22.302 19.5059 42.5213
other 1.0059 1.0743 3.6542 10.2656
[total] 267.8434 268.6116 207.5471 299.9978
credit_amount
mean 2288.8498 1812.2911 3638.3737 5195.2049
std. dev. 1342.8531 995.7303 2694.223 3683.9507
savings_status
<100 170.6648 165.5967 96.2641 174.4744
100<=X<500 26.3033 25.4915 18.3092 36.8959
500<=X<1000 15.6275 21.5273 15.5765 14.2688
>=1000 12.2318 18.448 12.513 8.8072
no known savings 37.0161 31.5481 58.8844 59.5515
[total] 261.8434 262.6116 201.5471 293.9978
employment
unemployed 14.0219 3.1801 16.0683 32.7298
<1 90.51 34.2062 8.4379 42.846
1<=X<4 84.9242 128.879 27.7645 101.4323
4<=X<7 50.6437 42.1897 31.3087 53.858
>=7 21.7437 54.1567 117.9679 63.1317
[total] 261.8434 262.6116 201.5471 293.9978
installment_commitment
mean 2.8557 3.0212 3.312 2.8038
std. dev. 1.1596 1.1124 0.9515 1.1363
personal_status
male div/sep 15.737 9.9518 4.6205 23.6907
female div/dep/mar 151.4625 48.4321 18.2787 95.8267
male single 67.3068 159.5075 172.5861 152.5996
male mar/wid 26.3371 43.7203 5.0618 20.8808
female single 1 1 1 1
[total] 261.8434 262.6116 201.5471 293.9978
other_parties
none 235.863 218.7895 186.4245 269.923
co applicant 12.5526 10.6977 6.9588 14.7909
guarantor 11.4278 31.1244 6.1638 7.2839
[total] 259.8434 260.6116 199.5471 291.9978
DATA WAREHOUSING AND DATA MINING LAB MANUAL
146
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
residence_since
mean 2.6862 2.5399 3.5434 2.7831
88
std. dev. 1.1732 1.0186 0.7654 1.1061
property_magnitude
real estate 69.0217 148.9943 30.8391 37.1449
life insurance 81.2718 54.4192 41.9034 58.4056
car 95.7773 51.1875 60.6462 128.389
no known property 14.7725 7.0107 67.1584 69.0583
[total] 260.8434 261.6116 200.5471 292.9978
age
mean 27.7345 36.1057 43.8079 36.3705
std. dev. 5.7953 10.3158 11.3129 11.5738
other_payment_plans
bank 34.4988 32.0758 33.984 42.4414
stores 10.9742 12.5287 10.4947 17.0024
none 214.3704 216.0071 155.0685 232.554
[total] 259.8434 260.6116 199.5471 291.9978
housing
rent 85.8549 31.7206 15.9015 49.523
own 168.499 226.2291 124.0089 198.2629
for free 5.4895 2.6619 59.6367 44.2118
[total] 259.8434 260.6116 199.5471 291.9978
existing_credits
mean 1.213 1.4137 1.7961 1.3088
std. dev. 0.4142 0.5377 0.7406 0.4734
job
unemp/unskilled non res 11.7711 2.5192 6.8364 4.8733
unskilled resident 52.9713 105.4029 24.5489 21.0769
skilled 188.0096 147.8359 128.9987 169.1558
high qualif/self emp/mgmt 8.0914 5.8537 40.1631 97.8918
[total] 260.8434 261.6116 200.5471 292.9978
num_dependents
mean 1 1.2978 1.3983 1
std. dev. 0.3621 0.4573 0.4895 0.3621
own_telephone
none 219.2961 215.7304 81.1575 83.816
yes 39.5473 43.8813 117.3896 207.1818
[total] 258.8434 259.6116 198.5471 290.9978
89
foreign_worker
DATA WAREHOUSING AND DATA MINING LAB MANUAL
147
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
yes 248.5954 234.0215 197.4796 286.9034
no 10.248 25.5901 1.0675 4.0944
[total] 258.8434 259.6116 198.5471 290.9978
Time taken to build model (full training data) : 22.43 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 279 ( 28%)
1 279 ( 28%)
2 194 ( 19%)
3 248 ( 25%)
Log likelihood: -33.06046
DATA WAREHOUSING AND DATA MINING LAB MANUAL
148
Download