DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIT-1 BUILD DATA WAREHOUSE AND EXPLORE WEKA A. Build a Data warehouse/Data Mart (using open source tools like Pentaho Data Integration tool, Pentaho Business Analytics; or other data warehouse tools like Microsoft-ssis, Informatica, Business Objects, etc) (i). Identify source tables and populate sample data. Create the Data Warehouse We create 3 dimension tables and 1 fact table in the data warehouse. DimDate, DimCustomer, DimVan and FactHire. We populate the 3 dimensions but we’ll leave the fact table empty. The following script is used to create and populate dim and fact tables . Create data warehouse create database TopHireDW go use TopHireDW go Create Date Dimension if exists (select * from sys.tables where name = 'DimDate') drop table DimDate go create table DimDate ( DateKey int not null primary key, DATA WAREHOUSING AND DATA MINING LAB MANUAL 1 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING [Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10)) go Populate Date Dimension truncate table DimDate go declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,@DateString varchar(10), @Year varchar(4), @Month varchar(7), @Date1 varchar(20) set @StartDate = '2006-01-01' set @EndDate = '2016-12-31' set @Date = @StartDate insert into DimDate (DateKey, [Year], [Month], [Date], DateString) values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row while @Date <= @EndDate begin set @DateString = convert(varchar(10), @Date, 20) set @DateKey = convert(int, replace(@DateString,'-','')) set @Year = left(@DateString,4) set @Month = left(@DateString, 7) insert into DimDate (DateKey, [Year], [Month], [Date], DateString) values (@DateKey, @Year, @Month, @Date, @DateString) DATA WAREHOUSING AND DATA MINING LAB MANUAL 2 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING set @Date = dateadd(d, 1, @Date) end go select * from DimDate Create Customer dimension if exists (select * from sys.tables where name = 'DimCustomer') drop table DimCustomer go create table DimCustomer ( CustomerKey int not null identity(1,1) primary key,CustomerId varchar(20) not null, CustomerName varchar(30), DateOfBirth date, Town varchar(50), TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30) ) go insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo, DrivingLicenceNo, Occupation) select * from HireBase.dbo.Customer select * from DimCustomer Create Van dimension if exists (select * from sys.tables where name = 'DimVan') drop table DimVan go DATA WAREHOUSING AND DATA MINING LAB MANUAL 3 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING create table DimVan ( VanKey int not null identity(1,1) primary key, RegNo varchar(10) not null, Make varchar(30), Model varchar(30), [Year] varchar(4), Colour varchar(20), CC int, Class varchar(10) ) go insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class) select * from HireBase.dbo.Van go select * from DimVan Create Hirefact table if exists (select * from sys.tables where name = 'FactHire') drop table FactHire go create table FactHire ( SnapshotDateKey int not null, --Daily periodic snapshot fact table HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys HireId varchar(10) not null, --Degenerate Dimension NoOfDays int, VanHire money, SatNavHire money, Insurance money, DamageWaiver money, TotalBill money ) go select * from FactHire. DATA WAREHOUSING AND DATA MINING LAB MANUAL 4 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING (ii). Design multi-dimensional data models namely Star,snowflake and Fact constellation schemas for any one enterprise (ex.Banking, Insurance, Finance, Healthcare, Manufacturing, Automobile, etc) Schema Definition Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts. Star Schema • Each dimension in a star schema is represented with only one-dimension table. • This dimension table contains the set of attributes. • The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location. • There is a fact table at the center. It contains the keys to each of four dimensions. • The fact table also contains the attributes, namely dollars sold and units sold. DATA WAREHOUSING AND DATA MINING LAB MANUAL 5 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Star schema Snowflake Schema ● Some dimension tables in the Snowflake schema are normalized. ● The normalization splits up the data into additional tables. ● Unlike Star schema, the dimensions table in a snowflake schema is normalized. For ● example, the item dimension table in star schema is normalized and split into two ● dimension tables, namely item and supplier table. ● Now the item dimension table contains the attributes item_key, item_name, type, brand, ● and supplier-key. ● The supplier key is linked to the supplier dimension table. The supplier dimension table ● contains the attributes supplier_key and supplier_type. Fig: Snowflake Schema DATA WAREHOUSING AND DATA MINING LAB MANUAL 6 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fact Constellation Schema ● A fact constellation has multiple fact tables. It is also known as galaxy schema. • The following diagram shows two fact tables, namely sales and shipping. • The sales fact table is same as that in the star schema. • The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,from_location, to_location. • The shipping fact table also contains two measures, namely dollars sold and units sold. • It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table. Fig: Fact constellation schema (iii). Write ETL scripts and implement using data warehouse tools ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Extraction– transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, its cleansing, customization, reformatting, integration, and insertion into a data warehouse. DATA WAREHOUSING AND DATA MINING LAB MANUAL 7 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Building the ETL process is potentially one of the biggest tasks of building a warehouse; it is complex, time consuming, and consumes most of data warehouse project’s implementation efforts, costs, and resources. Building a data warehouse requires focusing closely on understanding three main areas: 1. Source Area- The source area has standard models such as entity relationship diagram. 2. Destination Area- The destination area has standard models such as star schema. 3. Mapping Area- But the mapping area has not a standard model till now. Abbreviations ● ETL-extraction–transformation–loading ● DW-data warehouse ● DM- data mart ● OLAP- on-line analytical processing ● DS-data sources ● ODS- operational data store ● DSA- data staging area ● DBMS- database management system ● OLTP-on-line transaction processing ● CDC-change data capture ● SCD-slowly changing dimension ● FCME- first-class modeling elements ● EMD-entity mapping diagra DATA WAREHOUSING AND DATA MINING LAB MANUAL 8 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ETL Process: Extract The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking. Transform The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules. Load During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency. LEAVE 1 PAPER(2 PAGES) (iv). Perform various OLAP operations such slice,dice,rollup,drill down and pivot. DATA WAREHOUSING AND DATA MINING LAB MANUAL 9 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Ans: OLAP OPERATIONS Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. OLAP operations in multidimensional data. Here is the list of OLAP operations: ● Roll-up ● Drill-down ● Slice and dice ● Pivot (rotate) Roll-up The following diagram illustrates how roll-up works. DATA WAREHOUSING AND DATA MINING LAB MANUAL 10 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 11 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● Roll-up is performed by climbing up a concept hierarchy for the dimension location. ● Initially the concept hierarchy was "street < city < province < country". ● On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country. ● The data is grouped into cities rather than countries. ● When roll-up is performed, one or more dimensions from the data cube are removed. Drill-down Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: ● By stepping down a concept hierarchy for a dimension ● By introducing a new dimension. DATA WAREHOUSING AND DATA MINING LAB MANUAL 12 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING The following diagram illustrates how drill-down works: DATA WAREHOUSING AND DATA MINING LAB MANUAL 13 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 14 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● Drill-down is performed by stepping down a concept hierarchy for the dimension time. ● Initially the concept hierarchy was "day < month < quarter < year." ● On drilling down, the time dimension is descended from the level of quarter to the level of month. ● When drill-down is performed, one or more dimensions from the data cube are added. ● It navigates the data from less detailed data to highly detailed data. Slice The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works. DATA WAREHOUSING AND DATA MINING LAB MANUAL 15 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● Here Slice is performed for the dimension "time" using the criterion time = "Q1". ● It will form a new sub-cube by selecting one or more dimensions. DATA WAREHOUSING AND DATA MINING LAB MANUAL 16 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Dice Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation. DATA WAREHOUSING AND DATA MINING LAB MANUAL 17 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING The dice operation on the cube based on the following selection criteria involves three dimensions. ● (location = "Toronto" or "Vancouver") ● (time = "Q1" or "Q2") ● (item =" Mobile" or "Modem") DATA WAREHOUSING AND DATA MINING LAB MANUAL 18 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Pivot The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation. DATA WAREHOUSING AND DATA MINING LAB MANUAL 19 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING A.(v). Explore visualization features of the tool for analysis like identifying trends etc. Ans:Visualization Features: WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points. ● Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The Text Output And --- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc. To open Visualization screen, click ‘Visualize’ tab. DATA WAREHOUSING AND DATA MINING LAB MANUAL 20 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play on the left and ‘outlook’ at the top DATA WAREHOUSING AND DATA MINING LAB MANUAL 21 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Changing the View: In the visualization window, beneath the X-axis selector there is a drop-down list, ‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the attribute selected. Below the plot area, there is a legend that describes what values the colors correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select lighter color from the color palette. To the right of the plot area there are series of horizontal strips. Each strip represents an attribute, and the dots within it show the distribution values of the attribute. You can choose what axes are used in the main graph by clicking on these strips (left-click changes X-axis, right-click changes Y-axis). The software sets X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement given to all points in the plot, to the right, until you can spot concentration points. The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides ‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the ‘outlook’. It will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to explain the result, you need to see the ‘temperature’ – if it is too hot, you do not want to play. Change ‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well. Selecting Instances Sometimes it is helpful to select a subset of the data using visualization tool. A special case is the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances. Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of points on the graph can be selected in four ways [2] DATA WAREHOUSING AND DATA MINING LAB MANUAL 22 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 1. Select Instance. Click on an individual data point. It brings up a window listing attributes of the point. If more than one point will appear at the same location, more than one set of attributes will be shown. 2. Rectangle. You can create a rectangle by dragging it around the point DATA WAREHOUSING AND DATA MINING LAB MANUAL 23 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 24 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph to add vertices to the polygon and right-click to complete it. DATA WAREHOUSING AND DATA MINING LAB MANUAL 25 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 4. Polyline. To distinguish the points on one side from the once on another, you can build a polyline. Left-click on the graph to add vertices to the polyline and right-click to finish. DATA WAREHOUSING AND DATA MINING LAB MANUAL 26 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING (v). Explore Visualization features of the tool for analysis like identifying trends etc. B. Explore WEKA Data Mining/Machine Learning Toolkit (i) Downloading and/or installation of WEKA data mining toolkit Operating system in computer is Ubuntu operating system. Step1: Go to the Applications Menu-> Ubuntu software center->Search for Weka and click Install button. Fig: Ubuntu Software Center DATA WAREHOUSING AND DATA MINING LAB MANUAL 27 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Weka software availability search in Ubuntu Software Center DATA WAREHOUSING AND DATA MINING LAB MANUAL 28 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Weka installation process asks for authentication of the administrator. Fig: Authentication provision window in installation process Weka software would be installed in few minutes. Fig: Weka installation status confirmation window Step2: Launching WEKA GUI Tool. The following navigation is appropriate for launching weka tool. Applications Menu->Science->weka DATA WAREHOUSING AND DATA MINING LAB MANUAL 29 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Weka GUI Launching window (ii). Understand the features of WEKA toolkit such as Explorer, Knowledge Flow interface,Experimenter,command-line interface. Introduction: Weka is a workbench that contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. Advantages of Weka include: Free availability under the GNU General Public License. Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform. A comprehensive collection of data preprocessing and modeling techniques. Ease of use due to its graphical user interfaces. Description: DATA WAREHOUSING AND DATA MINING LAB MANUAL 30 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Open the program. Once the program has been loaded on the user’s machine it is opened by navigating to the programs start option and that will depend on the user’s operating system. Figure 1.1 is an example of the initial opening screen on a computer. There are four options available on this initial screen: Fig: Weka GUI 1. Applications Explorer - the graphical interface used to conduct experimentation on raw data After clicking the Explorer button the weka explorer interface appears.Inside the weka explorer window there are six tabs: 1. Preprocess- used to choose the data file to be used by the application. ● ● ● Open File- allows for the user to select files residing on the local machine or recorded medium Open URL- provides a mechanism to locate a file or data source from a different location specified by the user Open Database- allows the user to retrieve files or data from a database source provided by user DATA WAREHOUSING AND DATA MINING LAB MANUAL 31 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Preprocessor Environment 2. Classify- used to test and train different learning schemes on the preprocessed data file under experimentation. Again there are several options to be selected inside of the classify tab. Test option gives the user the choice of using four different test mode scenarios on the data set. 1. Use training set 2. Supplied training set 3. Cross validation 4. Split percentage DATA WAREHOUSING AND DATA MINING LAB MANUAL 32 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig:Classifiation of data and its options in Weka 3. Cluster- used to apply different tools that identify clusters within the data file. The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. Fig: Cluser analysis options and using training data set in Weka tool DATA WAREHOUSING AND DATA MINING LAB MANUAL 33 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 4. Association- used to apply different rules to the data file that identify association within the data. The associate tab opens a window to select the options for associations within the data set. Fig: Choosing Apriori data set 5. Select attributes-used to apply different rules to reveal changes based on selected attributes inclusion or exclusion from the experiment Fig: Selection of Attributes in weka tool DATA WAREHOUSING AND DATA MINING LAB MANUAL 34 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot and bar graph output. Fig: Visualization in Weka tool 2. Experimenter - this option allows users to conduct different experimental variations on data sets and perform statistical manipulation. The Weka Experiment Environment enables the user to create, run, modify, and analyze experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyze the results to determine if one of the schemes is (statistically) better than the other schemes. Results destination: ARFF file, CSV file, JDBC database. Experiment type: Cross-validation (default), Train/Test Percentage Split (data randomized). Iteration control: Number of repetitions, Data sets first/Algorithms first. Algorithms: filters DATA WAREHOUSING AND DATA MINING LAB MANUAL 35 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 3. Knowledge Flow -basically the same functionality as Explorer with drag and drop functionality. The advantage of this option is that it supports incremental learning from previous results 4. Simple CLI - provides users without a graphic interface option the ability to execute commands from a terminal window. (iv). Study the arff file format An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software in WEKA, each data entry is an instance of the java class weka. core. Instance, and each instance consists of a For loading datasets in WEKA, WEKA can load ARFF files. Attribute Relation File Format has two sections: 1) The Header section defines relation (dataset) name, attribute name, and type. 2) The Data section lists the data instances. DATA WAREHOUSING AND DATA MINING LAB MANUAL 36 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: ARFF file The figure above is from the german credit data that shows an ARFF file. Lines beginning with a % sign are comments. And there are three basic keywords: “@relation “ in Header section followed with relation name. “@attribute “ in Header section followed with attribute names and their types( or sizes). “@data “ in Data section followed with the list of data instances . The external representation of an Instances class Consists of : − A header: Describes the attribute types DATA WAREHOUSING AND DATA MINING LAB MANUAL 37 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING − Data section: Comma separated list of data (v). Explore the available datasets in weka tool. Click the “Open file...” button to open a data set and double click on the “data” directory.Weka provides a number of small common machine learning datasets that you can use to practice on. Fig: Sample data sets provided by Weka tool (vi). Load a data set ( ex.Weather data set , credit data set) DATA WAREHOUSING AND DATA MINING LAB MANUAL 38 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING (vii). and Load data set observe the following a. List the Attribute names and their types DATA WAREHOUSING AND DATA MINING LAB MANUAL 39 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Title: German Credit data Attribute Names Type of Attribute Status of existing checking account qualitative Duration in month numerical Credit history qualitative Purpose qualitative Credit amount numerical Savings account/bonds qualitative Present employment since qualitative b. Number of records in data set are 1000. c. Class attribute is Credit history. d. Summary Correctly Classified Instances 855 85.5 % Incorrectly Classified Instances 145 14.5 % Total Number of Instances 1000 DATA WAREHOUSING AND DATA MINING LAB MANUAL 40 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIT - 2 Perform data pre-processing tasks and Demonstrate performing association rule mining on data sets A. Explore various options available in Weka for preprocessing data and apply ( like Discretization Filters, Resample filter, etc) on each dataset. The navigation flow for preprocess of data set with out any filter in Weka tool is as follows : The GUI WEKA launcher -- > Explorer --> Preprocess --> Open file (Browse system for data set) --> select ALL attributes . A histogram which represents all attributes appear in visualize area. DATA WAREHOUSING AND DATA MINING LAB MANUAL 41 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig:Preprocessing the data without any filter. The navigation flow for preprocess of data set with Discretize filter in Weka tool is as follows : The GUI WEKA launcher -- > Explorer --> Preprocess --> DATA WAREHOUSING AND DATA MINING LAB MANUAL 42 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> attributes --> Discretiz --> ALL attributes . A histogram which represents credit history appear in visualize area. Fig: preprocess with discretization DATA WAREHOUSING AND DATA MINING LAB MANUAL 43 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING The navigation flow for preprocess of data set with Resample filter in Weka tool is as follows : The GUI WEKA launcher -- > Explorer --> Preprocess --> Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> instance --> Resample --> Select Credit History . A histogram which represents credit history appear in visualize area. DATA WAREHOUSING AND DATA MINING LAB MANUAL 44 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Preprocess of data with Resample B. Load a data set into weka and run Aprori algorithm with different support and confidence values. Study the rules generated. The navigation flow for applying Apriori algorithm on a data set Weka tool is as follows : The GUI WEKA launcher -- > Explorer --> Associate --> DATA WAREHOUSING AND DATA MINING LAB MANUAL 45 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Open file (Browse system for data set) --> Choose --> weka --> Associations-->Apriori --> Start . A histogram which represents credit history appear in visualize area. DATA WAREHOUSING AND DATA MINING LAB MANUAL 46 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Apriori ssociation DATA WAREHOUSING AND DATA MINING LAB MANUAL 47 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Apriori Association continuation UNTI-3 Demonstrate performing classification on data sets A. Load dataset into Weka and run J48 classification algorithm.Study the classifier output.Compute entropy values, Kappa statistic. The navigation flow for classification of data set with J48 classifier in Weka tool is as follows : The GUI WEKA launcher -- > Explorer --> Classify --> Open file (Browse system for data set) --> Choose --> weka --> Trees --> J48 . DATA WAREHOUSING AND DATA MINING LAB MANUAL 48 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Classification of a dataset using classifier J48 B. Extract if-then rules from the decision tree generated by the classifier, observe the confusion matrix and derive accuracy, F-measure,TPrate, FPrate, Precision and recall values. Apply crossvalidation strategy with values and compare accuracy results. DATA WAREHOUSING AND DATA MINING LAB MANUAL 49 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Classification of a dataset using classifier J48 continuation. C. Load dataset into Weka and run Naive-bayes classification algorithm.Study the classifier output. The navigation flow for classification of data set with Naive-bayes classifier in Weka tool is as follows :The GUI WEKA launcher -- > Explorer --> Classify -->Open file (Browse system for data set) --> Choose --> weka --> bayes --> Naive-Bayes DATA WAREHOUSING AND DATA MINING LAB MANUAL 50 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: classification using Naive-bayes classifier Fig: classification using Naive-bayes classifier continuation.. DATA WAREHOUSING AND DATA MINING LAB MANUAL 51 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: classification using Naive-bayes classifier continuation.. D.Plot ROC Curves DATA WAREHOUSING AND DATA MINING LAB MANUAL 52 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Classifier run on german credit dataset DATA WAREHOUSING AND DATA MINING LAB MANUAL 53 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig : ROC curve for german credit dataset visualize threshould curves DATA WAREHOUSING AND DATA MINING LAB MANUAL 54 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: No credits or All paid DATA WAREHOUSING AND DATA MINING LAB MANUAL 55 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: All paid DATA WAREHOUSING AND DATA MINING LAB MANUAL 56 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Existing paid DATA WAREHOUSING AND DATA MINING LAB MANUAL 57 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Delayed previously DATA WAREHOUSING AND DATA MINING LAB MANUAL 58 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Fig: Critical or Other existing credit E. Compare classification results of ID3,J48, Naïve-Bayes and k-NN classifiers for each dataset , and reduce which classifier is performing best and poor for each dataset and justify. Ans Steps for run ID3 and J48 Classification algorithms in WEKA 1. 2. 3. 4. 5. Open WEKA Tool. Click on WEKA Explorer. Click on Preprocessing tab button. Click on open file button. Choose WEKA folder in C drive. 6. Select and Click on data option button. 7. 8. 9. 10. Choose iris data set and open file. Click on classify tab and Choose J48 algorithm and select use training set test option Click on start button. Click on classify tab and Choose ID3 algorithm and select use training set test option. DATA WAREHOUSING AND DATA MINING LAB MANUAL 59 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 11. 12. 13. 14. 15. Click on start button. Click on classify tab and Choose Naïve-bayes algorithm and select use training set test option. Click on start button. Click on classify tab and Choose k-nearest neighbor and select use training set test option. Click on start button. J48: === Run information === Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode:evaluate on training data DATA WAREHOUSING AND DATA MINING LAB MANUAL 60 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING === Classifier model (full training set) === J48 pruned tree -----------------petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : Size of the tree : 5 9 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === DATA WAREHOUSING AND DATA MINING LAB MANUAL 61 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Correctly Classified Instances 147 98 % 3 2 % Incorrectly Classified Instances Kappa statistic 0.97 Mean absolute error 0.0233 Root mean squared error 0.108 Relative absolute error 5.2482 % Root relative squared error 22.9089 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision 1 0 1 1 1 Recall F-Measure 1 ROC Area Class Iris-setosa 0.98 0.02 0.961 0.98 0.97 0.99 0.96 0.01 0.98 0.96 0.97 0.99 Iris-virginica Weighted Avg. 0.98 0.01 0.98 0.98 Iris-versicolor 0.98 0.993 === Confusion Matrix === abc <-- classified as 50 0 0 | a = Iris-setosa 0 49 1 | b = Iris-versicolor 1 2 48 | c = Iris-virginica DATA WAREHOUSING AND DATA MINING LAB MANUAL 62 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Naïve-bayes: === Run information === Scheme:weka.classifiers.bayes.NaiveBayes Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode:evaluate on training data === Classifier model (full training set) === Naive Bayes Classifier Class Attribute Iris-setosa Iris-versicolor Iris-virginica DATA WAREHOUSING AND DATA MINING LAB MANUAL 63 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING (0.33) (0.33) (0.33) =============================================================== sepallength mean 4.9913 5.9379 6.5795 std. dev. 0.355 0.5042 0.6353 weight sum 50 50 50 precision 0.1059 0.1059 0.1059 mean 3.4015 2.7687 2.9629 std. dev. 0.3925 0.3038 0.3088 weight sum 50 50 50 precision 0.1091 0.1091 0.1091 mean 1.4694 4.2452 5.5516 std. dev. 0.1782 0.4712 0.5529 weight sum 50 50 50 precision 0.1405 0.1405 0.1405 mean 0.2743 1.3097 2.0343 std. dev. 0.1096 0.1915 0.2646 weight sum 50 50 50 sepalwidth petallength petalwidth DATA WAREHOUSING AND DATA MINING LAB MANUAL 64 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING precision 0.1143 0.1143 0.1143 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 144 96 % Incorrectly Classified Instances 6 4 Kappa statistic % 0.94 Mean absolute error 0.0324 Root mean squared error 0.1495 Relative absolute error 7.2883 % Root relative squared error 31.7089 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision 1 0 1 Recall F-Measure ROC Area Class 1 1 1 Iris-setosa 0.96 0.04 0.923 0.96 0.941 0.993 Iris-versicolor 0.92 0.02 0.958 0.92 0.939 0.993 Iris-virginica 0.02 0.96 0.96 Weighted Avg. 0.96 0.96 0.995 === Confusion Matrix === abc <-- classified as DATA WAREHOUSING AND DATA MINING LAB MANUAL 65 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 50 0 0 | a = Iris-setosa 0 48 2 | b = Iris-versicolor 1 4 46 | c = Iris-virginica K-Nearest Neighbor (IBK): === Run information === Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\"" Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode:evaluate on training data === Classifier model (full training set) === IB1 instance-based classifier DATA WAREHOUSING AND DATA MINING LAB MANUAL 66 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING using 1 nearest neighbour(s) for classification Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 150 Incorrectly Classified Instances Kappa statistic 100 % 0 0 % 1 Mean absolute error 0.0085 Root mean squared error 0.0091 Relative absolute error 1.9219 % Root relative squared error 1.9335 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision 1 0 1 1 1 1 Iris-setosa 1 0 1 1 1 1 Iris-versicolor 1 0 1 1 1 1 Iris-virginica Weighted Avg. 1 0 Recall F-Measure ROC Area Class 1 1 1 1 DATA WAREHOUSING AND DATA MINING LAB MANUAL 67 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 50 0 | b = Iris-versicolor 0 0 50 | c = Iris-virginica. DATA WAREHOUSING AND DATA MINING LAB MANUAL 68 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Unit – 4 DEMONSTRATE PERFORMING CLUSTERING ON DATA SETS CLUSTERING TAB A. Load each dataset into Weka and run simple k-means clustering algorithm with different values of k(number of desired clusters). Study the clusters formed. Observe the sum of squared errors and centroids, and derive insights Ans:Steps for run K-mean Clustering algorithms in WEKA 1. 2. 3. 4. 5. 6. 7. 8. 9. Open WEKA Tool. Click on WEKA Explorer. Click on Preprocessing tab button. Click on open file button. Choose WEKA folder in C drive. Select and Click on data option button. Choose iris data set and open file. Click on cluster tab and Choose k-mean and select use training set test option. Click on start button. Output: === Run information === Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth DATA WAREHOUSING AND DATA MINING LAB MANUAL 69 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING petallength petalwidth class Test mode:evaluate on training data === Model and evaluation on training set === kMeans ====== Number of iterations: 7 Within cluster sum of squared errors: 62.1436882815797 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute (150) Full Data (100) 0 1 (50) ================================================================== DATA WAREHOUSING AND DATA MINING LAB MANUAL 70 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING sepallength 5.8433 6.262 5.006 sepalwidth 3.054 2.872 3.418 petallength 3.7587 4.906 1.464 petalwidth 1.1987 1.676 0.244 class Iris-setosa Iris-versicolor Iris-setosa Time taken to build model (full training data) : 0 seconds === Model and evaluation on training set === Clustered Instances 1 100 ( 67%) DATA WAREHOUSING AND DATA MINING LAB MANUAL 71 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 2 50 ( 33%) B. Explore other clustering techniques available in Weka. Ans: Clustering Algorithms And Techniques in WEKA, They are DATA WAREHOUSING AND DATA MINING LAB MANUAL 72 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 73 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING C. Explore visualization features of weka to visualize the clusters. Derive interesting insights and explain. Ans: Visualize Features WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points. Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The Text Output And - Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc. To open Visualization screen, click ‘Visualize’ tab. DATA WAREHOUSING AND DATA MINING LAB MANUAL 74 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 75 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 76 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play o Changing the View: In the visualization window, beneath the X-axis selector there is a drop-down list, ‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the attribute selected. Below the plot area, there is a legend that describes what values the colors correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select lighter color from the color palette. n the left and ‘outlook’ at the top. DATA WAREHOUSING AND DATA MINING LAB MANUAL 77 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 78 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Selecting Instances Sometimes it is helpful to select a subset of the data using visualization tool. A special case is the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances. Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of points on the graph can be selected in four ways [2]: 1. Select Instance. Click on an individual data point. It brings up a window listing attributes of the point. If more than one point will appear at the same location, more than one set of attributes will be shown. DATA WAREHOUSING AND DATA MINING LAB MANUAL 79 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 2. Rectangle. You can create a rectangle by dragging it around the point. DATA WAREHOUSING AND DATA MINING LAB MANUAL 80 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph to add vertices to the polygon and right-click to complete it. DATA WAREHOUSING AND DATA MINING LAB MANUAL 81 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 4.Polyline. To distinguish the points on one side from the once on another, you can build a polyline. Left-click on the graph to add vertices to the polyline and right-click to finish. Viva Questions Unit-V Demonstrate performing regression on data sets. A. Load each dataset into Weka and build Linear Regression model. Study the cluster formed. Use training set option. Interpret the regression model and derive patterns and conclusions from the regression results. Ans:Steps for run Aprior algorithm in WEKA 1. Open WEKA Tool. 2. 3. 4. Click on WEKA Explorer. Click on Preprocessing tab button. Click on open file button. DATA WAREHOUSING AND DATA MINING LAB MANUAL 82 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 5. Choose WEKA folder in C drive. 6. Select and Click on data option button. 7. 8. 9. 10. Choose labor data set and open file. Click on Classify tab and Click the Choose button then expand the functions branch. Select the LinearRegression leaf ans select use training set test option. Click on start button. Output: === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year DATA WAREHOUSING AND DATA MINING LAB MANUAL 83 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance DATA WAREHOUSING AND DATA MINING LAB MANUAL 84 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING contribution-to-health-plan class Test mode: 10-fold cross-validation === Classifier model (full training set) === Linear Regression Model DATA WAREHOUSING AND DATA MINING LAB MANUAL 85 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B. Use options cross-validation and percentage split and repeat running the Linear Regression Model. Observe the results and derive meaningful results. Ans: Steps for run Aprior algorithm in WEKA 1. Open WEKA Tool. 2. 3. 4. 5. Click on WEKA Explorer. Click on Preprocessing tab button. Click on open file button. Choose WEKA folder in C drive. 6. Select and Click on data option button. 7. 8. 9. 10. 11. 12. Choose labor data set and open file. Click on Classify tab and Click the Choose button then expand the functions branch. Select the LinearRegression leaf and select test options cross-validation. Click on start button. Select the LinearRegression leaf and select test options percentage split. Click on start button. DATA WAREHOUSING AND DATA MINING LAB MANUAL 86 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Output: cross-validation === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: labor-neg-data 57 Instances: Attributes: 17 duration wage-increase-first-year wage-increase-second-year DATA WAREHOUSING AND DATA MINING LAB MANUAL 87 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class Test mode: 10-fold cross-validation DATA WAREHOUSING AND DATA MINING LAB MANUAL 88 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING === Classifier model (full training set) === Linear Regression Model duration = 0.4689 * cost-of-living-adjustment=tc,tcf + 0.6523 * pension=none,empl_contr + 1.0321 * bereavement-assistance=yes + 0.3904 * contribution-to-health-plan=full + 0.2765 Time taken to build model: 0.02 seconds === Cross-validation === === Summary === Correlation coefficient 0.1967 DATA WAREHOUSING AND DATA MINING LAB MANUAL 89 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Mean absolute error 0.6499 Root mean squared error 0.777 Relative absolute error 111.6598 % Root relative squared error 108.8152 % Total Number of Instances 56 Ignored Class Unknown Instances 1 Output: percentage split === Run information === DATA WAREHOUSING AND DATA MINING LAB MANUAL 90 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance DATA WAREHOUSING AND DATA MINING LAB MANUAL 91 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class Test mode: split 66.0% train, remainder test === Classifier model (full training set) === Linear Regression Model duration = 0.4689 * cost-of-living-adjustment=tc,tcf + 0.6523 * pension=none,empl_contr + 1.0321 * bereavement-assistance=yes + 0.3904 * contribution-to-health-plan=full + 0.2765 Time taken to build model: 0.02 seconds === Evaluation on test split === === Summary === Correlation coefficient 0.243 Mean absolute error 0.783 DATA WAREHOUSING AND DATA MINING LAB MANUAL 92 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Root mean squared error 0.9496 Relative absolute error 106.8823 % Root relative squared error 114.13 % Total Number of Instances 19 C. Explore Simple linear regression techniques that only looks at one variable. Ans: Steps for run Aprior algorithm in WEKA 1. Open WEKA Tool. 2. Click on WEKA Explorer. 3. Click on Preprocessing tab button. 4. Click on open file button. 5. Choose WEKA folder in C drive. 6. Select and Click on data option button. 7. 8. 9. 10. Choose labor data set and open file. Click on Classify tab and Click the Choose button then expand the functions branch. Select the S i m p l e Linear Regression leaf and select test options cross-validation. Click on start button. DATA WAREHOUSING AND DATA MINING LAB MANUAL 93 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Data Mining Lab CREDIT RISK ASSESSMENT ● The business of banks is making loans. Assessing the credit worthiness of an applicant’s of crucial importance. We have to develop a system to help a loan officer decide whether the credit of a customer is good or bad. A bank’s business rules regarding loans must consider two opposing factors. On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source. On the other hand, a bank cannot afford to make too many bad loans. To many bad could leads to the collapse of the bank. The bank’s loan policy must involve a compromise not too strict, and not too lenient. DATA WAREHOUSING AND DATA MINING LAB MANUAL 94 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● ● ● ● ● ● ● Credit risk is an investor's risk of loss arising from a borrower who does not make payments as promised. Such an event is called a default. Other terms for credit risk are default risk and counterparty risk. Credit risk is most simply defined as the potential that a bank borrower or counterparty will fail to meet its obligations in accordance with agreed terms. The goal of credit risk management is to maximize a bank's risk-adjusted rate of return by maintaining credit risk exposure within acceptable parameters. Banks need to manage the credit risk inherent in the entire portfolio as well as the risk in individual credits or transactions. Banks should also consider the relationships between credit risk and other risks. The effective management of credit risk is a critical component of a comprehensive approach to risk management and essential to the long-term success of any banking organization. A good credit assessment means you should be able to qualify, within the limits of your income, for most loans. SHOULD WRITE GERMAN CREDIT DATA Week 1 1. List all the categorical (or nominal) attributes and the real-valued attributes separately. From the German Credit Assessment Case Study given to us, the following attribute are found to be applicable for Credit-Risk Assessment. Total Valid Attributes Categorical or Nominal attributes (which takes True/false, etc values) Real valued attributes 1. checking_status 2. duration 3. credit history 4. purpose 5. credit amount 6. savings_status 7. employment duration 8. installment rate 9. personal status DATA WAREHOUSING AND DATA MINING LAB MANUAL 95 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 10. debtors 11. residence_since 12. property 13. installment plans 14. housing 15. existing credits 16. job 17. num_dependents 18. telephone 19. foreign worker Week 2 2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes. According to me the following attributes may be crucial in making the credit risk assessment. 1. Credit_history 2. Employment 3. Property_magnitude 4. Job 5. Duration 6. Credit_amount 7. Installment 8. Existing credit Based on the above attributes, we can make a decision whether to give credit or not. checking_status = no checking AND other_payment_plans = none AND credit_history = critical/other existing credit: good checking_status = no checking AND existing_credits <= 1 AND other_payment_plans = none AND purpose = radio/tv: good checking_status = no checking AND foreign_worker = yes AND employment = 4<=X<7: good foreign_worker = no AND personal_status = male single: good checking_status = no checking AND purpose = used car AND DATA WAREHOUSING AND DATA MINING LAB MANUAL 96 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING other_payment_plans = none: good duration <= 15 AND other_parties = guarantor: good duration <= 11 AND credit_history = critical/other existing credit: good checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = car: good checking_status = no checking AND property_magnitude = real estate AND other_payment_plans = none AND age > 23: good savings_status = >=1000 AND property_magnitude = real estate: good savings_status = 500<=X<1000 AND employment = >=7: good credit_history = no credits/all paid AND housing = rent: bad savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits > 1: good checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance: good installment_commitment <= 2 AND other_parties = co applicant AND existing_credits > 1: bad installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1 AND residence_since > 1: good installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <= 1: good duration > 30 AND savings_status = 100<=X<500: bad credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad DATA WAREHOUSING AND DATA MINING LAB MANUAL 97 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING duration > 30 AND savings_status = no known savings AND num_dependents > 1: good duration > 30 AND credit_history = delayed previously: bad duration > 42 AND savings_status = <100 AND residence_since > 1: bad Week 3 3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete dataset as the training data. Report the model obtained after training. A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class label. Decision trees can be easily converted into classification rules. e.g. ID3, C4.5 and CART. J48 pruned tree ● Using WEKA Tool, we can generate a decision tree by selecting the “classify tab”. ● In classify tab select choose option where a list of different decision trees are available. From that list select J48. ● Now under test option, select training data test option. ● The resulting window in WEKA is as follows: ● To generate the decision tree, right click on the result list and select visualize tree option by which the decision tree will be generated DATA WAREHOUSING AND DATA MINING LAB MANUAL 98 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● The obtained decision tree for credit risk assessment is very large to fit on the screen. DATA WAREHOUSING AND DATA MINING LAB MANUAL 99 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING The decision tree below is unclear due to a large number of attributes. DATA WAREHOUSING AND DATA MINING LAB MANUAL 100 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 4 4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you cannot get 100 % training accuracy? In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset. For example: IF purpose=vacation THEN credit=bad; ELSE purpose=business THEN credit=good; In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified. We can’t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can’t get 100% training accuracy. Week 5 5. Is testing on the training set as you did above a good idea? Why not? Bad idea, if take all the data into training set. Then how to test the above classification is correctly or not? According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which results only 85.5% accuracy. This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will be less. This is why, we prefer not to take complete dataset as training set. Use Training Set Result for the table German Credit Data: DATA WAREHOUSING AND DATA MINING LAB MANUAL 101 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Correctly Classified Instances 855 85.5 % Incorrectly Classified Instances 145 14.5 % Kappa statistic 0.6251 Mean absolute error 0.2312 Root mean squared error 0.34 Relative absolute error 55.0377 % Root relative squared error 74.2015 % Total Number of Instances 1000 Week 6 6. One approach for solving the problem encountered in the previous question is using crossvalidation? Describe what cross-validation is briefly. Train a Decision Tree again using crossvalidation and report your results. Does your accuracy increase/decrease? Why? Cross validation:In k-fold cross-validation, the initial data are randomly portioned into ‘k’ mutually exclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed ‘k’ times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on.... DATA WAREHOUSING AND DATA MINING LAB MANUAL 102 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and the number of folds as 10. 2. Number of folds indicates number of partition with the set of attributes. 3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed out, but in reality there is no such training set that gives 100% accuracy. 4. Cross Validation Result at folds: 10 for the table German Credit Data: Correctly Classified Instances 705 70.5 % Incorrectly Classified Instances 295 29.5 % DATA WAREHOUSING AND DATA MINING LAB MANUAL 103 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Kappa statistic 0.2467 Mean absolute error 0.3467 Root mean squared error 0.4796 Relative absolute error 82.5233 % Root relative squared error 104.6565 % Total Number of Instances 1000 Here there are 1000 instances with 100 instances per partition. Cross Validation Result at folds: 20 for the table GermanCreditData: Correctly Classified Instances 698 69.8 % Incorrectly Classified Instances 302 30.2 % Kappa statistic 0.2264 Mean absolute error 0.3571 Root mean squared error 0.4883 Relative absolute error DATA WAREHOUSING AND DATA MINING LAB MANUAL 104 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 85.0006 % Root relative squared error 106.5538 % Total Number of Instances 1000 Cross Validation Result at folds: 50 for the table GermanCreditData: Correctly Classified Instances 710 71 % Incorrectly Classified Instances 290 29% Kappa statistic 0 .2587 Mean absolute error 0.3444 Root mean squared error 0.4771 Relative absolute error 81.959 % Root relative squared error 104.1164 % Total Number of Instances 1000 Cross Validation Result at folds: 100 for the table German Credit Data: Correctly Classified Instance Incorrectly Classified Instance Kappa statistic Mean absolute error Root mean squared error Relative absolute error 710 71 290 29 0.2587 0.3444 0.477 81.959 % DATA WAREHOUSING AND DATA MINING LAB MANUAL 105 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Root relative squared error Total Number of Instances 104.1164 % 1000 Percentage split does not allow 100%, it allows only till 99.9% DATA WAREHOUSING AND DATA MINING LAB MANUAL 106 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● ● ● ● ● ● ● ● ● Percentage Split Result at 50%: Correctly Classified Instances362 72.4 % Incorrectly Classified Instances 138 27.6 % Kappa statistic 0.2725 Mean absolute error 0.3225 Root mean squared error 0.4764 Relative absolute error 76.3523 % Root relative squared error 106.4373% Total Number of Instances 500 DATA WAREHOUSING AND DATA MINING LAB MANUAL 107 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● ● ● ● ● ● ● ● ● Percentage Split Result at Correctly Classified Instances Incorrectly Classified Instance Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 99.9%: 00 1 100 0 0.6667 0.6667 221.7054 % 221.7054 % 1 DATA WAREHOUSING AND DATA MINING LAB MANUAL 108 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 7 7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or "personalstatus"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes from the dataset and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes have any significant effect? Discuss. This increases in accuracy because the two attributes “foreign workers” and “personal status “are not much important in training and analyzing. By removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have trained now. This is the main difference between these two decision trees. DATA WAREHOUSING AND DATA MINING LAB MANUAL 109 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING After foreign worker is removed, the accuracy is increased to 85.9% DATA WAREHOUSING AND DATA MINING LAB MANUAL 110 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these two attributes are not significant to perform training. DATA WAREHOUSING AND DATA MINING LAB MANUAL 111 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Cross validation after removing 9 th attribute. DATA WAREHOUSING AND DATA MINING LAB MANUAL 112 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Percentage split after removing 9 th attribute. DATA WAREHOUSING AND DATA MINING LAB MANUAL 113 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING After removing the 20th attribute, the cross validation is as above. DATA WAREHOUSING AND DATA MINING LAB MANUAL 114 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING After removing 20 th attribute, the percentage split is as above. DATA WAREHOUSING AND DATA MINING LAB MANUAL 115 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 8 8. Another question might be, do you really need to input so many attributes to get good results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem 7 Remember to reload the ARFF data file to get all the attributes initially before you start selecting the ones you want.) Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes. DATA WAREHOUSING AND DATA MINING LAB MANUAL 116 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Here accuracy is decreased. Select random attributes and then check the accuracy. DATA WAREHOUSING AND DATA MINING LAB MANUAL 117 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 118 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and visualize them. DATA WAREHOUSING AND DATA MINING LAB MANUAL 119 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try random combination of attributes to increase the accuracy. DATA WAREHOUSING AND DATA MINING LAB MANUAL 120 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 121 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 9 9. Sometimes, the cost of rejecting an applicant who actually has a good credit Case 1: Might be higher than accepting an applicant who has bad credit Case 2: Instead of counting the misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a cost matrix in WEKA. Train your Decision Tree again and report the Decision Tree and cross-validation results. Are they significantly different from results obtained in problem 6 (using equal cost)? DATA WAREHOUSING AND DATA MINING LAB MANUAL 122 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2.When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5) Total Cost 3820 1705 Average Cost 3.82 1.705 We don’t find this cost factor in problem 6. As there we use equal cost. This is the major difference between the results of problem 6 and problem 9. The cost matrices we used here: Case 1: 5 1 1 5 Case 2: 2 1 1 2 1. Select classify tab. Select More Option from Test Option. Tick on cost sensitive Evaluation and go to set. Set classes as 2.Click on Resize and then we’ll get cost matrix. Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 2. Then confusion matrix will be generated and you can find out the difference between good and bad attribute. 3. Check accuracy whether it’s changing or not. DATA WAREHOUSING AND DATA MINING LAB MANUAL 123 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 124 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 125 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 10 10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision trees? How does the complexity of a Decision Tree relate to the bias of the model? When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect. DATA WAREHOUSING AND DATA MINING LAB MANUAL 126 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the bias of the model. Due to this the result will be more accurate. So it is a good idea to prefer simple decision trees instead of long complex trees. 1. Open any existing ARFF file e.g labour.arff. 2. In preprocess tab, select ALL to select all the attributes. 3. Go to classify tab and then use training set with J48 algorithm. 4. To generate the decision tree, right click on the result list and select visualize tree option, by which the decision tree will be generated. DATA WAREHOUSING AND DATA MINING LAB MANUAL 127 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 5. Right click on J48 algorithm to get Generic Object Editor window 6. In this, make the unpruned option as true . 7. Then press OK and then start. We find the tree will become more complex if not pruned. DATA WAREHOUSING AND DATA MINING LAB MANUAL 128 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Visualize tree DATA WAREHOUSING AND DATA MINING LAB MANUAL 129 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 8. The tree has become more complex. DATA WAREHOUSING AND DATA MINING LAB MANUAL 130 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 131 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Week 11 11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase? Reduced-error pruning:DATA WAREHOUSING AND DATA MINING LAB MANUAL 132 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING The idea of using a separate pruning set for pruning—which is applicable to decision trees as well as rule sets—is called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental reduced-error pruning. Another possibility is to build a full, unpruned rule set first, pruning it afterwards b discarding individual tests. However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only rule in the theory, operating under the closed world assumption. If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets positive instances right. The instances that it does not cover include N–n negative ones, where n=t–p is the number of negative instances that the rule covers and N=T-P is the total number of negative instances. Thus the rule has an overall success ratio of [p +(N - n)] T, and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning. 1. Right click on J48 algorithm to get Generic Object Editor window 2. In this, make reduced error pruning option as true and also the unpruned option as true . 3. Then press OK and then start. 4. 4. We find that the accuracy has been increased by selecting the reduced error pruning option. DATA WAREHOUSING AND DATA MINING LAB MANUAL 133 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Reduced error pruning is set to True Week 12 12. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one! Can you predict what attribute that might be in this dataset? One R classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR. In WEKA, rules, PART is one of the classifier which converts the decision trees into “IF- THENELSE” rules. Converting Decision trees into “IF-THEN-ELSE” rules using rules. PART classifier:PART decision list ● outlook = overcast: yes (4.0) ● windy = TRUE: no (4.0/1.0) DATA WAREHOUSING AND DATA MINING LAB MANUAL 134 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ● outlook = sunny: no (3.0/1.0) : yes (3.0) ● Number of Rules : 4 Yes, sometimes just one attribute can be good enough in making the decision. In this dataset (Weather), Single attribute for making the decision is “outlook” outlook: ● sunny -> no ● overcast -> yes ● rainy -> yes (10/14 instances correct) With respect to the time, the one R classifier has higher ranking and J48 is in 2nd place and PART gets 3rd place. J48 PART one R TIME (sec) 0.12 0.14 0.04 RANK II III I But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place and one R gets lst place J48 PART oneR ACCURACY (%) 70.5 70.2% 66.8% 1. Open existing file as weather.nomial.arff 2. Select All. 3. Go to classify. DATA WAREHOUSING AND DATA MINING LAB MANUAL 135 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 4. Start. DATA WAREHOUSING AND DATA MINING LAB MANUAL 136 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING DATA WAREHOUSING AND DATA MINING LAB MANUAL 137 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Here the accuracy is 100% The tree is something like “if-then-else” rule If outlook=overcast then play=yes If outlook=sunny and humidity=high then play = no else play = yes DATA WAREHOUSING AND DATA MINING LAB MANUAL 138 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING If outlook=rainy and windy=true then play = no else play = yes To click out the rules 1. Go to choose then click on Rule then select PART. 2. Click on Save and start. DATA WAREHOUSING AND DATA MINING LAB MANUAL 139 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 3. Similarly for oneR algorithm. DATA WAREHOUSING AND DATA MINING LAB MANUAL 140 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING If outlook = overcast then Play = yes If outlook = sunny and humidity = high then Play = no If outlook = sunny and humidity = low then Play = yes DATA WAREHOUSING AND DATA MINING LAB MANUAL 141 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Additional Programs 1. Perform cluster analysis on German credit data set using partition clustering algorithm 2. Perform cluster analysis on German credit data set using EM clustering algorithm Additional program-1 Aim: Perform cluster analysis on German credit data set using partition clustering algorithm Recommended Hardware / Software Requirements: • Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster processor with at least 64 MB RAM and 100 MB free disk space. • Weka Pseudo code In pseudo code, the general algorithm for k-means clustering algorithm is: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select Simplekmeans. Then go to Choose and select use training set. Click on start. Output: cluster analysis on k-means clustering algorithm === Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -O -S 10 Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21 Instances: 1000 Attributes: 20 checking_status duration credit_history purpose credit_amount 83 savings_status employment installment_commitment personal_status other_parties DATA WAREHOUSING AND DATA MINING LAB MANUAL 142 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 8 Within cluster sum of squared errors: 5145.269062855846 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 (1000) (484) (190) (326) =============================================================== ========================================== checking_status no checking no checking <0 0<=X<200 duration 20.903 20.7314 26.0526 18.1564 credit_history existing paid existing paid existing paid existing paid purpose radio/tv new car used car radio/tv 84 credit_amount 3271.258 3293.1281 4844.6474 2321.7822 savings_status <100 <100 <100 <100 employment 1<=X<4 1<=X<4 >=7 >=7 installment_commitment 2.973 2.8822 3.0579 3.0583 personal_status male single male single male single male single other_parties none none none none residence_since 2.845 2.4483 3.5211 3.0399 property_magnitude car car no known property real estate age 35.546 33.155 41.0526 35.8865 other_payment_plans none none none none housing own own for free own DATA WAREHOUSING AND DATA MINING LAB MANUAL 143 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING existing_credits 1.407 1.3967 1.4474 1.3988 job skilled skilled skilled skilled num_dependents 1.155 1.155 1.2474 1.1012 own_telephone none none yes none foreign_worker yes yes yes yes Time taken to build model (full training data) : 0.07 seconds === Model and evaluation on training set === Clustered Instances 0 484 ( 48%) 1 190 ( 19%) 2 326 ( 33%) Additional program-2 Aim: Perform cluster analysis on German credit data set using hierarchal clustering algorithm Recommended Hardware / Software Requirements: • Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster processor with at least 64 MB RAM and 100 MB free disk space. • Weka 85 Pseudo code: In pseudo code, the general algorithm using EM clustering algorithm is Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select EM. Then go to Choose and select use training set. Click on start. Output: === Run information === Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21 Instances: 1000 Attributes: 20 checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since DATA WAREHOUSING AND DATA MINING LAB MANUAL 144 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker Test mode: evaluate on training data 86 === Clustering model (full training set) === EM == Number of clusters selected by cross validation: 4 Cluster Attribute 0 1 2 3 (0.26) (0.26) (0.2) (0.29) =============================================================== ============ checking_status <0 100.8097 58.5666 51.7958 66.8279 0<=X<200 69.3481 63.6477 34.9535 105.0507 >=200 17.6736 20.0978 11.9012 17.3274 no checking 73.012 119.2995 101.8966 103.7918 [total] 260.8434 261.6116 200.5471 292.9978 duration mean 17.7484 14.3572 23.4112 27.8358 std. dev. 8.0841 7.1757 12.1018 14.1317 credit_history no credits/all paid 10.1705 6.0326 8.4795 19.3174 all paid 17.9296 11.0899 9.6553 14.3252 existing paid 175.3951 142.1934 53.3962 163.0153 delayed previously 10.1938 18.0432 24.9273 38.8357 critical/other existing credit 48.1544 85.2526 105.0888 58.5041 [total] 261.8434 262.6116 201.5471 293.9978 purpose new car 57.7025 76.7946 47.734 55.7689 used car 14.504 7.9487 40.7163 43.831 furniture/equipment 95.3943 25.2704 24.1583 40.1769 radio/tv 53.3828 106.3023 48.3866 75.9283 DATA WAREHOUSING AND DATA MINING LAB MANUAL 145 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING domestic appliance 7.9495 3.4917 1.161 3.3979 repairs 5.5771 9.5832 6.9408 3.8988 education 9.921 10.7236 11.9789 21.3766 vacation 1 1 1 1 87 retraining 4.7356 4.1209 2.311 1.8324 business 16.6708 22.302 19.5059 42.5213 other 1.0059 1.0743 3.6542 10.2656 [total] 267.8434 268.6116 207.5471 299.9978 credit_amount mean 2288.8498 1812.2911 3638.3737 5195.2049 std. dev. 1342.8531 995.7303 2694.223 3683.9507 savings_status <100 170.6648 165.5967 96.2641 174.4744 100<=X<500 26.3033 25.4915 18.3092 36.8959 500<=X<1000 15.6275 21.5273 15.5765 14.2688 >=1000 12.2318 18.448 12.513 8.8072 no known savings 37.0161 31.5481 58.8844 59.5515 [total] 261.8434 262.6116 201.5471 293.9978 employment unemployed 14.0219 3.1801 16.0683 32.7298 <1 90.51 34.2062 8.4379 42.846 1<=X<4 84.9242 128.879 27.7645 101.4323 4<=X<7 50.6437 42.1897 31.3087 53.858 >=7 21.7437 54.1567 117.9679 63.1317 [total] 261.8434 262.6116 201.5471 293.9978 installment_commitment mean 2.8557 3.0212 3.312 2.8038 std. dev. 1.1596 1.1124 0.9515 1.1363 personal_status male div/sep 15.737 9.9518 4.6205 23.6907 female div/dep/mar 151.4625 48.4321 18.2787 95.8267 male single 67.3068 159.5075 172.5861 152.5996 male mar/wid 26.3371 43.7203 5.0618 20.8808 female single 1 1 1 1 [total] 261.8434 262.6116 201.5471 293.9978 other_parties none 235.863 218.7895 186.4245 269.923 co applicant 12.5526 10.6977 6.9588 14.7909 guarantor 11.4278 31.1244 6.1638 7.2839 [total] 259.8434 260.6116 199.5471 291.9978 DATA WAREHOUSING AND DATA MINING LAB MANUAL 146 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING residence_since mean 2.6862 2.5399 3.5434 2.7831 88 std. dev. 1.1732 1.0186 0.7654 1.1061 property_magnitude real estate 69.0217 148.9943 30.8391 37.1449 life insurance 81.2718 54.4192 41.9034 58.4056 car 95.7773 51.1875 60.6462 128.389 no known property 14.7725 7.0107 67.1584 69.0583 [total] 260.8434 261.6116 200.5471 292.9978 age mean 27.7345 36.1057 43.8079 36.3705 std. dev. 5.7953 10.3158 11.3129 11.5738 other_payment_plans bank 34.4988 32.0758 33.984 42.4414 stores 10.9742 12.5287 10.4947 17.0024 none 214.3704 216.0071 155.0685 232.554 [total] 259.8434 260.6116 199.5471 291.9978 housing rent 85.8549 31.7206 15.9015 49.523 own 168.499 226.2291 124.0089 198.2629 for free 5.4895 2.6619 59.6367 44.2118 [total] 259.8434 260.6116 199.5471 291.9978 existing_credits mean 1.213 1.4137 1.7961 1.3088 std. dev. 0.4142 0.5377 0.7406 0.4734 job unemp/unskilled non res 11.7711 2.5192 6.8364 4.8733 unskilled resident 52.9713 105.4029 24.5489 21.0769 skilled 188.0096 147.8359 128.9987 169.1558 high qualif/self emp/mgmt 8.0914 5.8537 40.1631 97.8918 [total] 260.8434 261.6116 200.5471 292.9978 num_dependents mean 1 1.2978 1.3983 1 std. dev. 0.3621 0.4573 0.4895 0.3621 own_telephone none 219.2961 215.7304 81.1575 83.816 yes 39.5473 43.8813 117.3896 207.1818 [total] 258.8434 259.6116 198.5471 290.9978 89 foreign_worker DATA WAREHOUSING AND DATA MINING LAB MANUAL 147 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING yes 248.5954 234.0215 197.4796 286.9034 no 10.248 25.5901 1.0675 4.0944 [total] 258.8434 259.6116 198.5471 290.9978 Time taken to build model (full training data) : 22.43 seconds === Model and evaluation on training set === Clustered Instances 0 279 ( 28%) 1 279 ( 28%) 2 194 ( 19%) 3 248 ( 25%) Log likelihood: -33.06046 DATA WAREHOUSING AND DATA MINING LAB MANUAL 148