Uploaded by Tanisha Chouhan

Data Mining Lab Manual: Weka, Cube Design, SQL Server

advertisement
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE
LABORATORY MANUAL
DATA MINING AND WAREHOUSING
CS 705
VII SEM (CSE)
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that Mr./Ms. ………………………………………………………………………..
with RGTU
Enrollment No. ..…………………………..has satisfactorily completed the course of experiments in Data Mining
laboratory, as prescribed by Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal for VII Semester of the
Computer Science and Engineering Department during year 2023-24.
Signature of
Faculty In-Charge
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
2023-24
List of Experiments
Student Name: Expt. No.
1
2
3
4
5
6
7
8
9
10
Enrollment No.: Conduction
List of Experiments
Staff Signature
Date
To acquire the knowledge about Weka tool on
sample data sets.
Design the cube by identifying measures and
dimensions for Star schema and Snowflake
schema.
Design and create cube by identifying
measures and dimensions using storage
mode MOLAP, ROLAP and HOLAP.
Study about the browsing and processing of
cube’s data.
Design data mining models using analysis
services of SQL Server.
List all the categorical (or nominal) attributes
and the real-valued attributes separately
from the dataset.
Demonstrate the various preprocessing
operations on the dataset.
Demonstrate the concept of the association
rule on the dataset using the Apriori
algorithm.
Demonstrate the clustering rule process on
the dataset using a simple K-means
algorithm.
Demonstrate the classification rule process
on the dataset using the Naïve Bayes
algorithm.
DATA MINING AND WAREHOUSING – LABORATORY
1
EXPT. No 1. To acquire the knowledge about Weka tool on sample data sets.
Aim: To acquire the knowledge about Weka tool on sample datasets
Software Required: Weka tool
Theory:
The Weka knowledge explorer is an easy-to-use graphical user interface that harnesses the power of the
Weka software. Each of the major Weka packages: filters, classifiers, clusters, associations, and attribute
selection are represented in the explorer along with a visualization tool which allows datasets and the
predictions of classifiers and clusters to be visualized in two dimensions.
Following are the panel details available in Weka Tool:
Preprocess panel
The preprocess panel is the start point for knowledge exploration. From this panel you can load datasets,
browse the characteristics of attributes and apply any combination of Weka's unsupervised filters to the
data.
Classifier panel
The classifier panel allows you to configure and execute any of the Weka classifiers on the current dataset.
You can choose to perform a cross validation or test on a separate dataset. Classification errors can be
visualized in a pop-up data visualization tool. If the classifier produces a decision tree it can be displayed
graphically in a pop-up tree visualizer.
Cluster panel
From the cluster panel you can configure and execute any of the Weka clusters on the current dataset.
Clusters can be visualized in a pop-up data visualization tool.
Associate panel
From the associate panel you can mine the current dataset for association rules using the Weka associators.
Select attributes panel
This panel allows you to configure and apply any combination of Weka attribute evaluator and search
method to select the most pertinent attributes in the dataset. If an attribute selection scheme transforms
the data, then the transformed data can be visualized in a pop-up data visualization tool.
Visualize Panel
This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and the size of
the points they display can be adjusted using the slider controls at the bottom of the panel. This panel allows
you to visualize the current dataset in one and two dimensions. When the coloring attribute is discrete, each
value is displayed as a different color; when the coloring attribute is continuous, a spectrum is used to
indicate the value.
Procedure:
1. Start the Weka GUI Chooser.
2. Launch the Weka Explorer by clicking the “Explorer” button.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
2
Figure 1.1: Screenshot of the Weka Explorer
3. Click the “Open file…” button.
4. Navigate to your current working directory. Change the “Files of Type” to “CSV data files (*.csv)”. Select
your file and click the “Open” button.
Viva Questions:
1. What is data mining?
2. Why do we pre-process the data?
3. What are the steps of data pre-processing?
4. What is data warehouse?
5. In which language Weka tool is programmed?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
3
EXPT. No 2. To design the cube by identifying measures and dimensions for Star schema and Snowflake
schema
Aim: To understand Star & Snowflake scheme and their operations
Software Required: Analysis services- SQL Server
Theory:
Star schema: In the Star schema, the center of the star can have one fact table and a number of associated
dimension tables. It is known as star schema as its structure resembles a star. The star schema is the
simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimized for querying
large data sets.
Snowflake schema: Snowflake schema is a logical arrangement of tables in a multidimensional database
such that the ER diagram resembles a snowflake shape. A Snowflake schema is an extension of a Star
schema, and it adds additional dimensions. The dimension tables are normalized which splits data into
additional tables.
Data cube:
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be
viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent
data that is too complex to be described by a table of columns and rows. As such, data cubes can go far
beyond 3-D to include many more dimensions.
Four types of analytical operations in OLAP are:
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
Procedure:
1. In the Solution Explorer, right click Cubes and select New Cube:
Figure 2.1: Solution Explorer
2. In Select Measure Groups Tables select FactResellerSales table. Measure group tables are used to
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
4
include the table with data to measure. A measure can be the number of sales, amount sold, freight, etc.
Figure 2.2: Data Source View
3. Select the data to measure. We will uncheck the keys and check the other attributes to measure.
Figure 2.3: Measure Selection
4. Select the dimensions that you want to add to the cube.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
5
Figure 2.4: Dimension Selection
5. You can also add the fact table as a dimension (degenerate dimension). In this example, we will not add
it.
Figure 2.5: New Dimension Selection
6. Once that the cube is created, press finish.
Figure 2.6: Cube Creation
The cube is created with the fact table (in yellow) and the dimensions (in blue).
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
6
Viva Questions:
1. What is a cube?
2. What is a dimension table?
3. What is a fact table?
4. How star schema is different from snowflake schema?
5. Explain the concept of the star schema.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
7
EXPT. No 3. To design the cube by identifying measures and dimensions using storage mode MOLAP,
ROLAP and HOLAP
Aim: To understand various storage modes, i.e., MOLAP, ROLAP & HOLAP and their working
Software Required: Analysis services- SQL Server
Theory:
Partition storage
Physical storage options affect the performance, storage requirements, and storage locations of partitions
and their parent measure groups and cubes. A partition can have one of three basic storage modes:
 Multidimensional OLAP (MOLAP)
 Relational OLAP (ROLAP)
 Hybrid OLAP (HOLAP)
Microsoft SQL Server Analysis Services (SSAS) supports all three basic storage modes. It also supports
proactive caching, which enables you to combine the characteristics of ROLAP and MOLAP storage for both
immediacies of data and query performance. You can configure the storage mode and proactive caching
options in one of three ways.
MOLAP
MOLAP uses array-based multidimensional storage engines to display multidimensional views of data.
Basically, they use an OLAP cube.
ROLAP
ROLAP works with data that exist in a relational database. Facts and dimension tables are stored as relational
tables. It also allows multidimensional analysis of data and is the fastest growing OLAP.
Advantages of ROLAP model:
 High data efficiency: It offers high data efficiency because query performance and access language are
optimized particularly for the multidimensional data analysis.
 Scalability: This type of OLAP system offers scalability for managing large volumes of data, and even
when the data is steadily increasing.
HOLAP
Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation of MOLAP and higher
scalability of ROLAP. HOLAP uses two databases.
1. Aggregated or computed data is stored in a multidimensional OLAP cube
2. Detailed information is stored in a relational database.
Advantages of Hybrid OLAP:
 This kind of OLAP helps to economize the disk space, and it also remains compact which helps to avoid
issues related to access speed and convenience.
 Hybrid HOLAP's uses cube technology which allows faster performance for all types of data.
Figure 3.1 shows the initial Design Storage choice from the Tools menu, along with the three data storage
options from which you can select.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
8
Figure 3.1: Storage Design Wizard: MOLAP, ROLAP, or HOLAP
Viva Questions:
1. What is a model in Data mining?
2. Explain the procedure to mine an OLAP cube.
3. What are the advantages of the HOLAP model?
4. What are different data warehouse models?
5. Explain the basic OLAP operations.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
9
EXPT. No 4. Study about the browsing and processing of cube’s data
Aim: To understand about the operations on data cube
Software Required: Analysis services- SQL Server
Theory:
Browsing cube data is a way to check work incrementally. It is to verify the small changes to properties,
relationships, and other objects that have the desired effect once the object is processed. While the Browser
tab is used to view both cube and dimension data, the tab provides different capabilities based on the object
browsing.
For cubes, the Browser tab provides two approaches for exploring data. The built-in MDX Query Designer is
used to build queries that return a flattened row set from a multidimensional database. Alternatively, an
Excel shortcut can also be used. Start Excel from SQL Server Data Tools, Excel opens with a PivotTable
already in the worksheet and a predefined connection to the model workspace database.
Processing cube data:
Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can be
done by:
 Moving down in the concept hierarchy
 Adding a new dimension
Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It can be
done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation.
Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a new
view of it.
To browse the deployed cube
1. Switch to Dimension Designer for the Product dimension in SQL Server Data Tools. To do this,
double-click the Product dimension in the Dimensions node of Solution Explorer.
2. Click the Browser tab to display the All member of the Product Key attribute hierarchy.
3. Switch to Cube Designer in SQL Server Data Tools. To do this, double-click the Analysis Services
Tutorial cube in the Cubes node of Solution Explorer.
4. Select the Browser tab, and then click the Reconnect icon on the toolbar of the designer.
The left pane of the designer shows the objects in the Analysis Services Tutorial cube. On the right side of
the Browser tab, there are two panes: the upper pane is the Filter pane, and the lower pane is
the Data pane.
To process an Analysis Services cube
1. Connect to the SQL Server Analysis Services Instance using SQL Server Management Studio.
2. In the Object Explorer Pane, expand the Analysis Services Instance, expand Databases and then expand
the Analysis Services database that contains the cube which needs to be processed.
3. Right click the cube to be processed and then click the Process option from the drop-down list as shown in
the Figure 4.1.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
10
Figure 4.2: Object Explorer
4. In the Process Cube dialog box, in the Process Options column under Object list, verify that the Process
Full option is selected for the column as highlighted in the Figure 4.2. If it is not selected, then under Process
Options click the option and then select Process Full from the drop down list as shown in the Figure 4.2.
5. To verify settings for the processing batch you need to click the Change Settings button in the Process
Cube dialog box as shown in the Figure 4.3. The settings you specify in the Change Settings dialog box will
override the default settings inherited from the Analysis Services database for all the objects listed in
the Process dialog box.
Figure 4.3: Process Cube Window
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
11
In the Processing Options tab, shown below, there are different processing options available:
 Parallel: - Use this option to process all the objects in a single transaction.
 Sequential: - Use this option to process the objects sequentially.
 Writeback Table Option: - Choose the option which can be used to manage a Writeback table. There
are three options available Create, Create always and Use existing
 Affected Objects: - Processing affected objects will process all the objects that have a dependency on
the selected objects.
Figure 4.4: Settings for Processing
6. In the Dimension Key Errors tab under change settings you will be able to use either a default error
configuration or a custom error configuration. Click OK to save the changes and to return to the Process
Cube dialog box.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
12
Figure 5.4: Error Configurations
7. Finally to Process the Cube click OK in the Process Cube dialog box. Once the Cube Processing has
completed successfully click the Close button in Process Progress dialog box to complete the cube
processing.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
13
Figure 4.6: Processing Cube
Viva Questions:
1. Which operation improves cube processing speed?
2. What is pivot operation?
3. What is filtering of data?
4. What is sub cube?
5. What is drill down operation?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
14
EXPT. No 5. Design data mining models using analysis services of SQL Server
Aim: To understand about data mining models
Software Required: Analysis services- SQL Server.
Theory:
Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured
and unstructured data.
Data mining models
Data mining model is created by applying the algorithm on top of the raw data. The mining model is more
than the algorithm or metadata handler. It is a set of data, patterns, statistics that can be serviceable on new
data that is being sourced to generate the predictions and get some inference about the relationships. The
following are some of the techniques that are used in data mining:
1. Descriptive data mining technique
This technique is generally preferred to generate cross-tabulation, correlation, frequency, etc. These
descriptive data mining techniques are used to obtain information on the regularity of the data by using raw
data as input and to discover important patterns. The other applications of this, the analysis is to understand
the captivating groups in the wider area of the raw data.
2. Predictive data mining technique
The main objective of the predictive mining technique is to identify futuristic results instead of the current
tendency. There are many functions that are used for the prediction of the target value. The techniques that
fall under this category are the classification, regression and time-series analysis. Data modeling is a
compulsion for this predictive analysis, which uses some variables to predict the uncertain futuristic data for
other variables.
Procedure:
 Open Microsoft Visual Studio and create a Multidimensional project under Analysis Service and
select Analysis Services Multidimensional and Data Mining project. Following is the Solution
Explorer for the created project:
Figure 5.1: Solution Explorer
For data mining, we will be using three nodes, Data Sources, Data Source Views, and Data Mining.
A. Data Sources
 We need to configure the data source to the project as shown in Figure 5.2. The data source makes a
connection to the sample database, AdventureWorksDW2017.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY

15
Figure 5.2: Select Database
After providing the credential to the source database, next is to provide the credentials to the Analysis
service to connect to the database.
Figure 5.3: Analysis Service Credentials
Analysis service will be used to store the data mining models and analysis service only use windows
authentication. Any of the four options can be used to provide the necessary connection.
B. Data Source View
 Next step is to select a data source view. The data source view is a subset of the tables or views. Since
you might not require all the tables and views for the project, from the data source view, you can choose
the needed objects.
If you have not created a data source before, from the Data Source View wizard, you can create the data
source.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
16
Figure 5.4: Select Table & Views
In the data source view, you can select the objects you need from the available objects. You can filter the
objects. If you have selected tables that have foreign key constraints, you can automatically select the
related tables by selecting Add Related Table.
C. Data Mining
 Next is to create a data mining project. Similar to other configurations, data mining structure creation
will be done with the help of a wizard.
The following will be the wizard for the data mining model creation.
Figure 5.5: Select Source to Create Model
In the above dialog box, there are two types of sources, whether it is from a relational database or an OLAP
cube.
 Next, the technique or algorithm is selected.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
17
Figure 5.7: Algorithm Selection
Nine data mining algorithms are supported in the SQL Server which is the most popular algorithm.
Correct data source view should be selected from which you have created before.
 Next is to choose the Case and Nested options. The case table is the table that contains entities to
analyze and the nested table is the table that contains additional information about the case.
Figure 5.7: Case & Nested Option

Training the dataset. Whenever the data mining model is created, it is always important to test your
model with a valid data set. Train data set will be used to train the model while the test data set is used
to test the built model. The Figure 5.8 shows the process to configure test and train data set.
Figure 5.8: Training Dataset
Typically, 70/30 is the split for train/test data set. Input data will be randomly split into two sets, a training
set and a testing set, based on the percentage of data for testing and a maximum number of cases in testing
data set you to provide. The training set is used to create the mining model. The testing set is used to check
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
18
model accuracy.
The percentage of data for testing specifies percentages of cases reserved for the testing set. The Maximum
number of cases in testing data set limits the total number of cases in the testing set. If both values are
specified, both limits are enforced.
Viva Questions:
1. What is descriptive data mining technique?
2. What is data mining model?
3. Which data file formats are supported by Weka?
4. What is predictive data mining technique?
5. What are the advantages of data mining models?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
19
EXPT. No 6. List all the categorical (or nominal) attributes and the real-valued attributes separately from
the dataset
Aim: To understand the different types of attributes in data file
Software Required: Weka mining tool
Theory:
Nominal Attributes- Related to names: The values of a Nominal attribute are names of things, some kind of
symbols. Values of Nominal attributes represents some category or state and that is why nominal attribute
also referred as categorical attributes and there is no order (rank, position) among values of the nominal
attribute.
Real-valued attributes have real numbers as attribute values Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits. Continuous
attributes are typically represented as floating-point variables.
Procedure:
1. Open the Weka GUI Chooser.
2. Select EXPLORER present in Applications.
3. Select Preprocess Tab.
4. Go to OPEN file and browse the file that is already stored in the system “bank.csv”.
5. Clicking on any attribute in the left panel will show the basic statistics on that selected attribute.
Figure 6.1: Preprocess Tab
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
20
Viva Questions:
1. What is a Nominal attribute?
2. Give examples of real-valued attributes.
3. How many applications are there in Weka GUI chooser?
4. What is binary attribute?
5. How to load files in Weka?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
21
EXPT. No 7. Demonstrate the various preprocessing operations on the dataset
Aim: To understand preprocessing of data in Weka tool
Software Required: Weka mining tool
Theory: The data that is collected from the field contains many unwanted things that leads to wrong
analysis. For example, the data may contain null fields, it may contain columns that are irrelevant to the
current analysis, and so on. Thus, the data must be preprocessed to meet the requirements of analysis. This
is the done in the preprocessing module.
To demonstrate the available features in preprocessing, use the Weather database that is provided in the
installation.
Procedure:
Using the Open file option under the Preprocess tag select the weather-nominal.arff file.
Figure 7.1: Loading File
After opening the file, the screen appearance will look like the figure 7.2-
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
22
Figure 7.2: Current Relation Sub Window
Understanding Data
First look at the highlighted Current relation sub window. It shows the name of the database that is
currently loaded. Two points can be inferred from this sub window −
 There are 14 instances - the number of rows in the table.
 The table contains 5 attributes.
On the left side, notice the Attributes sub window that displays the various fields in the database.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
23
Figure 7.3: Attributes Sub Window
The weather database contains five fields - outlook, temperature, humidity, windy and play. When an
attribute is selected from this list by clicking on it, further details on the attribute itself are displayed on the
right-hand side.
Select the temperature attribute first. Below screen will appear −
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
24
Figure 7.4: Selected Attribute Information
In the Selected Attribute sub window, following points can be observed−
 The name and the type of the attribute are displayed.
 The type for the temperature attribute is Nominal.
 The number of Missing values is zero.
 There are three distinct values with no unique value.
 The table underneath this information shows the nominal values for this field as hot, mild and cold.
 It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, the visual representation of the class values can be seen.
Click on the Visualize All button, all features in one single window is shown in Figure 7.5.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
25
Figure 7.5: Visualization
Removing Attributes
Most of the time, the data that used for model building comes with many irrelevant fields. For example, the
customer database may contain his mobile number which is relevant in analyzing his credit rating.
Figure 7.6: Attributes Removal
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database.
Applying Filters
Some of the machine learning techniques such as association rule mining requires categorical data. To
illustrate the use of filters, use weather-numeric.arff database that contains two numeric attributes temperature and humidity.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
26
Convert these to nominal by applying a filter on raw data. Click on the Choose button in the Filter sub
window and select the following filter −
weka→filters→supervised→attribute→Discretize
Figure 7.7: Applying Filter
Click on the Apply button and examine the temperature and/or humidity attribute. It can be noticed that
these have changed from numeric to nominal types.
Figure 7.8: Attribute Type Changed
Now, save the data by clicking the Save button.
Viva Questions:
1. What is data filtering?
2. What is the need to remove some attributes from the database?
3. Which filters are supported by Weka?
4. Which technique is best for filtering data?
5. What are the advantages of data processing?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
27
EXPT. No 8. Demonstrate the concept of the association rule on the dataset using the Apriori algorithm
Aim: To understand association in data mining using Apriori algorithm
Software Required: Weka mining tool
Theory: It was observed that people who buy bread also buy peanut butter at the same time. That is there is
an association in buying bread and peanut butter together. Though this seems not well convincing, this
association rule was mined from huge databases of supermarkets.
Finding such associations becomes vital for supermarkets as they would stock peanut butter next to bread so
that customers can locate both items easily resulting in an increased sale for the supermarket.
The Apriori algorithm is one such algorithm in ML that finds out the probable associations and creates
association rules. WEKA provides the implementation of the Apriori algorithm. Here, the Apriori algorithm
will be applied to the supermarket data provided in the WEKA installation.
Loading Data
In the WEKA explorer, open the Preprocess tab, click on the Open file button and
select supermarket.arff database from the installation folder. After the data is loaded it looks like the Figure
8.1.
Figure 8.1: Supermarket Relation
The database contains 4627 instances and 217 attributes. It can be easily understood how difficult it would
be to detect the association between such a large numbers of attributes. However, this task is automated
with the help of Apriori algorithm.
Associator
Click on the Associate tab and click on the Choose button. Select the Apriori association as shown in the
Figure 8.2.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
28
Figure 8.2: Apriori Association
To set the parameters for the Apriori algorithm, click on its name, a window will pop up as shown below in
Figure 8.3 that allows to set the parameters −
Figure 8.3: Parameters for Apriori Association
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
29
After setting the parameters, click the Start button. After a while the results will appear as shown in the
Figure 8.4.
Figure 8.4: Rules of Association Found
At the bottom, the detected best rules of associations will be found. This will help the supermarket in
stocking their products in appropriate shelves.
Viva Questions:
1. What do you mean by the term Association?
2. What is the Apriori algorithm?
3. Explain the concept of association with real-life examples.
4. Name two association algorithms.
5. How do association rules help in improving business?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
30
EXPT. No 9. Demonstrate the clustering rule process on the dataset using a simple K-means algorithm
Aim: To understand clustering in data mining using simple K-means algorithm
Software Required: Weka mining tool
Theory:
Cluster Analysis
Clustering Algorithms are unsupervised learning algorithms used to create groups of data with similar
characteristics. It aggregates objects with similarities into groups and subgroups thus leading to the
partitioning of datasets. Cluster analysis is the process of portioning of datasets into subsets. These subsets
are called clusters and the set of clusters is called clustering.
Cluster Analysis is used in many applications such as image recognition, pattern recognition, web search, and
security, in business intelligence such as the grouping of customers with similar likings.
K-means Clustering
K means clustering is the simplest clustering algorithm. In the K-Clustering algorithm, the dataset is
partitioned into K-clusters. An objective function is used to find the quality of partitions so that similar
objects are in one cluster and dissimilar objects in other groups.
In this method, the centroid of a cluster is found to represent a cluster. The centroid is taken as the center of
the cluster which is calculated as the mean value of points within the cluster. Now the quality of clustering is
found by measuring the Euclidean distance between the point and center. This distance should be
maximum.
Procedure:
1. Open WEKA Explorer and click on Open File in the Preprocess tab. Choose dataset “vote.arff”.
Figure 9.1: Loading Data file
2. Go to the “Cluster” tab and click on the “Choose” button. Select the clustering method as
“SimpleKMeans”.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
31
Figure 9.2: Cluster Tab
3. Choose Settings and then set the following fields:
 Distance function as Euclidian
 The number of clusters as 6. With a greater number of clusters, the sum of squared error will reduce.
Click on Ok and start the algorithm.
Figure 9.3: Kmeans Settings
4. Click on Start in the left panel. The algorithm displays results on the white screen.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
32
5. Choose “Classes to Clusters Evaluations” and click on Start.
Figure 9.4: SimpleKMeans Information
The algorithm will assign the class label to the cluster. Cluster 0 represents republican and Cluster 3
represents democrat. The Incorrectly clustered instance is 39.77% which can be reduced by ignoring the
unimportant attributes.
Figure 9.5: Cluster Tab
6. To ignore the unimportant attributes. Click on the “Ignore attributes” button and select the
attributes to be removed.
7. Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any
box. Move the Jitter to the max.
 The X-axis and Y-axis represent the attribute.
 The blue color represents class label democrat and the red color represents class label republican.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY


33
Jitter is used to view Clusters.
Click the box on the right-hand side of the window to change the x coordinate attribute and view
clustering with respect to other attributes.
Figure 9.6: Clustering Visualization
K-means clustering is a simple cluster analysis method. The number of clusters can be set using the setting
tab. The centroid of each cluster is calculated as the mean of all points within the clusters. With the increase
in the number of clusters, the sum of square errors is reduced. The objects within the cluster exhibit similar
characteristics and properties. The clusters represent the class labels.
Viva Questions:
1. What is clustering?
2. Which algorithms are available for clustering?
3. Why do we perform clustering on the data set?
4. What is a simple K-means algorithm?
5. What is Euclidean distance?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
34
EXPT. No 10. Demonstrate the classification rule process on the dataset using the Naïve Bayes algorithm
Aim: To understand classification in data mining using Naïve Bayes algorithm
Software Required: Weka
Theory:
Naïve Bayes Algorithm: It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be an apple if it is red, round, and about 3 inches in diameter. Even if these features
depend on each other or upon the existence of the other features, all these properties independently
contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods.
Procedure:
1. First load the required dataset in the Weka tool using choose file option. Here, weather-nominal
dataset is selected to execute.
2. Now, go to the classify tab on the top left side and click on the choose button and select the Naive
Bayesian algorithm in it.
Figure 10.1: Naïve Bayes
3. Now to change the parameters click on the right side at the choose button, and the default values are
selected in this example.
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
35
Figure 10.2: Naïve Bayes Settings
4. Choose the Percentage split as measurement method from the “Test” choices in the main panel.
Since there is no separate test data collection, use the percentage split of 66 percent to get a good
idea of the model’s accuracy.
Figure 10.3: Test Options
5. To generate the model, now click “start.” When the model is done, the evaluation statistic will
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
DATA MINING AND WAREHOUSING – LABORATORY
36
appear in the right panel.
Figure 10.4: Evaluation Statistics
The model’s classification accuracy is about 60%. This suggests that accuracy can be optimized by performing
some modifications.
Viva Questions:
1. What is Naïve Bayes algorithm?
2. What is classification?
3. What are the ways to split the datasets?
4. Which algorithm is best for clustering?
5. Naïve Bayes is suitable for what kind of datasets?
CHAMELI DEVI GROUP OF INSTITUTIONS
COMPUTER SCIENCE AND ENGINEERING
Download