CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE LABORATORY MANUAL DATA MINING AND WAREHOUSING CS 705 VII SEM (CSE) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CERTIFICATE This is to certify that Mr./Ms. ……………………………………………………………………….. with RGTU Enrollment No. ..…………………………..has satisfactorily completed the course of experiments in Data Mining laboratory, as prescribed by Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal for VII Semester of the Computer Science and Engineering Department during year 2023-24. Signature of Faculty In-Charge DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 2023-24 List of Experiments Student Name: Expt. No. 1 2 3 4 5 6 7 8 9 10 Enrollment No.: Conduction List of Experiments Staff Signature Date To acquire the knowledge about Weka tool on sample data sets. Design the cube by identifying measures and dimensions for Star schema and Snowflake schema. Design and create cube by identifying measures and dimensions using storage mode MOLAP, ROLAP and HOLAP. Study about the browsing and processing of cube’s data. Design data mining models using analysis services of SQL Server. List all the categorical (or nominal) attributes and the real-valued attributes separately from the dataset. Demonstrate the various preprocessing operations on the dataset. Demonstrate the concept of the association rule on the dataset using the Apriori algorithm. Demonstrate the clustering rule process on the dataset using a simple K-means algorithm. Demonstrate the classification rule process on the dataset using the Naïve Bayes algorithm. DATA MINING AND WAREHOUSING – LABORATORY 1 EXPT. No 1. To acquire the knowledge about Weka tool on sample data sets. Aim: To acquire the knowledge about Weka tool on sample datasets Software Required: Weka tool Theory: The Weka knowledge explorer is an easy-to-use graphical user interface that harnesses the power of the Weka software. Each of the major Weka packages: filters, classifiers, clusters, associations, and attribute selection are represented in the explorer along with a visualization tool which allows datasets and the predictions of classifiers and clusters to be visualized in two dimensions. Following are the panel details available in Weka Tool: Preprocess panel The preprocess panel is the start point for knowledge exploration. From this panel you can load datasets, browse the characteristics of attributes and apply any combination of Weka's unsupervised filters to the data. Classifier panel The classifier panel allows you to configure and execute any of the Weka classifiers on the current dataset. You can choose to perform a cross validation or test on a separate dataset. Classification errors can be visualized in a pop-up data visualization tool. If the classifier produces a decision tree it can be displayed graphically in a pop-up tree visualizer. Cluster panel From the cluster panel you can configure and execute any of the Weka clusters on the current dataset. Clusters can be visualized in a pop-up data visualization tool. Associate panel From the associate panel you can mine the current dataset for association rules using the Weka associators. Select attributes panel This panel allows you to configure and apply any combination of Weka attribute evaluator and search method to select the most pertinent attributes in the dataset. If an attribute selection scheme transforms the data, then the transformed data can be visualized in a pop-up data visualization tool. Visualize Panel This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and the size of the points they display can be adjusted using the slider controls at the bottom of the panel. This panel allows you to visualize the current dataset in one and two dimensions. When the coloring attribute is discrete, each value is displayed as a different color; when the coloring attribute is continuous, a spectrum is used to indicate the value. Procedure: 1. Start the Weka GUI Chooser. 2. Launch the Weka Explorer by clicking the “Explorer” button. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 2 Figure 1.1: Screenshot of the Weka Explorer 3. Click the “Open file…” button. 4. Navigate to your current working directory. Change the “Files of Type” to “CSV data files (*.csv)”. Select your file and click the “Open” button. Viva Questions: 1. What is data mining? 2. Why do we pre-process the data? 3. What are the steps of data pre-processing? 4. What is data warehouse? 5. In which language Weka tool is programmed? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 3 EXPT. No 2. To design the cube by identifying measures and dimensions for Star schema and Snowflake schema Aim: To understand Star & Snowflake scheme and their operations Software Required: Analysis services- SQL Server Theory: Star schema: In the Star schema, the center of the star can have one fact table and a number of associated dimension tables. It is known as star schema as its structure resembles a star. The star schema is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimized for querying large data sets. Snowflake schema: Snowflake schema is a logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snowflake shape. A Snowflake schema is an extension of a Star schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables. Data cube: When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical Processing)." The general idea of this approach is to materialize certain expensive computations that are frequently inquired. A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent data that is too complex to be described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions. Four types of analytical operations in OLAP are: 1. Roll-up 2. Drill-down 3. Slice and dice 4. Pivot (rotate) Procedure: 1. In the Solution Explorer, right click Cubes and select New Cube: Figure 2.1: Solution Explorer 2. In Select Measure Groups Tables select FactResellerSales table. Measure group tables are used to CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 4 include the table with data to measure. A measure can be the number of sales, amount sold, freight, etc. Figure 2.2: Data Source View 3. Select the data to measure. We will uncheck the keys and check the other attributes to measure. Figure 2.3: Measure Selection 4. Select the dimensions that you want to add to the cube. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 5 Figure 2.4: Dimension Selection 5. You can also add the fact table as a dimension (degenerate dimension). In this example, we will not add it. Figure 2.5: New Dimension Selection 6. Once that the cube is created, press finish. Figure 2.6: Cube Creation The cube is created with the fact table (in yellow) and the dimensions (in blue). CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 6 Viva Questions: 1. What is a cube? 2. What is a dimension table? 3. What is a fact table? 4. How star schema is different from snowflake schema? 5. Explain the concept of the star schema. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 7 EXPT. No 3. To design the cube by identifying measures and dimensions using storage mode MOLAP, ROLAP and HOLAP Aim: To understand various storage modes, i.e., MOLAP, ROLAP & HOLAP and their working Software Required: Analysis services- SQL Server Theory: Partition storage Physical storage options affect the performance, storage requirements, and storage locations of partitions and their parent measure groups and cubes. A partition can have one of three basic storage modes: Multidimensional OLAP (MOLAP) Relational OLAP (ROLAP) Hybrid OLAP (HOLAP) Microsoft SQL Server Analysis Services (SSAS) supports all three basic storage modes. It also supports proactive caching, which enables you to combine the characteristics of ROLAP and MOLAP storage for both immediacies of data and query performance. You can configure the storage mode and proactive caching options in one of three ways. MOLAP MOLAP uses array-based multidimensional storage engines to display multidimensional views of data. Basically, they use an OLAP cube. ROLAP ROLAP works with data that exist in a relational database. Facts and dimension tables are stored as relational tables. It also allows multidimensional analysis of data and is the fastest growing OLAP. Advantages of ROLAP model: High data efficiency: It offers high data efficiency because query performance and access language are optimized particularly for the multidimensional data analysis. Scalability: This type of OLAP system offers scalability for managing large volumes of data, and even when the data is steadily increasing. HOLAP Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation of MOLAP and higher scalability of ROLAP. HOLAP uses two databases. 1. Aggregated or computed data is stored in a multidimensional OLAP cube 2. Detailed information is stored in a relational database. Advantages of Hybrid OLAP: This kind of OLAP helps to economize the disk space, and it also remains compact which helps to avoid issues related to access speed and convenience. Hybrid HOLAP's uses cube technology which allows faster performance for all types of data. Figure 3.1 shows the initial Design Storage choice from the Tools menu, along with the three data storage options from which you can select. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 8 Figure 3.1: Storage Design Wizard: MOLAP, ROLAP, or HOLAP Viva Questions: 1. What is a model in Data mining? 2. Explain the procedure to mine an OLAP cube. 3. What are the advantages of the HOLAP model? 4. What are different data warehouse models? 5. Explain the basic OLAP operations. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 9 EXPT. No 4. Study about the browsing and processing of cube’s data Aim: To understand about the operations on data cube Software Required: Analysis services- SQL Server Theory: Browsing cube data is a way to check work incrementally. It is to verify the small changes to properties, relationships, and other objects that have the desired effect once the object is processed. While the Browser tab is used to view both cube and dimension data, the tab provides different capabilities based on the object browsing. For cubes, the Browser tab provides two approaches for exploring data. The built-in MDX Query Designer is used to build queries that return a flattened row set from a multidimensional database. Alternatively, an Excel shortcut can also be used. Start Excel from SQL Server Data Tools, Excel opens with a PivotTable already in the worksheet and a predefined connection to the model workspace database. Processing cube data: Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can be done by: Moving down in the concept hierarchy Adding a new dimension Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It can be done by: Climbing up in the concept hierarchy Reducing the dimensions Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a new view of it. To browse the deployed cube 1. Switch to Dimension Designer for the Product dimension in SQL Server Data Tools. To do this, double-click the Product dimension in the Dimensions node of Solution Explorer. 2. Click the Browser tab to display the All member of the Product Key attribute hierarchy. 3. Switch to Cube Designer in SQL Server Data Tools. To do this, double-click the Analysis Services Tutorial cube in the Cubes node of Solution Explorer. 4. Select the Browser tab, and then click the Reconnect icon on the toolbar of the designer. The left pane of the designer shows the objects in the Analysis Services Tutorial cube. On the right side of the Browser tab, there are two panes: the upper pane is the Filter pane, and the lower pane is the Data pane. To process an Analysis Services cube 1. Connect to the SQL Server Analysis Services Instance using SQL Server Management Studio. 2. In the Object Explorer Pane, expand the Analysis Services Instance, expand Databases and then expand the Analysis Services database that contains the cube which needs to be processed. 3. Right click the cube to be processed and then click the Process option from the drop-down list as shown in the Figure 4.1. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 10 Figure 4.2: Object Explorer 4. In the Process Cube dialog box, in the Process Options column under Object list, verify that the Process Full option is selected for the column as highlighted in the Figure 4.2. If it is not selected, then under Process Options click the option and then select Process Full from the drop down list as shown in the Figure 4.2. 5. To verify settings for the processing batch you need to click the Change Settings button in the Process Cube dialog box as shown in the Figure 4.3. The settings you specify in the Change Settings dialog box will override the default settings inherited from the Analysis Services database for all the objects listed in the Process dialog box. Figure 4.3: Process Cube Window CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 11 In the Processing Options tab, shown below, there are different processing options available: Parallel: - Use this option to process all the objects in a single transaction. Sequential: - Use this option to process the objects sequentially. Writeback Table Option: - Choose the option which can be used to manage a Writeback table. There are three options available Create, Create always and Use existing Affected Objects: - Processing affected objects will process all the objects that have a dependency on the selected objects. Figure 4.4: Settings for Processing 6. In the Dimension Key Errors tab under change settings you will be able to use either a default error configuration or a custom error configuration. Click OK to save the changes and to return to the Process Cube dialog box. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 12 Figure 5.4: Error Configurations 7. Finally to Process the Cube click OK in the Process Cube dialog box. Once the Cube Processing has completed successfully click the Close button in Process Progress dialog box to complete the cube processing. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 13 Figure 4.6: Processing Cube Viva Questions: 1. Which operation improves cube processing speed? 2. What is pivot operation? 3. What is filtering of data? 4. What is sub cube? 5. What is drill down operation? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 14 EXPT. No 5. Design data mining models using analysis services of SQL Server Aim: To understand about data mining models Software Required: Analysis services- SQL Server. Theory: Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining models Data mining model is created by applying the algorithm on top of the raw data. The mining model is more than the algorithm or metadata handler. It is a set of data, patterns, statistics that can be serviceable on new data that is being sourced to generate the predictions and get some inference about the relationships. The following are some of the techniques that are used in data mining: 1. Descriptive data mining technique This technique is generally preferred to generate cross-tabulation, correlation, frequency, etc. These descriptive data mining techniques are used to obtain information on the regularity of the data by using raw data as input and to discover important patterns. The other applications of this, the analysis is to understand the captivating groups in the wider area of the raw data. 2. Predictive data mining technique The main objective of the predictive mining technique is to identify futuristic results instead of the current tendency. There are many functions that are used for the prediction of the target value. The techniques that fall under this category are the classification, regression and time-series analysis. Data modeling is a compulsion for this predictive analysis, which uses some variables to predict the uncertain futuristic data for other variables. Procedure: Open Microsoft Visual Studio and create a Multidimensional project under Analysis Service and select Analysis Services Multidimensional and Data Mining project. Following is the Solution Explorer for the created project: Figure 5.1: Solution Explorer For data mining, we will be using three nodes, Data Sources, Data Source Views, and Data Mining. A. Data Sources We need to configure the data source to the project as shown in Figure 5.2. The data source makes a connection to the sample database, AdventureWorksDW2017. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 15 Figure 5.2: Select Database After providing the credential to the source database, next is to provide the credentials to the Analysis service to connect to the database. Figure 5.3: Analysis Service Credentials Analysis service will be used to store the data mining models and analysis service only use windows authentication. Any of the four options can be used to provide the necessary connection. B. Data Source View Next step is to select a data source view. The data source view is a subset of the tables or views. Since you might not require all the tables and views for the project, from the data source view, you can choose the needed objects. If you have not created a data source before, from the Data Source View wizard, you can create the data source. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 16 Figure 5.4: Select Table & Views In the data source view, you can select the objects you need from the available objects. You can filter the objects. If you have selected tables that have foreign key constraints, you can automatically select the related tables by selecting Add Related Table. C. Data Mining Next is to create a data mining project. Similar to other configurations, data mining structure creation will be done with the help of a wizard. The following will be the wizard for the data mining model creation. Figure 5.5: Select Source to Create Model In the above dialog box, there are two types of sources, whether it is from a relational database or an OLAP cube. Next, the technique or algorithm is selected. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 17 Figure 5.7: Algorithm Selection Nine data mining algorithms are supported in the SQL Server which is the most popular algorithm. Correct data source view should be selected from which you have created before. Next is to choose the Case and Nested options. The case table is the table that contains entities to analyze and the nested table is the table that contains additional information about the case. Figure 5.7: Case & Nested Option Training the dataset. Whenever the data mining model is created, it is always important to test your model with a valid data set. Train data set will be used to train the model while the test data set is used to test the built model. The Figure 5.8 shows the process to configure test and train data set. Figure 5.8: Training Dataset Typically, 70/30 is the split for train/test data set. Input data will be randomly split into two sets, a training set and a testing set, based on the percentage of data for testing and a maximum number of cases in testing data set you to provide. The training set is used to create the mining model. The testing set is used to check CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 18 model accuracy. The percentage of data for testing specifies percentages of cases reserved for the testing set. The Maximum number of cases in testing data set limits the total number of cases in the testing set. If both values are specified, both limits are enforced. Viva Questions: 1. What is descriptive data mining technique? 2. What is data mining model? 3. Which data file formats are supported by Weka? 4. What is predictive data mining technique? 5. What are the advantages of data mining models? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 19 EXPT. No 6. List all the categorical (or nominal) attributes and the real-valued attributes separately from the dataset Aim: To understand the different types of attributes in data file Software Required: Weka mining tool Theory: Nominal Attributes- Related to names: The values of a Nominal attribute are names of things, some kind of symbols. Values of Nominal attributes represents some category or state and that is why nominal attribute also referred as categorical attributes and there is no order (rank, position) among values of the nominal attribute. Real-valued attributes have real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. Procedure: 1. Open the Weka GUI Chooser. 2. Select EXPLORER present in Applications. 3. Select Preprocess Tab. 4. Go to OPEN file and browse the file that is already stored in the system “bank.csv”. 5. Clicking on any attribute in the left panel will show the basic statistics on that selected attribute. Figure 6.1: Preprocess Tab CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 20 Viva Questions: 1. What is a Nominal attribute? 2. Give examples of real-valued attributes. 3. How many applications are there in Weka GUI chooser? 4. What is binary attribute? 5. How to load files in Weka? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 21 EXPT. No 7. Demonstrate the various preprocessing operations on the dataset Aim: To understand preprocessing of data in Weka tool Software Required: Weka mining tool Theory: The data that is collected from the field contains many unwanted things that leads to wrong analysis. For example, the data may contain null fields, it may contain columns that are irrelevant to the current analysis, and so on. Thus, the data must be preprocessed to meet the requirements of analysis. This is the done in the preprocessing module. To demonstrate the available features in preprocessing, use the Weather database that is provided in the installation. Procedure: Using the Open file option under the Preprocess tag select the weather-nominal.arff file. Figure 7.1: Loading File After opening the file, the screen appearance will look like the figure 7.2- CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 22 Figure 7.2: Current Relation Sub Window Understanding Data First look at the highlighted Current relation sub window. It shows the name of the database that is currently loaded. Two points can be inferred from this sub window − There are 14 instances - the number of rows in the table. The table contains 5 attributes. On the left side, notice the Attributes sub window that displays the various fields in the database. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 23 Figure 7.3: Attributes Sub Window The weather database contains five fields - outlook, temperature, humidity, windy and play. When an attribute is selected from this list by clicking on it, further details on the attribute itself are displayed on the right-hand side. Select the temperature attribute first. Below screen will appear − CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 24 Figure 7.4: Selected Attribute Information In the Selected Attribute sub window, following points can be observed− The name and the type of the attribute are displayed. The type for the temperature attribute is Nominal. The number of Missing values is zero. There are three distinct values with no unique value. The table underneath this information shows the nominal values for this field as hot, mild and cold. It also shows the count and weight in terms of a percentage for each nominal value. At the bottom of the window, the visual representation of the class values can be seen. Click on the Visualize All button, all features in one single window is shown in Figure 7.5. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 25 Figure 7.5: Visualization Removing Attributes Most of the time, the data that used for model building comes with many irrelevant fields. For example, the customer database may contain his mobile number which is relevant in analyzing his credit rating. Figure 7.6: Attributes Removal To remove Attribute/s select them and click on the Remove button at the bottom. The selected attributes would be removed from the database. Applying Filters Some of the machine learning techniques such as association rule mining requires categorical data. To illustrate the use of filters, use weather-numeric.arff database that contains two numeric attributes temperature and humidity. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 26 Convert these to nominal by applying a filter on raw data. Click on the Choose button in the Filter sub window and select the following filter − weka→filters→supervised→attribute→Discretize Figure 7.7: Applying Filter Click on the Apply button and examine the temperature and/or humidity attribute. It can be noticed that these have changed from numeric to nominal types. Figure 7.8: Attribute Type Changed Now, save the data by clicking the Save button. Viva Questions: 1. What is data filtering? 2. What is the need to remove some attributes from the database? 3. Which filters are supported by Weka? 4. Which technique is best for filtering data? 5. What are the advantages of data processing? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 27 EXPT. No 8. Demonstrate the concept of the association rule on the dataset using the Apriori algorithm Aim: To understand association in data mining using Apriori algorithm Software Required: Weka mining tool Theory: It was observed that people who buy bread also buy peanut butter at the same time. That is there is an association in buying bread and peanut butter together. Though this seems not well convincing, this association rule was mined from huge databases of supermarkets. Finding such associations becomes vital for supermarkets as they would stock peanut butter next to bread so that customers can locate both items easily resulting in an increased sale for the supermarket. The Apriori algorithm is one such algorithm in ML that finds out the probable associations and creates association rules. WEKA provides the implementation of the Apriori algorithm. Here, the Apriori algorithm will be applied to the supermarket data provided in the WEKA installation. Loading Data In the WEKA explorer, open the Preprocess tab, click on the Open file button and select supermarket.arff database from the installation folder. After the data is loaded it looks like the Figure 8.1. Figure 8.1: Supermarket Relation The database contains 4627 instances and 217 attributes. It can be easily understood how difficult it would be to detect the association between such a large numbers of attributes. However, this task is automated with the help of Apriori algorithm. Associator Click on the Associate tab and click on the Choose button. Select the Apriori association as shown in the Figure 8.2. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 28 Figure 8.2: Apriori Association To set the parameters for the Apriori algorithm, click on its name, a window will pop up as shown below in Figure 8.3 that allows to set the parameters − Figure 8.3: Parameters for Apriori Association CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 29 After setting the parameters, click the Start button. After a while the results will appear as shown in the Figure 8.4. Figure 8.4: Rules of Association Found At the bottom, the detected best rules of associations will be found. This will help the supermarket in stocking their products in appropriate shelves. Viva Questions: 1. What do you mean by the term Association? 2. What is the Apriori algorithm? 3. Explain the concept of association with real-life examples. 4. Name two association algorithms. 5. How do association rules help in improving business? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 30 EXPT. No 9. Demonstrate the clustering rule process on the dataset using a simple K-means algorithm Aim: To understand clustering in data mining using simple K-means algorithm Software Required: Weka mining tool Theory: Cluster Analysis Clustering Algorithms are unsupervised learning algorithms used to create groups of data with similar characteristics. It aggregates objects with similarities into groups and subgroups thus leading to the partitioning of datasets. Cluster analysis is the process of portioning of datasets into subsets. These subsets are called clusters and the set of clusters is called clustering. Cluster Analysis is used in many applications such as image recognition, pattern recognition, web search, and security, in business intelligence such as the grouping of customers with similar likings. K-means Clustering K means clustering is the simplest clustering algorithm. In the K-Clustering algorithm, the dataset is partitioned into K-clusters. An objective function is used to find the quality of partitions so that similar objects are in one cluster and dissimilar objects in other groups. In this method, the centroid of a cluster is found to represent a cluster. The centroid is taken as the center of the cluster which is calculated as the mean value of points within the cluster. Now the quality of clustering is found by measuring the Euclidean distance between the point and center. This distance should be maximum. Procedure: 1. Open WEKA Explorer and click on Open File in the Preprocess tab. Choose dataset “vote.arff”. Figure 9.1: Loading Data file 2. Go to the “Cluster” tab and click on the “Choose” button. Select the clustering method as “SimpleKMeans”. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 31 Figure 9.2: Cluster Tab 3. Choose Settings and then set the following fields: Distance function as Euclidian The number of clusters as 6. With a greater number of clusters, the sum of squared error will reduce. Click on Ok and start the algorithm. Figure 9.3: Kmeans Settings 4. Click on Start in the left panel. The algorithm displays results on the white screen. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 32 5. Choose “Classes to Clusters Evaluations” and click on Start. Figure 9.4: SimpleKMeans Information The algorithm will assign the class label to the cluster. Cluster 0 represents republican and Cluster 3 represents democrat. The Incorrectly clustered instance is 39.77% which can be reduced by ignoring the unimportant attributes. Figure 9.5: Cluster Tab 6. To ignore the unimportant attributes. Click on the “Ignore attributes” button and select the attributes to be removed. 7. Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any box. Move the Jitter to the max. The X-axis and Y-axis represent the attribute. The blue color represents class label democrat and the red color represents class label republican. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 33 Jitter is used to view Clusters. Click the box on the right-hand side of the window to change the x coordinate attribute and view clustering with respect to other attributes. Figure 9.6: Clustering Visualization K-means clustering is a simple cluster analysis method. The number of clusters can be set using the setting tab. The centroid of each cluster is calculated as the mean of all points within the clusters. With the increase in the number of clusters, the sum of square errors is reduced. The objects within the cluster exhibit similar characteristics and properties. The clusters represent the class labels. Viva Questions: 1. What is clustering? 2. Which algorithms are available for clustering? 3. Why do we perform clustering on the data set? 4. What is a simple K-means algorithm? 5. What is Euclidean distance? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 34 EXPT. No 10. Demonstrate the classification rule process on the dataset using the Naïve Bayes algorithm Aim: To understand classification in data mining using Naïve Bayes algorithm Software Required: Weka Theory: Naïve Bayes Algorithm: It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Procedure: 1. First load the required dataset in the Weka tool using choose file option. Here, weather-nominal dataset is selected to execute. 2. Now, go to the classify tab on the top left side and click on the choose button and select the Naive Bayesian algorithm in it. Figure 10.1: Naïve Bayes 3. Now to change the parameters click on the right side at the choose button, and the default values are selected in this example. CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 35 Figure 10.2: Naïve Bayes Settings 4. Choose the Percentage split as measurement method from the “Test” choices in the main panel. Since there is no separate test data collection, use the percentage split of 66 percent to get a good idea of the model’s accuracy. Figure 10.3: Test Options 5. To generate the model, now click “start.” When the model is done, the evaluation statistic will CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING DATA MINING AND WAREHOUSING – LABORATORY 36 appear in the right panel. Figure 10.4: Evaluation Statistics The model’s classification accuracy is about 60%. This suggests that accuracy can be optimized by performing some modifications. Viva Questions: 1. What is Naïve Bayes algorithm? 2. What is classification? 3. What are the ways to split the datasets? 4. Which algorithm is best for clustering? 5. Naïve Bayes is suitable for what kind of datasets? CHAMELI DEVI GROUP OF INSTITUTIONS COMPUTER SCIENCE AND ENGINEERING