LECTURE NOTES ON DATA WAREHOUSING AND MINING 2018 – 2019 II MCA II Semester Mrs. T.LAKSHMI PRASANNA, Assistant Professor CHADALAWADA RAMANAMMA ENGINEERING COLLEGE (AUTONOMOUS) Chadalawada Nagar, Renigunta Road, Tirupati – 517 506 Department of Master of Computer Applications Unit - I Data Warehousing Classes:10 Introduction to data mining: Motivation, importance, definition of data mining, kinds of data mining, kinds of patterns, data mining technologies, kinds of applications targeted, major issues in data mining; Preprocessing: data objects and attribute types, basic statistical descriptions of data, data visualization, data quality, data cleaning, data integration, data reduction, data transformation and data discretization. MOTIVATION AND IMPORTANCE: Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts off with a basic overview and the terminologies involved in data mining and then gradually moves on to cover topics such as knowledge discovery, query language, classification and prediction, decision tree induction, cluster analysis, and how to mine the Web. here is a huge amount of data available in the Information Industry. This data is of no use until it is converted into useful information. It is necessary to analyze this huge amount of data and extract useful information from it. Extraction of information is not the only process we need to perform; data mining also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration, etc. DEFINITION OF DATA MINING? Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. The information or knowledge extracted so can be used for any of the following applications − • Market Analysis • Fraud Detection • Customer Retention • Production Control • Science Exploration * Major Sources of Abundant data: - Business – Web, E-commerce, Transactions, Stocks - Science – Remote Sensing, Bio informatics, Scientific Simulation - Society and Everyone – News, Digital Cameras, You Tube * Need for turning data into knowledge – Drowning in data, but starving for knowledge * Applications that use data mining: - Market Analysis - Fraud Detection - Customer Retention - Production Control - Scientific Exploration Definition of Data Mining? Extracting and ‘Mining’ knowledge from large amounts of data. “Gold Mining from rock or sand” is same as “Knowledge mining from data” Other terms for Data Mining: o Knowledge Mining o Knowledge Extraction o Pattern Analysis o Data Archeology o Data Dredging Data Mining is not same as KDD (Knowledge Discovery from Data) Data Mining is a step in KDD Data Cleaning – Remove noisy and inconsistent data Data Integration – Multiple data sources combined Data Selection – Data relevant to analysis retrieved Data Transformation – Transform into form suitable for Data Mining (Summarized / Aggregated) Data Mining – Extract data patterns using intelligent methods Pattern Evaluation – Identify interesting patterns Knowledge Presentation – Visualization / Knowledge Representation – Presenting mined knowledge to the user DIFFERENET KINDS OF DATA MINING: There are several major data mining techniques have been developing and using in data mining projects recently including association, classification, clustering, prediction, sequential patterns and decision tree. We will briefly examine those data mining techniques in the following sections. Association: Association is one of the best-known data mining technique. In association, a pattern is discovered based on a relationship between items in the same transaction. That’s is the reason why association technique is also known as relation technique. The association technique is used in market basket analysis to identify a set of products that customers frequently purchase together. Retailers are using association technique to research customer’s buying habits. Based on historical sale data, retailers might find out that customers always buy crisps when they buy beers, and, therefore, they can put beers and crisps next to each other to save time for the customer and increase sales. Classification Classification is a classic data mining technique based on machine learning. Basically, classification is used to classify each item in a set of data into one of a predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network, and statistics. In classification, we develop the software that can learn how to classify the data items into groups. For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.” In this case, we divide the records of employees into two groups that named “leave” and “stay”. And then we can ask our data mining software to classify the employees into separate groups. Clustering Clustering is a data mining technique that makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique. The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes. To make the concept clearer, we can take book management in the library as an example. In a library, there is a wide range of books on various topics available. The challenge is how to keep those books in a way that readers can take several books on a particular topic without hassle. By using the clustering technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in that topic, they would only have to go to that shelf instead of looking for the entire library. Prediction The prediction, as its name implied, is one of a data mining techniques that discovers the relationship between independent variables and relationship between dependent and independent variables. For instance, the prediction analysis technique can be used in the sale to predict profit for the future if we consider the sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction. Sequential Patterns Sequential patterns analysis is one of data mining technique that seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period. In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. Decision trees The A decision tree is one of the most commonly used data mining techniques because its model is easy to understand for users. In decision tree technique, the root of the decision tree is a simple question or condition that has multiple answers. Each answer then leads to a set of questions or conditions that help us determine the data so that we can make the final decision based on it. For example, We use the following decision tree to determine whether or not to play tennis: Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is rainy, we should only play tennis if the wind is the week. And if it is sunny then we should play tennis in case the humidity is normal. We often combine two or more of those data mining techniques together to form an appropriate process that meets the business needs. 1. Classification analysis This analysis is used to retrieve important and relevant information about data, and metadata. It is used to classify different data in different classes. Classification is similar to clustering in a way that it also segments data records into different segments called classes. But unlike clustering, here the data analysts would have the knowledge of different classes or cluster. So, in classification analysis you would apply algorithms to decide how new data should be classified. A classic example of classification analysis would be our outlook email. In outlook, they use certain algorithms to characterize an email as legitimate or spam. 2. Association rule learning It refers to the method that can help you identify some interesting relations (dependency modeling) between different variables in large databases. This technique can help you unpack some hidden patterns in the data that can be used to identify variables within the data and the concurrence of different variables that appear very frequently in the dataset.association rules are useful for examining and forecasting customer behavior. It is highly recommended in the retail industry analysis. This technique is used to determine shopping basket data analysis, product clustering, catalog design and store layout. In it, programmers use association rules to build programs capable of machine learning. 3. Anomaly or outlier detection This refers to the observation for data items in a dataset that do not match an expected pattern or an expected behavior. Anomalies are also known as outliers, novelties, noise, deviations and exceptions. Often they provide critical and actionable information. An anomaly is an item that deviates considerably from the common average within a dataset or a combination of data. These types of items are statistically aloof as compared to the rest of the data and hence, it indicates that something out of the ordinary has happened and requires additional attention. This technique can be used in a variety of domains, such as intrusion detection, system health monitoring, fraud detection, fault detection, event detection in sensor networks, and detecting eco-system disturbances. Analysts often remove the anomalous data from the dataset top discover results with an increased accuracy. 4. Clustering analysis The cluster is actually a collection of data objects; those objects are similar within the same cluster. That means the objects are similar to one another within the same group and they are rather different or they are dissimilar or unrelated to the objects in other groups or in other clusters. Clustering analysis is the process of discovering groups and clusters in the data in such a way that the degree of association between two objects is highest if they belong to the same group and lowest otherwise.a result of this analysis can be used to create customer profiling. 5. Regression analysis In statistical terms, a regression analysis is the process of identifying and analyzing the relationship among variables. It can help you understand the characteristic value of the dependent variable changes, if any one of the independent variables is varied. This means one variable is dependent on another, but it is not vice versa.it is generally used for prediction and forecasting. All of these techniques can help analyze different data from different perspectives. Now you have the knowledge to decide the best technique to summarize data into useful information – information that can be used to solve a variety of business problems to increase revenue, customer satisfaction, or decrease unwanted cost. DATA MINING TECHNIQUES: 1.Classification: This analysis is used to retrieve important and relevant information about data, and metadata. This data mining method helps to classify data in different classes. 2. Clustering: Clustering analysis is a data mining technique to identify data that are like each other. This process helps to understand the differences and similarities between the data. 3. Regression: Regression analysis is the data mining method of identifying and analyzing the relationship between variables. It is used to identify the likelihood of a specific variable, given the presence of other variables. 4. Association Rules: This data mining technique helps to find the association between two or more Items. It discovers a hidden pattern in the data set. 5. Outer detection: This type of data mining technique refers to observation of data items in the dataset which do not match an expected pattern or expected behavior. This technique can be used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called Outlier Analysis or Outlier mining. 6. Sequential Patterns: This data mining technique helps to discover or identify similar patterns or trends in transaction data for certain period. 7. Prediction: Prediction has used a combination of the other data mining techniques like trends, sequential patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence for predicting a future event. DATA MINING TECHNIQUES(IN DETAIL): 1.Tracking patterns. One of the most basic techniques in data mining is learning to recognize patterns in your data sets. This is usually a recognition of some aberration in your data happening at regular intervals, or an ebb and flow of a certain variable over time. For example, you might see that your sales of a certain product seem to spike just before the holidays, or notice that warmer weather drives more people to your website. 2. Classification. Classification is a more complex data mining technique that forces you to collect various attributes together into discernable categories, which you can then use to draw further conclusions, or serve some function. For example, if you’re evaluating data on individual customers’ financial backgrounds and purchase histories, you might be able to classify them as “low,” “medium,” or “high” credit risks. You could then use these classifications to learn even more about those customers. 3. Association. Association is related to tracking patterns, but is more specific to dependently linked variables. In this case, you’ll look for specific events or attributes that are highly correlated with another event or attribute; for example, you might notice that when your customers buy a specific item, they also often buy a second, related item. This is usually what’s used to populate “people also bought” sections of online stores. 4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give you a clear understanding of your data set. You also need to be able to identify anomalies, or outliers in your data. For example, if your purchasers are almost exclusively male, but during one strange week in July, there’s a huge spike in female purchasers, you’ll want to investigate the spike and see what drove it, so you can either replicate it or better understand your audience in the process. 5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data together based on their similarities. For example, you might choose to cluster different demographics of your audience into different packets based on how much disposable income they have, or how often they tend to shop at your store. 6. Regression. Regression, used primarily as a form of planning and modeling, is used to identify the likelihood of a certain variable, given the presence of other variables. For example, you could use it to project a certain price, based on other factors like availability, consumer demand, and competition. More specifically, regression’s main focus is to help you uncover the exact relationship between two (or more) variables in a given data set. 7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used to project the types of data you’ll see in the future. In many cases, just recognizing and understanding historical trends is enough to chart a somewhat accurate prediction of what will happen in the future. For example, you might review consumers’ credit histories and past purchases to predict whether they’ll be a credit risk in the future. DATA MINING TOOLS: So do you need the latest and greatest machine learning technology to be able to apply these techniques? Not necessarily. In fact, you can probably accomplish some cutting-edge data mining with relatively modest database systems, and simple tools that almost any company will have. And if you don’t have the right tools for the job, you can always create your own. However you approach it, data mining is the best collection of techniques you have for making the most out of the data you’ve already gathered. As long as you apply the correct logic, and ask the right questions, you can walk away with conclusions that have the potential to revolutionize your enterprise. Challenges of Implementation of Data mine: • Skilled Experts are needed to formulate the data mining queries. • Overfitting: Due to small size training database, a model may not fit future states. • Data mining needs large databases which sometimes are difficult to manage • Business practices may need to be modified to determine to use the information uncovered. • If the data set is not diverse, data mining results may not be accurate. • Integration information needed from heterogeneous databases and global information systems could be complex MAJOR ISSUES IN DATA MINING: • Mining Methodology Issues: • o Mining different kinds of knowledge in databases. • o Incorporation of background knowledge • o Handling noisy or incomplete data • o Pattern Evaluation – Interestingness Problem • User Interaction Issues: • o Interactive mining of knowledge at multiple levels of abstraction • o Data mining query languages and ad-hoc data mining. • Performance Issues: • o Efficiency and Scalability of Data Mining Algorithms. • o Parallel, distributed and incremental mining algorithms. • Issues related to diversity of data types: • o Handling of relational and complex types of data. • o Mining information from heterogeneous databases and global information • system DATA MINING TECHNOLOGIES: As a highly application-driven domain, data mining has incorporated many techniques from other domains such as statistics, machine learning, pattern recognition, database and data warehouse systems, information retrieval, visualization, algorithms, highperformance computing, and many application domains (Figure ) The interdisciplinary nature of data mining research and development contributes significantly to the success of data mining and its extensive applications. In this section, we give examples of several disciplines that strongly influence the development of data mining methods. DATA OBJECTS AND ATTRIBUTE TYPES: When we talk about data mining, we usually discuss about knowledge discovery from data. To get to know about the data it is necessary to discuss about data objects, data attributes and types of data attributes. Mining data includes knowing about data, finding relation between data. And for this we need to discuss about data objects and attributes. Data objects are the essential part of a database. A data object represents the entity. Data Objects are like group of attributes of a entity. For example a sales data object may represent customer, sales or purchases. When a data object is listed in a database they are called data tuples. Attribute It can be seen as a data field that represents characteristics or features of a data object. For a customer object attributes can be customer Id, address etc. We can say that a set of attributes used to describe a given object are known as attribute vector or feature vector. Type of attributes : This is the First step of Data Data-preprocessing. We differentiate between different types of attributes and then preprocess the data. So here is description of attribute types. 1. Qualitative (Nominal (N), Ordinal (O), Binary(B)). 2. Quantitative (Discrete, Continuous) Qualitative Attributes 1. Nominal Attributes – related to names : The values of a Nominal attribute are name of things, some kind of symbols. Values of Nominal attributes represents some category or state and that’s why nominal attribute also referred as categorical attributes and there is no order (rank, position) among values of nominal attribute. Example : 2. Binary Attributes : Binary data has only 2 values/states. For Example yes or no, affected i) Symmetric or : unaffected, Both values true are or equally important false. (Gender). ii) Asymmetric : Both values are not equally important (Result). 3. Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or ranking(order) between them, but the magnitude between values is not actually known, the order of values that shows what is important but don’t indicate how important it is. Quantitative Attributes 1. Numeric : A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real values. Numerical attributes are of 2 types, interval and ratio. i) An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes do not have the correct reference point or we can call zero point. Data can be added and subtracted at interval scale but can not be multiplied or divided.Consider a example of temperature in degrees Centigrade. If a days temperature of one day is twice than the other day we cannot say that one day is twice as hot as another day. ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a measurement is ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The values are ordered, and we can also compute the difference between values, and the mean, median, mode, Quantile-range and Five number summary can be given. 2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical form. These attributes has finite or countably infinite set of values. Example 3. Continuous : Continuous data have infinite no of states. Continuous data is of float type. There can be many values between Example 2 and 3. : DATA PRE-PROCESSING Definition - What does Data Preprocessing mean? Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks). Data goes through a series of steps during preprocessing: • Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing the noisy data, or resolving the inconsistencies in the data. • Data Integration: Data with different representations are put together and conflicts within the data are resolved. • Data Transformation: Data is normalized, aggregated and generalized. • Data Reduction: This step aims to present a reduced representation of the data in a data warehouse. • Data Discretization: Involves the reduction of a number of values of a continuous attribute by dividing the range of attribute intervals. Why Data Pre-processing? Data preprocessing prepares raw data for further processing. The traditional data preprocessing method is reacting as it starts with data that is assumed ready for analysis and there is no feedback and impart for the way of data collection. The data inconsistency between data sets is the main difficulty for the data preprocessing Following is the Major task of preprocessing Data Cleaning Data in the real world is dirty. That is it is incomplete or noisy or inconsistent. Incomplete: means lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” Noisy: means containing errors or outliers e.g., Salary=“-10” Inconsistent: means containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Why Is Data Dirty? Data is dirty because of the below reasons. Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human / hardware / software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning Why Is Data Pre-processing Important? Data Pre-processing is important because: If there is No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse DATA CLEANING Importance of Data Cleaning “Data cleaning is one of the three biggest problems in data warehousing”— “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks are: Filling in missing values Identifying outliers and smoothing out noisy data Correcting inconsistent data Resolving redundancy caused by data integration Explanation of Data Cleaning Missing Data Eg. Missing customer income attribute in the sales data Methods of handling missing values: a) Ignore the tuple 1) When the attribute with missing values does not contribute to any of the classes or has missing class label. 2) Effective only when more number of missing values are there for many attributes in the tuple. 3) Not effective when only few of the attribute values are missing in a tuple. b) Fill in the missing value manually 1) This method is time consuming 2) It is not efficient 3) The method is not feasible c) Use of a Global constant to fill in the missing value 1) This means filling with “Unknown” or “Infinity” 2) This method is simple 3) This is not recommended generally d) Use the attribute mean to fill in the missing value That is, take the average of all existing income values and fill in the missing income value. e) Use the attribute mean of all samples belonging to the same class as that of the given tuple. Say, there is a class “Average income” and the tuple with the missing value belongs to this class and then the missing value is the mean of all the values in this class. f) Use the most probable value to fill in the missing value This method uses inference based tools like Bayesian Formula, Decision tree etc. Data Cleaning in Data Mining Quality of your data is critical in getting to final analysis. Any data which tend to be incomplete, noisy and inconsistent can effect your result. Data cleaning in data mining is the process of detecting and removing corrupt or inaccurate records from a record set, table or database. Some data cleaning methods :1 You can ignore the tuple.This is done when class label is missing.This method is not very effective , unless the tuple contains several attributes with missing values. 2 You can fill in the missing value manually.This approach is effective on small data set with some missing values. 3 You can replace all missing attribute values with global constant, such as a label like “Unknown” or minus infinity. 4 You can use the attribute mean to fill in the missing value.For example customer average income is 25000 then you can use this value to replace missing value for income. 5 Use the most probable value to fill in the missing value. Noisy Data Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty data collection instruments, data entry problems and technology limitation. How to Handle Noisy Data? Binning: Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values around it.The sorted values are distributed into a number of “buckets,” or bins. For example Price = 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin a: 4, 8, 15 Bin b: 21, 21, 24 Bin c: 25, 28, 34 In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3. Smoothing by bin means: Bin a: 9, 9, 9 Bin b: 22, 22, 22 Bin c: 29, 29, 29 In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin boundaries: Bin a: 4, 4, 15 Bin b: 21, 21, 24 Bin c: 25, 25, 34 In smoothing by bin boundaries, each bin value is replaced by the closest boundary value. Regression Data can be smoothed by fitting the data into a regression functions. Clustering: Outliers may be detected by clustering,where similar values are organized into groups, or “clusters.Values that fall outside of the set of clusters may be considered outliers. Clustering DATA INTEGRATION i)_Data Integration - Combines data from multiple sources into a single store. - Includes multiple databases, data cubes or flat files ii)Schema integration - Integrates meta data from different sources - Eg. A.cust_id = B.cust_no iii)Entity Identification Problem - Identify real world entities from different data sources - Eg. Pay type filed in one data source can take the values ‘H’ or ‘S’, Vs in another data source it can take the values 1 or2 iv )Detecting and resolving data value conflicts: - For the same real world entity, the attribute value can be different in different data sources - Possible reasons can be - Different interpretations, different representation and different scaling - Eg. Sales amount represented in Dollars (USD) in one data source and as Pounds ($) in another data source. V) Handling Redundancy in data integration: - When we integrate multiple databases data redundancy occurs - Object Identification – Same attributes / objects in different data sources may have different names. - Derivable Data – Attribute in one data source may be derived from Attribute(s) in another data source Eg. Monthly_revenue in one data source and Annual revenue in another data source. - Such redundant attributes can be detected using Correlation Analysis - So, Careful integration of data from multiple sources can help in reducing or avoiding data redundancy and inconsistency which will in turn improve mining speed and quality. DATA INTEGRATION IN DATA MINING Data Integration is a data preprocessing technique that combines data from multiple sources and provides users a unified view of these data. Data Integration These sources may include multiple databases, data cubes, or flat files. One of the most wellknown implementation of data integration is building an enterprise's data warehouse. The benefit of a data warehouse enables a business to perform analyses based on the data in the data warehouse. There are mainly 2 major approaches for data integration:1 Tight Coupling In tight coupling data is combined from different sources into a single physical location through the process of ETL - Extraction, Transformation and Loading. 2 Loose Coupling In loose coupling data only remains in the actual source databases. In this approach, an interface is provided that takes query from user and transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result. DATA TRANSFORMATION Smoothing:- Removes noise from the data Aggregation:- Summarization, Data cube Construction Generalization:- Concept Hierarchy climbing Attribute / Feature Construction:- New attributes constructed from the given ones Normalization:- Data scaled to fall within a specified range - min-max normalization - z-score normalization - normalization by decimal scaling In data transformation process data are transformed from one format to another format, that is more appropriate for data mining. Some Data Transformation Strategies:1 Smoothing Smoothing is a process of removing noise from the data. 2 Aggregation Aggregation is a process where summary or aggregation operations are applied to the data. 3 Generalization In generalization low-level data are replaced with high-level data by using concept hierarchies climbing. 4 Normalization Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to 1.0. 5 Attribute Construction In Attribute construction, new attributes are constructed from the given set of attributes. DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION Data Discretization techniques can be used to divide the range of continuous attribute into intervals.Numerous continuous attribute values are replaced by small interval labels. This leads to a concise, easy-to-use, knowledge-level representation of mining results. Top-down discretization If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then repeats this recursively on the resulting intervals, then it is called top-down discretization or splitting. Bottom-up discretization If the process starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals, then it is called bottom-up discretization or merging. Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a concept hierarchy. Concept hierarchies Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts. In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. Data mining on a reduced data set means fewer input/output operations and is more efficient than mining on a larger data set. Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining. Discretization and Concept Hierarchy Generation for Numerical Data Typical methods 1 Binning Binning is a top-down splitting technique based on a specified number of bins.Binning is an unsupervised discretization technique. 2 Histogram Analysis Because histogram analysis does not use class information so it is an unsupervised discreti In data transformation process data are transformed from one format to another format, that is more appropriate for data mining. Some Data Transformation Strategies:- 1 Smoothing Smoothing is a process of removing noise from the data. 2 Aggregation Aggregation is a process where summary or aggregation operations are applied to the data. 3 Generalization In generalization low-level data are replaced with high-level data by using concept hierarchies climbing. 4 Normalization Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to 1.0. 5 Attribute Construction In Attribute construction, new attributes are constructed from the given set of attributes. zation technique. Histograms partition the values for an attribute into disjoint ranges called buckets. 3 Cluster Analysis Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to discrete a numerical attribute of A by partitioning the values of A into clusters or groups. Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy. UNIT -II UNIT - II Business Analysis Classes:08 Data warehouse and OLAP technology for data mining, what is a data warehouse, multidimensional data model, data warehouse architecture, data warehouse implementation, development of data cube technology, data warehousing to data mining; Data preprocessing: Data summarization, data cleaning, data integration and transformation data reduction, discretization and concept hierarchy generation. DATA WAREHOUSE AND OLAP TECHNOLOGY FOR DATA MINING: Online Analytical Processing (OLAP) is based on the multidimensional data model that allow user to extract and view data from different points of view. OLAP data is stored in a multidimensional database. OLAP is the technology behind many Business Intelligence (BI) applications. Using OLAP, user can create a spreadsheet showing all of a company's products sold in India in the month of May, compare revenue figures etc. WHAT IS A DATA WAREHOUSE: The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions. A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining. Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now become an important platform for data analysis and online analytical processing. Understanding a Data Warehouse • A data warehouse is a database, which is kept separate from the organization's operational database. • There is no frequent updating done in a data warehouse. • It possesses consolidated historical data, which helps the organization to analyze its business. • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. • Data warehouse systems help in the integration of diversity of application systems. • A data warehouse system helps in consolidated historical data analysis. Why a Data Warehouse is Separated from Operational Databases A data warehouses is kept separate from operational databases due to the following reasons − • An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. • Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. • An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. • An operational database maintains current data. On the other hand, a data warehouse maintains historical data. Data Warehouse Features The key features of a data warehouse are discussed below − • Subject Oriented − A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making. • Integrated − A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. • Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. • Non-volatile − Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse. Note − A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database. Data Warehouse Applications As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-executeassess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields − • Financial services • Banking services • Consumer goods • Retail sectors • Controlled manufacturing DATAWAREHOUSE ARCHITECTURE: The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages − • Since a data warehouse can gather information quickly and efficiently, it can enhance business productivity. • A data warehouse provides us a consistent view of customers and items, hence, it helps us manage customer relationship. • A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long period in a consistent and reliable manner. To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows − • The top-down view − This view allows the selection of relevant information needed for a data warehouse. • The data source view − This view presents the information being captured, stored, and managed by the operational system. • The data warehouse view − This view includes the fact tables and dimension tables. It represents the information stored inside the data warehouse. • The business query view − It is the view of the data from the viewpoint of the enduser. Three-Tier Data Warehouse Architecture Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data warehouse architecture. • Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. • Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. o By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. o By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations. • Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools. The following diagram depicts the three-tier architecture of data warehouse − Data Warehouse Models From the perspective of data warehouse architecture, we have the following data warehouse models − • Virtual Warehouse • Data mart • Enterprise Warehouse Virtual Warehouse The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. Data Mart Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects. Points to remember about data marts − • Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers. • The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years. • The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide. • Data marts are small in size. • Data marts are customized by department. • The source of a data mart is departmentally structured data warehouse. • Data mart are flexible. Enterprise Warehouse • An enterprise warehouse collects all the information and the subjects spanning an entire organization • It provides us enterprise-wide data integration. • The data is integrated from operational systems and external information providers. • This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond. Load Manager This component performs the operations required to extract and load process. The size and complexity of the load manager varies between specific solutions from one data warehouse to other. Load Manager Architecture The load manager performs the following functions − • Extract the data from source system. • Fast Load the extracted data into temporary data store. • Perform simple transformations into structure similar to the one in the data warehouse. Extract Data from Source The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway. Fast Load • In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time. • The transformations affects the speed of data processing. • It is more effective to load the data into relational database prior to applying transformations and checks. • Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved. Simple Transformations While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks: • Strip out all the columns that are not required within the warehouse. • Convert all the values to required data types. Warehouse Manager A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts. The size and complexity of warehouse managers varies between specific solutions. Warehouse Manager Architecture A warehouse manager includes the following − • The controlling process • Stored procedures or C with SQL • Backup/Recovery tool • SQL Scripts Operations Performed by Warehouse Manager • A warehouse manager analyzes the data to perform consistency and referential integrity checks. • Creates indexes, business views, partition views against the base data. • Generates new aggregations and updates existing aggregations. Generates normalizations. • Transforms and merges the source data into the published data warehouse. • Backup the data in the data warehouse. • Archives the data that has reached the end of its captured life. Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are appropriate. Query Manager • Query manager is responsible for directing the queries to the suitable tables. • By directing the queries to appropriate tables, the speed of querying and response generation can be increased. • Query manager is responsible for scheduling the execution of the queries posed by the user. Query Manager Architecture The following screenshot shows the architecture of a query manager. It includes the following: • Query redirection via C tool or RDBMS • Stored procedures • Query management tool • Query scheduling via C tool or RDBMS • Query scheduling via third-party software Detailed Information Detailed information is not kept online, rather it is aggregated to the next level of detail and then archived to tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated data. The following diagram shows a pictorial impression of where detailed information is stored and how it is used. MULTI DIMENSIONAL DATA MODEL(SCHEMAS EXPLANATION): Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse. Star Schema Star schema is the fundamental schema among the data mart schema and it is simplest. This schema is widely used to develop or build a data warehouse and dimensional data marts. It includes one or more fact tables indexing any number of dimensional tables. The star schema is a necessary case of the snowflake schema. It is also efficient for handling basic queries. It is said to be star as its physical model resembles to the star shape having a fact table at its center and the dimension tables at its peripheral representing the star’s points. Below is an example to demonstrate the Star Schema: In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID, Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title, Department and Region. Product dimension table contains the attributes: Product ID, Product Name, Product Category, Unit Price. Customer dimension table contains the attributes: Customer ID, Customer Name, Address, City, Zip. Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month. Model of Star Schema – In Star Schema, Business process data, that holds the quantitative data about a business is distributed in fact tables, and dimensions which are descriptive characteristics related to fact data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few examples of fact data in star schema. Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy to handle a star schema which have dimensions of few attributes. Advantages of Star Schema – 1. SimplerQueries: Join logic of star schema is quite cinch in compare to other join logic which are needed to fetch data from a transactional schema that is highly normalized. 2. SimplifiedBusinessReportingLogic: In compared to a transactional schema that is highly normalized, the star schema makes simpler common business reporting logic, such as as-of reporting and periodover-period. 3. FeedingCubes: Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source without designing a cube structure. Disadvantages of Star Schema – 1. Data integrity is not enforced well since in a highly de-normalized schema state. 2. Not flexible in terms if analytical needs as a normalized data model. 3. Star schemas don’t reinforce many-to-many relationships within business entities – at least not frequently. • Each dimension in a star schema is represented with only one-dimension table. • This dimension table contains the set of attributes. • The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location. • There is a fact table at the center. It contains the keys to each of four dimensions. • The fact table also contains the attributes, namely dollars sold and units sold. Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country. Snowflake Schema Introduction: The snowflake schema is a variant of the star schema. Here, the centralized fact table is connected to multiple dimensions. In the snowflake schema, dimension are present in a normalized from in multiple related tables. The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent table. The snowflake effect affects only the dimension tables and does not affect the fact tables. Example: The Employee dimension table now contains the attributes: EmployeeID, EmployeeName, DepartmentID, Region, Territory. The DepartmentID attribute links with Employee table with the Department dimension table. The Department dimension is used to provide detail about each department, such as Name and Location of the department. The Customer dimension table now contains the attributes: CustomerID, CustomerName, Address, CityID. The CityID attributes links the Customer dimension table with the City dimension table. The City dimension table has details about each city such as CityName, Zipcode, State and Country. The main difference between star schema and snowflake schema is that the dimension table of the snowflake schema are maintained in normalized form to reduce redundancy. The advantage here is that such table(normalized) are easy to maintain and save storage space. However, it also means that more joins will be needed to execute query. This will adversely impact system performance. What is snowflaking? The snowflake design is the result of further expansion and normalized of the dimension table. In other words, a dimension table is said to be snowflaked if the low-cardinality attribute of the dimensions have been divided into separate normalized tables. These tables are then joined to the original dimension table with referential constrains(foreign key constrain). Generally, snowflaking is not recommended in the dimension table, as it hampers the understandability and performance of the dimension model as more tables would be required to be joined to satisfy the queries. • Some dimension tables in the Snowflake schema are normalized. • The normalization splits up the data into additional tables. • Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table. • Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key. • The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type. Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space. Fact Constellation Schema • A fact constellation has multiple fact tables. It is also known as galaxy schema. • The following diagram shows two fact tables, namely sales and shipping. • The sales fact table is same as that in the star schema. • The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location. • The shipping fact table also contains two measures, namely dollars sold and units sold. • It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table. Schema Definition Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts. DEVELOPMENT OF DATA CUBE TECHNOLOGY Data Cube A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records. Illustration of Data Cube Suppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, "item" dimension table may have attributes such as item_name, item_type, and item_brand. The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions. But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below − The above 3-D table can be represented as 3-D data cube as shown in the following figure − Data Mart Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Points to Remember About Data Marts • Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers. • The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks rather than months or years. • The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide. • Data marts are small in size. • Data marts are customized by department. • The source of a data mart is departmentally structured data warehouse. • Data marts are flexible. The following figure shows a graphical representation of data marts. Virtual Warehouse The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers. DATA SUMMARIZATION Data Summarization is a simple term for a short conclusion of a big theory or a paragraph. This is something where you write the code and in the end, you declare the final result in the form of summarizing data. Data summarization has the great importance in the data mining. As nowadays a lot of programmers and developers work on big data theory. Earlier, you used to face difficulties to declare the result, but now there are so many relevant tools in the market where you can use in the programming or wherever you want in your data. Why Data Summarization? Why we need more summarization of data in the mining process, we are living in a digital world where data transfers in a second and it is much faster than a human capability. In the corporate field, employees work on a huge volume of data which is derived from different sources like Social Network, Media, Newspaper, Book, cloud media storage etc. But sometimes it may create difficulties for you to summarize the data. Sometimes you do not expect data volume because when you retrieve data from relational sources you can not predict that how much data will be stored in the database. As a result, data becomes more complex and takes time to summarize information. Let me tell you the solution to this problem. Always retrieve data in the form of category what type of data you want in the data or we can say use filtration when you retrieve data. Although, “Data Summarization” technique gives the good amount of quality to summarize the data. Moreover, a customer or user can take benefits in their research. Excel is the best tool for data summarization and I will discuss this in brief. DATA CLEANING: ―Data cleaning is the number one problem in data warehousing‖— Data quality is an essential characteristic that determines the reliability of data for making decisions. High-quality data is 1.Complete: All relevant data such as accounts, addresses and relationships for a given customer is linked. 2. Accurate: Common data problems like misspellings, typos, and random abbreviations have been cleaned up. 3. Available: Required data are accessible on demand; users do not need to search manually for the information. 4.Timely: Up-to-date information is readily available to support decisions.In general, data quality is defined as an aggregated value over a set of quality criteria Starting with the quality criteria defined in , the author describes the set of criteria that are affected by comprehensive data cleansing and define how to assess scores for each one of them for an existing data collection. To measure the quality of a data collection, scores have to be assessed for each of the quality criteria. The assessment of scores for quality criteria can be used to quantify the necessity of data cleansing for a data collection as well as the success of a performed data cleansing process of a data collection. Quality criteria can also be used within the optimization of data cleansing by specifying priorities for each of the criteria which in turn influences the execution of data cleansing methods affecting the specific criteria. Data cleaning routines work to ―clean‖ the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. Data cleansing may also involve activities like, harmonization of data, and standardization of data. For example, harmonization of short codes (St, rd) to actual words (street, road). Standardization of data is a means of changing a reference data set to a new standard, ex, use of standard codes. The major data cleaning tasks include 1. Identify outliers and smooth out noisy data 2. Fill in missing values 3.Correct inconsistent data 4. Resolve redundancy caused by data integration Among these tasks missing values causes inconsistencies for data mining. To overcome these inconsistencies, handling the missing value is a good solution. In the medical domain, missing data might occur as the value is not relevant to a particular case, could not be recorded when the data was collected, or is ignored by users because of privacy concerns or it may be unfeasible for the patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for resolving missing values are therefore needed in health care systems to enhance the quality of diagnosis. The following sections describe about the proposed data cleaning methods. Handling Missing Values The missing value treating method plays an important role in the data preprocessing. Missing data is a common problem in statistical analysis. The tolerance level of missing data is classified as Missing Value (Percentage) - Significant Upto 1% - Trivial 1-5% - Manageable 5-15% - sophisticated methods to handle More than 15% - Severe impact of interpretation Several methods have been proposed in the literature to treat missing data. Those methods are divided into three categories as proposed by Dempster and et al.[1977]. The different patterns of missing values are discussed in the next section. 4.4.2.1 Pattern of missing The Missing value in database falls into this three categories viz., Missing Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable (NI) Missing Completely at Random (MCAR) This is the highest level of randomness. It occurs when the probability of an instance (case) having a missing value for an attribute does not depend on either the known values or the missing values are randomly distributed across all observations. This is not a realistic assumption for many real time data. Missing at Random (MAR) When missingness does not depend on the true value of the missing variable, but it might depend on the value of other variables that are observed. This method occurs when missing values are not randomly distributed across all observations, rather they are randomly distributed within one or more sub samples Non-Ignorable (NI) NI exists when missing values are not randomly distributed across observations. If the probability that a cell is missing depends on the unobserved value of the missing response, then the process is non-ignorable. In next section the theoretical framework for Handling the missing value is discussed. DATA INTEGRATION Data Integration - Combines data from multiple sources into a single store. - Includes multiple databases, data cubes or flat files Schema integration - Integrates meta data from different sources - Eg. A.cust_id = B.cust_no Entity Identification Problem - Identify real world entities from different data sources - Eg. Pay_type filed in one data source can take the values ‘H’ or ‘S’, Vs in another data source it can take the values 1 or2 Detecting and resolving data value conflicts: - For the same real world entity, the attribute value can be different in different data sources - Possible reasons can be - Different interpretations, different representation and different scaling - Eg. Sales amount represented in Dollars (USD) in one data source and as Pounds ($) in another data source. Handling Redundancy in data integration: - When we integrate multiple databases data redundancy occurs - Object Identification – Same attributes / objects in different data sources may have different names. - Derivable Data – Attribute in one data source may be derived from Attribute(s) in another data source Eg. Monthly_revenue in one data source and Annual revenue in another data source. - Such redundant attributes can be detected using Correlation Analysis - So, Careful integration of data from multiple sources can help in reducing or avoiding data redundancy and inconsistency which will in turn improve mining speed and quality. Correlation Analysis – Numerical Data: - Formula for Correlation Co-efficient = (Pearson’s=Product ( A A)( B B) rA,B A andB Co-efficient) ( AB) n AB (n 1)AB - Moment (n Where, n are 1)AB = respective No. means Of of Tuples; A & B σA and σB are respective standard deviations of A & B Σ(AB) is the sum of the cross product of A & B - If the correlation co-efficient between the attributes A & B are positive then they are positively correlated. - That is if A’s value increases, B’s value also increases. - As the correlation co-efficient value increases, the stronger the correlation. - If the correlation co-efficient between the attributes A & B is zero then they are independent attributes. - If the correlation co-efficient value is negative then they are negatively Correlated. Eg: Positive Correlation Eg: Negative Correlation Eg: No Correlation Correlation Analysis – Categorical Data: - Applicable for data where values of each attribute are divided into different categories. - Use Chi-Square Test (using the below formula) 2 2 (Observed Expected ) Expected - If the value of Χ2 is high, higher the attributes are related. - The cells that contribute maximum to the value of Χ2 are the ones whose Observed frequency is very high than its Expected frequency. - The Expected frequency is calculated using the data distribution in the two categories of the attributes. - Consider there are two Attributes A & B; the values of A are categorized into category Ai and Aj; the values of B are categorized into category Bi and Bj - The expected frequency of Ai and Bj = Eij = (Count(Ai) * Count(Bj)) / N Play Not play chess Sum (row) chess Like science fiction Not like science 250(90) 50(210) 200(360) 1000(840) fiction Sum(col.) 300 1200 1500 450 1050 - Eg. Consider a sample population of 1500 people who are surveyed to see if they Play Chess or not and if they Like Science Fiction books are not. - The counts given within parenthesis are expected frequency and the remaining one is the observed frequency. - For Example the Expected frequency for the cell (Play Chess, Like Science Fiction) is: = (Count (Play Chess) * Count (Like Science Fiction)) / Total sample population = (300 * 450) / 1500 = 90 2 2 (250 90) 90 2 (50 210) 210 2 (200 360) 360 2 (1000 840) 840 507.93 - This shows that the categories Play Chess and Like Science Fiction are strongly correlated. DATA REDUCTION Why Data Reduction? - A database of data warehouse may store terabytes of data - Complex data analysis or mining will take long time to run on the complete data set What is Data Reduction? - Obtaining a reduced representation of the complete dataset - Produces same result or almost same mining / analytical results as that of original. Data Reduction Strategies: 1. Data cube Aggregation 2. Dimensionality reduction – remove unwanted attributes 3. Data Compression 4. Numerosity reduction – Fit data into mathematical models 5. Discretization and Concept Hierarchy Generation 1. Data Cube Aggregation: - The lowest level of data cube is called as base cuboid. - Single Level Aggregation - Select a particular entity or attribute and Aggregate based on that particular attribute. Eg. Aggregate along ‘Year’ in a Sales data. - Multiple Level of Aggregation – Aggregates along multiple attributes – Further reduces the size of the data to analyze. - When a query is posed by the user, use the appropriate level of Aggregation or data cube to solve the task - Queries regarding aggregated information should be answered using the data cube whenever possible. 2. Attribute Subset Selection Feature Selection: (attribute subset selection) - The goal of attribute subset selection is to find the minimum set of Attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all Attributes. - This will help to reduce the number of patterns produced and those patterns will be easy to understand Heuristic Methods: (Due to exponential number of attribute choices) - Step wise forward selection - Step wise backward elimination - Combining forward selection and backward elimination - Decision Tree induction - Class 1 - A1, A5, A6; Class 2 - A2, A3, A4 3. Data Compression - Compressed representation of the original data. - This data reduction is called as Lossless if the original data can be reconstructed from the compressed data without any loss of information. - The data reduction is called as Lossy if only an approximation of the original data can be reconstructed. - Two Lossy Data Compression methods available are: o Wavelet Transforms o Principal Components Analysis 3.1 Discrete Wavelet Transform (DWT): - Is a linear Signal processing technique - It transforms the data vector X into a numerically different vector X’. - These two vectors are of the same length. - Here each tuple is an n-dimensional data vector. - X = {x1,x2,…xn} n attributes - This wavelet transform data can be truncated. - Compressed Approximation: Stores only small fraction of strongest of wavelet coefficients. - Apply inverse of DWT to obtain the original data approximation. - Similar to discrete Fourier transforms (Signal processing technique involving Sines and Cosines) - DWT uses Hierarchical Pyramid Algorithm o Fast Computational speedo Halves the data in each iteration 1. The length of the data vector should be an integer power of two. (Padding with zeros can be done if required) 2. Each transform applies two functions: a. Smoothing – sum / weighted average b. Difference – weighted difference 3. These functions are applied to pairs of data so that two sets of data of length L/2 is obtained. 4. Applies these two transforms iteratively until a user desired data length is obtained. 3.2 Principal Components Analysis (PCA): - Say, data to be compressed consists of N tuples and k attributes. - Tuples can be called as Data vectors and attributes can be called as dimensions. - So, data to be compressed consists of N data vectors each having k-dimensions. - Consider a number c which is very very less than N. That is c << N. - PCA searches for c orthogonal vectors that have k dimensions and that can best be used to represent the data. - Thus data is projected to a smaller space and hence compressed. - In this process PCA also combines the essence of existing attributes and produces a smaller set of attributes. - Initial data is then projected on to this smaller attribute set. Basic Procedure: 1. Input data Normalized. All attributes values are mapped to the same range. 2. Compute N Orthonormal vectors called as principal components. These are unit vectors perpendicular to each other. Thus input data = linear combination of principal components 3. Principal Components are ordered in the decreasing order of “Significance” or strength. 4. Size of the data can be reduced by eliminating the components with less “Significance” or the weaker components are removed. Thus the Strongest Principal Component can be used to reconstruct a good approximation of the original data. - PCA can be applied to ordered & unordered attributes, sparse and skewed data. - It can also be applied on multi dimensional data by reducing the same into 2 dimensional data. - Works only for numeric data. Principal Component Analysis X2 Y1 Y2 X1 4. Numerosity Reduction - Reduces the data volume by choosing smaller forms of data representations. - Two types – Parametric, Non-Parametric. - Parametric – Data estimated into a model – only the data parameters stored and not the actual data. - Stored data includes outliers also - Eg. Log-Linear Models - Non-Parametric – Do not fits data into models - Eg. Histograms, Clustering and Sampling Regression and Log-Linear Models: - Linear Regression - data are modeled to fit in a straight line. - That is data can be modeled to the mathematical equation: - Where X is called the “Response Variable” and Y is called “Predictor Variable”. - Alpha and beta are called the regression coefficients. - Alpha is the Y-intercept and Beta is the Slope of the equation. - These regression coefficients can be solved by using “method of least squares”. - Multiple Regression – Extension of linear regression – Response variable Y is modeled as a multidimensional vector. - Log-Linear Models: Estimates the probability of each cell in a base cuboid for a set of discretized attributes. - In this higher order data cubes are constructed from lower ordered data cubes. Histograms: - Uses binning to distribute the data. - Histogram for an attribute A: - Partitions the data of A into disjoint subsets / buckets. - Buckets are represented in a horizontal line in a histogram. - Vertical line of histogram represents frequency of values in bucket. - Singleton Bucket – Has only one attribute value / frequency pair - Eg. Consider the list of prices (in $) of the sold items. 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30. - consider a bucket of uniform width, say $10. - Methods of determining the bucket / partitioning the attribute values: o Equi-Width: Width of each bucket is a constant o Equi-Depth: Frequency of each bucket is a constant o V-Optimal: Histogram with the least variance Histogram Variance = Weighted sum of values in each bucket Bucket Weight = Number of values in the bucket. o MaxDiff: Find the difference between pair of adjacent values. Buckets are formed between the pairs where the difference between the pairs is greater than or equal to b-1, b (Beta) is user specified. o V-Optimal & MaxDiff are most accurate and practical. - Histograms can be extended for multiple attributes – Multidimensional histograms – can capture dependency between attributes. KLNCIT – MCA Histograms of up to five attributes are found to be effective so far. For Private Circulation only Unit II - DATA WAREHOUSING AND DATA MINING -CA5010 - 15 Singleton buckets are useful for storing outliers with high frequency. 4.3 Clustering: - Considers data tuples as objects. - Partition objects into clusters. - Objects within a cluster are similar to one another and the objects in different clusters are dissimilar. - Quality of a cluster is represented by its ‘diameter’ – maximum distance between any two objects in a cluster. - Another measure of cluster quality = Centroid Distance = Average distance of each cluster object from the cluster centroid. - The cluster representation of the data can be used to replace the actual data - Effectiveness depends on the nature of the data. - Effective for data that can be categorized into distinct clusters. - Not effective if data is ‘smeared’. - Can also have hierarchical clustering of data - For faster data access in such cases we use multidimensional index trees. - There are many choices of clustering definitions and algorithms available. Diagram for clustering 4.4 Sampling: - Can be used as a data reduction technique. - Selects random sample or subset of data. - Say large dataset D contains N tuples. 1. Simple Random Sample WithOut Replacement (SRSWOR) of size n: - Draw n tuples from the original N tuples in D, where n<N. - The probability of drawing any tuple in D is 1/N. That is all tuples have equal chance 2. Simple Random Sample With Replacement (SRSWR) of size n: - Similar to SRSWOR, except that each time when a tuple is drawn from D it is recorded and replaced. - After a tuple is drawn it is placed back in D so that it can be drawn again. KLNCIT – MCA For Private Circulation only Unit II - DATA WAREHOUSING AND DATA MINING -CA5010 16 3. Cluster Sample: - Tuples in D are grouped into M mutually disjoint clusters. - Apply SRS (SRSWOR / SRSWR) to each cluster of tuples. - Each page of data fetching of tuples can be considered as a cluster. 4. Stratified Sample: - D is divided into mutually disjoint strata. - Apply SRS (SRSWOR / SRSWR) to each Stratum of tuples. - In this way the group having the smallest number of tuples is also represented. DATA DISCRETIZATION AND CONCEPT HIERARCHY GENERATION: Data Discretization Technique: - Reduces the number of attribute values - Divides the attribute values into intervals - Interval labels are used to replace the attribute values - Result – Data easy to use, concise, Knowledge level representation of data Types of Data Discretization Techniques: 1. Supervised Discretization a. Uses Class information of the data 2. Unsupervised Discretization a. Does not uses Class information of the data 3. Top-down Discretization (splitting) a. Identifies ‘Split-Points’ or ‘Cut-Points’ in data values b. Splits attribute values into intervals at split-points c. Repeats recursively on resulting intervals d. Stops when specified number of intervals reached or some stop criteria is reached. 4. Bottom-up Discretization (merging) a. Divide the attribute values into intervals where each interval has a distinct attribute value. b. Merge two intervals based on some merging criteria c. Repeats recursively on resulting intervals d. Stops when specified number of intervals reached or some stop criteria is reached. Discretization results in – Hierarchical Partitioning of Attributes = Called as Concept Hierarchy Concept Hierarchy used for Data Mining at multiple levels of abstraction. Eg. For Concept Hierarchy – Numeric values for the attribute Age can be replaced with the class labels ‘Youth’, ‘Middle Aged’ and ‘Senior’ Discretization and Concept Hierarchy are pre-processing steps for Data Mining For a Single Attribute multiple Concept Hierarchies can be produced to meet various user needs. Manual Definition of concept Hierarchies by Domain experts is a tedious and time consuming task. Automated discretization methods are available. Some Concept hierarchies are implicit at the schema definition level and are defined when the schema is being defined by the domain experts. Eg of Concept Hierarchy using attribute ‘Age’ Interval denoted by (Y,X] Value Y (exclusive) and Value X (inclusive) Discretization and Concept Hierarchy Generation for Numeric Data: - Concept Hierarchy generation for numeric data is difficult and tedious task as it has wide range of data values and has undergoes frequent updates in any database. - Automated Discretization Methods: o Binning o Histogram analysis o Entropy based Discretization Method o X2 – Merging (Chi-Merging) o Cluster Analysis o Discretization by Intuition Partitioning - These methods assumes the data is in the sorted order Binning: - Top-Down Discretization Technique Used - Un Supervised Discretization Technique – No Class Information Used - User specified number of bins is used. - Same technique as used for Smoothing and Numerosity reduction - Data Discretized using Equi-Width or Equi-Depth method - Replace each bin value by bin mean or bin median. - Same technique applied recursively on resulting bins or partitions to generate Concept Hierarchy - Outliers are also fitted in separate bins or partitions or intervals Histogram Analysis: - Un Supervised Discretization; Top-Down Discretization Technique. - Data Values split into buckets – Equi-Width or Equi-Frequency - Repeats recursively on resulting buckets to generate multi-level Concept Hierarchies. - Stops when user specified numbers of Concept Hierarchy Levels are generated. Entropy-Based Discretization: - Supervised Discretization Method; Top-Down Discretization Method - Calculates and determines split-points; The value of the attribute A that has minimum entropy is the split point; Data divided into partitions at the split points - Repeats recursively on resulting partitions to produce Concept Hierarchy of A - Basic Method: o Consider a database D which has many tuples and A is one of the attribute. o This attribute A is the Class label attribute as it decides the class of the tuples. o Attribute value of A is considered as Split-point - Binary Discretization. Tuples with data values of A<= Split-point = D1 Tuples with data values of A > Split-point = D2 o Uses Class information. Consider there are two classes of tuples C1 and C2. Then the ideal partitioning should be that the first partition should have the class C1 tuples and the second partition should have the class C2 tuples. But this is unlikely. o First partition may have many tuples of class C1 and few tuples of class C2 and Second partition may have many tuples of class C2 and few tuples of class C1. o To obtain a perfect partitioning the amount of Expected Information Requirement is given by the formula: o Formula o | D1 | o Info(D) I(D) Entropy(D ) 1 Entropy(D ) |D| and Entropy(D1) m | D2 p log ( p ) Entropy(D ) | |D| o for 2 1 i i1 2 i o Consider that there are m classes and pi is the probability of class i in D1. o Select a split-point so that it has minimum amount of Expected Information Requirement. o Repeat this recursively on the resulting partitions to obtain the Concept Hierarchy and stop when the number of intervals exceed the max-intervals (user specified) X2 Merging (Chi-Merging): - Bottom-up Discretization; Supervised Discretization - Best neighboring intervals identified and merged recursively. - Basic concept is that adjacent intervals should have the same class distribution. If so they are merged, otherwise remain separate. o Each distinct value of the numeric attribute = one interval o Perform X2 test on each pair of adjacent intervals. o Intervals with least X2 value are merged to form a larger interval o Low X2 value Similar class distribution o Merging done recursively until pre-specified stop criteria is reached. - Stop criteria determined by 3 conditions: o Stops when X2 value of every pair of adjacent intervals exceeds a pre- specified significance level – set between 0.10 and 0.01 o Stops when number of intervals exceeds a pre-specified max interval (say 10 to 15) o Relative Class frequencies should be consistent within an interval. Allowed level of inconsistency within an interval should be within a prespecified threshold say 3%. Cluster Analysis: - Uses Top-Down Discretization or Bottom-up Discretization - Data values of an attribute are partitioned into clusters - Uses the closeness of data values Produces high quality discretization results. - Each cluster is a node in the concept hierarchy Each cluster further sub-divided into sub-clusters in case of Top-down approach to create lower level clusters or concepts. - Clusters are merged in Bottom-up approach to create higher level cluster or concepts. Discretization by Intuitive Partitioning: - Users like numerical value intervals to be uniform, easy-to-use, ‘Intuitive’, Natural. - Clustering analysis produces intervals such as ($53,245.78,$62,311.78]. - But intervals such as ($50,000,$60,000] is better than the above. - 3-4-5 Rule: o Partitions the given data range into 3 or 4 or 5 equi-width intervals o Partitions recursively, level-by-level, based on value range at most significant digit. o Real world data can be extremely high or low values which need to be considered as outliers. Eg. Assets of some people may be several magnitudes higher than the others o Such outliers are handled separately in a different interval o So, majority of the data lies between 5% and 95% of the given data range. o Eg. Profit of an ABC Ltd in the year 2004. o Majority data between 5% and 95% - (-$159,876,$1,838,761] o MIN = -$351,976; MAX = $4,700,896; LOW = -$159,876; HIGH = $1,838,761; o Most Significant digit – msd = $1,000,000; o Hence LOW’ = -$1,000,000 & HIGH’ = $2,000,000 o Number of Intervals = ($2,000,000 – (-$1,000,000))/$1,000,000 = 3. Example of 3-4-5 Rule o Hence intervals are: (-$1,000,000,$0], ($0,$1,000,000], ($1,000,000,$2,000,000] o LOW’ < MIN => Adjust the left boundary to make the interval smaller. o Most significant digit of MIN is $100,000 => MIN’ = -$400,000 o Hence first interval reduced to (-$400,000,$0] o HIGH’ < MAX => Add new interval ($2,000,000,$5,000,000] o Hence the Top tier Hierarchy intervals are: o ($400,000,$0],($0,$1,000,000],($1,000,000,$2,000,000],($2,000,000,$5,00 0,000] o These are further subdivided as per 3-4-5 rule to obtain the lower level hierarchies. o Interval (-$400,000,$0] is divided into 4 equi-width intervals o Intervals ($0,$1,000,000] & is divided into 5 Equi-width intervals o Interval ($1,000,000,$2,000,000] is divided into 5 Equi-width intervals o Interval ($2,000,000, $5,000,000] is divided into 3 Equi-width intervals. Concept Hierarchy Generation for Categorical Data: Categorical Data = Discrete data; Eg. Geographic Location, Job type, Product Item type Methods Used: 1. Specification of partial ordering of attributes explicitly at the schema level by users or Experts. 2. Specification of a portion of a hierarchy by explicit data grouping. 3. Specification of the set of attributes that form the concept hierarchy, but not their partial ordering. 4. Specification of only a partial set of attributes. 1. Specification of a partial set of attributes at the schema level by the users or domain experts: - Eg. Dimension ‘Location’ in a Data warehouse has attributes ‘Street’, ‘City’, ‘State’ & ‘Country’. - Hierarchical definition of these attributes obtained by ordering these attributes as: - State < City < State < Country at the schema level itself by user or expert. 2. Specification of a portion of the hierarchy by explicit data grouping: - Manual definition of concept hierarchy. - In real time large databases it is unrealistic to define the concept hierarchy for the entire database manually by value enumeration. - But we can easily specify intermediate-level grouping of data - a small portion of hierarchy. - For Eg. Consider the sttribute State where we can specify as below: - {Chennai, Madurai, Trichy} C (Belongs to) Tamilnadu - {Bangalore, Mysore, Mangalore} C (Belongs to) Karnataka 3. Specification of a set of attributes but not their partial ordering: - User specifies set of attributes of the concept hierarchy; but omits to specify their ordering - Automatic concept hierarchy generation or attribute ordering can be done in such cases. - This is done using the rule that counts and uses the distinct values of each attribute. - The attribute that has the most distinct values is placed at the bottom of the hierarchy - And the attribute that has the least distinct values is placed at the top of the hierarchy - This heuristic rule applies for most cases but it fails of some. - Users or experts can examine the concept hierarchy and can perform manual adjustment. - Eg. Concept Hierarchy for ‘Location’ dimension: - Country (10); State (508), City (10,804), Street (1,234,567) - Street < City < State < Country - In this case user need not modify the generated order / concept hierarchy. - But this heuristic rule may fail for the ‘Time’ dimension. - Distinct Years (100); Distinct Months (12); Distinct Days-of-week (7) - So in this case the attribute ordering or the concept hierarchy is: - Year < Month < Days-of-week - This is not correct. 4. Specification of only partial set of attributes: - User may have vague idea of the concept hierarchy - So they just specify only few attributes that form the concept hierarchy. - Eg. User specifies just the Attributes Street and City. - To get the complete concept hierarchy in this case we have to link these user specified attributes with the data semantics specified by the domain experts. - Users have the authority to modify this generated hierarchy. - The domain expert may have defined that the attributes given below are semantically linked - Number, Street, City, State, Country. - Now the newly generated concept hierarchy by linking the domain expert specification and the users specification will be that: - Number < Street < City < State < Country - Here the user can inspect this concept hierarchy and can remove the unwanted attribute ‘Number’ to generate the new Concept Hierarchy as below: - Street < City < State < Country DATA WAREHOUSE IMPLEMENTATION - It is important for a data warehouse system to be implemented with: o Highly Efficient Computation of Data cubes o Access Methods; Query Processing Techniques - Efficient Computation of Data cubes: o = Efficient computation of aggregations across many sets of dimensions. o Compute Cube operator and its implementations: Extends SQL to include compute cube operator Create Data cube for the dimensions item, city, year and sales_in_dollars: Example Queries to analyze data: o Compute sum of sales grouping by item and city o Compute sum of sales grouping by item o Compute sum of sales grouping by city Here dimensions are item, city and year; Measure / Fact is sales_in_dollars Hence total number of cuboids or group bys for this data cube is Possible group bys are {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}; These group bys forms the lattice of cuboid 0-D (Apex) cuboid is (); 3-D (Base) cuboid is (city, item, year) WWW.VIDYARTHIPLUS.COM o Hence, for a cube with n dimensions there are total cuboids. o The statement ‘compute cube sales’ computes sales aggregate cuboids for the eight subsets. o Pre-computation of cuboids leads to faster response time and avoids redundant computation. o But challenge in pre-computation is that the required storage space may explode. o Number of cuboids in an n-dimensional data cube if there are no concept hierarchy attached with each dimension = cuboids. o Consider Time dimension has the concept hierarchy o Then total number of cuboids are: where Li is the number of levels associated with the dimension i. o Eg. If a cube has 10 dimensions and each dimension has 4 levels, then total number of cuboids generated will be . o This shows it is unrealistic to pre-compute and materialize all cuboids for a data cube. - Hence we go for Partial Materialization: o Three choices of materialization: o No Materialization: Pre-compute only base cuboid and no other cuboids; Slow computation o Full Materialization: Pre-compute all cuboids; Requires huge space o Partial Materialization: Pre-compute a proper subset of whole set of cuboids Considers 3 factors: Identify cuboids to materialize – based on workload, frequency, accessing cost, storage need, cost of update, index usage. (or simply use greedy Algorithm that has good performance) WWW.VIDYARTHIPLUS.COM Exploit materialized cuboids during query processing Update materialized cuboid during load and refresh (use parallelism and incremental update) - Multiway array aggregation in the computation of data cubes: o To ensure fast online analytical processing we need to go for full materialization o But should consider amount of main memory available and time taken for computation. o ROLAP and MOLAP uses different cube computation techniques. o Optimization techniques for ROLAP cube computations: Sorting, hashing and grouping operations applied to dimension attributes – to reorder and cluster tuples. Grouping performed on some sub aggregates – ‘Partial grouping step’ – to speed up computations WWW.VIDYARTHIPLUS.COM Aggregates computed from sub aggregates (rather than from base tables). In ROLAP dimension values are accessed by using value-based / keybased addressing search strategies. o Optimization techniques for MOLAP cube computations: MOLAP uses direct array addressing to access dimension values Partition the array into Chunks (sub cube small enough to fit into main memory). Compute aggregates by visiting cube cells. The number of times each cell is revisited is minimized to reduce memory access and storage costs. This is called as multiway array aggregation in data cube computation. o MOLAP cube computation is faster than ROLAP cube computation - Indexing OLAP Data: Bitmap Indexing; Join Indexing - Bitmap Indexing: o Index on a particular column; Each distinct value in a column has a bit vector o The length of each bit vector = No. of records in the base table o The i-th bit is set if the i-th row of the base table has the value for the indexed column. o This approach is not suitable for high cardinality domains - Join Indexing: o Registers joinable rows of two relations o Consider Relations R & S WWW.VIDYARTHIPLUS.COM o Let R (RID, A) & S (SID, B); where RID and SID are record identifiers of R & S Respectively. o For joining index record contains the pair (RID, SID). in the traditional attribute values to a list of record ids o But, dimensions of a star fact table. o Eg. Fact table: Sales and two dimensions city and o A join index on city maintains for each distinct city a list of RIDs of the tuples of the Fact table Sales in data warehouses join index relates the the product databases the join index maps of schema to rows in the the attributes A & B the join o Hence values o Join indices can span across multiple dimensions. – Composite join indices o To speed up query processing join indexing WWW.VIDYARTHIPLUS.COM and bitmap indexing can be integrated to form Bitmapped join indices. o - Efficient processing of OLAP queries: o Steps for efficient OLAP query processing: o 1. Determine which OLAP operations should be performed on the available cuboids: Transform the OLAP operations like drill-down, roll-up,… to its corresponding SQL (relational algebra) operations. Eg. Dice = Selection + Projection o 2. Determine to which materialized cuboids the relevant OLAP operations should be applied: Involves (i) Pruning of cuboids using knowledge of “dominance” (ii) Estimate the cost of remaining materialized cuboids (iii) Selecting the cuboid with least cost Eg. Cube: “Sales [time, item, location]: sum(sales_in_dollars)” Dimension hierarchies used are: “day < month < quarter < year” for time dimension “Item_name < brand < type for item dimension “street < city < state < country for location dimension Say query to be processed is on {brand, state} with the condition year = “1997” Say there are four materialized cuboids available Cuboid 1: {item_name, city, year} ; Cuboid 2: {brand, country, year} Cuboid 3: {brand, state, year} ; Cuboid 4: {item_name, state} where year = 1997 Which cuboid selected for query processing? Step 1: Pruning of cuboids – prune cuboid 2 as higher level of concept “country” can not answer query at lower granularity “state” Step 2: Estimate cuboid cost; Cuboid 1 costs the most of the 3 cuboids as item_name and city are at a finer granular level than brand and state as mentioned in the query. Step 3: If there are less number of years and there are more number of item_names under each brand then Cuboid 3 has the least cost. But if otherwise and there are efficient indexes on item_name then Cuboid 4 has the least cost. Hence select Cuboid 3 or Cuboid 4 accordingly. - Metadata repository: o Metadata is the data defining warehouse objects. It stores: Description of the structure of the data warehouse: Schema, views, dimensions, hierarchies, derived data definition, data mart locations and contents Operational meta data: Data lineage: History of migrated data and its transformation path Currency of data: Active, archived or purged Monitoring information: o Warehouse usage statistics, error reports, audit trails Algorithms used for summarization: Measure and Dimension definition algorithm Granularity, Subject, Partitions definition Aggregation, Summarization and pre-defined queries and reports. Mapping from operational environment to the data warehouse: Source database information; Data refresh & purging rules Data extraction, cleaning and transformation rules Security rules (authorization and access control) Data related to system performance: Data access performance; data retrieval performance Rules for timing and scheduling of refresh Business metadata: Business terms and definitions Data ownership information; Data charging policies - Data Warehouse Back-end Tools and Utilities: o Data Extraction: Get data from multiple, heterogeneous and external sources o Data Cleaning: Detects errors in the data and rectify them when possible o Data Transformation: Convert data from legacy or host format to warehouse format o Load: Sort; Summarize, Consolidate; Compute views; Check integrity Build indices and partitions FROM DATA WAREHOUSING TO DATA MINING: - Data Warehousing Usage: o Data warehouse and Data Marts used in wide range of applications; o Used in Feedback system for enterprise management – “Plan-executeassess Loop” o Applied in Banking, Finance, Retail, Manufacturing,… o Data warehouse used for knowledge discovery and strategic decision making using data mining tools o There are three kinds of data warehouse applications: Information Processing: Supports querying & basic statistical analysis Reporting using cross tabs, tables, charts and graphs Analytical Processing: Multidimensional analysis of data warehouse data Supports basic OLAP operations slice-dice, drilling and pivoting Data Mining: Knowledge discovery from hidden patterns Supports associations, classification & prediction and Clustering Constructs analytical models Presenting mining results using visualization tools - From Online Analytical Processing (OLAP) to Online Analytical Mining (OLAM): o OLAM also called as OLAP Mining – Integrates OLAP with mining techniques o Why OLAM? High Quality of data in data warehouses: DWH has cleaned, transformed and integrated data (Preprocessed data) Data mining tools need such costly preprocessing of data. Thus DWH serves as a valuable and high quality data source for OLAP as well as for Data Mining Available information processing infrastructure surrounding data warehouses: Includes accessing, integration, consolidation and transformation of multiple heterogeneous databases ; ODBC/OLEDB connections; Web accessing and servicing facilities; Reporting and OLAP analysis tools OLAP-based exploratory data analysis: OLAM provides facilities for data mining on different subsets of data and at different levels of abstraction Eg. Drill-down, pivoting, roll-up, slicing, dicing on OLAP and on intermediate DM results Enhances power of exploratory data mining by use of visualization tools On-line selection of data mining functions: OLAM provides the flexibility to select desired data mining functions and swap data mining tasks dynamically. Architecture of Online Analytical Mining: - OLAP and OLAM engines accept on-line queries via User GUI API - And they work with the data cube in data analysis via Data Cube API - A meta data directory is used to guide the access of data cube - MDDB constructed by integrating multiple databases or by filtering a data warehouse via Database API which may support ODBC/OLEDB connections. - OLAM Engine consists of multiple data mining modules – Hence sophisticated than OLAP engine. - Data Mining should be a human centered process – users should often interact with the system to perform exploratory data analysis 4.6 OLAP Need OLAP systems vary quite a lot, and they have generally been distinguished by a letter tagged onto the front of the word OLAP. ROLAP and MOLAP are the big players, and the other distinctions represent little more than the marketing programs on the part of the vendors to distinguish themselves, for example, SOLAP and DOLAP. Here, we aim to give you a hint as to what these distinctions mean. 4.7 Categorization of OLAP Tools Major Types: Relational OLAP (ROLAP) –Star Schema based Considered the fastest growing OLAP technology style, ROLAP or “Relational” OLAP systems work primarily from the data that resides in a relational database, where the base data and dimension tables are stored as relational tables. This model permits multidimensional analysis of data as this enables users to perform a function equivalent to that of the traditional OLAP slicing and dicing feature. This is achieved thorough use of any SQL reporting tool to extract or ‘query’ data directly from the data warehouse. Wherein specifying a ‘Where clause’ equals performing a certain slice and dice action. One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to be more scalable in handling huge amounts of data. ROLAP sits on top of relational databases therefore enabling it to leverage several functionalities that a relational database is capable of. Another gain of a ROLAP tool is that it is efficient in managing both numeric and textual data. It also permits users to “drill down” to the leaf details or the lowest level of a hierarchy structure. However, ROLAP applications display a slower performance as compared to other style of OLAP tools since, oftentimes, calculations are performed inside the server. Another demerit of a ROLAP tool is that as it is dependent on use of SQL for data manipulation, it may not be ideal for performance of some calculations that are not easily translatable into an SQL query. Multidimensional OLAP (MOLAP) –Cube based Multidimensional OLAP, with a popular acronym of MOLAP, is widely regarded as the classic form of OLAP. One of the major distinctions of MOLAP against a ROLAP tool is that data are pre-summarized and are stored in an optimized format in a multidimensional cube, instead of in a relational database. In this type of model, data are structured into proprietary Unit IV - DATA WAREHOUSING AND DATA MINING -CA5010 5 formats in accordance with a client’s reporting requirements with the calculations pre- generated on the cubes. This is probably by far, the best OLAP tool to use in making analysis reports since this enables users to easily reorganize or rotate the cube structure to view different aspects of data. This is done by way of slicing and dicing. MOLAP analytic tool are also capable of performing complex calculations. Since calculations are predefined upon cube creation, this results in the faster return of computed data. MOLAP systems also provide users the ability to quickly write back data into a data set. Moreover, in comparison to ROLAP, MOLAP is considerably less heavy on hardware due to compression techniques. In a nutshell, MOLAP is more optimized for fast query performance and retrieval of summarized information. There are certain limitations to implementation of a MOLAP system, one primary weakness of which is that MOLAP tool is less scalable than a ROLAP tool as the former is capable of handling only a limited amount of data. The MOLAP approach also introduces data redundancy. There are also certain MOLAP products that encounter difficulty in updating models with dimensions with very high cardinality. Hybrid OLAP (HOLAP) HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into a single architecture. This tool tried to bridge the technology gap of both products by enabling access or use to both multidimensional database (MDDB) and Relational Database Management System (RDBMS) data stores. HOLAP systems stores larger quantities of detailed data in the relational tables while the aggregations are stored in the pre-calculated cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational tables for delineated data. Some of the advantages of this system are better scalability, quick data processing and flexibility in accessing of data sources. Other Types: There are also less popular types of OLAP styles upon which one could stumble upon every so often. We have listed some of the less famous types existing in the OLAP industry. Web OLAP (WOLAP) Simply put, a Web OLAP which is likewise referred to as Web-enabled OLAP, pertains to OLAP application which is accessible via the web browser. Unlike traditional client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists of three components: a client, a middleware and a database server. Probably some of the most appealing features of this style of OLAP are the considerably lower investment involved, enhanced accessibility as a user only needs an internet connection and a web browser to connect to the data and ease in installation, configuration and deployment process. But despite all of its unique features, it could still not compare to a conventional client/server machine. Currently, it is inferior in comparison to OLAP applications which involve deployment in client machines in terms of functionality, visual appeal and performance. KLNCIT – MCA Unit IV - DATA WAREHOUSING AND DATA MINING -CA5010 For Private Circulation only 6 Desktop OLAP (DOLAP) Desktop OLAP, or “DOLAP” is based on the idea that a user can download a section of the data from the database or source, and work with that dataset locally, or on their desktop. DOLAP is easier to deploy and has a cheaper cost but comes with a very limited functionality in comparison with other OLAP applications. Mobile OLAP (MOLAP) Mobile OLAP is merely refers to OLAP functionalities on a wireless or mobile device. This enables users to access and work on OLAP data and applications remotely thorough the use of their mobile devices. Spatial OLAP (SOLAP) With the aim of integrating the capabilities of both Geographic Information Systems (GIS) and OLAP into a single user interface, “SOLAP” or Spatial OLAP emerged. SOLAP is created to facilitate management of both spatial and non-spatial data, as data could come not only in an alphanumeric form, but also in images and vectors. This technology provides easy and quick exploration of data that resides on a spatial database. Other different blends of an OLAP product like the less popular ‘DOLAP’ and ‘ROLAP’ which stands for Database OLAP and Remote OLAP, ‘LOLAP’ for Local OLAP and ‘RTOLAP’ for Real-Time OLAP are existing but have barely made a noise on the OLAP industry. Unit - IV Association Rule Mining And Classification Classes:09 Mining frequent patterns, associations and correlations, mining methods, mining various kinds of association rules, correlation analysis, constraint based association mining, classification and prediction, basic concepts, decision tree induction, Bayesian classification, rule based classification, classification by back propagation. CLASSIFICATION AND PREDICTION: There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. These two forms are as follows − • Classification • Prediction Classification models predict categorical class labels; and prediction models predict continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. What is classification? Following are the examples of cases where the data analysis task is Classification − • A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe. • A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer. In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data. What is prediction? Following are the examples of cases where the data analysis task is Prediction − Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a continuous-valued-function or ordered value. Note − Regression analysis is a statistical methodology that is most often used for numeric prediction. How Does Classification Works? With the help of the bank loan application that we have discussed above, let us understand the working of classification. The Data Classification process includes two steps − • Building the Classifier or Model • Using Classifier for Classification Building the Classifier or Model • This step is the learning step or the learning phase. • In this step the classification algorithms build the classifier. • The classifier is built from the training set made up of database tuples and their associated class labels. • Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points. Using Classifier for Classification In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable. Classification and Prediction Issues The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities − • Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute. • Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related. • Data Transformation and reduction − The data can be transformed by any of the following methods. o Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used. o Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies. Note − Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis, and clustering. Comparison of Classification and Prediction Methods Here is the criteria for comparing the methods of Classification and Prediction − • Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. • Speed − This refers to the computational cost in generating and using the classifier or predictor. • Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data. • Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data. • Interpretability − It refers to what extent the classifier or predictor understands. DECISION TREE INDUCTION: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class. The benefits of having a decision tree are as follows − • It does not require any domain knowledge. • It is easy to comprehend. • The learning and classification steps of a decision tree are simple and fast. Decision Tree Induction Algorithm A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner. Generating a decision tree form training tuples of data partition D Algorithm : Generate_decision_tree Input: Data partition, D, which is a set of training tuples and their associated class labels. attribute_list, the set of candidate attributes. Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or splitting subset. Output: A Decision Tree Method create a node N; if tuples in D are all of the same class, C then return N as leaf node labeled with class C; if attribute_list is empty then return N as leaf node with labeled with majority class in D;|| majority voting apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion; label node N with splitting_criterion; if splitting_attribute is discrete-valued and multiway splits allowed then // no restricted to binary trees attribute_list = splitting attribute; // remove splitting attribute for each outcome j of splitting criterion // partition the tuples and grow subtrees for each partition let Dj be the set of data tuples in D satisfying outcome j; // a partition if Dj is empty then attach a leaf labeled with the majority class in D to node N; else attach the node returned by Generate decision tree(Dj, attribute list) to node N; end for return N; Tree Pruning Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex. Tree Pruning Approaches There are two approaches to prune a tree − • Pre-pruning − The tree is pruned by halting its construction early. • Post-pruning - This approach removes a sub-tree from a fully grown tree. Cost Complexity The cost complexity is measured by the following two parameters − • Number of leaves in the tree, and • Error rate of the tree. BAYESIAN CLASSIFICATION Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Baye's Theorem Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities − • Posterior Probability [P(H/X)] • Prior Probability [P(H)] where X is data tuple and H is some hypothesis. According to Bayes' Theorem, P(H/X)= P(X/H)P(H) / P(X) Bayesian Belief Network Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks. • A Belief Network allows class conditional independencies to be defined between subsets of variables. • It provides a graphical model of causal relationship on which learning can be performed. • We can use a trained Bayesian Network for classification. There are two components that define a Bayesian Belief Network − • Directed acyclic graph • A set of conditional probability tables Directed Acyclic Graph • Each node in a directed acyclic graph represents a random variable. • These variable may be discrete or continuous valued. • These variables may correspond to the actual attribute given in the data. Directed Acyclic Graph Representation The following diagram shows a directed acyclic graph for six Boolean variables. The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer. Conditional Probability Table The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows − RULE BASED CLASSIFICATION IF-THEN Rules Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from − IF condition THEN conclusion Let us consider a rule R1, R1: IF age = youth AND student = yes THEN buy_computer = yes Points to remember − • The IF part of the rule is called rule antecedent or precondition. • The THEN part of the rule is called rule consequent. • The antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed. • The consequent part consists of class prediction. Note − We can also write rule R1 as follows − R1: (age = youth) ^ (student = yes))(buys computer = yes) If the condition holds true for a given tuple, then the antecedent is satisfied. Rule Extraction Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree. Points to remember − To extract a rule from a decision tree − • One rule is created for each path from the root to the leaf node. • To form a rule antecedent, each splitting criterion is logically ANDed. • The leaf node holds the class prediction, forming the rule consequent. Rule Induction Using Sequential Covering Algorithm Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the tuples of that class. Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule. Note − The Decision tree induction can be considered as learning a set of rules simultaneously. The Following is the sequential learning Algorithm where rules are learned for one class at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any other class. Algorithm: Sequential Covering Input: D, a data set class-labeled tuples, Att_vals, the set of all attributes and their possible values. Output: A Set of IF-THEN rules. Method: Rule_set={ }; // initial set of rules learned is empty for each class c do repeat Rule = Learn_One_Rule(D, Att_valls, c); remove tuples covered by Rule form D; until termination condition; Rule_set=Rule_set+Rule; // add a new rule to rule-set end for return Rule_Set; Rule Pruning The rule is pruned is due to the following reason − • The Assessment of quality is made on the original set of training data. The rule may perform well on training data but less well on subsequent data. That's why the rule pruning is required. • The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what was assessed on an independent set of tuples. FOIL is one of the simple and effective method for rule pruning. For a given rule R, FOIL_Prune = pos - neg / pos + neg where pos and neg is the number of positive tuples covered by R, respectively. Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of R, then we prune R. UNIT-V Unit –V Clustering And Trends In Data Mining Classes:08 Cluster analysis: Types of data, categorization of major clustering methods, K-means partitioning methods, hierarchical methods, density based methods, grid based methods, model based clustering methods, clustering, high dimensional data, constraint based cluster analysis, outlier analysis; Trends in data mining: Data mining applications, data mining system products and research prototypes, social impacts of data mining. What is Cluster Analysis? The process of grouping a set of physical objects into classes of similar objects is called clustering. Cluster – collection of data objects – Objects within a cluster are similar and objects in different clusters are dissimilar. Cluster applications – pattern recognition, image processing and market research. - helps marketers to discover the characterization of customer groups based on purchasing patterns - Categorize genes in plant and animal taxonomies - Identify groups of house in a city according to house type, value and geographical location - Classify documents on WWW for information discovery Clustering is a preprocessing step for other data mining steps like classification, characterization. Clustering – Unsupervised learning – does not rely on predefined classes with class labels. Typical requirements of clustering in data mining: 1. Scalability – Clustering algorithms should work for huge databases 2. Ability to deal with different types of attributes – Clustering algorithms should work not only for numeric data, but also for other data types. 3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance measures) should work for clusters of any shape. 4. Minimal requirements for domain knowledge to determine input parameters – Clustering results are sensitive to input parameters to a clustering algorithm (example – number of desired clusters). Determining the value of these parameters is difficult and requires some domain knowledge. 5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data detected by a clustering algorithm may lead to clusters of poor quality. 6. Insensitivity in the order of input records – Clustering algorithms should produce same results even if the order of input records is changed. 7. High dimensionality – Data in high dimensional space can be sparse and highly skewed, hence it is challenging for a clustering algorithm to cluster data objects in high dimensional space. 8. Constraint-based clustering – In Real world scenario, clusters are performed based on various constraints. It is a challenging task to find groups of data with good clustering behavior and satisfying various constraints. 9. Interpretability and usability – Clustering results should be interpretable, comprehensible and usable. So we should study how an application goal may influence the selection of clustering methods. Types of data in Clustering Analysis 1. Data Matrix: (object-by-variable structure) Represents n objects, (such as persons) with p variables (or attributes) (such as age, height, weight, gender, race and so on. The structure is in the form of relational table or n x p matrix as shown below: called as “two mode” matrix 2. Dissimilarity Matrix: (object-by-object structure) This stores a collection of proximities (closeness or distance) that are available for all pairs of n objects. It is represented by an n-by-n table as shown below. called as “one mode” matrix Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i, i) = 0 Many clustering algorithms use Dissimilarity Matrix. So data represented using Data Matrix are converted into Dissimilarity Matrix before applying such clustering algorithms. Clustering of objects done based on their similarities or dissimilarities. Similarity coefficients or dissimilarity coefficients are derived from correlation coefficients. 3.8 Categorization of Major Clustering Methods Categorization of Major Clustering Methods The choice of many available clustering algorithms depends on type of data available and the application used. Major Categories are: 1. Partitioning Methods: - Construct k-partitions of the n data objects, where each partition is a cluster and k <= n. - Each partition should contain at least one object & each object should belong to exactly one partition. Iterative Relocation Technique – attempts to improve partitioning by moving objects from one group to another. Good Partitioning – Objects in the same cluster are “close” / related and objects in the different clusters are “far apart” / very different. Uses the Algorithms: o K-means Algorithm: - Each cluster is represented by the mean value of the objects in the cluster. o K-mediods Algorithm: - Each cluster is represented by one of the objects located near the center of the cluster. o These work well in small to medium sized database. 2. Hierarchical Methods: - Creates hierarchical decomposition of the given set of data objects. - Two types – Agglomerative and Divisive - Agglomerative Approach: (Bottom-Up Approach): o Each object forms a separate group o Successively merges groups close to one another (based on distance between clusters) o Done until all the groups are merged to one or until a termination condition holds. (Termination condition can be desired number of clusters) - Divisive Approach: (Top-Down Approach): o Starts with all the objects in the same cluster o Successively clusters are split into smaller clusters o Done until each object is in one cluster or until a termination condition holds (Termination condition can be desired number of clusters) - Disadvantage – Once a merge or split is done it can not be undone. - Advantage – Less computational cost - If both these approaches are combined it gives more advantage. - Clustering algorithms with this integrated approach are BIRCH and CURE. 3. Density Based Methods: - Above methods produce Spherical shaped clusters. - To discover clusters of arbitrary shape, clustering done based on the notion of density. - Used to filter out noise or outliers. - Continue growing a cluster so long as the density in the neighborhood exceeds some threshold. Density = number of objects or data points That is for each data point within a given cluster; the neighborhood of a given radius has to contain at least a minimum number of points. Uses the algorithms: DBSCAN and OPTICS 4. Grid-Based Methods: - Divides the object space into finite number of cells to forma grid structure. - Performs clustering operations on the grid structure. - Advantage – Fast processing time – independent on the number of data objects & dependent on the number of cells in the data grid. - STING – typical grid based method - CLIQUE and Wave-Cluster – grid based and density based clustering algorithms. 5. Model-Based Methods: - Hypothesizes a model for each of the clusters and finds a best fit of the data to the model. - Forms clusters by constructing a density function that reflects the spatial distribution of the data points. - Robust clustering methods - Detects noise / outliers. Many algorithms combine several clustering methods. Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster. What is Clustering? Clustering is the process of making a group of abstract objects into classes of similar objects. Points to Remember • A cluster of data objects can be treated as one group. • While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups. • The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups. Applications of Cluster Analysis • Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. • Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns. • In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations. • Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location. • Clustering also helps in classifying documents on the web for information discovery. • Clustering is also used in outlier detection applications such as detection of credit card fraud. • As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster. Requirements of Clustering in Data Mining The following points throw light on why clustering is required in data mining − • Scalability − We need highly scalable clustering algorithms to deal with large databases. • Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data. • Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes. • High dimensionality − The clustering algorithm should not only be able to handle lowdimensional data but also the high dimensional space. • Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. • Interpretability − The clustering results should be interpretable, comprehensible, and usable. Clustering Methods Clustering methods can be classified into the following categories − • Partitioning Method • Hierarchical Method • Density-based Method • Grid-Based Method • Model-Based Method • Constraint-based Method Partitioning Method Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements − • Each group contains at least one object. • Each object must belong to exactly one group. Points to remember − • For a given number of partitions (say k), the partitioning method will create an initial partitioning. • Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other. Hierarchical Methods This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here − • Agglomerative Approach • Divisive Approach Agglomerative Approach This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds. Divisive Approach This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone. Approaches to Improve Quality of Hierarchical Clustering Here are the two approaches that are used to improve the quality of hierarchical clustering − • Perform careful analysis of object linkages at each hierarchical partitioning. • Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into micro-clusters, and then performing macro-clustering on the micro-clusters. Density-based Method This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points. Grid-based Method In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure. Advantages • The major advantage of this method is fast processing time. • It is dependent only on the number of cells in each dimension in the quantized space. Model-based methods In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates the clusters by clustering the density function. It reflects spatial distribution of the data points. This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods. Constraint-based Method In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. Constraints can be specified by the user or the application requirement. PARTITIONING METHODS Database has n objects and k partitions where k<=n; each partition is a cluster. Partitioning criterion = Similarity function: Objects within a cluster are similar; objects of different clusters are dissimilar. Classical Partitioning Methods: k-means and k-mediods: (A) Centroid-based technique: The k-means method: - Cluster similarity is measured using mean value of objects in the cluster (or clusters center of gravity) - Randomly select k objects. Each object is a cluster mean or center. - Each of the remaining objects is assigned to the most similar cluster – based on the distance between the object and the cluster mean. - Compute new mean for each cluster. - This process iterates until all the objects are assigned to a cluster and the partitioning criterion is met. - This algorithm determines k partitions that minimize the squared error function. - Square Error Function is defined as: - Where x is the point representing an object, mi is the mean of the cluster Ci. HIERARCHICAL METHODS This works by grouping data objects into a tree of clusters. Two types – Agglomerative and Divisive. Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK and CHAMELEON. BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies: - Integrated Hierarchical Clustering algorithm. - Introduces two concepts – Clustering Feature and CF tree (Clustering Feature - Tree) CF Trees – Summarized Cluster Representation – Helps to achieve good speed & clustering scalability Good for incremental and dynamical clustering of incoming data points. Clustering Feature CF is the summary statistics for the cluster defined as: ; where N is the number of points in the sub cluster (Each point is represented as ); is the linear sum of N points = ; SS is the square sum of data points - CF Tree – Height balanced tree that stores the Clustering Features. This has two parameters – Branching Factor B and threshold T Branching Factor specifies the maximum number of children. - Threshold parameter T = maximum diameter of sub clusters stored at the leaf nodes. Change the threshold value => Changes the size of the tree. The non-leaf nodes store sums of their children’s CF’s – summarizes information about their children. - - BIRCH algorithm has the following two phases: o Phase 1: Scan database to build an initial in-memory CF tree – Multi-level compression of the data – Preserves the inherent clustering structure of the data. CF tree is built dynamically as data points are inserted to the closest leaf entry. If the diameter of the subcluster in the leaf node after insertion becomes larger than the threshold then the leaf node and possibly other nodes are split. After a new point is inserted, the information about it is passed towards the root of the tree. Is the size of the memory to store the CF tree is larger than the the size of the main memory, then a smaller value of threshold is specified and the CF tree is rebuilt. This rebuild process builds from the leaf nodes of the old tree. Thus for building a tree data has to be read from the database only once. - - o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CFtree. Advantages: o Produces best clusters with available resources. o Minimizes the I/O time Computational complexity of this algorithm is – O(N) – N is the number of objects to be clustered. Disadvantage: o Not a natural way of clustering; o Does not work for non-spherical shaped clusters. CURE – Clustering Using Representatives: - Integrates hierarchical and partitioning algorithms. - Handles clusters of different shapes and sizes; Handles outliers separately. - Here a set of representative centroid points are used to represent a cluster. - These points are generated by first selecting well scattered points in a cluster and shrinking them towards the center of the cluster by a specified fraction (shrinking factor) - Closest pair of clusters are merged at each step of the algorithm. - - Having more than one representative point in a cluster allows BIRCH to handle clusters of non-spherical shape. Shrinking helps to identify the outliers. To handle large databases – CURE employs a combination of random sampling and partitioning. The resulting clusters from these samples are again merged to get the final cluster. CURE Algorithm: o Draw a random sample s o Partition sample s into p partitions each of size s/p o Partially cluster partitions into s/pq clusters where q > 1 o Eliminate outliers by random sampling – if a cluster is too slow eliminate it. o Cluster partial clusters - - o Mark data with the corresponding cluster labels o Advantage: o High quality clusters o Removes outliers o Produces clusters of different shapes & sizes o Scales for large database Disadvantage: o Needs parameters – Size of the random sample; Number of Clusters and Shrinking factor o These parameter settings have significant effect on the results. ROCK: - - Agglomerative hierarchical clustering algorithm. Suitable for clustering categorical attributes. It measures the similarity of two clusters by comparing the aggregate interconnectivity of two clusters against a user specified static interconnectivity model. Inter-connectivity of two clusters C1 and C2 are defined by the number of cross links between the two clusters. link(pi, pj) = number of common neighbors between two points pi and pj. Two steps: o First construct a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors. o Then performs a hierarchical clustering algorithm on the sparse graph. CHAMELEON – A hierarchical clustering algorithm using dynamic modeling: - In this clustering process, two clusters are merged if the inter-connectivity and closeness (proximity) between two clusters are highly related to the internal interconnectivity and closeness of the objects within the clusters. - This merge process produces natural and homogeneous clusters. - Applies to all types of data as long as the similarity function is specified. - - This first uses a graph partitioning algorithm to cluster the data items into large number of small sub clusters. Then it uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining the sub clusters created by the graph partitioning algorithm. To determine the pairs of most similar sub clusters, it considers the interconnectivity as well as the closeness of the clusters. - In this objects are represented using k-nearest neighbor graph. Vertex of this graph represents an object and the edges are present between two vertices (objects) - Partition the graph by removing the edges in the sparse region and keeping the edges in the dense region. Each of these partitioned graph forms a cluster Then form the final clusters by iteratively merging the clusters from the previous cycle based on their interconnectivity and closeness. - - CHAMELEON determines the similarity between each pair of clusters Ci and Cj according to their relative inter-connectivity RI(Ci, Cj) and their relative closeness RC(Ci, Cj). - = edge-cut of the cluster containing both Ci and Cj - = size of min-cut bisector - = Average weight of the edges that connect vertices in Ci to vertices in Cj - = Average weight of the edges that belong to the min-cut bisector of cluster Ci. Advantages: o More powerful than BIRCH and CURE. o Produces arbitrary shaped clusters Processing cost: - n = number of objects. APPLICATIONS OF DATA MINING: Data Mining is primarily used today by companies with a strong consumer focus — retail, financial, communication, and marketing organizations, to “drill down” into their transactional data and determine pricing, customer preferences and product positioning, impact on sales, customer satisfaction and corporate profits. With data mining, a retailer can use point-of-sale records of customer purchases to develop products and promotions to appeal to specific customer segments. 14 areas where data mining is widely used Here is the list of 14 other important areas where data mining is widely used: Future Healthcare Data mining holds great potential to improve health systems. It uses data and analytics to identify best practices that improve care and reduce costs. Researchers use data mining approaches like multi-dimensional databases, machine learning, soft computing, data visualization and statistics. Mining can be used to predict the volume of patients in every category. Processes are developed that make sure that the patients receive appropriate care at the right place and at the right time. Data mining can also help healthcare insurers to detect fraud and abuse. Market Basket Analysis Market basket analysis is a modelling technique based upon a theory that if you buy a certain group of items you are more likely to buy another group of items. This technique may allow the retailer to understand the purchase behaviour of a buyer. This information may help the retailer to know the buyer’s needs and change the store’s layout accordingly. Using differential analysis comparison of results between different stores, between customers in different demographic groups can be done. Education There is a new emerging field, called Educational Data Mining, concerns with developing methods that discover knowledge from data originating from educational Environments. The goals of EDM are identified as predicting students’ future learning behaviour, studying the effects of educational support, and advancing scientific knowledge about learning. Data mining can be used by an institution to take accurate decisions and also to predict the results of the student. With the results the institution can focus on what to teach and how to teach. Learning pattern of the students can be captured and used to develop techniques to teach them. Manufacturing Engineering Knowledge is the best asset a manufacturing enterprise would possess. Data mining tools can be very useful to discover patterns in complex manufacturing process. Data mining can be used in system-level designing to extract the relationships between product architecture, product portfolio, and customer needs data. It can also be used to predict the product development span time, cost, and dependencies among other tasks. CRM Customer Relationship Management is all about acquiring and retaining customers, also improving customers’ loyalty and implementing customer focused strategies. To maintain a proper relationship with a customer a business need to collect data and analyse the information. This is where data mining plays its part. With data mining technologies the collected data can be used for analysis. Instead of being confused where to focus to retain customer, the seekers for the solution get filtered results. Fraud Detection Billions of dollars have been lost to the action of frauds. Traditional methods of fraud detection are time consuming and complex. Data mining aids in providing meaningful patterns and turning data into information. Any information that is valid and useful is knowledge. A perfect fraud detection system should protect information of all the users. A supervised method includes collection of sample records. These records are classified fraudulent or non-fraudulent. A model is built using this data and the algorithm is made to identify whether the record is fraudulent or not. Intrusion Detection Any action that will compromise the integrity and confidentiality of a resource is an intrusion. The defensive measures to avoid an intrusion includes user authentication, avoid programming errors, and information protection. Data mining can help improve intrusion detection by adding a level of focus to anomaly detection. It helps an analyst to distinguish an activity from common everyday network activity. Data mining also helps extract data which is more relevant to the problem. Lie Detection Apprehending a criminal is easy whereas bringing out the truth from him is difficult. Law enforcement can use mining techniques to investigate crimes, monitor communication of suspected terrorists. This filed includes text mining also. This process seeks to find meaningful patterns in data which is usually unstructured text. The data sample collected from previous investigations are compared and a model for lie detection is created. With this model processes can be created according to the necessity. Customer Segmentation Traditional market research may help us to segment customers but data mining goes in deep and increases market effectiveness. Data mining aids in aligning the customers into a distinct segment and can tailor the needs according to the customers. Market is always about retaining the customers. Data mining allows to find a segment of customers based on vulnerability and the business could offer them with special offers and enhance satisfaction. Financial Banking With computerised banking everywhere huge amount of data is supposed to be generated with new transactions. Data mining can contribute to solving business problems in banking and finance by finding patterns, causalities, and correlations in business information and market prices that are not immediately apparent to managers because the volume data is too large or is generated too quickly to screen by experts. The managers may find these information for better segmenting,targeting, acquiring, retaining and maintaining a profitable customer. Corporate Surveillance Corporate surveillance is the monitoring of a person or group’s behaviour by a corporation. The data collected is most often used for marketing purposes or sold to other corporations, but is also regularly shared with government agencies. It can be used by the business to tailor their products desirable by their customers. The data can be used for direct marketing purposes, such as the targeted advertisements on Google and Yahoo, where ads are targeted to the user of the search engine by analyzing their search history and emails. Research Analysis History shows that we have witnessed revolutionary changes in research. Data mining is helpful in data cleaning, data pre-processing and integration of databases. The researchers can find any similar data from the database that might bring any change in the research. Identification of any co-occurring sequences and the correlation between any activities can be known. Data visualisation and visual data mining provide us with a clear view of the data. Criminal Investigation Criminology is a process that aims to identify crime characteristics. Actually crime analysis includes exploring and detecting crimes and their relationships with criminals. The high volume of crime datasets and also the complexity of relationships between these kinds of data have made criminology an appropriate field for applying data mining techniques. Text based crime reports can be converted into word processing files. These information can be used to perform crime matching process. Bio Informatics Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich. Mining biological data helps to extract useful knowledge from massive datasets gathered in biology, and in other related life sciences areas such as medicine and neuroscience. Applications of data mining to bioinformatics include gene finding, protein function inference, disease diagnosis, disease prognosis, disease treatment optimization, protein and gene interaction network reconstruction, data cleansing, and protein subcellular location prediction. Another potential application of data mining is the automatic recognition of patterns that were not previously known. Imagine if you had a tool that could automatically search your database to look for patterns which are hidden. If you had access to this technology, you would be able to find relationships that could allow you to make strategic decisions. Data mining is becoming a pervasive technology in activities as diverse as using historical data to predict the success of a marketing campaign, looking for patterns in financial transactions to discover illegal activities or analyzing genome sequences. From this perspective, it was just a matter of time for the discipline to reach the important area of computer security. Applications of Data Mining in Computer Security presents a collection of research efforts on the use of data mining in computer security. Data mining has been loosely defined as the process of extracting information from large amounts of data. In the context of security, the information we are seeking is the knowledge of whether a security breach has been experienced, and if the answer is yes, who is the perpetrator. This information could be collected in the context of discovering intrusions that aim to breach the privacy of services, data in a computer system or alternatively, in the context of discovering evidence left in a computer system as part of criminal activity. Applications of Data Mining in Computer Security concentrates heavily on the use of data mining in the area of intrusion detection. The reason for this is twofold. First, the volume of data dealing with both network and host activity is so large that it makes it an ideal candidate for using data mining techniques. Second, intrusion detection is an extremely critical activity. This book also addresses the application of data mining to computer forensics. This is a crucial area that seeks to address the needs of law enforcement in analyzing the digital evidence. Applications of Data Mining in Computer Security is designed to meet the needs of a professional audience composed of researchers and practitioners in industry and graduate level students in computer science. SOCIAL IMPACTS OF DATA MINING Data Mining can offer the individual many benefits by improving customer service and satisfaction, and lifestyle in general. However, it also has serious implications regarding one’s right to privacy and data security. Is Data Mining a Hype or a persistent, steadily growing business? Data Mining has recently become very popular area for research, development and business as it becomes an essential tool for deriving knowledge from data to help business person in decision making process. Several phases of Data Mining technology is as follows: Innovators Early Adopters Chasm Early Majority Late Majority Laggards Is Data Mining Merely Managers Business or Everyone’s Business? Data Mining will surely help company executives a great deal in understanding the market and their business. However, one can expect that everyone will have needs and means of data mining as it is expected that more and more powerful, user friendly, diversified and affordable data mining systems or components are made available. Data Mining can also have multiple personal uses such as: Identifying patterns in medical applications To choose best companies based on customer service. To classify email messages etc. Is Data Mining a threat to Privacy and Data Security? With more and more information accessible in electronic forms and available on the web and with increasingly powerful data mining tools being developed and put into use, there are increasing concern that data mining may pose a threat to our privacy and data security. Data Privacy: In 1980, the organization for Economic co-operation and development (OECD) established as set of international guidelines, referred to as fair information practices. These guidelines aim to protect privacy and data accuracy. They include the following principles: Purpose specification and use limitation. Openness Security Safeguards Individual Participation Data Security: Many data security enhancing techniques have been developed to help protect data. Databases can employ a multilevel security model to classify and restrict data according to various security levels with users permitted access to only their authorized level. Some of the data security techniques are: Encryption Technique Intrusion Detection In secure multiparty computation In data obscuration