UNIT:1 1. What Motivated Data Mining? Why Is It Important? Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. Evolution of Database Technology 1960s and earlier: Data Collection and Database Creation Primitive file processing 1970s - early 1980s: Data Base Management Systems Hieratical and network database systems Relational database Systems Query languages: SQL Transactions, concurrency control and recovery. On-line transaction processing (OLTP) Mid -1980s - present: Advanced data models Extended relational, object-relational Advanced application-oriented DBMS spatial, scientific, engineering, temporal, multimedia, active, stream and sensor, knowledge-based Late 1980s-present Advanced Data Analysis Data warehouse and OLAP Data mining and knowledge discovery Advanced data mining appliations Data mining and socity 1990s-present: XML-based database systems Integration with information retrieval Data and information integreation Present – future: New generation of integrated data and information system. 2. Data Mining: Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data,” which is unfortunately somewhat long. “Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) 3. Architecture of a typical data mining system The architecture of a typical data mining system may have the following major components. Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of on-line analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data analysis. Although there are many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data should be more appropriately categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as a database system, an information retrieval system, or a deductive database system. Data mining involves an integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis. The discovered knowledge can be applied to decision making, process control, information management, and query processing. Therefore, data mining is considered one of the most important frontiers in database and information systems and one of the most promising interdisciplinary developments in the information technology. 4. Data Mining: On What Kind of Data? Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW 5. Data Mining Functionalities: Concept description: Characterization and discrimination Data can be associated with classes or concepts Ex. AllElectronics store classes of items for sale include computer and printers. Description of class or concept called class/concept description. Data characterization Data discrimination Mining Frequent Patterns, Associations, and Correlations Frequent patters- patterns occurs frequently Item sets, subsequences and substructures Frequent item set Sequential patterns Structured patterns Association Analysis Multi-dimensional vs. single-dimensional association age(X, “20..29”) ^ income(X, “20..29K”) => buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) => contains(x, “software”) [support=1%, confidence=75%] Classification and Prediction Finding models (functions) that describe and distinguish data classes or concepts for predict the class whose label is unknown E.g., classify countries based on climate, or classify cars based on gas mileage Models: decision-tree, classification rules (if-then), neural network Prediction: Predict some unknown or missing numerical values Cluster analysis Analyze class-labeled data objects, clustering analyze data objects without consulting a known class label. Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Outlier analysis Outlier: a data object that does not comply with the general behavior of the model of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis 6. Classification of Data Mining Systems Classification according to the kinds of databases mined: A data mining system can be classified according to the kinds of databases mined. Database systems can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system. Classification according to the kinds of knowledge mined: Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. Moreover, data mining systems can be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge (at a high level of abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction. Data mining systems can also be categorized as those that mine data regularities (commonly occurring patterns) versus those that mine data irregularities (such as exceptions, or outliers). In general, concept description, association and correlation analysis, classification, prediction, and clustering mine data regularities, rejecting outliers as noise. These methods may also help detect outliers. Classification according to the kinds of techniques utilized: Data mining systems can be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed (e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique that combines the merits of a few individual approaches. Classification according to the applications adapted: Data mining systems can also be categorized according to the applications they adapt. For example, data mining systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on. Different applications often require the integration of application-specific methods. Therefore, a generic, all-purpose data mining system may not fit domain-specific mining tasks. 7. Integration of a Data Mining System with a Database or Data ware house System A good system architecture will facilitate the data mining system to make best use of the software environment, accomplish data mining tasks in an efficient and timely manner, interoperate and exchange information with other information systems, be adaptable to users’ diverse requirements, and evolve with time. A critical question in the design of a data mining (DM) system is how to integrate or couple the DM system with a database (DB) system and/or a data warehouse (DW) system. If a DM system works as a stand-alone system or is embedded in an application program, there are no DB or DW systems with which it has to communicate. This simple scheme is called no coupling, where the main focus of the DM design rests on developing effective and efficient algorithms for mining the available data sets. However, when a DM system works in an environment that requires it to communicate with other information system components, such as DB and DW systems, possible integration schemes include no coupling, loose coupling, semi tight coupling, and tight coupling. We examine each of these schemes, as follows: No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and then store the mining results in another file. Such a system, though simple, suffers from several drawbacks. First, a DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data.Without using a DB/DWsystem, aDMsystemmay spend a substantial amount of time finding, collecting, cleaning, and transforming data. In DB and/or DWsystems, data tend to be well organized, indexed, cleaned, integrated, or consolidated, so that finding the task-relevant, high-quality data becomes an easy task. Second, there are many tested, scalable algorithms and data structures implemented in DB and DWsystems. It is feasible to realize efficient, scalable implementations using such systems. Moreover, most data have been or will be stored in DB/DW systems. Without any coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Thus, no coupling represents a poor design. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. Loose coupling is better than no coupling because it can fetch any portion of data stored in databases or data warehouses by using query processing, indexing, and other system facilities. It incurs some advantages of the flexibility, efficiency, and other features provided by such systems. However, many loosely coupled mining systems are main memorybased. Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performance with large data sets. Semi tight coupling: Semi tight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data mining functions) can be provided in the DB/DW system. These primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and pre computation of some essential statistical measures, such as sum, count, max, min, standard deviation, and so on. Moreover, some frequently used intermediate mining results can be pre computed and stored in the DB/DW system. Because these intermediate mining results are either pre computed or can be computed efficiently, this design will enhance the performance of a DM system. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system. With further technology advances, DM, DB, and DW systems will evolve and integrate together as one information system with multiple functionalities. This will provide a uniform information processing environment. This approach is highly desirable because it facilitates efficient implementations of data mining functions, high system performance, and an integrated information processing environment. With this analysis, it is easy to see that a data mining system should be coupled with a DB/DW system. Loose coupling, though not efficient, is better than no coupling because it uses both data and system facilities of a DB/DW system. Tight coupling is highly desirable, but its implementation is nontrivial and more research is needed in this area. Semi tight coupling is a compromise between loose and tight coupling. It is important to identify commonly used data mining primitives and provide efficient implementations of such primitives in DB or DW systems. 8. Major Issues in Data Mining Major issues in data mining regarding mining methodology, user interaction, performance, and diverse data types. Mining methodology and user interaction issues: These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization. Mining different kinds of knowledge in databases: Because different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery tasks, including data characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis (which includestrend and similarity analysis). These tasks may use the same database in different ways and require the development of numerous data mining techniques. Interactive mining of knowledge at multiple levels of abstraction: Because it is difficult to know exactly what can be discovered within a database, the data mining process should be interactive. For databases containing a huge amount of data, appropriate sampling techniques can first be applied to facilitate interactive data exploration. Interactive mining allows users to focus the search for patterns, providing and refining data mining requests based on returned results. Specifically, knowledge should be mined by drilling down, rolling up, and pivoting through the data space and knowledge space interactively, similar to what OLAP can do on data cubes. In this way, the user can interact with the data mining system to view data and discovered patterns at multiple granularities and from different angles. Incorporation of background knowledge: Background knowledge, or information regarding the domain under study, may be used to guide the discovery process and allow discovered patterns to be expressed in concise terms and at different levels of abstraction. Domain knowledge related to databases, such as integrity constraints and deduction rules, can help focus and speed up a data mining process, or judge the interestingness of discovered patterns. Data mining query languages and ad hoc data mining: Relational query languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a similar vein, high-level data mining query languages need to be developed to allow users to describe ad hoc data mining tasks by facilitating the specification of the relevant sets of data for analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the discovered patterns. Such a language should be integrated with a database or data warehouse query language and optimized for efficient and flexible data mining. Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level languages, visual representations, or other expressive forms so that the knowledge can be easily understood and directly usable by humans. This is especially crucial if the data mining system is to be interactive. This requires the system to adopt expressive knowledge representation techniques, such as trees, tables, rules, graphs, charts, crosstabs, matrices, or curves. Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional cases, or incomplete data objects. When mining data regularities, these objects may confuse the process, causing the knowledge model constructed to overfit the data. As a result, the accuracy of the discovered patterns can be poor. Data cleaning methods and data analysis methods that can handle noise are required, as well as outlier mining methods for the discovery and analysis of exceptional cases. Pattern evaluation—the interestingness problem: A data mining system can uncover thousands of patterns. Many of the patterns discovered may be uninteresting to the given user, either because they represent common knowledge or lack novelty. Several challenges remain regarding the development of techniques to assess the interestingness of discovered patterns, particularly with regard to subjective measures that estimate the value of patterns with respect to a given user class, based on user beliefs or expectations. The use of interestingness measures or user-specified constraints to guide the discovery process and reduce the search space is another active area of research. Performance issues: These include efficiency, scalability, and parallelization of data mining algorithms. Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount of data in databases, data mining algorithms must be efficient and scalable. In other words, the running time of a data mining algorithm must be predictable and acceptable in large databases. Froma database perspective on knowledge discovery, efficiency and scalability are key issues in the implementation of data mining systems. Many of the issues discussed above under mining methodology and user interaction must also consider efficiency and scalability. Parallel, distributed, and incremental mining algorithms: The huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods are factors motivating the development of parallel and distributed data mining algorithms. Such algorithms divide the data into partitions, which are processed in parallel. The results from the partitions are then merged. Moreover, the high cost of some data mining processes promotes the need for incremental data mining algorithms that incorporate database updates without having to mine the entire data again “from scratch.” Such algorithms perform knowledge modification incrementally to amend and strengthen what was previously discovered. Issues relating to the diversity of database types: Handling of relational and complex types of data: Because relational databases and data warehouses are widely used, the development of efficient and effective data mining systems for such data is important. However, other databases may contain complex data objects, hypertext and multimedia data, spatial data, temporal data, or transaction data. It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types and different goals of data mining. Specific data mining systems should be constructed for mining specific kinds of data. Therefore, one may expect to have different data mining systems for different kinds of data. Mining information from heterogeneous databases and global information systems : Localand wide-area computer networks (such as the Internet) connect many sources of data, forming huge, distributed, and heterogeneous databases. The discovery of knowledge from different sources of structured, semi structured, or unstructured data with diverse data semantics poses great challenges to data mining. Data mining may help disclose high-level data regularities in multiple heterogeneous databases that are unlikely to be discovered by simple query systems and may improve information exchange and interoperability in heterogeneous databases. Web mining, which uncovers interesting knowledge about Web contents, Web structures, Web usage, and Web dynamics, becomes a very challenging and fast-evolving field in data mining. 9. Data Pre Processing: Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results.“How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?” There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. Data Discretization: Part of data reduction but with particular importance, especially for numerical data. Data Cleaning: Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. How to Handle Missing Data? Ignore the tuple: usually done when class label is missing Fill in the missing value manually Use a global constant to fill in the missing value: ex. “unknown” Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data How to Handle Noisy Data? Binning method: first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries Clustering detect and remove outliers Regression smooth by fitting the data to a regression functions – linear regression Simple Discretization Methods: Binning Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. Data Integration and Transformation Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones Data Reduction Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Attribute subset selection Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation Dimensionality Reduction Feature selection (attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand Heuristic methods step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction Wavelet Transforms Discrete wavelet transform (DWT): linear signal processing Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length Attribute subset selection Attribute subset selection reduces the data set size by removing irrelevent or redundant attributes. Goal is find min set of attributes Uses basic heuristic methods of attribute selection Numerosity Reduction Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling Regress Analysis and Log-Linear Models Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lowerorder tables. Probability: p(a, b, c, d) = ab acad bcd Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms. Sampling Allows a large data set to be represented by a much smaller of the data. Let a large data set D, contains N tuples. Methods to reduce data set D: Simple random sample without replacement (SRSWOR) Simple random sample with replacement (SRSWR) Cluster sample Stright sample Discretization and concept hierarchy generation Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis Discretization and Concept hierarchy Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Discretization by intuitive partitioning UNIT:2 What is Data Warehouse? Defined in many different ways A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. Data Warehouse—Non-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data. Data Warehouse vs. Operational DBMS Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making A multi-dimensional data model Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Data warehouse architecture Steps for the Design and Construction of Data Warehouse The design of a data warehouse: a business analysis framework The process of data warehouse design A three-tier data ware house architecture Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse Data warehouse view consists of fact tables and dimension tables Data source view exposes the information being captured, stored, and managed by operational systems Business query view Types of OLAP Servers Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers specialized support for SQL queries over star/snowflake schemas From data warehousing to data mining Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks From On-Line Analytical Processing to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. Architecture of OLAM UNIT:3 Data Mining Task Primitives A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths. The data mining primitives specify the following, as illustrated The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction. The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For example, interestingness measures for association rules include support and confidence. Rules whose support and confidence values are below user-specified thresholds are considered uninteresting. The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes. A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with data mining systems. Having a data mining query language provides a foundation on which user-friendly graphical interfaces can be built. This facilitates a data mining system’s communication with other information systems and its integration with the overall information processing environment. Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks, from data characterization to evolution analysis. Each task has different requirements. The design of an effective data mining query language requires a deep understanding of the power, limitation, and underlying mechanisms of the various kinds of data mining tasks. A Data Mining Query Language (DMQL) Motivation A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language like SQL to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptance Design DMQL is designed with the primitives Syntax for DMQL Syntax for specification of task-relevant data the kind of knowledge to be mined concept hierarchy specification interestingness measure pattern presentation and visualization — a DMQL query Syntax for task-relevant data specification use database database_name, or use data warehouse data_warehouse_name from relation(s)/cube(s) [where condition] in relevance to att_or_dim_list order by order_list group by grouping_list having condition Syntax for specifying the kind of knowledge to be mined Characterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s) Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s) Association Mine_Knowledge_Specification ::= mine associations [as pattern_name] Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension Prediction Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}} Syntax for concept hierarchy specification To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> use different syntax to define different type of hierarchies schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year] set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior operation-derived hierarchies define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250 Syntax for interestingness measure specification Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_value Example: with support threshold = 0.05 with confidence threshold = 0.7 Syntax for pattern presentation and visualization specification syntax which allows users to specify the display of discovered patterns in one or more forms display as <result_form> To facilitate interactive viewing at different concept level, the following syntax is defined: Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension Design graphical user interfaces based on a data mining query language What tasks should be considered in the design GUIs based on a data mining query language? Data collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other miscellaneous information UNIT: 4 What is Concept Description? Descriptive vs. predictive data mining Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data Concept description: Characterization: provides a concise and succinct summarization of the given collection of data Comparison: provides descriptions comparing two or more collections of data Concept Description vs. OLAP Concept description: can handle complex data types of the attributes and their aggregations a more automated process OLAP: restricted to a small number of dimension and measure types user-controlled process Data Generalization and Summarization-based Characterization Data generalization A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones. Approaches: Data cube approach(OLAP approach) Attribute-oriented induction approach Characterization: Data Cube Approach Perform computations and store results in data cubes Strength An efficient implementation of data generalization Computation of various kinds of measures count( ), sum( ), average( ), max( ) Generalization and specialization can be performed on a data cube by roll-up and drill-down Limitations handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values. Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should the generalization reach Attribute-Oriented Induction Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How it is done? Collect the task-relevant data( initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts. Interactive presentation with users. Basic Principles of Attribute-Oriented Induction Data focusing task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes. Attribute-generalization If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. Attribute-threshold control Generalized relation threshold control control the final relation/rule size. Basic Algorithm for Attribute-Oriented Induction Initial Relation Query processing of task-relevant data, deriving the initial relation. Pre Generalization Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? Prime Generalization Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Presentation User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. Implementation by Cube Technology Construct a data cube on-the-fly for the given data mining query Facilitate efficient drill-down analysis May increase the response time A balanced solution: precomputation of “subprime” relation Use a predefined & precomputed data cube Construct a data cube beforehand Facilitate not only the attribute-oriented induction, but also attribute relevance analysis, dicing, slicing, roll-up and drill-down Cost of cube computation and the nontrivial storage overhead Analytical characterization: Analysis of attribute relevance Characterization vs. OLAP Similarity: Presentation of data summarization at multiple levels of abstraction. Interactive drilling, pivoting, slicing and dicing. Differences: Automated desired level allocation. Dimension relevance analysis and ranking when there are many relevant dimensions. Sophisticated typing on dimensions and measures. Analytical characterization: data dispersion analysis. Attribute Relevance Analysis Why? What? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns statistical method for preprocessing data filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison Attribute relevance analysis Data Collection Analytical Generalization Use information gain analysis to identify highly relevant dimensions and levels. Relevance Analysis Sort and select the most relevant dimensions and levels. Attribute-oriented Induction for class description On selected dimension/level OLAP operations (drilling, slicing) on relevance rules Relevance Measures Quantitative relevance measure determines the classifying power of an attribute within a set of data. Methods information gain (ID3) gain ratio (C4.5) gini index 2 contingency table statistics uncertainty coefficient Information-Theoretic Approach Decision tree each internal node tests an attribute each branch corresponds to attribute value each leaf node assigns a classification ID3 algorithm build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height the least number of tests to classify an object Top-Down Induction of Decision Tree Mining Class Comparisons Comparison Comparing two or more classes. Method Partition the set of relevant data into the target class and the contrasting classes Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures: support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis Find attributes (features) which best distinguish different classes. Mining Data Dispersion Characteristics Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. Numerical dimensions -correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube Boxplot Analysis Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum Mining Descriptive Statistical Measures in Large Databases Variance s2 1 n 1 1 2 ( xi x ) 2 xi2 xi n 1 i 1 n 1 n Standard deviation: the square root of the variance Measures spread about the mean It is zero if and only if all the values are equal Both the deviation and the variance are algebraic Graphic Displays of Basic Statistical Descriptions Histogram Boxplot Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence Incremental and Parallel Mining of Concept Description Incremental mining: revision based on newly added data DB Generalize DB to the same level of abstraction in the generalized relation R to derive R Union R U R, i.e., merge counts and other statistical information to produce a new relation R’ Similar philosophy can be applied to data sampling, parallel and/or distributed mining, etc. UNIT: 5 Association rule mining • • • Association rule mining – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications – Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Rule form prediction (Boolean variables) => prediction (Boolean variables) [support, confidence] – Computer => antivirus_software [support =2%, confidence = 60%] – buys (x, “computer”) ® buys (x, “antivirus_software”) [0.5%, 60%] Basic Concepts • • • • Given a database of transactions each transaction is a list of items (purchased by a customer in a visit) Find all rules that correlate the presence of one set of items with that of another set of items Find frequent patterns Example for frequent itemset mining is market basket analysis. Association rule performance measures • Confidence • Support • Minimum support threshold • Minimum confidence threshold Martket Basket Analysis • • • • • Shopping baskets Each item has a Boolean variable representing the presence or absence of that item. Each basket can be represented by a Boolean vector of values assigned to these variables. Identify patterns from Boolean vector Patterns can be represented by association rules. A Road Map • Boolean vs. quantitative associations - Based on the types of values handled • • – buys(x, “SQLServer”) ^ buys(x, “DMBook”) => buys(x, “DBMiner”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”) => buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations Single level vs. multiple-level analysis Mining single-dimensional Boolean association rules from transactional databases Apriori Algorithm • • • Single dimensional, single-level, Boolean frequent item sets Finding frequent item sets using candidate generation Generating association rules from frequent item sets Mining Frequent Itemsets: the Key Step • Find the frequent itemsets: the sets of items that have minimum support – A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset • – Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 Lk+1 = candidates in Ck+1 with min_support end return k Lk; that are contained in t How to Generate Candidates • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck How to Count Supports of Candidates • • Why counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates Method – Candidate itemsets are stored in a hash-tree – Leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Methods to Improve Apriori’s Efficiency • • • Hash-based itemset counting – A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction – A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning – Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Methods to Improve Apriori’s Efficiency • • Sampling – mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting – add new candidate itemsets only when all of their subsets are estimated to be frequent Mining Frequent Patterns Without Candidate Generation • • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure – highly condensed, but complete for frequent pattern mining – avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method – A divide-and-conquer methodology: decompose mining tasks into smaller ones – Avoid candidate generation: sub-database test only Mining multilevel association rules from transactional databases Mining various kinds of association rules Mining Multilevel association rules Concepts at different levels Mining Multidimensional association rules More than one dimensional Mining Quantitative association rules Numeric attributes Multi-level Association • Uniform Support- the same minimum support for all levels – + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – – Lower level items do not occur as frequently. If support threshold • too high miss low level associations • too low generate too many high level associations • Reduced Support- reduced minimum support at lower levels – There are 4 search strategies: • Level-by-level independent • Level-cross filtering by k-itemset • Level-cross filtering by single item • Controlled level-cross filtering by single item Multi-level Association: Redundancy Filtering • • • • Some rules may be redundant due to “ancestor” relationships between items. Example – milk wheat bread [support = 8%, confidence = 70%] – 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. Mining multidimensional association rules from transactional databases and data warehouse Multi-Dimensional Association • Single-dimensional rules buys(X, “milk”) buys(X, “bread”) • Multi-dimensional rules – Inter-dimension association rules -no repeated predicates age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) – hybrid-dimension association rules -repeated predicates age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) • • Categorical Attributes – finite number of possible values, no ordering among values Quantitative Attributes – numeric, implicit ordering among values Techniques for Mining MD Associations • Search for frequent k-predicate set: – Example: {age, occupation, buys} is a 3-predicate set. – Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes – Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules – Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules – This is a dynamic discretization process that considers the distance between data points. From association mining to correlation analysis Interestingness Measurements • Objective measures – Two popular measurements support confidence • Subjective measures A rule (pattern) is interesting if *it is unexpected (surprising to the user); and/or *actionable (the user can do something with it) Constraint-based association mining • • Interactive, exploratory mining kinds of constraints – Knowledge type constraint- classification, association, etc. – – – – Data constraint: SQL-like queries Dimension/level constraints Rule constraint Interestingness constraints Rule Constraints in Association Mining • • Two kind of rule constraints: – Rule form constraints: meta-rule guided mining. • P(x, y) ^ Q(x, w) ® takes(x, “database systems”). – Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98). • sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 1-variable vs. 2-variable constraints – 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above. – 2-var: A constraint confining both sides (L and R). • sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS) Constrain-Based Association Query • Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price) • A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C }, – where C is a set of constraints on S1, S2 including frequency constraint • A classification of (single-variable) constraints: – Class constraint: S A. e.g. S Item – Domain constraint: • S v, { , , , , , }. e.g. S.Price < 100 • v S, is or . e.g. snacks S.Type • V S, or S V, { , , , , } – e.g. {snacks, sodas } S.Type – Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and { , , , , , }. • e.g. count(S1.Type) 1 , avg(S2.Price) 100 Anti-monotone and Monotone Constraints • • A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-patterns of S can satisfy Ca A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also satisfies it • Succinct Constraint • A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate p, where is a selection operator • SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus • A constraint Cs is succinct provided SATCs(I) is a succinct power set Convertible Constraint • Suppose all items in patterns are listed in a total order R • A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C • A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C UNIT: 6 Classification and Prediction Databases are rich with hidden information that can be used for intelligent decision making. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. Many classification and prediction methods have been proposed by researchers in machine learning, pattern recognition, and statistics. Most algorithms are memory resident, typically assuming a small data size. Recent data mining research has built on such work, developing scalable classification and prediction techniques capable of handling large diskresident data. Preparing the Data for Classification and Prediction The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process. Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Data transformation and reduction: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (like, say, income) from outweighing attributes with initially smaller ranges (such as binary attributes). The data can also be transformed by generalizing it to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income can be generalized to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can be generalized to higherlevel concepts, like city. Because generalization compresses the original training data, fewer input/output operations may be involved during learning. Data can also be reduced by applying many other methods, ranging from wavelet transformation and principle components analysis to discretization techniques, such as binning, histogram analysis, and clustering. Comparing Classification and Prediction Methods Classification and prediction methods can be compared and evaluated according to the following criteria: Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data. Accuracy can be estimated using one or more test sets that are independent of the training set. Estimation techniques, such as cross-validation and bootstrapping. Speed: This refers to the computational costs involved in generating and using the given classifier or predictor. Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values. Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of data. Interpretability: This refers to the level of understanding and insight that is provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to assess. Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Attribute Selection Measure • • Information gain (ID3/C4.5) – All attributes are assumed to be categorical – Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) – All attributes are assumed continuous-valued – Assume there exist several possible split values for each attribute – May need other tools, such as clustering, to get the possible split values – Can be modified for categorical attributes Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Assume there are two classes, P and N – Let the set of examples S contain p elements of class P and n elements of class N – The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as p p n n I ( p, n) log 2 log 2 pn pn pn pn Information Gain in Decision Tree Induction • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} – If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is pi ni I ( pi , ni ) i 1 p n – The encoding information that would be gained by branching on A Gain( A) I ( p, n) E ( A) E ( A) Enhancements to basic decision tree induction • • • Allow for continuous-valued attributes – Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values – Assign the most common value of the attribute – Assign probability to each of the possible values Attribute construction – Create new attributes based on existing ones that are sparsely represented – This reduces fragmentation, repetition, and replication Bayesian Classification • • • • • Statical classifiers Based on Baye’s theorem Naïve Bayesian classification Class conditional independence Bayesian belief netwoks Baye’s Theorem • • Let X be a data tuple and H be hypothesis, such that X belongs to a specific class C. Posterior probability of a hypothesis h on X, P(h|X) follows the Baye’s theorem P( H | X ) P( X | H ) P( H ) P( X ) Naïve Bayesian Classification • • • • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) If attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C If attribute is continuous: P(xi|C) is estimated thru a Gaussian density function Computationally easy in both cases Classification by Back propagation • • • Neural network learning algorithm Psychologists and Neurobiologists Neural network – set of connected input/output units in which each connection has a weight associated with it. Neural Networks • • Advantages – prediction accuracy is generally high – robust, works when training examples contain errors – output may be discrete, real-valued, or a vector of several discrete or real-valued attributes – fast evaluation of the learned target function limitations – long training time – difficult to understand the learned function (weights) – not easy to incorporate domain knowledge Network Pruning and Rule Extraction • • Network pruning – Fully connected network will be hard to articulate – N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights – Pruning: Remove some of the links without affecting classification accuracy of the network Extracting rules from a trained network – Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy – Enumerate the output from the discretized activation values to find rules between activation value and output – Find the relationship between the input and activation value – Combine the above two to have rules relating the output to input Other Classification Methods • • • • • k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches The k-Nearest Neighbor Algorithm • • • • • All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. Case-Based Reasoning • • • • • Also uses: lazy evaluation + analyze similar instances Difference: Instances are not “points in a Euclidean space” Example: Water faucet problem in CADET (Sycara et al’92) Methodology – Instances represented by rich symbolic descriptions ( function graphs) – Multiple retrieved cases may be combined – Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Research issues – Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases Genetic Algorithms • • • GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules – e.g., IF A1 and Not A2 then C2 can be encoded as 100 • Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings • The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation Rough Set Approach • • • Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C) Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity Fuzzy Set Approaches • • • • • Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Attribute values are converted to fuzzy values – e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated For a given new sample, more than one fuzzy value may apply Each applicable rule contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed Prediction • • Prediction is similar to classification – First, construct a model – Second, use model to predict unknown value • Major method for prediction is regression – Linear and multiple regression – Non-linear regression Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions Predictive Modeling in Databases • • • • Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions Method outline: – Minimal generalization – Attribute relevance analysis – Generalized linear model construction – Prediction Determine the major factors which influence the prediction • – Data relevance analysis: uncertainty measurement, entropy analysis, expert judgment, etc. Multi-level prediction: drill-down and roll-up analysis Classification Accuracy: Estimating Error Rates • • • Partition: Training-and-testing – use two independent data sets- training set , test set – used for data set with large number of samples Cross-validation – divide the data set into k subsamples – use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation – for data set with moderate size Bootstrapping (leave-one-out) – for small size data Boosting and Bagging • • • Boosting increases classification accuracy – Applicable to decision trees or Bayesian classifier Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor Boosting requires only linear time and constant space UNIT: 7 Cluster analysis The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on theWeb for information discovery. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. The following are typical requirements of clustering in data mining: Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control. Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data. That is, given a set of data objects, such an algorithm may return dramatically different clusterings depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods. Types of Data in Cluster Analysis • • • • Interval-scaled variables Binary variables Categorical, Ordinal, and Ratio Scaled variables Variables of mixed types Interval-valued variables • Standardize data – Calculate the mean absolute deviation: sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) Where mf 1 n (x1 f x2 f ... xnf ) – Calculate the standardized measurement (z-score) xif m f zif sf • Using mean absolute deviation is more robust than using standard deviation Similarity and Dissimilarity Objects • • Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j 2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 ip j p Similarity and Dissimilarity Objects • If q = 2, d is Euclidean distance: d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp – Properties • d(i,j) 0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j) d(i,k) + d(k,j) • Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. Categorical Variables • • • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching – M is no of matches, p is total no of variables Method 2: use a large number of binary variables – creating a new binary variable for each of the M nominal states Ordinal Variables • • • An ordinal variable can be discrete or continuous order is important, e.g., rank Can be treated like interval-scaled – replacing xif by their rank – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by – compute the dissimilarity using methods for interval-scaled variables Ratio-Scaled Variables • • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: – treat them like interval-scaled variables – apply logarithmic transformation yif = log(xif) – treat them as continuous ordinal data treat their rank as interval-scaled. Major Clustering Approaches • • • • • Partitioning algorithms – Construct various partitions and then evaluate them by some criterion Hierarchy algorithms – Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based – based on connectivity and density functions Grid-based – based on a multiple-level granularity structure Model-based – A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Partitioning Algorithms: Basic Concept • • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means - Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) - Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. – Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment. • Strength – Relatively efficient: O(tkn), where n is no of objects, k is no of clusters, and t is no of iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes The K-Medoids Clustering Method • • • Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, 1990) • • CLARANS (Ng & Han, 1994): Randomized sampling Focusing + spatial data structure (Ester et al., 1995) PAM (Partitioning Around Medoids) (1987) • • PAM (Kaufman and Rousseeuw, 1987), built in Splus Use real object to represent the cluster – Select k representative objects arbitrarily – For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih – For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object – repeat steps 2-3 until there is no change CLARA (Clustering Large Applications) (1990) • CLARA (Kaufmann and Rousseeuw in 1990) – Built in statistical analysis packages, such as S+ • • • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: – Efficiency depends on the sample size – A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased CLARANS (“Randomized” CLARA) (1994) • • • • • CLARANS (A Clustering Algorithm based on Randomized Search) CLARANS draws sample of neighbors dynamically The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum It is more efficient and scalable than both PAM and CLARA Focusing techniques and spatial access structures may further improve its performance Hierarchical Methods More on Hierarchical Clustering Methods • • Major weakness of agglomerative clustering methods – do not scale well: time complexity of at least O(n2), where n is the number of total objects – can never undo what was done previously Integration of hierarchical with distance-based clustering – BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters – CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction – CHAMELEON (1999): hierarchical clustering using dynamic modeling