What Is Data Mining Data mining is the process that uses a variety of data analysis tools to discover patterns and relationships, which are hidden among the vast amount of data. From these patterns and relationships, the company will be able to make valid predictions in the future trends. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They find hidden patterns and predictive information that experts may miss because those lie outside their expectations. Data mining is also known as knowledge data discovery (KDD). Those seeking to make a distinction between the terms data mining and KDD generally use KDD to refer to the process of discovering useful knowledge from data, while using data mining to refer to the application of algorithms for extracting patterns from data. Here are some examples for data mining: Classification: To which set of predefined categories does this case belong? In marketing, when planning a mail shot, the categories may simply be the people who will buy and the people who will not buy. Association: Which things occur together? For example, looking at shopping baskets, people who buy beer tend also to buy nuts at the same time. Sequence: It is essentially a time-ordered association, although the associated events may be spread for apart in time. For example, most people will buy insurance after the marriage. Clustering: It is like classification except that the categories are not normally known beforehand. It shows the collection of shopping baskets and discovers clusters corresponding to health food buyers, convenience food buyers, luxury food buyers and so on. Data mining process 1. Develop an understanding of the application, relevant prior knowledge, and the end user’s goals. 2. Create a target data set to be used for discovery. 3. Clean and preprocess data (including handling missing data fields, noise in the data, accounting for time series, and known changes). 4. Reduce the number of variables and find invariant representations of data if possible. 5. Choose the data mining task (classification, regression, clustering, etc.). 6. Choose the data mining algorithm. 7. Search for patterns of interest (this is the actual data mining). 8. Interpret the pattern mined. If necessary, iterate through any of steps 1 through 7. 9. Consolidate knowledge discovered and prepare a report. Figure 1: Steps in the Data Mining Process (1) (2) Understand Application (3) Target Data for Discovery (5) Choose Data Classification Task (4) Clean, Preprocess Data (6) Choose Data Mining Algorithm Reduce No. of Variables (7) Search for Patterns (8) Interpret Pattern Mined (9) Prepare Report Data mining and data warehousing The data to be mined is usually extracted from an enterprise data warehouse into data mining database. However a data warehouse is not a requirement for data mining. Setting up a large data warehouse for data mining can be time consuming and cost millions of dollars because it has to consolidates data from multiple sources, resolves data integrity problems, and loads the data into a query database. Therefore, companies would like to mine data from one or more operational or transactional databases which is easy to extract data into read only database. Data mining and OLAP One of the most common questions from data processing professionals is about the difference between data mining and On-Line Analytical Processing (OLAP). They are different tools but they can complement each other. Traditional query and report tools describe what is in a database. But OLAP is used to provide analysis for certain data why it is true. OLAP can assists analysts generate a series of hypothetical patterns and relationships and then uses queries against the database to verify them or disprove them. For example, an analyst might want to determine the factors that lead to loan defaults. He or she might initially hypothesize that people with low incomes are bad credit risks and analyze the database with OLAP to verify or disprove this assumption. If this hypothesis is not true, then an analyst might look at high debt as the determinant of risk. If this hypothesis is not supported by the data, either, then an analyst can try debt and income together as the best predictor of bad credit risks. Data mining is different from OLAP because rather than verify hypothetical patterns, it uses the data itself to find hidden patterns. For example, suppose an analyst who wants to identify the risk factor for load default and use a data mining tool. The data mining tool might discover that people with high debt and low incomes are bad credit risks which can be found by using OLAP. But it also can discover a pattern that an analyst does not think to try, such as age that is also a determinant of risk. Furthermore, OLAP is also complementary in the early stages of the knowledge discovery process because it can help user to explore the data, for instance by focusing attention on important variables, identifying exception, or finding interactions. This is important because the better understanding of the data, the more effective the knowledge discovery process will be. Data mining and hardware trends A key enabler of data mining is the major progress in hardware price and performance. The drop in the price of computer disk storage in these few years has radically changed the collecting and storing massive amounts of data. The price not only drops in disk storage, but also in CPU. Each generation of chips greatly increases the power of CPU, while allowing the drop on the cost. This is also reflected in the price of RAM. Today, most PCs have 64 megabytes or more of RAM, and workstations have 256 megabytes or more, while servers with gigabytes are normal. Due to the power of the individual CPU has greatly in creased, all servers can support multiple CPUs using symmetric multi-processing, which allows hundreds of CPUs to work on finding patterns in the data. Data mining applications Data mining is increasingly popular because of the substantial contribution it can make. It can be used to control costs as well as contribute to revenue increases. Many organizations are using data mining to help manage all phases of the customer life cycle, including acquiring new customers increasing revenue from existing customers, and retaining good customers. Data mining offers value across a broad spectrum of industries. Telecommunications and credit card companies are two of the leaders in applying data mining to detect fraudulent use of their services. Insurance companies and stock exchanges are also interested in applying this technology to reduce fraud. Medical services use data mining to predict the effectiveness of surgical procedures, medical tests or medications. Companies active in the financial markets use data mining to determine market and industry characteristics, as well as to predict individual company and stock performance. Retailers are making more use of data mining to decide which products to stock in particular store, as well as to assess the effectiveness of promotions and coupons. Pharmaceutical firms are mining large databases of chemical compounds and of genetic material to discover substances that might be candidates for the treatments of disease. Helpful Hint The relation between the people who have domain knowledge and those who are data mining experts is a critical success factor. A data mining expert without domain knowledge or a business expert who knows nothing about models and data are both useless. The data mining expert and the people who understand the business must work closely together. Data mining cannot find patterns of interest unless the tool is told what to look for. For example, to find ways of improving mailing list impact, the search would be directed to customers who have bought large amounts previously or who have responded frequently to previous offers. Data mining can be useful in many areas, not just those in which it has previously been successful. However, cost/benefit analysis needs to be done. If the cost of mining is larger than the anticipated return, it should not be undertaken. The techniques being applied in data mining are generally not new. They have been used for many years but not in a business context. With the large amounts of data now available in data warehouses and with new user-friendly software and interfaces, they have become accessible. Data mining no longer needs to be an extremely complex process. Data mining can be undertaken with relatively small databases. Large databases can be sampled. It is possible to have too much data and lose sight of the information it contains. For example, using both age and date of birth results in both factors being judged equally relevant and each will be assigned a lower weight. Current Limits to Data Mining Dealing with very large (e.g., terabyte) databases Dealing with high dimensionality, which increases the size of the search space and may also create spurious patterns Over fitting available data Rapid changes in data (non-stationary) that make previously discovered patterns invalid Missing and noisy data Reducing the emphasis on fully automated, rapid-response environments and increasing the human / computer interaction Lack of full understanding of the patterns observed Managing changes in data and knowledge as available information is updated Dealing with non-numerical data, such as objects, text, and multimedia, more and more of which is being stored in databases Lack of integration with other systems All of these problems are areas of current research, but they are not yet fully solved. Nonetheless, despite these difficulties, data mining offers an important approach to achieving value from the data warehouse for use in decision support. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data mining. The MIT Press: Cambridge, Massachusetts.