Paper - University of Houston-Clear Lake:: Management Information

advertisement

Estimation as a Data Mining Task

Vinh Ngo University of Houston – Clear Lake

Mike Ellis University of Houston – Clear Lake

ABSTRACT

The constant growth of information presents problems for many Information

Technology (IT) professionals, in that they are overwhelmed with extremely large datasets when extracting data for analysis. The traditional approaches for example; regression, mean and standard deviations still imposes a problem for estimate due to the large datasets. This study takes a look at different stages of data mining for data estimation and how it may be applied in the business world. The optimal goal is to constructive prediction models used to analyze and discover risk characteristic for a given dataset.

Keywords : Data Mining, Data Warehouse, mining phases, data estimation

INTRODUCTION

Estimation in data mining is a close relative of another data mining concept, classification. Classification, as the name suggests, involves separating data into categories that can be used to predict future behavior in similar data. The difference between the two terms involves the type of data considered – classification concerns discrete data, while estimation refers to continuous data. Estimation is also referred to under the terms regression and prediction .

This predictive nature of estimation is extremely valuable to almost every industry from janitorial services to aerospace. Estimation provides businesses the ability to predict and forecast future events based on historical data. In the past, estimation has always been a daunting task for any business analyst, engineering professional, or company executive. The concept of data mining and warehousing has made it easier for

everyone to extract valuable data from large multi-dimensional datasets within minutes and sometimes seconds. Data mining is usually implemented under three phases: exploration, build/validation, and deployment.

DATA MINING PHASES

What is data mining? It is an analytic process design to explore for patterns and/or systematic relationships between multiple data points and then applying the new found patterns to new datasets. The optimal goal is to develop a predictive data model, which can be use to make strategic business decisions quickly and accurately. The process of data mining is usually confined to these three phases: initial exploration, model development/pattern discovery, and deployment/implementation.

Initial Exploration o Cleaning data, data transformation, preliminary feature selection

Model Development/Pattern Discovery o Choosing the best predictive performance o Bagging o Boosting o Meta-Learning

Deployment/implementation o Applying the pattern to new datasets

Initial exploration is a key part to a successful data mining design. This phase involves cleaning the data such as getting rid of unnecessary data fields and values.

Another step transforming the data, for example is to make all dataset use the same unit of measure (e.g., cm, lbs, ft, etc.). However the most important part of this phase is preliminary feature selection. This involves selecting a sample of data from a large population to bring the number of variables down to a smaller and manageable range.

The model development and pattern discovery phase is somewhat a little more

involved. It requires selecting the best performance for determining the pattern and producing stable results. Bagging, Boosting, and Meta-Learning are some of the techniques used during this phase to help determine data predictors and classification. It is in this phase that estimation techniques may be used to generate a predictive model.

Deployment and implementation phase is pretty straightforward in that the name of the phase describes what is done. Taking the newly found pattern and applying it to a new dataset to generate predictions or estimates of the expected outcome.

The overall data mining process is shown below.

[1]

BUSINESS USES

Most analysts separate data mining and warehousing software tools into two groups: data mining toolset and mining applications. Data mining tools provide numerous techniques that can be applied to almost any business problem in any industry. Data mining applications,

on the other hand, embed techniques inside an application customized to address a specific business problem.

Practically everywhere you look and everything you do involves some form of data mining application. For example, almost every financial transaction is processed by a data mining application to detect fraud. Through estimation tools, the application establishes a pattern for every account holder the financial institution can apply to recent purchases and determine if these were extraordinary or normal purchases. Account holders can be notified when suspicious events occur and remain confident that their accounts are secure against fraudulent use.

Both data mining tools and data mining applications are valuable; however, most companies use data mining tools and data mining applications together in an integrated environment for predictive analysis. Listed below are some of the data mining applications used within different industries:

Financial Institutions: o Fraud detection o Stock Trend Analysis

Retail and Manufacturing o Sales projection o New product development analysis o Regional Markets Statistics

In April of 2005, IBM was publicizing a new service that incorporates estimation techniques to help the automotive industry. It’s “Parametric Analysis Center” can be used to diagnose problems that are similar to previous failures and predict which maintenance and repair issues will become wide-ranging problems, causing warranty problems for the manufacturers. [4]

As previously mentioned, financial services companies are big users of data mining services. Property & Casualty (P&C) insurance companies set their rates based upon several factors, the most basic of which is the pure premium. The pure premium is the minimum amount the company needs to charge any risk group to cover the group’s claims. It is based upon the frequency and severity of claims within the group. By utilizing estimation

techniques, a P&C insurance company can accurately identify claim patterns in the continuous data within a claim and make sure the group’s rates are accurate. [5]

CONCLUSION

Data mining has shown to be a powerful tool for any type of company. One particularly powerful part of the data mining toolbox is estimation tools. Since not every dataset contains discrete data, it is important to be able to make predictions on continuous data as well. These tools are at use all around us in our electronic economy, and will only become more powerful as more companies discover their strategic benefits.

REFERENCES

1.

Get Started With Data Mining Now; Warren Thornthwaite; October 1, 2005; http://www.intelligententerprise.com/info_centers/data_warehousing/showArticle.jhtm

l?articleID=171000647

2.

Data Mining Techniques. Copyright 1986-2006. Online at http://www.statsoft.com/textbook/stdatmin.html. Accessed June 15, 2006

3.

Data Warehousing Fundamentals. Ponniah, Paulraj. Published by John Wiley & Sons,

Inc.. Copyright 2001.

4.

“IBM Offers Advanced Diagnostics Analysis Services to Automotive Industry”, IBM press release, April 12, 2005. Online at http://www-

03.ibm.com/press/us/en/pressrelease/7609.wss. Accessed June 21, 2006.

5.

“Probabilistic Estimation Based Data Mining for Discovering Insurance Risks”, IBM

Research Report RC-21483. C. Apte, et. al. September 13, 1999. Online at http://www.research.ibm.com/dar/papers/pdf/upa-ieee-is-3.pdf. Accessed June 21,

2006.

Download