What Is Data Mining

advertisement
What Is Data Mining
Data mining is the process that uses a variety of data analysis tools to discover
patterns and relationships, which are hidden among the vast amount of data. From
these patterns and relationships, the company will be able to make valid predictions in
the future trends. Data mining tools can answer business questions that traditionally
were too time consuming to resolve. They find hidden patterns and predictive
information that experts may miss because those lie outside their expectations.
Data mining is also known as knowledge data discovery (KDD). Those seeking to
make a distinction between the terms data mining and KDD generally use KDD to
refer to the process of discovering useful knowledge from data, while using data
mining to refer to the application of algorithms for extracting patterns from data.
Here are some examples for data mining:
Classification: To which set of predefined categories does this case belong? In
marketing, when planning a mail shot, the categories may simply be the people who
will buy and the people who will not buy.
Association: Which things occur together? For example, looking at shopping
baskets, people who buy beer tend also to buy nuts at the same time.
Sequence: It is essentially a time-ordered association, although the associated events
may be spread for apart in time. For example, most people will buy insurance after
the marriage.
Clustering: It is like classification except that the categories are not normally known
beforehand. It shows the collection of shopping baskets and discovers clusters
corresponding to health food buyers, convenience food buyers, luxury food buyers
and so on.
Data mining process
1. Develop an understanding of the application, relevant prior knowledge, and the
end user’s goals.
2. Create a target data set to be used for discovery.
3. Clean and preprocess data (including handling missing data fields, noise in the
data, accounting for time series, and known changes).
4. Reduce the number of variables and find invariant representations of data if
possible.
5. Choose the data mining task (classification, regression, clustering, etc.).
6. Choose the data mining algorithm.
7. Search for patterns of interest (this is the actual data mining).
8. Interpret the pattern mined. If necessary, iterate through any of steps 1 through 7.
9. Consolidate knowledge discovered and prepare a report.
Figure 1: Steps in the Data Mining Process
(1)
(2)
Understand
Application
(3)
Target Data
for Discovery
(5)
Choose Data
Classification
Task
(4)
Clean,
Preprocess Data
(6)
Choose Data
Mining
Algorithm
Reduce No.
of Variables
(7)
Search for
Patterns
(8)
Interpret
Pattern
Mined
(9)
Prepare
Report
Data mining and data warehousing
The data to be mined is usually extracted from an enterprise data warehouse into data
mining database. However a data warehouse is not a requirement for data mining.
Setting up a large data warehouse for data mining can be time consuming and cost
millions of dollars because it has to consolidates data from multiple sources, resolves
data integrity problems, and loads the data into a query database. Therefore,
companies would like to mine data from one or more operational or transactional
databases which is easy to extract data into read only database.
Data mining and OLAP
One of the most common questions from data processing professionals is about the
difference between data mining and On-Line Analytical Processing (OLAP). They
are different tools but they can complement each other.
Traditional query and report tools describe what is in a database. But OLAP is used
to provide analysis for certain data why it is true. OLAP can assists analysts generate
a series of hypothetical patterns and relationships and then uses queries against the
database to verify them or disprove them.
For example, an analyst might want to
determine the factors that lead to loan defaults. He or she might initially hypothesize
that people with low incomes are bad credit risks and analyze the database with
OLAP to verify or disprove this assumption. If this hypothesis is not true, then an
analyst might look at high debt as the determinant of risk. If this hypothesis is not
supported by the data, either, then an analyst can try debt and income together as the
best predictor of bad credit risks.
Data mining is different from OLAP because rather than verify hypothetical patterns,
it uses the data itself to find hidden patterns. For example, suppose an analyst who
wants to identify the risk factor for load default and use a data mining tool. The data
mining tool might discover that people with high debt and low incomes are bad credit
risks which can be found by using OLAP. But it also can discover a pattern that an
analyst does not think to try, such as age that is also a determinant of risk.
Furthermore, OLAP is also complementary in the early stages of the knowledge
discovery process because it can help user to explore the data, for instance by
focusing attention on important variables, identifying exception, or finding
interactions. This is important because the better understanding of the data, the more
effective the knowledge discovery process will be.
Data mining and hardware trends
A key enabler of data mining is the major progress in hardware price and performance.
The drop in the price of computer disk storage in these few years has radically
changed the collecting and storing massive amounts of data. The price not only drops
in disk storage, but also in CPU. Each generation of chips greatly increases the power
of CPU, while allowing the drop on the cost. This is also reflected in the price of
RAM. Today, most PCs have 64 megabytes or more of RAM, and workstations have
256 megabytes or more, while servers with gigabytes are normal. Due to the power
of the individual CPU has greatly in creased, all servers can support multiple CPUs
using symmetric multi-processing, which allows hundreds of CPUs to work on
finding patterns in the data.
Data mining applications
Data mining is increasingly popular because of the substantial contribution it can
make. It can be used to control costs as well as contribute to revenue increases.
Many organizations are using data mining to help manage all phases of the customer
life cycle, including acquiring new customers increasing revenue from existing
customers, and retaining good customers.
Data mining offers value across a broad spectrum of industries. Telecommunications
and credit card companies are two of the leaders in applying data mining to detect
fraudulent use of their services. Insurance companies and stock exchanges are also
interested in applying this technology to reduce fraud. Medical services use data
mining to predict the effectiveness of surgical procedures, medical tests or
medications. Companies active in the financial markets use data mining to determine
market and industry characteristics, as well as to predict individual company and
stock performance. Retailers are making more use of data mining to decide which
products to stock in particular store, as well as to assess the effectiveness of
promotions and coupons. Pharmaceutical firms are mining large databases of
chemical compounds and of genetic material to discover substances that might be
candidates for the treatments of disease.
Helpful Hint
 The relation between the people who have domain knowledge and those who are
data mining experts is a critical success factor. A data mining expert without domain
knowledge or a business expert who knows nothing about models and data are both
useless. The data mining expert and the people who understand the business must
work closely together.
 Data mining cannot find patterns of interest unless the tool is told what to look for.
For example, to find ways of improving mailing list impact, the search would be
directed to customers who have bought large amounts previously or who have
responded frequently to previous offers.
 Data mining can be useful in many areas, not just those in which it has previously
been successful. However, cost/benefit analysis needs to be done. If the cost of
mining is larger than the anticipated return, it should not be undertaken.
 The techniques being applied in data mining are generally not new. They have been
used for many years but not in a business context. With the large amounts of data now
available in data warehouses and with new user-friendly software and interfaces, they
have become accessible. Data mining no longer needs to be an extremely complex
process.
 Data mining can be undertaken with relatively small databases. Large databases can
be sampled. It is possible to have too much data and lose sight of the information it
contains. For example, using both age and date of birth results in both factors being
judged equally relevant and each will be assigned a lower weight.
Current Limits to Data Mining
 Dealing with very large (e.g., terabyte) databases
 Dealing with high dimensionality, which increases the size of the search space and
may also create spurious patterns
 Over fitting available data
 Rapid changes in data (non-stationary) that make previously discovered patterns
invalid
 Missing and noisy data
 Reducing the emphasis on fully automated, rapid-response environments and
increasing the human / computer interaction
 Lack of full understanding of the patterns observed
 Managing changes in data and knowledge as available information is updated
 Dealing with non-numerical data, such as objects, text, and multimedia, more and
more of which is being stored in databases
 Lack of integration with other systems
All of these problems are areas of current research, but they are not yet fully solved.
Nonetheless, despite these difficulties, data mining offers an important approach to
achieving value from the data warehouse for use in decision support.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data mining. The MIT Press:
Cambridge, Massachusetts.
Download