data mining handout

advertisement
Copied from Adriaans, P, Zantinge, D. (1996) Data Mining, Addison-Wesley &
Dunham, M.H. (2003) Data Mining: Introductory and Advanced Topics, Prentice
Hall.
Data mining (DM) deals with the discovery of hidden knowledge, unexpected
patterns and new rules from large databases.
Knowledge discovery in databases (KDD) is the non-trivial extraction of implicit,
previously unknown and potentially useful knowledge in a database.
DM and KDD are often used interchangeably. However, lately KDD has been used to
refer to a process consisting of many steps, while DM is only one of these steps. In
principle KDD consists of 6 stages:
1. Data selection
2. Cleaning
3. Enrichment
4. Coding
5. Data mining
6. Reporting
As much as 80% of KDD is about preparing data, the remaining 20% is about mining.
We will see these steps with an example below.
SOME APPLICATION DOMAINS of DM
 Banking Sector (customer profiling, customer scoring, loan payment)
 Biomedical data analysis (detection of differentially expressed genes)
 Financial data analysis (Early warning systems, detection of money laundry and
the financial crimes)
 Insurance services
 Security services (Crime and criminal detection)
 …
Example: Suppose that we would like to work on a database of a magazine publisher.
The publisher sells five types of magazine – on cars, houses, sports, music and
comics. Say, we are interested in questions such as “What is the typical profile of a
reader of a car magazine?” and “Is there any correlation between an interest in cars
and in comics?”.
Data selection: we start with the choice of a database. The records consist of: client
number, name, address, date of subscription, and type of magazine. A part of the
database is given in the following table:
Client Number
Name
Address
Data
purchase
Downing 04-15-94
23003
Johnson
23003
Johnson
23003
Johnson
1
Street
1
Downing 06-21-93
Street
1
Downing 05-30-92
Street
1
of Magazine
purchased
Car
Music
Comic
23009
23019
Clinton
Jonson
2 Boulevard
01-01-10
1
Downing 03-30-95
Street
Comic
House
Cleaning: there might be problems in the data, such as the duplication of records, out
of range data value etc. At this stage, we simply clean the data. For example, in the
example, there is both a Johnson and a Jonson in the database with the same address.
These are probably the same people, and typo can be corrected.
Enrichment: If you can “purchase” additional info on the customers, you can add
these. For example, in this example, you can add info on date of birth, income,
ownership of a car or house…
Client name
Johnson
Clinton
Date of birth
04-13-76
10-20-71
Income
$18,500
$36,000
Car owner
No
Yes
House owner
No
No
Car
owner
House
owner
Region
Car
magazin
e
House m
Sports m
Music m
Comic m
23003 32
23009 37
Income
Age
Client
number
Coding: this is similar to data reconstruction.
We can add the enriched data as columns to original data and remove any unnecessary
columns (if any). For example, name of clients might be unnecessary and can be
removed. We can deal with missing cases. Some recoding or creating new variables
might be necessary. For example, we need to convert “car owner” info from yes-no to
1-0. Similarly, “house owner” should be converted. We can divide income by 1000 to
simplify. Date of birth and address info might be too detailed. One can convert them
to simpler numbers that might give us some pattern. For example, birth year can be
converted to age, and address can be converted to region by the use of postal code, if
available. Instead of having one attribute “magazines” with five different possible
values, we can create five binary attributes, one for every magazine. If the value of the
variable is 1, then the reader is a subscriber, otherwise it is 0. This is called
‘flattening’. Note that this is different from creating dummy variables. Why is
flattening preferred to dummy variables here? Think! The final version of the data
looks like the one in the following table:
18.5
36
0
1
0
0
1
1
1
0
0
0
1
0
1
1
1
0
Data mining: Any technique that helps extract more out of your data is useful, so DM
methods form quite a heterogeneous group, such as:
Visualization
Classification
Linear, logistic regression, time series analysis
Prediction
Clustering
Summarization
Decision trees
Association Rules
2
Neural Networks
Genetic algorithms
Before introducing these algorithms a little bit, we need to define supervised versus
unsupervised algorithms.
Supervised versus unsupervised learning / algorithms: Some algorithms need the
control of a human operator during their execution; such algorithms are called
supervised. Algorithms that can operate without human interaction are called
unsupervised.
•
•
•
•
•
•
Supervised
k-nearest neighbor
k-means clustering
Regression models
Decision trees
Neural networks
•
•
•
Unsupervised
Hierarchical clustering
Self organized maps (SOM)
Before we can apply more advanced pattern analysis algorithms, we need to know
some basic aspects and structures of the data set. A good way to start is to extract
some simple statistical information, for example summary statistics.
Average
46.9
20.8
34.9
0.59
0.59
0.329
0.702
0.447
0.146
0.081
Age
Income
Credit
Car owner
House owner
Car magazine
House magazine
Sports magazine
Music magazine
Comic magazine
1000 subjects in the data set.
Magazine
Car
House
Sports
Music
Comic
Age
29.3
48.1
42.2
24.6
21.4
averages
Credit
27.3
35.5
31.4
24.6
26.3
Income
17.1
21.1
24.3
12.8
25.5
Car
0.48
0.58
0.7
0.3
0.62
house
0.53
0.76
0.6
0.45
0.6
Graphs, such as histograms, scatter plots, interactive ones, would be very useful.
3
k-nearest neighbor: records of the same type will be close to each other in the data
space; they will be living in each other’s neighborhood. If we want to predict the
behavior of a certain individual, we start to look at the behavior of, for example, ten
individuals that are close to him in the data space. We calculate an average of the
behavior of these ten individuals, and this average will be the prediction for our
individual. The letter k in k-nearest stands for the number of neighbors we investigate.
Decision trees: our database consists of attributes such as age, income, and credit. If
we want to predict a certain kind of customer, say who will buy a car magazine, what
would help us more –the age or the income of a person? It could be that age is more
important. If this is the case, next thing to do is split this attribute in two, that is, we
must investigate whether there is a certain age threshold that separates car buyers
from non-car buyers. In this way, we could start with the first attribute, find a certain
threshold, go on to the next one, find a certain threshold, and repeat this process until
we have made a correct classification for our customers.
A simple decision tree for the car magazine:
AGE > 44.5
99%
AGE <= 44.5
38%
Above the age of 44.5 only 1% of people subscribe to a car magazine, while below it
62% of people subscribe to such a magazine.
A more detailed tree:
AGE > 44.5
AGE <= 44.5
AGE > 48.5
AGE <= 48.5
100%
92%
INCOME > 34.5
100%
INCOME <= 34.5
AGE > 31.5
46%
AGE<=31.5
0%
People with income under 34.5 and an age under 31.5 are very likely to be interested
in the car magazine.
Classification: maps data into predefined groups or classes. Aim is to find a
classification rule based on the given measurements on the training sample that can be
used for future samples. It is often referred to as a supervised algorithm since classes
are determined before examining the data. Examples: 1. classifying the applicants to a
bank credit as risky or not.
2. An airport security screening station is used to determine if passengers are potential
terrorists or criminals. To do this, the face of each passenger is scanned and its basic
pattern (distance between eyes, size and shape of mouth, shape of head etc.) is
identified. This pattern is compared to entries in a database to see if it matches any
patterns that are associated with known offenders.
Clustering: is similar to classification except that the groups are not predefined.
Therefore, clustering is referred to as unsupervised learning. The clustering is usually
accomplished by determining the similarity among the data on predefined attributes.
The most similar data are grouped into clusters.
4
Example: A certain national department store chain creates special catalogs targeted
to various demographic groups based on attributes such as income, location, and
physical characteristics of potential customers (age, height, weight, etc.). The results
of clustering are used by management to create new special catalogs and distribute
them to the correct target population.
Association Rules: Link analysis, alternatively referred to as affinity analysis or
association, refers to the data mining task of uncovering relationships among data.
The best example of this type of application is to determine association rules. An
association rule is a model that identifies specific types of data associations.
Example: A grocery store retailer is trying to decide whether to put bread on sale. To
help determine the impact of this decision, the retailer generates association rules that
show what other products are frequently sold with bread. He finds that 60% of the
times that bread is sold so are pretzels and that 70% of the time jelly is also sold.
Based on these facts, he decides to place some pretzels and jelly at the end of the aisle
where the bread is placed. In addition, he decides not to place either of these items on
sale at the same time.
Be aware that association rules are not casual relationships. There probably is no
relation between bread and pretzel to be purchased together. And there is no guarantee
that this association will apply in the future. However, these rules can be used in
helping the manager in effective advertisement, marketing, and inventory control.
Neural networks: There are several different forms of neural network.
Networks do not provide a rule to identify the association. They just show there is a
connection. Also, although they do learn, they do not provide us with a theory about
what they have learned. They are simply black boxes that give answers but provide no
clear idea as to how they arrived at these answers.
Genetic algorithms: A genetic algorithm is a search technique used in computing to
find exact or approximate solutions to optimization and search problems.
Fuzzy logic: Fuzzy logic is a form of multi-valued logic derived from fuzzy set
theory to deal with reasoning that is approximate rather than precise. Just as in fuzzy
set theory the set membership values can range (inclusively) between 0 and 1, in
fuzzy logic the degree of truth of a statement can range between 0 and 1 and is not
constrained to the two truth values {true, false} as in classic predicate logic.
Additional references:
 www.kdnuggets.com
 Hastie, T., Tibshirani, R. and Friedman, J. 2001. The elements of statistical
learning; data mining, inference and prediction. Springer Series in Statistics, 533,
USA.
5
Download