Cs.sjsu.edu Faculty Lee Cs157b

advertisement
Data Mining and Decision Trees
Prof. Sin-Min Lee
Department of Computer Science
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
• 1990s—2000s:
– Data mining and data warehousing, multimedia databases,
and Web databases
What Is Data Mining?
• Data mining (knowledge discovery in
databases):
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
• Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
Why Data Mining? — Potential
Applications
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering
Market Analysis and Management
(1)
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy what products
(clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and variation)
•
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
• Resource planning:
– summarize and compare the resources and spending
• Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
Fraud Detection and Management (1)
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Fraud Detection and Management (2)
• Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
• Detecting telephone fraud
– Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
• Retail
– Analysts estimate that 38% of retail shrink is due to
dishonest employees.
• Sports
Other Applications
– IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive
advantage for New York Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Area 1: Risk Analysis
• Insurance companies and banks use
data mining for risk analysis.
• And insurance company searches in
its own insurants and claims
databases for relationships between
personal characteristics and claim
behavior.
Continued
• The company is especially
interested in the characteristics of
insurants with a high deviating
claim behavior.
• With data mining, these so-called
risk-profiles can be discovered
and the company can use this
information to adapt its premium
polity.
Area 2: Direct Marketing
• Data mining can also be used to discover the
relationship between one’s personal
characteristics, e.g. age, gender, hometown,
and the probability that one will respond to a
mailing.
• Such relationships can be used to select those
customers from the mailing database that
have the highest probability of responding to
a mailing.
• This allows the company to mail its
prospects selectively, thus maximizing the
response.
• For example:
1. Company X sends a mailing to a number
of prospects.
2. The response is 2%.
What Data Mining can do
• Enables companies to determine
relationships among “internal” and
“external” factors.
• Predict cross-sell opportunities and make
recommendations
• Segment markets and personalize
communications.
• Predicts outcomes of future situations
The process Of Data Mining
• There are 3 main steps in the Data Mining
process:
– Preparation: data is selected from the
warehouse and “cleansed”.
– Processing: algorithms are used to process the
data. This step uses modeling to make
predictions.
– Analysis: output is evaluated.
Reasons for growing popularity
• Growing data volume- enormous amount of
existing and appearing data that require
processing.
• Limitations of Human Analysis- humans lacking
objectiveness when analyzing dependencies for
data.
• Low cost of Machine Learning- the data mining
process has a lower cost than hiring highly trained
professionals to analyze data.
Data Mining Techniques
• Association Rule- is to discover interesting
associations between attributes that are
contained in a database.
• Clustering- finds appropriate groupings of
elements for a set of data.
• Sequential patterns-looking for patterns
where one event leads to another later
event.
• Classification- looking for new patterns.
Applications of Data Mining
• Data Mining is applied in the following areas:
– Prediction of the Stock Market: predicting the future
trends.
– Bankruptcy prediction: prediction based on computer
generated rules, using models
– Foreign Exchange Market: Data Mining is used to
identify trading rules.
– Fraud Detection: construction of algorithms and models
that will help recognize a variety of fraud patterns.
Results of Data Mining Include:
• Forecasting what may happen in the future
• Classifying people or things into groups by
recognizing patterns
• Clustering people or things into groups based on
their attributes
• Associating what events are likely to occur
together
• Sequencing what events are likely to lead to later
events
Data mining is not
•Brute-force crunching of bulk
data
•“Blind” application of
algorithms
•Going to find relationships where
none exist
•Presenting data in different ways
•A database intensive task
•A difficult to understand
technology requiring an advanced
degree in computer science
What data mining has done for...
The US Internal Revenue Service
needed to improve customer
service and...
Scheduled its workforce
to provide faster, more accurate
answers to questions.
What data mining has done for...
The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and
analyzed suspects’ cell phone
usage to focus investigations.
What data mining has done for...
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and...
Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.
Data Mining process model -DM
Search in State Spaces
Decision Trees
•A decision tree is a special case of a state-space
graph.
•It is a rooted tree in which each internal node
corresponds to a decision, with a subtree at these
nodes for each possible outcome of the decision.
•Decision trees can be used to model problems in
which a series of decisions leads to a solution.
•The possible solutions of the problem correspond
to the paths from the root to the leaves of the
decision tree.
Decision Trees
•Example: The n-queens problem
•How can we place n queens on an nn chessboard so that no two
queens can capture each other?
A queen can move any
number of squares
horizontally, vertically, and
diagonally.
Here, the possible target
squares of the queen Q are
marked with an x.
•x
•x
•x
•x
•x
•x
•x •x •x
•x •x •x •Q •x •x •x •x
•x •x •x
•x
•x
•x
•x
•x
•x
•x
•x
•Let us consider the 4-queens problem.
•Question: How many possible configurations of
44 chessboards containing 4 queens are there?
•Answer: There are 16!/(12!4!) =
(13141516)/(234) = 13754 = 1820 possible
configurations.
•Shall we simply try them out one by one until we
encounter a solution?
•No, it is generally useful to think about a search
problem more carefully and discover constraints
on the problem’s solutions.
•Such constraints can dramatically reduce the size
of the relevant state space.
Obviously, in any solution of the n-queens problem,
there must be exactly one queen in each column of
the board.
Otherwise, the two queens in the same column could
capture each other.
Therefore, we can describe the solution of this problem
as a sequence of n decisions:
Decision 1: Place a queen in the first column.
Decision 2: Place a queen in the second column.
.
.
.
Decision n: Place a queen in the n-th column.
Backtracking in Decision Trees
empty board
•Q
place
1st
•Q
queen
•Q
place
2nd
queen
•Q
•Q
•Q
•Q
•Q
•Q
place
3rd
•Q
•Q
queen
•Q
•Q
•Q
•Q
place
4th
queen
•Q
•Q
•Q
Neural Network
Many inputs and a single output
Trained on signal and background sample
Well understood and mostly accepted in HEP
Decision Tree
Many inputs and a single output
Trained on signal and background sample
Used mostly in life sciences & business
Decision tree
Basic
Algorithm
• Initialize top node to all examples
• While impure leaves available
– select next impure leave L
– find splitting attribute A with maximal information gain
– for each value of A add child to L
Decision tree
Find good
splitstatistics to compute info gain: count matrix
• Sufficient
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
sunny
overcast
rainy
play don't play
2
3
4
0
3
2
gain: 0.25 bits
hot
mild
cool
play don't play
2
2
4
2
3
1
gain: 0.16 bits
humidity
high
normal
play don't play
3
4
6
1
gain: 0.03 bits
windy
FALSE
TRUE
play don't play
6
2
3
3
gain: 0.14 bits
outlook
temperature
Decision trees
•
•
•
•
Simple depth-first construction
Needs entire data to fit in memory
Unsuitable for large data sets
Need to “scale up”
Decision Trees
Planning Tool
Decision Trees
• Enable a business to quantify decision
making
• Useful when the outcomes are uncertain
• Places a numerical value on likely or
potential outcomes
• Allows comparison of different possible
decisions to be made
Decision Trees
• Limitations:
– How accurate is the data used in the construction of the
tree?
– How reliable are the estimates of the probabilities?
– Data may be historical – does this data relate to real
time?
– Necessity of factoring in the qualitative factors –
human resources, motivation, reaction, relations with
suppliers and other stakeholders
Process
The Process
Economic growth rises
0.7
Expected outcome
£300,000
Expand by opening new outlet
Economic growth declines
0.3
Expected outcome
-£500,000
Maintain current status
£0
The circle denotes the point where different outcomes could occur. The estimates of the probability and the
knowledge of the expected outcome allow the firm to make a calculation of the likely return. In this example
it is: A square denotes the point where a decision is made, In this example, a business is contemplating
There
is also
the outlet.
option The
to douncertainty
nothing and
current
status– quo!
wouldcontinues
have an outcome
opening
a new
is maintain
the state the
of the
economy
if theThis
economy
to grow of
Economic
£0.
growth
rises:
0.7
x
£300,000
=
£210,000
healthily the option is estimated to yield profits of £300,000. However, if the economy fails to grow as
expected,
the declines:
potential 0.3
lossxis£500,000
estimated
£500,000.
Economic
growth
= at
-£150,000
The calculation would suggest it is wise to go ahead with the decision ( a net ‘benefit’ figure of +£60,000)
The Process
Economic growth rises
0.5
Expected outcome
£300,000
Expand by opening new outlet
Economic growth declines
0.5
Expected outcome
-£500,000
Maintain current status
£0
Look what happens however if the probabilities change. If the firm is unsure of the potential for growth, it might
estimate it at 50:50. In this case the outcomes will be:
Economic growth rises: 0.5 x £300,000 = £150,000
Economic growth declines: 0.5 x -£500,000 = -£250,000
In this instance, the net benefit is -£100,000 – the decision looks less favourable!
Advantages
Disadvantages
Trained
Decision
Tree
(Limit)
(Binned Likelihood Fit)
Decision Trees from Data Base
Ex
Num
Att
Size
Att
Colour
Att
Shape
Concept
Satisfied
1
2
3
4
5
6
7
med
small
small
large
large
large
large
blue
red
red
red
green
red
green
brick
wedge
sphere
wedge
pillar
pillar
sphere
yes
no
yes
no
yes
no
yes
Choose target : Concept satisfied
Use all attributes except Ex Num
Rules from Tree
IF (SIZE = large
AND
((SHAPE = wedge) OR (SHAPE = pillar AND
COLOUR = red) )))
OR (SIZE = small AND SHAPE = wedge)
THEN NO
IF (SIZE = large
AND
((SHAPE = pillar) AND COLOUR = green)
OR SHAPE = sphere) )
OR (SIZE = small AND SHAPE = sphere)
OR (SIZE = medium)
THEN YES
Disjunctive Normal Form - DNF
IF
(SIZE = medium)
OR
(SIZE = small AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = sphere)
OR
(SIZE = large AND SHAPE = pillar
AND COLOUR = green
THEN CONCEPT = satisfied
ELSE CI ONCEPT = not satisfied
Download