Integrating Bayesian Networks and Simpson's Paradox in

advertisement
Integrating Bayesian Networks and
Simpson’s Paradox in Data Mining
Alex Freitas
University of Kent
Ken McGarry
University of Sunderland
Outline of the Talk





2
Introduction to Knowledge Discovery & Data
Mining
Constructing Bayesian networks from data
Simpson’s paradox
Proposed method for integrating Bayesian
networks and Simpson’s paradox
Conclusions
Introduction

Data Mining consists of extracting patterns from data,
and it is the core step of a knowledge discovery process
pre-proc
Data
22, M, 30K
26, F, 55K
……….
3
data mining
post-proc
interesting
patterns
IF (salary = high)
THEN (credit = good)
The Knowledge Discovery Process
– a popular definition

“Knowledge Discovery in Databases is the non-trivial
process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data”
(Fayyad et al. 1996)
4

Focus on the quality of discovered patterns
– independent of the data mining algorithm

This definition is often quoted, but not very seriously
taken into account
–
A lot of research on discovering valid, accurate patterns
–
Little research on discovering potentially useful patterns
Criteria to Evaluate the “Interestingness” of
Discovered Patterns
useful
Amount of
Research
novel, surprising
comprehensible
valid (accurate)
5
Difficulty of
measurement
On the difficulty of discovering
surprising patterns in data mining
6

Focus on maximizing accuracy leads to very accurate
but useless rules, e.g. (Brin et al. 1997) – census data:
– IF (person is pregnant) THEN (gender is female)
– IF (age  5) THEN (employed = no)

(Tsumoto 2000) extracted 29,050 rules from a medical
dataset. Out of these, just 220 (less than 1%) were
considered interesting or surprising to the user
Motivation for Integrating Bayesian
Networks and Simpson’s Paradox
Bayesian network
example
A
B
C
A Bayesian network represents potentially
causal patterns, which tend to be more
useful for intelligent decision making
However, algorithms for constructing
Bayesian networks from data were not
designed to discover surprising patterns
Simpson’s paradox is surprising by nature
D
7
Causality + Surprisingness
tends to improve Usefulness
Constructing Bayesian Networks
from Data

Methods based on conditional independence tests
–

Methods based on search guided by a scoring function
–
–
–
8
Not scalable to datasets with many variables (attributes)
Iteratively create candidate solutions (Bayesian networks) and
evaluate the quality of each created network using a scoring
function, until a stopping criteria is satisfied
Sequential methods consider a single candidate solution at a
time
Population-based methods consider many candidate solutions
at a time

Examples of sequential method
–
–


9
B algorithm starts with an empty network and at each iteration
adds, to the current candidate solution, the edge that
maximizes the value of the scoring function
K2 algorithm requires that the variables be ordered and the
user specifies a parameter: the maximum number of parents of
each variable in the network to be constructed
Both are greedy methods (local search), which offer no
guarantee of finding the optimal network
Population-based methods are global search methods,
but are stochastic, so again no guarantees
Limitations of methods for constructing
Bayesian networks from data (1)
Theoretical limitation (best possible algorithm & data)

10
Bayesian networks are Independence maps (I-maps)
of the true probability distribution
–
Every independence between variables represented
in the network is an actual independence in the true
probability distribution
–
Dependences between variables represented in the
network are not guaranteed to be actual
dependences in the true probability distribution
Limitations of methods for constructing
Bayesian networks from data (2)
Practical limitations
11

The problem of constructing the optimal net is too
complex in large datasets, so we have to use methods
which do not guarantee the discovery of the optimal net

Sampling variation and/or noisy data may mislead the
Bayesian network construction method, further
contributing to the discovery of a sub-optimal net
Simpson’s Paradox
12
(Pearl 2000)
Overall
E (recovered)
Drug (C)
20
No Drug (C)
16
Total
36
E (not recov.)
20
24
44
Total
40
40
80
Recov. Rate
50%
40%
Males
E (recovered)
Drug (C)
18
No Drug (C)
7
Total
25
E (not recov.)
12
3
15
Total
30
10
40
Recov. Rate
60%
70%
Females
E (recovered)
Drug (C)
2
No Drug (C)
9
Total
11
E (not recov.)
8
21
29
Total
10
30
40
Recov. Rate
20%
30%
Simpson’s Paradox as a Surprising
Pattern

Event C (“cause”) increases the probability of event E
(“effect”) in a given population but, at the same time,
decreases the probability of E in every subpopulation

No paradox in terms of probability theory, it looks a
“paradox” under a causal interpretation
–
13
Gender is a confounder variable in the previous example

Although Simpson’s paradox is known by statisticians,
occurrences of the paradox are surprising to users

There are algorithms that systematically find instances
of the paradox in data and rank them in decreasing
order of surprisingness (Fabris & Freitas 2006)
The proposed method for integrating
Bayesian networks and Simpson’s paradox

Basic Idea:
–

14
In a Bayesian network, the dependence denoted by
edge C  E can be spurious, i.e., due to a
confounding variable F
(for the previously discussed reasons)
Two approaches exploring this basic idea
First Approach: paradox detection
before network construction

First, run an algorithm that detects occurrences of
Simpson’s paradox in data (Fabris & Freitas 2006)
–


15
Produces a paradox list PL
Modify Bayesian network construction algorithms to
take into account this list, biasing the algorithms
against including network edges involving the paradox
Consider a potential dependence represented by the
edge C  E, where C is apparent cause of effect E
– If variables C, E are associated in an occurrence of
Simpson’s paradox in PL, the algorithm is biased
against including edge C  E in the network
Consider a greedy algorithm that starts with an empty
network and adds one edge to the network at a time,
guided by a scoring function
FOR EACH candidate edge A  B
compute the score of the network
if A  B is added to the network
proposed
extension
penalize score if there is an occurrence of the
paradox in list PL involving pair of variables A, B
SELECT edge with highest score and add it to the network
16
The same basic kind of extension can be applied to an
Estimation of Distribution Algorithm
– EDA is a population-based evolutionary algorithm
– It evaluates a complete candidate solution (network) at once
FOR EACH candidate solution in the population
compute the score of the network
represented by the candidate solution
17
proposed
extension
penalize score in proportion to the number of
paradox occurrences in list PL that are associated
with direct dependences A  B in the network
Second Approach: paradox detection
after network construction




18
First, construct a Bayesian network from data
Use the network to “prune” the search space for the
Simpson’s paradox detection algorithm
The algorithm will focus its search on the pairs of
variables for which there is a direct dependence (i.e.,
an edge A  B ) in the Bayesian network
For each pair of such variables, the algorithm will try to
find a third variable that acts as a confounder between
those two variables
Bayesian net
A
B
variables considered by Simpson’s
paradox detection algorithm,
considering the Bayesian net
Cause Effect
C
D
19
A
B
C
C
C
D
Is there a
counfounder?
?
?
?
A paradox occurrence involving the above pairs of cause and
effect variables would be even more surprising to the user,
due to the structure of the network
Limitation of the proposed
integration method

It is possible that the data does not contain any
occurrence of Simpson’s paradox
–

Even if the algorithm does not find any paradox
occurrence, this result is to some extent useful:
–
–
20
In this case the usefulness of the method is limited
it gives us increased confidence that the dependences
represented in the network are true dependences, rather than
spurious ones
This additional test complements (rather than replaces)
conventional methods for evaluating Bayesian networks
Conclusions

We proposed a method for integrating two very
different kinds of algorithm in data mining
–
Algorithms for constructing Bayesian networks

–
Algorithms for detecting Simpson’s paradox



21
Discover potentially causal, more useful patterns
Discover surprising patterns, potentially more useful
Hopefully, combining the “best of both worlds”,
increasing the chance of discovering patterns useful for
intelligent decision making by the user
Future research: computational implementation of the
proposed method and analysis of results
Any Questions ??
Thanks for listening!
Download