Lecture 9: Bump Hunting

advertisement
Bump Hunting
The objective
PRIM algorithm
Beam search
Brief Intro to Undirected
Graphical models
Overview
Regression-based models
References:
Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with the data
flood (STT, 65) (pp. 697-700). Den Haag, the Netherlands: STT/Beweton.
J.H. Friedman and N.I. Fisher (1999) Bump-hunting in high-dimensional data. Statistics and
Computing, 9:123–143.
Bump Hunting - The objective
Find regions in the feature space, where the outcome
variable has high average value.
In classification, it means a region of the feature space
where the majority of the samples are in one class.
The decision rule looks like an intersection of several
conditions (each on one predictor variable)
If condition 1 & condition 2 &…… & condition N,
then predict value …
Ex: if 0<x1<1 & 2<x2<5 &…& -1<xn<0, then class 1
Bump Hunting - The objective
When the dimension is high, and there is many such
boxes, the problem is not easy.
Bump Hunting - The objective
Let’s formalize the problem:
Predictors x=(
)
Target variable y, either continuous or binary
Feature space:
Find subspace
such that
Note: when y is binary, this is not mean of y. Rather, it is
Pr(y=1 | x
R)
Define any box:
Bump Hunting - The objective
Box in continuous feature space:
Bump Hunting - The objective
Box in categorical feature space.
Bump Hunting - PRIM
Sequentially find box in subsets of the data.
Support of a box:
1 N
 B   I ( xi  B)
N i 1
Continue search for
boxes until not
enough support for
the new box.
Bump Hunting - PRIM
“Patient Rule Induction Method”
Two steps:
(1) Patient successive top-down refinement
(2) Bottom-up recursive expansion
These are greedy algorithms.
Bump Hunting - PRIM
Peeling:
Begin with box B containing all data (or all remaining data
in later steps)
Remove sub-box b*, which maximizes
in B-b*
The candidate box b is defined on a single variable
(peeling only in one of the dimensions), and only a small
percentile is peeled each time.
Bump Hunting - PRIM
This is a greedy hill-climb algorithm.
Stop the iteration when the support drops to predetermined threshold.
Why called “patient …”? Only remove a small fraction at
each step.
Bump Hunting - PRIM
In peeling, box boundries are determined without
knowledge of later peels. Some non-optimal steps can be
taken.
Final box could be improved by boundary adjustments.
Pasting:
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
Bump Hunting - PRIM example
2/7
Bump Hunting - PRIM example
Bump Hunting - PRIM example
3/7
Bump Hunting - PRIM example
The winner is:
Bump Hunting - PRIM example
The next peel:
1. And β= 0.4
Bump Hunting - PRIM example
Bump Hunting - Beam search algorithm
At each step, w best sub-boxes (each on a single variable)
are selected.
Minimum support requirement.
More greedy --- at each step, much more can be peeled
than PRIM  optimization on one of the variables.
Bump Hunting - Beam search algorithm
Bump Hunting - Beam search algorithm
W=2
Bump Hunting - Beam search algorithm
Bump Hunting - About PRIM
It is a greedy search. However, it is “patient”. This is
important.
Methods that partition the data much faster, e.g. Beam
search and CART, could be less successful.
The “patient” method makes it easier to recover from
previous “unfortunate” steps, since we don’t run out of
the data too fast.
It doesn’t select off predictors due to high correlation
within them.
Undirected Graph Models - Introduction
A network/graph is a set of vertices connected by edges.
undirected edges  “undirected network”
directed edges  “directed network”.
Vertex-level characteristic:
The number of connections to a vertex : “degree”
Incoming edges  “in-degree” ki
Outgoing edges  “out-degree” ko
k=ki+ko
ki
Evolution of networks. S.N. Dorogovtsev, J.F.F. Mendes
ko
Undirected Graph Models - Introduction
Graphical models – a visual expression of the joint
distribution of the entire set of random variables.
Undirected graphical model – also known as “Markov
random fields” or “Markov networks”.
Lack of connection in such a network – conditional
independence given all other variables.
Sparse graphs – small number of edges – easy to interpret.
Edges – encode the strength of conditional dependency.
Undirected Graph Models - Introduction
Undirected Graph Models - Introduction
Pairwise Markov independency
Ex:
Global Markov independency:
Subgraphs A, B and C. If every path between A and B
intersects with a node in C  C separates A and B.
Ex:
Y “separates” X and Z
Undirected Graph Models - Introduction
Pairwise Markov independency
Based on
Global Markov independency
Clique – a complete (all pairs connected) subgraph
Maximal clique – a clique; no other vertices can be
added to yield a clique.
Ex:
{X, Y}, {Y, Z}, {Z, W} of graph above
Undirected Graph Models - Introduction
A probability density function f over a Markov graph G can
be presented:
Either distribution can represent the
dependence structure:
Pairwise Markov graphs concerns f(2) above.
Undirected Graph Models – Gaussian Graphical Model
- Observations have a multivariate Gaussian
distribution with mean μ and covariance matrix Σ.
- Gaussian distribution represents at most secondorder relationships, it automatically encodes a
pairwise Markov graph.
- All conditional distributions are also Gaussian.
- If the ijth component of Θ = Σ−1 is zero, then variables i
and j are conditionally independent
- Y is one variable, Z = (X1,...,Xp−1) is the rest of
variables, then the conditional distribution is
Same as population multiple linear regression of Y on Z
Undirected Graph Models – Gaussian Graphical Model
Partition Θ = Σ−1 the same way, then because ΣΘ = I,
Thus the regression coefficient of Y~Z,
- Zero elements in β and hence θZY mean that the corresponding
elements of Z are conditionally independent of Y
- We can learn the dependence structure through multiple linear
regression.
Undirected Graph Models – Gaussian Graphical Model
Finding parameters when network structure is known.
- Take empirical mean x ̄ and covariance matrix
- The log likelihood of data is
- The quantity −l(Θ) is a convex function of Θ
Undirected Graph Models – Gaussian Graphical Model
Estimating graph structure:
Meinshausen and Bu ̈hlmann’s regression approach
(2006)
- Fit a lasso regression using each variable as the
response and the others as predictors.
- θij is estimated to be nonzero if
estimated coefficient of variable i on j is nonzero,
OR (alternatively AND)
estimated coefficient of variable j on i is nonzero
Undirected Graph Models – Gaussian Graphical Model
More formally – graphical lasso
- Penalized likelihood
Download