Optimization Methods in Data Mining

advertisement
Optimization Methods
in Data Mining
Fall 2004
1
Overview
Optimization
Mathematical
Programming
Support
Vector
Machines
Classification,
Clustering,
etc
Fall 2004
Steepest
Descent
Search
Neural Nets,
Bayesian Networks
(optimize parameters)
Combinatorial
Optimization
Genetic
Algorithm
Feature selection
Classification
Clustering
2
What is Optimization?

Formulation




Decision variables
Objective function
Constraints
Solution


Fall 2004
Iterative algorithm
Improving search
Problem
Formulation
Model
Algorithm
Solution
3
Combinatorial Optimization

Finitely many solutions to choose from




Select the best rule from a finite set of rules
Select the best subset of attributes
Too many solutions to consider all
Solutions


Fall 2004
Branch-and-bound (better than Weka exhaustive
search)
Random search
4
Random Search



Select an initial solution x(0) and let k=0
Loop:
(k)) of x(k)
 Consider the neighbors N(x
(0))
 Select a candidate x’ from N(x
 Check the acceptance criterion
(k+1) = x’ and
 If accepted then let x
otherwise let x(k+1) = x(k)
Until stopping criterion is satisfied
Fall 2004
5
Common Algorithms

Simulated Annealing (SA)


that decreases as time goes on
Tabu Search (TS)


Idea: accept inferior solutions with a given probability
Idea: restrict the neighborhood with a list of solutions
that are tabu (that is, cannot be visited) because
they were visited recently
Genetic Algorithm (GA)

Idea: neighborhoods based on ‘genetic similarity’

Most used in data mining applications
Fall 2004
6
Genetic Algorithms



Maintain a population of solutions
rather than a single solution
Members of the population have certain
fitness (usually just the objective)
Survival of the fittest through



Fall 2004
selection
crossover
mutation
7
GA Formulation


Use binary strings (or bits) to encode
solutions:
011010010
Terminology



Fall 2004
Chromosomes = solution
Parent chromosome
Children or offspring
8
Problems Solved

Data Mining Problems that have been
addressed using Genetic Algorithms:
Fall 2004

Classification

Attribute selection

Clustering
9
Classification Example
Outlook
Sunny
100
Overcast
010
Rainy
001
Yes
10
No
01
Windy
Fall 2004
10
Representing a Rule
If windy=yes then play=yes

111
outlook
 
10 10
windy
play
If outlook=overcast and windy=yes then play=no

010
outlook
Fall 2004

10
windy

01
play
11
Single-Point Crossover
Parents
outlook

111
windy

010
windy
outlook
Offspring
play
 
10 10
outlook

111
windy
 
01 01
outlook

010
windy
play
 
01 01
play
 
10 10
play
Crossover point
Fall 2004
12
Two-Point
Crossover Offspring
Parents
outlook

111
windy

010
windy
outlook
play
 
10 10
outlook

111
windy
play
 
01 01
outlook

010
windy

10
play
play
 
01 10

01
Crossover points
Fall 2004
13
Uniform Crossover
Parents
Offspring
outlook

111
windy

10

010
windy

01
play
outlook
play

10
outlook

110
windy

01
outlook

011
windy

10

00
play
 
01 11
play
Problem?
Fall 2004
14
Mutation
Parent

010
outlook
Offspring
 
01 01
windy
play

010
outlook

11
windy

01
play
Mutated bit
Fall 2004
15
Selection

Which strings in the population should
be operated on?


Rank and select the n fittest ones
Assign probabilities according to fitness
and select probabilistically, say
Fitness ( xi )
P[select xi ] 
 Fitness ( x j )
j
Fall 2004
16
Creating a New Population


Create a population Pnew with p individuals
Survival




Crossover



Allow individuals from old population to survived intact
Rate: 1-r % of population
How to select the individuals that survive: Deterministic/random
Select fit individuals and create new once
Rate: r% of population. How to select?
Mutation



Slightly modify any on the above individuals
Mutation rate: m
Fixed number of mutations versus probabilistic mutations
Fall 2004
17
GA Algorithm



Randomly generate an initial population P
Evaluate the fitness f(xi) of each individual in P
Repeat:
 Survival: Probabilistically select (1-r)p individuals from P and
add to Pnew, according to
f ( xi )
P[select xi ] 
 f (x j )
j
Crossover: Probabilistically select rp/2 pairs from P and apply
the crossover operator. Add to Pnew
 Mutation: Uniformly choose m percent of member and invert
one randomly selected bit
 Update: P  Pnew
 Evaluate: Compute the fitness f(xi) of each individual in P
Return the fittest individual from P


Fall 2004
18
Analysis of GA: Schemas





Does GA converge?
Does GA move towards a good solution?
Local optima?
Holland (1975): Analysis based on schemas
Schema: string combination of 0s, 1s, *s
Example: 0*10 represents {0010,0110}
Fall 2004
19
The Schema Theorem
(all the theory on one slide)
Average fitness of
individuals in
schema s at time t
Distance between
defined bits in s
Number of
defined bits
in schema s
uˆ ( s, t )
d ( s) 

o( s )
E[m( s, t  1)] 
m( s, t )1  pc
(
1

p
)

m
f (t )
l 1 

Probability
of crossover
Probability
of mutation
Number of instance of
schema s at time t
Fall 2004
20
Interpretation


Fit schemas grow in influence
What is missing




Crossover?
Mutation?
How about time t+1 ?
Other approaches:


Fall 2004
Markov chains
Statistical mechanics
21
GA for Feature Selection

Feature selection:





Select a subset of attributes (features)
Reason: to many, redundant, irrelevant
Set of all subsets of attributes very
large
Little structure to search
Random search methods
Fall 2004
22
Encoding



Need a bit code representation
Have some n attributes
Each attribute is either in (1) or out (0)
of the selected set

1
0
temperature
Fall 2004

1
humidity
outlook
0
windy
23
Fitness

Wrapper approach



Filter approach



Apply learning algorithm, say a decision tree, to
the individual x ={outlook, humidity}
Let fitness equal error rate (minimize)
Let fitness equal the entropy (minimize)
Other diversity measures can also be used
Simplicity measure?
Fall 2004
24
Crossover

1

1
humidity
outlook
0
temperature

0
0

1
0
temperature
0

0
windy
1
windy

1
humidity
outlook
1

1
humidity
temperature
windy
humidity
outlook

1
outlook
0
temperature
0
windy
Crossover point
Fall 2004
25
In Weka
Fall 2004
26
Clustering Example
Create two clusters for:
ID
10
20
30
40
Outlook
Sunny
Overcast
Rainy
Rainy
{10,20}
{30,40}
{20,40}
{10,30}
Fall 2004
Temperature
Hot
Hot
Mild
Cool
Humidity
High
High
High
Normal
Windy
True
False
False
False
Offspring
Parents

 

1 1 0 0 1 1 0 1

0 1 0 1 0 1 0 0
Crossover
Play
No
Yes
Yes
Yes
{10,20,40}
{30}
{20}
{10,30,40}
27
Discussion



GA is a flexible and powerful random
search methodology
Efficiency depends on how well you can
encode the solutions in a way that will
work with the crossover operator
In data mining, attribute selection is the
most natural application
Fall 2004
28
Attribute Selection in
Unsupervised Learning



Attribute selection typically uses a
measure, such as accuracy, that is directly
related to the class attribute
How do we apply attribute selection to
unsupervised learning such as clustering?
Need a measure


compactness of cluster
separation among clusters
Fall 2004
Multiple measures
29
Quality Measures

Compactness
1 If xi belongs to cluster k
 ik  
Otherwise
0
Instances
Clusters
Fwithin  1 
1
Z within
1 K n
2
 ik xij   kj 

d k 1 i 1
Centroid
Normalization
constant to make
Fwithin [0,1]
Fall 2004
Number of
attributes
 x


 
n
 kj
i 1
n
i 1
ik ij
ik
30
More Quality Measures

Cluster Separation
Fbetween
Fall 2004
1 1 1 K n
2
1   ik  xij   kj 


Z bet d k  1 k 1 i 1
jJ
31
Final Quality Measures

Adjustment for bias
K  K min
Fclusters  1 
K max  K min

Compexity
d 1
Fclusters  1 
D 1
Fall 2004
32
Wrapper Framework


Loop:

Obtain an attribute subset

Apply k-means algorithm

Evaluate cluster quality
Until stopping criterion satisfied
Fall 2004
33
Problem

What is the optimal attribute subset?
What is the optimal number of clusters?

Try to find simultaneously

Fall 2004
34
Example

Find an attribute subset and optimal number of clusters
(Kmin = 2, Kmin = 3) for
ID
10
20
30
40
50
60
70
80
90
100
Fall 2004
Sepal Length
5.0
5.1
4.8
5.1
4.6
6.5
5.7
6.3
4.9
6.6
Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal length
1.6
1.9
1.4
1.6
1.4
4.6
4.5
4.7
3.3
4.6
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
35
Formulation

Define an individual
Selected attributes


* * * *
*
number of
clusters

Initial Population
0 1 0 1 1
1 0 0 1 0
Fall 2004
36
Evaluate Fitness


Start with 0 1 0 1 1
Three clusters and {Sepal Width, Petal Width}
ID
10
20
30
40
50
60
70
80
90
100

Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
Apply k-means with k=3
Fall 2004
37
K-Means

Start with random centroids: 10, 70, 80
2
Petal Width
1.5
80
60
70 100
1
90
10
0.5
30
20
40
50
0
2
2.5
3
3.5
4
Sepal Width
Fall 2004
38
New Centroids
2
Petal Width
1.5
80
60
C170 100
1
90
10
0.5
30
50
C3
20
40
0
2
2.5
3
3.5
4
Sepal Width
No change in assignment so
terminate k-means algorithm
Fall 2004
39
Quality of Clusters

Centers




Center 1 at (3.46,0.34): {60,70,90,100}
Center 2 at (3.30,1.60): {80}
Center 3 at (2.73,1.28): {10,20,30,40,50}
Evaluation
Fwithin  0.55
Fbetween  6.60
Fclusters  0.00
Fcomplexity  0.67
Fall 2004
40
Next Individual


Now look at 1 0 0 1 0
Two clusters and {Sepal Length, Petal Width}
ID
10
20
30
40
50
60
70
80
90
100

Sepal Length
5.0
5.1
4.8
5.1
4.6
6.5
5.7
6.3
4.9
6.6
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
Apply k-means with k=3
Fall 2004
41
K-Means

Say we select 20 and 90 as initial centroids:
Petal Width
2
80
1.5
70
1
60
100
90
10
20
30
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
Fall 2004
42
Recalculate Centroids
Petal Width
2
80
1.5
C2
70
1
60
100
90
10
20
30 C1
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
Fall 2004
43
Recalculate Again
Petal Width
2
80
C2 60
100
1.5
70
1
90
10
C1 20
30
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
No change in assignment so
terminate k-means algorithm
Fall 2004
44
Quality of Clusters

Centers



Center 1 at (4.92,0.45): {10,20,30,40,50,90}
Center 3 at (6.28,1.43): {60,70,90,100}
Evaluation
Fwithin  0.39
Fbetween  14.59
Fclusters  1.00
Fcomplexity  0.67
Fall 2004
45
Compare Individuals
0 1 0 1 1
Fwithin  0.55
Fbetween  6.60
Fclusters  0.00
Fcomplexity  0.67




1 0 0 1 0
Fwithin  0.39
Fbetween  14.59
Fclusters  1.00
Fcomplexity  0.67
Which is fitter?
Fall 2004
46
Evaluating Fitness


Can scale (if necessary)
Then weight them together, e.g.,
0.55 6.60
fitness(0 1 0 1 1) 

 0  0.67  0.53
0.55 14.59
0.39 14.59
fitness(1 0 0 1 0) 

 0  0.67  0.84
0.55 14.59

Alternatively, we can use Pareto optimization
Fall 2004
47
Fall 2004
48
Mathematical Programming



Continuous decision variables
Constrained versus non-constrained
Form of the objective function



Fall 2004
Linear Programming (LP)
Quadratic Programming (QP)
General Mathematical Programming (MP)
49
Linear Program
max
s.t
2  0.5 x
0  x  15
f ( x)  2  0.5 x
x  15
0 x
Optimal solution
x
10
Fall 2004
50
Two
Dimensional
Problem
4 x  2 x  4800
x
2
1
2000
2
x1  1500
x2  1500
1500
Feasible
Region
max 12 x1  9 x2
x1  1000
s.t.
x2  1500
x1  x2  1750
4 x1  2 x2  4800
Optimal Solution
1000
x1 , x2  0
x1  x2  1750
500
x1  0
x2  0
x1
500
1000
1500
2000
12 x1  9 x2  12000
12 x1  9x2  6000
Fall 2004
Optimum is
always at an
extreme point
51
Simplex Method
x2
4 x1  2 x2  4800
2000
x1  1500
x2  1500
1500
1000
x1  x2  1750
500
x1  0
x2  0
Fall 2004
x1
500
1000
1500
2000
52
Quadratic Programming
1.4
1.2
f (x )=0.2+(x -1)
1
2
0.8
0.6
0.4
0.2
2
1.
8
1.
6
1.
4
1.
2
1
0.
8
0.
6
0.
4
0.
2
0
0
f ' ( x)  2( x  1)
Fall 2004
2( x  1)  0
 x 1
53
General MP
f ' ( x)  0
f ' ( x)  0
f ' ( x)  0
f ' ( x)  0
Derivative being
zero is a necessary
but not sufficient
condition
f ' ( x)  0
Fall 2004
54
Constrained Problem?
f ' ( x)  0
x  10
Fall 2004
55
General MP
We write a general mathematical program in
matrix notation as:
min
s.t.
f ( x)
h ( x)  0
g(x)  0
x  Vector of decision v ariables
h(x)  (h 1 (x), h 2 (x),..., h m (x))
g(x)  (g1 (x), g 2 (x),..., g m (x))
Fall 2004
56
Karush-Kuhn-Tucker (KKT)
Conditions
If x* is a relative minimum for
min
s.t.
f ( x)
h ( x)  0
g(x)  0
There exist λ , μ  0, such that
 
 
 
μ gx   0
f x*  λ T h x*  μT g x*  0
T
Fall 2004
*
57
Convex Sets
A set C is convex if any line connecting two points in the
set lies completely within the set, that is,
x1 , x2  C,  (0,1) : x1  (1   )x2  C
Convex
Fall 2004
Not Convex
58
Convex Hull


The convex hull co(S) of a set S is the
intersection of all convex sets
containing S
A set V  Rn is a linear variety if
x1 , x2 V : x1  (1   )x2 V ,   R
Fall 2004
59
Hyperplane
A hyperplane in Rn is a (n-1)-dimensional variety
Hyperplane in R2
Fall 2004
Hyperplane in R3
60
Convex Hull Example
Humidity
Closets points in
convex hulls
Play
No Play
d
c
Separating hyperplane
bisects closest points
Temperature
Fall 2004
61
Finding the Closest Points
Formulate as QP:
min
s.t.
1
2
cd
2
c    i xi
i:Play Yes
d
 x
i:Play No

1

1
i
i:Play Yes
i
i:Play Yes
i i
i  0
Fall 2004
62
Support Vector Machines
Support Vectors
Play
Humidity
No Play
Separating
Hyperplane
Temperature
Fall 2004
63
Example
ID
10
20
30
40
50
60
70
80
90
100
Fall 2004
Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
64
Separating Hyperplane
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
2
Fall 2004
2.5
3
3.5
4
65
Assume Separating Planes
Constraints:
xi  w i  b  1, i : yi  1
xi  w i  b  1, i : yi  1.
Distance to each plane:
1
w
Fall 2004
66
Optimization Problem
2
max
w
subject to
x i  w i  b  1, i : yi  1
w ,b
x i  w i  b  1, i : yi  1.
Fall 2004
67
How Do We Solve MPs?
5
4
3
2
1
0
-1
-2
-3
-4
-5
-5
Fall 2004
-4
-3
-2
-1
0
1
2
3
4
5
68
Improving Search

Direction-step approach
x k 1  x k  x k
Search direction
Current Solution
New Solution
Fall 2004
Step size
69
Steepest Descent

Search direction equal to negative gradient
x  f (x k )

Finding  is a one-dimensional optimization
problem of minimizing
 ( )  f (x k  x k )
Fall 2004
70
Newton’s Method

Taylor series expansion
f (x)  f (x k )  f (x k )( x  x k )

1
 (x  x k )T F (x k )(x  x k )
2 side is minimized at
The right hand
xk 1  xk  F (xk ) f (xk )
1
Fall 2004
71
Discussion

Computing the inverse Hessian is difficult

Quasi-Newton
x k 1  x k  k S k f (x k )


Conjugate gradient methods
Does not account for constraints


Fall 2004
Penalty methods
Lagrangian methods, etc.
72
Non-separable
Add an error term to the constraints:
x i  w i  b  1   i , i : yi  1
x i  w i  b  1   i , i : yi  1
 i  0, i.
Fall 2004
73
Wolfe Dual
Only place
data appears
max
α
subject to
1
w   i    i j yi y j x i  x j
2 i, j
i
0  i  C
2
 y
i
i
 0.
i
Simple
constraints
Fall 2004
74
Extension to Non-Linear

Kernel functions
K (x, y )  (x)  (y )

n

:
R

H
Mapping
Takes place of
dot product in
Wolfe dual
High dimensional
Hilbert space
Fall 2004
75
Some Possible Kernels
K (x, y )  (x  y  1)
K (x, y )  e
p
x  y / 2 2
2
K (x, y )  tanh( x  y   )
Fall 2004
76
In Weka

Weka.classifiers.smo

Support vector machine for nominal data only

Does both linear and non-linear models
Fall 2004
77
Optimization in DM
Optimization
Mathematical
Programming
Support
Vector
Machines
Classification,
Clustering,
etc
Fall 2004
Steepest
Descent
Search
Neural Nets,
Bayesian Networks
(optimize parameters)
Combinatorial
Optimization
Genetic
Algorithm
Feature selection
Classification
Clustering
78
Bayesian Classification

Naïve Bayes assumes independence between
attributes



Simple computations
Best classifier if assumption is true
Bayesian Belief Networks


Joint probability distributions
Directed acyclic graphs


Nodes are random variables (attributes)
Arcs represent the dependencies
Fall 2004
79
Example: Bayesian Network
Lung Cancer depends on Family History and Smoker
Family History
FH,S FH,~S ~FH,S ~FH,~S
Smoker
0.8
0.5
0.7
0.1
~LC 0.2
0.5
0.3
0.9
LC
Lung Cancer
Emphysema
Positive X-Ray
Dyspnea
Lung Cancer is conditionally independent of
emphysema. given Family History and Smoker
Fall 2004
80
Conditional Probabilities
Pr( LungCancer " yes" | FamilyHist ory " yes" , Smoker " yes" )  0.8
Pr( LungCancer " no" | FamilyHist ory " no" , Smoker " no" )  0.9
n
Pr( z1 ,..., z n )   Pr  zi | Parents ( Z i ) 
i 1
Random
variable
Outcome of the
random variable
Fall 2004
The node representing
the class attribute is
called the output node
81
How Do we Learn?

Network structure



Given/known
Inferred or learned from the data
Variables


Fall 2004
Observable
Hidden (missing values / incomplete data)
82
Case 1: Known Structure and
Observable Variables



Straightforward
Similar to Naïve Bayes
Compute the entries of the conditional
probability table (CPT) of each variable
Fall 2004
83
Case 2: Known Structure and
Some Hidden Variables


Still need to learn the CPT entries
Let S be a set of s training instances
X 1 , X 2 ,..., X s .

Let wijk be the CPT entry for variable
Yi=yij having parents Ui=uik.
Fall 2004
84
CPT
Example
Y  LungCancer
i
yij " yes"
wijk
U i  {FamilyHist ory , Smoker}
uik  {" yes" , " yes"}
FH,S FH,~S ~FH,S ~FH,~S
0.8
0.5
0.7
0.1
~LC 0.2
0.5
0.3
0.9
LC
w  wijk i , j ,k
Fall 2004
85
Objective

Must find the value of
w  wijk i , j ,k

The objective is to maximize the likelihood of
the data, that is,
s
Prw ( S )   Prw  X d 

How do we do this?
Fall 2004
d 1
86
Non-Linear MP
From training data

Compute gradients:
s Pr Y  y , U  u | X 
 Prw ( S )
i
ij
i
ik
d

wijk
wijk
d 1

Move in the direction of the gradient
 Prw ( S )
wijk  wijk  l 
wijk
Learning rate
Fall 2004
87
Case 3: Unknown Network
Structure


Need to find/learn the optimal network
structure for the data
What type of optimization problem is this?
Combinatorial optimization (GA etc.)
Fall 2004
88
Download