slides

advertisement
Overview

Apriori Algorithm
Socks Tie
Support is
 Confidence is

50%
(2/4)
66.67% (2/3)
TX1
Shoes,Socks,Tie
TX2
Shoes,Socks,Tie,Belt,Shirt
TX3
Shoes,Tie
TX4
Shoes,Socks,Belt
Example

Five transactions from a supermarket
TID
List of Items
1 Beer,Diaper,Baby Powder,Bread,Umbrella
2 Diaper,Baby Powder
3 Beer,Diaper,Milk
4 Diaper,Beer,Detergent
5 Beer,Milk,Coca-Cola
(diaper=fralda)
Step 1

Min_sup 40% (2/5)
C1

Item
Support
Beer
"4/5"
Diaper
"4/5"
Baby Powder
"2/5"
Bread
"1/5"
Umbrella
"1/5"
Milk
"2/5"
Detergent
"1/5"
Coca-Cola
"1/5"
L1
Item
Support
Beer
"4/5"
Diaper
"4/5"
Baby Powder
"2/5"
Milk
"2/5"
Step 2 and Step 3


C2
L2
Item
Support
Beer, Diaper
"3/5"
Item
Support
Beer, Baby Powder
"1/5"
Beer, Diaper
"3/5"
Beer, Milk
"2/5"
Beer, Milk
"2/5"
Diaper,Baby Powder
"2/5"
Diaper,Baby Powder "2/5"
Diaper,Milk
"1/5"
Baby Powder,Milk
"0"
Step 4

C3

Item
Support
Beer, Diaper,Baby Powder
"1/5"
Beer, Diaper,Milk
"1/5"
Beer, Milk,Baby Powder
"0"
Diaper,Baby Powder,Milk
"0"
• Min_sup 40% (2/5)
empty
Step 5

min_sup=40%
Item
min_conf=70%
Support(A,B)
Suport A
Confidence
Beer, Diaper
60%
80%
75%
Beer, Milk
40%
80%
50%
Diaper,Baby Powder
40%
80%
50%
Diaper,Beer
60%
80%
75%
Milk,Beer
40%
40%
100%
Baby Powder, Diaper
40%
40%
100%
Results
Beer  Diaper

support 60%, confidence 70%

support 60%, confidence 70%

support 40%, confidence 100%
Diaper  Beer
Milk  Beer
Baby _ Powder  Diaper

support 40%, confidence 70%
Construct FP-tree from a Transaction
Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
min_support = 3
{}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
Find Patterns Having
p From p-conditional Database
Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Conditional pattern bases
f:4
c:1
Item frequency head
item
cond. pattern base
f
4
c:3 b:1 b:1
c
4
c
f:3
a
3
a
fc:3
b
3
a:3
p:1
m
3
b
fca:1, f:1, c:1
p
3
m:2 b:1
m
fca:2, fcab:1
p
fcam:2, cb:1
p:2 m:1

From Conditional Pattern-bases to
Conditional FP-trees

For each pattern-base


Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,

f:3  fm, cm, am,
fcm, fam, cam,
c:3
fcam
-> associations
a:3
m-conditional FP-tree

The Data Warehouse Toolkit, Ralph
Kimball, Margy Ross, 2nd ed, 2002

k-means Clustering
 d
2
d2 (x,z )  
x

z



i
i

i1
1
2




Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck
Error
k
E 
 d (x,c
2
j
)
2
j1 x C j

The error function has a local minima if,
k-means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
Algorithm
Random initialization of k cluster centers
do
{
-assign to each xi in the dataset the nearest cluster center
(centroid) cj according to d2
-compute all new cluster centers
}
until ( |Enew - Eold| <  or
number of iterations max_iterations)
k-Means vs
Mixture of Gaussians


Both are iterative algorithms to assign points to clusters
K-Means: minimize
k
E 
2
d
(x,c
)
 2 j
j1 x C j

MixGaussian: maximize P(x|C=i)

1
 1
P( x ) 
exp  ( x   ) 

Mixture of Gaussian is the more general formulation
 2
( 2 ) 

Equivalent to k-Means when ∑i =I
t
d/2
,
1/ 2

1
P(C  i)  k

 0
Ci
else
1

( x   )

Tree Clustering
Tree clustering algorithm allow us to
reveal the internal similarities of a given
pattern set
 To structure these similarities
hierarchically
 Applied to a small set of typical patterns
 For n patterns these algorithm generates
a sequence of 1 to n clusters

Example

Similarity between two clusters is
assessed by measuring the similarity of
the furthest pair of patterns


Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
(each one from the distinct cluster)
This is the so-called complete linkage rule
Impact of cluster distance
measures
“Single-Link”
(inter-cluster distance=
distance between closest pair of points)
“Complete-Link”
(inter-cluster distance=
distance between farthest pair of points)

There are two criteria proposed for clustering
evaluation and selection of an optimal
clustering scheme (Berry and Linoff, 1996)

Compactness, the members of each cluster
should be as close to each other as possible. A
common measure of compactness is the
variance, which should be minimized
Separation, the clusters themselves should be
widely spaced

Dunn index
min
d(Ci ,C j ) 
d(x, y )
x  Ci , y  C j


max
diam(Ci ) 
d(x, y)
x, y  Ci



min 





min 
d(Ci ,C j )
Dk 
1  j  k 

max
1  i  k 

diam(Cl )

i

j



1  l  k






The Davies-Bouldin (DB) index (1979)
min
d(Ci ,C j ) 
d(x, y )
x  Ci , y  C j
max
diam(Ci ) 
d(x, y)
x, y  Ci
1 k max diam(Ci )  diam(C j ) 
DBk  


k i1 i  j 
d(Ci ,C j )

Pattern Classification (2nd ed.), Richard
O. Duda,, Peter E. Hart, and David G.
Stork, Wiley Interscience, 2001
 Pattern Recognition: Concepts, Methods
and Applications , Joaquim P. Marques de
Sa, Springer-Verlag, 2001

3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o

Machine Learning, Tom M. Mitchell,
McGraw Hill, 1997

Bayes

Naive Bayes
Example

Does patient have cancer or not?
A patient takes a lab test and the result comes back
positive. The test returns a correct positive result (+) in
only 98% of the cases in which the disease is actually
present, and a correct negative result (-) in only 97% of
the cases in which the disease is not present
Furthermore, 0.008 of the entire population have this
cancer
Suppose a positive result (+) is
returned...
Normalization
0.0078
 0.20745
0.0078  0.0298

0.0298
 0.79255
0.0078  0.0298
The result of
 Bayesian inference depends
strongly on the prior probabilities, which
must be available in order to apply the
method
Belief Networks
Burglary
P(B)
0.001
Alarm
Burg.
t
t
f
f
JohnCalls
A
t
f
P(J)
.90
.05
P(E)
0.002
Earthquake
Earth. P(A)
t
.95
f
.94
t
.29
f
.001
MaryCalls
A
t
f
P(M)
.7
.01
Full Joint Distribution
n
P( x1 ,..., xn )   P( xi | parents( X i ))
i 1
P( j  m  a  b  e)
 P( j | a) P(m | a) P(a | b  e) P(b) P(e)
 0.9  0.7  0.001 0.999  0.998  0.00062

P(Burglary|JohnCalls=ture,MarryCalls=true)
• The hidden variables of the query are Earthquake
and Alarm
P(B | j,m)  P(B, j,m)    P(B,e,a, j,m)
e
a
• For Burglary=true in the Bayesain network

P(b | j,m)    P(b)P(e)P(a | b,e)P( j | a)P(m | a)
e
a

P(b) is constant and can be moved out, P(e)
term can be moved outside summation a
P(b | j,m)  P(b) P(e) P(a | b,e)P( j | a)P(m | a)
e


a
JohnCalls=true and MarryCalls=true, the probability
that the burglary has occured is aboud 28%
P(B, j,m)    0.00059224,0.0014919  0.284,0.716 
Computation for Burglary=true
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

Artificial Intelligence - A Modern
Approach, Second Edition, S. Russel and
P. Norvig, Prentice Hall, 2003

ID3 - Tree learning
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

The credit history loan table has following
information
p(risk is high)=6/14
 p(risk is moderate)=3/14
 p(risk is low)=5/14

 6  3
 3  5
 5 
6
I(credit _ table)   log 2   log 2   log 2  
14  14
14  14
14 
14
I(credit _ table)  1.531 bits
In the credit history loan table we make
income the property tested at the root
 This makes the division into

• C1={1,4,7,11},C2={2,3,12,14},C3={5,6,8,9,10,13}
4
4
6
E(income)  I(C1 )  I(C2 )  I(C3 )
14
14
14
4
4
6
E(income)  0  1.0  0.65
14
14
14
E(income)  0.564
bits
gain(income)=I(credit_table)-E(income)
gain(income)=1.531-0.564
gain(income)=0.967 bits
gain(credit history)=0.266
gain(debt)=0.581
gain(collateral)=0.756
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Overfitting
Consider error of hypothesis h over
 Training data: errortrain(h)
 Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is an
alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
An ID3 tree consistent with the
data
Hair Color
Blond
Lotion Used
No
Sarah
Annie
Brown
Red
Emily
Yes
Dana
Katie
Alex
Pete
John
Sunburned
Not Sunburned
Corresponding rules by C4.5
If the person‘s hair is blonde
and the person uses lotion
then nothing happens
If the person‘s hair color is blonde
and the person uses no lotion
then the person turns red
If the person‘s hair color is red
then the person turns red
If the person‘s hair color is brown
then nothing happens
Default rule
If the person uses lotion
then nothing happens
If the person‘s hair color is brown
then nothing happens
If no other rule applies
then the person turns red



Artificial Intelligence, Partick Henry Winston,
Addison-Wesley, 1992
Artificial Intelligence - Structures and Strategies
for Complex Problem Solving, Second Edition,
G. L. Luger and W. A. Stubblefield,
Benjamin/Cummings Publishing, 1993
Machine Learning, Tom M. Mitchell, McGraw
Hill, 1997

Perceptron
Limitations
 Gradient descent

XOR problem and Perceptron

By Minsky and Papert in mid 1960
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Gradient Descent

To understand, consider simpler linear
unit, where
n
o   wi  xi
i 0


Let's learn wi that minimize the squared
error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
• (t for target)

Feed-forward networks
Back-Propagation
 Activation Functions

Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
xk
x1
x2
x3
x4
x5


In our example E becomes
m
2
1
d
d 2
E[w]    (t i  oi )
2 d 1 i1
m
2
3
5
1
d
d
2
E[w]    (t i  f (W ij  f ( w jk x k )))
2 d 1 i1
j
k1


E[w] is differentiable given f is differentiable
Gradient descent can be applied
RBF-network
RBF-networks
 Support Vector Machines

Extension to Non-linear
Decision Boundary

Possible problem of the transformation


High computation burden and hard to get a good estimate
SVM solves these two issues simultaneously


Kernel tricks for efficient computation
Minimize ||w||2 can lead to a “good” classifier
f(.)
Input space
f( )
f( )
f( )
f( ) f( ) f( )
f( )
f( )
f( )
f( ) f( )
f( ) f( )
f( )
f( ) f( )
f( )
f( )
Feature space
Machine Learning, Tom M. Mitchell,
McGraw Hill, 1997
 Simon Haykin, Neural Networks, Secend
edition Prentice Hall, 1999

Download