Overview Apriori Algorithm Socks Tie Support is Confidence is 50% (2/4) 66.67% (2/3) TX1 Shoes,Socks,Tie TX2 Shoes,Socks,Tie,Belt,Shirt TX3 Shoes,Tie TX4 Shoes,Socks,Belt Example Five transactions from a supermarket TID List of Items 1 Beer,Diaper,Baby Powder,Bread,Umbrella 2 Diaper,Baby Powder 3 Beer,Diaper,Milk 4 Diaper,Beer,Detergent 5 Beer,Milk,Coca-Cola (diaper=fralda) Step 1 Min_sup 40% (2/5) C1 Item Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Bread "1/5" Umbrella "1/5" Milk "2/5" Detergent "1/5" Coca-Cola "1/5" L1 Item Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Milk "2/5" Step 2 and Step 3 C2 L2 Item Support Beer, Diaper "3/5" Item Support Beer, Baby Powder "1/5" Beer, Diaper "3/5" Beer, Milk "2/5" Beer, Milk "2/5" Diaper,Baby Powder "2/5" Diaper,Baby Powder "2/5" Diaper,Milk "1/5" Baby Powder,Milk "0" Step 4 C3 Item Support Beer, Diaper,Baby Powder "1/5" Beer, Diaper,Milk "1/5" Beer, Milk,Baby Powder "0" Diaper,Baby Powder,Milk "0" • Min_sup 40% (2/5) empty Step 5 min_sup=40% Item min_conf=70% Support(A,B) Suport A Confidence Beer, Diaper 60% 80% 75% Beer, Milk 40% 80% 50% Diaper,Baby Powder 40% 80% 50% Diaper,Beer 60% 80% 75% Milk,Beer 40% 40% 100% Baby Powder, Diaper 40% 40% 100% Results Beer Diaper support 60%, confidence 70% support 60%, confidence 70% support 40%, confidence 100% Diaper Beer Milk Beer Baby _ Powder Diaper support 40%, confidence 70% Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 {} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 Find Patterns Having p From p-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Conditional pattern bases f:4 c:1 Item frequency head item cond. pattern base f 4 c:3 b:1 b:1 c 4 c f:3 a 3 a fc:3 b 3 a:3 p:1 m 3 b fca:1, f:1, c:1 p 3 m:2 b:1 m fca:2, fcab:1 p fcam:2, cb:1 p:2 m:1 From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 m-conditional pattern base: fca:2, fcab:1 All frequent patterns relate to m {} m, f:3 fm, cm, am, fcm, fam, cam, c:3 fcam -> associations a:3 m-conditional FP-tree The Data Warehouse Toolkit, Ralph Kimball, Margy Ross, 2nd ed, 2002 k-means Clustering d 2 d2 (x,z ) x z i i i1 1 2 Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck Error k E d (x,c 2 j ) 2 j1 x C j The error function has a local minima if, k-means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Algorithm Random initialization of k cluster centers do { -assign to each xi in the dataset the nearest cluster center (centroid) cj according to d2 -compute all new cluster centers } until ( |Enew - Eold| < or number of iterations max_iterations) k-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters K-Means: minimize k E 2 d (x,c ) 2 j j1 x C j MixGaussian: maximize P(x|C=i) 1 1 P( x ) exp ( x ) Mixture of Gaussian is the more general formulation 2 ( 2 ) Equivalent to k-Means when ∑i =I t d/2 , 1/ 2 1 P(C i) k 0 Ci else 1 ( x ) Tree Clustering Tree clustering algorithm allow us to reveal the internal similarities of a given pattern set To structure these similarities hierarchically Applied to a small set of typical patterns For n patterns these algorithm generates a sequence of 1 to n clusters Example Similarity between two clusters is assessed by measuring the similarity of the furthest pair of patterns Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. (each one from the distinct cluster) This is the so-called complete linkage rule Impact of cluster distance measures “Single-Link” (inter-cluster distance= distance between closest pair of points) “Complete-Link” (inter-cluster distance= distance between farthest pair of points) There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme (Berry and Linoff, 1996) Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized Separation, the clusters themselves should be widely spaced Dunn index min d(Ci ,C j ) d(x, y ) x Ci , y C j max diam(Ci ) d(x, y) x, y Ci min min d(Ci ,C j ) Dk 1 j k max 1 i k diam(Cl ) i j 1 l k The Davies-Bouldin (DB) index (1979) min d(Ci ,C j ) d(x, y ) x Ci , y C j max diam(Ci ) d(x, y) x, y Ci 1 k max diam(Ci ) diam(C j ) DBk k i1 i j d(Ci ,C j ) Pattern Classification (2nd ed.), Richard O. Duda,, Peter E. Hart, and David G. Stork, Wiley Interscience, 2001 Pattern Recognition: Concepts, Methods and Applications , Joaquim P. Marques de Sa, Springer-Verlag, 2001 3-Nearest Neighbors query point qf 3 nearest neighbors 2x,1o Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Bayes Naive Bayes Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present Furthermore, 0.008 of the entire population have this cancer Suppose a positive result (+) is returned... Normalization 0.0078 0.20745 0.0078 0.0298 0.0298 0.79255 0.0078 0.0298 The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method Belief Networks Burglary P(B) 0.001 Alarm Burg. t t f f JohnCalls A t f P(J) .90 .05 P(E) 0.002 Earthquake Earth. P(A) t .95 f .94 t .29 f .001 MaryCalls A t f P(M) .7 .01 Full Joint Distribution n P( x1 ,..., xn ) P( xi | parents( X i )) i 1 P( j m a b e) P( j | a) P(m | a) P(a | b e) P(b) P(e) 0.9 0.7 0.001 0.999 0.998 0.00062 P(Burglary|JohnCalls=ture,MarryCalls=true) • The hidden variables of the query are Earthquake and Alarm P(B | j,m) P(B, j,m) P(B,e,a, j,m) e a • For Burglary=true in the Bayesain network P(b | j,m) P(b)P(e)P(a | b,e)P( j | a)P(m | a) e a P(b) is constant and can be moved out, P(e) term can be moved outside summation a P(b | j,m) P(b) P(e) P(a | b,e)P( j | a)P(m | a) e a JohnCalls=true and MarryCalls=true, the probability that the burglary has occured is aboud 28% P(B, j,m) 0.00059224,0.0014919 0.284,0.716 Computation for Burglary=true Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Artificial Intelligence - A Modern Approach, Second Edition, S. Russel and P. Norvig, Prentice Hall, 2003 ID3 - Tree learning Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. The credit history loan table has following information p(risk is high)=6/14 p(risk is moderate)=3/14 p(risk is low)=5/14 6 3 3 5 5 6 I(credit _ table) log 2 log 2 log 2 14 14 14 14 14 14 I(credit _ table) 1.531 bits In the credit history loan table we make income the property tested at the root This makes the division into • C1={1,4,7,11},C2={2,3,12,14},C3={5,6,8,9,10,13} 4 4 6 E(income) I(C1 ) I(C2 ) I(C3 ) 14 14 14 4 4 6 E(income) 0 1.0 0.65 14 14 14 E(income) 0.564 bits gain(income)=I(credit_table)-E(income) gain(income)=1.531-0.564 gain(income)=0.967 bits gain(credit history)=0.266 gain(debt)=0.581 gain(collateral)=0.756 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Overfitting Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h) Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that errortrain(h) < errortrain(h’) and errorD(h) > errorD(h’) An ID3 tree consistent with the data Hair Color Blond Lotion Used No Sarah Annie Brown Red Emily Yes Dana Katie Alex Pete John Sunburned Not Sunburned Corresponding rules by C4.5 If the person‘s hair is blonde and the person uses lotion then nothing happens If the person‘s hair color is blonde and the person uses no lotion then the person turns red If the person‘s hair color is red then the person turns red If the person‘s hair color is brown then nothing happens Default rule If the person uses lotion then nothing happens If the person‘s hair color is brown then nothing happens If no other rule applies then the person turns red Artificial Intelligence, Partick Henry Winston, Addison-Wesley, 1992 Artificial Intelligence - Structures and Strategies for Complex Problem Solving, Second Edition, G. L. Luger and W. A. Stubblefield, Benjamin/Cummings Publishing, 1993 Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Perceptron Limitations Gradient descent XOR problem and Perceptron By Minsky and Papert in mid 1960 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Gradient Descent To understand, consider simpler linear unit, where n o wi xi i 0 Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)} • (t for target) Feed-forward networks Back-Propagation Activation Functions Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. xk x1 x2 x3 x4 x5 In our example E becomes m 2 1 d d 2 E[w] (t i oi ) 2 d 1 i1 m 2 3 5 1 d d 2 E[w] (t i f (W ij f ( w jk x k ))) 2 d 1 i1 j k1 E[w] is differentiable given f is differentiable Gradient descent can be applied RBF-network RBF-networks Support Vector Machines Extension to Non-linear Decision Boundary Possible problem of the transformation High computation burden and hard to get a good estimate SVM solves these two issues simultaneously Kernel tricks for efficient computation Minimize ||w||2 can lead to a “good” classifier f(.) Input space f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Simon Haykin, Neural Networks, Secend edition Prentice Hall, 1999