Relaxed Transfer of Different Classes via Spectral Partition 1. Unsupervised Xiaoxiao Shi1 Wei Fan2 3 Jiangtao 2. Can use data with different classes to help.4How Qiang Yang Ren so? 1 University of Illinois at Chicago 2 IBM T. J. Watson Research Center 3 Hong Kong University of Science and Technology 4 Sun Yat-sen University What is Transfer Learning? Standard Supervised Learning training (labeled) test (unlabeled) Classifier New York Times 85.5% New York Times 2 What is Transfer Learning? In Reality… training (labeled) How to improve the performance? test (unlabeled) 47.3% Labeled data are insufficient! New York Times New York Times3 What is Transfer Learning? Source domain training (labeled) Target domain test (unlabeled) Transfer Classifier Reuters 82.6% New York Times Not necessary from the same domain and do not follow the same distribution 4 Transfer across Different Class Labels Source domain training (labeled) Target domain test (unlabeled) Transfer Classifier 82.6% Reuters New York Times Labels: Labels: World How to transfer Markets U. S. Since theywhen are from different domains, class labels Politics are different different? class Fashion they may have labels!Style Entertainment Travel in number and meaning Blogs …… …… 5 Two Main Categories of Transfer Learning • Unsupervised Transfer Learning – Do not have any labeled data from the target domain. – Use source domain to help learning. – Question: is it better than clustering? • Supervised Transfer Learning – Have limited number of labeled examples from target domain – Is it better than not using any source data example? 6 Transfer across Different Class Labels • Two sub-problems: – (1) What and how to transfer, since we can not explicitly use P(x|y) or P(y|x) to build the similarity among tasks (class labels ‘y’ have different meanings)? – (2) How to avoid negative transfer since the tasks may be from very different domains? Negative Transfer: when the tasks are too different, transfer learning may hurt learning accuracy. 7 The proposed solution • (1) What and How to Dataset transfer? exhibits complex 2 1.5 cluster shapes – Transfer the eigensapce 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -0.5 -1 K-means performs very poorly in this space due bias toward dense spherical clusters. -1.5 -2 Eigenspace: space expended by a set of eigen vectors. 0.8 0.6 0.4 0.2 In the eigenspace (space given by the eigenvectors), clusters are trivial to separate. -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065 0 -0.706 -0.2 8-0.4 -0.6 9 The proposed solution To• get divergence: (2) the HowClustering-based to avoid negative KL transfer? (1) Perform Clustering on the combined dataset. – A new clustering-based KL Divergence to reflect (2) Calculate thedifferences. KL divergence by some basic distribution statistical properties the clusters. – If distributions are tooof different (KL is large), See Example. automatically decrease the effect from source domain. Traditional KL Divergence Need to solve P(x), Q(x) for every x, which is normally difficult to obtain. 10 E(P)=8/15 E(Q)=7/15 An Example Q P the portion of For example, S(P’,examples C) meansin“the portion of P that thecontained portion ofinin cluster C ”. are examples in P that are contained examples Q that clusterinC1 are contained in the portion C2 cluster C1of examples P that S(P’, in C1) = 0.5 Clustering are thecontained portion ofin clusterinC2 examples Q that S(Q’, C1) = 0.5 are contained in C1 cluster C2 Combined Dataset P’(C1)=3/15 Q’(C1)=3/15 P’(C2)=5/15 Q’(C2)=4/15 S(P’, C2) =5/9 KL=0.0309 S(Q’, C2) =4/9 11 Objective Function • Objective: Find an eigenspace that well separates the target data – Intuition: If the source data is similar to the target data, make good use of the source eigenspace; – Otherwise, keep the original structure of the target data. Traditional Penalty Normalized Cut Term Prefer Source Eigenspace Prefer Original Structure Balanced by R(L; U) More similar of distributions, less is R(L; U), more the function will rely on source eigenspace TL 12 How to construct constraint TL and Tu? • Principle: – To construct TL --- it is directly derived from the “must-link” constraint (the examples with the same label should be together). 1 4 3 1, 2, 4 should be together (blue); 5 3, 5, 6 should be together (red) 6 2 – To construct TU --- (1) Perform standard spectral clustering (e.g., Ncut) on U. (2) the examples in the same cluster should be together. 1 3 2 4 5 6 1, 2, 3 should be together; 4, 5, 6 should be together 13 How to construct constraint TL and Tu? • Construct the constraint matrix M=[m1, m2, …, mr]’ T For example, ML = 1 3 2 4 5 6 1, -1, 0, 0, 0, 0 1 and 2 1, 0, 0, -1, 0, 0 1 and 4 0, 0, 1, 0, -1, 0 3 and 5 …… 14 Experiment Data sets 15 Experiment data sets 16 Text Classification 120% Comp1 VS Rec1 100% 80% 60% 40% 1 2 Full Transfer 1: comp2 VS Rec2 No Transfer 2: 4 classes (Graphics, etc) 3 RSP 3: 3 classes (crypt, etc) 90% Org1 VS People1 80% 70% 60% 50% 1 2 Full Transfer 1: org2 VS People2 No Transfer 2: 3 classes (Places, etc) 3 RSP 3: 3 classes (crypt, etc) 17 Image Classification 90% 80% Homer VS Real Bear 70% 60% 50% 1 2 Full Transfer 1: Superman VS Teddy 3 No Transfer 2: 3 classes (cartman, etc) RSP 3: 4 classes (laptop, etc) 100% 90% 80% Cartman VS Fern 70% 60% 50% 1 2 Full Transfer 1: Superman VS Bonsai No Transfer 2: 3 classes (homer, etc) 3 RSP 3: 4 classes (laptop, etc) 18 Parameter Sensitivity 19 Conclusions • Problem: Transfer across tasks with different class labels • Two sub-problems: • (1) What and How to transfer? • Transfer the eigenspace. • (2) How to avoid negative transfer? • Propose an effective clustering-based KL Divergence; if KL is large, or distributions are too different, decrease the effect from source domain. 20 Thanks! Datasets and codes: http://www.cs.columbia.edu/~wfan/software.htm 21 # Clusters? Condition for Lemma 1 to be valid: In each cluster, the expected values of the target and source data are about the same. If where > is close to 0. Adaptively Control the #Clusters to guarantee Lemma 1 valid! --Stop bisecting clustering when there is only target/source data in the cluster, or 22 Optimization Let Then, Algorithm flow 23