Semi-supervised Structured Prediction Models Ulf Brefeld Joint work with… Christoph Büscher Thomas Gärtner Peter Haider Tobias Scheffer Stefan Wrobel Alexander Zien Binary Classification + + + w - - Inappropriate for complex real world problems. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Label Sequence Learning Protein secondary structure prediction: x = “XSITKTELDG ILPLVARGKV…” y=„ SS TT SS EEEE SS…“ Named entity recognition (NER): x = “Tom comes from London.” y = “Person,–,–,Location” x = “The secretion of PTH and CT...” y = “–,–,–,Gene,–,Gene,…” Part-of-speech (POS) tagging: x = “Curiosity kills the cat.” y = “noun, verb, det, noun” Ulf Brefeld : “Semi-supervised Structured Prediction Models” Natural Language Parsing x = „Curiosity kills the cat“ y= Classification with Taxonomies x= y= Ulf Brefeld : “Semi-supervised Structured Prediction Models” Structural Learning Given: n labeled pairs (x1,y1),…,(xn,yn)XxY, drawn iid according to Learn a ranking function: with Decision value measures how good y fits to x. Compute prediction: Find hypothesis that realizes the smallest regularized empirical risk: inference/decoding model: Log-loss: kernel CRFs hinge loss: M3Networks, SVMs Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-supervised Discriminative Learning Labeled training data is scarce and expensive. Eg., experiments in computational biology. Need for expert knowledge. Tedious and time consuming. Unclassified instances are abundant and cheap. Extract texts/sentences from www (POS-tagging, NER, NLP). Assess primary structure of proteins from DNA/RNA. … There is a need for semi-supervised techniques in structural learning! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Overview 1. Semi-supervised learning. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Case study: email batch detection 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Cluster Assumption Now: m unlabeled inputs in addition to the n labeled pairs are given. m>>n. Decision boundary should not cross high density regions. Examples: transductive learning, graph kernels,… But: cluster assumption is frequently inappropriate, eg., regression! What else can we do? + Ulf Brefeld : “Semi-supervised Structured Prediction Models” - Learning from Multiple Views / Co-learning Split attributes into 2 disjoint sets (views) V1, V2. E.g., web page classification. View 1: content of web page. View 2: anchor text of inbound links. intrinsic ZZ-Top ZZ-Top Aaron Aalsmeer Aachen contextual Aaron Aalsmeer Aachen In each view learn a hypothesis fv, v=1,2. Each fv provides its peer with predictions on unlabeled examples. Strategy: maximize consensus between f1 and f2. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Hypothesis Space Intersection true labeling function View V1 View V2 Consensus maximization principle: Labeled examples → minimize the error. hypothesis space Unlabeled versionexamples space → minimize disagreement. Minimize an upper bound on the error! intersection H1H2 Hypothesis spaces H1 und H2. Minimize error rate and disagreement for all hypotheses in H1H2. Unlabeled examples = data-driven regularization! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Co-optimization Problem Given: n labeled pairs: (x1,y1),…,(xn,yn) XxY m unlabeled inputs: xn+1,…,xn+m X Loss function: Δ:YxY→R+ V hypotheses: f1,…,fV H1x…x HV regularization Goal: V min Q(f1,…fV) = n v=1 i=1 V +λ Δ(yi,argmaxy’ fv(xi,y’)) + η ||fv||2 n+m u,v=1 j=n+1 empirical risk of fv Representer theorem: Δ(argmaxy’ fu(xj,y’),argmaxy’’fv(xj,y’’)) pairwise disagreements Ulf Brefeld : “Semi-supervised Structured Prediction Models” Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-supervised Regularized Least Squares Regression Special case: Output space Y=R . Consider functions Squared loss: Given: n labeled examples m unlabeled inputs V views (V kernel functions ) Consensus maximization principle: Minimize squared error for labeled examples. Minimize squared differences for unlabeled examples. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Co-regularized Least Squares Regression Kernel matrix: Optimization problem: empirical risk regularization disagreement Closed-form solution: strictlypositive positive definite definite ifif K_v is is strictly strictly positive positive definite strictly definite Ulf Brefeld : “Semi-supervised Structured Prediction Models” Co-regularized Least Squares Regression Kernel matrix: Optimization problem: empirical risk regularization Closed-form solution: Execution time: disagreement as good (or bad) as the state-of-the-art Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-parametric Approximation Restrict hypothesis space: Convex objective function: Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-parametric Approximation Restrict hypothesis space: Convex objective function: Solution: Execution time: only linear in the amount of unlabeled data Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-supervised Methods for Distributed Data Participants keep labeled data private. Agree on fixed set of unlabeled data. Converges to global optimum. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Empirical Results 32 UCI data sets, 10 fold “inverse” cross validation. Dashed lines indicate equal performance. RLSR coRLSR (approx.) coRLSR (exact) , semi-parametric RMSE: exact coRLSR c < RLSR Results taken from: Brefeld, Gärtner, Scheffer, “Efficient CoRLSR”, ICMLModels” 2006 Ulf Brefeld Wrobel, : “Semi-supervised Structured Prediction Empirical Results 32 UCI data sets, 10 fold “inverse” cross validation. Dashed lines indicate equal performance. RLSR coRLSR (approx.) coRLSR (exact) < semi-parametric RMSE: exact coRLSR c < RLSR Results taken from: Brefeld, Gärtner, Scheffer, “Efficient CoRLSR”, ICMLModels” 2006 Ulf Brefeld Wrobel, : “Semi-supervised Structured Prediction Execution Time Exact solution is cubic in the number of unlabeled examples. Approximation only linear! Results taken from: Brefeld, Gärtner, Scheffer, “Efficient CoRLSR”, ICMLModels” 2006 Ulf Brefeld Wrobel, : “Semi-supervised Structured Prediction Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Semi-supervised Learning for Structured Output Variables Given n labeled examples m unlabeled inputs Joint decision function: where Distinct joint feature mappings in V1 and V2 Apply consensus maximization principle. Minimize the error for labeled examples. Minimize the disagreement for unlabeled examples. Compute argmax Viterbi algorithm (sequential output) CKY algorithm (recursive grammar) Ulf Brefeld : “Semi-supervised Structured Prediction Models” CoSVM Optimization Problem View v=1,2: Dual representation: prediction of prediction of peer view peer view Dual parameters are bound to input examples. Working sets associated with subspaces. Sparse models! confidence of peer view Ulf Brefeld : “Semi-supervised Structured Prediction Models” Labeled Examples, View v=1,2 xi=“John ate the cat” yi=<N,V,D,N> v y =<N,D,D,N> =<N,V,V,N> =<N,V,D,N> Viterbi Decoding v Working set Ωi = v { Error/Margin violation! 1. Update set Ωi Return αi, Working Ωi 2. Optimize αi φv(xi,yi)-φv(xi,<N,V,V,N>) φv(xi,yi)-φv(xi,<N,D,D,N>) Working set Ωj≠i fixed, } ( αiv(<N,V,V,N>) , αi= α v(<N,D,D,N>) i v ). v αj≠i fixed. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Unlabeled Examples xi=“John went home” View 1 1 αj≠i fixed, 1 Working set Ωj≠i fixed. { } ( ) 1 1 1 1 1 Working set Ωi = φ (xi,<N,V,V>)-φ (xi,<D,V,N>) , αi= αi (<D,V,N>) , 1 y =<D,V,N> =<N,V,N> Viterbi Decoding 2 Disagreement / margin Consensus: return αi1, αviolation! i , Ωi, Ωi View 2 Update working sets Ωi1, Ωi2 2. Optimize αi1, αi2 1. 2 y =<N,V,V> =<N,V,N> Viterbi Decoding { } ( 2 2 2 2 2 Working set Ωi = φ (xi,<D,V,N>)-φ (xi,<N,V,V>) , αi= αi (<N,V,V>) 2 Working set Ωj≠i fixed. ). 2 αj≠i fixed, Ulf Brefeld : “Semi-supervised Structured Prediction Models” Biocreative Named Entity Recognition BioCreative (Task1A, BioCreative Challenge, 2003). 7500 sentences from biomedical papers. Task: recognize gene/protein names. 500 holdout sentences. Approximately 350000 features (letter n-grams, surface clues,…) Random feature split. Baseline is trained on all features. Results taken from: Brefeld, Büscher, Scheffer, “Semi-supervisedUlf Discriminative SequentialStructured Learning”, ECMLModels” 2005 Brefeld : “Semi-supervised Prediction Biocreative Gene/Protein Name Recognition CoSVM more accurate than SVM. Accuracy positively correlated with number of unlabeled examples. Results taken from: Brefeld, Büscher, Scheffer, “Semi-supervisedUlf Discriminative SequentialStructured Learning”, ECMLModels” 2005 Brefeld : “Semi-supervised Prediction Natural Language Parsing Wall Street Journal corpus (Penn tree bank). Subsets 2-21. 8,666 sentences of length ≤ 15 tokens. Contex free grammar contains > 4,800 production rules. Negra corpus. German news paper archive. 14,137 sentences of between 5 and 25 tokens. CfG contains >26,700 production rules. Experimental setup: Local features (rule identity, rule at border, span width, …). Loss: (ya,yb) = 1 - F1(ya,yb). 100 holdout examples. CKY parser by Mark Johnson. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured OuptutStructured Variables”, ICMLModels” 2006 Ulf Brefeld : “Semi-supervised Prediction Wall Street Journal / Negra Corpus Natural Language Parsing CoSVM significantly outperforms SVM. Adding unlabeled instances further improves F1 score. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured OuptutStructured Variables”, ICMLModels” 2006 Ulf Brefeld : “Semi-supervised Prediction Execution Time CoSVM scales quadratically in the number of unlabeled examples. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured OuptutStructured Variables”, ICMLModels” 2006 Ulf Brefeld : “Semi-supervised Prediction Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Transductive Support Vector Machines for Structured Variables Binary transductive SVMs: Cluster assumption. Discrete variables for unlabeled instances. Optimization is expensive even for binary tasks! Structural transductive SVMs. Decoding = combinatorial optimization of discrete variables. Intractable! Efficient optimization: Transform, remove discrete variables. Differentiable, continuous optimization. Apply gradient-based, unconstraint optimization techniques. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Unconstraint Support Vector Machines SVM optimization problem: solving constraints for slack variables: hinge loss is not differentiable! Unconstraint SVM: BUT: Huber loss is! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Unconstraint Support Vector Machines SVM optimization problem: solving constraints for slack variables: still a max in the objective! Unconstraint SVM: Substitute differentiable softmax for max! Differentiable objective without constraints! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Unconstraint Transductive Support Vector Machines Unconstraint SVM objective function: Include unlabeled instances by an appropriate loss function. loss function. Unconstraint transductive SVM objective: overall influence of unlabeled instances Mitigate margin violations by moving w in two symmetric ways 2-best decoder Optimization problem is not convex! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Execution Time + 500 unlabeled examples + 250 unlabeled examples Gradient-based optimization faster than solving QPs. Efficient transductive integration of unlabeled instances. Results taken from: Zien, Brefeld, Scheffer, “TSVMs for StructuredStructured Variables”, ICMLModels” 2007 Ulf Brefeld : “Semi-supervised Prediction Spanish News Wire Named Entity Recognition Spanish News Wire (Special Session of CoNLL, 2002). 3100 sentences of between 10 and 40 tokens. Entities: person, location, organization and misc. names (9 labels). Window of size 3 around each token. Approximately 120,000 features (token itself, surface clues...). 300 holdout sentences. Results taken from: Zien, Brefeld, Scheffer, “TSVMs for StructuredStructured Variables”, ICMLModels” 2007 Ulf Brefeld : “Semi-supervised Prediction token error [%] Spanish News Named Entity Recognition number of unlabeled examples TSVM has significantly lower error rates than SVMs. Error decreases in terms of the number of unlabeled instances. Results taken from: Zien, Brefeld, Scheffer, “TSVMs for StructuredStructured Variables”, ICMLModels” 2007 Ulf Brefeld : “Semi-supervised Prediction Artificial Sequential Data RBF Laplacian 10 nearest neighbor Laplacian kernel vs. RBF kernel. Laplacian kernel well suited. Only little improvement by TSVM, if any. Different cluster assumptions: Laplacian: local (token level). TSVM: global (sequence level). Results taken from: Zien, Brefeld, Scheffer, “TSVMs for StructuredStructured Variables”, ICMLModels” 2007 Ulf Brefeld : “Semi-supervised Prediction Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection. 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Supervised Clustering of Data Streams for Email Batch Detection Spam characteristics: Amount of spam messages in electronic messaging is ~80%. Approximately 80-90% of these spams are generated by only a few spammers. Spammers maintain templates and exchange them rapidly. Many emails generated by the same template (=batch) in short time frames. Goal: Detect batches in the data stream. Ground-truth of exact clusterings exist! Batch information: Black/white listing. Improve spam/non-spam classification. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Template Generated Spam Messages Hello, This is Terry Hagan.We are accepting your mo rtgage application. Our company confirms you are legible for a $250.000 loan for a $380.00/month. Approval process will take 1 minute, so please fill out the form on our website. Best Regards, Terry Hagan; Senior Account Director Trades/Fin ance Department North Office Dear Mr/Mrs, This is Brenda Dunn.We are accepting your mortga ge application. Our office confirms you can get a $228.000 lo an for a $371.00 per month payment. Follow the link to our website and submit your contact information. Best Regards, Brenda Dunn; Accounts Manager Trades/Fina nce Department East Office Ulf Brefeld : “Semi-supervised Structured Prediction Models” Correlation Clustering Parameterized similarity measure: Solution is equivalent to poly-cut in a fully connected graph. Edge weight is similarity of the connected nodes. Maximize intra-cluster similarity. cxczc Ulf Brefeld : “Semi-supervised Structured Prediction Models” Problem Setting Parameterized similarity measure: Pairwise features: Edit distance of subjects, tf.idf similarity of body, … Collection x contains Ti messages x1(i),…,xTi. Matrix with if and are in the same cluster and 0 otherwise. Correlation clustering is NP complete! Solve relaxed variant instead: Substitute continuous for Ulf Brefeld : “Semi-supervised Structured Prediction Models” Large Margin Approach Structural SVM with margin rescaling: minimize combine the minimizations replace with Lagrangian dual subject to: QP with O(T3) constraints! Ulf Brefeld : “Semi-supervised Structured Prediction Models” Exploit Data Stream! Only the latest email xt has to be integrated into the existing clustering. Clustering on x1,…,xt-1 remains fixed. Execution time is linear in the number of emails. window ? time Ulf Brefeld : “Semi-supervised Structured Prediction Models” Sequential Approximation Exploit streaming nature of data: objective of clustering constant objective of sequential update computation in O(T) Decoding strategy: Find the best cluster for the latest message or create a singelton. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Results for Batch Detection No significant difference. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Execution Time Sequential approximation is efficient. Results taken from: Haider, Brefeld, Scheffer, “Supervised Clustering of Streaming Data”, ICML Models” 2007 Ulf Brefeld : “Semi-supervised Structured Prediction Supervised Clustering of Data Streams for Email Batch Detection (P. Haider, U. Brefeld und T. Scheffer, ICML 2007) Simple batch features increase AUC performance of spam/non-spam. Misclassification risk reduced by 40%! Results taken from: taken from: Results Haider, Brefeld, Clustering of Streaming Data”, ICML 2007 Zien,Scheffer, Brefeld, “Supervised Scheffer, “TSVMs for Structured Variables”, ICMLModels” 2007 Ulf Brefeld : “Semi-supervised Structured Prediction Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection. 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Conclusion Semi-supervised learning. Consensus maximization principle vs. cluster assumption. Co-regularized Least Squares Regression. Semi-supervised structured prediction models: CoSVMs and TSVMs. Efficient optimization. Empirical results: Semi-supervised variants have lower error than baselines. Adding unlabeled data further improves accuracy. Supervised Clustering: Efficient optimization. Batch features reduce misclassification risk. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Overview 1. Semi-supervised learning techniques. 1. Co-regularized least squares regression. 2. Semi-supervised structured prediction models. 1. Co-support vector machines. 2. Transductive SVMs and efficient optimization. 3. Email batch detection. 1. Supervised Clustering. 4. Conclusion. Ulf Brefeld : “Semi-supervised Structured Prediction Models” Conclusion Semi-supervised learning. Consensus maximization principle vs. cluster assumption. Co-regularized Least Squares Regression. Semi-supervised structured prediction models: CoSVMs and TSVMs. Efficient optimization. Empirical results: Semi-supervised variants have lower error than baselines. Adding unlabeled data further improves accuracy. Ulf Brefeld : “Semi-supervised Structured Prediction Models”