Semi-supervised Structured Prediction Models

advertisement
Semi-supervised Structured Prediction Models
Ulf Brefeld
Joint work with…
Christoph
Büscher
Thomas
Gärtner
Peter
Haider
Tobias
Scheffer
Stefan
Wrobel
Alexander
Zien
Binary Classification
+
+
+
w
-
-

Inappropriate for complex real world problems.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Label Sequence Learning

Protein secondary structure prediction:
x = “XSITKTELDG ILPLVARGKV…”


y=„
SS
TT SS EEEE SS…“
Named entity recognition (NER):
x = “Tom comes from London.”
y = “Person,–,–,Location”
x = “The secretion of PTH and CT...”
y = “–,–,–,Gene,–,Gene,…”
Part-of-speech (POS) tagging:
x = “Curiosity kills the cat.”
y = “noun, verb, det, noun”
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Natural Language Parsing
x = „Curiosity kills the cat“
y=
Classification with Taxonomies
x=
y=
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Structural Learning

Given:
 n labeled pairs (x1,y1),…,(xn,yn)XxY,
drawn iid according to

Learn a ranking function:
with
 Decision value measures how good y fits to x.

Compute prediction:

Find hypothesis that realizes the smallest regularized empirical risk:
inference/decoding
model:
Log-loss:
kernel CRFs
hinge loss:
M3Networks,
SVMs
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-supervised Discriminative Learning

Labeled training data is scarce and expensive.
 Eg., experiments in computational biology.
 Need for expert knowledge.
 Tedious and time consuming.

Unclassified instances are abundant and cheap.
 Extract texts/sentences from www (POS-tagging, NER, NLP).
 Assess primary structure of proteins from DNA/RNA.
 …
There is a need for semi-supervised
techniques in structural learning!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Overview
1. Semi-supervised learning.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Case study: email batch detection
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Cluster Assumption
Now: m unlabeled inputs in addition to the n labeled pairs are given.
m>>n.
Decision boundary should not cross high density regions.



Examples: transductive learning, graph kernels,…
But: cluster assumption is frequently inappropriate, eg., regression!
What else can we do?
+



Ulf Brefeld : “Semi-supervised Structured Prediction Models”
-
Learning from Multiple Views / Co-learning


Split attributes into 2 disjoint sets (views) V1, V2.
E.g., web page classification.
 View 1: content of web page.
 View 2: anchor text of inbound links.
intrinsic
ZZ-Top
ZZ-Top
Aaron
Aalsmeer
Aachen



contextual
Aaron
Aalsmeer
Aachen
In each view learn a hypothesis fv, v=1,2.
Each fv provides its peer with predictions on unlabeled examples.
Strategy: maximize consensus between f1 and f2.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Hypothesis Space Intersection
true labeling function
View V1
View V2
Consensus maximization principle:
 Labeled examples → minimize the error.
hypothesis space
 Unlabeled
versionexamples
space → minimize disagreement.
 Minimize an upper bound on the error!
intersection H1H2



Hypothesis spaces H1 und H2.
Minimize error rate and disagreement for all hypotheses in H1H2.
Unlabeled examples = data-driven regularization!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Co-optimization Problem


Given:
 n labeled pairs: (x1,y1),…,(xn,yn) XxY
 m unlabeled inputs: xn+1,…,xn+m X
 Loss function: Δ:YxY→R+
 V hypotheses: f1,…,fV H1x…x HV
regularization
Goal:
V
min Q(f1,…fV) = 
n

v=1 i=1
V
+λ
Δ(yi,argmaxy’ fv(xi,y’)) + η ||fv||2
n+m
 
u,v=1 j=n+1

empirical risk of
fv
Representer theorem:
Δ(argmaxy’ fu(xj,y’),argmaxy’’fv(xj,y’’))
pairwise
disagreements
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-supervised Regularized Least Squares Regression

Special case:
 Output space Y=R .
 Consider functions

Squared loss:

Given:
 n labeled examples
 m unlabeled inputs
 V views (V kernel functions

)
Consensus maximization principle:
 Minimize squared error for labeled examples.
 Minimize squared differences for unlabeled examples.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Co-regularized Least Squares Regression


Kernel matrix:
Optimization problem:
empirical risk regularization

disagreement
Closed-form solution:
strictlypositive
positive definite
definite ifif K_v
is is
strictly
strictly positive
positive definite
strictly
definite
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Co-regularized Least Squares Regression


Kernel matrix:
Optimization problem:
empirical risk regularization

Closed-form solution:

Execution time:
disagreement
as good (or bad) as the
state-of-the-art
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-parametric Approximation

Restrict hypothesis space:

Convex objective function:
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-parametric Approximation

Restrict hypothesis space:

Convex objective function:

Solution:

Execution time:
only linear in the amount of
unlabeled data
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-supervised Methods for Distributed Data


Participants keep labeled data private.
Agree on fixed set of unlabeled data.

Converges to global optimum.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Empirical Results


32 UCI data sets, 10 fold “inverse” cross validation.
Dashed lines indicate equal performance.

RLSR
coRLSR (approx.)
coRLSR
(exact) , semi-parametric
RMSE: exact
coRLSR
c < RLSR
Results taken from:
Brefeld, Gärtner, Scheffer,
“Efficient
CoRLSR”,
ICMLModels”
2006
Ulf Brefeld Wrobel,
: “Semi-supervised
Structured
Prediction
Empirical Results


32 UCI data sets, 10 fold “inverse” cross validation.
Dashed lines indicate equal performance.

RLSR
coRLSR (approx.)
coRLSR
(exact) < semi-parametric
RMSE: exact
coRLSR
c < RLSR
Results taken from:
Brefeld, Gärtner, Scheffer,
“Efficient
CoRLSR”,
ICMLModels”
2006
Ulf Brefeld Wrobel,
: “Semi-supervised
Structured
Prediction
Execution Time


Exact solution is cubic in the number of unlabeled examples.
Approximation only linear!
Results taken from:
Brefeld, Gärtner, Scheffer,
“Efficient
CoRLSR”,
ICMLModels”
2006
Ulf Brefeld Wrobel,
: “Semi-supervised
Structured
Prediction
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Semi-supervised Learning for Structured Output Variables

Given
 n labeled examples
 m unlabeled inputs

Joint decision function:
 where
Distinct joint
feature mappings
in V1 and V2

Apply consensus maximization principle.
 Minimize the error for labeled examples.
 Minimize the disagreement for unlabeled examples.

Compute argmax
 Viterbi algorithm (sequential output)
 CKY algorithm (recursive grammar)
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
CoSVM Optimization Problem

View v=1,2:

Dual representation:

prediction of
prediction
of
peer view
peer
view
Dual parameters are bound to input examples.
 Working sets associated with subspaces.
 Sparse models!
confidence of
peer view
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Labeled Examples, View v=1,2
xi=“John ate the cat”
yi=<N,V,D,N>
v
y =<N,D,D,N>
=<N,V,V,N>
=<N,V,D,N>
Viterbi Decoding
v
Working set Ωi =
v
{
Error/Margin violation!
1. Update
set Ωi
Return
αi, Working
Ωi
2. Optimize αi
φv(xi,yi)-φv(xi,<N,V,V,N>)
φv(xi,yi)-φv(xi,<N,D,D,N>)
Working set Ωj≠i fixed,
} (
αiv(<N,V,V,N>)
, αi= α v(<N,D,D,N>)
i
v
).
v
αj≠i
fixed.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Unlabeled Examples
xi=“John went home”
View 1
1
αj≠i
fixed,
1
Working set Ωj≠i fixed.
{
} (
)
1
1
1
1
1
Working set Ωi = φ (xi,<N,V,V>)-φ (xi,<D,V,N>) , αi= αi (<D,V,N>) ,
1
y =<D,V,N>
=<N,V,N>
Viterbi Decoding
2
Disagreement
/ margin
Consensus:
return
αi1, αviolation!
i , Ωi, Ωi
View 2
Update working sets Ωi1, Ωi2
2. Optimize αi1, αi2
1.
2
y =<N,V,V>
=<N,V,N>
Viterbi Decoding
{
} (
2
2
2
2
2
Working set Ωi = φ (xi,<D,V,N>)-φ (xi,<N,V,V>) , αi= αi (<N,V,V>)
2
Working set Ωj≠i fixed.
).
2
αj≠i
fixed,
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Biocreative Named Entity Recognition

BioCreative (Task1A, BioCreative Challenge, 2003).
 7500 sentences from biomedical papers.
 Task: recognize gene/protein names.
 500 holdout sentences.
 Approximately 350000 features (letter n-grams, surface clues,…)
 Random feature split.
 Baseline is trained on all features.
Results taken from:
Brefeld, Büscher, Scheffer, “Semi-supervisedUlf
Discriminative
SequentialStructured
Learning”,
ECMLModels”
2005
Brefeld : “Semi-supervised
Prediction
Biocreative Gene/Protein Name Recognition


CoSVM more accurate than SVM.
Accuracy positively correlated with number of unlabeled examples.
Results taken from:
Brefeld, Büscher, Scheffer, “Semi-supervisedUlf
Discriminative
SequentialStructured
Learning”,
ECMLModels”
2005
Brefeld : “Semi-supervised
Prediction
Natural Language Parsing

Wall Street Journal corpus (Penn tree bank).
 Subsets 2-21.
 8,666 sentences of length ≤ 15 tokens.
 Contex free grammar contains > 4,800 production rules.

Negra corpus.
 German news paper archive.
 14,137 sentences of between 5 and 25 tokens.
 CfG contains >26,700 production rules.

Experimental setup:
 Local features (rule identity, rule at border, span width, …).
 Loss: (ya,yb) = 1 - F1(ya,yb).
 100 holdout examples.
 CKY parser by Mark Johnson.
Results taken from:
Brefeld, Scheffer, “Semi-supervised Learning
for Structured
OuptutStructured
Variables”,
ICMLModels”
2006
Ulf Brefeld
: “Semi-supervised
Prediction
Wall Street Journal / Negra Corpus Natural Language Parsing


CoSVM significantly outperforms SVM.
Adding unlabeled instances further improves F1 score.
Results taken from:
Brefeld, Scheffer, “Semi-supervised Learning
for Structured
OuptutStructured
Variables”,
ICMLModels”
2006
Ulf Brefeld
: “Semi-supervised
Prediction
Execution Time

CoSVM scales quadratically in the number of unlabeled examples.
Results taken from:
Brefeld, Scheffer, “Semi-supervised Learning
for Structured
OuptutStructured
Variables”,
ICMLModels”
2006
Ulf Brefeld
: “Semi-supervised
Prediction
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Transductive Support Vector Machines
for Structured Variables

Binary transductive SVMs:
 Cluster assumption.
 Discrete variables for unlabeled instances.
 Optimization is expensive even for binary tasks!

Structural transductive SVMs.
 Decoding = combinatorial optimization of discrete variables.
 Intractable!

Efficient optimization:
 Transform, remove discrete variables.
 Differentiable, continuous optimization.
 Apply gradient-based, unconstraint optimization techniques.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Unconstraint Support Vector Machines

SVM optimization problem:
solving constraints for slack variables:
hinge loss
is not differentiable!

Unconstraint SVM:
BUT: Huber loss is!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Unconstraint Support Vector Machines

SVM optimization problem:
solving constraints for slack variables:
still a max in the objective!

Unconstraint SVM:
Substitute differentiable softmax for max!

Differentiable objective without constraints!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Unconstraint Transductive Support Vector Machines

Unconstraint SVM objective function:

Include unlabeled instances by an appropriate loss function.
loss function.
Unconstraint transductive SVM objective:

overall influence of
unlabeled instances

Mitigate margin
violations by moving
w in two symmetric
ways
2-best decoder
Optimization problem is not convex!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Execution Time
+ 500 unlabeled
examples
+ 250 unlabeled
examples


Gradient-based optimization faster than solving QPs.
Efficient transductive integration of unlabeled instances.
Results taken from:
Zien, Brefeld, Scheffer,
“TSVMs
for StructuredStructured
Variables”,
ICMLModels”
2007
Ulf Brefeld
: “Semi-supervised
Prediction
Spanish News Wire Named Entity Recognition

Spanish News Wire (Special Session of CoNLL, 2002).
 3100 sentences of between 10 and 40 tokens.
 Entities: person, location, organization and misc. names (9 labels).
 Window of size 3 around each token.
 Approximately 120,000 features (token itself, surface clues...).
 300 holdout sentences.
Results taken from:
Zien, Brefeld, Scheffer,
“TSVMs
for StructuredStructured
Variables”,
ICMLModels”
2007
Ulf Brefeld
: “Semi-supervised
Prediction
token error [%]
Spanish News Named Entity Recognition
number of unlabeled examples


TSVM has significantly lower error rates than SVMs.
Error decreases in terms of the number of unlabeled instances.
Results taken from:
Zien, Brefeld, Scheffer,
“TSVMs
for StructuredStructured
Variables”,
ICMLModels”
2007
Ulf Brefeld
: “Semi-supervised
Prediction
Artificial Sequential Data
RBF




Laplacian
10 nearest neighbor Laplacian kernel vs. RBF kernel.
Laplacian kernel well suited.
Only little improvement by TSVM, if any.
Different cluster assumptions:
 Laplacian: local (token level).
 TSVM: global (sequence level).
Results taken from:
Zien, Brefeld, Scheffer,
“TSVMs
for StructuredStructured
Variables”,
ICMLModels”
2007
Ulf Brefeld
: “Semi-supervised
Prediction
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection.
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Supervised Clustering of Data Streams
for Email Batch Detection

Spam characteristics:
 Amount of spam messages in electronic messaging is ~80%.
 Approximately 80-90% of these spams are generated by only a
few spammers.
 Spammers maintain templates and exchange them rapidly.
 Many emails generated by the same template (=batch) in short
time frames.

Goal:
 Detect batches in the data stream.
 Ground-truth of exact clusterings exist!

Batch information:
 Black/white listing.
 Improve spam/non-spam classification.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Template Generated Spam Messages
Hello,
This is Terry Hagan.We are accepting your mo rtgage application.
Our company confirms you are legible for a $250.000 loan for a
$380.00/month. Approval process will take 1 minute, so please fill out the
form on our website.
Best Regards, Terry Hagan; Senior Account Director
Trades/Fin ance Department North Office
Dear Mr/Mrs,
This is Brenda Dunn.We are accepting your mortga ge application.
Our office confirms you can get a $228.000 lo an for a
$371.00 per month payment. Follow the link to our website and submit
your contact information.
Best Regards, Brenda Dunn; Accounts Manager
Trades/Fina nce Department East Office
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Correlation Clustering



Parameterized similarity measure:
Solution is equivalent to poly-cut in a fully connected graph.
Edge weight is similarity of the connected nodes.

Maximize intra-cluster similarity.
cxczc
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Problem Setting

Parameterized similarity measure:
 Pairwise features:
 Edit distance of subjects,
 tf.idf similarity of body,
 …


Collection x contains Ti messages x1(i),…,xTi.
Matrix with
if
and are in the same cluster and 0 otherwise.


Correlation clustering is NP complete!
Solve relaxed variant instead:
 Substitute continuous
for
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Large Margin Approach

Structural SVM with margin rescaling:
minimize
combine the
minimizations
replace with
Lagrangian dual
subject to:
QP with O(T3)
constraints!
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Exploit Data Stream!



Only the latest email xt has to be integrated into the existing clustering.
Clustering on x1,…,xt-1 remains fixed.
Execution time is linear in the number of emails.
window
?
time
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Sequential Approximation

Exploit streaming nature of data:
objective of clustering
constant

objective of sequential update
computation in O(T)
Decoding strategy: Find the best cluster for the latest message or
create a singelton.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Results for Batch Detection

No significant difference.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Execution Time

Sequential approximation is efficient.
Results taken from:
Haider, Brefeld, Scheffer, “Supervised
Clustering
of Streaming
Data”,
ICML Models”
2007
Ulf Brefeld
: “Semi-supervised
Structured
Prediction
Supervised Clustering of Data Streams
for Email Batch Detection
(P. Haider, U. Brefeld und T. Scheffer, ICML 2007)


Simple batch features increase AUC performance of spam/non-spam.
Misclassification risk reduced by 40%!
Results taken
from: taken from:
Results
Haider, Brefeld,
Clustering
of Streaming
Data”,
ICML
2007
Zien,Scheffer,
Brefeld, “Supervised
Scheffer,
“TSVMs
for Structured
Variables”,
ICMLModels”
2007
Ulf Brefeld
: “Semi-supervised
Structured
Prediction
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection.
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Conclusion

Semi-supervised learning.
 Consensus maximization principle vs. cluster assumption.
 Co-regularized Least Squares Regression.

Semi-supervised structured prediction models:
 CoSVMs and TSVMs.
 Efficient optimization.

Empirical results:
 Semi-supervised variants have lower error than baselines.
 Adding unlabeled data further improves accuracy.

Supervised Clustering:
 Efficient optimization.
 Batch features reduce misclassification risk.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Overview
1. Semi-supervised learning techniques.
1. Co-regularized least squares regression.
2. Semi-supervised structured prediction models.
1. Co-support vector machines.
2. Transductive SVMs and efficient optimization.
3. Email batch detection.
1. Supervised Clustering.
4. Conclusion.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Conclusion

Semi-supervised learning.
 Consensus maximization principle vs. cluster assumption.
 Co-regularized Least Squares Regression.

Semi-supervised structured prediction models:
 CoSVMs and TSVMs.
 Efficient optimization.

Empirical results:
 Semi-supervised variants have lower error than baselines.
 Adding unlabeled data further improves accuracy.
Ulf Brefeld : “Semi-supervised Structured Prediction Models”
Download