Transfer Learning by Mapping and Revising Relational Knowledge Raymond J. Mooney

advertisement
Transfer Learning by
Mapping and Revising
Relational Knowledge
Raymond J. Mooney
University of Texas at Austin
with acknowledgements to
Lily Mihalkova, Tuyen Huynh
1
Transfer Learning
• Most machine learning methods learn each new task from
scratch, failing to utilize previously learned knowledge.
• Transfer learning concerns using knowledge acquired in
a previous source task to facilitate learning in a related
target task.
• Usually assume significant training data was available in
the source domain but limited training data is available in
the target domain.
• By exploiting knowledge from the source, learning in the
target can be:
– More accurate: Learned knowledge makes better predictions.
2
– Faster: Training time is reduced.
Transfer Learning Curves
Predictive Accuracy
• Transfer learning increases accuracy in the target
domain.
Jump
start
TRANSFER FROM SOURCE
NO TRANSFER
Advantage from
Limited target data
Amount of training data in target domain
3
Recent Work on Transfer Learning
• Recent DARPA program on Transfer Learning has led
to significant recent research in the area.
• Some work focuses on feature-vector classification.
–
–
–
–
Hierarchical Bayes (Yu et al., 2005; Lawrence & Platt, 2004)
Informative Bayesian Priors (Raina et al., 2005)
Boosting for transfer learning (Dai et al., 2007)
Structural Correspondence Learning (Blitzer et al., 2007)
• Some work focuses on Reinforcement Learning
– Value-function transfer (Taylor & Stone, 2005; 2007)
– Advice-based policy transfer (Torrey et al., 2005; 2007)
4
Similar Research Problems
• Multi-Task Learning (Caruana, 1997)
– Learn multiple tasks simultaneously; each one
helped by the others.
• Life-Long Learning (Thrun, 1996)
– Transfer learning from a number of prior source
problems, picking the correct source problems
to use.
5
Logical Paradigm
• Represents knowledge and data in binary
symbolic logic such as First Order Predicate
Calculus.
+ Rich representation that handles arbitrary
sets of objects, with properties, relations,
quantifiers, etc.
 Unable to handle uncertain knowledge and
probabilistic reasoning.
6
Probabilistic Paradigm
• Represents knowledge and data as a fixed
set of random variables with a joint
probability distribution.
+ Handles uncertain knowledge and
probabilistic reasoning.
 Unable to handle arbitrary sets of objects,
with properties, relations, quantifiers, etc.
7
Statistical Relational Learning (SRL)
• Most machine learning methods assume
i.i.d. examples represented as fixed-length
feature vectors.
• Many domains require learning and making
inferences about unbounded sets of entities
that are richly relationally connected.
• SRL methods attempt to integrate methods
from predicate logic and probabilistic
graphical models to handle such structured,
multi-relational data.
8
Statistical Relational Learning
Actor
Director
pacino
coppola
pacino
coppola
godFather
pacino
brando
…
brando
coppola
godFather
brando
…
…
godFather
coppola
streetCar
brando
…
…
…
WorkedFor
Multi-Relational Data
Movie
Learning
Algorithm
Probabilistic Graphical
Model
9
Multi-Relational Data Challenges
• Examples cannot be effectively represented as feature
vectors.
• Predictions for connected facts are not independent.
(e.g. WorkedFor(brando, coppolo), Movie(godFather, brando))
– Data is not i.i.d.
– Requires collective inference (classification) (Taskar et al., 2001)
• A single independent example (mega-example) often
contains information about a large number of interconnected
entities and can vary in length.
– Leave one university out testing (Craven et al., 1998)
10
TL and SRL and I.I.D.
Standard Machine Learning assumes examples are:
Independent and Identically Distributed
SRL breaks the assumption
that examples are independent
TL breaks the assumption
that test examples are drawn
from the same distribution as
the training instances
11
Multi-Relational Domains
• Domains about people
– Academic departments (UW-CSE)
– Movies (IMDB)
• Biochemical domains
– Mutagenesis
– Alzheimer drug design
• Linked text domains
– WebKB
– Cora
12
Relational Learning Methods
• Inductive Logic Programming (ILP)
– Produces sets of first-order rules
– Not appropriate for probabilistic reasoning
• If a student wrote a paper with a professor, then the professor
is the student’s advisor
• SRL models & learning algorithms
–
–
–
–
–
SLPs (Muggleton, 1996)
PRMs (Koller, 1999)
BLPs (Kersting & De Raedt, 2001)
RMNs (Taskar et al., 2002)
MLNs (Richardson & Domingos, 2006)
Markov logic networks
13
MLN Transfer
(Mihalkova, Huynh, & Mooney, 2007)
• Given two multi-relational domains, such as:
Source:
UW-CSE
Professor(A) Student(A)
Publication(T, A)
AdvisedBy(A, B)
…
Target:
IMDB
Director(A) Actor(A)
Movie(T, A)
WorkedFor(A, B)
…
• Transfer a Markov logic network learned in the
Source to the Target by:
– Mapping the Source predicates to the Target
– Revising the mapped knowledge
14
First-Order Logic Basics
• Literal: A predicate (or its negation) applied to
constants and/or variables.
– Gliteral: Ground literal; WorkedFor(brando, coppola)
– Vliteral: Variablized literal; WorkedFor(A, B)
• We assume predicates have typed arguments.
– For example: Movie(godFather, coppola)
movieTitle
person
15
First-Order Clauses
• Clause: A disjunction of literals
• Can be rewritten as a set of rules:
16
Representing the Data
Actor(pacino)
WorkedFor(pacino, coppola)
Movie(godFather, pacino)
Actor(brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
Director(coppola)
Movie(godFather, coppola)
• Makes a closed world assumption:
– The gliterals listed are true; the rest are false
17
Markov Logic Networks
(Richardson & Domingos, 2006)
• Set of first-order clauses, each assigned a weight.
• Larger weight indicates stronger belief that the
clause should hold.
• The clauses are called the structure of the MLN.
5.3
3.2
0.5
18
Markov Networks
(Pearl, 1988)
• A concise representation of the joint probability
distribution of a set of random variables using an
undirected graph.
Joint distribution
Reputation
of Author
Algorithm
Experiments
R
A
E
Q
e
e
e
e
0.7
e
e
e
p
0.1
e
e
p
e
0.03
…
Quality
of
Paper
Same probability distribution can
be represented as the product of
a set of functions defined over
the cliques of the graph
19
Markov Network Equations
• General form
• Log-linear models
weights
features20
Ground Markov Network for an MLN
• MLNs are templates for constructing Markov
networks for a given set of constants:
– Include a node for each type-consistent grounding
(a gliteral) of each predicate in the MLN.
– Two nodes are connected by an edge if their
corresponding gliterals appear together in any
grounding of any clause in the MLN.
– Include a feature for each grounding of each
clause in the MLN with weight equal to the
weight of the clause.
21
constants:
coppola,
brando,
godFather
1.3
1.2
0.5
Actor(brando)
Director(brando)
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
WorkedFor(coppola, brando)
Movie(godFather,coppola)
WorkedFor(coppola, coppola)
Director(coppola)
Actor(coppola)
22
MLN Equations
Compare to log-linear Markov networks:
23
MLN Equation Intuition
A possible world (a truth assignment to all
gliterals) becomes exponentially less likely
as the total weight of all the grounded
clauses it violates increases.
24
MLN Inference
• Given truth assignments for given set of
evidence gliterals, infer the probability that
each member of set of unknown query
gliterals is true.
25
0.9
F
Actor(brando)
Director(brando)
0.1
0.7
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
T
T
Movie(godFather, brando)
0.3
Movie(godFather,coppola)
0.2
WorkedFor(coppola, brando)
WorkedFor(coppola, coppola)
Director(coppola)
T
0.2 Actor(coppola)
26
MLN Inference Algorithms
• Gibbs Sampling (Richardson & Domingos, 2006)
• MC-SAT (Poon & Domingos, 2006)
27
MLN Learning
• Weight-learning (Richardson & Domingos, 2006;
Lowd & Domingos, 2007)
– Performed using optimization methods.
• Structure-learning (Kok & Domingos, 2005)
– Proceeds in iterations of beam search, adding the
best-performing clause after each iteration to the
MLN.
– Clauses are evaluated using WPLL score.
28
WPLL
(Kok & Domingos, 2005)
• Weighted pseudo log-likelihood
Compute the likelihood of
the data according to the
model and take log
For each gliteral, condition
on its Markov blanket for
tractability
Weight it so that predicates with
greater arity do not dominate
29
Alchemy
• Open-source package of MLN software
provided by UW that includes:
–
–
–
–
Inference algorithms
Weight learning algorithms
Structure learning algorithm
Sample data sets
• All our software uses and extends Alchemy.
30
TAMAR
(Transfer via Automatic Mapping And Revision)
Publication(T,A)  AdvisedBy(A,B) → Publication(T,B) (clause from UW-CSE)
M-TAMAR
Movie(T,A)  WorkedFor(A,B) → Movie(T,B)
Predicate Mapping:
Publication  Movie
AdvisedBy  WorkedFor
Target
(IMDB)
Data
R-TAMAR
Movie(T,A)  WorkedFor(A,B)  Relative(A,B) → Movie(T,B)
31
Predicate Mapping
• Each clause is mapped independently of the
others.
• The algorithm considers all possible ways to map
a clause such that:
– Each predicate in the source clause is mapped to
some target predicate.
– Each argument type in the source is mapped to
exactly one argument type in the target.
• Each mapped clause is evaluated by measuring
its WPLL for the target data, and the most
accurate mapping is kept.
32
Predicate Mapping Example
Publication(title, person)
AdvisedBy(person, person)
Movie(name, person)
WorkedFor(person, person)
Consistent Type Mapping:
title → name
person → person
33
Predicate Mapping Example 2
Publication(title, person)
AdvisedBy(person, person)
Gender(person, gend)
SameGender(gend, gend)
Consistent Type Mapping:
title → person
person → gend
34
TAMAR
(Transfer via Automatic Mapping And Revision)
Publication(T,A)  AdvisedBy(A,B) → Publication(T,B) (clause from UW-CSE)
M-TAMAR
Movie(T,A)  WorkedFor(A,B) → Movie(T,B)
Target
(IMDB)
Data
R-TAMAR
Movie(T,A)  WorkedFor(A,B)  Relative(A,B) → Movie(T,B)
35
Transfer Learning as Revision
• Regard mapped source
MLN as an approximate
model for the target task that
needs to be accurately and
efficiently revised.
• Thus our general approach
is similar to that taken by
theory revision systems
Top-down
Bottom-up
Source MLN
Source MLN
Data-Driven
Revision
Proposer
(Richards & Mooney, 1995).
• Revisions are proposed in a
bottom-up fashion.
Target Training
Data
36
R-TAMAR
Clause
Clause
Clause
Clause
Clause
1
2
3
4
5
wt
wt
wt
wt
wt
Relational Data
Selfdiagnosis
New clause
discovery
Too
Clause
Long
1
Clause
Good 2
Clause
3
Too
Short
Clause
Good 4
Clause
5
Too Long
Directed
Beam Search
New Candidate Clauses
New Clause 1
New Clause 3
New Clause 5
wt 0.1
wt 0.5
wt 1.3
New Clause 6
New Clause 7
wt
wt
wt
wt
wt
wt -0.2
wt 1.7
Change in
WPLL
37
R-TAMAR: Self-Diagnosis
• Use mapped source MLN to make inferences in
the target and observe the behavior of each clause:
– Consider each predicate P in the domain in turn.
– Use Gibbs sampling to infer truth values for the
gliterals of P, using the remaining gliterals as evidence.
– Bin the clauses containing gliterals of P based on
whether they behave as desired.
• Revisions are focused only on clauses in the
“Bad” bins.
38
Self-Diagnosis: Clause Bins
Actor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
39
Self-Diagnosis: Clause Bins
Actor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
40
Self-Diagnosis: Clause Bins
Actor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
– Irrelevant; Good
41
Self-Diagnosis: Clause Bins
Actor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
Current gliteral:
Actor(brando)
– Relevant; Good
– Relevant; Bad
– Irrelevant; Good
– Irrelevant; Bad
42
Structure Revisions
•
Using directed beam search:
–
–
•
Literal deletions attempted only from clauses
marked for shortening.
Literal additions attempted only for clauses marked
for lengthening.
Training is much faster since search space is
constrained by:
1. Limiting the clauses considered for updates.
2. Restricting the type of updates allowed.
43
New Clause Discovery
• Uses Relational Pathfinding (Richards & Mooney, 1992)
Actor(brando) Director(coppola)
Movie(godFather, brando) Movie(godFather, coppola)
Movie(rainMaker, coppola) WorkedFor(brando, coppola)
brando WorkedFor
godFather
coppola
rainMaker
44
Weight Revision
Publication(T,A)  AdvisedBy(A,B) → Publication(T,B)
M-TAMAR
Movie(T,A)  WorkedFor(A,B) → Movie(T,B)
Target
(IMDB)
Data
R-TAMAR
Movie(T,A)  WorkedFor(A,B)  Relative(A,B) → Movie(T,B)
MLN Weight Training
0.8 Movie(T,A)  WorkedFor(A,B)  Relative(A,B) → Movie(T,B)
45
Experiments: Domains
• UW-CSE
– Data about members of the UW CSE department
– Predicates include Professor, Student, AdvisedBy, TaughtBy,
Publication, etc.
• IMDB
– Data about 20 movies
– Predicates include Actor, Director, Movie, WorkedFor, Genre,
etc.
• WebKB
– Entity relations from the original WebKB domain
(Craven et al. 1998)
– Predicates include Faculty, Student, Project, CourseTA, etc.
46
Dataset Statistics
Data is organized as mega-examples
• Each mega-example contains information about a
group of related entities.
• Mega-examples are independent and disconnected
from each other.
Data Set
# MegaExamples
#
Constants
#
Types
#
# True
Predicates Gliterals
Total #
Gliterals
IMDB
5
316
4
10
1,540
32,615
UW-CSE
5
1,323
9
15
2,673
678,899
WebKB
4
1,700
3
6
2,065
688,193
47
Manually Developed Source KB
• UW-KB is a hand-built knowledge base (set
of clauses) for the UW-CSE domain.
• When used as a source domain, transfer
learning is a form of theory refinement that
also includes mapping to a new domain
with a different representation.
48
Systems Compared
• TAMAR: Complete transfer system.
• ScrKD: Algorithm of Kok & Domingos (2005)
learning from scratch.
• TrKD: Algorithm of Kok & Domingos (2005)
performing transfer, using M-TAMAR to
produce a mapping.
49
Methodology: Training & Testing
• Generated learning curves using leave-one-out CV
– Each run keeps one mega-example for testing and trains
on the remaining ones, provided one by one.
– Curves are averages over all runs.
• Evaluated learned MLN by performing inference
for all gliterals of each predicate in turn, providing
the rest as evidence, and averaging the results.
50
Methodology: Metrics
Kok & Domingos (2005)
• CLL: Conditional Log Likelihood
– The log of the probability predicted by the model that a
gliteral has the correct truth value given in the data.
– Averaged over all test gliterals.
• AUC: Area under the precision recall (PR) curve
– Produce a PR curve by varying the probability threshold.
– Compute the area under this curve.
51
Metrics to Summarize Curves
• Transfer Ratio (Cohen et al. 2007)
– Gives overall idea of improvement achieved over
learning from scratch
52
Transfer Scenarios
• Source/target pairs tested:
–
–
–
–
–
WebKB  IMDB
UW-CSE  IMDB
UW-KB  IMDB
WebKB  UW-CSE
IMDB  UW-CSE
• WebKB not used as a target since one
mega-example is sufficient to learn an
accurate theory for its limited predicate set.
53
Negative Transfer
TransferPositive
Ratio Transfer
AUC Transfer Ratio
2
1.8
1.6
1.4
1.2
TrKD
TAMAR
1
0.8
0.6
0.4
0.2
0
WebKB to IMDB
UW-CSE to IMDB
UW-KB to IMDB
Experiment
WebKB to UW-CSE
IMDB to UW-CSE
54
Transfer Ratio
NegativeTransfer
Positive Transfer
CLL Transfer Ratio
1.8
1.6
1.4
1.2
1
TrKD
TAMAR
0.8
0.6
0.4
0.2
0
WebKB to IMDB
UW-CSE to IMDB
UW-KB to IMDB
WebKB to UW-CSE
Experiments
IMDB to UW-CSE
55
Sample Learning Curve
ScrKD
TrKD, Hand Mapping
TAMAR, Hand Mapping
TrKD
TAMAR
56
Total Training Time in Minutes
1200.00
1000.00
Minutes
800.00
TrKD
TAMAR
Scr KD
600.00
Number of seconds to find best mapping:
400.00
0.89
9.10
9.99
6.75
42.3
200.00
0.00
WebKB to IMDB
UW-CSE to IMDB
UW-KB to IMDB
Experiment
WebKB to UW-CSE
IMDB to UW-CSE
57
Future Research Issues
• More realistic application domains.
• Application to other SRL models (e.g.
SLPs, BLPs).
• More flexible predicate mapping
– Allow argument ordering or arity to change.
– Map 1 predicate to conjunction of > 1 predicates
• AdvisedBy(X,Y)  Movie(M,X)  Director(M,Y)
58
Multiple Source Transfer
• Transfer from multiple source problems to a
given target problem.
• Determine which clauses to map and revise
from different source MLNs.
59
Source Selection
• Select useful source domains from a large
number of previously learned tasks.
• Ideally, picking source domain(s) is sublinear in the number of previously learned
tasks.
60
Conclusions
• Presented TAMAR, a complete transfer
system for SRL that:
– Maps relational knowledge in the source to the
target domain.
– Revises the mapped knowledge to further
improve accuracy.
• Showed experimentally that TAMAR
improves speed and accuracy over existing
methods.
61
Questions?
Related papers at:
http://www.cs.utexas.edu/users/ml/publication/transfer.html
62
Why MLNs?
• Inherit the expressivity of first-order logic
– Can apply insights from ILP
• Inherit the flexibility of probabilistic graphical models
– Can deal with noisy & uncertain environments
• Undirected models
– Do not need to learn causal directions
• Subsume all other SRL models that are special cases of
first-order logic or probabilistic graphical models
[Richardson 04]
• Publicly available software package: Alchemy
63
Predicate Mapping Comments
• A particular source predicate can be
mapped to different target predicates in
different clauses.
– This makes our approach context sensitive.
• More scalable.
– In the worst-case, the number of mappings is
exponential in the number of predicates.
– The number of predicates in a clause is
generally much smaller than the total number of
predicates in a domain.
64
Relationship to Structure Mapping Engine
(Falkenheiner et al., 1989)
• A system for mapping relations using
analogy based on a psychological theory.
• Mappings are evaluated based only on the
structural relational similarity between the
two domains.
• Does not consider the accuracy of mapped
knowledge in the target when determining
the preferred mapping.
• Determines a single global mapping for a
given source & target.
65
Summary of Methodology
1. Learn MLNs for each point on learning
curve
2. Perform inference over learned models
3. Summarize inference results using 2
metrics: CLL and AUC, thus producing
two learning curves
4. Summarize each learning curve using
transfer ratio and percentage improvement
from one mega-example
66
CLL Percent Improvement
Percent Improvement from 1
Example
80
70
60
50
TrKD
TAMAR
40
30
20
10
0
WebKB to IMDB
UW-CSE to IMDB
UW-KB to IMDB
WebKB to UW-CSE
Experiment
IMDB to UW-CSE
67
AUC Percent Improvement
Percent Improvement from 1
Example
60
50
40
30
TrKD
TAMAR
20
10
0
WebKB to IMDB
UW-CSE to IMDB
UW-KB to IMDB
WebKB to UW-CSE
IMDB to UW-CSE
-10
Experiment
68
Download