Learning an Approximation to Inductive Logic Programming Clause Evaluation Frank DiMaio and Jude Shavlik Computer Sciences Department University of Wisconsin - Madison USA Inductive Logic Programming 8 September 2004 Motivation • Given bottom clause |E| examples maximum clause length c • ILP’s runtime assuming constant-time clause evaluation c O( || |E| ) for exhaustive search O( || |E| ) for greedy search Motivation • Evaluation time of a clause on 1 example exponential in # variables (Dantsin et al 2001) • Many clause evaluations in datasets with long bottom clauses, long maximum clause length, or many examples • Result: long running time ILP Time Complexity • Search algorithm improvements Better heuristic functions, search strategy Random uniform sampling (Srinivasan, 2000) Stochastic search (Rückart & Kramer, 2003) ILP Time Complexity • Faster clause evaluations Clause reordering & optimizing (Blockeel et al 2002, Santos Costa et al 2003) Stochastic matching (Sebag et al, 2000) Sampling the training examples • Evaluation of a candidate still O(|E|) Outline • Bottom clause and ILP search space •Learning a fast approximation to the clause evaluation function •Using the clause evaluation function approximation to speed up ILP Bottom Clause Given background knowledge as facts and relations in first-order logic C B A onTop(blockB,blockA,ex2). onTop(blockC,blockB,ex2). above(A,B,C) :- onTopOf(A,B,C). above(A,B,C) :- onTopOf(A,Z,C), above(Z,B,C). Generate example’s bottom clause () by saturating that example (Muggleton, 1995) is the complete set all fully ground literals connected to example Bottom Clause ex2: C B A onTop(blockB,blockA,ex2). onTop (blockC,blockB,ex2). above(A,B,C) :- onTopOf(A,B,C). above(A,B,C) :- onTopOf(A,Z,C), above(Z,B,C). positive(ex) :onTop(blockB,blockA,ex2), onTop(blockB,blockA ,ex2onTop(blockC,blockB,ex2), ), onTop(blockC,blockB,ex2), above(blockB,blockA,ex2), above(blockB,blockA,ex2), above(blockC,blockB,ex2), above(blockC,blockB,ex2), above(blockC,blockA,ex2). above(blockC,blockA,ex2). Building Candidate Hypotheses positive(E). positive(E) :onTopOf(A,B,E), above(B,C,E). positive(ex2) :onTopOf(blockB,blockA,ex2), onTopOf(blockC,blockB,ex2), above(blockB,blockA,ex2), above(blockC,blockB,ex2), ... A Faster Clause Evaluation • Our idea: predict clause’s evaluation in O(1) time (i.e., independent of number of examples) • Use multilayer feed-forward neural network to approximately score candidate clauses • NN inputs specify bottom clause literals selected • There is a unique input for every candidate clause in the search space Neural Network Topology Selected literals from containsBlock(ex2,blockB) onTopOf(blockB,blockA) isRound(blockA) isRound(blockB) Candidate Clause positive(A) :containsBlock(A,B), onTopOf(B,C), isRound(B), isRound(C). Neural Network Topology Selected literals from containsBlock(ex2,blockB) onTopOf(blockB,blockA) isRound(blockA) isRound(blockB) containsBlock(ex2,blockB) 1 onTopOf(blockB,blockA) 1 isRed(blockA) 0 isRound(blockA) 1 isBlue(blockB) 0 Candidate Clause positive(A) :containsBlock(A,B), onTopOf(B,C), isRound(B), isRound(C). Neural Network Topology Selected literals from containsBlock(ex2,blockB) onTopOf(blockB,blockA) isRound(blockA) count(containsBlock) 1 count(onTopOf) 1 count(isRed) 0 count(isRound) 2 isRound(blockB) Candidate Clause positive(A) :containsBlock(A,B), onTopOf(B,C), isRound(B), isRound(C). Neural Network Topology Selected literals from containsBlock(ex2,blockB) onTopOf(blockB,blockA) isRound(blockA) isRound(blockB) length 5 number of variables 3 number of shared variables 3 Candidate Clause positive(A) :containsBlock(A,B), onTopOf(B,C), isRound(B), isRound(C). Neural Network Topology containsBlock(ex2,blockB) 1 onTopOf(block2B,blockA) 1 isRed(blockA) 0 isRound(blockA) 1 isBlue(blockB) 0 Σ Predicted Negative Cover … Σ Predicted Positive Cover count(containsBlock) 1 count(onTopOf) 1 count(isRed) 0 count(isRound) 2 … length 5 number of variables 3 number of shared variables 3 Experiments • Trained (clause → score) on benchmark datasets Carcinogenesis Mutagenesis Protein Metabolism Nuclear Smuggling • Clauses generated by uniform random sampling • Clause evaluation metric compression = posCovered – negCovered – length + 1 totalPositives • 10-fold cross-validation learning curves Results 10-fold c.v. Testset RMS Error 0.16 0.14 Protein Metabolism 0.12 Nuclear Smuggling Mutagenesis 0.10 Carcinogenesis 0.08 0.06 0.04 0.02 0.00 0 200 400 600 800 Training Set Size (number of clauses) 1000 Why not just use a fraction of examples? We compare squared error of 1. estimating scores with trained network 2. estimating scores using subset of examples Learning vs. Sampling 0.15 10% sampling 25% 50% 90% Neural Net RMS Error 0.10 0.05 0.00 Protein Metabolism Carcinogenesis Mutagenesis Nuclear Smuggling Using the Trained Network 1. Rapidly explore search space 2. Explore network-defined surface 3. Extract concepts from trained network Online Training Algorithm Begin with initial burn-in training When new clauses are evaluated on actual data, yielding I/O pair <C,[P,N]> insert <C,[P,N]> into recent_cache if one of top 100 clauses seen so far insert <C,[P,N]> sorted into best_cache At regular interval train net on recent_cache for fixed number of epochs train net on best_cache for fixed number of epochs 1. Rapidly explore search space • O(1) clause evaluation tool • Whenever a clause evaluation is needed, approximate on network • Before expanding network-approximated clause, evaluate against real data • Behavior depends on underlying search Branch and bound – optimize order of evaluation A* (aleph’s default) – ignore non-promising clauses 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). pos(A) :f(A,B),g(B). pos(A) :f(A,B). 2.3NN current node pos(A). 0 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). current node pos(A) :f(A,B),g(B). pos(A) :g(A). 3.7NN pos(A) :f(A,B). 2.3NN pos(A). 0 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). current node pos(A) :f(A,B),g(B). pos(A) :f(A,B) 2.3NN pos(A) :g(A). 2 pos(A) :g(A). 3.72NN 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). pos(A) :f(A,B),g(B). pos(A) :g(A). 2 current node pos(A) :f(A,B). 2.34NN 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). current node pos(A) :f(A,B),g(B). pos(A) :f(A,B) 4 pos(A) :g(A). 2 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). pos(A) :f(A,B),g(B). pos(A) :g(A). 2 current node pos(A) :f(A,B). 4 1. Rapidly explore search space pos(A). pos(A) :- f(A,B). pos(A) :f(A,B),g(A). open list pos(A) :- g(A). pos(A) :f(A,B),g(B). current node pos(A) :pos(A) :f(A,B),g(A). g(A). 5.7NN 2 pos(A) :f(A,B) 4 pos(A) :f(A,B),g(B). 1.6NN 2. Explore network-defined surface • Trained network defines function over space of candidate clauses 2. Explore network-defined surface • Explore this surface using stochastic gradient ascent • Rapid random restarts (Zelezny et al, 2002) random clause generation short local search • Use network-defined surface to make “intelligent” rapid random restarts (Boyan & Moore, 2000) Algorithm Illustration Alternate 1. searching network-defined surface 2. exploring clause evaluation function surface Network approx. clause eval. fn. Clause evaluation fn. Candidate Clauses 3. Extract concepts from trained net • Extract decision tree from trained neural network (Craven & Shavlik 1995) • Predicate invention High-weight edges into single hidden unit Add invented predicates to background Biased-RRR Results Protein Metabolism Average Coverage Carcinogenesis 40 35 30 30 20 10 25 0 20 0 15 biased-RRR 10 5 0 500 1000 1500 Clause Evaluations 200 300 400 Nuclear Smuggling RRR 0 100 2000 50 40 30 20 10 0 0 5000 10000 Future Work • Implement and test other uses (#1 and #3) for utilizing trained neural network • Look at relative ranking of network predictions rather than squared error Rankprop concerned with correctly predicting ranking (Caruana et al, 1997) • Approximation quality in phase transition? (Botta et al, 2003) Conclusion • Can learn to accurately estimate score of candidate clauses • Several potential uses for speeding up ILP • Helps scale ILP to ever larger (#ex’s, search space size) datasets Acknowledgements • NLM Grant 1T15 LM007359-01 • US Air Force Grant F30602-01-2-0571 • NLM Grant 1R01 LM07050-01