Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM Foundations and New Directions of Data Mining Workshop 19 November 2003 Rule-Based Learning positive examples negative examples • Goal: Induce a rule (or rules) that explains ALL positive examples and NO negative examples Inductive Logic Programming (ILP) • Encode background knowledge in first-order logic as facts… containsBlock(ex1,block1A). containsBlock(ex1,block1B). is_red(block1A). is_square(block1A). is_blue(block1B). is_round(block1B). on_top_of(block1B,block1A). and logical relations … above(A,B) :- onTopOf(A,B) above(A,B) :- onTopOf(A,Z), above(Z,B). Inductive Logic Programming (ILP) • Covering algorithm applied to explain all data - - + + - + + - + + - - + + + + + - Repeat Generate Remove Choose until all best every examples some rulepositive that positive covered covers example example this by this is example covered rule Inductive Logic Programming (ILP) • Saturate an example by writing everything true about it • The saturation of an example is the bottom clause () ex2 C B A positive(ex2) :contains_block(ex2,block2A), contains_block(ex2,block2B), contains_block(ex2,block2C), isRed(block2A), isBlue(block2B), isBlue(block2C), isRound(block2A), isRound(block2B), isSquare(block2C), onTopOf(block2B,block2A), onTopOf(block2C,block2B), above(block2B,block2A), above(block2C,block2B), above(block2C,block2A). Inductive Logic Programming (ILP) • Candidate clauses are generated by Selected literals from containsBlock(ex2,block2B) choosing literals from isRed(block2A) converting ground terms to variables onTopOf(block2B,block2A) • Search through the space of candidate clauses using standard AI search algo • Bottom clause ensures search finite Candidate Clause positive(A) :containsBlock(A,B), onTopOf(B,C), isRed(C). ILP Time Complexity • Time complexity of ILP systems depends on Size of bottom clause || Maximum clause length c Number of examples |E| Search algorithm Π c • O(|| |E|) for exhaustive search • O(|||E|) for greedy search • Assumes constant-time clause evaluation! Ideas in Speeding Up ILP • Search algorithm improvements Better heuristic functions, search strategy Srinivasan’s (2000) random uniform sampling (consider O(1) candidate clauses) • Faster clause evaluations Evaluation time of a clause (on 1 example) exponential in number of variables Clause reordering & optimizing (Blockeel et al 2002, Santos Costa et al 2003) • Evaluation of a candidate still O(|E|) A Faster Clause Evaluation • Our idea: predict clause’s evaluation in O(1) time (i.e., independent of number of examples) • Use multilayer feed-forward neural network to approximately score candidate clauses • NN inputs specify bottom clause literals selected • There is a unique input for every candidate clause in the search space Neural Network Topology Selected literals from Candidate Clause containsBlock(ex2,block2B) positive(A) :containsBlock(A,B), onTopOf(B,C), isRed(C). onTopOf(block2B,block2A) isRed(block2A) containsBlock(ex2,block2B) 1 onTopOf(block2B,block2A) 1 Σ isRound(block2A) 0 isRed(block2A) 1 predicted output Speeding Up ILP • Trained neural network provides a tool for approximate evaluation in O(1) time • Given enough examples (large |E|), approximate evaluation is free versus evaluation on data • During ILP’s search over hypothesis space … Approximately evaluate every candidate explored Only evaluate a clause on data if it is “promising” Adaptive Sampling – use real evaluations to improve approximation during search When to Evaluate Approximated Clauses? • Treat neural network-predicted score as a Gaussian distribution of true score • Only evaluate clauses when there is sufficient likelihood it is the best seen so far, e.g. Pred = 11.1 current best P(Best) = 0.03 don’t evaluate Best = 22 Pred = 18.9 current hypothesis ← clause scores → potential moves P(Best) = 0.24 evaluate Results • Trained learning only on benchmark datasets Carcinogenesis Mutagenesis Protein Metabolism Nuclear Smuggling • Clauses generated by random sampling • Clause evaluation metric compression = posCovered – negCovered – length + 1 totalPositives • 10-fold c.v. learning curves Results 10-fold c.v. RMS Error 0.16 0.14 Protein Metabolism 0.12 Nuclear Smuggling Mutagenesis 0.10 Carcinogenesis 0.08 0.06 0.04 0.02 0.00 0 200 400 600 Training Set Size 800 1000 • Test in an ILP system Potential for speedup in datasets with many examples Will inaccuracy hurt search? Predicted Score Future Work Space of Clauses • The trained network defines a function over the space of candidate clauses • We can use this function … Extract concepts Escape local maxima in heuristic search Acknowledgements Funding provided by NLM grant 1T15 LM007359-01 NLM grant 1R01 LM07050-01 DARPA EELD grant F30602-01-2-0571