Automated Theory Formation First Steps in Bioinformatics

advertisement
Automated Theory Formation:
First Steps in Bioinformatics
Simon Colton
Computational Bioinformatics Laboratory
Machine Learning (ML)
Questions
Given some background information
Concepts, hypotheses (axioms)
Given some positive examples
And some negative examples
Find me an explanation
Why the positives are positive
And the negatives are negative
Example: Predictive Toxicology
Given some theory from chemistry
Structure of molecules, well known substructures
Given some examples of toxic drugs
And some examples of non-toxic drugs
Question: Why are the toxic drugs toxic?
Automated Theory Formation (ATF)
Questions
Given some background information
Concepts, hypotheses (axioms)
And some objects of interest
Numbers, Molecules, etc.
Find something interesting
Interesting things could be:
Concepts, examples, hypotheses, explanations
ATF Overview
Scientific theories contain (at least):
Concepts: salt, acid, base
Hypotheses: acid + base => salt + water
Explanations: transfer of electrons, dissolving
So, ATF should do (at least):
Concept formation, Conjecture making
Hypothesis proving and disproving.
Also needs to:
Measure interestingness, present results, etc.
HR Theory Formation System
Developed in maths
Designed to be general purpose system
Concept-based theory formation
Tries to make concept
Makes conjecture when it can’t make a concept
Tries to explain conjectures
Conjecture-based theory formation
Fix faulty conjectures with concept formation
PhD work of Alison Pease, based on Lakatos
Concept Formation in HR
10 General Production Rules
Take in old concepts, produce new concepts
Size
[a,b] : b|a
[a,n]:n = |{b:b|a}|
Split
[a] : 2|a
Negate
Split
[a]:2=|{b:b|a}|
Compose [a] : not 2|a
[a]:2=|{b:b|a}| & not 2|a
(Odd Prime Numbers)
Conjecture Making
Empirical checks are performed
After each attempt to invent a new concept
If the concept has no examples
Makes non-existence conjecture
If concept has same examples as previous
Makes an equivalence conjecture
If another concept subsumes the concept
Makes an implication conjecture
Conjecture Extraction
Suppose HR makes equivalence conjecture:
P(a) & Q(a)  R(a) & S(a)
Extracts:
P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)
R(a) & S(a) => P(a), R(a) & S(a) => Q(a)
Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.
Prime implicates (require proving, though)
Important: gets Horn Clauses
Can be expressed in Prolog…..
Explanation Generation
In mathematical domains
HR relies on automated theorem provers
And Model generators

To find counterexamples
E.g., group theory: a*a=a  a=id (prove easily)
In biological/chemistry domains
Possibly: visualisation tools, reaction pathways
Greatest Hits
Please ask me over coffee about:
Pre-processing constraint problems
Learning properties of quadratic residues
Inventing integer sequences
Puzzle generation
Adding to the TPTP library
Setting mathematical tutorial questions
…
Long term aim in Bioinformatics
Develop an ATF system similar to HOMER
But working in biological domains
Biologist provides little background info
In a format they are happy with
Program provides results
Intelligent, interesting, not too much,
And very little rubbish
Automated assistant for biology
Short term aim in Bioinformatics
HR can work with biological data
Takes input similar to Muggleton’s Progol
Use HR to solve ML problems
See how bad an idea that is
Use theory formation to improve ML
Integrate HR and Progol somehow
Naïve Approach to ML Tasks
Give HR the same input as Progol
Get it to form a theory
Look at the theory
Extract concepts which do well on the task
i.e., they look similar to target concept
Not a goal-based approach
Bad idea (slow)
Less Naïve Approach
Improve search using “forward look-ahead”
ICML Paper
This has evolved to “reactive search”
Uses HR’s own Java interpreter
HR reacts to certain events in theory formation

Scripts supplied by the user
HR also makes “near-conjectures”
Faster approach, but still fairly slow
Example – Mutagenesis42 Data
Mutagenesis similar to carcinogenisis
42 drugs supplied with atom-bond details
Atom type, number & charge, bond type (1-8)
13 are mutagenic (active), 29 are not active
Progol learned this concept (88% accurate)
active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)
1
c,21
2
?
?
HR’s Results
Using reactive search, four PRs, 30K steps
HR learned this concept:
active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)
Also 88% accurate
But, Progol’s answer “better”
Because higher information content (fewer ?s)
Biologists sometimes want more information

Is this really a simpler answer?
1
?,21
?
?
?
But…..
HR also made these equivalence conjectures
And extracted them (+100 more) for us
atm(B,X,21)  atm(B,c,21)
atm(B,X,38)  atm(B,n,38)
bond(A,B,C,X1) & atm(C,X2,38)  bond(A,B,C,1) & atm(C,X3,38)
bond(A,X1,B,X2) & atm(B,X3,38)  bond(A,B,X4,2), atm(B,X5,38)
We used these to re-write HR’s answer
By hand, but hope to automate
Giving us this answer:
1
2
c,21
?
n,38
Remember that Progol’s Answer was:
1
c,21
2
?
?
So, we filled in one of the blanks!
Are we making a meal of this?
Yes, possibly for the mutagenesis data
I was worried about the difficulty of this problem
In the last week I’ve written a
200-line Prolog program which runs quite fast
And can be distributed over multiple processors
And can be easily understood by biologists
And gets these results….
Template search – Results
Nice result one (88% accurate, lots of info)
1
c,21
2
n,38
o,40
2
o,40
Nice result two (95% accurate)
1
c,21
2
n,38
7
o,40
c,?
1
c,22
-0.132
1
7
?
c,195
c,22
h,3
0.145
Template Search - Assumptions
Connected substructures
Are interesting answers
Progol’s answers are all substructures
More specific substructures are not so bad
Biologists may even want lots of information
Don’t forget that they want to do science
Each learned concept will be true of
At least one active (positive) molecule
Template Search - Overview
User chooses template for substructures
?
?,?
?
?,?
?,?
User specifies how many ?s are allowed
E.g., 3 out of 8 in the above template
Algorithm starts with the first positive
Extracts all substructures in the template
Then takes the next positive,
for each substructure in the set


Add the LGG so that it fits both positives
Don’t go under the IC limit
Template Search – Final Part
For all the substructures
Take a disjunction

Which achieves the best accuracy
Distribution of this algorithm possible
We’re getting a big Linux farm
PPP – Processor Per Positive


finds substructures true of one positive
combine answers at the end
Conclusions & Future Work
Automated Theory Formation
May be useful to bioinformatics
Use HR’s theory to improve Progol’s results


Possibly by pre-processing Progol’s input
Or by post-processing the learned concept
Template search
Maybe a good idea? Possibly not new….
Not bad results for the Mutagenesis42 dataset
Download