G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 3: Data Mining Lecture 2: Evolutionary Learning Outline of the lecture • • • • • Introduction and taxonomy Genetic algorithms Knowledge Representations Paradigms Two complete examples – GAssist – BioHEL • Resources Evolutionary Learning • Usage of any kind of evolutionary computation methods (list follows) to machine learning tasks – – – – – Genetic Algorithms Genetic Programming Evolution Strategies Ant Colony Optimization Particle Swarm Optimization • Also known as – Genetics-Based Machine Learning (GBML) – Learning Classifier Systems (LCS) (subset of it) Paradigms and representation • EL involves a huge mix of – Search methods (previous slide) – Representations – Learning paradigms • Learning paradigms: how the solution to the machine learning problem are generated • Representations: rules, decision trees, synthetic prototypes, hyperspheres, etc. Genetic Algorithm working cycle Population A Evaluation Mutation Population B Selection Population D Population C Crossover Genetic Algorithms: terms • Population – Possible solutions of the problem – Traditionally represented as bit-strings (e.g. each bit associated to a feature, indicating if it is selected or not) – Each bit of an individual is called gene – Initial population is created at random • Evaluation – Giving a goodness value to each individual in the population • Selection – Process that rewards good individuals – Good individuals will survive, and get more than one copy in the next population. Bad individuals will disappear Genetic Algorithms • Crossover – Exchanging subparts of the solutions 1-point crossover uniform crossover – The crossover stage will take two individuals from the population (parents) and with certain probability Pc will generate two offspring Knowledge representations • For nominal attributes – Ternary representation – GABIL representation • For real-valued attributes – Hyperrectangles – Decision tree – Synthetic prototypes – Others Ternary representation • Used by XCS (Michigan LCS) – Three-letter alphabet {0,1,#} for binary problems • # means “don’t care”, that is, that the attribute is irrelevant – If A1=0 and A2=1 and A3 is irrelevant class 0 01#|0 – For non-binary nominal attributes: • {0,1, 2, …, n,#} – Crossover and mutation act as in a classic GA GABIL representation – Predicate Class – Predicate: Conjunctive Normal Form (CNF) (A1=V11.. A1=V1n) ..... (An=Vn2.. An=Vnm) • Ai : ith attribute • Vij : jth value of the ith attribute – The rules can be mapped into a binary string 1100|0010|1001|1 • 2 Variables: – Sky = {clear, partially cloudy, dark clouds} – Pressure = {Low, Medium, High} • 2 Classes: {no rain, rain} • Rule: If [sky is (partially cloudy or has dark clouds)] and [pressure is low] then predict rain • Genotype: “011|100|1” Hyper-rectangle representation • The rule’s predicate encodes an interval for each of the dimensions of the domain, effectively generating an hyperrectangle If (X<0.25 and Y<0.25) then • Different ways of encoding the interval – X< value, X> value, X in [l,u] – Encoding the actual bounds (UBR, NAX) – Encoding the interval as center±spread (XCSR) – What if the u<l ? • Flipping them (UBR) • Declaring the attribute as irrelevant (NAX) Decision tree representation • Each individual literally encodes a complete decision tree [Llora, 02] • Only suitable for the Pittsburgh approach • Decision tree can be axis-parallel or oblique • Crossover – Exchange of sub-branches of a tree between parents – Mutation • Change of the definition of a node/leaf • Total replacement of a tree’s sub-branch Synthetic Prototypes representation [Llora, 02] • Each individual is a set of synthetic instances • These instances are used as the core of a nearestneighbor classifier 1 1. (-0.125,0,yellow) 2. (0.125,0,red) 3. (0,-0.125,blue) 4. (0,0.125,green) Y 0 1 Other representations for continuous problems • Hyperellipsoid representation (XCS) – Each rule encodes an (hyper)ellipse over the search space • Smooth, non-linear, frontiers • Arbitrary rotation – Encoded as • Center • Stretches across dimensions • Rotation angles • Neural representation (XCS) – Each individual is a complete MLP, and evolution can change both the weights and the network topology Learning Paradigms • Different ways of generating a solution – Is each individual a rule, a rule set? – Is the solution the best individual, or the whole population? – Is the solution generated in a single GA run • The Pittsburgh approach LCS • The Michigan approach • The Iterative Rule Learning approach The Pittsburgh Approach • Each individual is a complete solution to the classification problem • Traditionally this means that each individual is a variable-length set of rules • The final solution is the best individual from the population after the GA run • Fitness function is based on the rule set accuracy on the training set (usually also on complexity) • GABIL [De Jong & Spears, 91] is a classic example Pittsburgh approach: recombination – Crossover operator Parents Offspring – Mutation operator: classic GA mutation of bit inversion The Michigan Approach • Each individual (classifier) is a single rule • The whole population cooperates to solve the classification problem • A reinforcement learning system is used to identify the good rules • A GA is used to explore the search space for more rules • XCS [Wilson, 95] is the most well-known Michigan LCS The Michigan approach • What is Reinforcement Learning? – “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] – Rules will be evaluated example by example, receiving a positive/negative reward – Rule fitness will be update incrementally with this reward – After enough trials, good rules should have high fitness Michigan system’s working cycle Iterative Rule Learning approach • This approach implements the separate-andconquer method of rule learning – Each individual is a rule – A GA run ends up generating a single good rule – Examples covered by the rule are removed from the training set, and process starts again • First used in evolutionary learning in the SIA system [Venturini, 93] The Gassist Pittsburgh LCS [Bacardit, 04] • Genetic clASSIfier SysTem • Designed with three aims 1. Generate compact and accurate solutions 2. Run-time reduction 3. Be able to cope with both continuous and discrete data • Objectives achieved by several components – – – – – – ADI rule representation (3) Explicit default rule mechanism (1) ILAS windowing scheme (2) MDL-based fitness function (1) Initialization policies (1) Rule deletion operator (1) GAssist components in the GA cycle • Representation – ADI representation – Explicit default rule mechanism • GA cycle Initialization Evaluation Initialization Policies Selection Standard operators •MDL fitness function •ILAS windowing Mutation Crossover GAssist: Default Rule mechanism • When we encode this rule set as a decision list we can observe an interesting behavior: the emergent generation of a default rule • Using a default rule can help generating a more compact rule set – Easier to learn (smaller search space) – Potentially less sensitive to overlearning • To maximize this benefits, the knowledge representation is extended with an explicit default rule GAssist: Default Rule mechanism • What class is assigned to the default rule? – Simple policies such as using the majority/minority class are not robust enough – Automatic determination of default class • The initial population contains individuals with all default classes • Evolution will choose the correct default class • In the first few iterations the different default classes will be isolated: each is a separate subpopulation – Different default classes learn at different rates • Afterwards, restrictions are lifted and the system is freely to pick up the best policy GAssist: Initialisation policy • Initialization policy – Probability of a rule matching a random instance • In GABIL each gene associated to a value of an attribute is independent of the other values • Therefore the probability of matching an attribute equals to the probability of value 1 when initializing the chromosome P(match) ( P1 ) a – Probability of a rule set matching a random instance P(matchrule set) 1 (1 P(match))r 1 (1 ( P1 )a )r GAssist: Initialisation policy • Initialization policy – How can we derive a formula to adjust P1 ? • We use an explicit default rule mechanism • If we suppose equal class distribution, we have to make sure that we match all but one of the classes 1 P(match ruleset) 1 nc 1 a r (1 ( P1 ) ) nc P1 a 1 1 r nc GAssist: Initialisation policy • Covering operator – Each time a new rule has to be created, an instance is sampled from the training set – The rule is created as a generalized version of the example • Makes sure it matches the example • It covers not just the examples, but a larger area of the search space – Two methods of sampling instances from the training set • Uniform probability for each instance • Class-wise sampling probability GAssist: Rule deletion operator • Operator applied after the fitness computation • Rules that do not match any training example are eliminated • The operator leaves a small number of ‘dead’ rules in each individual, acting as protective neutral code – If crossover is applied over a dead rule, it does not matter, it will not break a good rule – However, if too many dead rules are present, exploration is inefficient, and the population loses diversity GAssist: ILAS windowing scheme • Windowing: use of a subset of examples to perform fitness computations • Incremental Learning with Alternating Strata (ILAS) • The mechanism uses a different subset of training examples in each GA iteration 0 Ex/n 2·Ex/n 3·Ex/n Ex Training set Iterations 0 Iter BioHEL [Bacardit et al, 09] • BIO-inspired HiErarchical Learning • Successor of GAssist, but changing paradigms: uses the Iterative Rule Learning approach • Created to overcome the scalability limitations of GAssist • It still employs – Default Rule (no auto policy) – ILAS windowing scheme BioHEL: fitness function • Fitness function definition is trickier than in GAssist, as it is impossible to have a global control over the solution • As in any separate-and-conquer method, the system should favor rules that are – Accurate (do not make mistakes) – General (that cover many examples) • These two objectives are contradictory, specially in real-world problems: the best way of increasing the accuracy is by creating very specific rules • BioHEL redefines coverage as a piece-wise function, which rewards rules that cover at least a certain fraction of the training set BioHEL: fitness function • Coverage term penalizes rules that do not cover a minimum percentage of examples • Choice of the coverage break is crucial for the proper performance of the system BioHEL: ALKR • The Attributes List Knowledge Representation (ALKR) • This representation exploits a very frequent situation – In high-dimensionality domains it is usual that each rule only uses a very small subset of the attributes • Example of a rule for predicting a Bioinformatics dataset [Bacardit and Krasnogor, 2009] • Att Leu-2 [-0.51,7] and Glu [0.19,8] and Asp+1 [-5.01,2.67] and Met+1 [-3.98,10] and Pro+2 [-7,-4.02] and Pro+3 [-7,-1.89] and Trp+3 [-8,13] and Glu+4 [0.70,5.52] and Lys+4 [-0.43,4.94] alpha • Only 9 attributes out of 300 were actually in the rule BioHEL: ALKR • Function match (instance x, rule r) Foreach attribute att in the domain If att is relevant in rule r and (x.att < r.att.lower or x.att > r.att.upper) Return false EndIf EndFor Return true • Given the previous example of a rule, 293 iterations of this loop are wasted !! • Can we get rid of them? BioHEL: ALKR • ALRK automatically identifies the relevant attributes in the domain for each rule and just tracks them BioHEL’s ALKR • Simulated 1-point crossover BioHEL: ALKR • In ALKR two operators (specialize and generalize) add or remove attributes from the list with a given probability, hence exploring the rule-wise space of the relevant attributes • ALKR match process is more efficient, however exploration is costlier and it has two extra operators • Since ALKR chromosome only contains relevant information, the exploration process is more efficient. BioHEL: CUDA-based fitness computation • NVIDIA’s Computer Unified Device Architecture (CUDA) is a parallel computing architecture that exploits the capacity within NVIDIA’s Graphic Processor Units • CUDA runs thousands of threads at the same time Single Program, Multiple Data paradigm • In the last few years GPUs have been extensively used in the evolutionary computation field – Many papers and applications are available at http://www.gpgpgpu.com • The use of GPGPUs in Machine Learning involves a greater challenge because it deals with more data but this also means it is potentially more parallelizable CUDA architecture CUDA memory management • Different types of memory with different access speed – Global memory (slow and large) – Shared memory (block-wise; fast but quite small) – Constant memory (very fast but very small) • The memory is limited • The memory copy operations involve a considerable amount of execution time • Since we are aiming to work with large scale datasets a good strategy to minimize the execution time is based on the memory usage CUDA for matching a set of rules • The match process is the stage computationally more expensive • However, performing only the match inside the GPU means downloading from the card a structure of size O(NxM) (N=population size, M=training set size) • In most cases we don’t need to know the specific matches of a classifier, just how many (reduce the data) • Performing the second stage also inside the GPU allows the system to reduce the memory traffic to O(N) CUDA in BioHEL Performance of CUDA alone • We used CUDA in a Tesla C1060 card with 4GB of global memory, and compared the run-time to that of Intel Xeon E5472 3.0GHz processors • Biggest speedups obtained in large problems (|T| or #Att), specially in domains with continuous attributes • Run time for the largest dataset reduced from 2 weeks to 8 hours CUDA fitness in combination with ILAS • The speedups of CUDA and ILAS are cumulative Resources • A very thorough survey on GBML is available here • Thesis of Martin Butz on XCS, including theoretical models and advanced exploration methods (later a book) • My thesis, about Gassist (code) • Complete description of BioHEL (code) Questions?