CS 2206 Pattern Recognition – 2014/2015 Handout: Lab 10 Section contents: Clustering Genetic algorithm Example Clustering: An Introduction What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. We can show this with a simple graphical example: The Goals of Clustering So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). Problems There are a number of problems with clustering. Among them: current clustering techniques do not address all the requirements adequately (and concurrently); dealing with large number of dimensions and large number of data items can be problematic because of time complexity; the effectiveness of the method depends on the definition of “distance” (for distance-based clustering); if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces; the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways. Clustering Algorithms Classification Clustering algorithms may be classified as listed below: Eng. Nareeman | Eng. Maram | Eng. Ahmed Page 1 Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic Clustering K-means Fuzzy C-means Hierarchical clustering Mixture of Gaussians Distance Measure The genetic algorithm : - Genetic algorithm was developed to provide efficient techniques for optimization and machine learning applications through application of the principles of evolutionary biology to computer science. - It uses a directed search algorithms based on the mechanics of biological evolution such as inheritance, mutation, natural selection, and recombination (or crossover). - It is a heuristic method that uses the idea of survival of the fittest .. - In the genetic algorithm, the problem to be solved is represented by a list of parameters which can be used to drive an evaluation procedure, called chromosomes or genomes. - Chromosomes are typically represented as simple strings of data and instructions. In the first step of the algorithm, such chromosomes are generated randomly or heuristically to form an initial pool of possible solutions called first generation pool. - The overall processes of the algorithm is summarized in Figure 5. - Also the flow chart of the genetic algorithm is given in Figure 6. - The components of the genetic algorithm explained above can also be summarized as below: Encoding technique....(gene, chromosome) Initialization procedure....(creation) Evaluation function....(environment) Selection of parents....(reproduction) Genetic operators.....(mutation, recombination) Parameter settings....(practice and art) Eng. Nareeman | Eng. Maram | Eng. Ahmed Page 2 Pesudo-code for genetic algorithm. The Flow chart of the genetic algorithm GA components Individual - Any possible solution Population - Group of all individuals Search Space - All possible solutions to the problem Chromosome - Blueprint for an individual Trait - Possible aspect of an individual Allele - Possible settings for a trait Locus - The position of a gene on the chromosome Genome - Collection of all chromosomes for an individual Foundations in Science Selection for all members of population sum += fitness of this individual end for Eng. Nareeman | Eng. Maram | Eng. Ahmed Page 3 for all members of population probability = sum of probabilities + (fitness / sum) sum of probabilities += probability end for loop until new population is full do this twice number = Random between 0 and 1 for all members of population if number > probability but less than next probability then you have been selected end for end create offspring end loop Crossover Mutation Traveling Salesman Problem Using Genetic Algorithms Parent 1 Parent 2 FAB|ECGD DEA|CGBF Eng. Nareeman | Eng. Maram | Eng. Ahmed Page 4 Child 1 Child 1 City A B C D E F G FAB|CGBF DEA|ECGD First Connection F A E G B D C Second Connection B E G F C A D The starting parameter values are: Parameter Initial Value 10,000 Population Size 5 Group Size 3% Mutation 5 # Nearby Cities Nearby City Odds 90 % Eng. Nareeman | Eng. Maram | Eng. Ahmed Page 5