PR-2CS-Section(10)

advertisement
CS 2206 Pattern Recognition – 2014/2015
Handout: Lab 10
Section contents:
 Clustering
 Genetic algorithm
 Example
Clustering: An Introduction
What is Clustering?
Clustering can be considered the most important unsupervised learning problem; so, as every other
problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of organizing objects into groups whose members
are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the
objects belonging to other clusters.
We can show this with a simple graphical example:
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to
decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which
would be independent of the final aim of the clustering. Consequently, it is the user which must supply
this criterion, in such a way that the result of the clustering will suit their needs.
For instance, we could be interested in finding representatives for homogeneous groups (data reduction),
in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier
detection).
Problems
There are a number of problems with clustering. Among them:
 current clustering techniques do not address all the requirements adequately (and concurrently);
 dealing with large number of dimensions and large number of data items can be problematic
because of time complexity;
 the effectiveness of the method depends on the definition of “distance” (for distance-based
clustering);
 if an obvious distance measure doesn’t exist we must “define” it, which is not always easy,
especially in multi-dimensional spaces;
 the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted
in different ways.
Clustering Algorithms
Classification
Clustering algorithms may be classified as listed below:
Eng. Nareeman | Eng. Maram | Eng. Ahmed
Page 1




Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering




K-means
Fuzzy C-means
Hierarchical clustering
Mixture of Gaussians
Distance Measure
The genetic algorithm :
- Genetic algorithm was developed to provide efficient techniques for optimization and machine
learning applications through application of the principles of evolutionary biology to computer
science.
- It uses a directed search algorithms based on the mechanics of biological evolution such as
inheritance, mutation, natural selection, and recombination (or crossover).
- It is a heuristic method that uses the idea of survival of the fittest ..
- In the genetic algorithm, the problem to be solved is represented by a list of parameters which can
be used to drive an evaluation procedure, called chromosomes or genomes.
- Chromosomes are typically represented as simple strings of data and instructions. In the first step
of the algorithm, such chromosomes are generated randomly or heuristically to form an initial
pool of possible solutions called first generation pool.
- The overall processes of the algorithm is summarized in Figure 5.
- Also the flow chart of the genetic algorithm is given in Figure 6.
- The components of the genetic algorithm explained above can also be summarized as below:
Encoding technique....(gene, chromosome)
Initialization procedure....(creation)
Evaluation function....(environment)
Selection of parents....(reproduction)
Genetic operators.....(mutation, recombination)
Parameter settings....(practice and art)
Eng. Nareeman | Eng. Maram | Eng. Ahmed
Page 2
Pesudo-code for genetic algorithm.
The Flow chart of the genetic algorithm
GA components
 Individual - Any possible solution
 Population - Group of all individuals
 Search Space - All possible solutions to the problem
 Chromosome - Blueprint for an individual
 Trait - Possible aspect of an individual
 Allele - Possible settings for a trait
 Locus - The position of a gene on the chromosome
 Genome - Collection of all chromosomes for an individual
Foundations in Science
Selection
for all members of population
sum += fitness of this individual
end for
Eng. Nareeman | Eng. Maram | Eng. Ahmed
Page 3
for all members of population
probability = sum of probabilities + (fitness / sum)
sum of probabilities += probability
end for
loop until new population is full
do this twice
number = Random between 0 and 1
for all members of population
if number > probability but less than next probability
then you have been selected
end for
end
create offspring
end loop
Crossover
Mutation
Traveling Salesman Problem
Using Genetic Algorithms
Parent 1
Parent 2
FAB|ECGD
DEA|CGBF
Eng. Nareeman | Eng. Maram | Eng. Ahmed
Page 4
Child 1
Child 1
City
A
B
C
D
E
F
G
FAB|CGBF
DEA|ECGD
First Connection
F
A
E
G
B
D
C
Second Connection
B
E
G
F
C
A
D
The starting parameter values are:
Parameter
Initial Value
10,000
Population Size
5
Group Size
3%
Mutation
5
# Nearby Cities
Nearby City Odds 90 %
Eng. Nareeman | Eng. Maram | Eng. Ahmed
Page 5
Download