Uploaded by Fahh Naww

Lectrues slides 1-4 (2)

advertisement
Pattern Recognition
By
Prof. Dr Talha Ali Khan
Subject Information
• Subject coordinator: Prof. Dr Talha Ali Khan
• Email ID: talhaali.khan@ue-germany.de
• Teaching time and Location: Potsdam Campus
• Consultation: By Appointment via email only
Intended Audience:
Graduate Students
Contents of the Course
1
Introduction to Pattern Recognition
2
Machine Learning and Pattern Recognition
3
Feature Selection, Extraction and Dimensionality
Reduction
4
Statistics for Pattern Recognition
5
Programming for Pattern Recognition
6
Deep Learning
7
Optimization Algorithms and pattern recognition
General Assessment Overview
Item
Topics
Assignment
Any Journal paper on pattern
Presentation/Report
recognition application
Due
Marks
Week 14
20%
Quizzes
Average of four quizzes
-
20%
Project Report and
Presentation
Any PR based application
Exam period
60%
https://www.ieee.org/
IEEE
IEEE UE Student
Branch
https://www.ieee.org/membership/join/more-visibility.html
How to access the Research Papers and
Technical Reports
• There are few databases where the research papers can be search, a
few of them are as under:
1. https://ieeexplore.ieee.org/Xplore/home.jsp
2. https://www.elsevier.com/de-de
3. https://scholar.google.com/
4. https://doaj.org/
How to download the papers??
• The link below will download most of the papers from the databases.
https://sci-hub.mksa.top/
Research Paper
Introduction / problem background / motivation: Describe the general scope of
your project (e.g., automated speech recognition), and “zoom in” on the specific
problem that you are addressing (e.g., pitch tracking). What is your motivation to
study this problem, what is the specific need or gap that your work is addressing?
Goal / objectives: The goal is a brief statement that establishes a general, longterm direction of your work (e.g., to analyze the affective content of physiological
signals). The objectives (there will likely be more than one) are quantifiable
expectations of performance (e.g., to implement or compare certain models).
Literature review: Describe prior research efforts in your area. This is not meant to
be a comprehensive survey of a scientific discipline, but a concise overview of the
most significant results that are tightly related to your work.
Proposed solution / methods used: Describe the algorithm that Authors have
developed, or the techniques that you used (a.k.a. the actual work). Keep this
description at a high level: the objective of the presentation is to make the
audience want to read your paper, not to scare them away with details!
Analysis of results: What are the specific results of your work? What do these results tell
you? Are they in agreement with your expectations? Do these results suggest the existence
of some phenomena that you were not aware of?
Conclusions: What are the main ideas (not more than three) that you want your audience
to remember? This is the time to “zoom out” and discuss how your results support the
long-term goal of the research.
Preparation and poise: Did you speak to the audience rather than look at the slides? Did
you expand on what was on the slides rather than read them word-by-word? Did you
speak at a reasonable pace rather than too fast or too slow? Did you appear to be
spontaneous and fluid, avoiding the use of distracting mannerisms and colloquialisms?
Use of allotted time: Was there a good balance between inspirational material and
technical content? Did you complete your presentation in time? Did you have to skip some
important material (e.g., conclusions) in order to complete your presentation in time?
Use of visual aids: Did you use pictures/diagrams to explain your ideas? Did you have
graphs of experimental results? Did the slides contain short, clear bullets rather than long
sentences and/or cryptic equations?
Response to questions: Did you address technical questions and comments well?
Books
• Textbook:
1. Christopher M. Bishop, Pattern Recognition and Machine Learning,
Springer, 2007
• Reference Books
1. Frank Y. Shih, Image Processing and Pattern Recognition:
Fundamentals and Techniques, Wiley
2. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern
Classification, 2nd Edition, Wiley
Learning Objectives
This course will equip you with the understanding of
• Understand the fundamentals of pattern recognition
• Different classification and clustering algorithms
• Basics of Artificial Neural Networks
• Applications of pattern recognition
• Deep Learning
• Optimization Algorithms
Part 1 – Introduction to
Pattern Recognition
Human Perception
• Humans have developed highly sophisticated
skills for sensing their environment and taking
actions according to what they observe
For example: recognizing a face, understanding
spoken words, reading handwriting, distinguishing
fresh food from its smell
• Concept is that “give similar capabilities to
machines”
Human and
Machine
Perception
Often influenced by the knowledge of how
patterns are modeled and recognized in
nature when develop pattern recognition
algorithms
Research on machine perception also helps
to gain deeper understanding and
appreciation for pattern recognition systems
in nature
Many techniques are applied, that are purely
numerical and do not have any
correspondence in natural systems
What is a Pattern?
• A pattern is a regularity in the world, in human-made design, or
in abstract ideas
• The elements of a pattern repeat in a predictable manner
• A geometric pattern is a kind of pattern formed of geometric
shapes and typically repeated like a wallpaper design
• Any of the senses may directly observe patterns
• The opposite of “pattern” is complete “randomness”.
What is a Pattern Recognition?
• Pattern Recognition involves the design of systems which
automatically or otherwise, recognizes patterns in
acquired data.
• Study of how machines can observe the environment
• Learn to distinguish patterns of interest
• Make sound and reasonable decisions about the
categories of the patterns
Pattern
Recognition
Tasks
• Similar to the field of Digital Signal Processing, also Pattern Recognition is a
discipline with applications (tasks) in virtually every field of science:
• Computer science
• Electrical engineering
• Industrial engineering
• Remote sensing (geology, archaeology, surveillance)
• Civil engineering
• Vehicle manufacturing, advanced driver assistance systems (ADAS),
autonomous vehicles
• Finance and investment
• Business economics
• Medicine (screening, diagnostics)
• Biometrics (age, gender, iris, fingerprint, mood, handwriting, voice identity
detection)
• Image segmentation (pages, forms, or images with, e.g., street, houses, sky,
cars, texts, …)
• Automatic recognition of speech, (hand)writing (OCR), gesture, …
• Intelligent sensing (smell, chemicals, …)
Pattern Recognition in Daily Life
We encounter automatic pattern recognition in daily life very frequently.
What is
Pattern
Recognition?
??
A pattern is an entity,
vaguely defined, that
could be given a name
For example:
1. Fingerprint image
2. Handwritten word
3. Human face
4. Speech signal
5. DNA sequence
Components of Pattern Recognition System
• Data Collection
• Sensors collect information
• e.g., Images, Temperature, Pressure, Hardness etc.
• Pre-Processing
• Noise Reduction, Filtering, Limiting etc.
• Feature Extraction
• Compute Numeric information from raw data
• Classification
• Clustering is an example of classification
Data Acquisition
Data is Acquired through Sensors
Noise
Examples of Data Acquisition
Depending upon the application we can use different types of sensors e.g.
How do we acquire data for
Temperature is sensed through Thermocouples
Sound data is sensed through Microphones
Image data is sensed through Cameras
Location data is sensed through GPS satellites
Financial data is acquired from Stock Exchanges
Wind Speed data is sensed through Anemometers
Blood Oxygen Saturation is sensed through a red-light sensor
1. Shortest Route from Point A to
B in google maps?
2. Audit Data of a Firm?
3. Detection of Credit Card Fraud?
Characteristics of Data and Data Acquisition
•
•
•
•
•
Frequency Spectrum of Data
Speed of Data Acquisition ( Sampling Frequency )
Acceptable Noise Level ( Signal to Noise Ratio (SNR) )
Amount of Collected Data ( Storage Size)
Built-in preprocessing
Related fields
and
application of
PR
Pattern Recognition Applications
Biometric Recognition
Pattern Recognition Applications
Fingerprint Recognition
Pattern Recognition Applications
Autonomous Navigation Recognition
Pattern Recognition Applications
Medical Imaging
Cancer detection and grading using microscopic tissue data. (left)
A whole slide image with 75568 × 74896 pixels. (right)
A region of interest with 7440 × 8260 pixels
Pattern Recognition Applications
Land cover classification using satellite
Pattern Recognition Applications
Building and building group recognition using satellite
Classification Algorithms
• Logistic Regression.
• Naïve Bayes.
• Stochastic Gradient Descent.
• K-Nearest Neighbours.
• Decision Tree.
• Random Forest.
• Support Vector Machine
Clustering Types
• Connectivity-based Clustering (Hierarchical
clustering)
• Centroids-based Clustering (Partitioning
methods)
• Distribution-based Clustering.
• Density-based Clustering (Model-based
methods)
• Fuzzy Clustering.
• Constraint-based (Supervised Clustering)
Typical PR System
Training data is the data you use to train an
algorithm or machine learning model to
predict the outcome you design your model to
predict.
Test data is used to measure the performance,
such as accuracy or efficiency, of the algorithm
you are using to train the machine.
Part 2 – Machine
Learning
Introduction to Artificial
Intelligence
Definition
“Anything that makes machines act more intelligently”
Artificial intelligence (AI) refers to the simulation of human
intelligence in machines that are programmed to think like
humans and mimic their actions
• Think of AI as augmented
intelligence
• AI should not attempt to
replace human experts, but
rather extend human
capabilities and accomplish
tasks that neither
humans nor machines
could do on their own
Augmented Intelligence
• Augmented intelligence is a design pattern for a
human-centered partnership model of people and
artificial intelligence (AI) working together to enhance
cognitive performance, including learning, decision
making and new experiences.
Augmented Intelligence
The internet has given us access to more information, faster
Distributed computing and IoT have led to massive amounts of data, and social
networking has encouraged most of that data to be unstructured
With Augmented Intelligence, giving information that subject matter experts need at
their fingertips, and backing it with evidence so they can make informed decisions
Help experts to scale their capabilities and let the machines do the time-consuming
work
How does AI learn???
• Based on strength, breadth, and application, AI can be described in different
ways.
1) Weak or Narrow AI:
• AI that is applied to a specific domain
For example: language translators, virtual assistants, self-driving cars, AI-powered
web searches, recommendation engines, and intelligent spam filters
2) Applied AI:
• It can perform specific tasks, but not learn new ones, making decisions based on
programmed algorithms, and training data
3) Strong AI or Generalized AI:
• AI that can interact and operate a wide variety of independent and unrelated
tasks.
• It can learn new tasks to solve new problems, and it does this by teaching itself
new strategies.
• Generalized Intelligence is the combination of many AI strategies that learn from
experience and can perform at a human level of intelligence.
4) Super AI or Conscious AI:
• AI with human-level consciousness, which would require it to be self-aware.
• Since it is not able to adequately define what consciousness is, it is unlikely that it
will be able to create a conscious AI in the near future
Impact and Examples
of AI
AI means different things to different
people
Video Game Designer
Screen Writer
Data Scientist
AI means writing the code
that affects how bots play
or how the environment
reacts to the player
AI means a character that
acts like a human with
some trope having
computer features mixed in
AI is a way of exploring and
classifying data to meet
specific goals
Natural Language
Processing
The natural language
processing and natural
language generation
capabilities of AI are not
only enabling machines
and humans to
understand and interact
with each other, but are
creating new
opportunities and new
ways of doing business
• Chatbots powered by natural language processing
capabilities are being used :
1. Healthcare : To question patients and run basic
diagnoses like real doctors
Chatbots
2. Education: Providing students with easy to learn
conversational interfaces and on-demand online
tutors
3. Customer service: Chatbots are improving
customer experience by resolving queries on the
spot and freeing up agents time for conversations
that add value
AI speech-to-text technology
• AI-powered advances in speech-to-text technology have made real
time transcription a reality
• Advances in speech synthesis are the reason companies are using AIpowered voice to enhance customer experience, and give their brand its
unique voice
• In the field of medicine, it's helping patients with Lou Gehrig's disease, for
example, to regain their real voice in place of using a computerized voice
It is due to advances in AI that the field of computer vision has been able to surpass
humans in tasks related to detecting and labeling objects
• Computer vision algorithms detect
facial
features
and
images
and compare them with databases of
face profiles
• This is what allows consumer devices
to authenticate the identities of
their
owners
through
facial
recognition, social media apps to
detect and tag users, and law
enforcement agencies to identify
criminals in video feeds
Help automate tasks such as:
Detecting cancerous moles in skin images
Findings symptoms in X-ray and MRI scans
• AI is working behind the scenes in different sectors
such as in:
AI behind
the scenes
1) Finance:
• Monitoring our investments, detecting fraudulent
transactions
• Identifying credit card fraud, and preventing financial
crimes
2) Healthcare:
• By helping doctors arrive at more accurate
preliminary diagnoses
• Reading medical imaging, finding appropriate clinical
trials for patients
• It is not just influencing patient outcomes, but also
making operational processes less expensive
• AI has the potential to access enormous amounts of information,
imitate humans, even specific humans, make life-changing
recommendations about health and finances, correlate data that may
invade privacy, and much more.
Terminologies used Interchangeably
Machine
Learning
Machine Learning, a subset of AI, uses computer algorithms to
analyze data and make intelligent decisions
Instead of following rules-based algorithms, machine learning
builds models to classify and make predictions from data
Predict whether a patient, hospitalized due to
heart attack, will have a second heart attack
Predict the price of a stock in 6 months from
now
Examples
Face detection: find faces in images (or
indicate if a face is present)
Spam filtering: identify email messages as
spam or no spam
Fraud-detection applications that seek
patterns in jumboize data sets
Why do we
mine data?
Commercial
Point of view
Data Every
Where
Ok there is too much Data, Then WHAT???
• A machine learning program with a large volume of pictures of birds and train the model to
return the label "bird" whenever it has provided a picture of a bird
• Similarly, also create a label for "cat" and provide pictures of cats to train on
• When the machine model is shown a picture of a cat or a bird, it will label the picture with some
level of confidence
Bird
Types of Machine Learning
Machine
Learning
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
Supervised Learning
• It is defined by its use of labeled datasets to train algorithms that to
classify data or predict outcomes accurately
Supervised Learning Workflow
Supervised Learning
• Supervised Learning can be split into three
categories
1. Regression
2. Classification
Regression
• Regression analysis is a set of
statistical processes for
estimating the relationships
between a dependent variable
and one or more independent
variables
Classification
•
Classification is the technique
of identifying which of a set of
categories or an observation,
belongs to
Unsupervised Learning
• Unsupervised learning, also known as unsupervised machine learning,
uses machine learning algorithms to analyze and cluster unlabeled datasets
• These algorithms discover hidden patterns or data groupings without the need
for human intervention
Unsupervised Learning
• Unsupervised learning can be divided into two types:
1. Clustering
2. Association
Clustering
• Cluster analysis or clustering is
the task of grouping a set of
objects in such a way that
objects in the same group are
more similar to each other than
to those in other groups
Association
• Association rule learning is a
rule-based machine
learning method for
discovering interesting
relations between variables in
large databases
Reinforcement Learning
• Reinforcement learning is the training of machine learning models to make a sequence of decisions.
• The agent learns to achieve a goal in an uncertain, potentially complex environment
• In reinforcement learning, an artificial intelligence faces a game-like situation
• Its goal is to maximize the total reward
Deep Learning
Deep Learning
AI
• Deep Learning enables AI systems to continuously
learn on the job and improve the quality and
accuracy of the results
• That enables these systems
unstructured data such as
Machine
Learning
Deep Learning:
Layers algorithm to create a neural networks, an
artificial replication of functionality and structure
of brain
to
learn
from
• Deep learning algorithms do not directly
map input to output
• It rely on several layers of processing units
• Each layer passes its output to the next
layer, which processes it and passes it to
the next
• The many layers is why it’s called deep
learning
Give a deep learning algorithm thousands of images and labels
that correspond to the content of each image
Creating deep learning
algorithms, developers
and engineers configure the
number of layers and the type
of functions that connect the
outputs of each layer to the
inputs of the next
Train the model by
providing it with lots of
annotated examples
The algorithm will run the those examples through its layered
neural network
Adjust the weights of the variables in each layer of the neural
network to be able to detect the common patterns that define
the images with similar labels
• Deep Learning has proven to be very efficient at various tasks
1. Image captioning
2. Voice recognition and transcription
3. Facial recognition
4. Medical imaging
5. Language translation
• Deep Learning is also one of the main components of
driverless cars
Recap
• Classification is a process that automatically orders or categorizes data into one or
more of a set of “classes.”
• Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group than those in other groups.
• In simple words, the aim is to segregate groups with similar traits and assign them
into clusters
• Regression analysis is a set of statistical processes for estimating the relationships
between a dependent variable (often called the 'outcome' or 'response' variable)
and one or more independent variables
Part 3 – Feature
Selection & Extraction
Introduction
• Data mining is an automatic or a semi-automatic process of extracting
and discovering implicit, unknown, and potentially useful patterns
and information from massive data stored and captured from date
repositories (web, database, data warehousing).
• Data mining tasks are usually divided into two categories:
1. Predictive
2. Descriptive
Introduction
• Predictive tasks:
• To predict or classify for determining which class an instance or
example belongs, based on the values of some features (i.e,
independent or conditional features) in a dataset.
• For example:
o Prediction task is to predict if a new patient (instance) is in danger of a heart
attack disease or not based on some clinical tests.
• Descriptive tasks:
• To find clusters, correlations, and trends based on the implicit
relationships hidden in the underlying data
Introduction
• Data collected for analysis might contain redundant or irrelevant features
• Applying data mining methods on such data might lead to misleading results.
• Data pre-processing is an essential step to refine the data to be used in any
learning model.
• The main purpose of the pre-processing step is to clean and transform raw data
into a suitable format, which improves the performance of data mining tasks.
• This is where dimensionality reduction algorithms come into play.
Motivation
• Dealing with real problems and real data we often deal with high
dimensional data that can go up to millions.
• In original high dimensional structure, data represents itself.
Although, sometimes need to reduce its dimensionality.
• Reduce the dimensionality that needs to associate with visualizations.
Dimensionality reduction
• Dimensionality reduction, or dimension reduction, is the
transformation of data from a high-dimensional space into a lowdimensional space so that the low-dimensional representation
retains some meaningful properties of the original data
Rainfall Example
• Machine learning models map features to outcomes.
• For instance, say you want to create a model that predicts the amount
of rainfall in one month.
• A dataset of different information collected from different cities in
separate months.
• The data points include temperature, humidity, city population, traffic,
number of concerts held in the city, wind speed, wind direction, air
pressure, number of bus tickets purchased, and the amount of rainfall.
• Obviously, not all this information is relevant to rainfall prediction.
Rainfall Example
• Some of the features might have nothing to do with the target variable.
• Evidently, population and number of bus tickets purchased do not affect
rainfall.
• Other features might be correlated to the target variable, but not have a
causal relation to it.
• For instance, the number of outdoor concerts might be correlated to the
volume of rainfall, but it is not a good predictor for rain.
• In other cases, such as carbon emission, there might be a link between
the feature and the target variable, but the effect will be negligible.
Drawbacks of having many features
• It is evident which features are valuable and which are useless.
• In other problems, the excessive features might not be obvious and
need further data analysis.
• But why bother to remove the extra dimensions?
• When you have too many features, you’ll also need a more complex
model.
• A more complex model means you’ll need a lot more training data and
more compute power to train your model to an acceptable level.
Drawbacks of having many features
• Machine learning has no understanding of causality, models try to map
any feature included in their dataset to the target variable, even if
there’s no causal relation.
• This can lead to models that are imprecise and erroneous.
• Reducing the number of features can make your machine learning
model simpler, more efficient, and less data-hungry.
• The problems caused by too many features are often referred to as the
“Curse of dimensionality”
The Curse of Dimensionality
• In machine learning, “dimensionality” simply refers to the number of
features (i.e., input variables) in your dataset.
• When the number of features is very large relative to the number of
observations in your dataset, certain algorithms struggle to train
effective models.
• This is called the “Curse of Dimensionality,” and it’s especially relevant
for clustering algorithms that rely on distance calculations.
• Let's say you have a straight line 100 yards long
and you dropped a penny somewhere on it. It
wouldn't be too hard to find. You walk along the
line and it takes two minutes.
• Now let's say you have a square 100 yards on
each side and you dropped a penny somewhere
on it. It would be pretty hard, like searching
across two football fields stuck together. It could
take days.
• Now a cube 100 yards across. That's like
searching a 30-story building the size of a
football stadium.
• The difficulty of searching through the space
gets a lot harder as you have more dimensions.
Dimensionality Reduction
• Dimensionality reduction identifies and removes the features that are
hurting the machine learning model’s performance or aren’t
contributing to its accuracy.
• The popular aspects of the curse of dimensionality; are
1. Data sparsity
2. Distance concentration
1. Data Sparsity
• A common problem in machine learning is sparse data, which alters
the performance of machine learning algorithms and their ability to
calculate accurate predictions.
• Data Sparsity: Data sparsity is term used for how much data we have
for a particular dimension/entity of the model.
• A common phenomenon in general large scaled data analysis
• For instance, if predicting a target, that is
dependent on two attributes: gender and age
group
• Ideally capture the targets for all possible
combinations of values for the two attributes
• If this data is used to train a model that is
capable of learning the mapping between the
attribute values and the target, its performance
could be generalized.
• As long as the future unseen data comes from
this distribution (a combination of values), the
model would predict the target accurately.
• The target value depends on gender and age group only.
• If the target depends on a third attribute, let’s say body type, the
number of training samples required to cover all the combinations
increases
• For two variables, needed eight training samples. For three variables,
it need 24 samples.
• As the number of attributes or the dimensions increases,
the number of training samples required to generalize a
model also increase.
• In reality, the available training samples may not have
observed targets for all combinations of the attributes.
• This is because some combination occurs more often
than others.
• Due to this, the training samples available for building
the model may not capture all possible combinations.
• This aspect, where the training samples do not capture
all combinations, is referred to as ‘Data sparsity’ or
simply ‘sparsity’ in high dimensional data.
• Data sparsity is one of the facets of the curse of
dimensionality.
• Training a model with sparse data could lead to highvariance or overfitting condition
• As the number of
features increase,
the number of
samples also
increases.
• The more features
the more number of
samples we will need
to have all
combinations of
feature values well
represented in our
sample.
2. Distance Concentration
• Distance concentration refers to the problem of all the pairwise
distances between different samples/points in the space converging
to the same value as the dimensionality of the data increases.
• Several machine learning models such as clustering or nearest
neighbours’ methods use distance-based metrics to identify similar or
proximity of the samples.
• Due to distance concentration, the concept of proximity or similarity
of the samples may not be qualitatively relevant in higher
dimensions.
• As the number of features increases, the model becomes more complex.
• The more the number of features, the more the chances of overfitting.
• A machine learning model that is trained on a large number of features,
gets increasingly dependent on the data it was trained on and in turn
overfitted, resulting in poor performance
• Avoiding overfitting is a major motivation for performing dimensionality
reduction.
• The fewer features our training data has, the lesser assumptions our model
makes and the simpler it will be.
Advantages of
Performing
Dimensionality
Reduction
1. Less misleading data means model accuracy
improves.
2. Less dimensions mean less computing. Less
data means that algorithms train faster.
3. Less data means less storage space
required.
4. Less dimensions allow usage of algorithms
unfit for a large number of dimensions
5. Removes redundant features and noise.
Components for Dimensionality Reduction
• There are two components of dimensionality reduction:
1. Feature Selection
2. Feature Extraction
Feature Extraction
• Feature extraction creates a new, smaller set of features that captures
most of the useful information in the data
Feature Selection
• Feature selection is for filtering irrelevant or redundant features from
your dataset.
• Feature selection keeps a subset of the original features
Feature Selection
• Feature selection techniques are used for several reasons:
1. Simplification of models to make them easier to interpret
2. Shorter training times
3. To avoid the curse of dimensionality
4. Improve data's compatibility with a learning model class
Feature Selection
Feature selection techniques are often used in domains where
there are many features and comparatively few samples (or data
points).
Application of feature selection include the analysis of written
texts and DNA microarray data, where there are many thousands
of features, and a few tens to hundreds of samples.
Feature Selection
A feature selection algorithm can be seen as the combination of a search
technique for proposing new feature subsets, along with an evaluation measure
which scores the different feature subsets.
The simplest algorithm is to test each possible subset of features finding the one
which minimizes the error rate.
This is an exhaustive search of the space and is computationally intractable for all
but the smallest of feature sets.
Generic Steps in Feature Selection
Feature
selection
techniques
consist of 4
major steps:
•Subset generation
•Subset evaluation
•Stopping Criteria
•Validation
a. Subsets Generation
• Subset generation is considered as a search problem that aims to
select the best subsets of all possible feature subsets.
• The search strategies include:
o Evolutionary algorithms, e.g., GAs or MOEAs, greedy, best first
search with forward search selection (FSS) and backward search
selection (BSS)
b. Subset Evaluation
• The selected subset can be evaluated using statistical measures
such as:
o The correlation between the feature and the target class as in
filter approaches
o By using a classifiers measure (e.g., accuracy) to evaluate the
subset performance as in wrapper approaches
c. Stopping Criteria
• Determine a stopping criterion to stop the iteration process of
selecting subsets.
• Stopping criteria might be based on a pre-defined maximum
number of generations to run the algorithm or the convergence of
the algorithm or a solution is found
d. Validation
• The validation step is usually performed after the feature selection
process to assess the quality of the resulted feature subset
Types of feature selection
• The three main categories of
feature selection algorithms:
1. Filters
2. Wrappers
3. Embedded methods
1. Filter Methods
• Filter methods use variable ranking techniques as the principal
criteria for variable selection by ordering.
• Ranking methods are used due to their simplicity and good success is
reported for practical applications.
• A suitable ranking criterion is used to score the variables and a
threshold is used to remove variables below the threshold.
• Ranking methods are filter methods since they are applied before
classification to filter out the less relevant variables
Property of a Unique feature
• A basic property of a unique feature is to contain useful information
about the different classes in the data.
• This property can be defined as feature relevance which provides a
measurement of the feature’s usefulness in discriminating the
different classes
The issue of relevancy of a feature ??
• Deciding the relevancy of a feature to the data or the output is the
big question??
• Definition: ‘‘A feature can be regarded as relevant if it is conditionally
dependent of the class labels.’’
• A feature is to be relevant it can be independent of the input data but
cannot be independent of the class labels i.e., the feature that has no
influence on the class labels can be discarded.
• Inter feature correlation plays an important role in determining
unique features
Ranking methods in filters techniques
Ranking
methods
Correlation
criteria
Mutual
Information
(MI)
a) Correlation Criteria:
• Correlation Criteria (CC), also known as Dependence Measure (DM), is
based on the relevance (predictive power) of each feature.
• The predictive power is computed by finding the correlation between
the independent feature x and the target (label) vector t.
• The feature with the highest correlation value will have the highest
predictive power and hence will be most useful.
• where xi is the ith variable, Y is the output (class labels), cov()
is the covariance and var() the variance.
• Correlation ranking can only detect linear dependencies
between variable and target.
• The features are then ranked according to some correlation
based heuristic evaluation function
b. Mutual Information:
• Mutual Information (MI), also known as Information Gain (IG), is the
measure of dependence or shared information between two random
variables.
• It is also described as Information Theoretic Ranking Criteria (ITRC).
• The MI can be described using the concept given by Shannon’s
definition of entropy:
Mutual Information:
• In feature selection, is need to maximize the mutual information (i.e., relevance)
between the feature and the target variable.
• The mutual information (MI), which is the relative entropy between the joint
distribution and product distribution
• where p(xj,t) is the joint probability density function of feature xj and target t. The
p(xj) and p(t) are the marginal density functions.
• The MI is zero or greater than zero if X and Y are independent or dependent
Chi-square Test
• The Chi-square test is used for categorical features in a dataset.
• Chi-square value are calculated between each feature and the target
and select the desired number of features with the best Chi-square
scores.
• Conditions for Chi-square testing:
The variables must be
o
Categorical
o
Sampled independently
Minimal redundancy maximal relevance
Method
• The Maximum Relevance Minimum Redundancy (mRMR) method
developed for feature selection of microarray data.
• It tends to select a subset of features having the most correlation
with a class (relevance) and the least correlation between
themselves (redundancy)
2. Wrappers
• Instead of finding a relevant feature subset by a separate independent process, the
wrapper method has its own machine learning algorithm (classifier) employed as part of
the FS process.
• Wrappers use a search algorithm to search through the space of possible features and
evaluate each subset by running a model on the subset.
• The process iterates a number of times until the best feature subset is found
• Classify the Wrapper methods into
1. Sequential Selection Algorithms
2.
Heuristic Search Algorithms.
1. Sequential selection algorithms
• These algorithms are called sequential due to the iterative nature of the
algorithms.
• The Sequential Forward Selection (SFS):
o
The Sequential Feature Selection (SFS) algorithm starts with an empty set and adds
one feature for the first step which gives the highest value for the objective function.
o
From the second step onwards, the remaining features are added individually to the
current subset and the new subset is evaluated.
o
The individual feature is permanently included in the subset if it gives the maximum
classification accuracy.
o
The process is repeated until the required number of features are added.
Bidirectional Search (BDS)
• BDS applies SFS and SBS simultaneously:
o SFS is performed from the empty set.
o SBS is performed from the full set.
• To guarantee that SFS and SBS converge to the same solution:
o Features already selected by SFS are not removed by SBS.
o Features already removed by SBS are not added by SFS.
Limitations of
SFS and SBS
• The main limitation of SFS is that it
is unable to remove features that
become non useful after the
addition of other features.
• The main limitation of SBS is its
inability
to
reevaluate
the
usefulness of a feature after it has
been discarded.
2. Heuristic and Metaheuristic Search
Algorithms
• Heuristic is a solving method for a special problem (It can
benefit from the properties of the solved problem).
• Metaheuristic is a generalized solving method .
EVOLUTIONARY COMPUTATION (EC)
• It defines a number of methods design to simulate nature evolution.
• These methods are population based and rely on a combination of
random variation and selection to solve problems.
• There are many population-based techniques inspired by natural
phenomena.
• Some are bio-inspired that work in the form of swarms these are PSO,
ant colony, whale optimization algorithm and etc.
• Others are based on the Physics / Chemistry Laws.
DESIRABLE CHARACTERISTICS OF A GOOD
OPTIMIZATION ALGORITHM
• Accurate
• Fast Convergence.
• Computationally effective.
• Balance exploration and exploitation.
o Exploration aims to explore the whole search space in order to find promising
solutions in undiscovered areas.
o Exploitation phase aims to improve the already discovered solutions by searching
their neighborhood
• Stable solution quality.
WHAT IS A SWARM?
➢ A loosely structured collection
of interacting agents.
Agents:
▪ Individuals that belong to a group
▪ They contribute to and benefit from the entire group
▪ They can recognize, communicate, and/or interact
with each other
SWARM INTELLIGENCE (SI)
• Swarm Intelligence (SI) algorithms are mostly metaheuristics, which
search for the (near) optimal solution for a specific problem by employing
a set of search agents to explore the search space
• The key characteristic of SI algorithms is that they try to mimic the natural
behavior of some creatures that live in groups like folks of birds, ants,
bees and others
• PSO, ACO, GWO, HHO, SSA, and FA are all examples of SI algorithms
that mimic the social behavior of some creatures
PARTICLE SWARM OPTIMIZATION ALGORITHM
• The Algorithm inspired by the problem solving capabilities and behaviour
of social animals.
• PSO works on the idea of collection of particles called “population”.
• Each particle interacts with its neighbours and updates its velocity
according to its previous best position and best position of the entire
population.
• As the particles iterate in the search space their collectively intelligent
movements bring them close to an optimal solution.
Basic Terminologies
• The search space is the set of all the possible solutions of the optimization
problem and the goal of the PSO is to find the best solution.
• Each particle
o Is characterised by its position(x) and velocity(v)
o Represents a candidate solution of the optimization problem to be
solved.
Basic Terminologies
Consider a particle X having a velocity
Each Particle has the memory of its
own best experience or position
denoted by Pbest. This is the personal
best value of the particle found during
the search.
There is a common best
experience among the
member of swarms which is
denoted by Gbest.
• If the new
position is
better than the
Bob best
position he
needs to
records it.
So this new
location might
be better or
worst ?
Similarly,
• If this position
is better than
the team’s best
location Bob
needs to
inform other
members and
update his
record.
So this is the path looks
like after 2 days of walk
So what’s
make the
PSO a
stochastic
method?
The team best and
personal best of all
members are updated after
every day.
So team searching area
depends upon personal
best and random
component.
So the question arises that
how this stochastic
method guarantee good
solution ?
Generally as the team
maintains the best
location in the region and
search around it so the
possibility of find a better
solution is really high.
PROBLEMS WITH STANDARD PSO (SPSO)
• Problems associated with the SPSO algorithm are:
• Premature Convergence:
Premature convergence in evolutionary algorithm means convergence of
algorithm before global optimum solution is reached.
• Slow Convergence:
Slow exploitation resulting in slower convergence and in poor-quality
solutions.
Inertia Weight
• Inertia weight is an important parameter in PSO, which significantly affects the
convergence and exploration- exploitation trade-off in PSO process.
• A large inertia weight facilitates a global search while a small inertia weight
facilitates a local search.
• By linearly decreasing the inertia weight from a relatively large value to a small
value as the iteration increases, the PSO tends to have more global search ability
at the beginning of the run while having more local search ability near the end of
the iterations.
Selection of inertia weight:
• To observe convergence behaviour of the APSO inertial weight “w” is
very significant parameter.
• Controls the impact of the previous velocity on the current update.
• It’s a trade-off between global and local abilities of the swarm.
• Larger values facilitates global exploration.
• Smaller values helps in facilitating local exploitation.
GRAVITATIONAL SEARCH ALGORITHM
• GSA is a metaheuristic inspired by Newton’s gravitational and motion
laws.
• In GSA, search agents have the role of interacting physical objects,
and the performance of the solutions is seen as the mass of the
objects.
• The particles can attract other agents based on the gravitational
force.
• This force pushes lighter objects towards the heavier objects.
• The heavier objects, which are considered as better solutions, will
move slower than the lighter objects.
• This behavior ensures the exploitation phase in GSA
• Comes under population-based technique in which each agent is
having different masses.
• Heavy masses, called good solutions, usually move more slowly than
lighter masses.
> Heavy masses do exploitation
> Lighter masses do the exploration
• Criteria of agents performance in GSA are examine by their masses.
• A force of gravity causing the agents to attract each other
> Responsible for the movement of agents globally towards the agents
with heavier masses.
PROBLEMS OF GSA
•
Problems that are associated with the Standard GSA are:
> Slow Convergence:
•
Due to the unbalance exploration and exploitation the particles trapped in the local minima.
> Exploitational issue:
•
Agents mass get heavier and heavier in the process of optimization because of the cumulative effect of the fitness
function on masses. This may prevent masses from rapidly exploiting the optimum which may result in weak
exploitation.
GRAVITATIONAL SEARCH ALGORITHM
• Let suppose a system of the objects with L objects, the position of the ith object is
defined as:
𝑋𝑖 = 𝑋𝑖1 , 𝑋𝑖2 , … … … … … 𝑋𝑖𝑑 , 𝑋𝑖𝑙 𝑓𝑜𝑟 𝑖 = (1,2, . . 𝑙)
where Xi shows the position of ith object in the dth dimension and l is the dimension of
search space.
•
At time “t”, the force acting on mass “i” from mass “j” can be represented as:
𝐹𝑖𝑗𝑑 𝑡 = 𝑮𝒐 𝑡
𝑀𝑝𝑖 𝑡 ∗ 𝑀𝑎𝑗 𝑡
𝑅𝑖𝑗 𝑡 + ꞓ
(𝑋𝑗𝑑 𝑡 − 𝑋𝑖𝑑 𝑡 )
• In order to emphasise exploration in the first iterations and exploitation in the final
iterations, G has been designed with an adaptive value so that it is decreased over
iterations.
𝒕
𝑮𝒐 (t)=exp (-α× 𝑻)
Mathematical Model GRAVITATIONAL SEARCH ALGORITHM
➢ The acceleration of the object i at time t, and in direction d by the law of motion can be represented as
𝑎𝑖𝑑 (𝑡)
𝐹𝑖𝑑 (𝑡)
=𝑀
𝑖𝑖
(𝑡)
➢ The velocity of the masses depends upon their present velocity as well as their acceleration. As a result, the
next velocity of the agent can be measured as follows:
𝑉𝑖𝑑 𝑡 + 1 = 𝑟𝑎𝑛𝑑𝑖 ∗ 𝑣𝑖𝑑 (t)+𝒂𝒅𝒊 (𝒕)
𝑋𝑖𝑑 𝑡 + 1 = 𝑋𝑖𝑑 + 𝑣𝑖𝑑 (𝑡 + 1)
➢ Agents’ masses are defined by their fitness evaluation, the agent with heaviest mass is the fittest agent.
According to the above equations, the heaviest agent has the highest attractive force and the slowest
movement
𝑀𝑎𝑖 = 𝑀𝑖𝑖 = 𝑀𝑝𝑖 = 𝑀𝑖
𝑚𝑖 (𝑡)
𝑗=1 𝑚𝑗 (𝑡)
𝑀𝑖 𝑡 = σ𝑁
𝑖 = 1,2,3, … … … 𝑁
𝑓𝑖𝑡 𝑡 −𝑤𝑜𝑟𝑠𝑡 (𝑡)
𝑡 −𝑤𝑜𝑟𝑠𝑡 (𝑡)
and 𝑚𝑖 𝑡 = 𝑏𝑒𝑠𝑡𝑖
where fiti shows the fitness value of the object at time.
𝑏𝑒𝑠𝑡 𝑡 = 𝑚𝑖𝑛𝑗 ꞓ {1,…𝑁} 𝑓𝑖𝑡𝑗 (t)
and
𝑤𝑜𝑟𝑠𝑡 𝑡 = 𝑚𝑎𝑥𝑗 ꞓ
1,…𝑁
𝑓𝑖𝑡𝑗 (t)
ANT COLONY OPTIMIZATION
• It is first developed by Marco Dorigo in 1992
• The first version of ACO was known as Ant systems
• Ant Colony Optimization (ACO) is one of the well-known swarm intelligence
techniques
• Despite the similarity of this algorithm to other swarm intelligence techniques in
terms of employing a set of solutions and stochastic nature, the inspiration of this
algorithm is unique
INSPIRATION
• The main inspiration of the ACO algorithm is the concept of
stigmergy in nature.
• Stigmergy refers to the manipulation of environment by biological
organisms to communicate with each other.
•
What makes this type of communication unique is the fact that
individuals communicate indirectly.
• The communication is local as well, meaning that individuals should
be in the vicinity of the manipulated area to access it
EXAMPLE OF STIGMERGY
• Stigmergy also exists in humans
• Example
are
websites
like
Wikipedia and Reddit.
• Millions of articles are contributed
by developers across the globe
• Search for any articles without
knowing the contributor of the
article
• In these websites no centralized
system for processing, editing,
approving the article everything is
done by people.
Common Example of Stigmergy
• Common example of stigmergy can be find in the behaviour of ants in an ant
colony for finding the food
• Finding food is an optimization tasks where organisms are trying to achieve
maximum amount of food source by consuming minimum amount of energy
• In an ant colony this can be done by finding the shortest path of the nets to any
food source
• In nature, ant solve this problem by using a very simple algorithm that is main
inspiration of the Ant Colony Optimization Algorithm
• A small village in the middle of
the desert and consist of
several families
• Every day families travel
several Km to travel to the
river and collect water
• There are two routes to bring
water but no one knows which
one is better.
• People
randomly
choose
among the two routes.
• One day a young man and girl
want to solve this problem and
comes with the idea of
identifying the shortage path
• Marking the path with water
and follow the path which is
more wet.
• The longest path takes longer
evaporating when any body else
wet it again.
• One day decide to bring water
several times to solve the
problem and bring two backets.
• The girl reach the point and the boy is half
way still
• The girl does not know if boy has already
reached or go back to the village
• They just check the path and mark it with the
water
• The girl go back to the village after marking
her path with water
• When she back home, boy reach to the river
and fill his backets
• The boy now has an answer which route to
follow
• During the first iteration the boy and girl
manage to choose the shortest path and
the path get more water and became
more wet.
• They decide to bring more water on the
same day
• So they started their journey and choose
the smallest path as there is no
vaporization at this point of time
• They reach to the river and fill their
baskets again
• They wet the path again while coming
back to village
• The path is more wet and indicates the
path is shortest path
• Assume now they choose the wrong
path or vaporization is occur.
• Since the path followed by the girl is
shorter and she makes two rounds of
the village than boy, so the path is
more wet then the boy’s path
• Imagine they want to repeat their path
again
• Shortage path is always more wet than
the longest path
• In an ant colony, ants constantly look for food sources around the
nest in random directions.
• It has been proven that once an ant finds a food source, it marks the
path with pheromone.
• The amount of the pheromone highly depends on the quality and
quantity of the food source.
• The more and the better the source of food, the stronger and
concentrated pheromone is deposited.
• When other ants perceive the presence of the pheromone, they also
follow the pheromone trail to reach the food source.
• After getting a portion of the food, ants carry them to the nest and
mark their own path to the next
• Three routes from a nest to a food
source.
• The amount of pheromone deposited
on a route is of the highest on the
closest path.
• The of pheromone decreases
proportional to the length of the path.
• While ants add pheromone to the
paths towards the food source,
vaporization occurs.
• The period of time that an ant tops up pheromone before it
vaporizes is higher inversely proportional to the length of the path.
• This means that the pheromone on the closets route become
more concentrated as more ants get attracted to the strongest
pheromone
• This means they made decision based on the probability
Mathematical Model
• Pheromone (Model and vaporization):
o How to mathematically model it? Simulate vaporization?
• Decision making process??
So cost matrix which define the length of each path
Other matrix represents the amount of the pheromone on each edge of this graph, this
matrix allows to store the amount of pheromone on each edge.
• So the question is how do we add pheromone on these paths??
• Different methods are there that add same amount of the pheromone
regardless of the path find by the ant.
• However, other methods that adds pheromone depends upon the quality
of the path found by the ant, there are some species of the ants that add
more pheromone depends on the quality of the food.
Mathematical Pheromone Level On The Graph
is the total amount of pheromone deposited on the i,j edge defined as
follows:
ant travels on edge i,j and 0 otherwise where Lk shows the length (cost
value) of the path that kth ant travels
For Multiple ants
“m” shows the number of ants
With Vaporization
ρ is the rate of pheromone evaporation. The greater the value of ρ, the higher
the evaporation rate
With out Vaporization
Sum of the lengths of the path
ant travelled, i.e., path length=
4+8+15+4=31
With Vaporization
With Vaporization
Change in
pheromone
(1-0.5)*1+1/14
Initial
pheromone
Next Step Calculating The Probability
Calculating The Probability
For simplicity α and β = 1
For Tree
For Car
This is the case when we have initial pheromone
level of 1, that is identical pheromone level to all
the paths
Change the initial pheromone
level to 5
How To Choose Destinations Using This
Probabilities
How To Choose Destinations Using This
Probabilities
ROULETTE WHEEL
How To Choose Destinations Using This
Probabilities
ROULETTE WHEEL
Genetic Algorithm
Video On Genetic Algorithm By Whentotrade.Com
Introduction
• Genetic Algorithm (GA) is one of the most well-regarded evolutionary algorithms
• Genetic Algorithm (GA) is one of the first population-based stochastic algorithm
proposed in the history
• This algorithm mimics Darwinian theory of survival of the fittest in nature
• Similar to other EAs, the main operators of GA are selection, crossover, and
mutation
Inspiration
• The GA algorithm is inspired by the theory of biological evolution proposed by Darwin .
• In fact, the main mechanism simulated in GA is the survival of the fittest.
•
In nature, fitter organisms have a higher probability of survival. This assists them to transfer their
genes to the next generation.
• Over time, good genes that allow species to better adapt with environment (avoid enemy and
find food) become dominant in subsequent generations.
Natural Selection - Survival of the fittest
• Darwin theory is proposed by Charles Darwin, one of the key mechanism in this theory is natural
selection or survival of the fittest
• Natural Selection refers to the variation in the genotype of organisms that increase the chance of
survival
• In natural selection those variations in the genotype that increases the organism chance of
survival are preserved and multiplied from generation to generation
• For Example, a short teeth is good for a shark nature will preserve those sharp teeth generation
by generation so the future generation will also benefit from it, for example to catch preys.
Survival of the fittest
• There is an ecosystem with a large number of
organisms of which some are on the land, and others in
water.
• The main objective of these organisms is survival
• They interact in the form of collaboration or
competition to increase the chance of survival
• For example, two species like tiger and wolves compete
on land to catch preys.
• Similar swarms of fish reacts to avoid predator or catch
preys
• NOTE: So that means how good an organism is to avoid
the predator or catch preys so that they can survive
Basic Terminologies
1. Population: This is a subset of all the probable solutions that can solve the given
problem.
2. Chromosomes: A chromosome is one of the solutions in the population.
3. Gene: This is an element in a chromosome.
4. Allele: This is the value given to a gene in a specific chromosome.
5. Fitness function: This is a function that uses a specific input to produce an
improved output. The solution is used as the input while the output is in the
form of solution suitability.
6. Genetic operators: In genetic algorithms, the best individual's mate to
reproduce an offspring that is better than the parents. Genetic operators are
used for changing the genetic composition of this next generation.
Basic Terminologies
• For instance, for a problem with 10 variables, GA uses chromosomes with
10 genes.
• The GA algorithm uses three main operators to improve the chromosomes
in each generation: selection, crossover (recombination), and mutation.
Initial Population
• The GA algorithm starts with a random population.
• This population can be generated from a Gaussian random distribution to
increase the diversity.
• This population includes multiple solutions, which represent chromosomes of
individuals.
• Each chromosome has a set of variables, which simulates the genes.
• The main objective in the initialisation step is to spread the solutions around the
search space as uniformly as possible to increase the diversity of population and
have a better chance of finding promising regions
Selection
• Natural selection is the main inspiration of this component for the GA algorithm.
• In nature, the fittest individuals have a higher chance of getting food and mating.
• This causes their genes to contribute more to the production of the next
generation of the same species.
• Inspiring from this simple idea, the GA algorithm employs a roulette wheel to
assign probabilities to individuals and select them for creating the next
generation proportional to their fitness (objective) values.
Selection
• For example, Let consider we have
one shark, two catfishes and a bunch
of small fishes, regardless they are
carnivorous or herbivorous they
compete for food
• Depending on how fit they are
competing on resources and increases
their chance of survival
• So if one catfish is eaten by the shark
may be it will not swim faster, i.e, that
catfish is a example of organism which
is not fit
Selection
• The catfish that survives is different from the one
which has been eaten by the shark.
• This is may be because this catfish has several fins
that help it to swim faster and run away from its
predator faster.
• So this feature is a good feature that helps catfish
to survive and swim faster
• Another feature is the moustache that help it to
sense the environment
• So if the organism is not fit enough it will die either
by starvation or by its predator
• The key point is that the genes or good
characteristics of the organism are protected and
transferred to the next generation.
Selection Operators
Some of other selection operators are:
1. Boltzmann selection
2. Tournament selection
3. Rank selection
4. Steady state selection
5. Truncation selection
6. Local selection
7. Fuzzy selection
8. Fitness uniform selection
9. Proportional selection
10. Linear rank selection
11. Steady-state reproduction
Crossover (Recombination)
• After selecting the individuals using a selection operator, they
have to be employed to create the new generation.
• In nature, the chromosomes in the genes of a male and a female
are combined to produce a new chromosome.
• This is simulated by combining two solutions (parent solutions)
selected by the roulette wheel to produce two new solutions
(children's solutions) in the GA algorithm
• There are different methods of crossover in the literature. The
easiest one divides chromosomes into two (single-point) or three
pieces (double-point). They then exchange genes between two
chromosomes.
• In the single-point cross over, the chromosomes of two parent solutions are
swapped before and after a single point.
• In the double-point crossover, however, there are two cross over points and the
chromosomes between the points are swapped only.
Crossover
• The main objective of crossover is to make sure that the genes are exchanged,
and the children inherit the genes from the parents.
• Crossover is the main mechanism of exploitation in GA
• There is a parameter in GA called Probability of Crossover (Pc) which indicates
the probability of accepting a new child.
• This parameter is a number in the interval of [0,1].
• A random number in the same interval is generated for each child.
• If this random number is less than Pc, the child is propagated to the subsequent
generation. Otherwise, the parent will be propagated.
• This happens in nature as well—all of the offspring do not survive.
Crossover Operators
1. Uniform crossover
2. Half uniform crossover
3. Three parents crossover
4. Partially matched crossover
5. Cycle crossover
6. Order crossover
7. Position-based crossover
8. Heuristic cross over
9. Masked crossover
10. Multi-point crossover
Recombination
• This is a kind of food chain here
shark eats catfish and catfish
eats small fishes
• Catfishes requires fins to swim
faster and avoid sharks and they
require moustache to sense the
environment and catch the prey
• Overtime less fitted catfishes
with less fins and moustaches
are died due to starvation or
eaten by sharks
• So the fittest catfishes survives
and take part in production of
the next generation
This figure is an example of the process after
preserving good feature generation by generation
by natural selection, the fittest catfish will survive
• Orange Catfish has two fins and two
moustaches and the green catfish
has one fin and two moustaches and
they are of different sizes and
located at different positions, this
corresponds to the genes in an
organism
• The characteristics of the catfish is
stored in the table with one row and
multiple columns
• The set of genes that define how the
catfish
looks
like
is
called
chromosomes
• The chromosomes is the set of the
genes when mating the chromosome
of both parents are combined
• That means they exchange genes to
produce a child
The yellow and purple colors shows the process of
combination in which children have better features
then their parents
Problem with Crossover
• The crossover operator exchanges genes between chromosomes.
• The issue with this mechanism is the lack of introducing new genes.
• If all the solutions become poor (be trapped in locally optimal solutions),
crossover does not lead to different solutions with new genes differing
from those in the parents.
Problem with Crossover
• Considering the natural selection, and recombination, Organisms and
individuals keep changing the genes in their chromosome
• What does a Catfish do if there is a change in the Ecosystem ???
• For example if there is a fitter predators or preys in the Ecosystem, to
overcome that Catfish should develop new genes or features because just
exchanging or combining the good features of the current population of
Catfishes
• The problem with crossover that it will not add new features nor the natural
selection
• According to the theory of evolution in biology chromosomes might have random
changes during the process of recombination, This is called Mutation
• This refers to the random changes of the genes during the process of
recombination
Mutation
• Mutation causes random changes in the genes.
• There is a parameter called Probability of Mutation (Pm) that is used for every
gene in a child chromosome produced in the crossover stage.
•
This parameter is a number in the interval of [0,1].
• A random number in the same interval is generated for each gene in the new
child.
• If this random number is less than Pm, the gene is assigned with a random
number with the lower and upper bounds
• The mutation operator maintains the
diversity of population by introducing
another level of randomness.
• This operator prevents solutions to
become similar and increase the
probability of avoiding local solutions
in the GA algorithm
Mutation operator alters one or multiple genes in the
children's solutions after the crossover phase
• In this analogue, the mutation is adding a third eye to
one of the child catfish in the second generation or a
very big moustache in the third generation
• This picture shows how the mutation changes the
individual in the future generation, if the genes is good
the mutation results in a better gene.
• Also, the natural selection and recombination will protect
and transfer them to the next generations
• What if the Mutation creates a negative feature that
resulted in a less fit organism, then the natural selection
will get rid of this.
• For example if a catfish after having many mutation has
small fins then it will die in the ecosystem
• So this negative feature will not transfer to the next
generations.
• What are the advantages of the
mutation???
• With the crossover and constant
competition between organisms
certain features are disappeared as
new generations are developed
• The Mutation operator allows
reverting some of the lost features
or adding the new one’s and
maintaining the diversity in the
generation
Impact of mutation on the
chromosomes, Light blue boxes are
the areas effected by the mutation
Similar children’s but they have developed
new features
Elitism
Crossover and mutation change the genes in the chromosomes.
Depending on the Probability of Mutation , there is a chance that all the
parents are replaced by the children.
This might lead to the problem of losing good solutions in the current
generation.
To fix this, a new operator is used called as Elitism
How it works???
• A portion of the best chromosomes in the current population is
maintained and propagated to the subsequent generation without any
changes.
• This prevents those solutions from being damaged by the crossover and
mutation during the process of creating the new populations.
• The list of elites gets updated by simply raking the individuals based on
their fitness values.
Applications of Genetic Algorithm
• Genetic algorithms are used in the traveling salesman problem to establish an
efficient plan that reduces the time and cost of travel.
•
It is also applied in other fields such as economics, solving optimization
problems, aircraft design, and DNA analysis and many others.
Hybrid Gravitational Search Particle Swarm
Optimization Algorithm
SPSO
SGSA
Direction is calculated using only two best positions, pbesti
and gbest
Agent direction is calculated based on the overall force
obtained by all the agents.
Updating is performed without considering the quality of the
solutions
The force is proportional to the fitness value and so the
agents see the search space around themselves in the
influence of force.
VS
PSO uses a kind of memory for updating the velocity
GSA is memory-less and only the current position of the
agents plays a role in the updating procedure.
Updating is performed without considering the distance
between solutions
The force is inversely proportional to the distance
between solutions.
PSO simulates the social behaviour of birds
GSA is inspired by laws of physics
Hybrid Gravitational Search Particle Swarm Optimization
Algorithm
• Particle Swarm Optimization (PSO) algorithm is a member of the swarm
computational family and widely used for solving nonlinear optimization
problems.
•
It tends to suffer from premature stagnation, trapped in the local minimum
and loses exploration capability as the iteration progresses.
• On the contrary, Gravitational Search Algorithm (GSA) is proficient for
searching global optimum, however, its drawback is its slow searching speed in
the final phase
• The key concept behind the proposed method is to merge the exploration
ability of GSA with the capability for social thinking (gbest) of PSO
PROBLEMS OF GSA
•
Problems that are associated with the Standard GSA are:
> Slow Convergence:
•
Due to the unbalance exploration and exploitation the particles trapped in the local minima.
> Exploitational issue:
•
Agents mass get heavier and heavier in the process of optimization because of the cumulative effect of the fitness
function on masses. This may prevent masses from rapidly exploiting the optimum which may result in weak
exploitation.
• The basic idea is to save and use the location of the best mass to speed up the
exploitation phase. Figure below shows the effect of using the best solution to accelerate
movement of agents towards the global optimum.
• As shown in this figure, the gbest element applies an additional velocity component
towards the last known location for the best mass.
• In this way, the external gbest ‘‘force’’ helps to prevent masses from stagnating in a
suboptimal situation
Mathematical Model
There are two benefits of this method:
a) Accelerating the movement of particles towards the location of the best
mass, which may help them to surpass it and be the best mass in the next
iteration
b) Saving the best solution attained so far
• Adaptively decrease 𝒄𝟏 and increase 𝒄𝟐 so that the masses tend to
accelerate towards the best solution as the algorithm reaches the
exploitation phase.
• Since there is no clear border between the exploration and exploitation
phases in evolutionary algorithms, the adaptive method is the best option
for allowing a gradual transition between these two phases
𝑽𝒅𝒊 𝒕 + 𝟏 = 𝒘. 𝒗𝒅𝒊 (t) × 𝒓𝒂𝒏𝒅 + 𝒄𝟏 × 𝒂𝒅𝒊 (𝒕) + 𝒘
𝑿𝒅𝒊 𝒕 + 𝟏 = 𝒘 ∗ 𝑿𝒅𝒊 + 𝒗𝒅𝒊 (𝒕 + 𝟏)
𝒘𝒊 = 𝟏 −
𝑷𝒈
𝑷𝒊
𝑪𝟏 = 𝑪𝟑 − 𝑪𝟒 ∗ (𝟏
𝒕
− ) + 𝑪𝟒
𝑻
𝒕
𝑻
𝑪𝟐 = 𝑪𝟓 − 𝑪𝟔 ∗ (𝟏 − ) + 𝑪𝟔
𝒄𝟏
𝒄𝟐
𝒑𝒈 − 𝒙𝒊
Flow chart of HGSPSO
Generate initial
population
Evaluate the
fitness of all
agents
Evaluate G and
gbest for all
population
Update velocity
and position
Calculate M, forces
, and acceleration
for all agents
NO
Meeting
end
criteria
YES
Return Best
solution
CASE STUDY
Introduction
• Dataset is obtained from the UCI
(https://archive.ics.uci.edu/ml/datasets.php)
biomedical
database.
• In data mining, feature subset selection is data pre-processing phase that
is of enormous importance.
• In this paper, for selecting a minimum number of features, K-Nearest
Neighbour (KNN) classifier is presented with a modified particle swarm
optimization (MPSO) to obtain good classification precision.
•
The proposed method (MPSO) is applied to three UCI medical data sets
and is compared with other feature selection approach available
FOUR KEY STEPS IN FEATURE SELECTION
Modifications in PSO
ADVANCED PARTICLE SWARM OPTIMIZATION
ALGORITHM
➢ The SPSO algorithm is primarily based on the two equations position and the
velocity of the particle.
a) Modification in the Velocity Update Equation:
• The third term that is added in the velocity equation of the PSO is used to
minimize the positions of the particles through the iterations so the velocity
will be increased and the algorithm will be reach to the optimal solution faster.
• 𝑣𝑖𝑑 = 𝑤𝑣𝑖𝑑 + 𝑐1 𝑅1 𝑝𝑖𝑑 − 𝑥𝑖𝑑 + 𝑐2 𝑅2 𝑝𝑔𝑑 − 𝑥𝑖𝑑 + 𝒘
𝒄𝟏
𝒄𝟐
𝒑𝒊𝒅 − 𝒑𝒈𝒅
VELOCITY CLAMPING & PARTICLE PENALIZATION METHOD
➢ Velocity Clamping keep the particle velocity within the range of [vmax, vmin]
The maximum and the velocity condition can be defined as
• 𝒗𝒎𝒂𝒙 = 𝒍𝒂𝒎𝒃𝒅𝒂 ∗ 𝑴𝒂𝒙𝒔𝒔 − 𝑴𝒊𝒏𝒔𝒔
• 𝒗𝒎𝒊𝒏 = 𝒍𝒂𝒎𝒃𝒅𝒂 ∗ 𝑴𝒊𝒏𝒔𝒔 − 𝑴𝒂𝒙𝒔𝒔
Condition for the velocity.
𝒗𝒊 > 𝒗𝒎𝒂𝒙 𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝒗𝒎𝒂𝒙
•
𝒗𝒊 < 𝒗𝒎𝒊𝒏 𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝒗𝒎𝒊𝒏
➢ Penalization keep the particle within the search domain if the sum of the agents'
position and velocity resulting in the new position lies outside the domain.
Conditions for penalization
• 𝒗𝒊 + 𝒙𝒊 > 𝑴𝒂𝒙𝒔𝒔 𝒐𝒓 𝒗𝒊 + 𝒙𝒊 < 𝑴𝒊𝒏𝒔𝒔
𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝟎
UCI DATA SET
UCI medical datasets are used for the feature selection process.
Results
• For the classification purpose, KNN is used.
The efficiency of the classifier has been
calculated based on the accuracy
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒)
𝑇𝑜𝑡𝑎𝑙
Download