Pattern Recognition By Prof. Dr Talha Ali Khan Subject Information • Subject coordinator: Prof. Dr Talha Ali Khan • Email ID: talhaali.khan@ue-germany.de • Teaching time and Location: Potsdam Campus • Consultation: By Appointment via email only Intended Audience: Graduate Students Contents of the Course 1 Introduction to Pattern Recognition 2 Machine Learning and Pattern Recognition 3 Feature Selection, Extraction and Dimensionality Reduction 4 Statistics for Pattern Recognition 5 Programming for Pattern Recognition 6 Deep Learning 7 Optimization Algorithms and pattern recognition General Assessment Overview Item Topics Assignment Any Journal paper on pattern Presentation/Report recognition application Due Marks Week 14 20% Quizzes Average of four quizzes - 20% Project Report and Presentation Any PR based application Exam period 60% https://www.ieee.org/ IEEE IEEE UE Student Branch https://www.ieee.org/membership/join/more-visibility.html How to access the Research Papers and Technical Reports • There are few databases where the research papers can be search, a few of them are as under: 1. https://ieeexplore.ieee.org/Xplore/home.jsp 2. https://www.elsevier.com/de-de 3. https://scholar.google.com/ 4. https://doaj.org/ How to download the papers?? • The link below will download most of the papers from the databases. https://sci-hub.mksa.top/ Research Paper Introduction / problem background / motivation: Describe the general scope of your project (e.g., automated speech recognition), and “zoom in” on the specific problem that you are addressing (e.g., pitch tracking). What is your motivation to study this problem, what is the specific need or gap that your work is addressing? Goal / objectives: The goal is a brief statement that establishes a general, longterm direction of your work (e.g., to analyze the affective content of physiological signals). The objectives (there will likely be more than one) are quantifiable expectations of performance (e.g., to implement or compare certain models). Literature review: Describe prior research efforts in your area. This is not meant to be a comprehensive survey of a scientific discipline, but a concise overview of the most significant results that are tightly related to your work. Proposed solution / methods used: Describe the algorithm that Authors have developed, or the techniques that you used (a.k.a. the actual work). Keep this description at a high level: the objective of the presentation is to make the audience want to read your paper, not to scare them away with details! Analysis of results: What are the specific results of your work? What do these results tell you? Are they in agreement with your expectations? Do these results suggest the existence of some phenomena that you were not aware of? Conclusions: What are the main ideas (not more than three) that you want your audience to remember? This is the time to “zoom out” and discuss how your results support the long-term goal of the research. Preparation and poise: Did you speak to the audience rather than look at the slides? Did you expand on what was on the slides rather than read them word-by-word? Did you speak at a reasonable pace rather than too fast or too slow? Did you appear to be spontaneous and fluid, avoiding the use of distracting mannerisms and colloquialisms? Use of allotted time: Was there a good balance between inspirational material and technical content? Did you complete your presentation in time? Did you have to skip some important material (e.g., conclusions) in order to complete your presentation in time? Use of visual aids: Did you use pictures/diagrams to explain your ideas? Did you have graphs of experimental results? Did the slides contain short, clear bullets rather than long sentences and/or cryptic equations? Response to questions: Did you address technical questions and comments well? Books • Textbook: 1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2007 • Reference Books 1. Frank Y. Shih, Image Processing and Pattern Recognition: Fundamentals and Techniques, Wiley 2. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, 2nd Edition, Wiley Learning Objectives This course will equip you with the understanding of • Understand the fundamentals of pattern recognition • Different classification and clustering algorithms • Basics of Artificial Neural Networks • Applications of pattern recognition • Deep Learning • Optimization Algorithms Part 1 – Introduction to Pattern Recognition Human Perception • Humans have developed highly sophisticated skills for sensing their environment and taking actions according to what they observe For example: recognizing a face, understanding spoken words, reading handwriting, distinguishing fresh food from its smell • Concept is that “give similar capabilities to machines” Human and Machine Perception Often influenced by the knowledge of how patterns are modeled and recognized in nature when develop pattern recognition algorithms Research on machine perception also helps to gain deeper understanding and appreciation for pattern recognition systems in nature Many techniques are applied, that are purely numerical and do not have any correspondence in natural systems What is a Pattern? • A pattern is a regularity in the world, in human-made design, or in abstract ideas • The elements of a pattern repeat in a predictable manner • A geometric pattern is a kind of pattern formed of geometric shapes and typically repeated like a wallpaper design • Any of the senses may directly observe patterns • The opposite of “pattern” is complete “randomness”. What is a Pattern Recognition? • Pattern Recognition involves the design of systems which automatically or otherwise, recognizes patterns in acquired data. • Study of how machines can observe the environment • Learn to distinguish patterns of interest • Make sound and reasonable decisions about the categories of the patterns Pattern Recognition Tasks • Similar to the field of Digital Signal Processing, also Pattern Recognition is a discipline with applications (tasks) in virtually every field of science: • Computer science • Electrical engineering • Industrial engineering • Remote sensing (geology, archaeology, surveillance) • Civil engineering • Vehicle manufacturing, advanced driver assistance systems (ADAS), autonomous vehicles • Finance and investment • Business economics • Medicine (screening, diagnostics) • Biometrics (age, gender, iris, fingerprint, mood, handwriting, voice identity detection) • Image segmentation (pages, forms, or images with, e.g., street, houses, sky, cars, texts, …) • Automatic recognition of speech, (hand)writing (OCR), gesture, … • Intelligent sensing (smell, chemicals, …) Pattern Recognition in Daily Life We encounter automatic pattern recognition in daily life very frequently. What is Pattern Recognition? ?? A pattern is an entity, vaguely defined, that could be given a name For example: 1. Fingerprint image 2. Handwritten word 3. Human face 4. Speech signal 5. DNA sequence Components of Pattern Recognition System • Data Collection • Sensors collect information • e.g., Images, Temperature, Pressure, Hardness etc. • Pre-Processing • Noise Reduction, Filtering, Limiting etc. • Feature Extraction • Compute Numeric information from raw data • Classification • Clustering is an example of classification Data Acquisition Data is Acquired through Sensors Noise Examples of Data Acquisition Depending upon the application we can use different types of sensors e.g. How do we acquire data for Temperature is sensed through Thermocouples Sound data is sensed through Microphones Image data is sensed through Cameras Location data is sensed through GPS satellites Financial data is acquired from Stock Exchanges Wind Speed data is sensed through Anemometers Blood Oxygen Saturation is sensed through a red-light sensor 1. Shortest Route from Point A to B in google maps? 2. Audit Data of a Firm? 3. Detection of Credit Card Fraud? Characteristics of Data and Data Acquisition • • • • • Frequency Spectrum of Data Speed of Data Acquisition ( Sampling Frequency ) Acceptable Noise Level ( Signal to Noise Ratio (SNR) ) Amount of Collected Data ( Storage Size) Built-in preprocessing Related fields and application of PR Pattern Recognition Applications Biometric Recognition Pattern Recognition Applications Fingerprint Recognition Pattern Recognition Applications Autonomous Navigation Recognition Pattern Recognition Applications Medical Imaging Cancer detection and grading using microscopic tissue data. (left) A whole slide image with 75568 × 74896 pixels. (right) A region of interest with 7440 × 8260 pixels Pattern Recognition Applications Land cover classification using satellite Pattern Recognition Applications Building and building group recognition using satellite Classification Algorithms • Logistic Regression. • Naïve Bayes. • Stochastic Gradient Descent. • K-Nearest Neighbours. • Decision Tree. • Random Forest. • Support Vector Machine Clustering Types • Connectivity-based Clustering (Hierarchical clustering) • Centroids-based Clustering (Partitioning methods) • Distribution-based Clustering. • Density-based Clustering (Model-based methods) • Fuzzy Clustering. • Constraint-based (Supervised Clustering) Typical PR System Training data is the data you use to train an algorithm or machine learning model to predict the outcome you design your model to predict. Test data is used to measure the performance, such as accuracy or efficiency, of the algorithm you are using to train the machine. Part 2 – Machine Learning Introduction to Artificial Intelligence Definition “Anything that makes machines act more intelligently” Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions • Think of AI as augmented intelligence • AI should not attempt to replace human experts, but rather extend human capabilities and accomplish tasks that neither humans nor machines could do on their own Augmented Intelligence • Augmented intelligence is a design pattern for a human-centered partnership model of people and artificial intelligence (AI) working together to enhance cognitive performance, including learning, decision making and new experiences. Augmented Intelligence The internet has given us access to more information, faster Distributed computing and IoT have led to massive amounts of data, and social networking has encouraged most of that data to be unstructured With Augmented Intelligence, giving information that subject matter experts need at their fingertips, and backing it with evidence so they can make informed decisions Help experts to scale their capabilities and let the machines do the time-consuming work How does AI learn??? • Based on strength, breadth, and application, AI can be described in different ways. 1) Weak or Narrow AI: • AI that is applied to a specific domain For example: language translators, virtual assistants, self-driving cars, AI-powered web searches, recommendation engines, and intelligent spam filters 2) Applied AI: • It can perform specific tasks, but not learn new ones, making decisions based on programmed algorithms, and training data 3) Strong AI or Generalized AI: • AI that can interact and operate a wide variety of independent and unrelated tasks. • It can learn new tasks to solve new problems, and it does this by teaching itself new strategies. • Generalized Intelligence is the combination of many AI strategies that learn from experience and can perform at a human level of intelligence. 4) Super AI or Conscious AI: • AI with human-level consciousness, which would require it to be self-aware. • Since it is not able to adequately define what consciousness is, it is unlikely that it will be able to create a conscious AI in the near future Impact and Examples of AI AI means different things to different people Video Game Designer Screen Writer Data Scientist AI means writing the code that affects how bots play or how the environment reacts to the player AI means a character that acts like a human with some trope having computer features mixed in AI is a way of exploring and classifying data to meet specific goals Natural Language Processing The natural language processing and natural language generation capabilities of AI are not only enabling machines and humans to understand and interact with each other, but are creating new opportunities and new ways of doing business • Chatbots powered by natural language processing capabilities are being used : 1. Healthcare : To question patients and run basic diagnoses like real doctors Chatbots 2. Education: Providing students with easy to learn conversational interfaces and on-demand online tutors 3. Customer service: Chatbots are improving customer experience by resolving queries on the spot and freeing up agents time for conversations that add value AI speech-to-text technology • AI-powered advances in speech-to-text technology have made real time transcription a reality • Advances in speech synthesis are the reason companies are using AIpowered voice to enhance customer experience, and give their brand its unique voice • In the field of medicine, it's helping patients with Lou Gehrig's disease, for example, to regain their real voice in place of using a computerized voice It is due to advances in AI that the field of computer vision has been able to surpass humans in tasks related to detecting and labeling objects • Computer vision algorithms detect facial features and images and compare them with databases of face profiles • This is what allows consumer devices to authenticate the identities of their owners through facial recognition, social media apps to detect and tag users, and law enforcement agencies to identify criminals in video feeds Help automate tasks such as: Detecting cancerous moles in skin images Findings symptoms in X-ray and MRI scans • AI is working behind the scenes in different sectors such as in: AI behind the scenes 1) Finance: • Monitoring our investments, detecting fraudulent transactions • Identifying credit card fraud, and preventing financial crimes 2) Healthcare: • By helping doctors arrive at more accurate preliminary diagnoses • Reading medical imaging, finding appropriate clinical trials for patients • It is not just influencing patient outcomes, but also making operational processes less expensive • AI has the potential to access enormous amounts of information, imitate humans, even specific humans, make life-changing recommendations about health and finances, correlate data that may invade privacy, and much more. Terminologies used Interchangeably Machine Learning Machine Learning, a subset of AI, uses computer algorithms to analyze data and make intelligent decisions Instead of following rules-based algorithms, machine learning builds models to classify and make predictions from data Predict whether a patient, hospitalized due to heart attack, will have a second heart attack Predict the price of a stock in 6 months from now Examples Face detection: find faces in images (or indicate if a face is present) Spam filtering: identify email messages as spam or no spam Fraud-detection applications that seek patterns in jumboize data sets Why do we mine data? Commercial Point of view Data Every Where Ok there is too much Data, Then WHAT??? • A machine learning program with a large volume of pictures of birds and train the model to return the label "bird" whenever it has provided a picture of a bird • Similarly, also create a label for "cat" and provide pictures of cats to train on • When the machine model is shown a picture of a cat or a bird, it will label the picture with some level of confidence Bird Types of Machine Learning Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Supervised Learning • It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately Supervised Learning Workflow Supervised Learning • Supervised Learning can be split into three categories 1. Regression 2. Classification Regression • Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables Classification • Classification is the technique of identifying which of a set of categories or an observation, belongs to Unsupervised Learning • Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets • These algorithms discover hidden patterns or data groupings without the need for human intervention Unsupervised Learning • Unsupervised learning can be divided into two types: 1. Clustering 2. Association Clustering • Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups Association • Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases Reinforcement Learning • Reinforcement learning is the training of machine learning models to make a sequence of decisions. • The agent learns to achieve a goal in an uncertain, potentially complex environment • In reinforcement learning, an artificial intelligence faces a game-like situation • Its goal is to maximize the total reward Deep Learning Deep Learning AI • Deep Learning enables AI systems to continuously learn on the job and improve the quality and accuracy of the results • That enables these systems unstructured data such as Machine Learning Deep Learning: Layers algorithm to create a neural networks, an artificial replication of functionality and structure of brain to learn from • Deep learning algorithms do not directly map input to output • It rely on several layers of processing units • Each layer passes its output to the next layer, which processes it and passes it to the next • The many layers is why it’s called deep learning Give a deep learning algorithm thousands of images and labels that correspond to the content of each image Creating deep learning algorithms, developers and engineers configure the number of layers and the type of functions that connect the outputs of each layer to the inputs of the next Train the model by providing it with lots of annotated examples The algorithm will run the those examples through its layered neural network Adjust the weights of the variables in each layer of the neural network to be able to detect the common patterns that define the images with similar labels • Deep Learning has proven to be very efficient at various tasks 1. Image captioning 2. Voice recognition and transcription 3. Facial recognition 4. Medical imaging 5. Language translation • Deep Learning is also one of the main components of driverless cars Recap • Classification is a process that automatically orders or categorizes data into one or more of a set of “classes.” • Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. • In simple words, the aim is to segregate groups with similar traits and assign them into clusters • Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable) and one or more independent variables Part 3 – Feature Selection & Extraction Introduction • Data mining is an automatic or a semi-automatic process of extracting and discovering implicit, unknown, and potentially useful patterns and information from massive data stored and captured from date repositories (web, database, data warehousing). • Data mining tasks are usually divided into two categories: 1. Predictive 2. Descriptive Introduction • Predictive tasks: • To predict or classify for determining which class an instance or example belongs, based on the values of some features (i.e, independent or conditional features) in a dataset. • For example: o Prediction task is to predict if a new patient (instance) is in danger of a heart attack disease or not based on some clinical tests. • Descriptive tasks: • To find clusters, correlations, and trends based on the implicit relationships hidden in the underlying data Introduction • Data collected for analysis might contain redundant or irrelevant features • Applying data mining methods on such data might lead to misleading results. • Data pre-processing is an essential step to refine the data to be used in any learning model. • The main purpose of the pre-processing step is to clean and transform raw data into a suitable format, which improves the performance of data mining tasks. • This is where dimensionality reduction algorithms come into play. Motivation • Dealing with real problems and real data we often deal with high dimensional data that can go up to millions. • In original high dimensional structure, data represents itself. Although, sometimes need to reduce its dimensionality. • Reduce the dimensionality that needs to associate with visualizations. Dimensionality reduction • Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a lowdimensional space so that the low-dimensional representation retains some meaningful properties of the original data Rainfall Example • Machine learning models map features to outcomes. • For instance, say you want to create a model that predicts the amount of rainfall in one month. • A dataset of different information collected from different cities in separate months. • The data points include temperature, humidity, city population, traffic, number of concerts held in the city, wind speed, wind direction, air pressure, number of bus tickets purchased, and the amount of rainfall. • Obviously, not all this information is relevant to rainfall prediction. Rainfall Example • Some of the features might have nothing to do with the target variable. • Evidently, population and number of bus tickets purchased do not affect rainfall. • Other features might be correlated to the target variable, but not have a causal relation to it. • For instance, the number of outdoor concerts might be correlated to the volume of rainfall, but it is not a good predictor for rain. • In other cases, such as carbon emission, there might be a link between the feature and the target variable, but the effect will be negligible. Drawbacks of having many features • It is evident which features are valuable and which are useless. • In other problems, the excessive features might not be obvious and need further data analysis. • But why bother to remove the extra dimensions? • When you have too many features, you’ll also need a more complex model. • A more complex model means you’ll need a lot more training data and more compute power to train your model to an acceptable level. Drawbacks of having many features • Machine learning has no understanding of causality, models try to map any feature included in their dataset to the target variable, even if there’s no causal relation. • This can lead to models that are imprecise and erroneous. • Reducing the number of features can make your machine learning model simpler, more efficient, and less data-hungry. • The problems caused by too many features are often referred to as the “Curse of dimensionality” The Curse of Dimensionality • In machine learning, “dimensionality” simply refers to the number of features (i.e., input variables) in your dataset. • When the number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models. • This is called the “Curse of Dimensionality,” and it’s especially relevant for clustering algorithms that rely on distance calculations. • Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn't be too hard to find. You walk along the line and it takes two minutes. • Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days. • Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. • The difficulty of searching through the space gets a lot harder as you have more dimensions. Dimensionality Reduction • Dimensionality reduction identifies and removes the features that are hurting the machine learning model’s performance or aren’t contributing to its accuracy. • The popular aspects of the curse of dimensionality; are 1. Data sparsity 2. Distance concentration 1. Data Sparsity • A common problem in machine learning is sparse data, which alters the performance of machine learning algorithms and their ability to calculate accurate predictions. • Data Sparsity: Data sparsity is term used for how much data we have for a particular dimension/entity of the model. • A common phenomenon in general large scaled data analysis • For instance, if predicting a target, that is dependent on two attributes: gender and age group • Ideally capture the targets for all possible combinations of values for the two attributes • If this data is used to train a model that is capable of learning the mapping between the attribute values and the target, its performance could be generalized. • As long as the future unseen data comes from this distribution (a combination of values), the model would predict the target accurately. • The target value depends on gender and age group only. • If the target depends on a third attribute, let’s say body type, the number of training samples required to cover all the combinations increases • For two variables, needed eight training samples. For three variables, it need 24 samples. • As the number of attributes or the dimensions increases, the number of training samples required to generalize a model also increase. • In reality, the available training samples may not have observed targets for all combinations of the attributes. • This is because some combination occurs more often than others. • Due to this, the training samples available for building the model may not capture all possible combinations. • This aspect, where the training samples do not capture all combinations, is referred to as ‘Data sparsity’ or simply ‘sparsity’ in high dimensional data. • Data sparsity is one of the facets of the curse of dimensionality. • Training a model with sparse data could lead to highvariance or overfitting condition • As the number of features increase, the number of samples also increases. • The more features the more number of samples we will need to have all combinations of feature values well represented in our sample. 2. Distance Concentration • Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. • Several machine learning models such as clustering or nearest neighbours’ methods use distance-based metrics to identify similar or proximity of the samples. • Due to distance concentration, the concept of proximity or similarity of the samples may not be qualitatively relevant in higher dimensions. • As the number of features increases, the model becomes more complex. • The more the number of features, the more the chances of overfitting. • A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance • Avoiding overfitting is a major motivation for performing dimensionality reduction. • The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be. Advantages of Performing Dimensionality Reduction 1. Less misleading data means model accuracy improves. 2. Less dimensions mean less computing. Less data means that algorithms train faster. 3. Less data means less storage space required. 4. Less dimensions allow usage of algorithms unfit for a large number of dimensions 5. Removes redundant features and noise. Components for Dimensionality Reduction • There are two components of dimensionality reduction: 1. Feature Selection 2. Feature Extraction Feature Extraction • Feature extraction creates a new, smaller set of features that captures most of the useful information in the data Feature Selection • Feature selection is for filtering irrelevant or redundant features from your dataset. • Feature selection keeps a subset of the original features Feature Selection • Feature selection techniques are used for several reasons: 1. Simplification of models to make them easier to interpret 2. Shorter training times 3. To avoid the curse of dimensionality 4. Improve data's compatibility with a learning model class Feature Selection Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). Application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples. Feature Selection A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space and is computationally intractable for all but the smallest of feature sets. Generic Steps in Feature Selection Feature selection techniques consist of 4 major steps: •Subset generation •Subset evaluation •Stopping Criteria •Validation a. Subsets Generation • Subset generation is considered as a search problem that aims to select the best subsets of all possible feature subsets. • The search strategies include: o Evolutionary algorithms, e.g., GAs or MOEAs, greedy, best first search with forward search selection (FSS) and backward search selection (BSS) b. Subset Evaluation • The selected subset can be evaluated using statistical measures such as: o The correlation between the feature and the target class as in filter approaches o By using a classifiers measure (e.g., accuracy) to evaluate the subset performance as in wrapper approaches c. Stopping Criteria • Determine a stopping criterion to stop the iteration process of selecting subsets. • Stopping criteria might be based on a pre-defined maximum number of generations to run the algorithm or the convergence of the algorithm or a solution is found d. Validation • The validation step is usually performed after the feature selection process to assess the quality of the resulted feature subset Types of feature selection • The three main categories of feature selection algorithms: 1. Filters 2. Wrappers 3. Embedded methods 1. Filter Methods • Filter methods use variable ranking techniques as the principal criteria for variable selection by ordering. • Ranking methods are used due to their simplicity and good success is reported for practical applications. • A suitable ranking criterion is used to score the variables and a threshold is used to remove variables below the threshold. • Ranking methods are filter methods since they are applied before classification to filter out the less relevant variables Property of a Unique feature • A basic property of a unique feature is to contain useful information about the different classes in the data. • This property can be defined as feature relevance which provides a measurement of the feature’s usefulness in discriminating the different classes The issue of relevancy of a feature ?? • Deciding the relevancy of a feature to the data or the output is the big question?? • Definition: ‘‘A feature can be regarded as relevant if it is conditionally dependent of the class labels.’’ • A feature is to be relevant it can be independent of the input data but cannot be independent of the class labels i.e., the feature that has no influence on the class labels can be discarded. • Inter feature correlation plays an important role in determining unique features Ranking methods in filters techniques Ranking methods Correlation criteria Mutual Information (MI) a) Correlation Criteria: • Correlation Criteria (CC), also known as Dependence Measure (DM), is based on the relevance (predictive power) of each feature. • The predictive power is computed by finding the correlation between the independent feature x and the target (label) vector t. • The feature with the highest correlation value will have the highest predictive power and hence will be most useful. • where xi is the ith variable, Y is the output (class labels), cov() is the covariance and var() the variance. • Correlation ranking can only detect linear dependencies between variable and target. • The features are then ranked according to some correlation based heuristic evaluation function b. Mutual Information: • Mutual Information (MI), also known as Information Gain (IG), is the measure of dependence or shared information between two random variables. • It is also described as Information Theoretic Ranking Criteria (ITRC). • The MI can be described using the concept given by Shannon’s definition of entropy: Mutual Information: • In feature selection, is need to maximize the mutual information (i.e., relevance) between the feature and the target variable. • The mutual information (MI), which is the relative entropy between the joint distribution and product distribution • where p(xj,t) is the joint probability density function of feature xj and target t. The p(xj) and p(t) are the marginal density functions. • The MI is zero or greater than zero if X and Y are independent or dependent Chi-square Test • The Chi-square test is used for categorical features in a dataset. • Chi-square value are calculated between each feature and the target and select the desired number of features with the best Chi-square scores. • Conditions for Chi-square testing: The variables must be o Categorical o Sampled independently Minimal redundancy maximal relevance Method • The Maximum Relevance Minimum Redundancy (mRMR) method developed for feature selection of microarray data. • It tends to select a subset of features having the most correlation with a class (relevance) and the least correlation between themselves (redundancy) 2. Wrappers • Instead of finding a relevant feature subset by a separate independent process, the wrapper method has its own machine learning algorithm (classifier) employed as part of the FS process. • Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. • The process iterates a number of times until the best feature subset is found • Classify the Wrapper methods into 1. Sequential Selection Algorithms 2. Heuristic Search Algorithms. 1. Sequential selection algorithms • These algorithms are called sequential due to the iterative nature of the algorithms. • The Sequential Forward Selection (SFS): o The Sequential Feature Selection (SFS) algorithm starts with an empty set and adds one feature for the first step which gives the highest value for the objective function. o From the second step onwards, the remaining features are added individually to the current subset and the new subset is evaluated. o The individual feature is permanently included in the subset if it gives the maximum classification accuracy. o The process is repeated until the required number of features are added. Bidirectional Search (BDS) • BDS applies SFS and SBS simultaneously: o SFS is performed from the empty set. o SBS is performed from the full set. • To guarantee that SFS and SBS converge to the same solution: o Features already selected by SFS are not removed by SBS. o Features already removed by SBS are not added by SFS. Limitations of SFS and SBS • The main limitation of SFS is that it is unable to remove features that become non useful after the addition of other features. • The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded. 2. Heuristic and Metaheuristic Search Algorithms • Heuristic is a solving method for a special problem (It can benefit from the properties of the solved problem). • Metaheuristic is a generalized solving method . EVOLUTIONARY COMPUTATION (EC) • It defines a number of methods design to simulate nature evolution. • These methods are population based and rely on a combination of random variation and selection to solve problems. • There are many population-based techniques inspired by natural phenomena. • Some are bio-inspired that work in the form of swarms these are PSO, ant colony, whale optimization algorithm and etc. • Others are based on the Physics / Chemistry Laws. DESIRABLE CHARACTERISTICS OF A GOOD OPTIMIZATION ALGORITHM • Accurate • Fast Convergence. • Computationally effective. • Balance exploration and exploitation. o Exploration aims to explore the whole search space in order to find promising solutions in undiscovered areas. o Exploitation phase aims to improve the already discovered solutions by searching their neighborhood • Stable solution quality. WHAT IS A SWARM? ➢ A loosely structured collection of interacting agents. Agents: ▪ Individuals that belong to a group ▪ They contribute to and benefit from the entire group ▪ They can recognize, communicate, and/or interact with each other SWARM INTELLIGENCE (SI) • Swarm Intelligence (SI) algorithms are mostly metaheuristics, which search for the (near) optimal solution for a specific problem by employing a set of search agents to explore the search space • The key characteristic of SI algorithms is that they try to mimic the natural behavior of some creatures that live in groups like folks of birds, ants, bees and others • PSO, ACO, GWO, HHO, SSA, and FA are all examples of SI algorithms that mimic the social behavior of some creatures PARTICLE SWARM OPTIMIZATION ALGORITHM • The Algorithm inspired by the problem solving capabilities and behaviour of social animals. • PSO works on the idea of collection of particles called “population”. • Each particle interacts with its neighbours and updates its velocity according to its previous best position and best position of the entire population. • As the particles iterate in the search space their collectively intelligent movements bring them close to an optimal solution. Basic Terminologies • The search space is the set of all the possible solutions of the optimization problem and the goal of the PSO is to find the best solution. • Each particle o Is characterised by its position(x) and velocity(v) o Represents a candidate solution of the optimization problem to be solved. Basic Terminologies Consider a particle X having a velocity Each Particle has the memory of its own best experience or position denoted by Pbest. This is the personal best value of the particle found during the search. There is a common best experience among the member of swarms which is denoted by Gbest. • If the new position is better than the Bob best position he needs to records it. So this new location might be better or worst ? Similarly, • If this position is better than the team’s best location Bob needs to inform other members and update his record. So this is the path looks like after 2 days of walk So what’s make the PSO a stochastic method? The team best and personal best of all members are updated after every day. So team searching area depends upon personal best and random component. So the question arises that how this stochastic method guarantee good solution ? Generally as the team maintains the best location in the region and search around it so the possibility of find a better solution is really high. PROBLEMS WITH STANDARD PSO (SPSO) • Problems associated with the SPSO algorithm are: • Premature Convergence: Premature convergence in evolutionary algorithm means convergence of algorithm before global optimum solution is reached. • Slow Convergence: Slow exploitation resulting in slower convergence and in poor-quality solutions. Inertia Weight • Inertia weight is an important parameter in PSO, which significantly affects the convergence and exploration- exploitation trade-off in PSO process. • A large inertia weight facilitates a global search while a small inertia weight facilitates a local search. • By linearly decreasing the inertia weight from a relatively large value to a small value as the iteration increases, the PSO tends to have more global search ability at the beginning of the run while having more local search ability near the end of the iterations. Selection of inertia weight: • To observe convergence behaviour of the APSO inertial weight “w” is very significant parameter. • Controls the impact of the previous velocity on the current update. • It’s a trade-off between global and local abilities of the swarm. • Larger values facilitates global exploration. • Smaller values helps in facilitating local exploitation. GRAVITATIONAL SEARCH ALGORITHM • GSA is a metaheuristic inspired by Newton’s gravitational and motion laws. • In GSA, search agents have the role of interacting physical objects, and the performance of the solutions is seen as the mass of the objects. • The particles can attract other agents based on the gravitational force. • This force pushes lighter objects towards the heavier objects. • The heavier objects, which are considered as better solutions, will move slower than the lighter objects. • This behavior ensures the exploitation phase in GSA • Comes under population-based technique in which each agent is having different masses. • Heavy masses, called good solutions, usually move more slowly than lighter masses. > Heavy masses do exploitation > Lighter masses do the exploration • Criteria of agents performance in GSA are examine by their masses. • A force of gravity causing the agents to attract each other > Responsible for the movement of agents globally towards the agents with heavier masses. PROBLEMS OF GSA • Problems that are associated with the Standard GSA are: > Slow Convergence: • Due to the unbalance exploration and exploitation the particles trapped in the local minima. > Exploitational issue: • Agents mass get heavier and heavier in the process of optimization because of the cumulative effect of the fitness function on masses. This may prevent masses from rapidly exploiting the optimum which may result in weak exploitation. GRAVITATIONAL SEARCH ALGORITHM • Let suppose a system of the objects with L objects, the position of the ith object is defined as: 𝑋𝑖 = 𝑋𝑖1 , 𝑋𝑖2 , … … … … … 𝑋𝑖𝑑 , 𝑋𝑖𝑙 𝑓𝑜𝑟 𝑖 = (1,2, . . 𝑙) where Xi shows the position of ith object in the dth dimension and l is the dimension of search space. • At time “t”, the force acting on mass “i” from mass “j” can be represented as: 𝐹𝑖𝑗𝑑 𝑡 = 𝑮𝒐 𝑡 𝑀𝑝𝑖 𝑡 ∗ 𝑀𝑎𝑗 𝑡 𝑅𝑖𝑗 𝑡 + ꞓ (𝑋𝑗𝑑 𝑡 − 𝑋𝑖𝑑 𝑡 ) • In order to emphasise exploration in the first iterations and exploitation in the final iterations, G has been designed with an adaptive value so that it is decreased over iterations. 𝒕 𝑮𝒐 (t)=exp (-α× 𝑻) Mathematical Model GRAVITATIONAL SEARCH ALGORITHM ➢ The acceleration of the object i at time t, and in direction d by the law of motion can be represented as 𝑎𝑖𝑑 (𝑡) 𝐹𝑖𝑑 (𝑡) =𝑀 𝑖𝑖 (𝑡) ➢ The velocity of the masses depends upon their present velocity as well as their acceleration. As a result, the next velocity of the agent can be measured as follows: 𝑉𝑖𝑑 𝑡 + 1 = 𝑟𝑎𝑛𝑑𝑖 ∗ 𝑣𝑖𝑑 (t)+𝒂𝒅𝒊 (𝒕) 𝑋𝑖𝑑 𝑡 + 1 = 𝑋𝑖𝑑 + 𝑣𝑖𝑑 (𝑡 + 1) ➢ Agents’ masses are defined by their fitness evaluation, the agent with heaviest mass is the fittest agent. According to the above equations, the heaviest agent has the highest attractive force and the slowest movement 𝑀𝑎𝑖 = 𝑀𝑖𝑖 = 𝑀𝑝𝑖 = 𝑀𝑖 𝑚𝑖 (𝑡) 𝑗=1 𝑚𝑗 (𝑡) 𝑀𝑖 𝑡 = σ𝑁 𝑖 = 1,2,3, … … … 𝑁 𝑓𝑖𝑡 𝑡 −𝑤𝑜𝑟𝑠𝑡 (𝑡) 𝑡 −𝑤𝑜𝑟𝑠𝑡 (𝑡) and 𝑚𝑖 𝑡 = 𝑏𝑒𝑠𝑡𝑖 where fiti shows the fitness value of the object at time. 𝑏𝑒𝑠𝑡 𝑡 = 𝑚𝑖𝑛𝑗 ꞓ {1,…𝑁} 𝑓𝑖𝑡𝑗 (t) and 𝑤𝑜𝑟𝑠𝑡 𝑡 = 𝑚𝑎𝑥𝑗 ꞓ 1,…𝑁 𝑓𝑖𝑡𝑗 (t) ANT COLONY OPTIMIZATION • It is first developed by Marco Dorigo in 1992 • The first version of ACO was known as Ant systems • Ant Colony Optimization (ACO) is one of the well-known swarm intelligence techniques • Despite the similarity of this algorithm to other swarm intelligence techniques in terms of employing a set of solutions and stochastic nature, the inspiration of this algorithm is unique INSPIRATION • The main inspiration of the ACO algorithm is the concept of stigmergy in nature. • Stigmergy refers to the manipulation of environment by biological organisms to communicate with each other. • What makes this type of communication unique is the fact that individuals communicate indirectly. • The communication is local as well, meaning that individuals should be in the vicinity of the manipulated area to access it EXAMPLE OF STIGMERGY • Stigmergy also exists in humans • Example are websites like Wikipedia and Reddit. • Millions of articles are contributed by developers across the globe • Search for any articles without knowing the contributor of the article • In these websites no centralized system for processing, editing, approving the article everything is done by people. Common Example of Stigmergy • Common example of stigmergy can be find in the behaviour of ants in an ant colony for finding the food • Finding food is an optimization tasks where organisms are trying to achieve maximum amount of food source by consuming minimum amount of energy • In an ant colony this can be done by finding the shortest path of the nets to any food source • In nature, ant solve this problem by using a very simple algorithm that is main inspiration of the Ant Colony Optimization Algorithm • A small village in the middle of the desert and consist of several families • Every day families travel several Km to travel to the river and collect water • There are two routes to bring water but no one knows which one is better. • People randomly choose among the two routes. • One day a young man and girl want to solve this problem and comes with the idea of identifying the shortage path • Marking the path with water and follow the path which is more wet. • The longest path takes longer evaporating when any body else wet it again. • One day decide to bring water several times to solve the problem and bring two backets. • The girl reach the point and the boy is half way still • The girl does not know if boy has already reached or go back to the village • They just check the path and mark it with the water • The girl go back to the village after marking her path with water • When she back home, boy reach to the river and fill his backets • The boy now has an answer which route to follow • During the first iteration the boy and girl manage to choose the shortest path and the path get more water and became more wet. • They decide to bring more water on the same day • So they started their journey and choose the smallest path as there is no vaporization at this point of time • They reach to the river and fill their baskets again • They wet the path again while coming back to village • The path is more wet and indicates the path is shortest path • Assume now they choose the wrong path or vaporization is occur. • Since the path followed by the girl is shorter and she makes two rounds of the village than boy, so the path is more wet then the boy’s path • Imagine they want to repeat their path again • Shortage path is always more wet than the longest path • In an ant colony, ants constantly look for food sources around the nest in random directions. • It has been proven that once an ant finds a food source, it marks the path with pheromone. • The amount of the pheromone highly depends on the quality and quantity of the food source. • The more and the better the source of food, the stronger and concentrated pheromone is deposited. • When other ants perceive the presence of the pheromone, they also follow the pheromone trail to reach the food source. • After getting a portion of the food, ants carry them to the nest and mark their own path to the next • Three routes from a nest to a food source. • The amount of pheromone deposited on a route is of the highest on the closest path. • The of pheromone decreases proportional to the length of the path. • While ants add pheromone to the paths towards the food source, vaporization occurs. • The period of time that an ant tops up pheromone before it vaporizes is higher inversely proportional to the length of the path. • This means that the pheromone on the closets route become more concentrated as more ants get attracted to the strongest pheromone • This means they made decision based on the probability Mathematical Model • Pheromone (Model and vaporization): o How to mathematically model it? Simulate vaporization? • Decision making process?? So cost matrix which define the length of each path Other matrix represents the amount of the pheromone on each edge of this graph, this matrix allows to store the amount of pheromone on each edge. • So the question is how do we add pheromone on these paths?? • Different methods are there that add same amount of the pheromone regardless of the path find by the ant. • However, other methods that adds pheromone depends upon the quality of the path found by the ant, there are some species of the ants that add more pheromone depends on the quality of the food. Mathematical Pheromone Level On The Graph is the total amount of pheromone deposited on the i,j edge defined as follows: ant travels on edge i,j and 0 otherwise where Lk shows the length (cost value) of the path that kth ant travels For Multiple ants “m” shows the number of ants With Vaporization ρ is the rate of pheromone evaporation. The greater the value of ρ, the higher the evaporation rate With out Vaporization Sum of the lengths of the path ant travelled, i.e., path length= 4+8+15+4=31 With Vaporization With Vaporization Change in pheromone (1-0.5)*1+1/14 Initial pheromone Next Step Calculating The Probability Calculating The Probability For simplicity α and β = 1 For Tree For Car This is the case when we have initial pheromone level of 1, that is identical pheromone level to all the paths Change the initial pheromone level to 5 How To Choose Destinations Using This Probabilities How To Choose Destinations Using This Probabilities ROULETTE WHEEL How To Choose Destinations Using This Probabilities ROULETTE WHEEL Genetic Algorithm Video On Genetic Algorithm By Whentotrade.Com Introduction • Genetic Algorithm (GA) is one of the most well-regarded evolutionary algorithms • Genetic Algorithm (GA) is one of the first population-based stochastic algorithm proposed in the history • This algorithm mimics Darwinian theory of survival of the fittest in nature • Similar to other EAs, the main operators of GA are selection, crossover, and mutation Inspiration • The GA algorithm is inspired by the theory of biological evolution proposed by Darwin . • In fact, the main mechanism simulated in GA is the survival of the fittest. • In nature, fitter organisms have a higher probability of survival. This assists them to transfer their genes to the next generation. • Over time, good genes that allow species to better adapt with environment (avoid enemy and find food) become dominant in subsequent generations. Natural Selection - Survival of the fittest • Darwin theory is proposed by Charles Darwin, one of the key mechanism in this theory is natural selection or survival of the fittest • Natural Selection refers to the variation in the genotype of organisms that increase the chance of survival • In natural selection those variations in the genotype that increases the organism chance of survival are preserved and multiplied from generation to generation • For Example, a short teeth is good for a shark nature will preserve those sharp teeth generation by generation so the future generation will also benefit from it, for example to catch preys. Survival of the fittest • There is an ecosystem with a large number of organisms of which some are on the land, and others in water. • The main objective of these organisms is survival • They interact in the form of collaboration or competition to increase the chance of survival • For example, two species like tiger and wolves compete on land to catch preys. • Similar swarms of fish reacts to avoid predator or catch preys • NOTE: So that means how good an organism is to avoid the predator or catch preys so that they can survive Basic Terminologies 1. Population: This is a subset of all the probable solutions that can solve the given problem. 2. Chromosomes: A chromosome is one of the solutions in the population. 3. Gene: This is an element in a chromosome. 4. Allele: This is the value given to a gene in a specific chromosome. 5. Fitness function: This is a function that uses a specific input to produce an improved output. The solution is used as the input while the output is in the form of solution suitability. 6. Genetic operators: In genetic algorithms, the best individual's mate to reproduce an offspring that is better than the parents. Genetic operators are used for changing the genetic composition of this next generation. Basic Terminologies • For instance, for a problem with 10 variables, GA uses chromosomes with 10 genes. • The GA algorithm uses three main operators to improve the chromosomes in each generation: selection, crossover (recombination), and mutation. Initial Population • The GA algorithm starts with a random population. • This population can be generated from a Gaussian random distribution to increase the diversity. • This population includes multiple solutions, which represent chromosomes of individuals. • Each chromosome has a set of variables, which simulates the genes. • The main objective in the initialisation step is to spread the solutions around the search space as uniformly as possible to increase the diversity of population and have a better chance of finding promising regions Selection • Natural selection is the main inspiration of this component for the GA algorithm. • In nature, the fittest individuals have a higher chance of getting food and mating. • This causes their genes to contribute more to the production of the next generation of the same species. • Inspiring from this simple idea, the GA algorithm employs a roulette wheel to assign probabilities to individuals and select them for creating the next generation proportional to their fitness (objective) values. Selection • For example, Let consider we have one shark, two catfishes and a bunch of small fishes, regardless they are carnivorous or herbivorous they compete for food • Depending on how fit they are competing on resources and increases their chance of survival • So if one catfish is eaten by the shark may be it will not swim faster, i.e, that catfish is a example of organism which is not fit Selection • The catfish that survives is different from the one which has been eaten by the shark. • This is may be because this catfish has several fins that help it to swim faster and run away from its predator faster. • So this feature is a good feature that helps catfish to survive and swim faster • Another feature is the moustache that help it to sense the environment • So if the organism is not fit enough it will die either by starvation or by its predator • The key point is that the genes or good characteristics of the organism are protected and transferred to the next generation. Selection Operators Some of other selection operators are: 1. Boltzmann selection 2. Tournament selection 3. Rank selection 4. Steady state selection 5. Truncation selection 6. Local selection 7. Fuzzy selection 8. Fitness uniform selection 9. Proportional selection 10. Linear rank selection 11. Steady-state reproduction Crossover (Recombination) • After selecting the individuals using a selection operator, they have to be employed to create the new generation. • In nature, the chromosomes in the genes of a male and a female are combined to produce a new chromosome. • This is simulated by combining two solutions (parent solutions) selected by the roulette wheel to produce two new solutions (children's solutions) in the GA algorithm • There are different methods of crossover in the literature. The easiest one divides chromosomes into two (single-point) or three pieces (double-point). They then exchange genes between two chromosomes. • In the single-point cross over, the chromosomes of two parent solutions are swapped before and after a single point. • In the double-point crossover, however, there are two cross over points and the chromosomes between the points are swapped only. Crossover • The main objective of crossover is to make sure that the genes are exchanged, and the children inherit the genes from the parents. • Crossover is the main mechanism of exploitation in GA • There is a parameter in GA called Probability of Crossover (Pc) which indicates the probability of accepting a new child. • This parameter is a number in the interval of [0,1]. • A random number in the same interval is generated for each child. • If this random number is less than Pc, the child is propagated to the subsequent generation. Otherwise, the parent will be propagated. • This happens in nature as well—all of the offspring do not survive. Crossover Operators 1. Uniform crossover 2. Half uniform crossover 3. Three parents crossover 4. Partially matched crossover 5. Cycle crossover 6. Order crossover 7. Position-based crossover 8. Heuristic cross over 9. Masked crossover 10. Multi-point crossover Recombination • This is a kind of food chain here shark eats catfish and catfish eats small fishes • Catfishes requires fins to swim faster and avoid sharks and they require moustache to sense the environment and catch the prey • Overtime less fitted catfishes with less fins and moustaches are died due to starvation or eaten by sharks • So the fittest catfishes survives and take part in production of the next generation This figure is an example of the process after preserving good feature generation by generation by natural selection, the fittest catfish will survive • Orange Catfish has two fins and two moustaches and the green catfish has one fin and two moustaches and they are of different sizes and located at different positions, this corresponds to the genes in an organism • The characteristics of the catfish is stored in the table with one row and multiple columns • The set of genes that define how the catfish looks like is called chromosomes • The chromosomes is the set of the genes when mating the chromosome of both parents are combined • That means they exchange genes to produce a child The yellow and purple colors shows the process of combination in which children have better features then their parents Problem with Crossover • The crossover operator exchanges genes between chromosomes. • The issue with this mechanism is the lack of introducing new genes. • If all the solutions become poor (be trapped in locally optimal solutions), crossover does not lead to different solutions with new genes differing from those in the parents. Problem with Crossover • Considering the natural selection, and recombination, Organisms and individuals keep changing the genes in their chromosome • What does a Catfish do if there is a change in the Ecosystem ??? • For example if there is a fitter predators or preys in the Ecosystem, to overcome that Catfish should develop new genes or features because just exchanging or combining the good features of the current population of Catfishes • The problem with crossover that it will not add new features nor the natural selection • According to the theory of evolution in biology chromosomes might have random changes during the process of recombination, This is called Mutation • This refers to the random changes of the genes during the process of recombination Mutation • Mutation causes random changes in the genes. • There is a parameter called Probability of Mutation (Pm) that is used for every gene in a child chromosome produced in the crossover stage. • This parameter is a number in the interval of [0,1]. • A random number in the same interval is generated for each gene in the new child. • If this random number is less than Pm, the gene is assigned with a random number with the lower and upper bounds • The mutation operator maintains the diversity of population by introducing another level of randomness. • This operator prevents solutions to become similar and increase the probability of avoiding local solutions in the GA algorithm Mutation operator alters one or multiple genes in the children's solutions after the crossover phase • In this analogue, the mutation is adding a third eye to one of the child catfish in the second generation or a very big moustache in the third generation • This picture shows how the mutation changes the individual in the future generation, if the genes is good the mutation results in a better gene. • Also, the natural selection and recombination will protect and transfer them to the next generations • What if the Mutation creates a negative feature that resulted in a less fit organism, then the natural selection will get rid of this. • For example if a catfish after having many mutation has small fins then it will die in the ecosystem • So this negative feature will not transfer to the next generations. • What are the advantages of the mutation??? • With the crossover and constant competition between organisms certain features are disappeared as new generations are developed • The Mutation operator allows reverting some of the lost features or adding the new one’s and maintaining the diversity in the generation Impact of mutation on the chromosomes, Light blue boxes are the areas effected by the mutation Similar children’s but they have developed new features Elitism Crossover and mutation change the genes in the chromosomes. Depending on the Probability of Mutation , there is a chance that all the parents are replaced by the children. This might lead to the problem of losing good solutions in the current generation. To fix this, a new operator is used called as Elitism How it works??? • A portion of the best chromosomes in the current population is maintained and propagated to the subsequent generation without any changes. • This prevents those solutions from being damaged by the crossover and mutation during the process of creating the new populations. • The list of elites gets updated by simply raking the individuals based on their fitness values. Applications of Genetic Algorithm • Genetic algorithms are used in the traveling salesman problem to establish an efficient plan that reduces the time and cost of travel. • It is also applied in other fields such as economics, solving optimization problems, aircraft design, and DNA analysis and many others. Hybrid Gravitational Search Particle Swarm Optimization Algorithm SPSO SGSA Direction is calculated using only two best positions, pbesti and gbest Agent direction is calculated based on the overall force obtained by all the agents. Updating is performed without considering the quality of the solutions The force is proportional to the fitness value and so the agents see the search space around themselves in the influence of force. VS PSO uses a kind of memory for updating the velocity GSA is memory-less and only the current position of the agents plays a role in the updating procedure. Updating is performed without considering the distance between solutions The force is inversely proportional to the distance between solutions. PSO simulates the social behaviour of birds GSA is inspired by laws of physics Hybrid Gravitational Search Particle Swarm Optimization Algorithm • Particle Swarm Optimization (PSO) algorithm is a member of the swarm computational family and widely used for solving nonlinear optimization problems. • It tends to suffer from premature stagnation, trapped in the local minimum and loses exploration capability as the iteration progresses. • On the contrary, Gravitational Search Algorithm (GSA) is proficient for searching global optimum, however, its drawback is its slow searching speed in the final phase • The key concept behind the proposed method is to merge the exploration ability of GSA with the capability for social thinking (gbest) of PSO PROBLEMS OF GSA • Problems that are associated with the Standard GSA are: > Slow Convergence: • Due to the unbalance exploration and exploitation the particles trapped in the local minima. > Exploitational issue: • Agents mass get heavier and heavier in the process of optimization because of the cumulative effect of the fitness function on masses. This may prevent masses from rapidly exploiting the optimum which may result in weak exploitation. • The basic idea is to save and use the location of the best mass to speed up the exploitation phase. Figure below shows the effect of using the best solution to accelerate movement of agents towards the global optimum. • As shown in this figure, the gbest element applies an additional velocity component towards the last known location for the best mass. • In this way, the external gbest ‘‘force’’ helps to prevent masses from stagnating in a suboptimal situation Mathematical Model There are two benefits of this method: a) Accelerating the movement of particles towards the location of the best mass, which may help them to surpass it and be the best mass in the next iteration b) Saving the best solution attained so far • Adaptively decrease 𝒄𝟏 and increase 𝒄𝟐 so that the masses tend to accelerate towards the best solution as the algorithm reaches the exploitation phase. • Since there is no clear border between the exploration and exploitation phases in evolutionary algorithms, the adaptive method is the best option for allowing a gradual transition between these two phases 𝑽𝒅𝒊 𝒕 + 𝟏 = 𝒘. 𝒗𝒅𝒊 (t) × 𝒓𝒂𝒏𝒅 + 𝒄𝟏 × 𝒂𝒅𝒊 (𝒕) + 𝒘 𝑿𝒅𝒊 𝒕 + 𝟏 = 𝒘 ∗ 𝑿𝒅𝒊 + 𝒗𝒅𝒊 (𝒕 + 𝟏) 𝒘𝒊 = 𝟏 − 𝑷𝒈 𝑷𝒊 𝑪𝟏 = 𝑪𝟑 − 𝑪𝟒 ∗ (𝟏 𝒕 − ) + 𝑪𝟒 𝑻 𝒕 𝑻 𝑪𝟐 = 𝑪𝟓 − 𝑪𝟔 ∗ (𝟏 − ) + 𝑪𝟔 𝒄𝟏 𝒄𝟐 𝒑𝒈 − 𝒙𝒊 Flow chart of HGSPSO Generate initial population Evaluate the fitness of all agents Evaluate G and gbest for all population Update velocity and position Calculate M, forces , and acceleration for all agents NO Meeting end criteria YES Return Best solution CASE STUDY Introduction • Dataset is obtained from the UCI (https://archive.ics.uci.edu/ml/datasets.php) biomedical database. • In data mining, feature subset selection is data pre-processing phase that is of enormous importance. • In this paper, for selecting a minimum number of features, K-Nearest Neighbour (KNN) classifier is presented with a modified particle swarm optimization (MPSO) to obtain good classification precision. • The proposed method (MPSO) is applied to three UCI medical data sets and is compared with other feature selection approach available FOUR KEY STEPS IN FEATURE SELECTION Modifications in PSO ADVANCED PARTICLE SWARM OPTIMIZATION ALGORITHM ➢ The SPSO algorithm is primarily based on the two equations position and the velocity of the particle. a) Modification in the Velocity Update Equation: • The third term that is added in the velocity equation of the PSO is used to minimize the positions of the particles through the iterations so the velocity will be increased and the algorithm will be reach to the optimal solution faster. • 𝑣𝑖𝑑 = 𝑤𝑣𝑖𝑑 + 𝑐1 𝑅1 𝑝𝑖𝑑 − 𝑥𝑖𝑑 + 𝑐2 𝑅2 𝑝𝑔𝑑 − 𝑥𝑖𝑑 + 𝒘 𝒄𝟏 𝒄𝟐 𝒑𝒊𝒅 − 𝒑𝒈𝒅 VELOCITY CLAMPING & PARTICLE PENALIZATION METHOD ➢ Velocity Clamping keep the particle velocity within the range of [vmax, vmin] The maximum and the velocity condition can be defined as • 𝒗𝒎𝒂𝒙 = 𝒍𝒂𝒎𝒃𝒅𝒂 ∗ 𝑴𝒂𝒙𝒔𝒔 − 𝑴𝒊𝒏𝒔𝒔 • 𝒗𝒎𝒊𝒏 = 𝒍𝒂𝒎𝒃𝒅𝒂 ∗ 𝑴𝒊𝒏𝒔𝒔 − 𝑴𝒂𝒙𝒔𝒔 Condition for the velocity. 𝒗𝒊 > 𝒗𝒎𝒂𝒙 𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝒗𝒎𝒂𝒙 • 𝒗𝒊 < 𝒗𝒎𝒊𝒏 𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝒗𝒎𝒊𝒏 ➢ Penalization keep the particle within the search domain if the sum of the agents' position and velocity resulting in the new position lies outside the domain. Conditions for penalization • 𝒗𝒊 + 𝒙𝒊 > 𝑴𝒂𝒙𝒔𝒔 𝒐𝒓 𝒗𝒊 + 𝒙𝒊 < 𝑴𝒊𝒏𝒔𝒔 𝒕𝒉𝒆𝒏 𝒗𝒊 = 𝟎 UCI DATA SET UCI medical datasets are used for the feature selection process. Results • For the classification purpose, KNN is used. The efficiency of the classifier has been calculated based on the accuracy 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒) 𝑇𝑜𝑡𝑎𝑙