Basics Machine Learning

Machine Learning
What is Machine Learning?
Machine learning is a branch of artificial intelligence that involves developing algorithms
and statistical models that allow computers to learn from data and make predictions or decisions
without being explicitly programmed. Machine learning algorithms can be used to identify patterns
in large datasets and use those patterns to make predictions or decisions about new, unseen data.
There are three main types of machine learning –
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Supervised Learning
In supervised learning, the algorithm is trained on labeled data, meaning that the correct
answer or output is provided for each input. The algorithm then uses this labeled data to make
predictions about new, unseen data.
Supervised learning algorithms or methods are the most commonly used ML algorithms. This
method or learning algorithm take the data sample i.e. training data and associated output i.e. labels
or responses with each data samples during the training process. The main objective of supervised
learning algorithms is to learn an association between input data samples and corresponding
outputs after performing multiple training data instances.
For example, we have −
x − Input variables and
Y − Output variable
Now, apply an algorithm to learn the mapping function from the input to output as follows −
Now, the main objective would be to approximate the mapping function so well that even when
we have new input data (x), we can easily predict the output variable (Y) for that new input data.
It is called supervised because the whole process of learning can be thought as it is being
supervised by a teacher or supervisor. Examples of supervised machine learning algorithms
includes Decision tree, Random Forest, KNN, Logistic Regression etc.
Based on the ML tasks, supervised learning algorithms can be divided into two broad classes
The key objective of classification-based tasks is to predict categorical output labels or
responses for the given input data. The output will be based on what the model has learned in its
training phase.
As we know that the categorical output responses means unordered and discrete values,
hence each output response will belong to a specific class or category. We will discuss
Classification and associated algorithms in detail in further chapters also.
Certainly! Let's delve deeper into binary and categorical classification:
1. Binary Classification:
 Binary classification involves categorizing data into two classes or categories.
 Consider a binary classification task of predicting whether an email is spam or not
spam. Here, the two classes are "spam" and "not spam."
 Logistic Regression is commonly used for binary classification tasks. It models the
probability that a given input belongs to one of the two classes using a logistic
 Common evaluation metrics for binary classification include accuracy, precision,
recall, F1-score, and the area under the Receiver Operating Characteristic (ROC)
curve (AUC-ROC).
2. Categorical (Multiclass) Classification:
 Categorical classification involves categorizing data into three or more classes or
 Imagine classifying images of fruits into categories such as "apple," "banana,"
"orange," and "mango." Here, there are multiple classes.
 Decision Trees, Random Forests, Support Vector Machines (SVM), and Neural
Networks are frequently used for categorical classification tasks. These algorithms
are capable of handling multiple classes directly.
 Evaluation metrics for categorical classification include accuracy, precision, recall,
F1-score, and confusion matrix. However, when dealing with imbalanced datasets,
class-specific metrics or techniques like class weighting might be necessary.
In summary, binary classification deals with distinguishing between two classes, while
categorical classification involves categorizing data into three or more classes. Each type of
classification task requires different algorithms, evaluation metrics, and strategies for handling
various challenges that may arise during model training and evaluation.
The key objective of regression-based tasks is to predict output labels or responses which
are continues numeric values, for the given input data. The output will be based on what the model
has learned in training phase.
Basically, regression models use the input data features (independent variables) and their
corresponding continuous numeric output values (dependent or outcome variables) to learn
specific association between inputs and corresponding outputs. We will discuss regression and
associated algorithms in detail in further chapters also.
Data Setup for supervised learning
Setting up data for supervised learning involves several steps to prepare the data for
training and evaluation. Here's a general overview:
1. Data Collection: Gather the data relevant to the problem you're trying to solve. This can
involve acquiring datasets from public repositories, collecting data through surveys or
experiments, or obtaining data from sensors or other sources.
2. Data Cleaning and Preprocessing:
Handling Missing Values: Identify and deal with missing values in the dataset.
This can involve imputation (replacing missing values with estimated values) or
removing instances with missing data.
Handling Outliers: Detect and handle outliers in the data. Outliers can be treated
by removing them, transforming them, or using robust statistical methods.
Data Normalization/Standardization: Scale numerical features to a similar range
to prevent certain features from dominating the model's learning process. Common
techniques include Min-Max scaling or Z-score normalization.
Encoding Categorical Variables: Convert categorical variables into a numerical
format that machine learning algorithms can understand. This can involve
techniques such as one-hot encoding or label encoding.
3. Feature Selection and Engineering:
Feature Selection: Identify the most relevant features that contribute to predicting
the target variable. This can involve techniques like correlation analysis, feature
importance ranking, or domain knowledge.
Feature Engineering: Create new features from existing ones or transform
existing features to improve the predictive performance of the model. Feature
engineering can involve techniques such as polynomial features, interaction terms,
or binning.
4. Splitting the Data:
Training Set: The portion of the dataset used to train the machine learning model.
Validation Set: A separate portion of the dataset used to tune hyperparameters and
evaluate model performance during training.
Test Set: A final portion of the dataset used to evaluate the model's performance
on unseen data after training is complete.
5. Model Training and Evaluation:
Selecting a Model: Choose an appropriate machine learning algorithm based on
the nature of the problem, the size and complexity of the dataset, and other factors.
Training the Model: Fit the selected model to the training data.
Evaluating the Model: Assess the model's performance using appropriate
evaluation metrics on the validation set. Iterate on model selection, tuning, and
feature engineering based on validation results.
Final Evaluation: Evaluate the final model's performance on the test set to estimate
its generalization ability to unseen data.
By following these steps, you can effectively set up your data for supervised learning,
ensuring that your machine learning model learns meaningful patterns from the data and makes
accurate predictions on new, unseen instances.
Unsupervised Learning
In unsupervised learning, the algorithm is trained on unlabeled data, meaning that the
correct output or answer is not provided for each input. Instead, the algorithm must identify
patterns and structures in the data on its own.
Clustering is a technique in unsupervised learning where data points are grouped together
based on their similarities. The goal of clustering is to partition a set of data points into distinct
groups or clusters, so that data points within the same cluster are more similar to each other than
to those in other clusters. Common clustering algorithms include k-means, hierarchical clustering,
and DBSCAN. Clustering is widely used in various fields such as customer segmentation, image
segmentation, anomaly detection, and data compression. It helps to identify meaningful patterns
and structures within data without the need for predefined labels.
Data Setup for unsupervised learning
Setting up data for unsupervised learning involves preparing a dataset without labeled
target variables. Here are the key steps:
1. Data Collection: Gather relevant data from various sources such as databases, files, or
APIs. Ensure the data is comprehensive and representative of the problem you're trying to
2. Data Cleaning: Clean the data to remove any inconsistencies, missing values, or errors
that could affect the analysis. Impute missing values or remove them depending on the
extent of missingness and their impact on the analysis.
3. Feature Selection/Extraction: Identify the features (attributes) that are relevant for the
analysis. Sometimes, feature extraction techniques like Principal Component Analysis
(PCA) or feature selection methods like Recursive Feature Elimination (RFE) are used to
reduce the dimensionality of the data.
4. Normalization/Standardization: Scale the features to ensure they have similar ranges.
This step is crucial, especially when using distance-based algorithms like k-means
clustering, to prevent features with larger scales from dominating the analysis.
5. Data Exploration: Explore the data to understand its distributions, correlations, and any
underlying patterns. Visualization techniques like scatter plots, histograms, or pair plots
can be helpful in this phase.
6. Model Selection: Choose appropriate unsupervised learning algorithms based on the
nature of the problem and the characteristics of the data. Common algorithms include kmeans clustering, hierarchical clustering, DBSCAN, and Gaussian mixture models.
7. Model Training: Train the chosen unsupervised learning model on the prepared dataset.
Unlike supervised learning, there's no need for a separate training and validation set since
there are no labeled target variables.
8. Evaluation: Evaluate the performance of the unsupervised learning model using relevant
metrics. However, evaluation in unsupervised learning is often more subjective and relies
on domain knowledge and interpretation of the results.
9. Iterate: Iterate through the steps if necessary, refining feature selection, model parameters,
or data preprocessing techniques based on the insights gained from the analysis.
By following these steps, you can effectively set up data for unsupervised learning and derive
meaningful insights from the dataset.
Reinforcement Learning
In reinforcement learning, the algorithm learns by receiving feedback in the form of rewards or
punishments based on its actions. The algorithm then uses this feedback to adjust its behavior and
improve its performance.