Machine Learning - is a subset of
Artificial
Intelligence
that
enables
computers to learn patterns from data and
make predictions or decisions without
being explicitly programmed. In traditional
programming, rules are hardcoded by
humans.
Formula: Data + Algorithm → Model →
Predictions
A Brief History of Machine Learning
• 1950s: Alan Turing proposes the Turing
Test to measure machine intelligence.
• 1957: Frank Rosenblatt develops the
Perceptron, the first neural network model.
• 1980s–1990s: Neural networks and
Support Vector Machines gain popularity.
• 2000s–Present: Advances in data
storage, cloud computing, and GPUs lead
to the deep learning revolution.
The Machine Learning Cycle
ML development is a continuous process
with the following stages:
1. Define the problem clearly.
2. Collect relevant data.
3. Preprocess and clean the data.
4. Select the right model.
5. Train the model with data.
6. Evaluate performance using metrics.
7. Deploy the model into production.
8. Monitor and maintain the model to
ensure accuracy over time.
Step 1: Problem Definition
Clearly identify what you want to solve.
The problem definition should specify:
• The objective (e.g., predict house
prices)
• The input features (e.g., location, size,
number of rooms)
• The expected output (e.g., price)
A well-defined problem prevents wasted
effort and improves model accuracy.
Step 2: Data Collection
Data is the foundation of any ML
model.Sources of Data:
• Public datasets (Kaggle, UCI ML
Repository)
• APIs (Twitter API, Weather API)
• Web scraping
• IoT sensors and devices
Data Types: Structured (tables),
unstructured (images, text, audio)
Step 3: Data Preprocessing
Raw data often contains missing values,
duplicates, or irrelevant features.
Preprocessing
improves data quality.Common Steps:
• Remove or fill missing values.
• Normalize numerical values.
• Encode categorical data.
• Split into training, validation, and test
sets.
Good preprocessing ensures the model
learns accurate patterns.
Step 4: Model Selection
Choosing the right algorithm depends on
the type of problem:
• Supervised Learning: Predict outcomes
using labeled data (e.g., Linear
Regression,
Decision Trees, k-NN).
• Unsupervised Learning: Find patterns in
unlabeled data (e.g., k-Means, PCA).
Weka, Google Colab, and even Excel can
be used to test different models.
Step 5: Model Training
Training is the process where the
algorithm learns from the training data.
• The model adjusts its parameters to
minimize errors.
• Hyperparameters (e.g., learning rate,
number of neighbors in k-NN) are tuned to
improve performance.A good training
process avoids both underfitting and
overfitting.
Step 6: Model Evaluation
We test the model using unseen data to
check its performance.Common Metrics:
• Accuracy: % of correct predictions.
• Precision & Recall: For imbalanced
datasets.
• RMSE (Root Mean Squared Error): For
regression problems.
Visualization tools like a confusion matrix
help interpret results.
Step 7: Deployment
Once the model performs well, it is
deployed so it can make predictions in real
time.
Deployment Platforms:
• Web apps (Flask, FastAPI)
• Mobile apps
• Cloud services (AWS, Azure, GCP)
Step 8: Monitoring & Maintenance
Models degrade over time as data
patterns change (model drift).Best
Practices:
• Monitor predictions for accuracy.
• Retrain with new data.
• Keep improving features and algorithms.
Understanding Data in ML
Types of data used in ML:
• Numerical: Age, temperature, price
• Categorical: Gender, city, brand
• Text: Reviews, articles
• Image/Video: Photos, CCTV footage
• Audio: Voice commands, music
The data type determines preprocessing
steps and model choice.
Python Libraries for Machine Learning
• Data Handling: pandas, numpy – Load
and manipulate datasets
• Visualization: matplotlib, seaborn, plotly
– Create charts and graphs
• ML Algorithms: scikit-learn – Train and
evaluate models
• Deep Learning: tensorflow, keras,
pytorch – Build neural networks
• Data Collection: requests, BeautifulSoup
– Get data from the web
Summary
• Machine Learning allows computers to
learn from data without explicit
programming.
• The process includes defining the
problem, collecting and preparing data,
choosing
and training a model, evaluating
performance, deploying, and maintaining
it.
• Good quality data and the right tools are
key to success.
k-Nearest Neighbors (kNN) - algorithm is
one of the simplest supervised machine
learning methods.
● If most of its nearest neighbors
belong to a certain class, then the
new object is assigned to that
class.
● It is used for both classification
and regression tasks.
Common distance metrics:
• Euclidean distance
• Manhattan distance
• Minkowsky
• Hamming distance
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may
include points from other classes
Rule of thumb:
K = sqrt(N)
N: number of training points
EUCLIDEAN DISTANCE
Use when:
● Data is continuous (real
numbers).
● Features are on the same scale
(e.g., all measured in cm, seconds,
etc.).
● You expect data to follow a
circular or spherical geometry
(points close in a straight-line
sense are more similar).
Examples:
● Image recognition (pixel intensity
values).
● Sensor readings (temperature,
humidity, etc.).
● Geometric problems where
“straight-line distance” makes
sense.
MANHATTAN DISTANCE
Use when:
● Data is continuous, but you
expect differences along individual
dimensions to matter more than
the overall straight-line distance.
● Geometry is more like a grid (like
moving in a city with streets).
● Dataset has high-dimensional
data
Examples:
● Predicting taxi fares in a city grid
(distance = blocks traveled).
● Text mining with word counts
(Bag-of-Words models).
● High-dimensional features where
sparse differences matter.
HAMMING DISTANCE
Use when:
● Data is categorical (not
numerical).
● Features are strings, binary
vectors, or discrete symbols.
● You only care about exact
matches vs mismatches, not
magnitude of difference.
Examples:
● Comparing DNA sequences (A,
T, C, G).
● Error detection in communication
(bit strings).
● Categorical survey data (Yes/No,
Male/Female).
WHEN TO USE KNN:
● Datasets are relatively small to
medium in size
● Decision boundaries are
irregular and Interpretability is
important
● No strong assumptions about data
● Applications where similarity
matters
○ Medical diagnosis: A new
patient is compared with
past patients.
○ Recommendation:
Suggest items similar to a
user’s past choices.
○ Anomaly detection: Flag
unusual cases by checking
if they’re far from normal
Clusters.
WHEN NOT TO USE KNN
● With very large datasets
● With high-dimensional data
● Without feature scaling
WHAT HAPPENS WHEN THERE’S NO
MAJORITY?
● Common Tie-Breaking Approaches
in kNN
● Choose the class of the closest
neighbor
● Distance-weighted voting
● Reduce k (prefer odd numbers)
● Random selection among tied
classes
● Domain-specific rules
IMPROVING KNN
● Experiment with different values
of k.
● Use weighted kNN (closer
neighbors have more influence).
● Apply feature engineering
(categorizing BMI, grouping ages).
● Reduce dimensionality (PCA) for
large feature spaces.
APPLICATIONS
● Healthcare: Predicting diseases
(diabetes, heart disease)
● Finance: Detecting fraudulent
transactions
● E-commerce: Recommending
similar products
● Security: Detecting abnormal
network behavior
● Image Recognition: Identifying
handwritten digits.
FORMULAS SA F1-Score:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝐴𝑐𝑐 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 * 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝑇𝑁|𝐹𝑃
𝐹𝑁|𝑇𝑃
APRIORI ALGORITHM
ASSOCIATION RULE MINING
● Association rule mining is a
technique used to identify
patterns in large data sets.
● It involves finding relationships
between variables in the data and
using those relationships to make
predictions or decisions.
● The goal of association rule mining
is to uncover rules that describe
the relationships between
different items in the data set.
APRIORI ALGORITHM
● Apriori is an algorithm designed to
extract frequent itemsets from
transactional databases and
generate association rules.
● It is based on the principle that if
an itemset is frequent, all its
subsets must also be frequent.
SUPPORT: The frequency with which an
item appears in the dataset.
CONFIDENCE: The likelihood that item B
is purchased when item A is purchased
LIFT: The strength of a rule, measuring
how much more likely item B is bought
when item A is bought compared to when
bought independently
Lift > 1 → Positive association
Lift = 1 → No association
Lift < 1 → Negative association
MINIMUM SUPPORT: minimum of
proportion of transactions in which an
itemset appears.
Purpose: to eliminate infrequent
itemsets early in the
Process
MINIMUM CONFIDENCE: the proportion
of transactions containing itemset A that
also contain itemset B
Purpose: to filter association rules
that are not strong
or reliable
APPLICATIONS
• Market basket analysis: Retailers use
Apriori to analyze purchase patterns,
helping them arrange products to
encourage combined purchases.
• Recommendation systems: Online
platforms use Apriori to suggest products
based on previous purchases.
• Anomaly detection: Apriori identifies
unusual transactions by comparing them
to expected patterns.