Uploaded by arunabh.bera

Introduction to Statistical Learning: Concepts, Models & Goals

advertisement
Lecture: 01 – Introduction to Statistical learning:
INTRODUCTION: Statistical learning refers to a vast set of tools for understanding data? They
are set of tools and techniques for understanding data by modelling the relationship between
Inputs X and Output Y.
The Core objective is to estimate a function f(x) that captures the systematic
information in data; thereby enabling prediction or inference
Statistical learning has two fundamental goals:
a. Prediction: Accurately estimate future or unknown outcomes Y from the inputs X
b. Inference: Understand how each input variable Xj affects the Output Y.
Statistical learning finds all sectors wherever there is a transactions of data. Some of the
common industry are health diagnostics, financial forecasting, marketing analytics,
bioinformatics etc.
SOME CORE FUNDAMENTALS:
Aspect
Data
Definition
Raw facts, figures, or observations A structured collection of related data,
(e.g., numbers, text, images).
often organized in tabular form.
Granularity
Atomic or
information.
Structure
May be unstructured
structured.
Purpose
Basic building blocks for information.
Examples
"25", "Male", "Blue", "2025-09-13"
Format
Usage
Context
Dataset
individual
units
or
of Aggregated and organized data for a
specific purpose or analysis.
semi- Typically structured with rows and
columns.
Used for
reporting.
analysis,
modelling,
or
Table of customer demographics, sales
records, or survey responses.
Can be text, numbers, audio, video, Usually stored in formats like CSV,
etc.
Excel, JSON, or SQL tables.
Found in sensors, logs, documents, Used in data science, machine learning,
etc.
and statistical analysis.
Highly
flexible
and
context- More rigid and designed for specific
dependent.
tasks or studies.
STATISTICAL VERSUS MACHINE LEARNING MODEL:
Aspect
Statistical Model
Machine Learning Model
Flexibility
Primary Goal
Make accurate predictions or
Explain
relationships
between
classifications from data, often
variables, test hypotheses, and make
without explicit assumptions about
inferences about a population.
relationships.
Approach
Based on predefined mathematical
Learns patterns directly from data
equations
and
probability
using algorithms; focuses on
distributions; focuses on parameter
predictive performance.
estimation and inference.
Strong assumptions about data (e.g., Few or no strict assumptions about
normality, independence, linearity, no data distribution; can handle
multicollinearity).
complex, non-linear relationships.
Data
Works well with smaller datasets if Performs best with large datasets;
Requirements assumptions hold.
can handle high-dimensional data.
High — coefficients and parameters Often lower — some models (e.g.,
Interpretability have clear meanings; easier to explain deep learning) are “black boxes”
results.
with less interpretability.
Assumptions
Aspect
Examples
Output Focus
Statistical Model
Machine Learning Model
Decision trees, random forests,
Linear regression, logistic regression,
support vector machines, neural
ANOVA, time series models.
networks.
Estimates parameters and quantifies Produces
predictions
or
uncertainty (confidence intervals, classifications; may not provide
p-values).
parameter estimates.
Evaluation
Metrics
Goodness-of-fit
adjusted R²),
AIC/BIC.
measures
hypothesis
(R², Prediction accuracy, precision,
tests, recall,
F1-score,
ROC-AUC,
cross-validation error.
Strengths
Strong theoretical foundation; results Flexible, can model complex
are explainable and statistically valid if patterns; often higher predictive
assumptions hold.
accuracy in complex tasks.
Sensitive to assumption violations; May over fit without proper
may
underperform
on
highly regularization; interpretability can
non-linear or unstructured data.
be challenging.
STATISTICAL LEARNING VERSUS MACHINE LEARNING:
Aspect
Statistical Learning
Machine Learning
Limitations
Primary Goal
Understand
and
interpret
Maximise predictive accuracy on
relationships between variables;
unseen data; focus on pattern
make
inferences
about
the
recognition and automation.
data-generating process.
Approach
Starts with a statistical model (often
Uses algorithms to learn patterns
parametric)
and
estimates
directly from data; data-driven.
parameters; theory-driven.
Assumptions
Often requires strong assumptions Fewer or no strict assumptions
(e.g.,
linearity,
normality, about data distribution; can model
independence, no multicollinearity). complex, non-linear relationships.
Interpretability
Often lower — some models (e.g.,
High — parameters have clear
deep learning) are “black boxes”
meaning; easier to explain results.
with less interpretability.
Data
Requirements
Performs best with large datasets;
Works well with smaller datasets if
can handle high-dimensional and
assumptions hold.
unstructured data.
Typical
Methods
Decision trees, random forests,
Linear regression, logistic regression,
support vector machines, neural
ANOVA, generalized linear models.
networks, gradient boosting.
Output Focus
Predictions,
classifications,
Parameter estimates, confidence
probability scores; may not provide
intervals, hypothesis tests, p-values.
parameter estimates.
Evaluation
Metrics
Goodness-of-fit
measures
(R², Prediction accuracy, precision,
adjusted R²), AIC, BIC, hypothesis recall,
F1-score,
ROC-AUC,
testing.
cross-validation error.
Strengths
Strong theoretical foundation; results Flexible; can model complex
are explainable and statistically valid if patterns; often higher predictive
assumptions hold.
accuracy in complex tasks.
Limitations
Sensitive to assumption violations;
may
underperform
on
highly
non-linear or unstructured data.
SUPERVISED VERSUS UNSUPERVISED STATISTICAL MODEL:
Aspect
Supervised Statistical Model
Unsupervised Statistical Model
Definition
Finds patterns or structure in
Learns a mapping from inputs XX to
unlabelled data without predefined
outputs YY using labelled data.
outputs.
Data
Requirement
Requires both input variables and Requires only input variables; no
corresponding output labels.
output labels are provided.
Goal
Predict outcomes for new data
Discover
hidden
patterns,
and/or understand the relationship
groupings, or data structure.
between XX and YY.
Examples
Tasks
of
Classification
Regression
prices).
(spam detection), Clustering
(predicting
house segmentation),
reduction (PCA).
(customer
Dimensionality
Algorithms
Linear
regression,
logistic K-means clustering, hierarchical
regression, decision trees, random clustering, PCA, t-SNE, association
forests, SVM, neural networks.
rule mining.
Evaluation
Accuracy, RMSE, precision/recall, Silhouette score, within-cluster sum
ROC-AUC — compared against of squares, variance explained — no
known labels.
labels to compare.
“Teacher” present — model learns No “teacher” — model learns
from correct answers during training. patterns on its own.
Often easier to interpret in simple
Interpretation focuses on discovered
Interpretability models; complex models may be
structure, which may be subjective.
opaque.
Supervision
Use Cases
Medical diagnosis, credit scoring, Market basket analysis, anomaly
demand forecasting.
detection, exploratory data analysis.
SUPERVISED, UNSUPERVISED AND REINFORCED LEARNING:
Aspect
Definition
Supervised Learning Unsupervised Learning
Learns from labelled
data — each input
has a known output.
Learns from unlabelled
data — finds patterns or
structure without
predefined outputs.
Reinforcement
Learning
Learns by interacting
with an environment,
receiving rewards or
penalties for actions.
No fixed dataset —
learns from state,
action, reward
sequences.
Discover hidden patterns, Learn an optimal policy
Predict outcomes for
clusters, or relationships in to maximize cumulative
new inputs accurately.
data.
rewards over time.
Labelled dataset (X,Y Unlabelled dataset (X
Type of Data
pairs).
only).
Goal
Agent explores and
Model maps inputs to Model groups, reduces, or exploits actions to
Learning
outputs using training organizes data based on improve
Process
examples.
similarity or structure.
decision-making via
feedback.
No explicit supervision
Requires external
No supervision —
Supervision
— learns from feedback
supervision (labels). self-organizing.
signals.
Common
Algorithms
Linear/Logistic
Regression, Decision K-Means, Hierarchical
Trees, Random
Clustering, PCA, Auto
Forest, SVM, Neural encoders, t-SNE.
Networks.
Example
Problems
Clustering (customer
Classification (spam
Game playing (Chess,
segmentation),
detection), Regression
Go), Robotics control,
Dimensionality Reduction
(price prediction).
Self-driving cars.
(PCA).
Output
Predicted labels or
continuous values.
Group assignments,
Sequence of actions
reduced-dimension
(policy) that maximizes
features, association rules. long-term reward.
Evaluation
Metrics
Accuracy, Precision,
Recall, F1, RMSE,
MAE.
Cumulative reward,
Silhouette score, Davies–
average reward per
Bouldin index,
episode, convergence
reconstruction error.
rate.
Q-Learning, SARSA,
Deep Q-Networks
(DQN), Policy Gradient
Methods.
Key takeaway:
 Supervised = Learn from answers given
 Unsupervised = Find structure without answers
 Reinforcement = Learn by trial, error, and reward
REGRESSION VERSUS CLASSIFICATION PROBLEMS:
Aspect
Regression
Classification
Output Type
Discrete categories or class labels
Continuous numeric values (e.g.,
(e.g., spam/not spam, disease/no
price, temperature, weight).
disease).
Goal
Predict a real-valued outcome based Assign inputs to one of several
on input features.
predefined classes.
Nature
of
Ordered, measurable quantities.
Target Variable
Unordered, qualitative categories.
Examples
Predicting house prices, forecasting Email spam detection, image
sales, estimating rainfall amount.
recognition, medical diagnosis.
Model Output
A number on a continuous scale.
Evaluation
Metrics
Mean Squared Error (MSE), Root
Accuracy,
Precision,
Mean Squared Error (RMSE), Mean
F1-score, ROC-AUC.
Absolute Error (MAE), R2R^2.
Decision
Mechanism
Fits a function (e.g., best-fit line) to Finds decision boundaries
approximate numeric outcomes.
separate classes.
A class label (or probability of
belonging to each class).
Recall,
that
Linear
regression,
polynomial Logistic regression, decision trees,
regression, regression trees, support random forests, k-NN, support vector
vector regression.
machines (classification).
CURSE OF DIMENSIONALITY
The curse of dimensionality refers to the strange and often problematic things that happen
when working with high-dimensional data (data with many features/variables).
As the number of dimensions (p) increases:
 The volume of the space grows exponentially.
 Data points become sparse — they are far apart from each other.
 Many algorithms that rely on “closeness” or “locality” (like nearest neighbours, kernel
smoothing) start to break down.
Common
Algorithms
Problems in Machine Learning:
Distance loses meaning
 In high dimensions, the difference between the nearest and farthest neighbour
distances becomes very small.
 “Nearest” points are still far away → local averaging becomes ineffective.
Data sparsity
 To maintain the same density of points as dimensions grow, you need an
exponentially larger dataset.
 Example: A 10% “neighbourhood” in 1D is small and local; in 100D, it covers almost
the entire space.
Overfitting risk
 With many features, models can fit noise instead of signal.
 More parameters → higher variance unless regularised.
Computational cost
 More features → more storage, longer training times, higher memory use.
Key Example:
 Nearest neighbour averaging works well when the number of predictors (p) is small
(about 4 or fewer) and the dataset is reasonably large (N).
 When p becomes large, the method often performs poorly — this is called the curse
of dimensionality.
 In high dimensions, even the “nearest” data points are still far away from the target
point.
 To reduce variance in the estimate, we need to average over a reasonable fraction
of the data (e.g., 10%).
 But in high dimensions, a 10% neighbourhood is no longer local — it covers a huge,
spread-out region.
 This means we lose the local averaging spirit that makes nearest neighbour methods
effective in low dimensions.
CONCEPT OF TEST DATA AND TRAINING DATA:
Aspect
Training Dataset
Purpose
Used to teach the model — the
Used to evaluate the trained
algorithm
learns
patterns,
model’s performance on unseen
relationships, and parameters from
data.
this data.
Exposure
Model
to
Test Dataset
The model sees this data during The model never sees this data
training and adjusts its parameters during training — it’s kept separate
accordingly.
to ensure an unbiased evaluation.
Aspect
Training Dataset
Test Dataset
Size
Usually the larger portion of the
Usually the smaller portion (e.g.,
dataset (e.g., 70–80%) to give the
20–30%)
reserved
for
final
model enough examples to learn
performance testing.
from.
Determines how well the model can
Role in Model
Determines how well the model can
fit the training data; used for
Development
generalise to new, unseen data.
parameter estimation.
Metrics
Training error, loss function values Test error, accuracy, precision,
Computed
during learning.
recall, F1-score, ROC-AUC, etc.
Risk if Misused
If used during training, it can lead to
If too small, the model may under fit
overfitting and overly optimistic
due to insufficient learning.
performance estimates.
Analogy
Like studying from a textbook before Like taking the final exam to check
an exam.
how well you learned.
 Training set → “Learn”
 Test set → “Prove”
THE TRADE-OFFS IN MACHINE LEARNING MODELS: There are three fundamentals tradeoffs in ML Models;
a. Prediction accuracy versus Interpretability
b. Good Fit versus Over-Fit or Under-Fit
c. Parsimony versus Black Box
THE
TRADE
OFF
BETWEEN
PREDICTION
ACCURACY
AND
MODEL
INTERPRETABILITY:
 Simple models (e.g., linear regression) are easy to interpret but may miss complex
patterns, leading to lower accuracy on unseen data.
 Complex models (e.g., random forests, neural networks) can capture intricate,
nonlinear relationships, often improving accuracy, but their inner workings are harder
to explain1.
 Inference-focused tasks (understanding relationships, testing hypotheses) benefit
from interpretability, even if prediction accuracy is slightly lower.
 Prediction-focused tasks (spam detection, image recognition) often prioritize accuracy
over interpretability, as long as the model is well-validated.
 Increasing flexibility to boost accuracy can also increase variance and risk of over
fitting, which may paradoxically reduce predictive performance on new data.
 Interpretability supports trust, transparency, and regulatory compliance, which can
outweigh small gains in accuracy in sensitive domains like healthcare or finance.
 The optimal balance depends on context: in high-stakes, regulated environments,
interpretability may be non-negotiable; in low-risk, high-volume prediction tasks,
accuracy may dominate1.
 Hybrid approaches (e.g., interpretable surrogates, post-hoc explanation tools like
SHAP or LIME) aim to retain much of the accuracy of complex models while restoring
some interpretability
Trade-off between More Restrictive models versus More Flexible Model:
More Restrictive Model Low More Flexible Model
When to Prefer…
Flexibility)
(High Flexibility)
Data Size
Small dataset – avoids over fitting
Large
dataset
–
enough
information to fit complex patterns
True Relationship
Likely simple or linear
Likely complex or highly non-linear
Noise Level
High noise – simpler model avoids Low noise – can capture subtle
chasing randomness
patterns without over fitting
When to Prefer…
Goal
Predictors
Observations
More Restrictive Model Low More Flexible Model
Flexibility)
(High Flexibility)
Interpretability,
transparency
vs. Many predictors
sample size
inference,
relative
Maximum predictive accuracy
to Few predictors
interactions
but
complex
Risk Tolerance
High-stakes or regulated domains Low-stakes
or
competitive
– need explainability
domains – performance is priority
Bias–Variance
Profile
Higher bias, lower variance
Examples
Linear
regression,
regression
Lower bias, higher variance
logistic Random forests, splines, neural
networks
To sum-up:
 Prediction Accuracy: How well a model predicts outcomes on unseen data?
o High accuracy means the model generalises well.
o Often achieved by using flexible, complex models that can capture intricate
patterns.
 Interpretability: How easily humans can understand the relationship between inputs
and outputs in the model.
o High interpretability means you can explain why the model made a certain
prediction.
o Often associated with simpler models.
The tension:
As model flexibility increases, accuracy on training data often improves — but interpretability
usually decreases, and the risk of overfitting rises
Why this matters:
When interpretability is critical:
 Regulatory compliance (finance, healthcare).
 Scientific research (understanding causal relationships).
 Stakeholder trust (executives, customers).
When accuracy is the priority:
 Recommendation systems.


Spam detection.
Image recognition. In these cases, a “black-box” model may be acceptable if it
performs better.
Model Type
Flexibility
Prediction Accuracy
Interpretability
Linear Regression Low
Moderate (if relationship is High — coefficients have clear
truly linear)
meaning
Decision
(shallow)
Moderate
High — easy to visualise
High
Low — many trees, hard to
interpret
Very High
Low — complex ensemble
Tree Low–
Moderate
Random Forest
High
Gradient Boosting High
Neural
Network
Very High
Very High
Very Low — “black box”
(deep)
Some key insights:
 More flexible methods can fit a wider range of shapes for f(X), but at the cost of
interpretability.
 Sometimes, less flexible models are chosen deliberately for their explanatory power,
even if accuracy is slightly lower.
 Machine Learning Mastery:
o “Unfortunately, the predictive models that are most powerful are usually the
least interpretable.
o As complexity increases, so do the number of parameters and the difficulty of
explaining them.
 Computational Social Science Book down:
o Low flexibility → better interpretation, worse prediction.
o High flexibility → better prediction potential, but can over fit and lose
interpretability.
THE BIAS-VARIANCE TRADE-OFF:
The bias–variance trade-off is one of the most important ideas in statistical modelling and
machine learning because it explains why models perform the way they do on new, unseen
data.
The Two sources of Error: In machine learning, prediction error can be split into three
parts:

Bias and variance are the reducible parts — we can influence them by changing the
model.
 Irreducible error is noise in the data we can’t remove.
When we build a predictive model, the total expected prediction error can be decomposed into
three parts:
 Bias – Error from wrong assumptions in the model.
o High bias means the model is too simple to capture the true relationship (e.g.,
fitting a straight line to a curved pattern).
o Leads to under fitting — both training and test errors are high.
 Variance – Error from too much sensitivity to the training data.
o High variance means the model fits the noise as well as the signal (e.g., a very
wiggly curve that passes through every training point).
o Leads to over fitting — training error is low, but test error is high.
 Irreducible Error – Noise in the data that no model can remove.
The Trade-Off:

Increasing model flexibility (e.g., moving from linear regression to high-degree
polynomials) → Bias decreases (model can fit more complex patterns) → Variance
increases (model becomes more sensitive to fluctuations in the training set).
 Decreasing model flexibility (e.g., using a very simple model) → Bias increases
(misses real patterns) → Variance decreases (predictions are more stable across
datasets)
 The sweet spot is where Bias² + Variance is minimized — this is the point of optimal
model complexity.
Visual Intuition:
 Rigid models: High bias, low variance → under fit.
 Moderately flexible models: Balanced bias and variance → best generalization.
 Highly flexible models: Low bias, high variance → over fit.
If we plotted Test MSE vs. Model Flexibility, we see the classic U-shaped curve for test
error, with training error steadily decreasing.
Why It Matters
Understanding this trade-off helps you:
 Choose the right model complexity.
 Apply regularization (like Lasso or Ridge) to control variance.
 Avoid chasing low training error at the cost of poor generalization.
The Bias-Variance Equation: The equation are foundational to understanding model
performance in statistical learning
1. Trade-off between Bias and Variance
 Simple models (like linear regression) tend to have low variance but high bias.
 Complex models (like deep neural nets) tend to have low bias but high variance.
 The art of modelling lies in balancing these two to minimize total error.
2. Irreducible Error Is a Reality Check
 No matter how perfect your model is, you can’t eliminate noise.
3. Model Evaluation Must Be Holistic
 Focusing only on training accuracy can be misleading.
 You need to assess generalization—how well the model performs on unseen data.
 Techniques like cross-validation help estimate this decomposition empirically.
Concept of Variance and Bias in Statistical Modelling
 Variance refers to the amount by which f^ would change if we estimate it using a
different training dataset. This is because of the training data re being used to fit the
statistical learning method, different data sets will results in different f^.
 Bias refers to the error that is introduced by approximating a real-life problem, which
may be extremely complicated by a much simpler model.
Aspect
Bias
Variance
Definition
Error due to wrong assumptions in the Error due to model's sensitivity to
model.
training data.
Cause
Model is too simple to capture real Model is too complex and over fits
patterns.
noise.
Effect
Misses important trends → under Captures noise as if it were signal
fitting.
→ over fitting.
Training Error
High (model doesn’t fit training data Low (model fits training data very
well).
closely).
Test Error
High (poor generalization due to High (poor generalization due to
oversimplification).
over fitting).
Model
Behaviour
Predictions are consistently wrong.
Predictions vary wildly with different
datasets.
Example Model Linear regression on non-linear data.
High-degree polynomial regression
on noisy data.
Model
Complexity
Low complexity → high bias.
High complexity → high variance.
Solution
Use a more flexible model.
Use regularization
model.
or
simpler
Note:
 One good test set performance of a statistical learning method requires low variance
as well as low squared bias. This is referred to as a trade-off because it is easy to
obtain a method with extremely low bias but high variance (for instance, by drawing a
curve that passes through every single training observation) or a method with very low
variance but high bias (by fitting a horizontal line to the data). The challenge lies in
finding a method for which both the variance and the squared bias are low.
 As a general rule, as we use more flexible methods, the variance will increase and the
bias will decrease. The relative rate of change of these two quantities determines
whether the test MSE increases or decreases. As we increase the flexibility of a class
of methods, the bias tends to initially decrease faster than the variance increases.
Consequently, the expected test MSE declines. However, at some point increasing
flexibility has little impact on the bias but starts to significantly increase the variance.
 In a real-life situation in which f is unobserved, it is generally not possible to explicitly
compute the test MSE, bias, or variance for a statistical learning method. Nevertheless,
one should always keep the bias-variance trade-off in mind.
Note:
 Increasing model complexity reduces bias but increases variance.
 Decreasing complexity reduces variance but increases bias.
 The goal is to find the “sweet spot” where total error (Bias² + Variance) is minimal.
Bias is about being consistently wrong due to oversimplification; variance is about being
inconsistently wrong due to over-sensitivity to the training data.
Interpretation:
 Bias curve (orange): Starts high for simple models, drops as complexity increases.
 Variance curve (green): Starts low for simple models, rises as complexity increases.
 Training error (blue): Falls steadily as complexity increases.
 Test error (blue U-shape): Falls at first (bias drops faster than variance rises), then
rises again (variance dominates).
 Sweet spot: The point of lowest test error — optimal balance between bias and
variance.
GOOD FIT VERSUS UNDER-FIT AND OVER-FIT
Bias–
Variance Interpretability
Profile
Scenario
Description
Typical Cause
Good Fit
Model captures the
Often
moderate
—
underlying pattern in
depends
on model Appropriate
the
data
without Balanced
choice. A simple model complexity, tuned
memorising
noise. bias and
with a good fit is highly hyper parameters,
Performs well on both variance.
interpretable; a complex and enough data.
training and unseen
one may not be.
data.
Under
fitting
Model is too simple to
Usually high — simple
capture
the
true
models are easy to
Over-simplified
relationship between High bias, explain,
but
model,
wrong
inputs and outputs. low
explanations
are
assumptions, too
Performs poorly on variance. misleading because the
few features.
both training and test
model misses important
data.
patterns.
Scenario
Description
Bias–
Variance Interpretability
Profile
Typical Cause
Model learns noise and
Often low — complex
idiosyncrasies of the
models are harder to Excessive
training data in addition Low bias, interpret, and even if complexity,
too
Overfitting to the true pattern. high
explained,
the many
features,
Performs
well
on variance. reasoning
may
be insufficient
training data but poorly
based on noise rather regularisation.
on test data.
than true patterns.
Aspect
Over fitting
Definition
Model learns the training data too Model fails to learn the
well, including noise and outliers. underlying pattern in the data.
Training Performance
Very high accuracy or very low
Poor accuracy or high error.
error.
Test Performance
Poor generalization;
badly on unseen data.
Model Complexity
Under fitting
Poor generalization; performs
badly on both training and test
data.
Too complex (e.g., high-degree Too simple (e.g., linear model
polynomial, deep tree).
for non-linear data).
performs
High
bias
patterns).
(misses
key
Bias
Low bias (fits training data closely).
Variance
High variance (sensitive to small Low variance (predictions are
changes in training data).
stable but inaccurate).
Visual Analogy
Curve passes through
training point, even noise.
Cause
Too few features, overly
Too many features, insufficient
simplistic model, insufficient
regularization, small training set.
training.
every Flat line that misses the trend
entirely.
Simplify model, use regularization,
Increase model complexity,
cross-validation, and increase
add features, and reduce bias.
training data.
Why we need to analyse this:
Under fit models:
 Easy to interpret (e.g., a straight line through clearly curved data).
 But the interpretation is wrong because the model ignores important structure.
Good fit models:
 Strike a balance — they explain the data well and generalise.
 If the model is simple (e.g., linear regression with relevant predictors), interpretability
is high.
 If the model is complex (e.g., tuned gradient boosting), interpretability may require
extra tools (e.g., SHAP, LIME).
Over fit models:
 Often complex and opaque (e.g., deep trees, high-degree polynomials).
 Even if you can “explain” them, the explanation may be misleading because the
model’s logic is tied to noise in the training set.
Solution
Bias–Variance–Interpretability Triangle
 High bias (under fit) → Simple, interpretable, but inaccurate.
 Balanced bias & variance (good fit) → Accurate and potentially interpretable.
 High variance (over fit) → Accurate on training data, poor generalisation, low
interpretability.
Note:
 Good fit = model complexity matches the true complexity of the data → best
generalisation.
 Under fit = too simple → high bias, misleadingly “clear” interpretation.
 Over fit = too complex → high variance, low trust in interpretation.
 Interpretability is not just about simplicity — it’s about whether the model’s
reasoning reflects the true data-generating process.
CONCEPT OF PARSIMONY VERSUS BLACK BOX:
Parsimony:
Meaning
 In plain English, parsimony means simplicity — using the smallest, simplest model
that still explains the data well.
 In ML, it’s the principle of avoiding unnecessary complexity in the model’s structure,
number of features, or parameters.
Why It Matters
 Better generalisation: Simpler models are less likely to over fit.
 Interpretability: Fewer parameters → easier to explain.
 Efficiency: Less computation, faster training, easier deployment.
How It’s Achieved
 Feature selection — keep only the most relevant predictors.
 Regularisation — e.g., Lasso shrinks some coefficients to zero.
 Occam’s razor — prefer the simplest model that performs adequately.
Example
 Choosing a linear regression with 5 meaningful predictors instead of a polynomial
model with 50 terms that barely improves accuracy.
Black Box:
Meaning
 A black box model is one whose internal decision-making process is not easily
interpretable by humans.
 You can see the inputs and outputs, but not clearly understand how the model
arrived at its prediction.
Why Some Models Are Black Boxes
 High complexity: Many layers, parameters, and non-linear transformations (e.g.,
deep neural networks).
 Distributed representations: Information is encoded across many nodes, making it
hard to trace influence of individual features.
 Layer-wise abstraction: Each layer learns increasingly abstract features, obscuring
the link between raw input and final output.
Implications
 Pros: Often achieve very high predictive accuracy.

Cons: Hard to explain decisions → issues in regulated domains (finance, healthcare,
law).
 Mitigation: Use explainability tools like SHAP, LIME, or partial dependence plots.
Example
 A deep learning image classifier that correctly labels an image as a “cat” but cannot
easily explain which features (whiskers, ears, fur pattern) drove the decision.
Aspect
Parsimonious Model
Black Box Model
Complexity
Low
High
Interpretability
High
Low
Risk
Overfitting
Lower
Higher (if not regularised)
of
Accuracy
Potential
Moderate–High (if problem is
High (especially for complex patterns)
simple)
Deep
neural
networks,
large
Linear
regression,
logistic
ensembles (random forest, gradient
regression, pruned decision tree
boosting)
Parsimony is about keeping models as simple as possible without losing accuracy; black box
models are powerful but opaque, making their internal reasoning hard to interpret.
WHY THE FUNCTION f in Y = f(x) + e IS GENERALLY UNKNOWN:
Reason
Explanation
Examples
Real-world
complexity
The relationship between inputs X and output Y is often governed by
complex, nonlinear, and interacting factors. We rarely know the exact
mathematical form of this relationship.
Epistemic
uncertainty
Even if a true function exists, we don’t have perfect knowledge. We only
observe samples, not the full population or mechanism.
Data limitations
We work with finite, noisy, and sometimes biased data. This makes it
impossible to recover f exactly—only approximate it.
Modelling
assumptions
Statistical models (like linear regression) assume a form for ff, but this is
a simplification. The true f may be far more intricate.
Randomness
and noise
The error term ϵ\epsilon captures unobserved influences, measurement
errors, and randomness. These obscure the true signal in f(X)
NonMultiple functions can fit the same data equally well, especially in high
identifiability
dimensions. Without strong assumptions, f is not uniquely identifiable.
WHY f HAS TO BE ESTIMATED IN SUPERVISED LEARNING ML MODEL:
In supervised machine learning, the central goal is to learn an unknown relationship between
inputs X and an output Y.
This relationship is often expressed as:
Y=f(X) + ε
Where:
 f (⋅) is the true underlying function mapping inputs to outputs.
 ε the irreducible error (random noise we can’t explain).
f is estimated for the following reasons:
Purpose
Explanation
Critical Insight
Prediction
We want to predict the The true function f is unknown, so we estimate
output YY for new or f^ to make informed predictions. Accuracy
unseen inputs XX.
depends on how well f^ approximates f.
Inference
Estimating f helps identify which predictors are
We aim to understand how
influential and how they interact. This is key in
changes in XX affect YY.
domains like health, economics, and policy.
Purpose
Decisionmaking
Handling
complexity
Explanation
Critical Insight
Models guide actions
based
on
predicted
outcomes.
Real-world relationships Estimating f allows us to capture nonlinear,
are rarely simple or linear. multivariate, or hidden patterns in data.
We
minimize
the
reducible
error
by While irreducible error (from noise) remains,
Reducing error
improving our estimate of better models reduce bias and variance.
ff.
Generalization
We want models that
Estimating f with proper validation ensures
perform well on unseen
robustness and avoids over fitting.
data.
Model
comparison
Comparing models (e.g., linear regression vs.
Different
algorithms
random forest) helps choose the best
estimate f differently.
estimator for your goals.
Shared characteristics of estimating f:
Characteristic
Description
Assumption
relationship
of
a
Implication
We assume that the output YY This is the foundation of modelling—
is related to inputs XX via some without this assumption, prediction
function f(X)
is impossible.
Estimation relies on observed
The quality and quantity of data
Use of data
data: pairs of inputs XX and
directly affect the accuracy of f^.
outputs YY.
Simpler models have high bias,
Trade-off between
We aim to balance this trade-off to
complex models have high
bias and variance
minimize total prediction error.
variance.
We can reduce error by
Reducible
vs.
Focus is on minimizing reducible
improving f^ but some error
irreducible error
error through better modelling.
(from noise) is unavoidable.
Models vary in flexibility—from
More flexible models can fit complex
Model flexibility
rigid (linear regression) to highly
f, but risk over fitting.
adaptive (neural networks).
Interpretability
accuracy
vs.
Validation
generalization
and
Simpler models are easier to Choice depends on whether
interpret; complex ones may explanation or prediction is the
predict better but are opaque. priority.
Estimating f must be tested on
Techniques like cross-validation are
unseen data to ensure it
essential to avoid over fitting.
generalizes well.
Parametric methods assume a Parametric models are simpler but
Parametric vs. nonform for f; non-parametric risk
misspecification;
nonparametric methods
methods do not.
parametric models need more data
In addition to the above:
1. Data-Driven Fitting: Both types of methods rely on the observed training data (Xi, Yi)
to construct an estimate f^
2. Flexibility Control: Parametric methods control flexibility via the number and form of
parameters; non-parametric methods control it via smoothing parameters or
neighbourhood size.
3. Over fitting Risk: Highly flexible models—many-parameter linear models or
unsmoothed non-parametric fits—can follow noise if not properly regularized.
4. Assumption vs. Smoothness: Parametric forms assume a specific function shape for
f; non-parametric methods assume only that f is sufficiently smooth or locally
approximable.
5. Optimization Objective: All methods define a loss (e.g., residual sum of squares or
negative log-likelihood) and seek the f^ that minimizes it.
6. Dependence on Sample Size: Parametric estimation can work reasonably with smaller
samples; non-parametric estimation typically requires large samples to achieve low
variance.
7. Interpretability vs. Predictive Accuracy Trade-Off: Simpler, lower-parameter models
are more interpretable but may incur higher bias; more flexible estimators often yield
better fit at the cost of transparency
Key Insight
Estimating f is the core of supervised learning because:
 It enables generalization to unseen data.
 It turns raw data into a predictive or explanatory tool.
 It formalizes the problem as function approximation under uncertainty
THE CONCEPT OF ERROR IN ML MODEL
The concept of Reducible and Irreducible Error:
In statistical learning, when we try to predict an outcome YY using inputs XX, the total
prediction error comes from two sources:
Type
Error
of
Reducible
Error
What It Means
Comes
imperfections
model f^(X).
Can We Reduce It?
Example
from ✅ Yes, by choosing Using a linear model when
in our better
models
or the true relationship is
nonlinear.
tuning them.
Comes
from
Mood
swings,
genetic
❌ No, it's inherent in
randomness
or
variability, or measurement
the system.
unknown factors ϵ
noise.
Why is irreducible error > 0?
Reason
Explanation
Irreducible
Error
Unobserved variables
There are always factors influencing YY that aren’t captured in XX.
For example, mood, genetics, or environmental noise.
Measurement error
Instruments and human reporting introduce inaccuracies. Even
biological age tests have variability.
Biological systems, human behaviour, and even market dynamics
have inherent unpredictability.
No model can perfectly capture reality. Even if you knew the true ff,
Model limitations
randomness in ϵ\epsilon remains.
Conditions change over time—what affects vitality today may not
Temporal fluctuations
tomorrow. This adds noise.
Errors in Practise:
Error Metric
Use Case
Natural randomness
Mean Squared Error (MSE)
Penalizes large errors heavily
Root Mean Squared Error (RMSE) Same as MSE but in original units
Mean Absolute Error (MAE)
Robust to outliers
Classification Error Rate
Misclassification proportion
Why understanding of Error matters:
 Model Selection: Choose complexity that minimizes test error, not just training error.
 Generalization: Avoid overfitting by balancing bias and variance.
 Performance Boundaries: Recognize that irreducible error means perfect accuracy
is impossible in noisy real-world data.
 Interpretability: Understanding error sources helps explain model behaviour to
stakeholders
Key Insight: In ML, error is not just a number — it’s a decomposition of what we can fix
(bias, variance) and what we can’t (irreducible error). The art of modelling is finding the sweet
spot where reducible error is minimized without chasing noise.
Training and Test Error:
Aspect
Training Error
Test Error
Definition
The error rate (or loss) the model The error rate (or loss) the model
achieves on the same data it was achieves on unseen data not used
trained on.
during training.
Purpose
Measures how well the model has Measures how well the model
learned patterns in the training set. generalizes to new, unseen data.
Data Exposure
Computed on data the model has Computed on data the model has
never seen before.
already “seen” and optimized for.
Typical
Behaviour
Usually lower than test error
Usually higher than training error
because the model is optimized to fit
due to generalization gap.
this data.
Risk Indicator
Very low training error with high test High test error with low training error
error → overfitting.
→ poor generalization.
Bias–Variance Low training error can mean low bias, Reflects the combined effect of bias,
Link
but may hide high variance.
variance, and irreducible error.
When
Useful
It’s
Diagnosing under fitting (high training
Evaluating real-world performance
error) or overfitting (very low training
and model selection.
error but high test error).
Directly from the training phase Requires a held-out dataset or
outputs.
cross-validation.
Synergy Between Training and Test Error
Training and test error complement each other in diagnosing and improving models:
1. Generalization Gap
o The difference between test error and training error is called the
generalization gap.
o A small gap → model generalizes well.
o A large gap → possible overfitting.
2. Bias–Variance Diagnosis
o High training error + high test error → under fitting (high bias).
o Low training error + high test error → overfitting (high variance).
o Low training error + low test error → good fit.
3. Model Selection & Tuning
o Monitoring both errors during hyper parameter tuning helps find the “sweet
spot” where the model is complex enough to capture patterns but simple
enough to generalize.
4. Cross-Validation as a Bridge
o Cross-validation uses training-like folds to estimate test-like error, giving a
more stable view of the synergy between the two.
Key Insight:
 Training error tells you how well your model fits known data.
 Test error tells you how well it will perform in the real world.
 Both are essential: focusing only on training error risks overfitting; focusing only on
test error without understanding training performance can hide under fitting.
 The goal is minimizing test error while keeping the generalization gap small.
Computation
Note:

Training Error steadily decreases as model complexity increases, since the model
fits the training data more closely
 Test Error follows a U-shaped curve — it decreases initially as the model captures
real patterns, then increases due to over fitting.
The optimal point is where test error is lowest — balancing bias and variance
Training MSE and Test MSE:
Aspect
Training MSE
Test MSE
Definition
Mean Squared Error calculated on
Mean Squared Error calculated on
new, unseen data not used in
the same data used to fit the model.
training.
Purpose
Measures how well the model fits the Measures how well the
training data.
generalizes to unseen data.
model
Almost always lower than Test MSE,
Usually higher than Training MSE,
Typical Value because the model is optimized for
due to generalization error.
this data.
Very low Training MSE with much
Over
fitting
Large gap between Test and Training
higher Test MSE suggests over
Indicator
MSE signals poor generalization.
fitting.
Bias–
Low bias on training data, but Reflects both bias and variance; high
Variance Link variance not visible here.
variance shows up as large Test MSE.
Correlation with Model Flexibility
Training
MSE
Model Type
Test MSE Trend
Trend
Reason
Rigid / Less
Starts relatively high Starts high, decreases
Flexible Model
High
bias
dominates;
and
decreases initially, then may rise
(e.g.,
linear
variance is low, so training
slowly with more slightly if under fitting
regression with
and test errors are closer.
flexibility.
persists.
few predictors)
Flexible Model Decreases rapidly as
Low bias but high variance;
Decreases at first, then
(e.g., high-degree flexibility increases
over fitting causes Test
increases sharply after
polynomial, deep — can approach
MSE to rise while Training
an optimal point.
tree)
zero.
MSE stays low.
Key Insight:
 Rigid models → Both Training and Test MSE are relatively high but close to each
other (under fitting).
 Moderately flexible models → Training MSE drops, Test MSE also drops to a
minimum (optimal complexity).
 Highly flexible models → Training MSE stays very low, but Test MSE rises due to
over fitting (variance dominates).

Training MSE (blue) falls steadily as model flexibility increases, because the model
can fit the training data more and more closely.
 Test MSE (red) follows an U-shaped curve — it drops at first as the model captures
real structure, then rises again when over fitting starts to dominate.
 The sweet spot is at the bottom of the Test MSE curve, where bias and variance are
optimally balanced.
Tradeoff between prediction and accuracy
Prediction (Generalization Performance)
 Focuses on how well the model performs on new, unseen data.
 Often achieved by using flexible models (e.g., random forests, neural networks).
 These models may be less interpretable, but they capture complex patterns and
interactions
 The goal is low test error, not necessarily understanding the model’s inner workings.
Accuracy (Interpretability or Inference)
 Focuses on understanding relationships between variables.
 Often achieved by using simpler models (e.g., linear regression, logistic regression).
 These models may have higher bias, but they’re easier to explain and trust.
 The goal is insight and explanation, not just prediction.
The Trade-Off in Practice
 Highly flexible models may give better predictions but are harder to interpret and
may over fit.
 Simpler models may be easier to explain but might miss subtle patterns, reducing
predictive power.
 You often have to choose between interpretability and predictive accuracy,
depending on your use case.
Example
 In medical diagnostics, you might prefer a simpler model that doctors can understand
— even if it’s slightly less accurate.
 In stock price forecasting, you might prioritize predictive power over interpretability
— because the stakes are in the outcome, not the explanation.
BIAS-VARIANCE DECOMPOSITION:
The Trade-Off
 Simpler models → high bias, low variance.
 More complex models → low bias, high variance.
 The goal is to find the sweet spot where total error (bias² + variance + irreducible
error) is minimized.
Why It Matters in Practice
 Model selection: Helps decide whether to increase complexity (reduce bias) or
regularize/simplify (reduce variance).
 Hyper parameter tuning: Guides choices like tree depth, polynomial degree, or
regularization strength.
 Interpretability vs. accuracy: Sometimes a slightly higher bias is acceptable for a
more interpretable, stable model
Key Insight
Bias–variance decomposition tells us that low training error is not enough — we must
balance bias and variance to achieve low test error. It’s the theoretical backbone behind why
cross-validation, regularization, and early stopping work.
The Concept of Var (ε)
The variance refers to how much the model’s predictions f^(X) would fluctuate if you trained
it on different datasets drawn from the same population.
Aspect
Meaning
Implication
High variance means the model
Variance measures how sensitive your model
Definition
is unstable—it changes a lot
is to changes in the training data.
with small data shifts.
Often caused by overly complex models that try Leads to over fitting—great on
Cause
to fit every detail (including noise).
training data, poor on new data.
It’s part of the reducible error—you can lower But reducing variance too much
Role
in
it by simplifying the model or using can increase bias—hence the
Error
bias-variance trade-off.
regularization.
Visual
Intuition
Imagine fitting a curve to data: a high-variance
model wiggles through every point; a low- You want a balance—not too
variance model draws a smoother, more wiggly, not too rigid.
general line.
VARIOUS SUPERVISED ML MODEL:
Prediction rows: Focus on minimizing test error and generalizing well to unseen data.
Inference rows: Focus on understanding relationships and quantifying effects, even if
predictive accuracy is secondary.
Regression vs. Classification columns: Determined by whether YY is continuous or
categorical
CLASSIFICATION OF METHODS FOR ESTIMATING f:
Fundamentally f is estimated either for
a. Prediction
b. Inference
Example:
 We want to know the value of a function f(x) at a certain point, for example x=4.
 In real data, we often have no exact observations where X=4.
 That means we cannot directly calculate the true average E (Y∣X=4).
 So, we relax the rule: instead of using only X=4X = 4, we look at points where XX is
close to 4.
 This “close to” region is called a neighbourhood N(x).
 We take the average of YY values for all data points in that neighbourhood.
 This gives us an estimate of f(x) even without exact matches.
 The red dashed lines in the plot show the neighbourhood around x=4x = 4.
 The green curve is the smoothed estimate of f(x) built from such averages.
PREDICTIVE VERSUS INFERENCE ML MODEL:
Aspect
Primary Goal
Predictive ML Model
Inference ML Model
Understand the relationship
Accurately predict outcomes for new,
between predictors and the
unseen data.
response.
“How does Y change when Xj
changes?”
Key Question
“Given X, what will Y be?”
Focus
Estimating
model
Minimizing prediction error on test
parameters
and
their
data.
statistical significance.
Output Emphasis
Predicted values
observations.
Model Choice
May use complex, non-interpretable Often
uses
simpler,
models (e.g., random forests, gradient interpretable models (e.g.,
(Y^)
for
future
Coefficient
estimates,
confidence
intervals,
p-values, effect sizes.
boosting, deep neural nets) if they linear/logistic regression) to
improve accuracy.
make
parameter
interpretation meaningful.
Evaluation Metrics
RMSE, MAE,
AUC, etc.
accuracy,
F1-score, Standard errors, t-statistics,
p-values, R^2, adjusted R^2.
Lower — complexity is limited
for High — complexity is acceptable if it
to maintain interpretability
improves predictive performance.
and valid inference.
Tolerance
Complexity
Data
Requirements
Requires data that meet
statistical assumptions (e.g.,
Needs representative training data;
independence,
correct
large datasets improve generalization.
functional form) for valid
inference.
Determining which factors
Credit
risk
scoring,
demand significantly affect disease
Use
forecasting,
image
classification, risk,
estimating
price
recommendation systems.
elasticity,
policy
impact
analysis.
Example
Cases
Interpretability
Priority
Often secondary to accuracy.
Primary — the model must
explain
why
and how
predictors
influence
the
outcome.
Synergy Between the Two
 Inference can inform prediction: Understanding which variables matter can simplify
models and improve generalization.
 Prediction can inform inference: Predictive performance can validate whether the
inferred relationships are useful in practice.
 In many real-world projects, both goals are blended — e.g., building a model that
predicts well and offers interpretable insights.
Goal ↓ / Task
Regression (Continuous YY)
→
Classification (Categorical YY)
Prediction
Predictive Classification
Predictive Regression
Goal: Accurately assign new observations
Goal: Accurately predict a numeric
to categories.
outcome for new data.
Examples: Spam detection, image
Examples: Predicting house prices,
recognition.
forecasting sales.
Metrics: Accuracy, Precision, Recall, F1,
Metrics: RMSE, MAE, R2R^2.
ROC-AUC.
Inference
Inferential Classification
Inferential Regression
Goal:
Understand
how predictors
Goal: Understand how predictors
influence the probability of class
influence a continuous outcome.
membership.
Examples:
Estimating
how
Examples: Identifying risk factors for
education level affects income.
disease.
Outputs: Coefficients, p-values,
Outputs: Odds ratios, significance tests,
confidence intervals.
marginal effects.
Note:
1. Prediction rows: Focus on minimizing test error and generalizing well to unseen data.
2. Inference rows: Focus on understanding relationships and quantifying effects, even if
predictive accuracy is secondary.
3. Regression vs. Classification columns: Determined by whether YY is continuous or
categorical.
THE CONCEPT OF PREDICTOR VARIABLES AND RESPONSE VARIABLE IN
STATISTICAL LEARNING
Predictor Variables
a. Also known as Input, Features or Independent variables
b. Denoted typically as X1, X2, ……, Xp
c. Can be Numeric or Categorical
d. Used to train Models and uncover patterns or relationships
e. In Experimental design, these are the variables we manipulate or Observe
Response Variables:
a. Also known as the Outcome we aim to predict or understand
b. Denoted by Y
c. Depends on the values of predictors variables
d. Can be continuous or Categorical
e. In modelling, Y is being tried to estimate using f^(X)
Simple Linear Equation: Assume a Quantitative Response Y and p different predictors as
(X1, X2… Xp). This can be written the general form
Y = f(X) + ε
Analysis of each of these terms:
Component
Y
Meaning
in
Statistical
Critical Insight
Learning
The response variable or
May be continuous (e.g., blood pressure)
outcome we want to predict
or categorical (e.g., disease status).
or understand.
Includes measurable inputs like training
The predictor variables or
intensity. Quality and relevance of X
features used to estimate YY.
directly affect model accuracy.
Central to statistical learning. We estimate
The true but unknown
f^(X) using data. The challenge is that f is
function that maps inputs to
f(X)
rarely known and may be nonlinear or
output.
complex.
Represents irreducible error. Even with
The error term capturing
perfect modelling, this component
ϵ
randomness,
noise,
or
remains due to biological variability,
unobserved factors.
measurement error, etc.
We use training data and algorithms (e.g.,
Estimate f^(X) as closely as regression, trees, and neural networks) to
Goal of Learning
possible to f(X).
approximate f. Accuracy depends on biasvariance trade-off.
X
Assumptions
These assumptions simplify modelling but
ϵ is independent of X and has
may not hold in real-world data. Violations
mean zero.
can lead to biased or inefficient estimates.
Implications
Guides model selection and tuning. Overly
Prediction error = Bias² +
complex models increase variance; overly
Variance + Irreducible Error.
simple ones increase bias.
Interpretability
vs. Accuracy
Simple f (e.g., linear) is
Trade-off is crucial. In health modelling,
interpretable; complex f (e.g.,
interpretability often matters for trust and
neural nets) may be more
actionable insights.
accurate.
Some Critical aspects:
 The function f is fixed but unknown. This implies that the function doesn’t change, but
we don’t know its exact form. This is the essence of statistical modelling i.e. estimating
f from the data
 E is the random error terms that captures noise or unmodeled effects. It assumes that
all deviations from f(x) are random and not systematic
 The errors are not influenced by the input variable. This is crucial for unbiased
estimations as if violated; the model reliability will suffer
 The Mean of the Error is zero which implies that on average, the error doesn’t skew
the output.
The f represents the systematic information that X provides about Y. This implies that
f(x) captures the predictable part of Y. It separates signal from noise; however the
boundary between “systematic” and “random” can be blurry.
PREDICTION:
In statistical modelling, prediction is one of the core motivations for estimating the function f
in the equation:
Y = f(X) + ε
Aspect
Explanation
Objective
Prediction focuses on accuracy, not
Use known inputs XX to predict
necessarily
understanding
the
unknown or future outputs YY.
underlying mechanism.
Estimated
Function f^(X)
Error Term ϵ
Critical Insight
A model built from training data
f^ may be a black box (e.g., neural net)
to approximate the true
or interpretable (e.g., linear regression).
function f(X)
Represents randomness or Even the best model cannot eliminate
noise that cannot be predicted. this irreducible error.
Good
prediction
requires
Training vs. Test Model is trained on known data
generalization, not just fitting the
Data
and evaluated on unseen data.
training set.
Loss Function
Model Selection
Cross-validation
Measures how far predictions Common choices: Mean Squared Error
Y^=f^(x) are from actual Y
(MSE), classification error, etc.
Choose
f^ to
minimize This involves balancing bias (simplicity)
and variance (flexibility).
expected prediction error.
Technique
to
estimate Helps avoid over fitting and ensures
prediction error reliably.
robustness across different data splits.
Interpretability
Accurate models may be In wellness modelling, interpretability is
Trade-off
complex and hard to interpret. crucial for actionable insights.
INFERENCE:
In statistical modelling, inference is about understanding the underlying relationship
between inputs X and output Y by estimating the function f. Unlike prediction, which focuses
on accuracy, inference focuses on insight—what drives the outcome and how.
3 Fundamental questions are being sought for:
a. Which predictors are associated with the response? In this, the underlying aim is to
identify the few important predictors among a large set of possible variables that can
be useful, depending on the applications
b. What is the relationship between the response and each predictors? In this, the
underlying aim is to identify positive or negative relationship
c. Can the relationship between Y and each predictor be adequately summarized using
a Linear equation or is the relationship is more complicated.
Concept
Explanation
Critical Insight
Goal
To understand how each predictor Xj Helps
in
identifying
causal
affects the response YY.
relationships, not just correlations.
Function f
Represents the true relationship Estimating f allows us to interpret
between inputs and output.
the system, not just predict it.
Model
Interpretability
Models used for inference are often
Complexity is avoided to preserve
simple and transparent (e.g., linear
clarity and explainability.
regression).
Parameter
Estimation
We estimate coefficients (e.g., βj) These estimates help answer
that quantify the effect of each questions like: “How much does
predictor.
intake affect vitality?”
Confidence
Intervals
Provide a range within which the true Adds statistical rigor and accounts
parameter likely falls.
for uncertainty.
Hypothesis
Testing
Tests whether a predictor has a Useful for validating
statistically significant effect on Y.
interventions.
Assumptions
Matter
Inference relies on assumptions like
Violations can lead to misleading
linearity,
independence,
and
conclusions.
normality.
wellness
Inference
depends
on This is where concepts like
understanding how estimates vary standard error and p-values come
across samples.
in.
PARAMETRIC AND NON PARAMETRIC METHOD FOR ESTIMATING f:
Parametric Method:
 Definition:
o A statistical learning approach that assumes ff belongs to a specific functional
form (e.g., linear, polynomial, logistic).
o The problem reduces to estimating a finite set of parameters that define this
form.
 Two-Step Process:
1. Model Assumption: Choose a functional form for f (e.g.
Sampling
Distribution




2. Parameter Estimation: Use data to estimate the parameters (e.g., βj) via
methods like least squares or maximum likelihood.
Common Examples:
o Linear regression
o Logistic regression
o Poisson regression
o ANOVA models
Estimation Techniques:
o Least Squares Estimation (minimizes sum of squared residuals)
o Maximum Likelihood Estimation (MLE) (maximizes probability of observed
data)
o Method of Moments (matches sample moments to theoretical moments)
Key Assumptions:
o Correct functional form is specified.
o Observations are independent.
o Error terms have constant variance (homoscedasticity).
o Often assumes a specific distribution for errors (e.g., normality).
Advantages:
o Simple to implement and interpret.
o Requires less data compared to non-parametric methods.
o Inference (hypothesis testing, confidence intervals) is straightforward.
Limitations:
o High bias if the chosen functional form is incorrect.
o Less flexible—may miss complex, nonlinear patterns.
o Model performance heavily depends on meeting assumptions.
 Bias–Variance Profile:
o Typically low variance (stable estimates across samples) but potentially high
bias if the model is misspecified.
Non parametric Method
Definition
 Does not assume a fixed functional form for ff.
 The shape of f is determined entirely by the data, allowing it to adapt to complex
patterns.
Flexibility
 Can model highly nonlinear relationships.
 Complexity can grow with the size of the dataset.
Examples
 k-Nearest Neighbours (k-NN) regression/classification
 Decision Trees and ensembles (Random Forests, Bagging, Boosting)
 Kernel Smoothing and Kernel Density Estimation
 Local Polynomial Regression
 Splines and Smoothing Splines
Estimation Process
 Uses the training data directly to make predictions (often called “memory-based” or
“instance-based” learning).
 Relies on measures like distance, similarity, or local averaging rather than global
parameter estimates.
Assumptions
 Minimal distributional assumptions compared to parametric methods.
 Often assumes only that f is smooth or continuous in some sense.
Advantages
 High flexibility — can fit a wide variety of shapes.
 Fewer assumptions reduce the risk of model misspecification.
 Can achieve high accuracy with sufficient data.
Limitations
 Requires more data to achieve low variance.
 Can be computationally intensive for large datasets.
 Risk of over fitting if complexity is not controlled (e.g., via bandwidth in kernel methods,
pruning in trees).
Bias–Variance Profile



Typically low bias (can closely follow the data) but high variance if not regularized.
Model tuning (e.g., choosing k in k-NN, bandwidth in kernels) is crucial to balance bias
and variance.
When to Use
 When the true form of f is unknown or suspected to be highly nonlinear.
 When you have a large dataset and want flexibility over interpretability.
Difference between Parametric and Non Parametric Method:
Aspect
Parametric Method
Non-Parametric Method
Model Assumption
Assumes a fixed functional form Makes no fixed assumption about
for f (e.g., linear, polynomial).
the shape of f.
What We Estimate
A small set of parameters that The function f itself, often using the
define the chosen model.
data directly.
Aspect
Parametric Method
Flexibility
Less flexible — limited to the Highly flexible — can adapt to many
chosen form.
shapes.
Data Requirement
Works
well
datasets.
Bias–Variance
Profile
Usually higher
variance.
Interpretability
Easy to interpret and explain.
with
bias,
Non-Parametric Method
smaller
lower
Needs large datasets for accuracy.
Usually lower bias, higher variance.
Often harder to interpret.
Risk
if
Wrong High — if the assumed form is Lower — fewer assumptions reduce
Assumption
wrong, predictions suffer.
risk of misspecification.
Examples
Linear
regression,
logistic K-Nearest Neighbours, decision
regression, Poisson regression. trees, splines, kernel smoothing.
FUNDAMENTAL OF MEASURING THE QUALITY OF FIT:
Measuring the “quality of fit” in a machine learning (ML) model means evaluating how well
the model’s predictions match the actual observed data — both on the data it was trained
on and, more importantly, on unseen data.
It’s essentially asking:
“Does my model capture the underlying patterns in the data without overfitting or under fitting?”
The exact way we measure it depends on the type of model and the nature of the outcome
variable, but the core idea is always the same — smaller, unbiased differences between
predicted and observed values mean a better fit
What “Quality of Fit” Means
 In statistical terms: It’s the degree to which the model’s predicted values align with
the observed values.
 In ML terms: It’s a measure of model performance and generalization ability.
 A good fit means low error and high explanatory or predictive power.
Why It Matters
 Trust: A poorly fitting model can give misleading predictions or wrong conclusions.
 Model Selection: Comparing fit across candidate models helps choose the best one.
 Bias–Variance Balance: Fit quality helps detect overfitting (too complex) or under
fitting (too simple).
Quality of Fit for Regression Models:
Quality of Fit for Classification Models:
Visual & Diagnostic Checks
 Residual plots (regression): Residuals should be randomly scattered, no patterns.
 Predicted vs. Actual plots: Points should cluster around the 45° line.
 Learning curves: Show training vs. validation error to detect over/under fitting.
Key Insight
 Quality of fit ≠ just high accuracy or high R² — it’s about balanced performance
that generalizes well.
 Always measure fit on validation/test data, not just training data.
 Use multiple metrics and visual diagnostics to get a complete picture
Note:
For Regression Models (Continuous Outcomes)
 R-squared (R2) – Proportion of variance in the dependent variable explained by the
model. Higher values (closer to 1) indicate better fit.
 Adjusted R2 – Like R2 but penalizes adding predictors that don’t improve the model.
 Mean Squared Error (MSE) / Root Mean Squared Error (RMSE) – Average squared
(or square-rooted) prediction error; lower is better.

Mean Absolute Error (MAE) – Average absolute prediction error; less sensitive to
outliers than MSE.
 Residual Analysis – Plotting residuals to check for randomness; patterns suggest
poor fit.
For Classification Models (Categorical Outcomes)
We assess how well predicted classes match actual classes:
 Accuracy – Proportion of correct predictions.
 Precision, Recall, F1-score – Useful when classes are imbalanced.
 ROC Curve & AUC – Measures the model’s ability to rank positives above negatives
across thresholds.
 Log-Loss / Cross-Entropy – Penalizes confident but wrong predictions.
For Probability Distributions or Hypothesis Testing
When checking if data follow a theoretical distribution:
 Chi-Square Goodness-of-Fit Test – Compares observed and expected frequencies.
 Kolmogorov–Smirnov Test, Anderson–Darling Test, Cramér–von Mises
Criterion – Compare the empirical distribution to the theoretical one.
 Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC) –
Compare models, penalizing complexity.
Visual Checks
 Residual Plots – Random scatter around zero suggests a good fit.
 Predicted vs. Observed Plots – Points close to the 45° line indicate strong
agreement.
In short:
 For numeric predictions, use error metrics and R2R^2 plus residual checks.



For categorical predictions, use accuracy, precision/recall, and ROC-AUC.
For distributional fit, use statistical goodness-of-fit tests.
Always combine numerical metrics with visual diagnostics for a complete picture.
FUNDAMENTAL OF MODEL ACCURACY
In ML, accuracy is a measure of how well a model’s predictions match the actual outcomes.
It’s a performance indicator that answers:
“Out of all predictions the model made, how many were correct?”
 For classification: Accuracy is the proportion of correct predictions (both positives
and negatives) over all predictions.
 For regression: We don’t usually talk about “accuracy” as a percentage — instead,
we use error metrics (MSE, RMSE, MAE) or fit measures (R²) to judge accuracy.
Factors affecting model Accuracy:
 Data Quality – Garbage in, garbage out. Missing values, noise, and bias in data
reduce accuracy.
 Feature Engineering – Relevant, well-constructed features improve predictive power.
 Model Complexity – Too simple → under fitting (high bias); too complex → overfitting
(high variance).
 Training Data Size – More representative data generally improves generalization.
 Evaluation Method – Using cross-validation gives a more reliable accuracy estimate
than a single train/test split.
Fundamental Principles:
Model accuracy is about generalization — how well the model performs on unseen data,
not just the training set. That’s why we always:
 Measure accuracy on a test set or via cross-validation.
 Compare multiple metrics to get a complete performance picture.
 Balance bias and variance to minimize total error.
ASSESSING MODEL ACCURACY OF PREDICTIVE MODEL
In predictive modelling, the goal is not just to fit the training data well, but to generalize to
unseen data. Assessing accuracy tells us:
 How well the model predicts on new data.
 Whether we are overfitting (too complex) or under fitting (too simple).
 Which model or hyper parameter setting is optimal.
Core steps in Accuracy Assessment:
Step
Description
Key Notes
1. Split Data
Divide into training and test sets (e.g., Ensures test error is measured on
80:20) or use cross-validation.
unseen data.
Select metrics aligned with
2.
Choose
prediction
task
(regression
Metrics
classification).
3. Train Model
the Avoid relying on a single metric;
vs. consider the cost of different
errors.
Fit the model on the training set (or Keep test data untouched until final
training folds in CV).
evaluation.
4. Evaluate on Compute chosen metrics on the test This is your generalization
Test Data
performance.
set.
5.
Compare Use validation/CV scores to choose the Avoid peeking at test data
Models
best model.
repeatedly — it biases results.
Common predictive Accuracy Metrics:
Note:
 For predictive models that deal with continuous outcomes (regression problems),
accuracy is measured by how close the predicted values are to the actual values.
 Common ways to do this include calculating the Mean Squared Error (MSE), which
averages the squared differences between predictions and actuals, or the Root Mean
Squared Error (RMSE), which is just the square root of MSE and is in the same units
as the target variable. Mean Absolute Error (MAE) is another option, taking the
average of absolute differences, and is less sensitive to large outliers. The R^2 score
tells you what proportion of the variation in the outcome is explained by your model —
with 1 meaning a perfect fit and 0 meaning no better than predicting the mean.
ASESSING MODEL ACCURACY OF CLASSIFICATION MODEL:
In supervised classification, we train a model on labelled data (X, Y) and then evaluate it on
data it hasn’t seen before. The goal is to measure generalization performance — how well
the model will perform in the wild.
The workflow:
Step
What Happens
Why It Matters
1. Data Splitting
Ensures evaluation is on unseen
Divide data into training and test
data, preventing over-optimistic
sets (or use k-fold cross-validation).
results.
2.
Model
Fit the model on the training set.
Training
3. Prediction
Learn patterns
examples.
from
labelled
Apply the model to the test set to get
Produces the outputs we’ll evaluate.
predicted labels or probabilities.
4.
Confusion Summarize predictions into TP, TN, Foundation for most classification
Matrix
FP, FN counts.
metrics.
5.
Metric Compute accuracy, precision, recall, Quantifies
performance
from
Calculation
F1, ROC-AUC, etc.
different angles.
Relate metrics to the problem’s
Ensures the chosen metric aligns
6. Interpretation priorities (cost of errors, class
with business or research goals.
balance).
The Confusion Matrix: For Binary Classifications:
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
Key Metrics:
Practical Considerations:
1. Class Imbalance: Accuracy can be misleading — a model predicting the majority class
always can have high accuracy but zero usefulness.
2. Threshold Tuning: Many models output probabilities; adjusting the decision threshold
changes precision/recall trade-offs.
3. Cross-Validation: Gives a more stable estimate of performance, especially with small
datasets.
4. Domain Costs: Always align metric choice with the real-world cost of errors.
Note:
 For classification models that deal with categorical outcomes, accuracy is measured
by how often the model predicts the correct class.
 The simplest measure is accuracy itself — the proportion of correct predictions out of
all predictions. But when classes are imbalanced, accuracy can be misleading, so we
also look at precision (how many predicted positives are actually positive), recall or
sensitivity (how many actual positives were correctly identified), and the F1-score,
which balances precision and recall. For models that output probabilities, the
ROC-AUC score measures how well the model ranks positive cases above negative
ones across different thresholds.
 In short, for regression you measure closeness of predictions to actual values, and for
classification you measure correctness of predicted labels — with extra metrics to
handle imbalance or probability-based outputs.
Broad difference between Model Accuracy and Model Fit:
Aspect
Model Fit
Model Accuracy
Definition
How well a model captures the How well a model’s predictions
underlying patterns in the training match the actual outcomes on
data.
unseen (test/validation) data.
Aspect
Focus
Measurement
Context
Model Fit
Model Accuracy
Evaluates the quality of the learned
Evaluates
the
generalization
function f^ relative to the training
performance of f^ on new data.
data.
Primarily assessed on training data Assessed on test data or via
(or via in-sample diagnostics).
cross-validation (out-of-sample).
Typical Metrics
Accuracy, precision, recall, F1
R², Adjusted R², residual analysis,
score, ROC-AUC (classification);
AIC/BIC, training loss.
RMSE, MAE, MAPE (regression).
Purpose
To check if the model can make
To check if the model is under fitting
correct predictions on unseen
or overfitting the training data.
data.
Misleading results if dataset is
Risk if
Overfitting — model may memorize
imbalanced or metric choice is
Maximized Alone training data and fail to generalize.
inappropriate.
Example
A polynomial regression with degree
A spam filter with 99% accuracy on
15 may have a perfect fit (R² ≈ 1) on
test data may still miss most spam
training
data
but
poor
test
if classes are imbalanced.
performance.
Relation to
Generalization
High accuracy indicates good
Good fit is necessary but not
generalization only if evaluation is
sufficient for good generalization.
unbiased and representative.
Basics of Exploratory Data analysis:
Foundations of EDA from Descriptive Statistics
Concept
Definition
Purpose in EDA
Central
Tendency
Mean, Median, Mode
Dispersion
Range,
Deviation
Skewness
Degree of asymmetry in distribution Detect outliers and non-normality
Kurtosis
Tailedness of distribution
Variance,
Understand where data is centred
Standard
Measure spread and variability
Assess
values
presence
of
extreme
Relationship Analysis in EDA
Concept
Definition
Purpose in EDA
Covariance
Measure of how two variables change
Identify directional relationships
together
Standardized measure of linear Quantify strength and direction of
association (−1 to +1)
relationships
Regression
Quantifies change in one variable due
Explore predictive relationships
(Slope)
to another
Grouping data points based on Reveal hidden structure or
Clustering
similarity
segments
Graphical Analysis:
Correlation
Type
of Variable
Analysis
Count
Purpose
Univariate
Analysis
One
variable
Histogram
(for
Understand distribution, continuous data) Age distribution of
central tendency, and Bar
Chart (for employees
spread of a single categorical data) Gender breakdown
variable.
Pie Chart
Income spread
Box Plot
Two
variables
Scatter
Plot
(numerical
vs.
numerical)
Line Chart (time
Age vs. Salary
series)
Explore relationships or
Department
vs.
Grouped Bar Chart
associations between two
Service Length
(categorical
vs.
variables.
Status Year vs.
numerical)
Status Category
Heat
map
(correlation matrix)
Box
Plot
with
grouping
Bivariate
Analysis
Common Graphs Examples
Download