Uploaded by royoish2000

Loss Functions in Linear Regression

advertisement
Loss function (aka cost function or error function)
• Quantifies the difference between the predicted values by
the model and the actual values in the training data
• Key to understand how machines learn
• The goal of training a machine learning model is to minimize
this loss
• Example: Mean squared error (MSE), Absolute error
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
1
Linear Regression
• Supervised machine learning technique
• Used to:
• Estimate the value of dependent variable for given independent
variable(s)
• Explain the impact of change in dependent variable for a unit
change in independent variable
• Check which variable is highly influencing the dependent variable
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
2
๐‘ฆ = ๐‘š๐‘ฅ + ๐‘
x = Independent variable
y = Dependent variable
m = Slope term or Regression coefficient
c = Intercept term
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
3
Loss function for linear regression
๐‘
๐‘
๐‘†๐‘†๐ธ = เท ๐‘ฆ๐‘– − ๐‘ฆเทœ๐‘– 2 = เท ๐‘ฆ๐‘– − (๐‘š. ๐‘ฅ๐‘– + ๐‘) 2
๐‘–=1
๐‘–=1
Where
๐‘ฅ๐‘– =observed value of the
independent variable for the ith
observation
๐‘ฆ๐‘– =observed value of the dependent
variable for the ith observation
๐‘ฆเทœ๐‘– =predicted value of the dependent
variable for the ith observation
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
๐‘š. ๐‘ฅ + ๐‘
4
Gradient Descent Algorithm
c
c
๐‘
๐‘
๐‘†๐‘†๐ธ = เท ๐‘ฆ๐‘– − ๐‘ฆเทœ๐‘– 2 = เท ๐‘ฆ๐‘– − (๐‘š. ๐‘ฅ๐‘– + ๐‘) 2
๐‘–=1
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
๐‘–=1
5
๐‘
๐‘†๐‘†๐ธ = เท ๐‘ฆ๐‘– − (๐‘š. ๐‘ฅ๐‘– + ๐‘) 2
Height (Y)
5
๐‘–=1
4
3
๐‘ฆ1 − ๐‘ฆ
เทž1
1
SSE
2
Weight (X)
1
2
3
4
5
Y-intercept
Assume: Slope of the best fit line is known
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
6
๐‘
๐‘†๐‘†๐ธ = เท ๐‘ฆ๐‘– − (๐‘š. ๐‘ฅ๐‘– + ๐‘) 2
๐‘–=1
Height (Y)
5
๐‘†๐‘™๐‘œ๐‘๐‘’ ๐‘œ๐‘“ ๐‘ก๐‘Ž๐‘›๐‘”๐‘’๐‘›๐‘ก =
๐‘
4
๐‘‘(๐‘†๐‘†๐ธ)
๐‘‘๐‘
= เท −2 ๐‘ฆ๐‘– − ๐‘š. ๐‘ฅ๐‘– − ๐‘
3
๐‘–=1
Step size = Derivative value x Learning Rate
New value = Old value – Step size
๐‘ฆ๐‘– − ๐‘ฆเท๐‘–
1
SSE
2
c=0
Weight (X)
1
2
3
4
5
Y-intercept
Assume: Slope of the best fit line is known
7
Cases on Linear regression
• Marketwise solutions
• Student performance
• Food delivery times
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
8
What to look for in results for Linear regression?
• Regression Equation:
• Represents the relationship between the dependent and independent
variables
๐‘›
๐‘ฆเทœ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + ๐›ฝ2 ๐‘ฅ2 + โ‹ฏ + ๐›ฝ๐‘› ๐‘ฅ๐‘› = ๐›ฝ0 + เท ๐›ฝ๐‘– ๐‘ฅ๐‘–
๐‘–=1
• Coefficients:
• Intercept (๐›ฝ0): The expected value of the dependent variable when all
independent variables are 0.
• Slope coefficients (๐›ฝ1,๐›ฝ2,…, ๐›ฝn): Represent the change in the dependent
variable for a one-unit increase in the independent variable, keeping others
constant.
• Significance:
• P-Value: Tests if the coefficient is significantly different from zero.
• ๐‘ < α: Significant; variable contributes meaningfully.
• ๐‘ ≥ α : Not significant; may not be impactful.
• Common values for α : 1%, 5%
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
9
Evaluation & comparison of Regression techniques
• Mean Absolute Error (MAE)
1 ๐‘›
๐‘€๐ด๐ธ = σ๐‘–=1 ๐‘ฆ๐‘– − ๐‘ฆเทœ๐‘–
๐‘›
• Mean Squared Error (MSE)
๐‘€๐‘†๐ธ = σ๐‘›๐‘–=1(๐‘ฆ๐‘– − ๐‘ฆเทœ๐‘– )2
1
๐‘›
• Root Mean Squared Error (RMSE) ๐‘…๐‘€๐‘†๐ธ = ๐‘€๐‘†๐ธ
2
๐‘†๐‘†๐‘…๐‘’๐‘ ๐‘–๐‘‘๐‘ข๐‘Ž๐‘™
=1−
๐‘†๐‘†๐‘‡๐‘œ๐‘ก๐‘Ž๐‘™
• R-Squared (R2)
๐‘…
• Adjusted R-Squared
(1−๐‘…2 )(๐‘›−1)
2
๐‘…๐‘Ž๐‘‘๐‘— = 1 −
๐‘›−๐‘−1
• Mean Absolute Percentage Error (MAPE)
1
๐‘›
๐‘€๐ด๐‘ƒ๐ธ = σ๐‘›๐‘–=1
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
๐‘ฆ๐‘– −๐‘ฆเทœ ๐‘–
๐‘ฆ๐‘–
× 100
10
Metric
Purpose
MAE
Provides a straightforward
measure of prediction accuracy.
Easy to interpret, less sensitive to Ignores the direction of errors and
outliers than MSE.
doesn't penalize large errors heavily.
MSE
Evaluates the overall accuracy of
the model by emphasizing large
errors.
Penalizes large errors more
heavily, highlighting significant
deviations.
RMSE
Accounts for large errors more
Provides an interpretable measure
effectively than MAE, easier to
of error magnitude.
interpret than MSE.
R-Squared
Weaknesses
Sensitive to outliers, harder to
interpret due to squaring.
Still sensitive to outliers,
computationally intensive for large
datasets.
Measures how well the
independent variables explain the
target variable's variance.
Provides a single measure of
Can increase with additional
model fit; widely recognized and predictors, even if they don't
used.
improve the model.
Evaluates model fit while
More realistic measure of model Can be computationally intensive;
performance for multiple
may still not identify overfitting
regression.
completely.
Adjusted Rpenalizing for adding irrelevant
Squared
predictors.
MAPE
Strengths
Useful for scale-independent
performance measurement,
especially in business contexts.
Easy to understand; works well
for comparing performance
across datasets with different
scales.
Can produce misleading results
when actual values are close to zero.
11
Ways to get a balanced fit model
• Cross-validation
• Eg. K-fold cross validation
• Regularization
• Eg. L1 and L2 regularization
• Dimensionality reduction
• Eg. Principal component analysis
• Ensemble techniques
• Eg. Max voting, Averaging, Bagging
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
12
Regularization
• Technique used to prevent overfitting and improve the
generalization performance of models
• Overfitting occurs when a model learns to fit the training
data too closely, capturing noise and making it perform
poorly on unseen data
• If we have a large set of data, then we can be confident that
the linear regression model will be accurate
• What if we have a small set of training data?
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
13
Height (Y)
Weight (X)
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
14
Height (Y)
Low Bias but High
variance - Overfitting
We would be better off
introducing some bias
for reducing variance
Weight (X)
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
15
Regularization
• Regularization methods add a penalty to the model's
objective function to encourage it to have smaller and
simpler coefficients
• That is, make the predicted variable less sensitive to the input
variables by flattening the slope
• Two common types of regularization used in linear models:
L1 (Lasso) Regularization & L2 (Ridge) Regularization
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
16
Ridge (L2) regularization
• Adds the squared values of the coefficients to the loss
function
• Prevent overfitting by shrinking the coefficients towards zero
without forcing them to be exactly zero
• L2 Loss function: ๐‘†๐‘†๐ธ + ๐œ† × σ ๐‘๐‘œ๐‘’๐‘“๐‘“๐‘–๐‘๐‘–๐‘’๐‘›๐‘ก๐‘  2
• Here ๐œ† is the multiplier that determines the severity of
penalty
• Can take any values from 0 to +∞
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
17
Lasso (L1) regularization
• Lasso: Least Absolute Shrinkage and Selection Operator
• Adds the absolute values of the coefficients to the loss
function
• L1 Loss function: ๐‘†๐‘†๐ธ + ๐œ† × σ ๐‘๐‘œ๐‘’๐‘“๐‘“๐‘–๐‘๐‘–๐‘’๐‘›๐‘ก๐‘ 
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
18
Values of ๐œ†
• At ๐œ†=0, Lasso / Ridge regression line and Least squares line is
the same
• As we increase ๐œ†, the slope becomes flatter closer to 0
• With flatter slope, the predicted variable becomes less sensitive to
the independent variables
• To find optimal ๐œ† try a bunch of values for ๐œ† and use crossvalidation (typically 10-fold CV) to determine which one
results in the lowest variance
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
19
Difference between Ridge and Lasso regularization
• L1 regularization (Lasso) encourages feature selection and sparsity,
while L2 regularization (Ridge) prevents overfitting by shrinking
coefficients
• When the regularization strength (λ) is sufficiently high, Lasso forces some
coefficients to become exactly zero, effectively removing the associated
features from the model.
• Ridge's penalty, on the other hand, is smooth and continuous, leading to
coefficients that are very close to zero but typically not exactly zero, thus
preserving all features in the model
• Link for more information
• The choice between them depends on the problem and the trade-off
between complexity and simplicity in the model
Lecture slides of Prof Vishwesh Singbal, Goa Institute of
Management
20
Download