Loss function (aka cost function or error function) • Quantifies the difference between the predicted values by the model and the actual values in the training data • Key to understand how machines learn • The goal of training a machine learning model is to minimize this loss • Example: Mean squared error (MSE), Absolute error Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 1 Linear Regression • Supervised machine learning technique • Used to: • Estimate the value of dependent variable for given independent variable(s) • Explain the impact of change in dependent variable for a unit change in independent variable • Check which variable is highly influencing the dependent variable Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 2 ๐ฆ = ๐๐ฅ + ๐ x = Independent variable y = Dependent variable m = Slope term or Regression coefficient c = Intercept term Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 3 Loss function for linear regression ๐ ๐ ๐๐๐ธ = เท ๐ฆ๐ − ๐ฆเท๐ 2 = เท ๐ฆ๐ − (๐. ๐ฅ๐ + ๐) 2 ๐=1 ๐=1 Where ๐ฅ๐ =observed value of the independent variable for the ith observation ๐ฆ๐ =observed value of the dependent variable for the ith observation ๐ฆเท๐ =predicted value of the dependent variable for the ith observation Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management ๐. ๐ฅ + ๐ 4 Gradient Descent Algorithm c c ๐ ๐ ๐๐๐ธ = เท ๐ฆ๐ − ๐ฆเท๐ 2 = เท ๐ฆ๐ − (๐. ๐ฅ๐ + ๐) 2 ๐=1 Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management ๐=1 5 ๐ ๐๐๐ธ = เท ๐ฆ๐ − (๐. ๐ฅ๐ + ๐) 2 Height (Y) 5 ๐=1 4 3 ๐ฆ1 − ๐ฆ เท1 1 SSE 2 Weight (X) 1 2 3 4 5 Y-intercept Assume: Slope of the best fit line is known Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 6 ๐ ๐๐๐ธ = เท ๐ฆ๐ − (๐. ๐ฅ๐ + ๐) 2 ๐=1 Height (Y) 5 ๐๐๐๐๐ ๐๐ ๐ก๐๐๐๐๐๐ก = ๐ 4 ๐(๐๐๐ธ) ๐๐ = เท −2 ๐ฆ๐ − ๐. ๐ฅ๐ − ๐ 3 ๐=1 Step size = Derivative value x Learning Rate New value = Old value – Step size ๐ฆ๐ − ๐ฆเท๐ 1 SSE 2 c=0 Weight (X) 1 2 3 4 5 Y-intercept Assume: Slope of the best fit line is known 7 Cases on Linear regression • Marketwise solutions • Student performance • Food delivery times Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 8 What to look for in results for Linear regression? • Regression Equation: • Represents the relationship between the dependent and independent variables ๐ ๐ฆเท = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + โฏ + ๐ฝ๐ ๐ฅ๐ = ๐ฝ0 + เท ๐ฝ๐ ๐ฅ๐ ๐=1 • Coefficients: • Intercept (๐ฝ0): The expected value of the dependent variable when all independent variables are 0. • Slope coefficients (๐ฝ1,๐ฝ2,…, ๐ฝn): Represent the change in the dependent variable for a one-unit increase in the independent variable, keeping others constant. • Significance: • P-Value: Tests if the coefficient is significantly different from zero. • ๐ < α: Significant; variable contributes meaningfully. • ๐ ≥ α : Not significant; may not be impactful. • Common values for α : 1%, 5% Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 9 Evaluation & comparison of Regression techniques • Mean Absolute Error (MAE) 1 ๐ ๐๐ด๐ธ = σ๐=1 ๐ฆ๐ − ๐ฆเท๐ ๐ • Mean Squared Error (MSE) ๐๐๐ธ = σ๐๐=1(๐ฆ๐ − ๐ฆเท๐ )2 1 ๐ • Root Mean Squared Error (RMSE) ๐ ๐๐๐ธ = ๐๐๐ธ 2 ๐๐๐ ๐๐ ๐๐๐ข๐๐ =1− ๐๐๐๐๐ก๐๐ • R-Squared (R2) ๐ • Adjusted R-Squared (1−๐ 2 )(๐−1) 2 ๐ ๐๐๐ = 1 − ๐−๐−1 • Mean Absolute Percentage Error (MAPE) 1 ๐ ๐๐ด๐๐ธ = σ๐๐=1 Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management ๐ฆ๐ −๐ฆเท ๐ ๐ฆ๐ × 100 10 Metric Purpose MAE Provides a straightforward measure of prediction accuracy. Easy to interpret, less sensitive to Ignores the direction of errors and outliers than MSE. doesn't penalize large errors heavily. MSE Evaluates the overall accuracy of the model by emphasizing large errors. Penalizes large errors more heavily, highlighting significant deviations. RMSE Accounts for large errors more Provides an interpretable measure effectively than MAE, easier to of error magnitude. interpret than MSE. R-Squared Weaknesses Sensitive to outliers, harder to interpret due to squaring. Still sensitive to outliers, computationally intensive for large datasets. Measures how well the independent variables explain the target variable's variance. Provides a single measure of Can increase with additional model fit; widely recognized and predictors, even if they don't used. improve the model. Evaluates model fit while More realistic measure of model Can be computationally intensive; performance for multiple may still not identify overfitting regression. completely. Adjusted Rpenalizing for adding irrelevant Squared predictors. MAPE Strengths Useful for scale-independent performance measurement, especially in business contexts. Easy to understand; works well for comparing performance across datasets with different scales. Can produce misleading results when actual values are close to zero. 11 Ways to get a balanced fit model • Cross-validation • Eg. K-fold cross validation • Regularization • Eg. L1 and L2 regularization • Dimensionality reduction • Eg. Principal component analysis • Ensemble techniques • Eg. Max voting, Averaging, Bagging Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 12 Regularization • Technique used to prevent overfitting and improve the generalization performance of models • Overfitting occurs when a model learns to fit the training data too closely, capturing noise and making it perform poorly on unseen data • If we have a large set of data, then we can be confident that the linear regression model will be accurate • What if we have a small set of training data? Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 13 Height (Y) Weight (X) Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 14 Height (Y) Low Bias but High variance - Overfitting We would be better off introducing some bias for reducing variance Weight (X) Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 15 Regularization • Regularization methods add a penalty to the model's objective function to encourage it to have smaller and simpler coefficients • That is, make the predicted variable less sensitive to the input variables by flattening the slope • Two common types of regularization used in linear models: L1 (Lasso) Regularization & L2 (Ridge) Regularization Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 16 Ridge (L2) regularization • Adds the squared values of the coefficients to the loss function • Prevent overfitting by shrinking the coefficients towards zero without forcing them to be exactly zero • L2 Loss function: ๐๐๐ธ + ๐ × σ ๐๐๐๐๐๐๐๐๐๐๐ก๐ 2 • Here ๐ is the multiplier that determines the severity of penalty • Can take any values from 0 to +∞ Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 17 Lasso (L1) regularization • Lasso: Least Absolute Shrinkage and Selection Operator • Adds the absolute values of the coefficients to the loss function • L1 Loss function: ๐๐๐ธ + ๐ × σ ๐๐๐๐๐๐๐๐๐๐๐ก๐ Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 18 Values of ๐ • At ๐=0, Lasso / Ridge regression line and Least squares line is the same • As we increase ๐, the slope becomes flatter closer to 0 • With flatter slope, the predicted variable becomes less sensitive to the independent variables • To find optimal ๐ try a bunch of values for ๐ and use crossvalidation (typically 10-fold CV) to determine which one results in the lowest variance Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 19 Difference between Ridge and Lasso regularization • L1 regularization (Lasso) encourages feature selection and sparsity, while L2 regularization (Ridge) prevents overfitting by shrinking coefficients • When the regularization strength (λ) is sufficiently high, Lasso forces some coefficients to become exactly zero, effectively removing the associated features from the model. • Ridge's penalty, on the other hand, is smooth and continuous, leading to coefficients that are very close to zero but typically not exactly zero, thus preserving all features in the model • Link for more information • The choice between them depends on the problem and the trade-off between complexity and simplicity in the model Lecture slides of Prof Vishwesh Singbal, Goa Institute of Management 20