Lecture: 01 – Introduction to Statistical learning: INTRODUCTION: Statistical learning refers to a vast set of tools for understanding data? They are set of tools and techniques for understanding data by modelling the relationship between Inputs X and Output Y. The Core objective is to estimate a function f(x) that captures the systematic information in data; thereby enabling prediction or inference Statistical learning has two fundamental goals: a. Prediction: Accurately estimate future or unknown outcomes Y from the inputs X b. Inference: Understand how each input variable Xj affects the Output Y. Statistical learning finds all sectors wherever there is a transactions of data. Some of the common industry are health diagnostics, financial forecasting, marketing analytics, bioinformatics etc. SOME CORE FUNDAMENTALS: Aspect Data Definition Raw facts, figures, or observations A structured collection of related data, (e.g., numbers, text, images). often organized in tabular form. Granularity Atomic or information. Structure May be unstructured structured. Purpose Basic building blocks for information. Examples "25", "Male", "Blue", "2025-09-13" Format Usage Context Dataset individual units or of Aggregated and organized data for a specific purpose or analysis. semi- Typically structured with rows and columns. Used for reporting. analysis, modelling, or Table of customer demographics, sales records, or survey responses. Can be text, numbers, audio, video, Usually stored in formats like CSV, etc. Excel, JSON, or SQL tables. Found in sensors, logs, documents, Used in data science, machine learning, etc. and statistical analysis. Highly flexible and context- More rigid and designed for specific dependent. tasks or studies. STATISTICAL VERSUS MACHINE LEARNING MODEL: Aspect Statistical Model Machine Learning Model Flexibility Primary Goal Make accurate predictions or Explain relationships between classifications from data, often variables, test hypotheses, and make without explicit assumptions about inferences about a population. relationships. Approach Based on predefined mathematical Learns patterns directly from data equations and probability using algorithms; focuses on distributions; focuses on parameter predictive performance. estimation and inference. Strong assumptions about data (e.g., Few or no strict assumptions about normality, independence, linearity, no data distribution; can handle multicollinearity). complex, non-linear relationships. Data Works well with smaller datasets if Performs best with large datasets; Requirements assumptions hold. can handle high-dimensional data. High — coefficients and parameters Often lower — some models (e.g., Interpretability have clear meanings; easier to explain deep learning) are “black boxes” results. with less interpretability. Assumptions Aspect Examples Output Focus Statistical Model Machine Learning Model Decision trees, random forests, Linear regression, logistic regression, support vector machines, neural ANOVA, time series models. networks. Estimates parameters and quantifies Produces predictions or uncertainty (confidence intervals, classifications; may not provide p-values). parameter estimates. Evaluation Metrics Goodness-of-fit adjusted R²), AIC/BIC. measures hypothesis (R², Prediction accuracy, precision, tests, recall, F1-score, ROC-AUC, cross-validation error. Strengths Strong theoretical foundation; results Flexible, can model complex are explainable and statistically valid if patterns; often higher predictive assumptions hold. accuracy in complex tasks. Sensitive to assumption violations; May over fit without proper may underperform on highly regularization; interpretability can non-linear or unstructured data. be challenging. STATISTICAL LEARNING VERSUS MACHINE LEARNING: Aspect Statistical Learning Machine Learning Limitations Primary Goal Understand and interpret Maximise predictive accuracy on relationships between variables; unseen data; focus on pattern make inferences about the recognition and automation. data-generating process. Approach Starts with a statistical model (often Uses algorithms to learn patterns parametric) and estimates directly from data; data-driven. parameters; theory-driven. Assumptions Often requires strong assumptions Fewer or no strict assumptions (e.g., linearity, normality, about data distribution; can model independence, no multicollinearity). complex, non-linear relationships. Interpretability Often lower — some models (e.g., High — parameters have clear deep learning) are “black boxes” meaning; easier to explain results. with less interpretability. Data Requirements Performs best with large datasets; Works well with smaller datasets if can handle high-dimensional and assumptions hold. unstructured data. Typical Methods Decision trees, random forests, Linear regression, logistic regression, support vector machines, neural ANOVA, generalized linear models. networks, gradient boosting. Output Focus Predictions, classifications, Parameter estimates, confidence probability scores; may not provide intervals, hypothesis tests, p-values. parameter estimates. Evaluation Metrics Goodness-of-fit measures (R², Prediction accuracy, precision, adjusted R²), AIC, BIC, hypothesis recall, F1-score, ROC-AUC, testing. cross-validation error. Strengths Strong theoretical foundation; results Flexible; can model complex are explainable and statistically valid if patterns; often higher predictive assumptions hold. accuracy in complex tasks. Limitations Sensitive to assumption violations; may underperform on highly non-linear or unstructured data. SUPERVISED VERSUS UNSUPERVISED STATISTICAL MODEL: Aspect Supervised Statistical Model Unsupervised Statistical Model Definition Finds patterns or structure in Learns a mapping from inputs XX to unlabelled data without predefined outputs YY using labelled data. outputs. Data Requirement Requires both input variables and Requires only input variables; no corresponding output labels. output labels are provided. Goal Predict outcomes for new data Discover hidden patterns, and/or understand the relationship groupings, or data structure. between XX and YY. Examples Tasks of Classification Regression prices). (spam detection), Clustering (predicting house segmentation), reduction (PCA). (customer Dimensionality Algorithms Linear regression, logistic K-means clustering, hierarchical regression, decision trees, random clustering, PCA, t-SNE, association forests, SVM, neural networks. rule mining. Evaluation Accuracy, RMSE, precision/recall, Silhouette score, within-cluster sum ROC-AUC — compared against of squares, variance explained — no known labels. labels to compare. “Teacher” present — model learns No “teacher” — model learns from correct answers during training. patterns on its own. Often easier to interpret in simple Interpretation focuses on discovered Interpretability models; complex models may be structure, which may be subjective. opaque. Supervision Use Cases Medical diagnosis, credit scoring, Market basket analysis, anomaly demand forecasting. detection, exploratory data analysis. SUPERVISED, UNSUPERVISED AND REINFORCED LEARNING: Aspect Definition Supervised Learning Unsupervised Learning Learns from labelled data — each input has a known output. Learns from unlabelled data — finds patterns or structure without predefined outputs. Reinforcement Learning Learns by interacting with an environment, receiving rewards or penalties for actions. No fixed dataset — learns from state, action, reward sequences. Discover hidden patterns, Learn an optimal policy Predict outcomes for clusters, or relationships in to maximize cumulative new inputs accurately. data. rewards over time. Labelled dataset (X,Y Unlabelled dataset (X Type of Data pairs). only). Goal Agent explores and Model maps inputs to Model groups, reduces, or exploits actions to Learning outputs using training organizes data based on improve Process examples. similarity or structure. decision-making via feedback. No explicit supervision Requires external No supervision — Supervision — learns from feedback supervision (labels). self-organizing. signals. Common Algorithms Linear/Logistic Regression, Decision K-Means, Hierarchical Trees, Random Clustering, PCA, Auto Forest, SVM, Neural encoders, t-SNE. Networks. Example Problems Clustering (customer Classification (spam Game playing (Chess, segmentation), detection), Regression Go), Robotics control, Dimensionality Reduction (price prediction). Self-driving cars. (PCA). Output Predicted labels or continuous values. Group assignments, Sequence of actions reduced-dimension (policy) that maximizes features, association rules. long-term reward. Evaluation Metrics Accuracy, Precision, Recall, F1, RMSE, MAE. Cumulative reward, Silhouette score, Davies– average reward per Bouldin index, episode, convergence reconstruction error. rate. Q-Learning, SARSA, Deep Q-Networks (DQN), Policy Gradient Methods. Key takeaway: Supervised = Learn from answers given Unsupervised = Find structure without answers Reinforcement = Learn by trial, error, and reward REGRESSION VERSUS CLASSIFICATION PROBLEMS: Aspect Regression Classification Output Type Discrete categories or class labels Continuous numeric values (e.g., (e.g., spam/not spam, disease/no price, temperature, weight). disease). Goal Predict a real-valued outcome based Assign inputs to one of several on input features. predefined classes. Nature of Ordered, measurable quantities. Target Variable Unordered, qualitative categories. Examples Predicting house prices, forecasting Email spam detection, image sales, estimating rainfall amount. recognition, medical diagnosis. Model Output A number on a continuous scale. Evaluation Metrics Mean Squared Error (MSE), Root Accuracy, Precision, Mean Squared Error (RMSE), Mean F1-score, ROC-AUC. Absolute Error (MAE), R2R^2. Decision Mechanism Fits a function (e.g., best-fit line) to Finds decision boundaries approximate numeric outcomes. separate classes. A class label (or probability of belonging to each class). Recall, that Linear regression, polynomial Logistic regression, decision trees, regression, regression trees, support random forests, k-NN, support vector vector regression. machines (classification). CURSE OF DIMENSIONALITY The curse of dimensionality refers to the strange and often problematic things that happen when working with high-dimensional data (data with many features/variables). As the number of dimensions (p) increases: The volume of the space grows exponentially. Data points become sparse — they are far apart from each other. Many algorithms that rely on “closeness” or “locality” (like nearest neighbours, kernel smoothing) start to break down. Common Algorithms Problems in Machine Learning: Distance loses meaning In high dimensions, the difference between the nearest and farthest neighbour distances becomes very small. “Nearest” points are still far away → local averaging becomes ineffective. Data sparsity To maintain the same density of points as dimensions grow, you need an exponentially larger dataset. Example: A 10% “neighbourhood” in 1D is small and local; in 100D, it covers almost the entire space. Overfitting risk With many features, models can fit noise instead of signal. More parameters → higher variance unless regularised. Computational cost More features → more storage, longer training times, higher memory use. Key Example: Nearest neighbour averaging works well when the number of predictors (p) is small (about 4 or fewer) and the dataset is reasonably large (N). When p becomes large, the method often performs poorly — this is called the curse of dimensionality. In high dimensions, even the “nearest” data points are still far away from the target point. To reduce variance in the estimate, we need to average over a reasonable fraction of the data (e.g., 10%). But in high dimensions, a 10% neighbourhood is no longer local — it covers a huge, spread-out region. This means we lose the local averaging spirit that makes nearest neighbour methods effective in low dimensions. CONCEPT OF TEST DATA AND TRAINING DATA: Aspect Training Dataset Purpose Used to teach the model — the Used to evaluate the trained algorithm learns patterns, model’s performance on unseen relationships, and parameters from data. this data. Exposure Model to Test Dataset The model sees this data during The model never sees this data training and adjusts its parameters during training — it’s kept separate accordingly. to ensure an unbiased evaluation. Aspect Training Dataset Test Dataset Size Usually the larger portion of the Usually the smaller portion (e.g., dataset (e.g., 70–80%) to give the 20–30%) reserved for final model enough examples to learn performance testing. from. Determines how well the model can Role in Model Determines how well the model can fit the training data; used for Development generalise to new, unseen data. parameter estimation. Metrics Training error, loss function values Test error, accuracy, precision, Computed during learning. recall, F1-score, ROC-AUC, etc. Risk if Misused If used during training, it can lead to If too small, the model may under fit overfitting and overly optimistic due to insufficient learning. performance estimates. Analogy Like studying from a textbook before Like taking the final exam to check an exam. how well you learned. Training set → “Learn” Test set → “Prove” THE TRADE-OFFS IN MACHINE LEARNING MODELS: There are three fundamentals tradeoffs in ML Models; a. Prediction accuracy versus Interpretability b. Good Fit versus Over-Fit or Under-Fit c. Parsimony versus Black Box THE TRADE OFF BETWEEN PREDICTION ACCURACY AND MODEL INTERPRETABILITY: Simple models (e.g., linear regression) are easy to interpret but may miss complex patterns, leading to lower accuracy on unseen data. Complex models (e.g., random forests, neural networks) can capture intricate, nonlinear relationships, often improving accuracy, but their inner workings are harder to explain1. Inference-focused tasks (understanding relationships, testing hypotheses) benefit from interpretability, even if prediction accuracy is slightly lower. Prediction-focused tasks (spam detection, image recognition) often prioritize accuracy over interpretability, as long as the model is well-validated. Increasing flexibility to boost accuracy can also increase variance and risk of over fitting, which may paradoxically reduce predictive performance on new data. Interpretability supports trust, transparency, and regulatory compliance, which can outweigh small gains in accuracy in sensitive domains like healthcare or finance. The optimal balance depends on context: in high-stakes, regulated environments, interpretability may be non-negotiable; in low-risk, high-volume prediction tasks, accuracy may dominate1. Hybrid approaches (e.g., interpretable surrogates, post-hoc explanation tools like SHAP or LIME) aim to retain much of the accuracy of complex models while restoring some interpretability Trade-off between More Restrictive models versus More Flexible Model: More Restrictive Model Low More Flexible Model When to Prefer… Flexibility) (High Flexibility) Data Size Small dataset – avoids over fitting Large dataset – enough information to fit complex patterns True Relationship Likely simple or linear Likely complex or highly non-linear Noise Level High noise – simpler model avoids Low noise – can capture subtle chasing randomness patterns without over fitting When to Prefer… Goal Predictors Observations More Restrictive Model Low More Flexible Model Flexibility) (High Flexibility) Interpretability, transparency vs. Many predictors sample size inference, relative Maximum predictive accuracy to Few predictors interactions but complex Risk Tolerance High-stakes or regulated domains Low-stakes or competitive – need explainability domains – performance is priority Bias–Variance Profile Higher bias, lower variance Examples Linear regression, regression Lower bias, higher variance logistic Random forests, splines, neural networks To sum-up: Prediction Accuracy: How well a model predicts outcomes on unseen data? o High accuracy means the model generalises well. o Often achieved by using flexible, complex models that can capture intricate patterns. Interpretability: How easily humans can understand the relationship between inputs and outputs in the model. o High interpretability means you can explain why the model made a certain prediction. o Often associated with simpler models. The tension: As model flexibility increases, accuracy on training data often improves — but interpretability usually decreases, and the risk of overfitting rises Why this matters: When interpretability is critical: Regulatory compliance (finance, healthcare). Scientific research (understanding causal relationships). Stakeholder trust (executives, customers). When accuracy is the priority: Recommendation systems. Spam detection. Image recognition. In these cases, a “black-box” model may be acceptable if it performs better. Model Type Flexibility Prediction Accuracy Interpretability Linear Regression Low Moderate (if relationship is High — coefficients have clear truly linear) meaning Decision (shallow) Moderate High — easy to visualise High Low — many trees, hard to interpret Very High Low — complex ensemble Tree Low– Moderate Random Forest High Gradient Boosting High Neural Network Very High Very High Very Low — “black box” (deep) Some key insights: More flexible methods can fit a wider range of shapes for f(X), but at the cost of interpretability. Sometimes, less flexible models are chosen deliberately for their explanatory power, even if accuracy is slightly lower. Machine Learning Mastery: o “Unfortunately, the predictive models that are most powerful are usually the least interpretable. o As complexity increases, so do the number of parameters and the difficulty of explaining them. Computational Social Science Book down: o Low flexibility → better interpretation, worse prediction. o High flexibility → better prediction potential, but can over fit and lose interpretability. THE BIAS-VARIANCE TRADE-OFF: The bias–variance trade-off is one of the most important ideas in statistical modelling and machine learning because it explains why models perform the way they do on new, unseen data. The Two sources of Error: In machine learning, prediction error can be split into three parts: Bias and variance are the reducible parts — we can influence them by changing the model. Irreducible error is noise in the data we can’t remove. When we build a predictive model, the total expected prediction error can be decomposed into three parts: Bias – Error from wrong assumptions in the model. o High bias means the model is too simple to capture the true relationship (e.g., fitting a straight line to a curved pattern). o Leads to under fitting — both training and test errors are high. Variance – Error from too much sensitivity to the training data. o High variance means the model fits the noise as well as the signal (e.g., a very wiggly curve that passes through every training point). o Leads to over fitting — training error is low, but test error is high. Irreducible Error – Noise in the data that no model can remove. The Trade-Off: Increasing model flexibility (e.g., moving from linear regression to high-degree polynomials) → Bias decreases (model can fit more complex patterns) → Variance increases (model becomes more sensitive to fluctuations in the training set). Decreasing model flexibility (e.g., using a very simple model) → Bias increases (misses real patterns) → Variance decreases (predictions are more stable across datasets) The sweet spot is where Bias² + Variance is minimized — this is the point of optimal model complexity. Visual Intuition: Rigid models: High bias, low variance → under fit. Moderately flexible models: Balanced bias and variance → best generalization. Highly flexible models: Low bias, high variance → over fit. If we plotted Test MSE vs. Model Flexibility, we see the classic U-shaped curve for test error, with training error steadily decreasing. Why It Matters Understanding this trade-off helps you: Choose the right model complexity. Apply regularization (like Lasso or Ridge) to control variance. Avoid chasing low training error at the cost of poor generalization. The Bias-Variance Equation: The equation are foundational to understanding model performance in statistical learning 1. Trade-off between Bias and Variance Simple models (like linear regression) tend to have low variance but high bias. Complex models (like deep neural nets) tend to have low bias but high variance. The art of modelling lies in balancing these two to minimize total error. 2. Irreducible Error Is a Reality Check No matter how perfect your model is, you can’t eliminate noise. 3. Model Evaluation Must Be Holistic Focusing only on training accuracy can be misleading. You need to assess generalization—how well the model performs on unseen data. Techniques like cross-validation help estimate this decomposition empirically. Concept of Variance and Bias in Statistical Modelling Variance refers to the amount by which f^ would change if we estimate it using a different training dataset. This is because of the training data re being used to fit the statistical learning method, different data sets will results in different f^. Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated by a much simpler model. Aspect Bias Variance Definition Error due to wrong assumptions in the Error due to model's sensitivity to model. training data. Cause Model is too simple to capture real Model is too complex and over fits patterns. noise. Effect Misses important trends → under Captures noise as if it were signal fitting. → over fitting. Training Error High (model doesn’t fit training data Low (model fits training data very well). closely). Test Error High (poor generalization due to High (poor generalization due to oversimplification). over fitting). Model Behaviour Predictions are consistently wrong. Predictions vary wildly with different datasets. Example Model Linear regression on non-linear data. High-degree polynomial regression on noisy data. Model Complexity Low complexity → high bias. High complexity → high variance. Solution Use a more flexible model. Use regularization model. or simpler Note: One good test set performance of a statistical learning method requires low variance as well as low squared bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low. As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the bias-variance trade-off in mind. Note: Increasing model complexity reduces bias but increases variance. Decreasing complexity reduces variance but increases bias. The goal is to find the “sweet spot” where total error (Bias² + Variance) is minimal. Bias is about being consistently wrong due to oversimplification; variance is about being inconsistently wrong due to over-sensitivity to the training data. Interpretation: Bias curve (orange): Starts high for simple models, drops as complexity increases. Variance curve (green): Starts low for simple models, rises as complexity increases. Training error (blue): Falls steadily as complexity increases. Test error (blue U-shape): Falls at first (bias drops faster than variance rises), then rises again (variance dominates). Sweet spot: The point of lowest test error — optimal balance between bias and variance. GOOD FIT VERSUS UNDER-FIT AND OVER-FIT Bias– Variance Interpretability Profile Scenario Description Typical Cause Good Fit Model captures the Often moderate — underlying pattern in depends on model Appropriate the data without Balanced choice. A simple model complexity, tuned memorising noise. bias and with a good fit is highly hyper parameters, Performs well on both variance. interpretable; a complex and enough data. training and unseen one may not be. data. Under fitting Model is too simple to Usually high — simple capture the true models are easy to Over-simplified relationship between High bias, explain, but model, wrong inputs and outputs. low explanations are assumptions, too Performs poorly on variance. misleading because the few features. both training and test model misses important data. patterns. Scenario Description Bias– Variance Interpretability Profile Typical Cause Model learns noise and Often low — complex idiosyncrasies of the models are harder to Excessive training data in addition Low bias, interpret, and even if complexity, too Overfitting to the true pattern. high explained, the many features, Performs well on variance. reasoning may be insufficient training data but poorly based on noise rather regularisation. on test data. than true patterns. Aspect Over fitting Definition Model learns the training data too Model fails to learn the well, including noise and outliers. underlying pattern in the data. Training Performance Very high accuracy or very low Poor accuracy or high error. error. Test Performance Poor generalization; badly on unseen data. Model Complexity Under fitting Poor generalization; performs badly on both training and test data. Too complex (e.g., high-degree Too simple (e.g., linear model polynomial, deep tree). for non-linear data). performs High bias patterns). (misses key Bias Low bias (fits training data closely). Variance High variance (sensitive to small Low variance (predictions are changes in training data). stable but inaccurate). Visual Analogy Curve passes through training point, even noise. Cause Too few features, overly Too many features, insufficient simplistic model, insufficient regularization, small training set. training. every Flat line that misses the trend entirely. Simplify model, use regularization, Increase model complexity, cross-validation, and increase add features, and reduce bias. training data. Why we need to analyse this: Under fit models: Easy to interpret (e.g., a straight line through clearly curved data). But the interpretation is wrong because the model ignores important structure. Good fit models: Strike a balance — they explain the data well and generalise. If the model is simple (e.g., linear regression with relevant predictors), interpretability is high. If the model is complex (e.g., tuned gradient boosting), interpretability may require extra tools (e.g., SHAP, LIME). Over fit models: Often complex and opaque (e.g., deep trees, high-degree polynomials). Even if you can “explain” them, the explanation may be misleading because the model’s logic is tied to noise in the training set. Solution Bias–Variance–Interpretability Triangle High bias (under fit) → Simple, interpretable, but inaccurate. Balanced bias & variance (good fit) → Accurate and potentially interpretable. High variance (over fit) → Accurate on training data, poor generalisation, low interpretability. Note: Good fit = model complexity matches the true complexity of the data → best generalisation. Under fit = too simple → high bias, misleadingly “clear” interpretation. Over fit = too complex → high variance, low trust in interpretation. Interpretability is not just about simplicity — it’s about whether the model’s reasoning reflects the true data-generating process. CONCEPT OF PARSIMONY VERSUS BLACK BOX: Parsimony: Meaning In plain English, parsimony means simplicity — using the smallest, simplest model that still explains the data well. In ML, it’s the principle of avoiding unnecessary complexity in the model’s structure, number of features, or parameters. Why It Matters Better generalisation: Simpler models are less likely to over fit. Interpretability: Fewer parameters → easier to explain. Efficiency: Less computation, faster training, easier deployment. How It’s Achieved Feature selection — keep only the most relevant predictors. Regularisation — e.g., Lasso shrinks some coefficients to zero. Occam’s razor — prefer the simplest model that performs adequately. Example Choosing a linear regression with 5 meaningful predictors instead of a polynomial model with 50 terms that barely improves accuracy. Black Box: Meaning A black box model is one whose internal decision-making process is not easily interpretable by humans. You can see the inputs and outputs, but not clearly understand how the model arrived at its prediction. Why Some Models Are Black Boxes High complexity: Many layers, parameters, and non-linear transformations (e.g., deep neural networks). Distributed representations: Information is encoded across many nodes, making it hard to trace influence of individual features. Layer-wise abstraction: Each layer learns increasingly abstract features, obscuring the link between raw input and final output. Implications Pros: Often achieve very high predictive accuracy. Cons: Hard to explain decisions → issues in regulated domains (finance, healthcare, law). Mitigation: Use explainability tools like SHAP, LIME, or partial dependence plots. Example A deep learning image classifier that correctly labels an image as a “cat” but cannot easily explain which features (whiskers, ears, fur pattern) drove the decision. Aspect Parsimonious Model Black Box Model Complexity Low High Interpretability High Low Risk Overfitting Lower Higher (if not regularised) of Accuracy Potential Moderate–High (if problem is High (especially for complex patterns) simple) Deep neural networks, large Linear regression, logistic ensembles (random forest, gradient regression, pruned decision tree boosting) Parsimony is about keeping models as simple as possible without losing accuracy; black box models are powerful but opaque, making their internal reasoning hard to interpret. WHY THE FUNCTION f in Y = f(x) + e IS GENERALLY UNKNOWN: Reason Explanation Examples Real-world complexity The relationship between inputs X and output Y is often governed by complex, nonlinear, and interacting factors. We rarely know the exact mathematical form of this relationship. Epistemic uncertainty Even if a true function exists, we don’t have perfect knowledge. We only observe samples, not the full population or mechanism. Data limitations We work with finite, noisy, and sometimes biased data. This makes it impossible to recover f exactly—only approximate it. Modelling assumptions Statistical models (like linear regression) assume a form for ff, but this is a simplification. The true f may be far more intricate. Randomness and noise The error term ϵ\epsilon captures unobserved influences, measurement errors, and randomness. These obscure the true signal in f(X) NonMultiple functions can fit the same data equally well, especially in high identifiability dimensions. Without strong assumptions, f is not uniquely identifiable. WHY f HAS TO BE ESTIMATED IN SUPERVISED LEARNING ML MODEL: In supervised machine learning, the central goal is to learn an unknown relationship between inputs X and an output Y. This relationship is often expressed as: Y=f(X) + ε Where: f (⋅) is the true underlying function mapping inputs to outputs. ε the irreducible error (random noise we can’t explain). f is estimated for the following reasons: Purpose Explanation Critical Insight Prediction We want to predict the The true function f is unknown, so we estimate output YY for new or f^ to make informed predictions. Accuracy unseen inputs XX. depends on how well f^ approximates f. Inference Estimating f helps identify which predictors are We aim to understand how influential and how they interact. This is key in changes in XX affect YY. domains like health, economics, and policy. Purpose Decisionmaking Handling complexity Explanation Critical Insight Models guide actions based on predicted outcomes. Real-world relationships Estimating f allows us to capture nonlinear, are rarely simple or linear. multivariate, or hidden patterns in data. We minimize the reducible error by While irreducible error (from noise) remains, Reducing error improving our estimate of better models reduce bias and variance. ff. Generalization We want models that Estimating f with proper validation ensures perform well on unseen robustness and avoids over fitting. data. Model comparison Comparing models (e.g., linear regression vs. Different algorithms random forest) helps choose the best estimate f differently. estimator for your goals. Shared characteristics of estimating f: Characteristic Description Assumption relationship of a Implication We assume that the output YY This is the foundation of modelling— is related to inputs XX via some without this assumption, prediction function f(X) is impossible. Estimation relies on observed The quality and quantity of data Use of data data: pairs of inputs XX and directly affect the accuracy of f^. outputs YY. Simpler models have high bias, Trade-off between We aim to balance this trade-off to complex models have high bias and variance minimize total prediction error. variance. We can reduce error by Reducible vs. Focus is on minimizing reducible improving f^ but some error irreducible error error through better modelling. (from noise) is unavoidable. Models vary in flexibility—from More flexible models can fit complex Model flexibility rigid (linear regression) to highly f, but risk over fitting. adaptive (neural networks). Interpretability accuracy vs. Validation generalization and Simpler models are easier to Choice depends on whether interpret; complex ones may explanation or prediction is the predict better but are opaque. priority. Estimating f must be tested on Techniques like cross-validation are unseen data to ensure it essential to avoid over fitting. generalizes well. Parametric methods assume a Parametric models are simpler but Parametric vs. nonform for f; non-parametric risk misspecification; nonparametric methods methods do not. parametric models need more data In addition to the above: 1. Data-Driven Fitting: Both types of methods rely on the observed training data (Xi, Yi) to construct an estimate f^ 2. Flexibility Control: Parametric methods control flexibility via the number and form of parameters; non-parametric methods control it via smoothing parameters or neighbourhood size. 3. Over fitting Risk: Highly flexible models—many-parameter linear models or unsmoothed non-parametric fits—can follow noise if not properly regularized. 4. Assumption vs. Smoothness: Parametric forms assume a specific function shape for f; non-parametric methods assume only that f is sufficiently smooth or locally approximable. 5. Optimization Objective: All methods define a loss (e.g., residual sum of squares or negative log-likelihood) and seek the f^ that minimizes it. 6. Dependence on Sample Size: Parametric estimation can work reasonably with smaller samples; non-parametric estimation typically requires large samples to achieve low variance. 7. Interpretability vs. Predictive Accuracy Trade-Off: Simpler, lower-parameter models are more interpretable but may incur higher bias; more flexible estimators often yield better fit at the cost of transparency Key Insight Estimating f is the core of supervised learning because: It enables generalization to unseen data. It turns raw data into a predictive or explanatory tool. It formalizes the problem as function approximation under uncertainty THE CONCEPT OF ERROR IN ML MODEL The concept of Reducible and Irreducible Error: In statistical learning, when we try to predict an outcome YY using inputs XX, the total prediction error comes from two sources: Type Error of Reducible Error What It Means Comes imperfections model f^(X). Can We Reduce It? Example from ✅ Yes, by choosing Using a linear model when in our better models or the true relationship is nonlinear. tuning them. Comes from Mood swings, genetic ❌ No, it's inherent in randomness or variability, or measurement the system. unknown factors ϵ noise. Why is irreducible error > 0? Reason Explanation Irreducible Error Unobserved variables There are always factors influencing YY that aren’t captured in XX. For example, mood, genetics, or environmental noise. Measurement error Instruments and human reporting introduce inaccuracies. Even biological age tests have variability. Biological systems, human behaviour, and even market dynamics have inherent unpredictability. No model can perfectly capture reality. Even if you knew the true ff, Model limitations randomness in ϵ\epsilon remains. Conditions change over time—what affects vitality today may not Temporal fluctuations tomorrow. This adds noise. Errors in Practise: Error Metric Use Case Natural randomness Mean Squared Error (MSE) Penalizes large errors heavily Root Mean Squared Error (RMSE) Same as MSE but in original units Mean Absolute Error (MAE) Robust to outliers Classification Error Rate Misclassification proportion Why understanding of Error matters: Model Selection: Choose complexity that minimizes test error, not just training error. Generalization: Avoid overfitting by balancing bias and variance. Performance Boundaries: Recognize that irreducible error means perfect accuracy is impossible in noisy real-world data. Interpretability: Understanding error sources helps explain model behaviour to stakeholders Key Insight: In ML, error is not just a number — it’s a decomposition of what we can fix (bias, variance) and what we can’t (irreducible error). The art of modelling is finding the sweet spot where reducible error is minimized without chasing noise. Training and Test Error: Aspect Training Error Test Error Definition The error rate (or loss) the model The error rate (or loss) the model achieves on the same data it was achieves on unseen data not used trained on. during training. Purpose Measures how well the model has Measures how well the model learned patterns in the training set. generalizes to new, unseen data. Data Exposure Computed on data the model has Computed on data the model has never seen before. already “seen” and optimized for. Typical Behaviour Usually lower than test error Usually higher than training error because the model is optimized to fit due to generalization gap. this data. Risk Indicator Very low training error with high test High test error with low training error error → overfitting. → poor generalization. Bias–Variance Low training error can mean low bias, Reflects the combined effect of bias, Link but may hide high variance. variance, and irreducible error. When Useful It’s Diagnosing under fitting (high training Evaluating real-world performance error) or overfitting (very low training and model selection. error but high test error). Directly from the training phase Requires a held-out dataset or outputs. cross-validation. Synergy Between Training and Test Error Training and test error complement each other in diagnosing and improving models: 1. Generalization Gap o The difference between test error and training error is called the generalization gap. o A small gap → model generalizes well. o A large gap → possible overfitting. 2. Bias–Variance Diagnosis o High training error + high test error → under fitting (high bias). o Low training error + high test error → overfitting (high variance). o Low training error + low test error → good fit. 3. Model Selection & Tuning o Monitoring both errors during hyper parameter tuning helps find the “sweet spot” where the model is complex enough to capture patterns but simple enough to generalize. 4. Cross-Validation as a Bridge o Cross-validation uses training-like folds to estimate test-like error, giving a more stable view of the synergy between the two. Key Insight: Training error tells you how well your model fits known data. Test error tells you how well it will perform in the real world. Both are essential: focusing only on training error risks overfitting; focusing only on test error without understanding training performance can hide under fitting. The goal is minimizing test error while keeping the generalization gap small. Computation Note: Training Error steadily decreases as model complexity increases, since the model fits the training data more closely Test Error follows a U-shaped curve — it decreases initially as the model captures real patterns, then increases due to over fitting. The optimal point is where test error is lowest — balancing bias and variance Training MSE and Test MSE: Aspect Training MSE Test MSE Definition Mean Squared Error calculated on Mean Squared Error calculated on new, unseen data not used in the same data used to fit the model. training. Purpose Measures how well the model fits the Measures how well the training data. generalizes to unseen data. model Almost always lower than Test MSE, Usually higher than Training MSE, Typical Value because the model is optimized for due to generalization error. this data. Very low Training MSE with much Over fitting Large gap between Test and Training higher Test MSE suggests over Indicator MSE signals poor generalization. fitting. Bias– Low bias on training data, but Reflects both bias and variance; high Variance Link variance not visible here. variance shows up as large Test MSE. Correlation with Model Flexibility Training MSE Model Type Test MSE Trend Trend Reason Rigid / Less Starts relatively high Starts high, decreases Flexible Model High bias dominates; and decreases initially, then may rise (e.g., linear variance is low, so training slowly with more slightly if under fitting regression with and test errors are closer. flexibility. persists. few predictors) Flexible Model Decreases rapidly as Low bias but high variance; Decreases at first, then (e.g., high-degree flexibility increases over fitting causes Test increases sharply after polynomial, deep — can approach MSE to rise while Training an optimal point. tree) zero. MSE stays low. Key Insight: Rigid models → Both Training and Test MSE are relatively high but close to each other (under fitting). Moderately flexible models → Training MSE drops, Test MSE also drops to a minimum (optimal complexity). Highly flexible models → Training MSE stays very low, but Test MSE rises due to over fitting (variance dominates). Training MSE (blue) falls steadily as model flexibility increases, because the model can fit the training data more and more closely. Test MSE (red) follows an U-shaped curve — it drops at first as the model captures real structure, then rises again when over fitting starts to dominate. The sweet spot is at the bottom of the Test MSE curve, where bias and variance are optimally balanced. Tradeoff between prediction and accuracy Prediction (Generalization Performance) Focuses on how well the model performs on new, unseen data. Often achieved by using flexible models (e.g., random forests, neural networks). These models may be less interpretable, but they capture complex patterns and interactions The goal is low test error, not necessarily understanding the model’s inner workings. Accuracy (Interpretability or Inference) Focuses on understanding relationships between variables. Often achieved by using simpler models (e.g., linear regression, logistic regression). These models may have higher bias, but they’re easier to explain and trust. The goal is insight and explanation, not just prediction. The Trade-Off in Practice Highly flexible models may give better predictions but are harder to interpret and may over fit. Simpler models may be easier to explain but might miss subtle patterns, reducing predictive power. You often have to choose between interpretability and predictive accuracy, depending on your use case. Example In medical diagnostics, you might prefer a simpler model that doctors can understand — even if it’s slightly less accurate. In stock price forecasting, you might prioritize predictive power over interpretability — because the stakes are in the outcome, not the explanation. BIAS-VARIANCE DECOMPOSITION: The Trade-Off Simpler models → high bias, low variance. More complex models → low bias, high variance. The goal is to find the sweet spot where total error (bias² + variance + irreducible error) is minimized. Why It Matters in Practice Model selection: Helps decide whether to increase complexity (reduce bias) or regularize/simplify (reduce variance). Hyper parameter tuning: Guides choices like tree depth, polynomial degree, or regularization strength. Interpretability vs. accuracy: Sometimes a slightly higher bias is acceptable for a more interpretable, stable model Key Insight Bias–variance decomposition tells us that low training error is not enough — we must balance bias and variance to achieve low test error. It’s the theoretical backbone behind why cross-validation, regularization, and early stopping work. The Concept of Var (ε) The variance refers to how much the model’s predictions f^(X) would fluctuate if you trained it on different datasets drawn from the same population. Aspect Meaning Implication High variance means the model Variance measures how sensitive your model Definition is unstable—it changes a lot is to changes in the training data. with small data shifts. Often caused by overly complex models that try Leads to over fitting—great on Cause to fit every detail (including noise). training data, poor on new data. It’s part of the reducible error—you can lower But reducing variance too much Role in it by simplifying the model or using can increase bias—hence the Error bias-variance trade-off. regularization. Visual Intuition Imagine fitting a curve to data: a high-variance model wiggles through every point; a low- You want a balance—not too variance model draws a smoother, more wiggly, not too rigid. general line. VARIOUS SUPERVISED ML MODEL: Prediction rows: Focus on minimizing test error and generalizing well to unseen data. Inference rows: Focus on understanding relationships and quantifying effects, even if predictive accuracy is secondary. Regression vs. Classification columns: Determined by whether YY is continuous or categorical CLASSIFICATION OF METHODS FOR ESTIMATING f: Fundamentally f is estimated either for a. Prediction b. Inference Example: We want to know the value of a function f(x) at a certain point, for example x=4. In real data, we often have no exact observations where X=4. That means we cannot directly calculate the true average E (Y∣X=4). So, we relax the rule: instead of using only X=4X = 4, we look at points where XX is close to 4. This “close to” region is called a neighbourhood N(x). We take the average of YY values for all data points in that neighbourhood. This gives us an estimate of f(x) even without exact matches. The red dashed lines in the plot show the neighbourhood around x=4x = 4. The green curve is the smoothed estimate of f(x) built from such averages. PREDICTIVE VERSUS INFERENCE ML MODEL: Aspect Primary Goal Predictive ML Model Inference ML Model Understand the relationship Accurately predict outcomes for new, between predictors and the unseen data. response. “How does Y change when Xj changes?” Key Question “Given X, what will Y be?” Focus Estimating model Minimizing prediction error on test parameters and their data. statistical significance. Output Emphasis Predicted values observations. Model Choice May use complex, non-interpretable Often uses simpler, models (e.g., random forests, gradient interpretable models (e.g., (Y^) for future Coefficient estimates, confidence intervals, p-values, effect sizes. boosting, deep neural nets) if they linear/logistic regression) to improve accuracy. make parameter interpretation meaningful. Evaluation Metrics RMSE, MAE, AUC, etc. accuracy, F1-score, Standard errors, t-statistics, p-values, R^2, adjusted R^2. Lower — complexity is limited for High — complexity is acceptable if it to maintain interpretability improves predictive performance. and valid inference. Tolerance Complexity Data Requirements Requires data that meet statistical assumptions (e.g., Needs representative training data; independence, correct large datasets improve generalization. functional form) for valid inference. Determining which factors Credit risk scoring, demand significantly affect disease Use forecasting, image classification, risk, estimating price recommendation systems. elasticity, policy impact analysis. Example Cases Interpretability Priority Often secondary to accuracy. Primary — the model must explain why and how predictors influence the outcome. Synergy Between the Two Inference can inform prediction: Understanding which variables matter can simplify models and improve generalization. Prediction can inform inference: Predictive performance can validate whether the inferred relationships are useful in practice. In many real-world projects, both goals are blended — e.g., building a model that predicts well and offers interpretable insights. Goal ↓ / Task Regression (Continuous YY) → Classification (Categorical YY) Prediction Predictive Classification Predictive Regression Goal: Accurately assign new observations Goal: Accurately predict a numeric to categories. outcome for new data. Examples: Spam detection, image Examples: Predicting house prices, recognition. forecasting sales. Metrics: Accuracy, Precision, Recall, F1, Metrics: RMSE, MAE, R2R^2. ROC-AUC. Inference Inferential Classification Inferential Regression Goal: Understand how predictors Goal: Understand how predictors influence the probability of class influence a continuous outcome. membership. Examples: Estimating how Examples: Identifying risk factors for education level affects income. disease. Outputs: Coefficients, p-values, Outputs: Odds ratios, significance tests, confidence intervals. marginal effects. Note: 1. Prediction rows: Focus on minimizing test error and generalizing well to unseen data. 2. Inference rows: Focus on understanding relationships and quantifying effects, even if predictive accuracy is secondary. 3. Regression vs. Classification columns: Determined by whether YY is continuous or categorical. THE CONCEPT OF PREDICTOR VARIABLES AND RESPONSE VARIABLE IN STATISTICAL LEARNING Predictor Variables a. Also known as Input, Features or Independent variables b. Denoted typically as X1, X2, ……, Xp c. Can be Numeric or Categorical d. Used to train Models and uncover patterns or relationships e. In Experimental design, these are the variables we manipulate or Observe Response Variables: a. Also known as the Outcome we aim to predict or understand b. Denoted by Y c. Depends on the values of predictors variables d. Can be continuous or Categorical e. In modelling, Y is being tried to estimate using f^(X) Simple Linear Equation: Assume a Quantitative Response Y and p different predictors as (X1, X2… Xp). This can be written the general form Y = f(X) + ε Analysis of each of these terms: Component Y Meaning in Statistical Critical Insight Learning The response variable or May be continuous (e.g., blood pressure) outcome we want to predict or categorical (e.g., disease status). or understand. Includes measurable inputs like training The predictor variables or intensity. Quality and relevance of X features used to estimate YY. directly affect model accuracy. Central to statistical learning. We estimate The true but unknown f^(X) using data. The challenge is that f is function that maps inputs to f(X) rarely known and may be nonlinear or output. complex. Represents irreducible error. Even with The error term capturing perfect modelling, this component ϵ randomness, noise, or remains due to biological variability, unobserved factors. measurement error, etc. We use training data and algorithms (e.g., Estimate f^(X) as closely as regression, trees, and neural networks) to Goal of Learning possible to f(X). approximate f. Accuracy depends on biasvariance trade-off. X Assumptions These assumptions simplify modelling but ϵ is independent of X and has may not hold in real-world data. Violations mean zero. can lead to biased or inefficient estimates. Implications Guides model selection and tuning. Overly Prediction error = Bias² + complex models increase variance; overly Variance + Irreducible Error. simple ones increase bias. Interpretability vs. Accuracy Simple f (e.g., linear) is Trade-off is crucial. In health modelling, interpretable; complex f (e.g., interpretability often matters for trust and neural nets) may be more actionable insights. accurate. Some Critical aspects: The function f is fixed but unknown. This implies that the function doesn’t change, but we don’t know its exact form. This is the essence of statistical modelling i.e. estimating f from the data E is the random error terms that captures noise or unmodeled effects. It assumes that all deviations from f(x) are random and not systematic The errors are not influenced by the input variable. This is crucial for unbiased estimations as if violated; the model reliability will suffer The Mean of the Error is zero which implies that on average, the error doesn’t skew the output. The f represents the systematic information that X provides about Y. This implies that f(x) captures the predictable part of Y. It separates signal from noise; however the boundary between “systematic” and “random” can be blurry. PREDICTION: In statistical modelling, prediction is one of the core motivations for estimating the function f in the equation: Y = f(X) + ε Aspect Explanation Objective Prediction focuses on accuracy, not Use known inputs XX to predict necessarily understanding the unknown or future outputs YY. underlying mechanism. Estimated Function f^(X) Error Term ϵ Critical Insight A model built from training data f^ may be a black box (e.g., neural net) to approximate the true or interpretable (e.g., linear regression). function f(X) Represents randomness or Even the best model cannot eliminate noise that cannot be predicted. this irreducible error. Good prediction requires Training vs. Test Model is trained on known data generalization, not just fitting the Data and evaluated on unseen data. training set. Loss Function Model Selection Cross-validation Measures how far predictions Common choices: Mean Squared Error Y^=f^(x) are from actual Y (MSE), classification error, etc. Choose f^ to minimize This involves balancing bias (simplicity) and variance (flexibility). expected prediction error. Technique to estimate Helps avoid over fitting and ensures prediction error reliably. robustness across different data splits. Interpretability Accurate models may be In wellness modelling, interpretability is Trade-off complex and hard to interpret. crucial for actionable insights. INFERENCE: In statistical modelling, inference is about understanding the underlying relationship between inputs X and output Y by estimating the function f. Unlike prediction, which focuses on accuracy, inference focuses on insight—what drives the outcome and how. 3 Fundamental questions are being sought for: a. Which predictors are associated with the response? In this, the underlying aim is to identify the few important predictors among a large set of possible variables that can be useful, depending on the applications b. What is the relationship between the response and each predictors? In this, the underlying aim is to identify positive or negative relationship c. Can the relationship between Y and each predictor be adequately summarized using a Linear equation or is the relationship is more complicated. Concept Explanation Critical Insight Goal To understand how each predictor Xj Helps in identifying causal affects the response YY. relationships, not just correlations. Function f Represents the true relationship Estimating f allows us to interpret between inputs and output. the system, not just predict it. Model Interpretability Models used for inference are often Complexity is avoided to preserve simple and transparent (e.g., linear clarity and explainability. regression). Parameter Estimation We estimate coefficients (e.g., βj) These estimates help answer that quantify the effect of each questions like: “How much does predictor. intake affect vitality?” Confidence Intervals Provide a range within which the true Adds statistical rigor and accounts parameter likely falls. for uncertainty. Hypothesis Testing Tests whether a predictor has a Useful for validating statistically significant effect on Y. interventions. Assumptions Matter Inference relies on assumptions like Violations can lead to misleading linearity, independence, and conclusions. normality. wellness Inference depends on This is where concepts like understanding how estimates vary standard error and p-values come across samples. in. PARAMETRIC AND NON PARAMETRIC METHOD FOR ESTIMATING f: Parametric Method: Definition: o A statistical learning approach that assumes ff belongs to a specific functional form (e.g., linear, polynomial, logistic). o The problem reduces to estimating a finite set of parameters that define this form. Two-Step Process: 1. Model Assumption: Choose a functional form for f (e.g. Sampling Distribution 2. Parameter Estimation: Use data to estimate the parameters (e.g., βj) via methods like least squares or maximum likelihood. Common Examples: o Linear regression o Logistic regression o Poisson regression o ANOVA models Estimation Techniques: o Least Squares Estimation (minimizes sum of squared residuals) o Maximum Likelihood Estimation (MLE) (maximizes probability of observed data) o Method of Moments (matches sample moments to theoretical moments) Key Assumptions: o Correct functional form is specified. o Observations are independent. o Error terms have constant variance (homoscedasticity). o Often assumes a specific distribution for errors (e.g., normality). Advantages: o Simple to implement and interpret. o Requires less data compared to non-parametric methods. o Inference (hypothesis testing, confidence intervals) is straightforward. Limitations: o High bias if the chosen functional form is incorrect. o Less flexible—may miss complex, nonlinear patterns. o Model performance heavily depends on meeting assumptions. Bias–Variance Profile: o Typically low variance (stable estimates across samples) but potentially high bias if the model is misspecified. Non parametric Method Definition Does not assume a fixed functional form for ff. The shape of f is determined entirely by the data, allowing it to adapt to complex patterns. Flexibility Can model highly nonlinear relationships. Complexity can grow with the size of the dataset. Examples k-Nearest Neighbours (k-NN) regression/classification Decision Trees and ensembles (Random Forests, Bagging, Boosting) Kernel Smoothing and Kernel Density Estimation Local Polynomial Regression Splines and Smoothing Splines Estimation Process Uses the training data directly to make predictions (often called “memory-based” or “instance-based” learning). Relies on measures like distance, similarity, or local averaging rather than global parameter estimates. Assumptions Minimal distributional assumptions compared to parametric methods. Often assumes only that f is smooth or continuous in some sense. Advantages High flexibility — can fit a wide variety of shapes. Fewer assumptions reduce the risk of model misspecification. Can achieve high accuracy with sufficient data. Limitations Requires more data to achieve low variance. Can be computationally intensive for large datasets. Risk of over fitting if complexity is not controlled (e.g., via bandwidth in kernel methods, pruning in trees). Bias–Variance Profile Typically low bias (can closely follow the data) but high variance if not regularized. Model tuning (e.g., choosing k in k-NN, bandwidth in kernels) is crucial to balance bias and variance. When to Use When the true form of f is unknown or suspected to be highly nonlinear. When you have a large dataset and want flexibility over interpretability. Difference between Parametric and Non Parametric Method: Aspect Parametric Method Non-Parametric Method Model Assumption Assumes a fixed functional form Makes no fixed assumption about for f (e.g., linear, polynomial). the shape of f. What We Estimate A small set of parameters that The function f itself, often using the define the chosen model. data directly. Aspect Parametric Method Flexibility Less flexible — limited to the Highly flexible — can adapt to many chosen form. shapes. Data Requirement Works well datasets. Bias–Variance Profile Usually higher variance. Interpretability Easy to interpret and explain. with bias, Non-Parametric Method smaller lower Needs large datasets for accuracy. Usually lower bias, higher variance. Often harder to interpret. Risk if Wrong High — if the assumed form is Lower — fewer assumptions reduce Assumption wrong, predictions suffer. risk of misspecification. Examples Linear regression, logistic K-Nearest Neighbours, decision regression, Poisson regression. trees, splines, kernel smoothing. FUNDAMENTAL OF MEASURING THE QUALITY OF FIT: Measuring the “quality of fit” in a machine learning (ML) model means evaluating how well the model’s predictions match the actual observed data — both on the data it was trained on and, more importantly, on unseen data. It’s essentially asking: “Does my model capture the underlying patterns in the data without overfitting or under fitting?” The exact way we measure it depends on the type of model and the nature of the outcome variable, but the core idea is always the same — smaller, unbiased differences between predicted and observed values mean a better fit What “Quality of Fit” Means In statistical terms: It’s the degree to which the model’s predicted values align with the observed values. In ML terms: It’s a measure of model performance and generalization ability. A good fit means low error and high explanatory or predictive power. Why It Matters Trust: A poorly fitting model can give misleading predictions or wrong conclusions. Model Selection: Comparing fit across candidate models helps choose the best one. Bias–Variance Balance: Fit quality helps detect overfitting (too complex) or under fitting (too simple). Quality of Fit for Regression Models: Quality of Fit for Classification Models: Visual & Diagnostic Checks Residual plots (regression): Residuals should be randomly scattered, no patterns. Predicted vs. Actual plots: Points should cluster around the 45° line. Learning curves: Show training vs. validation error to detect over/under fitting. Key Insight Quality of fit ≠ just high accuracy or high R² — it’s about balanced performance that generalizes well. Always measure fit on validation/test data, not just training data. Use multiple metrics and visual diagnostics to get a complete picture Note: For Regression Models (Continuous Outcomes) R-squared (R2) – Proportion of variance in the dependent variable explained by the model. Higher values (closer to 1) indicate better fit. Adjusted R2 – Like R2 but penalizes adding predictors that don’t improve the model. Mean Squared Error (MSE) / Root Mean Squared Error (RMSE) – Average squared (or square-rooted) prediction error; lower is better. Mean Absolute Error (MAE) – Average absolute prediction error; less sensitive to outliers than MSE. Residual Analysis – Plotting residuals to check for randomness; patterns suggest poor fit. For Classification Models (Categorical Outcomes) We assess how well predicted classes match actual classes: Accuracy – Proportion of correct predictions. Precision, Recall, F1-score – Useful when classes are imbalanced. ROC Curve & AUC – Measures the model’s ability to rank positives above negatives across thresholds. Log-Loss / Cross-Entropy – Penalizes confident but wrong predictions. For Probability Distributions or Hypothesis Testing When checking if data follow a theoretical distribution: Chi-Square Goodness-of-Fit Test – Compares observed and expected frequencies. Kolmogorov–Smirnov Test, Anderson–Darling Test, Cramér–von Mises Criterion – Compare the empirical distribution to the theoretical one. Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC) – Compare models, penalizing complexity. Visual Checks Residual Plots – Random scatter around zero suggests a good fit. Predicted vs. Observed Plots – Points close to the 45° line indicate strong agreement. In short: For numeric predictions, use error metrics and R2R^2 plus residual checks. For categorical predictions, use accuracy, precision/recall, and ROC-AUC. For distributional fit, use statistical goodness-of-fit tests. Always combine numerical metrics with visual diagnostics for a complete picture. FUNDAMENTAL OF MODEL ACCURACY In ML, accuracy is a measure of how well a model’s predictions match the actual outcomes. It’s a performance indicator that answers: “Out of all predictions the model made, how many were correct?” For classification: Accuracy is the proportion of correct predictions (both positives and negatives) over all predictions. For regression: We don’t usually talk about “accuracy” as a percentage — instead, we use error metrics (MSE, RMSE, MAE) or fit measures (R²) to judge accuracy. Factors affecting model Accuracy: Data Quality – Garbage in, garbage out. Missing values, noise, and bias in data reduce accuracy. Feature Engineering – Relevant, well-constructed features improve predictive power. Model Complexity – Too simple → under fitting (high bias); too complex → overfitting (high variance). Training Data Size – More representative data generally improves generalization. Evaluation Method – Using cross-validation gives a more reliable accuracy estimate than a single train/test split. Fundamental Principles: Model accuracy is about generalization — how well the model performs on unseen data, not just the training set. That’s why we always: Measure accuracy on a test set or via cross-validation. Compare multiple metrics to get a complete performance picture. Balance bias and variance to minimize total error. ASSESSING MODEL ACCURACY OF PREDICTIVE MODEL In predictive modelling, the goal is not just to fit the training data well, but to generalize to unseen data. Assessing accuracy tells us: How well the model predicts on new data. Whether we are overfitting (too complex) or under fitting (too simple). Which model or hyper parameter setting is optimal. Core steps in Accuracy Assessment: Step Description Key Notes 1. Split Data Divide into training and test sets (e.g., Ensures test error is measured on 80:20) or use cross-validation. unseen data. Select metrics aligned with 2. Choose prediction task (regression Metrics classification). 3. Train Model the Avoid relying on a single metric; vs. consider the cost of different errors. Fit the model on the training set (or Keep test data untouched until final training folds in CV). evaluation. 4. Evaluate on Compute chosen metrics on the test This is your generalization Test Data performance. set. 5. Compare Use validation/CV scores to choose the Avoid peeking at test data Models best model. repeatedly — it biases results. Common predictive Accuracy Metrics: Note: For predictive models that deal with continuous outcomes (regression problems), accuracy is measured by how close the predicted values are to the actual values. Common ways to do this include calculating the Mean Squared Error (MSE), which averages the squared differences between predictions and actuals, or the Root Mean Squared Error (RMSE), which is just the square root of MSE and is in the same units as the target variable. Mean Absolute Error (MAE) is another option, taking the average of absolute differences, and is less sensitive to large outliers. The R^2 score tells you what proportion of the variation in the outcome is explained by your model — with 1 meaning a perfect fit and 0 meaning no better than predicting the mean. ASESSING MODEL ACCURACY OF CLASSIFICATION MODEL: In supervised classification, we train a model on labelled data (X, Y) and then evaluate it on data it hasn’t seen before. The goal is to measure generalization performance — how well the model will perform in the wild. The workflow: Step What Happens Why It Matters 1. Data Splitting Ensures evaluation is on unseen Divide data into training and test data, preventing over-optimistic sets (or use k-fold cross-validation). results. 2. Model Fit the model on the training set. Training 3. Prediction Learn patterns examples. from labelled Apply the model to the test set to get Produces the outputs we’ll evaluate. predicted labels or probabilities. 4. Confusion Summarize predictions into TP, TN, Foundation for most classification Matrix FP, FN counts. metrics. 5. Metric Compute accuracy, precision, recall, Quantifies performance from Calculation F1, ROC-AUC, etc. different angles. Relate metrics to the problem’s Ensures the chosen metric aligns 6. Interpretation priorities (cost of errors, class with business or research goals. balance). The Confusion Matrix: For Binary Classifications: Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Key Metrics: Practical Considerations: 1. Class Imbalance: Accuracy can be misleading — a model predicting the majority class always can have high accuracy but zero usefulness. 2. Threshold Tuning: Many models output probabilities; adjusting the decision threshold changes precision/recall trade-offs. 3. Cross-Validation: Gives a more stable estimate of performance, especially with small datasets. 4. Domain Costs: Always align metric choice with the real-world cost of errors. Note: For classification models that deal with categorical outcomes, accuracy is measured by how often the model predicts the correct class. The simplest measure is accuracy itself — the proportion of correct predictions out of all predictions. But when classes are imbalanced, accuracy can be misleading, so we also look at precision (how many predicted positives are actually positive), recall or sensitivity (how many actual positives were correctly identified), and the F1-score, which balances precision and recall. For models that output probabilities, the ROC-AUC score measures how well the model ranks positive cases above negative ones across different thresholds. In short, for regression you measure closeness of predictions to actual values, and for classification you measure correctness of predicted labels — with extra metrics to handle imbalance or probability-based outputs. Broad difference between Model Accuracy and Model Fit: Aspect Model Fit Model Accuracy Definition How well a model captures the How well a model’s predictions underlying patterns in the training match the actual outcomes on data. unseen (test/validation) data. Aspect Focus Measurement Context Model Fit Model Accuracy Evaluates the quality of the learned Evaluates the generalization function f^ relative to the training performance of f^ on new data. data. Primarily assessed on training data Assessed on test data or via (or via in-sample diagnostics). cross-validation (out-of-sample). Typical Metrics Accuracy, precision, recall, F1 R², Adjusted R², residual analysis, score, ROC-AUC (classification); AIC/BIC, training loss. RMSE, MAE, MAPE (regression). Purpose To check if the model can make To check if the model is under fitting correct predictions on unseen or overfitting the training data. data. Misleading results if dataset is Risk if Overfitting — model may memorize imbalanced or metric choice is Maximized Alone training data and fail to generalize. inappropriate. Example A polynomial regression with degree A spam filter with 99% accuracy on 15 may have a perfect fit (R² ≈ 1) on test data may still miss most spam training data but poor test if classes are imbalanced. performance. Relation to Generalization High accuracy indicates good Good fit is necessary but not generalization only if evaluation is sufficient for good generalization. unbiased and representative. Basics of Exploratory Data analysis: Foundations of EDA from Descriptive Statistics Concept Definition Purpose in EDA Central Tendency Mean, Median, Mode Dispersion Range, Deviation Skewness Degree of asymmetry in distribution Detect outliers and non-normality Kurtosis Tailedness of distribution Variance, Understand where data is centred Standard Measure spread and variability Assess values presence of extreme Relationship Analysis in EDA Concept Definition Purpose in EDA Covariance Measure of how two variables change Identify directional relationships together Standardized measure of linear Quantify strength and direction of association (−1 to +1) relationships Regression Quantifies change in one variable due Explore predictive relationships (Slope) to another Grouping data points based on Reveal hidden structure or Clustering similarity segments Graphical Analysis: Correlation Type of Variable Analysis Count Purpose Univariate Analysis One variable Histogram (for Understand distribution, continuous data) Age distribution of central tendency, and Bar Chart (for employees spread of a single categorical data) Gender breakdown variable. Pie Chart Income spread Box Plot Two variables Scatter Plot (numerical vs. numerical) Line Chart (time Age vs. Salary series) Explore relationships or Department vs. Grouped Bar Chart associations between two Service Length (categorical vs. variables. Status Year vs. numerical) Status Category Heat map (correlation matrix) Box Plot with grouping Bivariate Analysis Common Graphs Examples