S in the formula = std β^1∼Distribution(Mean,Variance) Is β̂₁ Centered at the True Value? Yes if the assumptions hold. Hence B^ is unbiased Denominator, ∑(Xi−Xˉ)2: the higher the spread of X, the better. slope estimate more precise → smaller standard error → narrower confidence interval Numerator, error variance: the smaller the better Big σ² means Y fluctuates a lot for the same X → slope estimates jump around more between samples → larger standard error. Hence we need smaller σ² If R squared = 0.8 = 80% Can be interpreted as 80% of Y variation can be explained by differences in X; 20% is due to other factors R squared = 1 – var(e)/var(y) Residuals = u^ = Yi – Y^i SSE = sum of residuals R squared = 0 = no relationship R squared = 1 = perfect linear relationship It measures how well regression line fits the data Conditional mean E(Y|X = x) -> the average value of Y given x -> its center Conditional variance V(Y|X = x) -> the spread of Y around its conditional mean -> its uncertainty. E[Y∣X=x]=β0+β1x. In a simple regression, we only observe how two variables move together — not whether one causes the other. β1 tells us how Y tends to vary when X changes, holding nothing else constant, but it doesn’t prove that X changes cause Y to change Most important assumptions is the first four -> unbiased Homoskedasticity and normality -> for smaller variance and reliable inferences Variation in X: Regression only makes sense if Y changes as X changes; without variation, there’s no way to find a meaningful relationship| Random sampling: Each observation (Xᵢ, Yᵢ) must come from the same population and be independently chosen; ensures that sample patterns represent true population patterns, not sampling bias Linearity (in parameters): the model must be linear in parameters, meaning the equation takes the form Y = B0 + B1Y; if the true relationship is curved or interactive but you force a straight line, the model is misspecified → bias | Zero conditional mean : E(ε∣X)=0 ; all systematic patterns are captured by the regression line — what’s left (the residual) is pure noise. This is the most crucial assumption for unbiasedness. If violated (for example, if X and ε are correlated), your slope β̂₁ will be biased — this is known as endogeneity or omitted variable bias. Simple regression: X must vary/Multiple regression: X’s must not be perfectly correlated | OLS tries to learn the population relationship between X and Y using your sample. only works if your sample: randomly chosen, represent the population (iid). if repeated sampling many times, the average of all estimates would equal the true β (Violation: sampling bias, time series data (not independent). | The model must be linear in β the coehicient , not necessarily in X. Assume E(u) = 0, meaning on average, the model doesn’t systematically over- or underpredict Y | E(u∣X)=0 -> X explains what it’s supposed to explain. What’s left unexplained u is pure noise, not something related to X. core condition for unbiased E[β^1]=β1 Violation 4th assumption: endogeneity (u and x are correlated , Cov(u, X) =/ 0 & omitted variable bias. assumptions 1–4 make OLS unbiased, need assumption 5 to measure the precision of those estimates Assumption 5: the spread of the residuals is the same for all values of X no matter whether X is small or large. If 1-5 holds, OLS is Best Linear Unbiased Estimator Gauss–Markov theorem What is B1? measures how much Age (Y) is expected to change when Number of angering items (X) increases by one unit (depends on the model, this is true for level-level) Population notation = E(X|Y), called the parameter of interest Sample notation = E^(Y∣X), called the estimator of the parameter of interest For a specific dataset, we get an estimate of E(X|Y) Do log transformation if data have highly skewed distributions, huge diherent scales between variables, non-linear relationship (multiplicative) When x is binary, the slope coeNicient is the diNerence in the mean of two Xs? Yes then a “one-unit increase” in X literally means “going from group 0 to group 1.” Y = B0 + B1(x) X = 0 -> Y = B0 X = 1 -> Y = B0 +B1 In large sample, by CLT, they become approximately normal. Even if the error terms Ui are not normally distributed, the OLS estimators β^0 and β^1 will be approximately normal when n is large In small samples, assumption normality is needed. Every fixed value of X: distribution of Y is normal, Centered on the regression line β0+β1X, With constant spread (variance σu2). Important for small samples because the exact distribution of β̂ is normal only if u is normal. For large samples, the Central Limit Theorem makes β̂ approximately normal even without this assumption. If violated in small samples, inferences become unreliable. Solution: inspect residuals, transform variables (log, etc) 95% confidence interval -> Z value = 1.96 Level-level: Y changes by β₁ units of Y for every 1 unit of X. E.g.: Each additional year of education is associated with an increase of $2.5 in hourly wage, on average Log-Level: Y increases by roughly (100×β₁)% for every +1 in X. E.g.: Each additional year of education is associated with an 8% higher wage, on average Log-log: % change in Y for a 1% change in X. . E.g.: A 1% increase in advertising spending is associated with a 0.6% increase in sales, on average Exact formula: (1+% increase/decrease)^B1 – 1 Logit: how the odds of Y=1 change when X increases by 1 unit e.g. Increasing X by 1 unit raises the odds of Y=1 by 49% / A 1-unit increase in X increases the probability of Y=1 by 2.3 percentage points, on average what factors reduce (lower) the sampling variability of the OLS estimator? Higher variance of X and Larger N Which line is likely to be closer to the observed sample value on X and Y? The fitted line MSD(u^) underestimates the true error variance because OLS forces the residuals to be smaller by fitting the line that minimizes them. Hence we use the E(MSD(u^)). The degree of freedom depends on how many Beta coehicients we used in our regression model. B0 and B1 hence the degrees of freedom adjustment is n-2 3. Y increase by (1.1)^B1 – 1 = xxx Xxx * 100% = 3.97% What would happen if we used OLS to predict a binary outcome? Can’t, use logit instead. Simpson Paradox Omitted Variable Bias Sampling variance with 2 covariates. When we add another independent variable (X₂), we need to separate X₁’s ehect on Y from X₂’s ehect. Multicollinearity = when two (or more) independent variables are highly correlated with each other. The high enough correlation makes the model struggle to separate their individual ehects R₁² ≈ 1.Multicollinearity does not violate OLS assumptions (still unbiased and consistent) Partial eNect: derivatives(turunan) from the question (e.g. delta Y/delta X1) Lower RMSE = better fit (smaller prediction errors) Higher RMSE = worse fit (model predictions far oh) RMSE = 0 = perfect fit = all prediction equals actuals Dummy variables (Xi), which are used in regression to include categorical information (like gender, region, or department. If you have 3 categories, you’ll create 2 dummy variables — because i−1=3−1=2 interaction terms, which extend the dummyvariable idea by allowing slopes (not just intercepts) to diher between groups polynomial terms in regression, which let you model curved (nonlinear) relationships instead of just straight lines (the x variables, not the beta). If model: Y = B0+B1X1+B2X1^2 = u Turning point, when marginal = 0 DiD – Assumption: Parallel trend, without treatment treated and control units would have evolved similarly. a policy implemented in some units but not others at a specific point in time—is a natural fit for DiD. A pre trend plot break the parallel assumption. IV – when data encounter Endogeneity (X and Y correlated with unobservable). Assumption: Relevance (instrument correlated with endogenous X) & Exclusion (instrument ahect Y only through ). A rule of thumb is that an F-statistic larger than 10 suggests the instrument is strong enough. IV estimates the Local Average Treatment Ehect (LATE): the causal ehect specifically for ”compliers”. Compliers are customers who increase their limit if and only if they get the oher. “A model explains in-sample R^2 well but predicts default poorly” overfitting occurs when the model learns training data noise/ memorises the past instead of the underlying signal, failing to generalise to new data. Overfitting low bias, high variance Underfitting high bias low variance Aim for low bias low variance | high bias, low variance -> underfitting, misses true relationship | low bias, high variance -> unreliable but hey still unbiased -> overfitting | Low bias, low variance -> model sucks, inaccurate, unreliable, unstable Regularisation combats overfitting in models (especially with many or correlated features) by adding a penalty term to the loss function that discourages large coehicient values. This trades a small increase in bias for a larger decrease in variance, improving out-of-sample performance. It’s crucial to select λ appropriately, typically via cross-validation Ridge helps when predictors are highly correlated multicollinearity, improving stability Lasso improves interpretability, performing automatic feature selection Prediction: The goal is accurate out-of-sample forecasts (Y^). OLS assumptions, particularly zero conditional mean, are less critical. A model with biased coeNicients (β) might still predict well if the patterns causing the bias are stable between training and test data. The focus shifts from fixing endogeneity to minimizing prediction error using validation techniques. However, assumptions like linearity or addressing multicollinearity (often via regularization like Ridge) remain important for building robust predictive models K-fold cross-validation (CV) provides a more reliable estimate of a model’s out-of-sample performance compared to a single train/test split, especially with limited data. splitting the data into k folds, iteratively training on k−1 folds and testing on the remaining one, rotating the test fold, and averaging the performance metrics across all k iterations | critical mistake is performing preprocessing steps that learn parameters from data on the entire dataset before splitting into folds (Data leakage) | when dealing with rare events (like the ≈1.3% default rate) use stratified K fold it guarantees representative folds and enables stable, reliable metric calculation, which is essential for imbalanced datasets features should be scaled (e.g., standardised to mean 0, std 1) before applying regularised linear models like Ridge or Lasso because regularisationpenalises coehicient magnitude. Without scaling, two issues arise: Disproportionate Shrinkage (Coefficient Penalty – low variance) & Poor Numerical Conditioning (Process Penalty-high variance). Recall(sensitivity) = TP/(TP + FN) measures the proportion of actual positive cases that were correctly identified, crucial when missing positives is costly | Precision(PPV) = TP/(TP+FP) measures the proportion of cases predicted as positive that were actually positive. Important when investigating false alarms is costly or the capacity is limited. | Improving Recall (by lowering the threshold) typically decreases Precision (more FP), and vice-versa Stationary: statistical behavior of a time series remains consistent over time. A process is strongly stationary if the joint distribution of values does not depend on the specific time, only on the gap (lag) between them. If not stationary: take the diherence (model not the Stock Price but the Stock return as it hovers around 0/mean rather than depending on yesterday’s value). Dickey Fuller Test to check if a time series is stationary or not: a hypothesis test that looks for the presence of a unit root (indicates non-stationarity). | If not stationary, use ARIMA (the difference, stock returns). ARIMA(p, d, q) adds “I” = Integrated, d is the number of differences needed to make the series stationary Autocorrelation means a variable is correlated with its own past values. p = how many past lags (previous values) are included. Each βi shows how strongly past values influence the current value. εt = new random shock (white noise). Autocorrelation violates the OLS assumption of i.i.d. errors -> estimates become biased & inehicient Risk Ops can review ≤120 applications/week. Missing a default hurts more. Which threshold from the table should we use? This capacity limit constrains the maximum acceptable workload (TP + FP). Find the threshold that maximises the prioritised metric (Recall, in this case) subject to the capacity constrain. | Current Recall = 0.75, Precision = 0.12. Fraud prevalence ≈0.5%. Manual-review capacity will increase by 15% next month. Lower the threshold ARMA: a balanced model for forecasting time series that show both persistence and noise. Past values (p lags) and past errors (q lags). Cointegration: a stable long run equilibrium relationship between >= 2 non-stationary variables allows to model it together wo/ differencing long term relationship. Prophet: open source forecasting tool to handle complex real world time series data w/ seasonality, trends, holidays, event. Yt = trend + seasonality + holidays + error can handle non stationary data
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )