ICS422 Applied Predictive Analytics [3- 0-0-3] Linear regression Residual Analysis Class 15 Presented by Dr. Selvi C Assistant Professor IIIT Kottayam Simple Linear Regression • Simple linear regression is really a comparison of two models • One is where the independent variable does not even exist • And the other uses the best fit regression line • If there is only one variable, the best prediction for other values is the mean of the “dependent” variable • The difference between the best-fit line and the observed value is called the residual (or error) • The residuals are squared and then added together to generate sum of squares (LITERALLY) residuals / error, SSE. • Simple linear regression is designed to find the best fitting line through the data that minimizes the SSE. ICS 223 Compiler Design 2 3 4 Simple Linear Regression 5 REGRESSION EQUATION WITH ESTIMATES • If we actually knew the population parameters, Bo and B1, we could use the Simple Linear Regression Equation. E(y) =β0 + β1 x • In reality we almost never have the population parameters. Therefore we will estimate them using sample data. When using sample data, we have to change our equation a little bit. • ลท, pronounced "y-hat“ is the point estimator of E(y) ลท = ๐0 + ๐1 x • ลท, is the mean value of y for a given value of x. ICS 223 Compiler Design 6 Least square criterion • ๐ฆ๐ = observed value of dependent variable (tip amount) • ๐ฆ๐ = estimated (predicted) value of the dependent variable (predicted tip amount) • The goal is to minimize the sum of the squared differences between the observed value for the dependent variable ( ๐ฆ๐ ) and the estimated/predicted value of the dependent variable ( ๐ฆ๐ ) that is provided by the regression line. Sum of the squared residuals. ICS 223 Compiler Design 7 Parameters 1.For each data point. 2. Take the x-value and subtract the mean of x. 3. Square Step 2 4. Add up all the products. 1. For each data point. 2. Take the x-value and subtract the mean of x. 3. Take the y-value and subtract the mean of y. 4. Multiply Step 2 and Step 3 5.Add up all of the products. ICS 223 Compiler Design 8 Example ICS 223 Compiler Design 9 • ๐ฆ๐ = 0.1462๐ฅ − 0.8188 • For every $1 the bill amount (x) increases, we would expect the tip amount to increase by $0.1462 or about 15-cents. • If the bill amount (x) is zero, then the expected/predicted tip amount is $- 0.8188 or negative 82-cents! Does this make sense? NO. The intercept may or may not make sense in the "real world." ICS 223 Compiler Design 10 RESIDUAL ANALYSIS • Residual (n): a quantity remaining after other things have been subtracted or allowed for. • Difference between the observed value of the dependent variable (tip amount) and what is predicted by the regression model • So if the model predicts a tip of $10 for a given meal, but the observed tip is $12, then the residual amount is 12 - 10 = 2 • ๐ฆ๐ - ๐ฆ๐ • Observed tip - Predicted tip ICS 223 Compiler Design 11 Goodness of fit • Only a part of the variance in the dependent variable will be explained by the values of the independent variable; • ๐ 2 =(SSR / SST) • The variance left unexplained is due to model error (SSE / SST) • Think "How far off" or "How good" the model accounts for the variance in the dependent variable ICS 223 Compiler Design 12 Model Assumption ๐ฆ = ๐0 + ๐1 + ๐ • Residuals offer the best information about the error term, ε • The expected value of the error term is zero; E (ε) = 0 • For all values of the independent variable x, the variance of the error term ε is the same • The values of the error term ε are independent of each other • The error term ε follows a normal distribution ICS 223 Compiler Design 13 Assumption For the results of a linear regression model to be valid and reliable, we need to check that the following four assumptions are met: 1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data. 3. Homoscedasticity: The residuals have constant variance at every level of x. 4. Normality: The residuals of the model are normally distributed. If one or more of these assumptions are violated, then the results of our linear regression may be unreliable or even misleading. 14 15 Best case residual distribution • Evenly distributed left to right, up to down, all over the graph 16 • Residuals are not evenly distributed 17 Points observed • What happens if the residual analysis reveals heteroscedasticity? • Rebuild the model with different independent variable(s) • Perform transformations on non-linear data • Fit a non-linear regression model... but don't OVERFIT • Are there statistical tests for residuals? ICS 223 Compiler Design 18 R2 INTERPRETATION ICS 223 Compiler Design 19 R2 INTERPRETATION • Coefficient of Determination = r2 = 0.7493 or 74.93% • We can conclude that 74.93% of the total sum of squares can be explained by using the estimated regression equation to predict the tip amount. • The remainder is error. ICS 223 Compiler Design 20 Comparison of R-squared to the Standard Error of the Regression (S) • The standard error of the regression provides the absolute measure of the typical distance that the data points fall from the regression line. S is in the units of the dependent variable. • R-squared provides the relative measure of the percentage of the dependent variable variance that the model explains. R-squared can range from 0 to 100%. 21 Sum of Squared Error ๐ (๐ฆ๐ − ๐ฆ๐ )2 ๐=1 • A measure of the variability of the observation about the regression line ICS 223 Compiler Design 22 Mean Squared Error • MSE ๐ 2 is an estimate of ๐ 2 the variance of the error, ษ. • In other words, how spread out the data points are from the regression line. MSE is SSE divided by its degrees of freedom which is 2 because we are estimating the slope and intercept. MSE = ๐ 2 =SSE/n-2 • Why divide by n - 2 and not just N? REMEMBER, we are using sample data. It's also why we use ๐ 2 and not ๐ 2 . • This is why MSE is not simply the average of the residuals. • If we were using population data, we would just divide by N and it would simply be the average of the residuals. ICS 223 Compiler Design 23 Standard error of the Estimate • The standard error of the estimate σ (or just "standard error") is the standard deviation of the error term, ษ. Now we are UN-SQUARED! • It is the average distance an observation falls from the regression line in units of the dependent variable. • Since the MSE is ๐ 2 , the standard error is just the square root of MSE. • s= √MSE = √ SSE/n-2 • s = √7.5187 = 2.742 • So the average distance of the data points from the fitted line is about $2.74. • You can think of s as a measure of how well the regression model makes predictions. Can be used to make prediction intervals. ICS 223 Compiler Design 24 Statistically significant • How much variance in the dependent variable is explained by the model / independent variable? • For this we look at the value of R2 or Adjusted- R2 • Does a statistically significant linear relationship exist between the independent and dependent variables? • Is the overall F-test or t-test (in simple regression these are actually the same thing) significant? • Can we reject the null hypothesis that the slope b1 of the regression line is ZERO? • Does the confidence interval for the slope b1 contain zero? ICS 223 Compiler Design 25 Estimators Everywhere Linear regression contains many estimators • ๐1 the slope of the regression line • ๐0 the intercept of the regression line on the y-axis • Centroid: the point that is the intersection of the mean of each variable (x, y) • The mean value of ลท* for any value of x* (confidence interval) • The individual value of ลท* for any value of x* (prediction interval) • And many others about variance, etc. ICS 223 Compiler Design 26 Standard error of the Estimate • The standard error of the estimate σ (or just "standard error") is the standard deviation of the error term, ษ. • Since the MSE is ๐ 2 , the standard error is just the square root of MSE. • s= √MSE = √ SSE/n-2 • s = √7.5187 = 2.742 ICS 223 Compiler Design 27 Degree of freedom • What are degrees of freedom in statistics? Degrees of freedom are the number of independent values that a statistical analysis can estimate. • Calculating the degrees of freedom is often the sample size minus the number of parameters you’re estimating 28 Confidence Interval • 95% confidence that the actual mean for the population falls within this interval t-value calculation • ๐1 ± ๐กα 2 ๐ ๐1 Where ๐ ๐1 is standard deviation of the slope, ๐กα 2 ๐ ๐1 is margin of error, ๐1 is point estimator for the slope 29 Standard Deviation of the slope • ๐ ๐1 = ๐ (๐ฅ๐ −๐ฅ)2 • =2.742/sqrt(4206) • =0.04228 ICS 223 Compiler Design 30 31 Confidence Interval for slope • 0.1462197 ± ๐ก0.05 2 0.04228 • 0.1462197 ± 2.776 ∗0.04228 • 0.1462197 ± 0.11737 • (0.02885,0.2636) We are 95% confident that the interval (0.02885,0.2636)contains the true slope of the regression line ICS 223 Compiler Design 32 Does the interval contain zero? • (0.02885,0.2636) • Hypothesis : ๐ป0 : ๐1 = 0 ๐ป๐ : ๐1 ≠ 0 • Can we reject null hypothesis have slope as zero? • Null hypothesis is that the slope of the regression line is zero and therefore there is no significant relationship exist between two variables. ICS 223 Compiler Design 33 Test statistics ๐1 0.1462197 •๐ก= = = 3.4584 ๐๐1 0.04228 • t vs ๐ก๐๐๐๐ก๐๐๐๐ ๐ง • 3.4584 > 2.776 is significant, so reject null hypothesis ICS 223 Compiler Design 34 Summary • Does the confidence interval for the slope, ๐1 contain the value of ZERO? • Is the test statistic t greater than the critical value for t at the chosen significance level and correct degrees of freedom? ICS 223 Compiler Design 35 Any Queries? Thank you