Uploaded by Vipul Khandke

Linear Regression Residual Analysis Presentation

advertisement
ICS422 Applied Predictive
Analytics [3- 0-0-3]
Linear regression Residual Analysis
Class 15
Presented by
Dr. Selvi C
Assistant Professor
IIIT Kottayam
Simple Linear Regression
• Simple linear regression is really a comparison of two models
• One is where the independent variable does not even exist
• And the other uses the best fit regression line
• If there is only one variable, the best prediction for other values is the
mean of the “dependent” variable
• The difference between the best-fit line and the observed value is called
the residual (or error)
• The residuals are squared and then added together to generate sum of
squares (LITERALLY) residuals / error, SSE.
• Simple linear regression is designed to find the best fitting line through
the data that minimizes the SSE.
ICS 223 Compiler Design
2
3
4
Simple Linear Regression
5
REGRESSION EQUATION WITH
ESTIMATES
• If we actually knew the population parameters, Bo and B1, we could
use the Simple Linear Regression Equation.
E(y) =β0 + β1 x
• In reality we almost never have the population parameters. Therefore
we will estimate them using sample data. When using sample data,
we have to change our equation a little bit.
• ลท, pronounced "y-hat“ is the point estimator of E(y)
ลท = ๐‘0 + ๐‘1 x
• ลท, is the mean value of y for a given value of x.
ICS 223 Compiler Design
6
Least square criterion
• ๐‘ฆ๐‘– = observed value of dependent variable (tip amount)
• ๐‘ฆ๐‘– = estimated (predicted) value of the dependent variable (predicted
tip amount)
• The goal is to minimize the sum of the squared differences between
the observed value for the dependent variable ( ๐‘ฆ๐‘– ) and the
estimated/predicted value of the dependent variable ( ๐‘ฆ๐‘– ) that is
provided by the regression line. Sum of the squared residuals.
ICS 223 Compiler Design
7
Parameters
1.For each data point.
2. Take the x-value and subtract
the mean of x.
3. Square Step 2
4. Add up all the products.
1. For each data point.
2. Take the x-value and subtract
the mean of x.
3. Take the y-value and subtract
the mean of y.
4. Multiply Step 2 and Step 3
5.Add up all of the products.
ICS 223 Compiler Design
8
Example
ICS 223 Compiler Design
9
• ๐‘ฆ๐‘– = 0.1462๐‘ฅ − 0.8188
• For every $1 the bill amount (x) increases, we would expect the tip
amount to increase by $0.1462 or about 15-cents.
• If the bill amount (x) is zero, then the expected/predicted tip amount
is $- 0.8188 or negative 82-cents! Does this make sense? NO. The
intercept may or may not make sense in the "real world."
ICS 223 Compiler Design
10
RESIDUAL ANALYSIS
• Residual (n): a quantity remaining after other things have been
subtracted or allowed for.
• Difference between the observed value of the dependent variable (tip
amount) and what is predicted by the regression model
• So if the model predicts a tip of $10 for a given meal, but the
observed tip is $12, then the residual amount is 12 - 10 = 2
• ๐‘ฆ๐‘– - ๐‘ฆ๐‘–
• Observed tip - Predicted tip
ICS 223 Compiler Design
11
Goodness of fit
• Only a part of the variance in the dependent variable will be
explained by the values of the independent variable;
• ๐‘…2 =(SSR / SST)
• The variance left unexplained is due to model error (SSE / SST)
• Think "How far off" or "How good" the model accounts for the
variance in the dependent variable
ICS 223 Compiler Design
12
Model Assumption
๐‘ฆ = ๐‘0 + ๐‘1 + ๐œ€
• Residuals offer the best information about the error term, ε
• The expected value of the error term is zero; E (ε) = 0
• For all values of the independent variable x, the variance of the error
term ε is the same
• The values of the error term ε are independent of each other
• The error term ε follows a normal distribution
ICS 223 Compiler Design
13
Assumption
For the results of a linear regression model to be valid and reliable, we need to check
that the following four assumptions are met:
1. Linear relationship: There exists a linear relationship between the independent
variable, x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is no
correlation between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.
If one or more of these assumptions are violated, then the results of our linear
regression may be unreliable or even misleading.
14
15
Best case residual distribution
• Evenly distributed left to right, up to down, all over the graph
16
• Residuals are not evenly distributed
17
Points observed
• What happens if the residual analysis reveals heteroscedasticity?
• Rebuild the model with different independent variable(s)
• Perform transformations on non-linear data
• Fit a non-linear regression model... but don't OVERFIT
• Are there statistical tests for residuals?
ICS 223 Compiler Design
18
R2 INTERPRETATION
ICS 223 Compiler Design
19
R2 INTERPRETATION
• Coefficient of Determination = r2 = 0.7493 or 74.93%
• We can conclude that 74.93% of the total sum of squares can be
explained by using the estimated regression equation to predict the
tip amount.
• The remainder is error.
ICS 223 Compiler Design
20
Comparison of R-squared to the Standard
Error of the Regression (S)
• The standard error of the regression provides the absolute measure of the
typical distance that the data points fall from the regression line. S is in the
units of the dependent variable.
• R-squared provides the relative measure of the percentage of the dependent
variable variance that the model explains. R-squared can range from 0 to
100%.
21
Sum of Squared Error
๐‘›
(๐‘ฆ๐‘– − ๐‘ฆ๐‘– )2
๐‘–=1
• A measure of the variability of the observation about the regression
line
ICS 223 Compiler Design
22
Mean Squared Error
• MSE ๐‘  2 is an estimate of ๐œŽ 2 the variance of the error, ษ›.
• In other words, how spread out the data points are from the regression line.
MSE is SSE divided by its degrees of freedom which is 2 because we are
estimating the slope and intercept.
MSE = ๐‘  2 =SSE/n-2
• Why divide by n - 2 and not just N? REMEMBER, we are using sample
data. It's also why we use ๐‘  2 and not ๐œŽ 2 .
• This is why MSE is not simply the average of the residuals.
• If we were using population data, we would just divide by N and it would
simply be the average of the residuals.
ICS 223 Compiler Design
23
Standard error of the Estimate
• The standard error of the estimate σ (or just "standard error") is the standard
deviation of the error term, ษ›. Now we are UN-SQUARED!
• It is the average distance an observation falls from the regression line in units
of the dependent variable.
• Since the MSE is ๐‘  2 , the standard error is just the square root of MSE.
• s= √MSE = √ SSE/n-2
• s = √7.5187 = 2.742
• So the average distance of the data points from the fitted line is about $2.74.
• You can think of s as a measure of how well the regression model makes
predictions. Can be used to make prediction intervals.
ICS 223 Compiler Design
24
Statistically significant
• How much variance in the dependent variable is explained by the
model / independent variable?
• For this we look at the value of R2 or Adjusted- R2
• Does a statistically significant linear relationship exist between the
independent and dependent variables?
• Is the overall F-test or t-test (in simple regression these are actually the same
thing) significant?
• Can we reject the null hypothesis that the slope b1 of the regression line is
ZERO?
• Does the confidence interval for the slope b1 contain zero?
ICS 223 Compiler Design
25
Estimators Everywhere
Linear regression contains many estimators
• ๐‘1 the slope of the regression line
• ๐‘0 the intercept of the regression line on the y-axis
• Centroid: the point that is the intersection of the mean of each
variable (x, y)
• The mean value of ลท* for any value of x* (confidence interval)
• The individual value of ลท* for any value of x* (prediction interval)
• And many others about variance, etc.
ICS 223 Compiler Design
26
Standard error of the Estimate
• The standard error of the estimate σ (or just "standard error") is the standard
deviation of the error term, ษ›.
• Since the MSE is ๐‘  2 , the standard error is just the square root of MSE.
• s= √MSE = √ SSE/n-2
• s = √7.5187 = 2.742
ICS 223 Compiler Design
27
Degree of freedom
• What are degrees of freedom in statistics? Degrees of freedom are
the number of independent values that a statistical analysis
can estimate.
• Calculating the degrees of freedom is often the sample size minus the
number of parameters you’re estimating
28
Confidence Interval
• 95% confidence that the actual mean for the population falls within
this interval
t-value calculation
• ๐‘1 ± ๐‘กα 2 ๐‘ ๐‘1
Where ๐‘ ๐‘1 is standard deviation of the slope,
๐‘กα 2 ๐‘ ๐‘1 is margin of error, ๐‘1 is point estimator for the slope
29
Standard Deviation of the slope
• ๐‘ ๐‘1 =
๐‘ 
(๐‘ฅ๐‘– −๐‘ฅ)2
• =2.742/sqrt(4206)
• =0.04228
ICS 223 Compiler Design
30
31
Confidence Interval for slope
• 0.1462197 ± ๐‘ก0.05 2 0.04228
• 0.1462197 ± 2.776 ∗0.04228
• 0.1462197 ± 0.11737
• (0.02885,0.2636)
We are 95% confident that the interval (0.02885,0.2636)contains the
true slope of the regression line
ICS 223 Compiler Design
32
Does the interval contain zero?
• (0.02885,0.2636)
• Hypothesis : ๐ป0 : ๐‘1 = 0
๐ป๐‘Ž : ๐‘1 ≠ 0
• Can we reject null hypothesis have slope as zero?
• Null hypothesis is that the slope of the regression line is zero and
therefore there is no significant relationship exist between two
variables.
ICS 223 Compiler Design
33
Test statistics
๐‘1
0.1462197
•๐‘ก=
=
= 3.4584
๐‘†๐‘1
0.04228
• t vs ๐‘ก๐‘๐‘Ÿ๐‘–๐‘ก๐‘–๐‘๐‘Ž๐‘™
๐‘ง
• 3.4584 > 2.776 is significant, so reject null hypothesis
ICS 223 Compiler Design
34
Summary
• Does the confidence interval for the slope, ๐‘1 contain the value of
ZERO?
• Is the test statistic t greater than the critical value for t at the chosen
significance level and correct degrees of freedom?
ICS 223 Compiler Design
35
Any
Queries?
Thank you
Download