• What is Regression Analysis?
• Simple Regression Theory
• Example 1: House Price Model
• Run Simple Regression Using SAS
• Steps & Assumptions of Regression
• Multiple Regression Analysis
• Significance Testing
• Coefficient of Determination
• Example 2: Credit Card Model
• Model selection
• Verify Regression Assumptions
• Regression Diagnostics
• Run Multiple Regression Using SAS
2
• When two or more things are related to each other and we want to quantify the relationship between them, regression analysis is the right technique
• It goes beyond correlation by creating a mathematical equation to estimate or predict the values within the range framed by the data
• The regression procedure demands at least one dependent and one or more independent variables
• Dependent variable (also known as outcome or response variable) is built upon independent variables (also called explanatory or predictor variable)
• Associative relationships between these variables is analyzed by Regression Analysis
• It is commonly used in forecasting, time series modelling, financial analysis, and market research to find the causal effect relationship between variables
Scatter Diagram
3
• Let’s begin with simple linear regression which is easier to understand
• Remember ‘ y=mx+c ’ linear equation from high school which make the plot, fitting a straight line to data
• In simple regression, this equation is modified to ‘ y= β
0 dependent variable and x is independent variable
+ β
1 x + ε ’, where y is a
• β
0 same like y-intercept c is the estimated value of y when x is zero, while β similar to slope of line m is the estimated change in the average value of y as a
1 result of a unit change in x and ε is the error
• The error is needed because the regression model is based on sample rather population (usually sample estimators are not close to the population mean)
• That is why Ordinary Least-Squares (OLS) procedure is used for selecting the model parameters ( β
0 and β
1
) that minimize the sum of the squared differences between y and ŷ and determine the best-fitting line
• The objective is always to minimize the error, which is difference between the observed and the predicted values generated by the model ‘ ŷ =b
0
+ b
1 x ’
4
• A real estate company wants to examine the relationship between the selling price of a home (in
$1000s) and its size (in square feet) for a specific region.
• It selects a random sample of 10 houses
• The scatterplot with the data points shows the positive linear relationship
• Higher the size of house means higher the price of the house
5
Source
Model
Error
Corrected Total
DF
Analysis of Variance (ANOVA)
Sum of Squares Mean Sqaure F Value Pr > F
1
8
9
18935
13666
32601
18935
1708.19565
11.08
0.0104
Dependent
Variable (Y)
Independent
Variable (x)
Parameters
Variable
Intercept Intercept
X
House Price
(in $1000s)
R-Square
Size (in square feet)
Adj R-Sq
2 Root MSE
Size
Label
0.5808
0.5284
41.33032
Dependent Mean
Coeff Var
Observations
286.5
14.42594
10
Parameter Estimates
DF Parameter Estimate Standard Error t Value Pr > |t|
1
1
98.24833
0.10977
58.03348
0.03297
1.69
3.33
0.1289
0.0104
6
• First, look at the ANOVA results in which Pr value is lesser than 0.05, meaning the null hypothesis is rejected
• Second, R-Sqaure value is 0.58082 which means that 58.08% of the variation in house prices is explained
• The regression model makes sense only when it fits the data better than the baseline model, meaning the slope of the regression line is not equal to zero
• From the parameter estimates, House Price Model is ŷ= 98.24833 + 0.10977x
• Since the prices are in once thousand dollars, for each square feet, the average value of house increases by 0.10977 ($1000) = $109.77
• For example, the expected price of a 2000 square feet house would be
98.24833 + 0.10977x2000 ($1000) = $219,638.20
• The estimation and prediction should happen only within the range of data that was used for the regression analysis, else results are doubtful
• The remaining statistics will be discussed in Multiple Regression Analysis
7
DATA House; input Y X; label Y = 'House Price in $' ; label X = 'Size in Square Feet' ; datalines ;
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
; ods graphics on ; title1 'Simple Regression
Analysis' ;
Title2 'House Price Model' ; proc reg
PLOTS (ONLY)=FITPLOT; model Y = X; run ; ods graphics off ; title ;
• Copy and paste above code in the program of SAS software
8
Step 1
Formulate the problem
Step 2
Define dependent & independent variables
Step 3
Build the general model
Step 4
Plot the scatter diagram
Step 5
Estimate the parameters
Step 6
Estimate the regression coefficient
Step 7
Test for significance
Step 8
Find the strength of the association
Step 9
Check the prediction accuracy
Step 10
Examine the residuals
Step 11
Cross-validate the model
• Linearity of the phenomenon measured , meaning the mean of dependent variable is linearly related to independent variable
• Error are normally distributed with a mean of zero
• Errors have equal variances , or in other words the error term is constant (Homoscedasticity)
• Error are independent , meaning uncorrelated
9
• More powerful as it involves single dependent variable and two or more independent variables
• The dependent variable should be in interval-scale and other variables in metric or appropriately transformed
• Analyze the impact of a set of independent variables on the dependent variable.
• The equation for multiple regression is ‘ y=β where y is a dependent variable and x
1
, x
2
, x n
0
+ β
1 x
1
+ β
2 x
2
+ … + β n x n
+ are the independent variables
ε n
’,
• The predicted values generated by the model ‘ ŷ =b
0 where b
0, b
1, b
2, and b
0 are the estimators of β
0
, β
1
, β
2
+ b
1 x and β n
1
+ b
2 x
2
+ … + b n x n
’
• The model parameters are estimated using Ordinary Least-Squares (OLS) procedure which minimize the sum of the squared differences between y and ŷ and determine the best-fitting line
• Before performing multiple regression, it is always recommended to check the correlation among variables to avoid multicollinearity issue
10
• To provide justification for accepting or rejecting a given hypothesis
• In ANOVA, the null hypothesis is that all population means are equal and the alternative hypothesis is that not all of the population means are equal. It is assumed that the populations are normal and that they have equal variances.
• In regression, there are three types of sums of squares: variation explained by model
( SS
M
), unexplained variation error
( SS
E
), and total variation ( SS
T
)
• To test the hypothesis, F ratio is calculated which has to be higher than the
Fisher distribution statistics (based on sample size), proving the model fit the data better than the baseline model
• The results has p-value which should be lower 0.05 to confirm the probability that relationship exists between dependent and independent variables
• Testing for the significance of the model parameters can be done in a manner similar but using t test statistics
11
• Coefficient of determination (R 2 ) explains the strength of association
• R 2 = SS
M
/ SS
T
• It measures the percentage of the variation in dependent variable that is explained by the independent variable
• The value of R 2 closer to 1 means regression line fits perfectly whereas the value closer to 0 doesn’t fit the data well
• R 2 value will keep increasing if we add more independent variables to the model and results can be misleading
• After adding the first few variables, additional independent variables do not make much contribution
• Adjusted R 2 tells the percentage of variation explained by only the independent variables that actually affect the dependent variable
• For example, in below R 2 values, variables more than 3 does not add any value to the model
12
• A bank wants to predict the number of credit cards that a family uses (Y) based on the following data – Family Number (ID), Family Size (X1), Family income in thousand dollars (X2), and Number of automobiles owned (X3)
• A sample of 8 families is used in the analysis
• The objective is to find a better predicting value with minimum prediction error squared
Family
ID
1
2
3
4
5
6
7
8
Total
Actual No. of
Credit Cards (Y)
4
6
6
7
8
7
8
10
56
Baseline Prediction
(ȳ=ŷ )
7
7
7
7
7
7
7
7
(
Y/N=56/8)
0
1
3
0
0
(yȳ)
-3
-1
0
1
1
0
1
9
22
(yȳ) 2
9
1
1
0
13
Source
Model
Error
Corrected Total
DF
Analysis of Variance (ANOVA)
Sum of Squares Mean Sqaure F Value Pr > F
2
5
7
18.95027
3.04973
22
9.47514
0.60995
15.53
0.0072
Dependent Variable
(Y)
Independent Variables
(X1 & X2)
Parameters
No. of Credit
Cards
R-Square
Family Size &
Family Income
Adj R-Sq
3 Root MSE
0.8614
Dependent
Mean
0.8059
Coeff Var
0.78099
Observations
7.0
11.157
8
Variable Label
Intercept Intercept
X1
X2
Family Size
Family Income
Parameter Estimates
DF Parameter Estimate Standard Error t Value Pr > |t|
1
1
1
0.48169
0.63224
0.21585
1.46141
0.25231
0.10801
0.33
2.51
2
0.7551
0.0541
0.1021
14
• ANOVA results shows that Pr value is lesser than 0.05, meaning the null hypothesis is rejected and the relationship exists between Y1 and X1 & X2
• In this model, variation explained by model is 3.04953 which is lesser than baseline model (where predicted error squared is 22)
• R-Sqaure value is 0.8614 which means that 86.14% of the variation in credit cards is explained by this model
• When we included X3, the adjusted R-square decreased. Hence, we did not include X3 in this model as it was statistically insignificant.
• From the parameter estimates, ŷ= 0.482 + 0.63*X1 + 0.216*X2
• Assuming the family size (X1) is 4 and its annual income (X2) is 17.5. the predicted number of credit cars would be 6.782 (using above equation). Here
0.218 is the error if the value is round off and made it to 7 credit cards
• The estimation and prediction should happen only within the range of data that was used for the regression analysis, else results are doubtful
15
• For effective modeling, one should always choose the best model, validate regression assumptions, detect influential observations and check collinearity.
• Let’s understand model selection. Weather run regression manually or using stepwise selection, the objective is to always have better model which can explain more variation (R-square value closer to 1 is expected)
• Stepwise Regression is used often when there are many variables because this method chooses the best possible combination of variables automatically, based on their p-values.
• Below is the summary of statistics which shows how each variable entered in the model influenced R-square and Adjusted R-square.
• When X3 entered into the model, the Adjusted R-square reduced, suggesting to drop the variable from the model
Variables entered in model R-Square Adjusted R-Square
X1
X2
X3
0.7506
0.8614
0.8720
0.7091
0.8059
0.7761
F Value
18.06
15.53
9.09
Pr > F
0.0054
0.0072
0.0294
16
• To confirm the normality of the error term, check the histogram and distribution curves
• Looking at Residual Plot, one can verify other two assumptions, equal variance and independence, if errors are randomly plotted
• In the previous slide, the intercept was 0.482 (when intercept is not zero, the linearity assumption is already verified
17
Influential observations
The R-square value can be affected by outliers or influential observations. It is necessary to look at Rstudent Plot.
Usually, values greater than two is considered as outlier (3 for large sample size).
Cook’s D, DFFITS and
DFBETAS are other useful statistics.
Multicollinearity
It occurs when two or more independent variables are highly correlated with each other, which leads to instability in the regression model. To measure the magnitude of collinearity in a model, VIF (Variance
Inflation Factor) is used and its accepted values are up to 10.
Variables
X1
X2
X3
VIF
1.82692
1.93492
1.09976
In the credit card example, there is absence of collinearity issue as VIF values are lower than 8
18
DATA CreditCard;
INPUT ID Y X1 X2 X3;
LABEL ID = ‘Family Number’;
LABEL Y = ‘Number of Credit Cards‘;
LABEL X1 = ‘Family Size‘;
LABEL X2 = ‘Family income in $000‘;
LABEL X3 = ‘Number of cars owned‘;
DATALINES ;
1 4 2 14 1
2 6 2 16 2
3 6 4 14 2
4 7 4 17 1
5 8 5 18 3
6 7 5 21 2
7 8 6 17 1
8 10 6 25 2
;
ODS GRAPHICS ON ;
TITLE1 'Multiple Regression Analysis' ;
TITLE2 'Credit Card Model' ;
PROC REG
PLOTS (ONLY)=RESIDUALHISTOGRAM
PLOTS (ONLY)=RESIDUALBYPREDICTE
D
D
PLOTS (ONLY)=RSTUDENTBYPREDICTE
PLOTS (ONLY)=COOKSD
PLOTS (ONLY)=DFFITS
PLOTS (ONLY)=DFBETAS
PLOTS (ONLY)=DIAGNOSTICSPANEL;
MODEL Y = X1 X2;
RUN ;
ODS GRAPHICS OFF ;
TITLE ;
• Copy and paste above code in the program of SAS software
19