Uploaded by Kazi Rahman

regressionanalysisst-170623120709

advertisement

REGRESSION ANALYSIS

Shailendra Tomar

Index

• What is Regression Analysis?

• Simple Regression Theory

• Example 1: House Price Model

• Run Simple Regression Using SAS

• Steps & Assumptions of Regression

• Multiple Regression Analysis

• Significance Testing

• Coefficient of Determination

• Example 2: Credit Card Model

• Model selection

• Verify Regression Assumptions

• Regression Diagnostics

• Run Multiple Regression Using SAS

2

WHAT IS REGRESSION ANALYSIS?

• When two or more things are related to each other and we want to quantify the relationship between them, regression analysis is the right technique

• It goes beyond correlation by creating a mathematical equation to estimate or predict the values within the range framed by the data

• The regression procedure demands at least one dependent and one or more independent variables

• Dependent variable (also known as outcome or response variable) is built upon independent variables (also called explanatory or predictor variable)

• Associative relationships between these variables is analyzed by Regression Analysis

• It is commonly used in forecasting, time series modelling, financial analysis, and market research to find the causal effect relationship between variables

Scatter Diagram

3

SIMPLE REGRESSION THEORY

• Let’s begin with simple linear regression which is easier to understand

• Remember ‘ y=mx+c ’ linear equation from high school which make the plot, fitting a straight line to data

• In simple regression, this equation is modified to ‘ y= β

0 dependent variable and x is independent variable

+ β

1 x + ε ’, where y is a

• β

0 same like y-intercept c is the estimated value of y when x is zero, while β similar to slope of line m is the estimated change in the average value of y as a

1 result of a unit change in x and ε is the error

• The error is needed because the regression model is based on sample rather population (usually sample estimators are not close to the population mean)

• That is why Ordinary Least-Squares (OLS) procedure is used for selecting the model parameters ( β

0 and β

1

) that minimize the sum of the squared differences between y and ŷ and determine the best-fitting line

• The objective is always to minimize the error, which is difference between the observed and the predicted values generated by the model ‘ ŷ =b

0

+ b

1 x ’

4

EXAMPLE 1: HOUSE PRICE MODEL

• A real estate company wants to examine the relationship between the selling price of a home (in

$1000s) and its size (in square feet) for a specific region.

• It selects a random sample of 10 houses

• The scatterplot with the data points shows the positive linear relationship

• Higher the size of house means higher the price of the house

5

STATISTICS: HOUSE PRICE MODEL

Source

Model

Error

Corrected Total

DF

Analysis of Variance (ANOVA)

Sum of Squares Mean Sqaure F Value Pr > F

1

8

9

18935

13666

32601

18935

1708.19565

11.08

0.0104

Dependent

Variable (Y)

Independent

Variable (x)

Parameters

Variable

Intercept Intercept

X

House Price

(in $1000s)

R-Square

Size (in square feet)

Adj R-Sq

2 Root MSE

Size

Label

0.5808

0.5284

41.33032

Dependent Mean

Coeff Var

Observations

286.5

14.42594

10

Parameter Estimates

DF Parameter Estimate Standard Error t Value Pr > |t|

1

1

98.24833

0.10977

58.03348

0.03297

1.69

3.33

0.1289

0.0104

6

INTERPRETATION: HOUSE PRICE MODEL

• First, look at the ANOVA results in which Pr value is lesser than 0.05, meaning the null hypothesis is rejected

• Second, R-Sqaure value is 0.58082 which means that 58.08% of the variation in house prices is explained

• The regression model makes sense only when it fits the data better than the baseline model, meaning the slope of the regression line is not equal to zero

• From the parameter estimates, House Price Model is ŷ= 98.24833 + 0.10977x

• Since the prices are in once thousand dollars, for each square feet, the average value of house increases by 0.10977 ($1000) = $109.77

• For example, the expected price of a 2000 square feet house would be

98.24833 + 0.10977x2000 ($1000) = $219,638.20

• The estimation and prediction should happen only within the range of data that was used for the regression analysis, else results are doubtful

• The remaining statistics will be discussed in Multiple Regression Analysis

7

RUN SIMPLE REGRESSION USING SAS

DATA House; input Y X; label Y = 'House Price in $' ; label X = 'Size in Square Feet' ; datalines ;

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

; ods graphics on ; title1 'Simple Regression

Analysis' ;

Title2 'House Price Model' ; proc reg

PLOTS (ONLY)=FITPLOT; model Y = X; run ; ods graphics off ; title ;

• Copy and paste above code in the program of SAS software

8

STEPS & ASSUMPTIONS OF REGRESSION

Step 1

Formulate the problem

Step 2

Define dependent & independent variables

Step 3

Build the general model

Step 4

Plot the scatter diagram

Step 5

Estimate the parameters

Step 6

Estimate the regression coefficient

Step 7

Test for significance

Step 8

Find the strength of the association

Step 9

Check the prediction accuracy

Step 10

Examine the residuals

Step 11

Cross-validate the model

• Linearity of the phenomenon measured , meaning the mean of dependent variable is linearly related to independent variable

• Error are normally distributed with a mean of zero

• Errors have equal variances , or in other words the error term is constant (Homoscedasticity)

• Error are independent , meaning uncorrelated

9

MUTLIPLE REGRESSION ANALYSIS

• More powerful as it involves single dependent variable and two or more independent variables

• The dependent variable should be in interval-scale and other variables in metric or appropriately transformed

• Analyze the impact of a set of independent variables on the dependent variable.

• The equation for multiple regression is ‘ y=β where y is a dependent variable and x

1

, x

2

, x n

0

+ β

1 x

1

+ β

2 x

2

+ … + β n x n

+ are the independent variables

ε n

’,

• The predicted values generated by the model ‘ ŷ =b

0 where b

0, b

1, b

2, and b

0 are the estimators of β

0

, β

1

, β

2

+ b

1 x and β n

1

+ b

2 x

2

+ … + b n x n

• The model parameters are estimated using Ordinary Least-Squares (OLS) procedure which minimize the sum of the squared differences between y and ŷ and determine the best-fitting line

• Before performing multiple regression, it is always recommended to check the correlation among variables to avoid multicollinearity issue

10

SIGNIFICANCE TESTING

• To provide justification for accepting or rejecting a given hypothesis

• In ANOVA, the null hypothesis is that all population means are equal and the alternative hypothesis is that not all of the population means are equal. It is assumed that the populations are normal and that they have equal variances.

• In regression, there are three types of sums of squares: variation explained by model

( SS

M

), unexplained variation error

( SS

E

), and total variation ( SS

T

)

• To test the hypothesis, F ratio is calculated which has to be higher than the

Fisher distribution statistics (based on sample size), proving the model fit the data better than the baseline model

• The results has p-value which should be lower 0.05 to confirm the probability that relationship exists between dependent and independent variables

• Testing for the significance of the model parameters can be done in a manner similar but using t test statistics

11

COEFFICIENT OF DETERMINATION

• Coefficient of determination (R 2 ) explains the strength of association

• R 2 = SS

M

/ SS

T

• It measures the percentage of the variation in dependent variable that is explained by the independent variable

• The value of R 2 closer to 1 means regression line fits perfectly whereas the value closer to 0 doesn’t fit the data well

• R 2 value will keep increasing if we add more independent variables to the model and results can be misleading

• After adding the first few variables, additional independent variables do not make much contribution

• Adjusted R 2 tells the percentage of variation explained by only the independent variables that actually affect the dependent variable

• For example, in below R 2 values, variables more than 3 does not add any value to the model

12

EXAMPLE 2: CREDIT CARD MODEL

• A bank wants to predict the number of credit cards that a family uses (Y) based on the following data – Family Number (ID), Family Size (X1), Family income in thousand dollars (X2), and Number of automobiles owned (X3)

• A sample of 8 families is used in the analysis

• The objective is to find a better predicting value with minimum prediction error squared

Family

ID

1

2

3

4

5

6

7

8

Total

Actual No. of

Credit Cards (Y)

4

6

6

7

8

7

8

10

56

Baseline Prediction

(ȳ=ŷ )

7

7

7

7

7

7

7

7

(

Y/N=56/8)

0

1

3

0

0

(yȳ)

-3

-1

0

1

1

0

1

9

22

(yȳ) 2

9

1

1

0

13

STATISTICS: CREDIT CARD MODEL

Source

Model

Error

Corrected Total

DF

Analysis of Variance (ANOVA)

Sum of Squares Mean Sqaure F Value Pr > F

2

5

7

18.95027

3.04973

22

9.47514

0.60995

15.53

0.0072

Dependent Variable

(Y)

Independent Variables

(X1 & X2)

Parameters

No. of Credit

Cards

R-Square

Family Size &

Family Income

Adj R-Sq

3 Root MSE

0.8614

Dependent

Mean

0.8059

Coeff Var

0.78099

Observations

7.0

11.157

8

Variable Label

Intercept Intercept

X1

X2

Family Size

Family Income

Parameter Estimates

DF Parameter Estimate Standard Error t Value Pr > |t|

1

1

1

0.48169

0.63224

0.21585

1.46141

0.25231

0.10801

0.33

2.51

2

0.7551

0.0541

0.1021

14

INTERPRETATION: CREDIT CARD MODEL

• ANOVA results shows that Pr value is lesser than 0.05, meaning the null hypothesis is rejected and the relationship exists between Y1 and X1 & X2

• In this model, variation explained by model is 3.04953 which is lesser than baseline model (where predicted error squared is 22)

• R-Sqaure value is 0.8614 which means that 86.14% of the variation in credit cards is explained by this model

• When we included X3, the adjusted R-square decreased. Hence, we did not include X3 in this model as it was statistically insignificant.

• From the parameter estimates, ŷ= 0.482 + 0.63*X1 + 0.216*X2

• Assuming the family size (X1) is 4 and its annual income (X2) is 17.5. the predicted number of credit cars would be 6.782 (using above equation). Here

0.218 is the error if the value is round off and made it to 7 credit cards

• The estimation and prediction should happen only within the range of data that was used for the regression analysis, else results are doubtful

15

MODEL SELECTION

• For effective modeling, one should always choose the best model, validate regression assumptions, detect influential observations and check collinearity.

• Let’s understand model selection. Weather run regression manually or using stepwise selection, the objective is to always have better model which can explain more variation (R-square value closer to 1 is expected)

• Stepwise Regression is used often when there are many variables because this method chooses the best possible combination of variables automatically, based on their p-values.

• Below is the summary of statistics which shows how each variable entered in the model influenced R-square and Adjusted R-square.

• When X3 entered into the model, the Adjusted R-square reduced, suggesting to drop the variable from the model

Variables entered in model R-Square Adjusted R-Square

X1

X2

X3

0.7506

0.8614

0.8720

0.7091

0.8059

0.7761

F Value

18.06

15.53

9.09

Pr > F

0.0054

0.0072

0.0294

16

VERIFY REGRESSION ASSUMPTIONS

• To confirm the normality of the error term, check the histogram and distribution curves

• Looking at Residual Plot, one can verify other two assumptions, equal variance and independence, if errors are randomly plotted

• In the previous slide, the intercept was 0.482 (when intercept is not zero, the linearity assumption is already verified

17

REGRESSION DIAGNOSTICS

Influential observations

The R-square value can be affected by outliers or influential observations. It is necessary to look at Rstudent Plot.

Usually, values greater than two is considered as outlier (3 for large sample size).

Cook’s D, DFFITS and

DFBETAS are other useful statistics.

Multicollinearity

It occurs when two or more independent variables are highly correlated with each other, which leads to instability in the regression model. To measure the magnitude of collinearity in a model, VIF (Variance

Inflation Factor) is used and its accepted values are up to 10.

Variables

X1

X2

X3

VIF

1.82692

1.93492

1.09976

In the credit card example, there is absence of collinearity issue as VIF values are lower than 8

18

RUN MULTIPLE REGRESSION USING SAS

DATA CreditCard;

INPUT ID Y X1 X2 X3;

LABEL ID = ‘Family Number’;

LABEL Y = ‘Number of Credit Cards‘;

LABEL X1 = ‘Family Size‘;

LABEL X2 = ‘Family income in $000‘;

LABEL X3 = ‘Number of cars owned‘;

DATALINES ;

1 4 2 14 1

2 6 2 16 2

3 6 4 14 2

4 7 4 17 1

5 8 5 18 3

6 7 5 21 2

7 8 6 17 1

8 10 6 25 2

;

ODS GRAPHICS ON ;

TITLE1 'Multiple Regression Analysis' ;

TITLE2 'Credit Card Model' ;

PROC REG

PLOTS (ONLY)=RESIDUALHISTOGRAM

PLOTS (ONLY)=RESIDUALBYPREDICTE

D

D

PLOTS (ONLY)=RSTUDENTBYPREDICTE

PLOTS (ONLY)=COOKSD

PLOTS (ONLY)=DFFITS

PLOTS (ONLY)=DFBETAS

PLOTS (ONLY)=DIAGNOSTICSPANEL;

MODEL Y = X1 X2;

RUN ;

ODS GRAPHICS OFF ;

TITLE ;

• Copy and paste above code in the program of SAS software

19

Thank You

Download