Lecture 12

advertisement
Multiple Regression
Simple Regression in detail
Yi = βo + β1 xi + εi
Where
• Y =>Dependent variable
• X =>Independent variable
• βo =>Model parameter
– Mean value of dependent variable (Y) when the
independent variable (X) is zero
Simple Regression in detail
•
Β1 => Model parameter
- Slope that measures change in mean value of
dependent variable associated with a oneunit increase in the independent variable
•
εi =>
- Error term that describes the effects on Yi of
all factors other than value of Xi
Assumptions of the Regression
Model
• Error term is normally distributed (normality
assumption)
• Mean of error term is zero (E{εi} = 0)
• Variance of error term is a constant and is independent
of the values of X (constant variance assumption)
• Error terms are independent of each other
(independent assumption)
• Values of the independent variable X is fixed
– No error in X values.
Estimating the Model Parameters
• Calculate point estimate bo and b1 of unknown
parameter βo and β1
• Obtain random sample and use this information
from sample to estimate βo and β1
• Obtain a line of best "fit" for sample data points least squares line
Yˆi = bo + b1 Xi
Where Yˆi is the predicted value of Y
Values of Least Squares
Estimates bo and b1
b1 = n xiyi - (xi)(yi)
n xi2 - (xi)2
bo = y - bi x
Where
y = yi ;
n
x = xi
n
• bo and b1 vary from sample to sample. Variation is
given by their Standard Errors Sbo and Sb1
Example 1
• To see relationship between Advertising and Store
Traffic
• Store Traffic is the dependent variable and
Advertising is the independent variable
• We find using the formulae that bo=148.64 and b1
=1.54
• Are bo and b1 significant?
• What is Store Traffic when Advertising is 600?
Example 2
• Consider the following data
Sales (X)
Advertising(Y)
3
7
8
13
17
13
4
11
15
16
7
6
• Using formulae we find that b0 = -2.55 and b1 =
1.05
Example 2
Therefore the regression model would be
Ŷ = -2.55 + 1.05 Xi
r2 = (0.74)2 = 0.54 (Variance in sales (Y) explained by ad (X))
Assume that the Sbo(Standard error of b0) = 0.51 and
Sb1 = 0.26 at  = 0.5, df = 4,
Is bo significant? Is b1 significant?
Idea behind Estimation: Residuals
• Difference between the actual and predicted
values are called Residuals
• Estimate of the error in the population
ei = yi - yi
= yi - (bo + b1 xi)
Quantities in hats are predicted quantities
• bo and b1 minimize the residual or error sums of
squares (SSE)
SSE = ei2 = ((yi - yi)2
= Σ [yi-(bo + b1xi)]2
Testing the Significance of the
Independent Variables
• Null Hypothesis
• There is no linear relationship between the
independent & dependent variables
• Alternative Hypothesis
• There is a linear relationship
independent & dependent variables
between
the
Testing the Significance of the
Independent Variables
• Test Statistic
t = b1 - β1
sb1
• Degrees of Freedom
v=n-2
• Testing for a Type II Error
H0: β1 = 0
H1: β1  0
• Decision Rule
Reject H0: β1 = 0 if α > p value
Significance Test for Store
Traffic Example
• Null hypothesis, Ho: β1=0
• Alternative hypothesis, HA: β1  0
b1  1
• The test statistic is t =
= 1.54  0 =7.33
sb1
.21
• With  as 0.5 and with Degree of Freedom v = n-2
=18, the value of t from the table is 2.10
• Since tcalc  ttable , we reject the null hypothesis of no
linear relationship. Therefore Advertising affects Store
Traffic
Predicting the Dependent Variable
• How well does the model yi = bo + bixi predict?
• Error of prediction without indep var is yi - yi
• Error of prediction with indep var is yi- yi
• Thus, by using indep var the error in prediction
reduces by (yi – yi)-(yi- yi)= (yi – yi)
• It can be shown that
2=
(y
y)
 i
2+
2
(
y
y)
(y
y
)
 i
 i i
Predicting the Dependent Variable
• Total variation (SST)= Explained variation (SSM) +
Unexplained variation (SSE)
• A measure of the model’s ability to predict is the
Coefficient of Determination (r2)
r2
SST - SSE
=
SST
SSM
=
SST
• For our example, r2 =0.74, i.e, 74% of variation in Y
is accounted for by X
• r2 is the square of the correlation between X and Y
Multiple Regression
• Used when more than one indep variable affects
dependent variable
• General model Y  0  1 X 1  ...   n X n

Where
Y: Dependent variable
X 1 , X 2 ,..., X n : Independent variables
1 ,  2 ,...,  n : Coefficients of the n indep variables
 0 : A constant (Intercept)
Issues in Multiple Regression
• Which variables to include
• Is relationship between dep variables and each of
the indep variables linear?
• Is dep variable normally distributed for all values
of the indep variables?
• Are each of the indep variables normally
distributed (without regard to dep var)
• Are there interaction variables?
• Are indep variables themselves highly correlated?
Example 3
• Cataloger believes that age (AGE) and income
(INCOME) can predict amount spent in last 6
months (DOLLSPENT)
• The regression equation is
DOLLSPENT = 351.29 - 0.65 INCOME
+0.86 AGE
• What happens when income(age) increases?
• Are the coefficients significant?
Example 4
• Which customers are most likely to buy?
• Cataloger believes that ratio of total orders to total
pieces mailed is good measure of purchase
likelihood
• Call this ratio RESP
• Indep variables are
- TOTDOLL: total purchase dollars
- AVGORDR: average dollar order
- LASTBUY: # of months since last purchase
Example 4
• Analysis of Variance table
- How is total sum of squares split up?
- How do you get the various Deg of Freedom?
- How do you get/interpret R-square?
- How do you interpret the F statistic?
- What is the Adjusted R-square?
Example 4
• Parameter estimates table
- What are the t-values corresp to the estimates?
- What are the p-values corresp to the estimates?
- Which variables are the most important?
- What are standardized estimates?
- What to do with non-significant variables?
Download