Multiple Regression Simple Regression in detail Yi = βo + β1 xi + εi Where • Y =>Dependent variable • X =>Independent variable • βo =>Model parameter – Mean value of dependent variable (Y) when the independent variable (X) is zero Simple Regression in detail • Β1 => Model parameter - Slope that measures change in mean value of dependent variable associated with a oneunit increase in the independent variable • εi => - Error term that describes the effects on Yi of all factors other than value of Xi Assumptions of the Regression Model • Error term is normally distributed (normality assumption) • Mean of error term is zero (E{εi} = 0) • Variance of error term is a constant and is independent of the values of X (constant variance assumption) • Error terms are independent of each other (independent assumption) • Values of the independent variable X is fixed – No error in X values. Estimating the Model Parameters • Calculate point estimate bo and b1 of unknown parameter βo and β1 • Obtain random sample and use this information from sample to estimate βo and β1 • Obtain a line of best "fit" for sample data points least squares line Yˆi = bo + b1 Xi Where Yˆi is the predicted value of Y Values of Least Squares Estimates bo and b1 b1 = n xiyi - (xi)(yi) n xi2 - (xi)2 bo = y - bi x Where y = yi ; n x = xi n • bo and b1 vary from sample to sample. Variation is given by their Standard Errors Sbo and Sb1 Example 1 • To see relationship between Advertising and Store Traffic • Store Traffic is the dependent variable and Advertising is the independent variable • We find using the formulae that bo=148.64 and b1 =1.54 • Are bo and b1 significant? • What is Store Traffic when Advertising is 600? Example 2 • Consider the following data Sales (X) Advertising(Y) 3 7 8 13 17 13 4 11 15 16 7 6 • Using formulae we find that b0 = -2.55 and b1 = 1.05 Example 2 Therefore the regression model would be Ŷ = -2.55 + 1.05 Xi r2 = (0.74)2 = 0.54 (Variance in sales (Y) explained by ad (X)) Assume that the Sbo(Standard error of b0) = 0.51 and Sb1 = 0.26 at = 0.5, df = 4, Is bo significant? Is b1 significant? Idea behind Estimation: Residuals • Difference between the actual and predicted values are called Residuals • Estimate of the error in the population ei = yi - yi = yi - (bo + b1 xi) Quantities in hats are predicted quantities • bo and b1 minimize the residual or error sums of squares (SSE) SSE = ei2 = ((yi - yi)2 = Σ [yi-(bo + b1xi)]2 Testing the Significance of the Independent Variables • Null Hypothesis • There is no linear relationship between the independent & dependent variables • Alternative Hypothesis • There is a linear relationship independent & dependent variables between the Testing the Significance of the Independent Variables • Test Statistic t = b1 - β1 sb1 • Degrees of Freedom v=n-2 • Testing for a Type II Error H0: β1 = 0 H1: β1 0 • Decision Rule Reject H0: β1 = 0 if α > p value Significance Test for Store Traffic Example • Null hypothesis, Ho: β1=0 • Alternative hypothesis, HA: β1 0 b1 1 • The test statistic is t = = 1.54 0 =7.33 sb1 .21 • With as 0.5 and with Degree of Freedom v = n-2 =18, the value of t from the table is 2.10 • Since tcalc ttable , we reject the null hypothesis of no linear relationship. Therefore Advertising affects Store Traffic Predicting the Dependent Variable • How well does the model yi = bo + bixi predict? • Error of prediction without indep var is yi - yi • Error of prediction with indep var is yi- yi • Thus, by using indep var the error in prediction reduces by (yi – yi)-(yi- yi)= (yi – yi) • It can be shown that 2= (y y) i 2+ 2 ( y y) (y y ) i i i Predicting the Dependent Variable • Total variation (SST)= Explained variation (SSM) + Unexplained variation (SSE) • A measure of the model’s ability to predict is the Coefficient of Determination (r2) r2 SST - SSE = SST SSM = SST • For our example, r2 =0.74, i.e, 74% of variation in Y is accounted for by X • r2 is the square of the correlation between X and Y Multiple Regression • Used when more than one indep variable affects dependent variable • General model Y 0 1 X 1 ... n X n Where Y: Dependent variable X 1 , X 2 ,..., X n : Independent variables 1 , 2 ,..., n : Coefficients of the n indep variables 0 : A constant (Intercept) Issues in Multiple Regression • Which variables to include • Is relationship between dep variables and each of the indep variables linear? • Is dep variable normally distributed for all values of the indep variables? • Are each of the indep variables normally distributed (without regard to dep var) • Are there interaction variables? • Are indep variables themselves highly correlated? Example 3 • Cataloger believes that age (AGE) and income (INCOME) can predict amount spent in last 6 months (DOLLSPENT) • The regression equation is DOLLSPENT = 351.29 - 0.65 INCOME +0.86 AGE • What happens when income(age) increases? • Are the coefficients significant? Example 4 • Which customers are most likely to buy? • Cataloger believes that ratio of total orders to total pieces mailed is good measure of purchase likelihood • Call this ratio RESP • Indep variables are - TOTDOLL: total purchase dollars - AVGORDR: average dollar order - LASTBUY: # of months since last purchase Example 4 • Analysis of Variance table - How is total sum of squares split up? - How do you get the various Deg of Freedom? - How do you get/interpret R-square? - How do you interpret the F statistic? - What is the Adjusted R-square? Example 4 • Parameter estimates table - What are the t-values corresp to the estimates? - What are the p-values corresp to the estimates? - Which variables are the most important? - What are standardized estimates? - What to do with non-significant variables?