Introduction to Regression Analysis • We use sample data to • estimate a population mean () or (1 - 2) • estimate a population proportion (p) or (p1 - p2) • test of hypothesis about or (1 - 2) • test of hypothesis about p or (p1 - p2). • Now we want to use sample data to investigate the relationships among a group of variables and to create a mathematical model that can be used to predict its value in the future. • The process of finding a mathematical model (an equation) that best fits the data is known as regression analysis. 1 Introduction to Regression Analysis • The variable to be predicted (or modeled), y, is called the dependent variable. • The variables used to predict (or model) y are called independent variables and are denoted by the symbols x1, x2, x3, etc.. • General form of probabilistic model in regression: y y| x1 , x2 ,..., xk 0 1 x1 2 x2 ... k xk where y = dependent variable y| x1 , x2 ,..., xk = mean or expected value of y, deterministic component = unexplainable, or random error component • Estimation/prediction equation yˆ b0 b1 x1 b2 x2 ... bk xk 2 Form of The Simple Linear Regression Model y= μ y|x ε = β0 β1 x ε y|x = 0 + 1x is the mean value of the dependent variable y when the value of the independent variable is x 0 is the y-intercept, the mean of y when x is 0 (when there is observed any values of x near 0) 1 is the slope, the change in the mean of y per unit change in x (over the range of sample x-values) is an error term that describes the effect on y of all factors other than x 3 The Simple Linear Regression Model Illustrated 4 Regression Terms • β0 and β1 are called regression parameters • β0 is the y-intercept and β1 is the slope • We do not know the true values of these parameters • So, we must use sample data to estimate them • b0 is the estimate of β0 and b1 is the estimate of β1 5 The Least Squares Point Estimates Estimation/prediction equation yˆ b0 b1 x Slope: b1 SSxy SSxx y-intercept: b0 y b1 x x x i n y y n=sample size i n SS xy ( xi x )( yi y ) xi yi nxy SS xx ( xi x ) 2 xi n( x ) 2 2 MS EXCEL: =SLOPE(y range, x range) =INTERCEPT(y range, x range) 6 An Estimator of 2 SSE s n2 2 where SSE ( yi yˆi )2 SS yy b1SS xy yi2 n( y)2 b1SS xy n = sample size s = standard deviation of error = standard error of estimate 7 A 100(1-)% confidence interval for the simple linear regression slope 1 b1 t / 2 sb1 where sb1 s SS xx t/2 is based on (n-2) degree of freedom 8 Testing the Significance of the Slope One Tailed Test Ho: 1 = 0 Ha: 1 < 0 or 1 > 0 Two Tailed Test Ho: 1 = 0 Ha: 1 0 b1 Test Statistic: t sb1 Rejection region: t< -t or t> t Where t is based on (n-2) degree of freedom Rejection region: |t|>t/2 Where t/2 is based on (n-2) degree of freedom 9 The 100(1-)% confidence interval for the mean value of y for x=xp y t / 2 s 1 n ( x p x )2 SS xx Where t/2 is based on (n-2) degree of freedom 10 The 100(1-)% prediction interval for an individual y for x=xp 1 y t / 2 s 1 n ( x p x )2 SS xx Where t/2 is based on (n-2) degree of freedom 11 Simple Coefficient of Determination 2 ˆ ( y y ) i Explained Variation 2 ( y y ) Total Variation i r2 = About 100(r2)% of the sample variation in y can be explained by using x to predict y in the simple linear regression model. yi ŷi y Un-Explained Variation Explained Variation Total Variation xi 12 The coefficient of correlation SSxy Where r = ---------------SS yy ( yi y ) 2 yi2 ny 2 SSxx SSyy r for sample and (rho) for population -1< r <1 r > 0 means that y increases as x increases r < 0 means that y decreases as x increases r 0 little or no linear relationship between y and x. the closer r to 1 or –1, the stronger the relationship. High correlation does not imply causality. Only a linear trend may exist between x and y. r r 2 when b1>0 or r r2 when b1<0 13 Exercise • What is the range of values that the coefficient of determination can assume? ___ • If the value of r is -0.96, what does this indicate about the dependent variable as the independent variable increases? __ • If the correlation between sales and advertising is +0.6, what percent of the variation in sales can be attributed to advertising? __ • What does the coefficient of determination equal if r = 0.89? Exercise • In the regression equation, what does the letter "b" represent? • What is the null hypothesis to test the significance of the slope in a regression equation? • The regression equation is Ŷ = 29.29 - 0.96X, the sample size is 8, and the standard error of the slope is 0.22. What is the test statistic to test the significance of the slope? 15 Exercise • • • • • Page 488 no. 26 Page 494 no. 31 Page 500 no. 38 Page 502 no. 46 Page 506 no. 56 16