UNIT V Regression: relationship between two variables The variable being predicted is called the dependent variable. The variable or variables being used to predict the value of the dependent variable are called the independent variables The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line. It is called simple linear regression Regression analysis involving two or more independent variables is called multiple regression analysis Line of Regression of Y on X Y = a + bX Here X = independent variable Y= dependent variable a = intercept, b = slope of line Another way of expressing regression equation of Y on X = this line can also be expressed as (Y- Y) = byx (X-X) byx = r (σy /σx) Where X = mean of X series Y = mean of Y series σx= standard deviation of x series σy= standard deviation of y series r= coefficient of correlation between two variable x and y Calculation of byx = n ∑XiYi – (∑Xi) (∑Yi) n ∑Xi2 – (∑Xi) 2 Line of Regression of X on Y X = c + dY Here Y = independent variable X= dependent variable c = intercept, d = slope of line Another way of expressing regression equation of X on Y = this line can also be expressed as (X- X) = bxy (Y-Y) bxy = r (σx /σy) Where X = mean of X series Y = mean of Y series σx= standard deviation of x series σy= standard deviation of y series r= coefficient of correlation between two variable x and y Calculation of byx = n ∑XiYi – (∑Xi) (∑Yi) n ∑Yi2 – (∑Yi)2 Coefficient of correlation r = √𝒃𝒙𝒚 ∗ 𝒃𝒚𝒙 Question: A panel judges P and Q graded seven dramatic performances by independently awarding mark as follows: Performance 1 2 3 4 5 6 7 Marks by P 46 42 44 40 43 41 45 Marks by Q 40 38 36 35 39 37 41 The eighth performance which judge Q could not attend was awarded 37 marks by judge P. If judge Q had also been present, how many marks would be expected to have been awarded by him to eighth performance? Solution: let us denote marks awarded by the judge P as X and marks awarded by the judge Q as Y. since we have to estimate marks that would have been awarded by judge Q, we shall fit a line of regression of Y on X tot the given data. X Y X2 Y2 XY 46 40 2116 42 38 1764 44 36 1936 1296 1584 40 35 1600 1225 1400 43 39 1849 1521 1677 41 37 1681 1369 1517 45 41 ∑Y=266 2025 ∑X = 12971 1681 ∑Y2=10136 1845 ∑XY=11459 ∑X=301 2 1600 1444 1840 1596 Calculate mean of X = 301/7 = 43 Mean of Y = 266/7 = 38 Calculation of byx = n ∑XiYi – (∑Xi) (∑Yi) n ∑Xi2 – (∑Xi)2 = 7 ∗ 11459 – 301∗266 7∗12971−(301∗301) = 0.75 (Y- Y) = byx (X-X) = (Y – 38) = 0.75(X - 43) Y = 5.75 + 0.75 X Estimate of Y when X =37 Y = 5.75 + 0.75 * 37 = 33.5 marks It is expected that the judge Q would have awarded 33.5 marks to the eighth performance Question: find the mean of X & Y variables and the coefficient of correlation between them from the following two regression equations 3Y – 2X – 10 = 0 2Y – X – 50 = 0 Solution: means of X & Y …solve both equation and get the values of X and Y After solving then mean of X = 130 mean of Y = 90 Correlation coefficient Let us assume that the first equation be regression of X on Y X=− 10 2 3 + 2𝑌 Then here b =3/2 Let us assume that the second equation be regression of Y on X Y= 50 2 1 + 2𝑋 Then here d =1/2 Coefficient of correlation = r2 = b*d= 3 2 ∗ 1 2 = 3 r = √4 = 0.87 Coefficient of Determination r2 = SSR /SST 3 4 Question: calculate SSE,SST,SSR where 10 number of restaurants X y 2 58 6 105 8 88 8 118 12 117 16 137 20 157 Solution : firstly calculate the estimated regression equation Y on X SSE SST 20 169 22 149 26 202 Using the estimated regression equation for estimation and prediction If a significant relationship exists between x and y, and the coefficient of determination shows that the fit is good, the estimated regression equation should be useful for estimation and prediction. Point Estimation We can use the estimated regression equation to develop a point estimate of the mean value of y for a particular value of x or to predict an individual value of y corresponding to a given value of x. Interval Estimation Point estimates do not provide any information about the precision associated with an estimate. For that we must develop interval estimates. The first type of interval estimate, a confidence interval, is an interval estimate of the mean value of y for a given value of x. The second type of interval estimate, a prediction interval, is used whenever we want an interval estimate of an individual value of y for a given value of x. The point estimate of the mean value of y is the same as the point estimate of an individual value of y. But, the interval estimates we obtain for the two cases are different. The margin of error is larger for a prediction interval Prediction Interval for an Individual Value of y Residual Plot Against x Aresidual plot against the independent variable x is a graph in which the values of the independent variable are represented by the horizontal axis and the corresponding residual values are represented by the vertical axis. A point is plotted for each residual. The first coordinate for each point is given by the value of xi and the second coordinate is given by the corresponding value of the residual yi _ i. For a residual plot against x Residual Plot Against y Another residual plot represents the predicted value of the dependent variable on the horizontal axis and the residual values on the vertical axis. Apoint is plotted for each residual. The first coordinate for each point is given by i and the second coordinate is given by the corresponding value of the ith residual yi