STATISTICS 302:504-505

STAT 211 1 Handout 10 (Chapter 12): Simple Linear Regression and Correlation    Is there a relationship between two numeric random variables? Can we describe this relationship with a model? Can we use this model to predict the future values? Example 1: Infestation of crops by insects has long been of great concern to farmers and agricultural scientists. Certain article reports normally distributed data on x=age of cotton plant (days) and y=% damaged squares. The following are the data. X: 9 12 Y: 11 12 12 23 15 30 18 29 18 52 21 41 21 65 27 60 30 72 30 84 33 93 How do we demonstrate the relationship between age of the cotton plant and % damaged squares? Scatterplots: Recall that in the graph, it will not matter which you choose as X or Y. Scatter plot of x versus y using MINITAB 100 90 80 70 Y 60 50 40 30 20 10 10 20 30 X Correlations A) PEARSON’S CORRELATION COEFFICIENT -measures the strength & direction of the linear relationship between X and Y -X and Y must be numerical variables The formula looks like: r  X i  X Yi  Y  (n  1) s x s y   X  X Y  Y   X  X   Y  Y  i i 2 i 2 i STAT 211 2 where sx and sy are the standard deviations for x and y. Notice we are looking at how far each point deviates from the average X and Y value. Properties of Pearson’s Correlation Coefficient  r<0 implies a negative relationship, r>0 implies a positive relationship, r=0 no apparent relationship.  -1 r 1  r=1 or –1 happens only when the points lie on an exact line  r0.8 implies a strong relationship, 0.5<r<0.8 a moderate relationship, & r0.5 implies a weak one Note: It is pretty hard to tell the difference in graphs with correlation under 0.5! However, the larger the sample size, the easier it is to see the correlation! CAUTION: Scatterplots with weak correlations are just scattered points. You could have a curved shape (i.e.- a parabola) that fits the data perfectly, but it will have a very weak Pearson’s correlation coefficient.   The correlation is the same no matter which variable is designated X or Y The value of r does not depend upon the units of measurement. For example, the correlation between grade and the hours you study for the class will be the same as the correlation between the hours you study for the class and the grade. Using the MINITAB program the following is computed for example 1. Pearson correlation of X and Y = 0.949 P-Value = 0.000 Note the Pearson’s correlation between x and y is 0.949. The P-value=0.000 is for testing H 0 :  XY  0 (the true correlation is zero. It means there is no linear relationship between x and y) versus H a :  XY  0 (the true correlation is not zero. It means there is a linear relationship between x and y). r n2 The formal test has the test statistics t  =9.5186 with n-2=10 degrees of freedom. Since 1 r2 this is a two tailed test, P-value=2P(t>9.5186) The correlation for x vs. x and y vs. y is 1. Why? IMPORTANT !!!! Just because two variables are highly correlated does NOT mean that one causes the other!!!! B) SPEARMAN'S RANK CORRELATION (rs) Recall, Pearson’s r is calculated with means and thus would be affected by outliers. Thus Spearman’s method provides us with an alternative that is robust to outliers. Spearman’s can STAT 211 3 identify both linear and nonlinear relationships. The actual observed data is not used – the ranks are. That is, they replace the smallest X with a 1, the next with a 2, and so on. The same is done for Y. For example: for (12.3, 2.7) (10.4, 3.2) (13.2,3.0) you would use (2,1) (1,3) (3,2) Then the Pearson formula is just applied to these ranks. Thus, rs is interpreted just like r: has values between –1 and 1, with 1 indicating a strong relationship and 0 a weak. Using the MINITAB program the following is computed for example 1. Spearman’s correlation of x and y = 0.958 P-Value = 0.000 Purposes for fitting equations to data sets:  To summarize and condense a data set in order to obtain predictive formulas  To reject or confirm a proposed mathematical relation  To assist in the search for a mathematical relation  To perform a quantitative comparison of two or more data sets Regression analysis is a general approach for obtaining a prediction function using a sample data. We work with a dependent variable (Y, response or endogenous variable), independent ^ variable (X's, predictor or exogenous variable) and the predicted value for a given level of X, Y . The method of least squares finds particular line where the aggregate deviation of the data points above or below it is minimized. Least Squares Regression Line-slope & intercept Now if we have an idea that there is some sort of relationship, we are interested in some way to summarize that relationship. We want to find a line that “best fits” the points – least squares regression line - so that we can predict values for Y if we know X. The “best fit” line is the line that in some sense is closest to all of the data points simultaneously. The vertical distances from each point to the line is drawn. These are the residuals, denoted by e. If the point is above the line, this distance is +, below it is -. If we added up all of the e’s, we would get zero (we will check this in lecture) If we add up all of the squares of the residuals, we get a measure of how far away from the data our line is. The “best line” will be the one with the minimum sum of the squares – thus it is called the “Least Squares Regression Line”. STAT 211 4 Simple Linear Regression: Y   0  1 x  e , where n observations included, the parameters  0 and  1 are constants whose "true" values are unknown and must be estimated from the data.  The variables Y and x are theoretically related to one another by the equation of the straight line,  The data set is typical of the behavior of the process under study.  The Yi, i=1,2,….,n, are pairwise statistically independent of one another.  The Yi are random variables possessing the same variance,  2 .  For each data point, there are no outliers that have arisen under unusual, accidental, or careless circumstances.  The uncontrolled random error, e associated with the Y is normally independently distributed with mean 0 and the constant variance,  2 . Obtaining the best estimates for  0 (intercept) and  1 (slope): ^  The estimate of  i (residual), ei  y i  y i  y i  (b0  b1 xi )  Minimize n n i 1 i 1  ei2   ( yi  b0  b1 xi ) 2 (i.e., take a first derivative with respect to b0 and b1 )  Resulting equations are called normal equations _ _    x  x y  y      i i    i 1  b1  n _ 2    xi  x    i 1  n n y i 1 n i  n  b0  b1   xi n i 1 n n  xi  yi  b0   xi  b1   xi2 i 1 i 1 then i 1 _ n _ _  xi y i  n x y i 1 n x i 1 2 i nx _ b0  y  b1  x ^  The estimated linear regression equation, y  b0  b1  x If you write y as the function of x (regress y on x), the following output can be obtained for example 1. The regression equation is Y = - 19.7 + 3.2847 X Predictor Constant X S = 9.094 Coef -19.670 3.2847 SE Coef 7.524 0.3440 R-Sq = 90.1% T -2.61 9.55 P 0.026 0.000 R-Sq(adj) = 89.1% Analysis of Variance Source Regression Residual Error Total DF 1 10 11 SS 7541.7 827.0 8368.7 MS 7541.7 82.7 F 91.19 P 0.000 _2 STAT 211 5 The slope and intercept for your equation is found under the coefficient column. Which is the slope and which is the intercept? What is the least squares regression line in this example? Minitab gives you the following output for simple linear regression Predictor (y) Coef SE Coef Constant b0 _   1 ( x) 2   sb0  s  _ n    ( xi  x) 2  tost (x) b1 sb1  s / Analysis of Variance Source DF Regression 1 Residual Error n-2 Total n-1 y  19.67  3.2847 x  (x i 1 T _ i SS SSR SSE SST  x) 2 MS MSR MSE P b0 s b0 p-value b1 s b1 p-value F MSR/MSE P p-value has two interpretations: (i) It is a point estimate of the mean value of y when x is given. (ii) It is a point prediction of an individual y to be observed when x is given. Estimate of average rbot when x =11 is  19.67  3.2847(11) =16.46 Predicted y when x=11 is  19.67  3.2847(11) =16.46 Estimate of average y when x =2 should not be computed using the equation Predicted y when x=2 not be computed using the equation. SSE = 827 is the unexplained variation: measure of y's variation that can be attributed to an approximate linear relationship. SST = 8368.7 explains the deviations of y from the sample mean of y Coefficient of Determination, R2 : Measure what percent of Y's variation is explained by the X variables via the regression model. It tells us the proportion of SST that is explained by the fitted equation. Note that SSE is the proportion of SST that is not explained by the model. SSR SSE R2   1 SST SST Only in simple linear regression, R 2  r 2 where r is the Pearson’s correlation coefficient. n  1 SSE  (n  1)  R 2  1     n  2 SST  n2  Both will always be between 0 and 1 indicating: (i) strong linear relationship between X and Y if it is close to 1 and (ii) very weak relationship between X and Y if it is close to 0. Adjusted R 2  1  STAT 211 6 (iii) It is 0 (No linear relationship) when SSE= SST. 90.1% of Y's variation is explained by the X variables via the regression the example above. Notice that R2 =r2 = (0.949)2 only when you have one X and one Y variable in your regression model (Simple Linear Regression). Eventhough, estimated variance and the coefficient of determination are given in your ANOVA table, the following is the way how they are calculated: ^  2  s 2  SSE /( n  2)  827 /10  82.7 : MSE ^   s  82.7  9.094 : Root MSE SSE 827 r2  1  1  0.901 SST 8368.7 : Coefficient of determination Approximately 90.1% of the observed variation in % damaged squares (y) can be attributed to the probabilistic linear relationship with age of cotton plant (x). ^ ^ The following residuals ( e  y  y ) and predicted or fitted y ( y ) computed by Minitab. If you compute them by hand, you will get slightly different number because of the rounding. My suggestion is for you to use these numbers as it is given on the output unless otherwise is asked. Obs 1 2 3 4 5 6 7 8 9 10 11 12 X 9.0 12.0 12.0 15.0 18.0 18.0 21.0 21.0 27.0 30.0 30.0 33.0 Y 11.00 12.00 23.00 30.00 29.00 52.00 41.00 65.00 60.00 72.00 84.00 93.00 Fit 9.89 19.75 19.75 29.60 39.45 39.45 49.31 49.31 69.02 78.87 78.87 88.73 SE Fit 4.75 3.93 3.93 3.24 2.76 2.76 2.63 2.63 3.45 4.19 4.19 5.04 Residual 1.11 -7.75 3.25 0.40 -10.45 12.55 -8.31 15.69 -9.02 -6.87 5.13 4.27 St Resid 0.14 -0.94 0.40 0.05 -1.21 1.45 -0.95 1.80 -1.07 -0.85 0.64 0.56 Notice that some of the residuals are positive and some others are negative. All add up to zero. The intercept and the slope are calculated for this data. Since the slope is a numeric number other than zero, we may think x and y are linearly related. Is it true in the population? H 0 : 1  0 (no linear relation between x and y) H a : 1  0 (x and y are linearly related) Test statistics: t  b1  0 3.2847  0   9.55  t / 2;df  t 0.025;10 =2.228. H0 is rejected and data sb1 0.344 supports the linear relationship between x and y . STAT 211 7 Note that b1 is the estimated slope, s b1 is the standard error for b1 , 5% significance is used with the error degrees of freedom. The formal test would be the following: H 0 : 1  10 H a : 1  10 b1   10 where  10 is the value slope is compared with. std b1  Decision making can be done by using error degrees of freedom for any other t test we have discussed before (either using the P-value or the rejection region method) Test statistics: t  The 100(1-)% confidence interval for  1 is b1  t / 2;df sb1 For our example, 95% confidence interval for  1 is 3.2847 2.228(0.344)=(2.518 , 4.051) You can also test for H a : 1  10 or H a : 1  10 . RESI1 10 0 -10 10 20 30 X Example 2: Find the best predicted value for the number of viewers (in millions) given that the salary (in million of dollars) of television star is $16 million. How does the predicted value compare to actual number of viewers, which was 24 million? The regression equation is Viewers = 6.76 - 0.0111 Salary STAT 211 8 Predictor Constant Salary Coef 6.760 -0.01106 S = 3.266 SE Coef 1.459 0.03791 R-Sq = 1.4% Analysis of Variance Source DF Regression 1 Residual Error 6 Total 7 Obs 1 2 3 4 5 6 7 8 Salary 100 14 14 35 12 7 5 1 T 4.63 -0.29 P 0.004 0.780 R-Sq(adj) = 0.0% SS 0.91 63.99 64.90 Viewers 7.00 4.40 5.90 1.60 10.40 9.60 8.90 4.20 MS 0.91 10.67 Fit 5.65 6.61 6.61 6.37 6.63 6.68 6.70 6.75 F 0.09 SE Fit 3.12 1.21 1.21 1.24 1.23 1.31 1.35 1.44 P 0.780 Residual 1.35 -2.21 -0.71 -4.77 3.77 2.92 2.20 -2.55 X denotes an observation whose X value gives it large influence. 11 10 9 Viewers 8 7 6 5 4 3 2 1 0 50 Salary Linear Model: Logarithmic Model: Power Model: Quadratic Model: Exponential Model: Logistic Model: y=a+bx y=a+bln(x) y=axb y=ax2+bx+c y=abx c y= 1  ae bx 100 St Resid 1.40 X -0.73 -0.23 -1.58 1.25 0.98 0.74 -0.87

STATISTICS 302:504-505

Related documents

Products

Support

STATISTICS 302:504-505

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib