Page 1 of 2 Correlation and Linear Regression 1001-CorrLinReg.doc Correlation and Linear Regression Correlation – collection of pairs of sample data – bi-variant data (having two variables). Relationship – correlation – when one variable is related to another one in some way. Assumptions 1. sample of paired (x, y) data is random sample. 2. (x, y) bi-variant normal distribution. Values of both x and y are from normal distribution. Linear correlation coefficient, r, shows the strength of the linear relationship between paired (x, y) values in a sample. r n xy n x 2 x y x n y 2 2 y 2 r = sample correlation coefficient (rho) = population correlation coefficient Some characteristics of correlation coefficient: 1. –1 r 1 2. Conversion of all values of either variable to a different scale does not change rvalue. 3. r is not affected by choice of x or y. 4. r measures strength of a linear relationship. Coefficient of Determination: r2 is proportion of variation in y that is explained by linear relationship between x and y. Common errors: 1. Concluding that correlation implies causality. Lurking variable – one that affects variables being studied, but is not included in the study. 2. Using data based on averages – suppresses individual variation and may inflate correlation coefficient. 3. Property of linearity – linear correlation may be zero when non-linear correlation may be very strong. Testing significance of r, correlation coefficient. H0: = 0 Use t and r 1 r2 n 2 H1: 0 with degrees of freedom, df = n-2, or Table A-6 for r-values. Page 2 of 2 Correlation and Linear Regression 1001-CorrLinReg.doc Regression Analysis • Purpose: to determine the regression equation; it is used to predict the value of the dependent variable (Y) based on the independent variable (X). • Procedure: select a sample from the population and list the paired data for each observation; draw a scatter diagram to give a visual portrayal of the relationship; determine the regression equation. • the regression equation: yˆ b0 b1 x , where: • ŷ (y hat) is the predicted value of Y for any X. • b0 is the Y-intercept, or the estimated Y value when X=0 • b1 is the slope of the line, or the average change in ŷ for each change of one • unit in X the least squares principle is used to obtain b1 & b0 : b1 n xy x y n x 2 x 2 and b0 y b1 x or b0 y b x n 1 n Centroid: From a collection of paired (x, y) data, the centroid is x, y . This represents the point designated by the mean of x-values and mean of y-values.