Introduction to Regression The central purpose of regression is to create a linear equation relating the independent variable X, to the dependent variable Y. It permits us to answer the following kind of question; How much additional income does each additional year of education provide Regression assumes that both variables are measured on at least an interval level and should only be used if we think that this assumption is close to being met.. The prediction equation in the population Yi^ = + Xi where Y^ is being used as Yhat. Y^ is known as the predicted value of Y for a given value of X. It is considered the best estimate of Y for a given X. If the model (equation) is correct. for the population, Y^ equals Y|X, this is known as the conditional mean of Y given X, and is the population mean of Y for the particular value of X. Thus the regression line can be considered the path of mean values of the Y as X changes. The line is produced by plugging values of X into the linear regression formula and solving for Y^. is the regression slope -- the amount of change in Y for each unit change in X. (Note that one must specify the units) is the Y intercept. It is the value of Y^ when X =0. These symbols ( and ) have no relationship to as in level The population equation with error Yi = + Xi + i Where i is the error and i = Yi - Y^i and where Yi is the actual observed value of Y for a case Computing the sample statistics Yi^ = a + bXi is the sample prediction equation Yi = a + bXi + ei is the sample equation with error where a, b, and e are sample estimates of , and respectively. Equations for b and a These equations are designed to produce the best fitting line for the scatterplot A scatterplot is a two-dimensional set of the observations in X and y co-ordinates , in which the location of each point indicates both the values of both Xand Y for that case. The best fitting line is defined as the one which minimizes the sum of the squared distances of all points from the regression line. The procedure for improving best estimates for a dependent variable (Y) by accounting for its relationship with an independent variable (X) is called simple linear correlation and regression analysis Simple Linear Correlation and Regression Analysis Simple linear correlation and regression analysis is the use of the formula for a straight line to improve best estimates of an interval/ratio dependent variable (Y) for all values of an interval/ratio independent variable (X) Linear means “straight line” Scatterplots A linear regression formula is the formula for a straight line Simple linear correlation and regression statistics apply only to scatterplots with coordinates in a linear, cigar-shaped pattern The formula for a straight line to estimate Y is: Ý = a + bX The Regression Line on the Scatterplot The regression line is the best-fitting straight line plotted through the X,Y-coordinates of a scatterplot Positive and Negative Correlations A positive correlation is an upward sloping pattern in a scatterplot where an increase in X is related to an increase in Y A negative correlation is a downward sloping pattern where an increase in X is related to a decrease in Y When the pattern lacks an elongated, sloped cigar shape, there is no correlation, an increase in X is unrelated to the scores of Y Computing Correlation and Regression Statistics To calculate correlation and regression statistics, set up a spreadsheet to obtain the following sums: ΣX, ΣY, ΣX2, ΣY2, and ΣXY Pearson’s r Correlation Coefficient Pearson’s r is a widely used correlation coefficient that measures the tightness of fit of X,Ycoordinates around the regression line of a scatterplot Computed values of Pearson’s r can range from -1 to +1 The larger the absolute value of r, the tighter the fit of X,Y-coordinates around the regression line The Sign of Pearson’s r When the regression line slopes upward, we have a positive correlation. Pearson’s r will be positive up to a value of +1 When the regression line slopes downward, we have a negative correlation. Pearson’s r will be negative down to a value of -1 When the regression line is flat, we have no correlation and Pearson’s r = 0 Regression Statistics The coefficients and symbols of the regression line formula, Ý = a + bX Ý = the predicted Y (an estimate of the dependent variable Y computed for a given value of the independent variable X) Recall that the objective of correlation and regression is to use the regression line to make better estimates of Y Regression Statistics: The Slope, b b = slope of the regression line (called the regression coefficient) b conveys slope in the sense of going up or down a hill. It answers the question: How far does the line rise for every one-unit run of X? Regression Statistics: The Y-intercept, a a = Y-intercept, the point at which the regression line intersects the Y-axis when X = 0 To compute a, calculate the means of X and Y, substitute them into the regression equation, and solve for a Plotting the Regression Line To plot the regression line on the scatterplot, use the regression equation to calculate a few values of Ý Do this by inserting chosen values of X and solve for Ý in the regression equation Ý= a + bX The Importance of Observing the Scatterplot The linear regression equation applies only when the pattern of coordinates is linear The presence of outlier coordinates can cause the attenuation (weakening) or inflation of the Pearson’s r correlation coefficient