Regression intro

advertisement
Introduction to Regression
The central purpose of regression is to create a linear equation relating the independent variable
X, to the dependent variable Y. It permits us to answer the following kind of question;
How much additional income does each additional year of education provide
Regression assumes that both variables are measured on at least an interval level and should only
be used if we think that this assumption is close to being met..
The prediction equation in the population
Yi^ =  + Xi
where Y^ is being used as Yhat.
Y^ is known as the predicted value of Y for a given value of X.
It is considered the best estimate of Y for a given X.
If the model (equation) is correct. for the population, Y^ equals Y|X, this is known as the
conditional mean of Y given X, and is the population mean of Y for the particular value of X.
Thus the regression line can be considered the path of mean values of the Y as X changes. The
line is produced by plugging values of X into the linear regression formula and solving for Y^.
 is the regression slope -- the amount of change in Y for each unit change in X. (Note that one
must specify the units)
 is the Y intercept. It is the value of Y^ when X =0.
These symbols ( and ) have no relationship to  as in  level
The population equation with error
Yi =  + Xi + i
Where i is the error and
i = Yi - Y^i
and where Yi is the actual observed value of Y for a case
Computing the sample statistics
Yi^ = a + bXi
is the sample prediction equation
Yi = a + bXi + ei
is the sample equation with error
where a, b, and e are sample estimates of ,  and  respectively.
Equations for b and a
These equations are designed to produce the best fitting line for the scatterplot
A scatterplot is a two-dimensional set of the observations in X and y co-ordinates , in which the
location of each point indicates both the values of both Xand Y for that case.
The best fitting line is defined as the one which minimizes the sum of the squared distances of all
points from the regression line.
The procedure for improving best estimates for a dependent variable (Y) by accounting for its
relationship with an independent variable (X) is called simple linear correlation and regression
analysis
Simple Linear Correlation and Regression Analysis
Simple linear correlation and regression analysis is the use of the formula for a straight line to
improve best estimates of an interval/ratio dependent variable (Y) for all values of an
interval/ratio independent variable (X)
Linear means “straight line”
Scatterplots
A linear regression formula is the formula for a straight line
Simple linear correlation and regression statistics apply only to scatterplots with coordinates in a
linear, cigar-shaped pattern
The formula for a straight line to estimate Y is: Ý = a + bX
The Regression Line on the Scatterplot
The regression line is the best-fitting straight line plotted through the X,Y-coordinates of a
scatterplot
Positive and Negative Correlations
A positive correlation is an upward sloping pattern in a scatterplot where an increase in X is
related to an increase in Y
A negative correlation is a downward sloping pattern where an increase in X is related to a
decrease in Y
When the pattern lacks an elongated, sloped cigar shape, there is no correlation, an increase in X
is unrelated to the scores of Y
Computing Correlation and Regression Statistics
To calculate correlation and regression statistics, set up a spreadsheet to obtain the following
sums: ΣX, ΣY, ΣX2, ΣY2, and ΣXY
Pearson’s r Correlation Coefficient
Pearson’s r is a widely used correlation coefficient that measures the tightness of fit of X,Ycoordinates around the regression line of a scatterplot
Computed values of Pearson’s r can range from -1 to +1
The larger the absolute value of r, the tighter the fit of X,Y-coordinates around the regression
line
The Sign of Pearson’s r
When the regression line slopes upward, we have a positive correlation. Pearson’s r will be
positive up to a value of +1
When the regression line slopes downward, we have a negative correlation. Pearson’s r will be
negative down to a value of -1
When the regression line is flat, we have no correlation and Pearson’s r = 0
Regression Statistics
The coefficients and symbols of the regression line formula, Ý = a + bX
Ý = the predicted Y (an estimate of the dependent variable Y computed for a given value of the
independent variable X)
Recall that the objective of correlation and regression is to use the regression line to make better
estimates of Y
Regression Statistics:
The Slope, b
b = slope of the regression line (called the regression coefficient)
b conveys slope in the sense of going up or down a hill. It answers the question: How far does
the line rise for every one-unit run of X?
Regression Statistics:
The Y-intercept, a
a = Y-intercept, the point at which the regression line intersects the Y-axis when X = 0
To compute a, calculate the means of X and Y, substitute them into the regression equation, and
solve for a
Plotting the Regression Line
To plot the regression line on the scatterplot, use the regression equation to calculate a few
values of Ý
Do this by inserting chosen values of X and solve for Ý in the regression equation
Ý=
a + bX
The Importance of Observing the Scatterplot
The linear regression equation applies only when the pattern of coordinates is linear
The presence of outlier coordinates can cause the attenuation (weakening) or inflation of the
Pearson’s r correlation coefficient
Download