Lecture 5: Simple Linear Regression - School of Psychology

advertisement
Lecture 5:
Simple Linear Regression
Laura McAvinue
School of Psychology
Trinity College Dublin
Previous Lecture
• Regression Line
– Offers a model of the relationship between two
variables
• A straight line that represents the ‘best fit’
– Enables us to predict Variable Y on the basis of
Variable X
ˆ
Y  bX  a
Today
• Calculation of the regression line
• Measuring the accuracy of prediction
• Some practice!
How is the regression line calculated?
• The Method of Least Squares
– Computes a line that minimises the difference
between the predicted values of Y (Y’) and the actual
values of Y (Y)
– Minimises…
• (Y – Y’)s
• Errors of prediction
• Residuals
7
Y
6
5
Y’
Y 4
3
These lines =
Errors of prediction
(Y - Y’)s
Residuals
2
1
0
0
1
2
3
X
4
5
7
Y=6
6
5
Y’ = 5
Y 4
3
2
1
0
0
1
2
3
X
4
5
Method of Least Squares
• When fitting a line to the data, the regression procedure
attempts to fit a line that minimises these errors of
prediction, total (Y – Y’)s
– But! You can’t try to minimise (Y-Y’) as (Y-Y’)s will have positive
and negative values, which will cancel each other out
• So, you square the residuals and then add them and try
to minimise (Y-Y’)2
– Hence, the name, ‘Method of Least Squares’
How do we measure the accuracy of
prediction?
• The regression line is fitted in such a way that the errors
of prediction are kept as small as possible
– You can fit a regression line to any dataset, doesn’t mean it’s a
good fit!
– How do we measure how good this fit is?
– How to we measure the accuracy of the prediction that our
regression equation makes?
• Three methods
– Standard Error of the Estimate
– r2
– Statistical Significance
Standard Error of the Estimate
• A measure of the size of the errors of prediction
• We’ve seen that…
– The regression line is computed in such a way as to minimise the
difference between the predicted values (Y’) and the actual
values (Y)
– The difference between these values are known as errors of
prediction or residuals, (Y – Y’)s
• For any set of data, the errors of prediction will vary
– Some data points will be close to the line, so (Y – Y’) will be small
– Some data points will be far from the line, so (Y – Y’) will be big
Standard Error of the Estimate
• One way to assess the fit of the regression line is to take
the standard deviation of all of these errors
– On average, how much do the data points vary from the
regression line?
• Standard error of the estimate
Standard Error of the Estimate
• One point to note…
• Standard error is a measure of the standard deviation of
data points around the regression line
• (Standard error)2 expresses the variance of the data
points around the regression line
– Residual or error variance
r2
• Interested in the relationship between two variables
– Variable X
• A set of scores that vary around a mean,
– Variable Y
• A set of scores that vary around a mean,
• If these two variables are correlated, they will share
some variance
X
Variance in X that
is not related to Y
Y
Variance in Y that
is not related to X
Shared variance
between X and Y
•In regression, we are trying to explain Variable Y
as a function of Variable X
•Would be useful if we could find out what
percentage of variance in Variable Y can be
explained by variance in Variable X
Total Variance in Variable Y
SStotal
Variance due to Variable X
Regression / Model Variance
SSm
SStotal - SSerror
Variance due to other factors
Error Variance
SSerror
r2
• To calculate the percentage of variance in
Variable Y that can be explained by variance in
Variable X
SSm
SStotal
= r2
=
Variance due to X / regression
Total variance in Y
r2
• (Pearson Correlation)2
– Shared variance between two variables
– Used in simple linear regression to show what
percentage of Variable Y can be explained by Variable
X
• For example
– If rxy = .8, r2xy = .64, then 64% of the variability in Y is
directly predictable from variable X
– If rxy = .2, r2xy = .04, then 4% of the variability in Y is
due to / can be explained by X
Statistical Significance
• Does the regression model predict Variable Y
better than chance?
• Simple linear regression
– Does X significantly predict Y?
– If the correlation between X & Y is statistically
significant, the regression model will be statistically
significant
– Not so for multiple regression, next lecture
• F Ratio
Statistical Significance
• F-Ratio
– Average variance due to the regression
Average variance due to error
– MSm
MSerror
=
SSm / dfm
SSerror / dferror
– It uses the mean square rather than the sum of squares in order
to compare the average variance
• You want the F-Ratio to be large and statistically
significant
• If large, then more variance is explained by the
regression than by the error in the model
An example
‘Linear regression’ data-set
• I want to predict a person’s verbal coherency based on
the number of units of alcohol they consume
• Record how much alcohol is consumed and administer a
test of verbal coherency
– SPSS
– Analyse, Regression, Linear
– Dependent variable: Verbal Coherency
– Independent variable: Alcohol
– Method: Enter
Three parts to the output
Model Summary
r2
Standard error
Anova
F Ratio
Coefficients
Regression Equation
• Table: how well our regression model explains the
variation in verbal coherency
Pearson r between
alcohol and
verbal coherency
Statistical estimate of the
error in the
regression model
Model Summary
Model
1
R
.628
a
R Square
.394
Adjusted R
Square
.393
Std. Error of
the Estimate
10.29
a. Predictors: (Constant), alcohol units
Proportion of variation
in verbal coherency
that is related to
alcohol
Statistical estimate
of the population
proportion of variation
in verbal coherency
that is related to
alcohol
Total variation in data due
to regression model
Average variation
in data
due to
regression model
ANOVA
Model
1
Sum of Squares
Regression
30105.539
Residual
46240.488
Total
76346.027
df
b
Mean Square
1 30105.539
437
105.813
438
Ratio of variation
in data due
to regression model &
variation not due to
model
F
284.515
a. Predictors: (Constant), alcohol units
b. Dependent Variable: VRBLCOHR
Total variation in data NOT
due to regression model
Average variation
in data
NOT due
to regression model
Sig.
.000
a
Probability of
observing
this F-ratio if
Ho is true
T-statistic = tells us whether using the predictor variable gives us a
better than chance prediction of the DV
Alcohol is a sig. predictor of verbal coherency
Coefficients
Standar
dized
Coefficients
Unstandardized
Coefficients
Model
1
(Constant)
alcohol units
a
B
21.543
Std. Error
1.063
4.715
.280
Beta
.628
t
20.276
Sig.
.000
16.868
.000
a. Dependent Variable: VRBLCOHR
Values that we use in the regression equation (Y = BX +a)
Verbal Coherency = B (alcohol) + constant
Verbal coherency = 4.7 (alcohol) + 21.5
As alcohol 1 unit, verbal coherency  by 4.7 units
Second Example
• Can we predict how many months a person
survives after being diagnosed with cancer,
based on their level of optimism?
• Linear Regression dataset
• Analyse, regression, linear
• Dependent variable: Survival
• Independent variable: Optimism
Aspects of Regression analysis
• Write the regression equation
• Explain what this equation tells us about the relationship
between Variables X and Y
• Make a prediction of Y when given a value of X
• State the standard error of your prediction
• Ascertain if the regression model significantly predicts
the dependent variable Y
• State what percentage of Variable Y is explained by
Variable X
State the following…
• Describe the relationship between survival (Y) and
optimism (X) in terms of a regression equation.
• In your own words, explain what this equation tells us
about the relationship between survival and optimism.
• Using this equation, predict how many months a
person will survive for if their optimism score is 10.
State the following…
• What is the standard error of your prediction?
• Does the regression model significantly predict the
dependent variable?
• What percentage of variance in survival is explained
by optimism level?
Answers
• Describe the relationship between survival (Y) and
optimism (X) in terms of a regression equation.
• Y’ = .69X + 18.4
• In your own words, explain what this equation tells us
about the relationship between survival and optimism.
• As optimism level increases by one unit, survival increases by
.69months
• When a person’s optimism score is 0, his/her predicted length
of survival is 18.4 months
• Using this equation, predict how many months a
person will survive for if their optimism score is 10.
• Y’ = .69(10) + 18.4 = 25.3 months
State the following…
• What is the standard error of your prediction?
• 4.5months
• Does the regression model significantly predict the
dependent variable?
• Yes, F (1, 432) = 202, p < .001
• What percentage of variance in survival is explained
by optimism level?
• 32%
Summary
• Simple linear regression
• Provides a model of the relationship between two
variables
• Creates a straight line that best represents the
relationship between two variables
• Enables us to estimate the percentage of variance
in one variable that can be explained by another
• Enables us to predict one variable on the basis of
another
• Remember that a regression line can be fitted to any
dataset. It’s necessary to assess the accuracy of the fit.
Download