Uploaded by Rashard Springfield

04-21 - Regression

advertisement
Linear Regression
Tuesday, April 21, 2020
Today
 Today’s topic: Linear regression
 Quiz 4 (on correlation) is due tomorrow (Wed) by
8pm
 We will have one more quiz, posted on Tues, 4/28
and due the following Mon, May 4 by 8pm
 We will not have a Final Exam for this course
 Complete SETE course evaluations
 You will receive an email with instructions to
complete evaluations for this class and your other
classes by May 7
Linear Regression
 Used to describe the relationship between two
interval/ratio variables
 How is regression distinct from correlation?
 Correlation is unitless
 Regression has units and tells us how a 1-unit
change in independent variable affects value of
dependent variable
 Regression relies on scatterplots and a prediction
of the “best-fitting” line to describe a trend in the
relationship
 Line described by slope and y-intercept
Correlation is Unitless
 Formula for correlation coefficient (r):
 The units for Y and X are in the numerator and
denominator – they cancel each other out
 Regression keeps the units of the independent
and dependent variables and tells us how a 1-unit
increase in independent variable affects predicted
value of dependent variable
Linear Regression
 Do you remember the formula for a line from
algebra?
Y = mX + b
…where m = slope and
b = y-intercept
 The formula for a regression line is nearly
identical but with statistical notation:
� = a + bX
Y
…where b = slope and
a = y-intercept
Linear Regression
 Y-intercept (or Constant) (a)
 The point at which the regression line crosses the
y-axis.
 The expected value of Y when X equals 0.
 Not always meaningful; for some independent
variables, 0 is not a valid value (age, education,
body mass index)
 Slope (b)
 The steepness of the regression line
 “Rise divided by run”
 Indicates the predicted change in the dependent
variable for each unit increase in the independent
variable.
Regression line
� with X
Predicting Y
 If I asked you to guess a person’s age and you
knew nothing about them except that he or she
lived in the US, what would be your best guess?
� (y bar)
 The mean age for US residents, Y
� = 38.2 years in 2018
Y
 Now if I tell you that person is widowed, would
you adjust your guess?
� with X
Predicting Y
 In statistics, we are often trying to predict (make a
better guess about) the mean for a variable.
 With bivariate statistics, we use an independent
variable (widowhood) to improve our estimate of a
dependent variable (age)
� (“Y hat”)
 We use X to predict Y
� (point on regression line) should give us a
 Y
� (mean) because it draws
better estimate than Y
on information from X
How would we find the line that best fits these
data points?
Which line best captures how these two
variables co-vary?
Which line best captures how these two
variables co-vary?
The difficulty is that we have a lot of choices
that look about right….right?
We need to find the single best intercept and
single best slope for our line…
We need a line that minimizes the distance
from the line to the data points
Called “Residuals”
or Residual Errors
Blue lines represent the error
between the predicted value
based on the red regression
line and the actual value
observed in the data (the
circles)
Linear Regression, also called “Ordinary
Least Square (OLS) Regression”
 The logic is similar to variance. Some residual
errors are negative and some are positive. If we
just add the residual errors together, they cancel
each other out and sum to zero.
 The estimates of a and b will have the property
that the sum of the squared differences between
2
�
the observed and predicted values ∑ Y − Y is
minimized using ordinary least squares.
 This relies on Sum of Squared Errors
Interpreting SPSS Output
 Let’s return to our example from last week about
age and time spent using the internet
 Research question: Is there a relationship
between respondent’s age and the number of
hours they spend using the internet?
 Both IV and DV are interval/ratio
Fit regression line
 SPSS will calculate the y-intercept and slope for
the “best-fitting” line to tell us how well
respondents’ age predicts their hours online per
week, on average
Regression with SPSS
Regression with SPSS
Output provided by SPSS
Interpreting SPSS Output: Summary
 R = Pearson’s correlation coefficient
(measure of correlation we discussed 4/14)
 r = .259, so there is a positive, weak association
between education and hours worked
Interpreting SPSS Output: Summary
 R2 (“R squared”): coefficient of determination
 R2 tells us what proportion of error is reduced by using X
� versus just using �
to predict Y
Y
Interpreting SPSS Output: Summary
 R2 Interpretation
 R2 = .067 means that 6.7% of the variation in the DV
(hours of internet use) can be explained by the IV (age)
versus using the mean of the DV alone
Interpreting SPSS Output
 SPSS uses different notation than we’ve been using
 (Constant) row & B column = y-intercept a
 “Age of respondent” row and B column = slope b
� = 26.395 + (−.262) × X
 Our regression equation: Y
Interpreting SPSS Output
 Y-intercept a = 26.395
 We can’t interpret this. Y-intercept tells us the value of
the DV (hours of internet use) when our IV (age)
equals 0. No one in the survey is zero years old.
Interpreting SPSS Output
 Slope b = –.262
 When age increases by 1 year, hours of internet use
will decrease by .262 hours, on average
Interpreting SPSS Output
 “Sig.” is the p-value
 If p-value is smaller than our alpha value (.05), then the
association between the two variables is statistically
significant
 In other words, we can be 95% confident that education
and work hours are associated at the population level
Interpreting SPSS Output
� = 26.39 − .262 × X
 Our regression equation: Y
 If someone’s age X equals 65 years, what is our best
�)?
estimate of their hours online per week (Y
� = 26.395 − .262 × 65
Y
� = 9.365
Y
Download