Regression Regression line Regression line Slope intercept form

advertisement
9/16/09
Regression line
 Correlation coefficient a nice numerical summary of two
Regression
FPP 10 kind of
quantitative variables
 It indicates direction and strength of association
 But does it quantify the association?
 It would be of interest to do this for
 Predictions
 Understanding phenomena
Regression line
Slope intercept form review
 Correlation measures the direction and strength of the
straight-line (linear) relationship between two quantitative
variables
 If a scatter plot shows a linear relationship, we would like to
summarize this overall pattern by drawing a line on the
scatter plot
 This line represents a mathematical model. Later we will
make the mathematical model a statistical one.
Regression line
Regression
 Slope intercept form notation
y = mx + b
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
 Regression form notation
€
yˆ = a + bx
€
1
9/16/09
Which line is best
Which model to use
 Different people might draw different lines by eye on a scatterplot
Price = -90.2458 + 0.1598SQFT (red)
Price = -300 + 0.3SQFT (blue)
Price = 0 + 0.1SQFT (green)
 What are some ways we can determine which model(line) out of all
the possible models(lines) is the “best” one?
 What are some ways that we can numerically rank the different
models? (i.e. the different lines)
 This will come later in the course
Slope interpretation
yˆ = a + bx
 The slope, b, of a regression line is almost always important
€
for interpreting the data. The slope is the rate of change, the
mean amount of change in y-hat when x increases by 1
Intercept interpretation
yˆ = a + bx
 The intercept, a, of the regression line is the value of y-hat
€
when x = 0. Although we need the value of the intercept to
draw the line, it is statistically meaningful only when x can
actually take values close to zero.
Slope interpretation
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
For every 1 sqft increase in size of
home on average the house price
increases by $159.8 dollars
Intercept interpretation
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
If the sqft of a home was 0 on
average the house price will be
-$90,245.80 dollars
This doesn’t make much sense
here because x (sqft) doesn’t take
on values close to zero.
2
9/16/09
OECD data: Income and unemployment in
the U.S.
Prediction
 What is the relationship between households’ disposable
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
income and the nation’s unemployment rate?
 Data from the U.S. 1980 to 1998
For a 3500 sqft home we would predict the
selling price to be
price = -90.2458 + 0.1598*3500
price = $469,054.2
 (data provided by the economics department at Duke)
Disposable income vs unemployment rates
 Disposable income
and unemployment
rates regression
output
Does regression fit data well?
 A regression line is reasonable if
 Association between two variables is indeed linear
 When points are randomly scattered around line
 Income/unemployment rate data well-described by
regression line.
 Regression of AIDS
rates per 1000 people
of GNP per capita
 Line is too low for
GDP values near zero
and too high for big
GDP values.
 We shouldn’t use line
for predictions
3
9/16/09
Birth and death rates in 74
countries
Changing the response variable
 When the regression line fits the data badly, sometimes you
can transform variables to obtain a better fitting line.
 With monetary variables, typically this can be accomplished
by taking logarithms.
Facts about regression
 Regression of log(AIDS) on
log(GNP)
 The distinction between explanatory and response variable is
essential in regression
 If you have a slope computed using x as the explanatory and y
as the response variable you can’t “back solve” to get
predictions of x given y
 Much better fit
 Predict log(AIDS) from
log(GNP). Exponentiate to
estimate AIDS
 If you want to predict x given a y then you must find the
intercept and slope with y being the explanatory variable and x
being the resopnse
Facts about regression
Warnings about regression
 There is a close relationship between the correlation
 Predicting y at values of x beyond the range of x in the data is
coefficient and the slope of a regression line
b=r
SDy
SDx
 They have the same sign
called extrapolation
 This is risky, because we have no evidence to believe that the
association between x and y remains linear for unseen x
values
 They are proportional to each other
 Extrapolated predictions can be absolutely wrong
€
 The intercept has no relationship with the correlation
coefficient but here is the formula
a = y − bx
€
4
9/16/09
Extrapolation
Extrapolation
 Diamond price and carat
 The relationship between
 Explanatory variable is
measured by carats and
response variable is
dollars
diamond carat and price
doesn’t remain linear after a
carat size of about 0.4
 Predict price of hope
diamond
yˆ = 48.88 + 2430.77(45.52) = $110,697.53
€
Extrapolation
 Green line is
linear fit with
only diamonds
less then 0.4
carats
 Blue line is linear
fit with all carat
sizes
 Red curve a
quadratic fit
Lurking variable
 A variable not being considered could be driving the
relationship
 In practice this is a difficult issue to tackle. Especially when
everything seems OK
Influential point
Causality
 An outlier in either the X or Y direction which, if removed,
 On its own, regression only quantifies an association between
would markedly change the value of the slope and y-interept.
 applet
x and y
 It does not prove causality
 Under a carefully designed experiment (or in some cases
observational studies) regression can be used to show
causality.
5
Download