Uploaded by lxy2203

Review for Linear Regression

advertisement
-
-
-
-
-
-
-
Review for Linear Regression
Association: two variables are associated if knowing the values of one of the variables
tells you something about the values of the other variable
o A lot of the time, the goal is causality, not association
Association does not mean causality
Response variable (y): outcome of the study
o In statistics, it is not called the dependent variable because dependency is only
valid when there is causation; not necessarily if there is association
Explanatory variable (x): explains or causes changes in the response variable
On a scatterplot, explanatory variable is on the x axis and response variable is on the y
axis.
Deterministic relationship between X and Y, meaning that Y is a function, g of X
o Y = g(X)
Scatterplot is described using form, direction, strength, and outliers
o Form: linear, curved, clusters, no pattern
o Direction: positive, negative, horizontal
o Strength: strong or weak, how close the points are to the line
o Outliers: outliers on the y-axis usually wouldn’t affect the regression line too
much. However, x-axis outliers can greatly affect the regression line so always
remember to look for x-axis outliers.
Regression line: a straight line that describes how a response variable y changes as an
explanatory variable x change
o We can use a regression line to predict the value of y for a given value of x
Notation:
o n independent observations
o xi are the explanatory observations
o yi are the observed response variable observations
o we have n ordered pairs (xi , yi)
simple linear regression model
o Let (xi, yi) be pairs of observations. We assume that there exist constants 0 and
1 such that
 xi is fixed
 the variance of Yi is the same as the variance of I since 0, 1, and Xi are
constants
 all the randomness in Y comes from epsilon
o Yi = 0 + 1Xi + i
o where i ~ N(0, σ2)
o E(Yi) = 0 + 1Xi
o i are random deviations from the line or random error terms related to the true
regression line
 The error terms are just the difference of each point from the line in the y
direction
 iid: independent with identical distributions
 the estimated value of Y sub i is the deterministic model because the error
term has a mean of 0
Assumptions for linear regression
-
-
o SRS with the observations independent of each other
o The relationship between X and Y is linear in the population
o The residuals have a normal distribution
o The standard deviation of the residuals is constant
SSR: variance due to the line, b1*Sxy
SSE: variance due to error, SST – SSR
SST: total variance, Syy or SSR + SSE
dfr: the equation for the line has 2 consonants and there is 1 average, dfr = 2 – 1 = 1
dfe: n – 2
dft: n – 1
MSR: SSR
MSE: SSE/dfe
MST: SST/dft
Facts about least square regression
o Slope: a change of y with one-unit change in x
o Intercept: the value of y when x = 0, there is no practical significance of the
value when x = 0 so most of the time, this value is not relevant
o The line that passes through the point (x̄, ȳ)
o There is an inherent difference between x and y
 If we switch x and y, the slope will change and the assumptions will
change
 We assume that x is fixed, so if we switch x and y, we are now
assuming that y is fixed
o b1 = Sxy/Sxx
 Sxy is the sum of the x term times the y term, in order for the slope to be
negative, Sxy must be less than 0
 Sxx is the sum of the x term squared and can never be negative
o Y-hat is an unbiased estimator for µy given x
o b0 is an unbiased estimator for beta 0
o b1 is an unbiased estimator for beta 1
o the residual, ei, is yi – y-hati
o s2 = SSE/dfe = MSE
o sqrt(MSE) = s <= good estimator but not unbiased of the standard deviation
R2: coefficient of determination, fraction of the variation of the values of y that is
explained by the least-squares regression of y on x, SSR/SST
o Proportion of the response variable that is explained by the linear relationship
with the explanatory variable
o If this is high, this is a good fit because most of the variance is due to the line, not
due to the residuals or error
o If this is low, then we don’t have a good fit because the line doesn’t explain the
variance of Y
o R2 does not tell you anything about the linearity of the data points. If R2 is
high, it is probably linear, if it’s low, we don’t know anything about the linearity
unless the data points are plotted. A low R2 means that either the points are not
close to the line or it is not linear.
o R2 is not resistant to outliers
-
-
-
-
o Just because R squared is large, it doesn’t mean that you can make a good
prediction because R-squared is a ratio. For linear regression, you need to know
MSE (the measure of the absolute error) to know if prediction is valid or not.
o r2 = R2 is only valid for simply linear regression
Sample correlation: r, is a measure of the strength of a linear relationship between two
continuous variables, Sxy/sqrt(Sxx*Syy)
o Correlation makes no distinction between explanatory and response variables
 If X and Y were switched, there would be no change to the calculation
o r has no units and does not change when the units of x and y change
o r > 0, positive association
o r < 0, negative association
o r is always a number between -1 and 1
o -1 < r < -0.8, 0.8 < r < 1, strong correlation
o -0.8 < r < -0.5, 0.5 < r < 0.8, moderate correlation
o -0.5 < r < 0.5, weak correlation
o r = 0, x and y are linearly uncorrelated, does not mean there is no association
between x and y. This is saying that there is no linear association between x and y
o correlation requires that both variables be quantitative
o correlation measures the strength of linear relationships only
o correlation is not resistant to outliers
o Correlation is not a complete summary of bivariate data, it does not provide
information on form
o ALWAYS PLOT YOUR DATA
Analyzing the y-intercept would only make sense if the y-intercept holds physical
meaning and that x = 0 is realistic
Why is a residual plot useful?
o It is easier to look at points relative to a horizontal line vs. a slanted line
o The scale is larger
Fts: compares the variance due to the model to the variance due to the residuals or error
o If the test statistic is large, most of the variance in the response variable is due to
the model so there is an association between X and Y. This means that there is a
“large” slope.
o If the test statistic is small, the variance due to the model is around the same or
less than the variance due to the residuals so the slope is approximately equal to 0
or there is no association.
The model utility test is for testing whether or not there is an association between two
variables.
If you want to know if the slope is positive or negative, then you have to use the test of
significant on the slope.
The value of the Fts is tts squared
Caution about correlation and regression
o Requires good experimental design
o Both describe linear relationship
o Both are affected by outliers
o Beware of extrapolations
o Beware of lurking variables
-
-
-
-
-
Extrapolation occurs when you are looking outside of the range of the explanatory
variable, where there are no points. Extrapolation is within the range of the x-axis
o You cannot accurately predict anything about the response variable if it is out of
the range of the x axis
If there are lurking variables, then you might not be measuring what you think you are
If you want to determine causation from association, you need to look at additional
information
SEµ-hat*: variance associated with the mean response
o Consists of two parts, 1/n comes from the y-intercept and the other term comes
from y-intercept and the slope
o No error comes from the original error of the point
o Part of the variance that comes from this term depends on the value of x-star, this
means that the variance will increase if x-star is further from the mean value of x
SEy-hat*: variance associated with the observation
The confidence band tells us the confidence of the equation of the line. Therefore, the
actual data points are not necessarily included in the shaded area. However, the
prediction band is telling you what the next value will be. All or most of the data points
will be included in the area.
The confidence interval will be narrower than the prediction interval because the
prediction interval has the added uncertainty of the point which is sigma squared.
Therefore, the SE is bigger for prediction intervals.
To make a good prediction
o Precise prediction interval
o Low MSE
o High n
o Large Sxx
Model Utility F test
Step 1: define the terms
Not needed for model utility F test
Step 2: state the hypotheses
H0: there is no association between X and Y
Ha: there is an association between X and Y
Step 3: state the test statistic, df, p-value
Fts = MSR/MSE, df1 = dfr = 1, df2 = dfe = n – 2
If data is unknown
p-value = pf(Fts,df1,df2,lower.tail=FALSE)
If data is known
Table.lm <- lm(YVar~XVar, data = Table)
Summary(Table.lm)
Step 4: conclusion in context
Reject or fail to reject because…
The data does provide strong support (p = value) to the claim that there is a linear relationship
between…
Test of Significance on the Slope
Step 1: define the terms
Beta 1 is the population slope [] vs []
Step 2: state the hypotheses
H0: beta 1 = beta 1 0 = 0
Ha: beta 1 ≠ beta 1 0 ≠ 0
Note: beta 1 0 is a 0 after a 1 not beta 10
Step 3: state the test statistic, df, p-value
tts = b1/SE or b1/sqrt(MSE/Sxx), df = dfe = n – 2
if data is unknown
2*pt(tts,df,lower.tail=FALSE)
Step 4: conclusion in context
Reject or fail to reject H0 because…
The data does [not] provide [strong] support (p = value) to the claim that there is a linear
relationship between…
Confidence interval for the slope
If data is unknown
t <- qt(alpha/2,df,lower.tail=FALSE)
SE <- sqrt(MSE/Sxx)
c(b1-t*SE, b1+t*SE)
if data is known
Table.lm <- lm(YVar~XVar, data = Table)
Summary(Table.lm)
confint(Table.lm, level = C)
Confidence Interval for Mean at a Point (SEµ-hat*)
If data is unknown
qt(alpha/2,df,lower.tail=FALSE)
SE <- sqrt(MSE*(1/n+(xstar–xbar)^2/Sxx))
If data is known
newdata <- data.frame(XVar = NewValue)
predict(Table.lm, newdata, interval = "confidence", level = 0.99)
We are C% confident that the population mean y is covered by the interval () when x is x*
Prediction Interval for Mean at a Point (SEy-hat*)
If data is unknown
SE <- sqrt(MSE*(1 + 1/n + (xstar – xbar)^2/Sxx))
If data is known
newdata <- data.frame(XVar = NewValue)
predict(Table.lm, newdata, interval = "prediction", level = 0.99)
We are C% confident that the next y is covered by the interval () when x is x*
Download