Uploaded by Seven7 Post

EXST3999 SLR Lecture01

advertisement
EXST 3999
Simple Linear Regression
Spring 2015
Simple Linear Regression (review?)
Y - the dependent variable
The objective: Given points plotted on two coordinates, Y and X, find the best line to fit the data.
35
30
25
20
0
1
2
3
4
5
6
7
8
9
10
X - the independent variable
The concept: Data consists of paired observations with a presumed potential for the existence of some
underlying relationship. We wish to determine the nature of the relationship and quantify it if it exists.
Note that we cannot prove that the relationship exists by using regression (i.e. we cannot prove
cause and effect). Regression can only show if a “correlation” exists, and provide an equation
for the relationship.
Given a data set consisting of paired, quantitative variables, and recognizing that there is variation in
the data set, we will define,
POPULATION MODEL (SLR): Yi  0  1 X i   i
This is the model we will fit. It is the equation describing straight line for a population and
we want to estimate the parameters in the equation. The population parameters to be
estimated are for the underlying model,  y. x  0  1 X i , are:
 y. x = the true population mean of Y at each value of X
 0 = the true value of the Y intercept
 1 = the true value of the slope, the change in Y per unit of X

Terminology
Dependent variable: variable to be predicted
Y = dependent variable (all variation occurs in Y)
Independent variable: predictor or regressor variable
X = independent variable (X is measured without error)
Intercept: value of Y when X = 0, point where the regression line passes through the Y axis. The
units on the intercept are the same as the “Y” units
Slope: the value of the change in Y for each unit increase in X. The units on the slope are “Y”
units per “X” unit
Deviation: distance from an observed point to the regression line, also called a residual.
EXST 3999
Spring 2014
Page 2
Least squares regression line: the line that minimizes the squared distances from the line to the
individual observations.
Y - the dependent variable
Regression line
Deviations
Intercept
0
1
2
3
4
5
6
7
8
9
10
X - the independent variable
The regression line itself represents the mean of Y at each value of X (  y. x ).
Regression calculations
All calculations for simple linear regression start with the same values. These are,
n
n
n
n
n
 X ,  X , Y , Y ,  X Y , n
i 1
i
i 1
2
i
i 1
i
2
i
i 1
i 1
i i
Calculations for simple linear regression are first adjusted for the mean. These are called “corrected
values”. They are corrected for the MEAN by subtracting a “correction factor”.
As a result, all simple linear regressions are adjusted for the mean of X and Y and pass through the
point Y , X  .
Y , X 
Y
X
The original sums and sums of squares of Y are distances and squared distances from zero. These are
referred to as “uncorrected” meaning unadjusted for the mean.
Y
0
X
EXST 3999
Spring 2014
Page 3
The “corrected” deviations sum to zero (half negative and half positive) and the sums of the squares
are squared distances from the mean of Y.
Y
Y
X
Once the means  X , Y  and corrected sums of squares and cross products  S XX , SYY , S XY  are obtained,
the calculations for the parameter estimates are:
Slope = b1 
S XY
S XX
Intercept = b0  Y  b1 X
We have fitted the sample equation
Yi  b0  b1 X i  ei , which estimates the population parameters of the model, Yi  0  1 X i   i
Variance estimates for regression
After the regression line is fitted, variance calculations are based on the deviations from the
regression. From the regression model Yi  b0  b1 X i  ei we derive the formula for the deviations
ˆ .
e  Y   b  b X  or e = Y -Y
i
0
i
1
i
i
i
i
Y
X
As with other calculations of variance, we calculate a sum of squares (corrected for the mean).
This is simplified by the fact that the deviations, or residuals, already have a mean of zero,
n
SSResiduals =  ei2 = SSError .
i 1
The degrees of freedom (d.f.) for the variance calculation is n–2, since two parameters are
estimated prior to the variance (0 and 1).
The variance estimate is called the MSE (Mean square error). It is the SSError divided by the d.f.,
MSE  SSE
 n  2
.
The variances for the two parameter estimates and the predicted values are all different, but all are
based on the MSE, and all have n–2 d.f. (t-tests) or n–2 d.f. for the denominator (F tests).
EXST 3999
Spring 2014
Variance of the slope = MSE
Page 4
S XX
 1   X 2 

Variance of the intercept = MSE  
S XX 
n


 1  X  X 2 
i

Variance of a predicted value at Xi = MSE  
S XX
n



Any of these variances can be used for a t-test of an estimate against an hypothesized value for the
appropriate parameter (i.e. slope, intercept or predicted value respectively).
ANOVA table for regression
A common representation of regression results is an ANOVA table. Given the SSError (sum of
squared deviations from the regression), and the initial total sum of squares (SYY), the sum of
squares of Y adjusted for the mean, we can construct an ANOVA table
Simple Linear Regression ANOVA table
Regression
Error
Total
d.f.
1
n–2
n–1
Sum of Squares
SSRegression
SSError
SYY = SSTotal
Mean Square
MSReg
MSError
MSReg
F
/MSError
In the ANOVA table
The SSRegression and SSError sum to the SSTotal, so given the total (SYY) and one of the
two terms, we can get the other.
The easiest to calculate first is usually the SSRegression since we usually already have the
necessary intermediate values.
SSRegression =
 S XY 
2
S XX
The SSRegression is a measure of the “improvement” in the fit due to the regression line.
The deviations start at SYY and are reduced to SSError. The difference is the improvement,
and is equal to the SSRegression.
This gives another statistic called the R2. What portion of the SSTotal (SYY) is accounted for
by the regression?
R2 = SSRegression / SSTotal
The degrees of freedom in the ANOVA table are,
n–1 for the total, one lost for the correction for the mean (which also fits the intercept)
n–2 for the error, since two parameters are estimated to get the regression line.
1 d.f. for the regression, which is the d.f. for the slope.
Statistics quote: He uses statistics as a drunken man uses lampposts -- for support rather than for illumination.
Andrew Lang (1844-1912), Scottish poet, folklorist, biographer, translator, novelist, and scholar
EXST 3999
Spring 2014
Page 5
The F test is constructed by calculating the MSRegression / MSError. This F test has 1 in the
numerator and (n–2) d.f. in the denominator. This is exactly the same test as the t-test of the slope
against zero.
To test the slope against an hypothesized value (say zero) using the t-test with n–2 d.f., calculate
t
b1  b1Hypothesized
Sb1
b1  0
MSE
S XX

Assumptions for the Regression
We will recognize 4 assumptions
1) Normality – We take the deviations from regression and pool them all together into one estimate
of variance. Some of the tests we use require the assumption of normality, so these deviations
should be normally distributed.
Y
X
For each value of X there is a population of values for the variable Y (normally distributed).
2) Homogeneity of variance – When we pool these deviations (variances) we also assume that the
variances are the same at each value of Xi. In some cases this is not true, particularly when the
variance increases as X increases.
3) X is measured without error! Since variances are measured only vertically, all variance is in Y,
no provisions are made for variance in X.
4) Independence. This enters in several places. First, the observations should be independent of
each other (i.e. the value of ei should be independent of ej, for i ≠ j). Also, in the equation for the
line Yi  b0  b1 X i  ei we assume that the term ei is independent of the rest of the model. We will
talk more of this when we get to multiple regression.
So the four assumptions are:




Normality
Homogeneity of variance
Independence
X measured without error
These are explicit assumptions, and we will examine or test these assumptions when possible. There
are also some other assumptions that I consider implicit. We will not state these, but in some cases
they can be tested. For example,


There is order in the Universe. Otherwise, what are you investigating? Chaos?
The underlying fundamental relationship that I just fitted a straight line to really is a straight
line. Sometimes this one can be examined statistically.
EXST 3999
Spring 2014
Page 6
Curvilinear Regression or Intrinsically Linear Regression
As the name implies, these are regressions that fit curves. However, the regressions we will
discuss are also linear models, so most of the techniques and SAS procedures we have discussed
will still be relevant.
We will discuss two basic types of curvilinear model.
The first are models that are not linear, but that can be “linearized” by transformation. These
models are referred to as “intrinsically linear”, because after transformation they are linear.
Later we will cover polynomial regressions. These are an extraordinarily flexible family of
curves that will fit almost anything. Unfortunately, they rarely have a good, interpretation of
the parameter estimates.
Intrinsically linear models
These are models that contain some transformed variable, logarithms, inverses, square roots, sines,
etc. We will concentrate on logarithms, since these models are some of the most useful.
What is the effect of taking a logarithm of a dependent or independent variable? For example,
instead of Yi  b0  b1 X i  ei , fit log Yi   b0  b1 X i  ei
If we fit log Yi   b0  b1 X i  ei , then the original model, before we took logarithms, must
have been Yi  b0eb1 X i ei where “e” is the base of the natural logarithm (2.718281828). This
model is called the “Exponential Growth model” if b1 is positive, or the exponential decay
model if it is not. It is
used in the biological
sciences to fit exponential
growth (+1) or mortality 35
(–1). This function
30
produces a curve that
25
increases or decreases
20
proportionally.
Exponential growth and decay
Exponential model
Yi  b0eb1 X i ei
15
10
5
0
0
10
20
30
EXST 3999
Spring 2014
Page 7
Other examples of curvilinear models.
Power model: Yi  b0 X ib1 ei . Taking log Yi  b0 X ib1 ei  gives log Yi   log  b0   b1 log  X i   log  ei 
.
This is a simple linear
regression with a dependent
variable of log(Yi) and an
independent variable of
log(Xi). This model is used
to fit many things, including
morphometric data, The
following graphic was fitted
for the values of b0 and b1 =
29, –1, 19,0, 4, 0.5, 1,1,
0.03, 2.
30
b1=negative
25
b1=0
20
15
0<b1<1
10
b1=1
5
b1>1
0
0
5
10
15
20
25
30
Models following a hyperbolic shape, with an asymptote can be fitted with inverses. A model with
an inverse (e.g. 1 X i ) will fit a “hyperbola”, with its asymptote.
Yi  bo  b1  1   ei
 Xi 
25
where b0 fits the asymptote
20
These are a few of many
possible curvilinear
regressions. Models including
power terms, exponents,
logarithms, inverses, roots, and
trigonometric functions fit may
be curvilinear.
15
Hyperbolic curves
Yhat = 10 + 10(1/Xi)
10
Yhat = 10 - 10(1/Xi)
5
0
0
10
20
30
40
A note on logarithms. The model described above requires natural logs. In SAS the function
“LOG” produces natural logs (LOG10 gives log base 10). In EXCEL and on many calculators the
natural log function is “LN”.
EXST 3999
Spring 2014
Page 8
Curvilinear Residual Patterns
Transformations of Yi, like log transformations, will affect homogeneity of variance. The raw data
should actually appear nonhomogeneous.
Yi
Yi
Xi
Xi
Transformations of Xi will fit curves but will not affect the homogeneity of variance.
Yi
Yi
Xi
Xi
Polynomials assume homogeneous variance and will not adjust variance.
Yi
Xi
Intrinsically Linear (Curvilinear) regression example
Remember our SLR example about the amount of wood harvested from trees, predicted on the
basis of DBH (diameter at breast height)? Recall that it looked a little curved, and maybe even
had nonhomogeneous variance? Let's take another look at that model.
Typically, morphometric relationships (between parts of an organism) are best fitted with
models with both Log(Y) and Log(X).




Fish length - scale length
Fish total length - fish fork length
Crab width - crab length
Fish length - fish weight, etc.
Statistics quote: There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804 - 1881)
EXST 3999
Spring 2014
Page 9
Here we have tree diameter and tree weight. Lets try a log-log model (using natural logs).
Since we are fitting a linear measurement to a volumetric or weight measurement, I
expect the following relationship to apply.
1gm = 1c3, for the metric system if the material has the same density as water (specific
gravity = 1).
In other words, I expect
Wood weight  b0  woodlength 
3
log Wood weight   log  b0   3log  woodlength 
Note that this line will go through the origin, no problem there.
Yi  b0 X ib1 ei
The power term should be approximately 3, so we will test the coefficient of DBH
against a value of 3.
See computer output.
The residual plot for the log-log model appears to show no curvature, no nonhomogeneous
variance, no obvious outliers, and no significant departure from the normal distribution.
In short, it is much improved, and probably fits better than the linear model and it is
interpretable. Geometrically, the model for a tree trunk should approximate a cylinder or cone,
depending on the taper in the bole.
Weight = C**(specific gravity)*(D/2)2*H
where C=1 for a cylinder and 1/3 for a cone.
Curvilinear Regression Notes and Summary
For transformed models,
The usual regression assumptions must be met for the transformed model, not the raw data
(homogeneity, normality, etc.).
Estimates, hypothesis tests and confidence intervals would be calculated for the transformed
model. The estimates and limits can then be detransformed.
A wide range of biometrics situations call for established curvilinear models. These would
include, exponential growth, mortality, morphometric models, instrument standardization,
some other growth models (power models and quadratics have been used), recruitment models.
Check the literature in your field to see what models are used.
Statistics quote: "In earlier times, they had no statistics, and so they had to fall back on lies". -- Stephen Leacock
EXST 3999
Spring 2014
Page 10
Analysis of Covariance
Linear regression is usually done as a least squares technique applied to QUANTITATIVE
VARIABLES. ANOVA is the analysis of categorical (class, indicator, group) variables, there
are no quantitative “X” variables as in regression, but this is still a least squares technique.
It stands to reason that if Regression uses the least squares technique to fit quantitative variables,
and ANOVA uses the same approach to fit qualitative variables, that we should be able to put both
together into a single analysis.
We will call this Analysis of Covariance. There are actually two conceptual approaches,


Multisource regression – adding class variables to a regression
Analysis of Covariance – adding quantitative variables to an ANOVA
For now, we will be primarily concerned with Multisource regression.
With multisource regression we start with a regression (one slope and one intercept) and ask,
would the addition of an indicator or class variable improve the model?
Adding a class variable to a regression gives each group its own intercept, fitting a separate
intercept to each group.
Y
X
Adding an interaction fits a separate slope to each group.
Y
X
How do they do that?
For a simple linear regression we start with, Yi  b0  b1 X 1i  ei
Now add an indicator variable. An indicator variable, or dummy variable, is a variable that
uses values of “0” and “1” to distinguish between the members of a group. For example, if the
category, or groups, were MALE and FEMALE, we could give the females a “1” and the males
a “0” and distinguish between the groups. If the groups were FRESMAN, SOPHOMORE,
JUNIOR and SENIOR we would need 3 variables to distinguish between them, the first would
have a “1” for freshmen and a “0” otherwise, the second and third would have “1” for
SOPHOMORE and JUNIOR, respectively, and a zero otherwise. We don’t need a fourth
variable for SENIOR because if we have an observation with values of 0, 0, 0 for the three
variables we know it has to be a SENIOR. So, we always need one less dummy variable than
there are categories in the group.
Statistics quote: If at first it doesn't fit, fit, fit again. -- John McPhee
EXST 3999
Spring 2014
Page 11
Fitting separate intercepts
In our example we will add just one indicator variable, but it could be several. We will call our
indicator variable X2i, but it is actually just a variable with values of 0 or 1 that distinguishes
between the two categories of the indicator variable.
Yi  b0  b1 X 1i  b2 X 2i  ei
When X2i = 0 we get: Yi  b0  b1 X 1i  b2 (0)  ei ,
which reduces to Yi  b0  b1 X 1i  ei , a simple
linear model for the “0” group.
And when X2i = 1 we have Yi  b0  b1 X 1i  b2 (1)  ei ,
Y
(b0+b2)
which simplifies to Yi  (b0  b2 )  b1 X 1i  ei ,
b0
X
a simple linear model with an intercept equal to (b0+b2) for the “1” group.
So there are two lines with intercepts of b0 and (b0+b2). Note that b2 is the difference between
the two lines, or an adjustment to the first line that gives the second line.
This can be positive or negative, so the second line may be above or below the first.
This term can be tested against zero to determine if there is a difference in the intercepts.
Fitting separate slopes
Adding an interaction (crossproduct term) between the quantitative variable (X1i) and the
indicator variable (X2i) will fit separate slopes. Still using just one indicator variable for two
classes the model is Yi  b0  b1 X 1i  b2 X 2i  b3 X 1i X 2i  ei .
When X2i = 0 we get: Yi  b0  b1 X 1i  b2 (0)  b3 X 1i (0)  ei ,
which reduces to Yi  b0  b1 X 1i  ei , a simple
linear model for the “0” group.
When X2i =1 then Yi  b0  b1 X 1i  b2 (1)  b3 X 1i (1)  ei
simplifying to Yi  (b0  b2 )  (b1  b3 )X 1i  ei ,
Y
b1+b3
b0+b2
b0
b1
X
which is a simple linear model with an intercept equal to (b0+b2) and a slope equal to
(b1+b3) for the “1” group.
Note that b0 and b1 are the intercept and slope for one of the lines (whichever was assigned the
0). The values of b2 and b3 are the intercept and slope adjustments (+ or – differences) for the
second line.
If these adjustments are not different from zero then the intercept and or slope are not different
from each other.
A third or fourth line could be fitted by adding additional indicator variables and interaction
with coefficients b4 and b5, etc.
Statistics quote: There are two kinds of statistics, the kind you look up and the kind you make up. -- Rex Stout
EXST 3999
Spring 2014
Page 12
Conceptual application of Analysis of Covariance
Suppose we have a regression, and we would like to know if perhaps the regression is different for
some groups, perhaps male and female or treated and untreated.
Of course, regression starts with a corrected sum of squares (a flat line through the mean) and we
fit a slope. Viewing this as extra sums of squares we are comparing a model with no slope to a
model with a slope.
a
A test of the difference between these two models tests the
hypothesis of adding a slope (H0:1=0). The extra SS are SSX1|X0, or just SSX1.
From the simple linear regression we want to test the hypotheses that the model is improved by
adding separate intercepts or slopes. The concepts are pretty simple. The additional variables can
be tested with extra SS or the General Linear Hypothesis Test (GLHT with full model and reduced
model).
For a single indicator variable the extra SS are easy enough. For several indicator variables it may
be easier to do the GLHT where the single line is the reduced model and the two lines are the full
model.
b
A test of the difference between these two models tests the
hypothesis of adding an intercept adjustment (H0:2=0) or tests for the difference between the
intercepts. The extra SS are SSX2 | X1.
The next test compares a model with two intercepts and one slope with a model containing two
slopes and two intercepts.
c
Here we are testing the addition of a slope adjustment, or
second slope (H0:3=0). This test determines if a model with 2 intercepts and 2 slopes is better
than a model with 2 intercepts and one slope. The extra SS are SSX1X2 | X1 X2.
This series a–b–c is the usual “Multisource regression” series. Note that the extra SS needed are:
SSX1; SSX2 | X1; SSX1X2 | X1, X2. These are
the Type I SS for the SAS model Y=X1 X2
X1*X2;. So this is another of those rare cases
where we may wish to use TYPE I tests.
The hypotheses to be tested can be simplified
to some extent if we know that for our base
model we want a regression. This is
multisource regression.
Start with a simple linear regression.
Test to see if separate intercepts improve the
model (H0:2=0).
Correction factor,
1 level = b0 (fit mean)
add b1,
1 level (b0), 1 slope (b1)
add b2,
2 levels (b0, b2), 1 slope (b1)
add b3,
2 levels (b0, b2) and
2 slopes (b1 and b3)
EXST 3999
Spring 2014
Page 13
Test to see if separate slopes improve the model (H0:3=0).
As extra SS we could test extra SS for b2 and then the extra SS for b3.
In practice we usually start with the fullest model (2 slopes and 2 intercepts)
and then reduce to 1 slope and 2 intercepts if the extra SS for b3 is not significant
and then to 1 slope and 1 intercept if extra SS for b2 is not significant.
So, interpretation usually proceeds from the bottom up; do we need 2 slopes and 2 intercepts, if
not do we need 2 intercepts, if not is the SLR significant?
If TYPE I sums of squares are used there would be little if any change from deleting the higher
order components of the model, so in this case we would not have to refit the model after each
decision.
The solution (i.e. regression coefficients) provided with ANCOVA is the correct model for
prediction, even though it is fully adjusted (Type III) and our tests of hypothesis are not (Type
I).
The progression discussed above corresponds to the likely series in multisource regression (a-b-c
below). In Analysis of Variance (our next topic), the researchers interest is in differences among
the means of categorical variables. For this analysis the categorical variable can be put first and
the progression is the usual series for Analysis of Variance (d-e-c below). Which series of test you
need depends on if you are starting with a regression or an analysis of variance.
a
d
e
b
f
c
g
Examine the possible full and reduced models on the previous page. What is tested in each case,
what extra SS is needed for each test and what are the initial and final models for each case.
One other test of occasional interest is the test denoted by the comparison “g”. If you fit a model
with two slopes and two intercepts, and cannot reduce to a model with two intercepts and one
slope, you may wonder if a model with two slopes and one intercept is appropriate.
Full model: Yi  b0  b1 X 1i  b2 X 2i  b3 X 1i X 2i  ei
Reduced model: Yi  b0  b1 X 1i  b3 X 1i X 2i  ei
The extra SS for this test is SSX2 | X1, X1*X2. This is actually a test of “intercepts fitted last”
and would be available as the Type III SS for X1*X2 with the model Y=X1 X2 X1*X2;.
g
EXST 3999
Spring 2014
Page 14
A few notes on ANCOVA.
The values of the slopes and intercepts fitted in the full model (2 slopes and 2 intercepts) are
exactly the same slopes and intercepts that would be fitted if the models were done separately.
However, the combined model has the advantage of having a more powerful pooled error term.
In SAS we can fit our own indicator variable in proc reg, building dummy variables in the data
step. However, if there are more than just two levels (one dummy variable) the dummy variables
should be treated as a aggregate.
The CLASSES statement in PROC MIXED or GLM automatically creates the indicator
variables, and both will test TYPE I and TYPE III sums of squares.
We will use PROC MIXED in our numerical SAS example.
When we include the CLASSES statement in SAS, the program assumes that we are doing an
ANOVA, and by default does not print the regression coefficients. If we want these we must
add the option “/solution” to the model statement. This is true for both PROC MIXED and
PROC GLM.
Download