EXST 3999 Simple Linear Regression Spring 2015 Simple Linear Regression (review?) Y - the dependent variable The objective: Given points plotted on two coordinates, Y and X, find the best line to fit the data. 35 30 25 20 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable The concept: Data consists of paired observations with a presumed potential for the existence of some underlying relationship. We wish to determine the nature of the relationship and quantify it if it exists. Note that we cannot prove that the relationship exists by using regression (i.e. we cannot prove cause and effect). Regression can only show if a “correlation” exists, and provide an equation for the relationship. Given a data set consisting of paired, quantitative variables, and recognizing that there is variation in the data set, we will define, POPULATION MODEL (SLR): Yi 0 1 X i i This is the model we will fit. It is the equation describing straight line for a population and we want to estimate the parameters in the equation. The population parameters to be estimated are for the underlying model, y. x 0 1 X i , are: y. x = the true population mean of Y at each value of X 0 = the true value of the Y intercept 1 = the true value of the slope, the change in Y per unit of X Terminology Dependent variable: variable to be predicted Y = dependent variable (all variation occurs in Y) Independent variable: predictor or regressor variable X = independent variable (X is measured without error) Intercept: value of Y when X = 0, point where the regression line passes through the Y axis. The units on the intercept are the same as the “Y” units Slope: the value of the change in Y for each unit increase in X. The units on the slope are “Y” units per “X” unit Deviation: distance from an observed point to the regression line, also called a residual. EXST 3999 Spring 2014 Page 2 Least squares regression line: the line that minimizes the squared distances from the line to the individual observations. Y - the dependent variable Regression line Deviations Intercept 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable The regression line itself represents the mean of Y at each value of X ( y. x ). Regression calculations All calculations for simple linear regression start with the same values. These are, n n n n n X , X , Y , Y , X Y , n i 1 i i 1 2 i i 1 i 2 i i 1 i 1 i i Calculations for simple linear regression are first adjusted for the mean. These are called “corrected values”. They are corrected for the MEAN by subtracting a “correction factor”. As a result, all simple linear regressions are adjusted for the mean of X and Y and pass through the point Y , X . Y , X Y X The original sums and sums of squares of Y are distances and squared distances from zero. These are referred to as “uncorrected” meaning unadjusted for the mean. Y 0 X EXST 3999 Spring 2014 Page 3 The “corrected” deviations sum to zero (half negative and half positive) and the sums of the squares are squared distances from the mean of Y. Y Y X Once the means X , Y and corrected sums of squares and cross products S XX , SYY , S XY are obtained, the calculations for the parameter estimates are: Slope = b1 S XY S XX Intercept = b0 Y b1 X We have fitted the sample equation Yi b0 b1 X i ei , which estimates the population parameters of the model, Yi 0 1 X i i Variance estimates for regression After the regression line is fitted, variance calculations are based on the deviations from the regression. From the regression model Yi b0 b1 X i ei we derive the formula for the deviations ˆ . e Y b b X or e = Y -Y i 0 i 1 i i i i Y X As with other calculations of variance, we calculate a sum of squares (corrected for the mean). This is simplified by the fact that the deviations, or residuals, already have a mean of zero, n SSResiduals = ei2 = SSError . i 1 The degrees of freedom (d.f.) for the variance calculation is n–2, since two parameters are estimated prior to the variance (0 and 1). The variance estimate is called the MSE (Mean square error). It is the SSError divided by the d.f., MSE SSE n 2 . The variances for the two parameter estimates and the predicted values are all different, but all are based on the MSE, and all have n–2 d.f. (t-tests) or n–2 d.f. for the denominator (F tests). EXST 3999 Spring 2014 Variance of the slope = MSE Page 4 S XX 1 X 2 Variance of the intercept = MSE S XX n 1 X X 2 i Variance of a predicted value at Xi = MSE S XX n Any of these variances can be used for a t-test of an estimate against an hypothesized value for the appropriate parameter (i.e. slope, intercept or predicted value respectively). ANOVA table for regression A common representation of regression results is an ANOVA table. Given the SSError (sum of squared deviations from the regression), and the initial total sum of squares (SYY), the sum of squares of Y adjusted for the mean, we can construct an ANOVA table Simple Linear Regression ANOVA table Regression Error Total d.f. 1 n–2 n–1 Sum of Squares SSRegression SSError SYY = SSTotal Mean Square MSReg MSError MSReg F /MSError In the ANOVA table The SSRegression and SSError sum to the SSTotal, so given the total (SYY) and one of the two terms, we can get the other. The easiest to calculate first is usually the SSRegression since we usually already have the necessary intermediate values. SSRegression = S XY 2 S XX The SSRegression is a measure of the “improvement” in the fit due to the regression line. The deviations start at SYY and are reduced to SSError. The difference is the improvement, and is equal to the SSRegression. This gives another statistic called the R2. What portion of the SSTotal (SYY) is accounted for by the regression? R2 = SSRegression / SSTotal The degrees of freedom in the ANOVA table are, n–1 for the total, one lost for the correction for the mean (which also fits the intercept) n–2 for the error, since two parameters are estimated to get the regression line. 1 d.f. for the regression, which is the d.f. for the slope. Statistics quote: He uses statistics as a drunken man uses lampposts -- for support rather than for illumination. Andrew Lang (1844-1912), Scottish poet, folklorist, biographer, translator, novelist, and scholar EXST 3999 Spring 2014 Page 5 The F test is constructed by calculating the MSRegression / MSError. This F test has 1 in the numerator and (n–2) d.f. in the denominator. This is exactly the same test as the t-test of the slope against zero. To test the slope against an hypothesized value (say zero) using the t-test with n–2 d.f., calculate t b1 b1Hypothesized Sb1 b1 0 MSE S XX Assumptions for the Regression We will recognize 4 assumptions 1) Normality – We take the deviations from regression and pool them all together into one estimate of variance. Some of the tests we use require the assumption of normality, so these deviations should be normally distributed. Y X For each value of X there is a population of values for the variable Y (normally distributed). 2) Homogeneity of variance – When we pool these deviations (variances) we also assume that the variances are the same at each value of Xi. In some cases this is not true, particularly when the variance increases as X increases. 3) X is measured without error! Since variances are measured only vertically, all variance is in Y, no provisions are made for variance in X. 4) Independence. This enters in several places. First, the observations should be independent of each other (i.e. the value of ei should be independent of ej, for i ≠ j). Also, in the equation for the line Yi b0 b1 X i ei we assume that the term ei is independent of the rest of the model. We will talk more of this when we get to multiple regression. So the four assumptions are: Normality Homogeneity of variance Independence X measured without error These are explicit assumptions, and we will examine or test these assumptions when possible. There are also some other assumptions that I consider implicit. We will not state these, but in some cases they can be tested. For example, There is order in the Universe. Otherwise, what are you investigating? Chaos? The underlying fundamental relationship that I just fitted a straight line to really is a straight line. Sometimes this one can be examined statistically. EXST 3999 Spring 2014 Page 6 Curvilinear Regression or Intrinsically Linear Regression As the name implies, these are regressions that fit curves. However, the regressions we will discuss are also linear models, so most of the techniques and SAS procedures we have discussed will still be relevant. We will discuss two basic types of curvilinear model. The first are models that are not linear, but that can be “linearized” by transformation. These models are referred to as “intrinsically linear”, because after transformation they are linear. Later we will cover polynomial regressions. These are an extraordinarily flexible family of curves that will fit almost anything. Unfortunately, they rarely have a good, interpretation of the parameter estimates. Intrinsically linear models These are models that contain some transformed variable, logarithms, inverses, square roots, sines, etc. We will concentrate on logarithms, since these models are some of the most useful. What is the effect of taking a logarithm of a dependent or independent variable? For example, instead of Yi b0 b1 X i ei , fit log Yi b0 b1 X i ei If we fit log Yi b0 b1 X i ei , then the original model, before we took logarithms, must have been Yi b0eb1 X i ei where “e” is the base of the natural logarithm (2.718281828). This model is called the “Exponential Growth model” if b1 is positive, or the exponential decay model if it is not. It is used in the biological sciences to fit exponential growth (+1) or mortality 35 (–1). This function 30 produces a curve that 25 increases or decreases 20 proportionally. Exponential growth and decay Exponential model Yi b0eb1 X i ei 15 10 5 0 0 10 20 30 EXST 3999 Spring 2014 Page 7 Other examples of curvilinear models. Power model: Yi b0 X ib1 ei . Taking log Yi b0 X ib1 ei gives log Yi log b0 b1 log X i log ei . This is a simple linear regression with a dependent variable of log(Yi) and an independent variable of log(Xi). This model is used to fit many things, including morphometric data, The following graphic was fitted for the values of b0 and b1 = 29, –1, 19,0, 4, 0.5, 1,1, 0.03, 2. 30 b1=negative 25 b1=0 20 15 0<b1<1 10 b1=1 5 b1>1 0 0 5 10 15 20 25 30 Models following a hyperbolic shape, with an asymptote can be fitted with inverses. A model with an inverse (e.g. 1 X i ) will fit a “hyperbola”, with its asymptote. Yi bo b1 1 ei Xi 25 where b0 fits the asymptote 20 These are a few of many possible curvilinear regressions. Models including power terms, exponents, logarithms, inverses, roots, and trigonometric functions fit may be curvilinear. 15 Hyperbolic curves Yhat = 10 + 10(1/Xi) 10 Yhat = 10 - 10(1/Xi) 5 0 0 10 20 30 40 A note on logarithms. The model described above requires natural logs. In SAS the function “LOG” produces natural logs (LOG10 gives log base 10). In EXCEL and on many calculators the natural log function is “LN”. EXST 3999 Spring 2014 Page 8 Curvilinear Residual Patterns Transformations of Yi, like log transformations, will affect homogeneity of variance. The raw data should actually appear nonhomogeneous. Yi Yi Xi Xi Transformations of Xi will fit curves but will not affect the homogeneity of variance. Yi Yi Xi Xi Polynomials assume homogeneous variance and will not adjust variance. Yi Xi Intrinsically Linear (Curvilinear) regression example Remember our SLR example about the amount of wood harvested from trees, predicted on the basis of DBH (diameter at breast height)? Recall that it looked a little curved, and maybe even had nonhomogeneous variance? Let's take another look at that model. Typically, morphometric relationships (between parts of an organism) are best fitted with models with both Log(Y) and Log(X). Fish length - scale length Fish total length - fish fork length Crab width - crab length Fish length - fish weight, etc. Statistics quote: There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804 - 1881) EXST 3999 Spring 2014 Page 9 Here we have tree diameter and tree weight. Lets try a log-log model (using natural logs). Since we are fitting a linear measurement to a volumetric or weight measurement, I expect the following relationship to apply. 1gm = 1c3, for the metric system if the material has the same density as water (specific gravity = 1). In other words, I expect Wood weight b0 woodlength 3 log Wood weight log b0 3log woodlength Note that this line will go through the origin, no problem there. Yi b0 X ib1 ei The power term should be approximately 3, so we will test the coefficient of DBH against a value of 3. See computer output. The residual plot for the log-log model appears to show no curvature, no nonhomogeneous variance, no obvious outliers, and no significant departure from the normal distribution. In short, it is much improved, and probably fits better than the linear model and it is interpretable. Geometrically, the model for a tree trunk should approximate a cylinder or cone, depending on the taper in the bole. Weight = C**(specific gravity)*(D/2)2*H where C=1 for a cylinder and 1/3 for a cone. Curvilinear Regression Notes and Summary For transformed models, The usual regression assumptions must be met for the transformed model, not the raw data (homogeneity, normality, etc.). Estimates, hypothesis tests and confidence intervals would be calculated for the transformed model. The estimates and limits can then be detransformed. A wide range of biometrics situations call for established curvilinear models. These would include, exponential growth, mortality, morphometric models, instrument standardization, some other growth models (power models and quadratics have been used), recruitment models. Check the literature in your field to see what models are used. Statistics quote: "In earlier times, they had no statistics, and so they had to fall back on lies". -- Stephen Leacock EXST 3999 Spring 2014 Page 10 Analysis of Covariance Linear regression is usually done as a least squares technique applied to QUANTITATIVE VARIABLES. ANOVA is the analysis of categorical (class, indicator, group) variables, there are no quantitative “X” variables as in regression, but this is still a least squares technique. It stands to reason that if Regression uses the least squares technique to fit quantitative variables, and ANOVA uses the same approach to fit qualitative variables, that we should be able to put both together into a single analysis. We will call this Analysis of Covariance. There are actually two conceptual approaches, Multisource regression – adding class variables to a regression Analysis of Covariance – adding quantitative variables to an ANOVA For now, we will be primarily concerned with Multisource regression. With multisource regression we start with a regression (one slope and one intercept) and ask, would the addition of an indicator or class variable improve the model? Adding a class variable to a regression gives each group its own intercept, fitting a separate intercept to each group. Y X Adding an interaction fits a separate slope to each group. Y X How do they do that? For a simple linear regression we start with, Yi b0 b1 X 1i ei Now add an indicator variable. An indicator variable, or dummy variable, is a variable that uses values of “0” and “1” to distinguish between the members of a group. For example, if the category, or groups, were MALE and FEMALE, we could give the females a “1” and the males a “0” and distinguish between the groups. If the groups were FRESMAN, SOPHOMORE, JUNIOR and SENIOR we would need 3 variables to distinguish between them, the first would have a “1” for freshmen and a “0” otherwise, the second and third would have “1” for SOPHOMORE and JUNIOR, respectively, and a zero otherwise. We don’t need a fourth variable for SENIOR because if we have an observation with values of 0, 0, 0 for the three variables we know it has to be a SENIOR. So, we always need one less dummy variable than there are categories in the group. Statistics quote: If at first it doesn't fit, fit, fit again. -- John McPhee EXST 3999 Spring 2014 Page 11 Fitting separate intercepts In our example we will add just one indicator variable, but it could be several. We will call our indicator variable X2i, but it is actually just a variable with values of 0 or 1 that distinguishes between the two categories of the indicator variable. Yi b0 b1 X 1i b2 X 2i ei When X2i = 0 we get: Yi b0 b1 X 1i b2 (0) ei , which reduces to Yi b0 b1 X 1i ei , a simple linear model for the “0” group. And when X2i = 1 we have Yi b0 b1 X 1i b2 (1) ei , Y (b0+b2) which simplifies to Yi (b0 b2 ) b1 X 1i ei , b0 X a simple linear model with an intercept equal to (b0+b2) for the “1” group. So there are two lines with intercepts of b0 and (b0+b2). Note that b2 is the difference between the two lines, or an adjustment to the first line that gives the second line. This can be positive or negative, so the second line may be above or below the first. This term can be tested against zero to determine if there is a difference in the intercepts. Fitting separate slopes Adding an interaction (crossproduct term) between the quantitative variable (X1i) and the indicator variable (X2i) will fit separate slopes. Still using just one indicator variable for two classes the model is Yi b0 b1 X 1i b2 X 2i b3 X 1i X 2i ei . When X2i = 0 we get: Yi b0 b1 X 1i b2 (0) b3 X 1i (0) ei , which reduces to Yi b0 b1 X 1i ei , a simple linear model for the “0” group. When X2i =1 then Yi b0 b1 X 1i b2 (1) b3 X 1i (1) ei simplifying to Yi (b0 b2 ) (b1 b3 )X 1i ei , Y b1+b3 b0+b2 b0 b1 X which is a simple linear model with an intercept equal to (b0+b2) and a slope equal to (b1+b3) for the “1” group. Note that b0 and b1 are the intercept and slope for one of the lines (whichever was assigned the 0). The values of b2 and b3 are the intercept and slope adjustments (+ or – differences) for the second line. If these adjustments are not different from zero then the intercept and or slope are not different from each other. A third or fourth line could be fitted by adding additional indicator variables and interaction with coefficients b4 and b5, etc. Statistics quote: There are two kinds of statistics, the kind you look up and the kind you make up. -- Rex Stout EXST 3999 Spring 2014 Page 12 Conceptual application of Analysis of Covariance Suppose we have a regression, and we would like to know if perhaps the regression is different for some groups, perhaps male and female or treated and untreated. Of course, regression starts with a corrected sum of squares (a flat line through the mean) and we fit a slope. Viewing this as extra sums of squares we are comparing a model with no slope to a model with a slope. a A test of the difference between these two models tests the hypothesis of adding a slope (H0:1=0). The extra SS are SSX1|X0, or just SSX1. From the simple linear regression we want to test the hypotheses that the model is improved by adding separate intercepts or slopes. The concepts are pretty simple. The additional variables can be tested with extra SS or the General Linear Hypothesis Test (GLHT with full model and reduced model). For a single indicator variable the extra SS are easy enough. For several indicator variables it may be easier to do the GLHT where the single line is the reduced model and the two lines are the full model. b A test of the difference between these two models tests the hypothesis of adding an intercept adjustment (H0:2=0) or tests for the difference between the intercepts. The extra SS are SSX2 | X1. The next test compares a model with two intercepts and one slope with a model containing two slopes and two intercepts. c Here we are testing the addition of a slope adjustment, or second slope (H0:3=0). This test determines if a model with 2 intercepts and 2 slopes is better than a model with 2 intercepts and one slope. The extra SS are SSX1X2 | X1 X2. This series a–b–c is the usual “Multisource regression” series. Note that the extra SS needed are: SSX1; SSX2 | X1; SSX1X2 | X1, X2. These are the Type I SS for the SAS model Y=X1 X2 X1*X2;. So this is another of those rare cases where we may wish to use TYPE I tests. The hypotheses to be tested can be simplified to some extent if we know that for our base model we want a regression. This is multisource regression. Start with a simple linear regression. Test to see if separate intercepts improve the model (H0:2=0). Correction factor, 1 level = b0 (fit mean) add b1, 1 level (b0), 1 slope (b1) add b2, 2 levels (b0, b2), 1 slope (b1) add b3, 2 levels (b0, b2) and 2 slopes (b1 and b3) EXST 3999 Spring 2014 Page 13 Test to see if separate slopes improve the model (H0:3=0). As extra SS we could test extra SS for b2 and then the extra SS for b3. In practice we usually start with the fullest model (2 slopes and 2 intercepts) and then reduce to 1 slope and 2 intercepts if the extra SS for b3 is not significant and then to 1 slope and 1 intercept if extra SS for b2 is not significant. So, interpretation usually proceeds from the bottom up; do we need 2 slopes and 2 intercepts, if not do we need 2 intercepts, if not is the SLR significant? If TYPE I sums of squares are used there would be little if any change from deleting the higher order components of the model, so in this case we would not have to refit the model after each decision. The solution (i.e. regression coefficients) provided with ANCOVA is the correct model for prediction, even though it is fully adjusted (Type III) and our tests of hypothesis are not (Type I). The progression discussed above corresponds to the likely series in multisource regression (a-b-c below). In Analysis of Variance (our next topic), the researchers interest is in differences among the means of categorical variables. For this analysis the categorical variable can be put first and the progression is the usual series for Analysis of Variance (d-e-c below). Which series of test you need depends on if you are starting with a regression or an analysis of variance. a d e b f c g Examine the possible full and reduced models on the previous page. What is tested in each case, what extra SS is needed for each test and what are the initial and final models for each case. One other test of occasional interest is the test denoted by the comparison “g”. If you fit a model with two slopes and two intercepts, and cannot reduce to a model with two intercepts and one slope, you may wonder if a model with two slopes and one intercept is appropriate. Full model: Yi b0 b1 X 1i b2 X 2i b3 X 1i X 2i ei Reduced model: Yi b0 b1 X 1i b3 X 1i X 2i ei The extra SS for this test is SSX2 | X1, X1*X2. This is actually a test of “intercepts fitted last” and would be available as the Type III SS for X1*X2 with the model Y=X1 X2 X1*X2;. g EXST 3999 Spring 2014 Page 14 A few notes on ANCOVA. The values of the slopes and intercepts fitted in the full model (2 slopes and 2 intercepts) are exactly the same slopes and intercepts that would be fitted if the models were done separately. However, the combined model has the advantage of having a more powerful pooled error term. In SAS we can fit our own indicator variable in proc reg, building dummy variables in the data step. However, if there are more than just two levels (one dummy variable) the dummy variables should be treated as a aggregate. The CLASSES statement in PROC MIXED or GLM automatically creates the indicator variables, and both will test TYPE I and TYPE III sums of squares. We will use PROC MIXED in our numerical SAS example. When we include the CLASSES statement in SAS, the program assumes that we are doing an ANOVA, and by default does not print the regression coefficients. If we want these we must add the option “/solution” to the model statement. This is true for both PROC MIXED and PROC GLM.