CHAPTER FOUR MULTIPLE LINEAR REGRESSION TABLE OF CONTENTS OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Notation for multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 ORDINARY METHOD OF LEAST SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 Evaluation of partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 System of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 Solution for a known intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9 Solution for known slope(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 ANALYSIS OF SUM OF SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Partitioning of total sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11 Mean square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12 Extra sums of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15 EXAMPLE PROBLEM #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Key matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18 Regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19 ANOVA and R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 EXCEL Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21 REPRESENTATION FOR STATISTICAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 Properties of least squares estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25 Hypothesis testing of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26 F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26 EXAMPLE PROBLEM #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28 Test significance of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29 Confidence interval for $1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29 Confidence interval for regression surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30 Test slope parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-31 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32 MULTICOLLINEARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nearly perfect multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SELECTION OF REGRESSION MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guidelines for removing variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stepwise regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OTHER REGRESSION TOPICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piecewise linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common slope or intercept parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX 4-A: SELECTED DERIVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total and regression sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expected value of least square estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance-covariance matrix of b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PROBLEM ASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SOLUTION KEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32 4-32 4-33 4-35 4-37 4-42 4-42 4-43 4-44 4-51 4-52 4-52 4-53 4-54 4-56 4-58 4-60 4-60 4-60 4-61 4-63 4-65 4-66 4-71 CHAPTER FOUR MULTIPLE LINEAR REGRESSION I am too familiar with the manner in which actual data are met with the suggestion that other data, if they were collected, might show something else to believe it to have any value as an argument. “Statistics on the table, please,” can be my sole reply. Karl Pearson, 1910 OVERVIEW Introduction Background Multiple linear regression is a method of fitting and evaluating linear algebraic equation to observed data where there are more than one independent variables. Similar to simple regression, multiple regression analysis can be viewed as having a * Optimization criteria for fitting the equation * Statistical inference Noteworthy differences include: * Requires matrix algebra for efficient solution * Correlation among independent variable very important Uses In general, multiple regression analysis is used to: * * Description: Primary objective is to identify significant variables. Statistical inferences are of critical importance. Prediction: Primary objective is to develop a predictive model. Physical insight into the process should be used in the selection of parameters. Notation for multiple regression Model for population Let’s expand on our grain bin example used in Chapters 2 and 3 by considering the drying of 4-1 4-2 using different systems over several years. We are now interesting in the moisture content of the kernels (dependent variable) as influenced by independent variable such as the height in the bin (x1), air flow rate (x2), air temperature (x3), relative humidity (x4) and other possible factors. If all kernels were measured in the population, then the regression model would be where yi is the dependent variable (moisture content of the kernel) and 0i is the linear model defined for the population as where $o through $p are the population parameters, xi1 through xip are the independent variable values (height, air flow rate, etc) corresponding to yi. The residual, gi, is the deviation between the linear model and the observation. This deviation is assumed to be random and is the result of measurement error and/or other factors. Linear model from sample The corresponding linear model using a sample of y and x’s values is where For n observations, we have the following set of equations, | | | | which can be written in matrix form as where , , and 4-3 where y^ = nx1 vector, b = mx1 vector and x = nxm matrix where m=p+1 is the number of parameters. The errors or residuals for n observations are simply defined as | | | | which can be written in matrix form as where and Extrapolation Extrapolation outside the range of x should be done cautiously. As discussed in the Chapter 3, the confidence intervals become wider as the distance from the mean increases. However more importantly, the relationship may be approximately linear within the range of x, but nonlinear outside x. Substantial errors are then possible. Both of these concepts are shown below. As demonstrated in an example problem later, the acceptable interpolation region of the observed data is more difficult to identify for multiple regression problems. y y Nonlinear Confidence intervals x ORDINARY METHOD OF LEAST SQUARES Linear x 4-4 Definition As discussed in Chapter 3, the ordinary least square method minimizes the sum of squared residuals. In matrix notation, the residual sum of squares is defined as and therefore the objective function for determining the least square parameters is defined as A necessary condition for M to be minimized has previously been shown is or Evaluation of partial derivatives Partial with respect to bo Let’s first expand M as from which is easily shown that For linear regression, we have and therefore the derivatives are easily obtained as By using these results, by setting the derivative equal to zero, and by dividing through by -2, we obtain the first equation for minimum of 4-5 Important results for minimization with respect to bo Once again, there are two important results of minimizes M with respect to bo. Since ei = (yiy^ i), we obtain that is, the sum of residuals is zero. Let’s rearrange the above equation as where the definition of y^ has been used and each term has been summed separately. By dividing through by n, we obtain y The mean values of y and x1 through xp lie on the regression surface. The representation for two independent variables is shown below. x2 E(y) E(x2) E(x1) x1 Evaluate with respect to other parameters After expanding M, the partial derivative with respect to b1 is By using the previously given equations for y^ i for linear regression, we quickly conclude that 4-6 By using these results, by once again setting the derivative equal to zero, and by dividing through by -2, we obtain the second equation for minimum of The partial derivative with other parameters would result in the same solution except having a different independent variable. The general solution1 for j…0 is therefore System of equations Standard Equation Set Let’s review our minimization results. The minimum with respect to bo gave us and with respect to b1 and with respect to b2 and subsequent parameters | | | | and with respect to bp This system of equations can be written as 1 A simple evaluation of matrix solutions for multiple regression is obtained by testing Gxijei = for each j. 4-7 For linear regression, the partial derivatives are easily evaluated to obtain the following simple matrix solution: where xT as previously defined has been used. By using the definition for the vector y, and since we have previously defined that y^ = x b, we can write the system of equations can be written as or where xT = m x n matrix, y= n x 1 vector, xTy = m x 1 vector, x b = n x 1 vector, and xT x b = m x 1 vector (where m=p+1). If the inverse is known By using the definition of the identity matrix, we obtain Solution approaches for nonlinear equation are given in Chapter 6. The xTy matrix is easily evaluated directly as and the xTx matrix as 4-8 No Intercept/Mean Difference Formulation For some applications it is more convenient to formulate the estimation of the least square parameters using the difference between the dependent and independent variables from their respective means. Let’s start by using the solution for MM/Mbo=0, that is By using this result, the regression equation can alternatively be written as where the prime terms are defined as the variable relative to it’s mean. To solve for b1 through bp, we again find the minimum with respect to each variable resulting in the following set of “p” equations | | | which can be written in the following matrix form We therefore obtain the following matrix solution for b1 through bp where b is a vector of slope parameters and yN is a vector of (yi-yG ) values. As discussed in more detail later, variance-covariance for b is defined as 4-9 where for independent and constant variance we have used . Let’s now examine the matrix xNTyN and xNTxN more closely. We obtain from matrix multiplication and or Solution for a known intercept Computation of bj Let’s consider the special case where the intercept has a known value of $o, usually $o = 0. where the linear regression model is By removing the first condition in the previous section, the appropriate system of equations is 4-10 where xT is defined here as The system of equations can then be written as which can be rearranged as which can be solved for the unknown vector b. Important restrictions Since we are no longer minimizing M with respect to bo, we lose the useful conditions that the sum of residuals is zero and that the mean values lie on the regression surface, or and Solution for known slope(s) Let’s consider the solution where the slope of one or more term is know. For illustration, we will assume that values are known for $1 and $p-1. The solution can be obtained with essentially the same set of equation with two fewer equation, that is where xT is redefined here with two fewer rows. The system of equations can then be written as which can be rearranged as 4-11 which can be solved for the unknown vector b. As before, the sum of the residuals is again zero and the mean values lie on the regression line. ANALYSIS OF SUM OF SQUARES Sum of squares Similar to simple regression, the total sum of squares (SSTO) is defined as the sum of the squared deviation between each observed value and the mean of y, or or in matrix notation The residual (or error) sum of squares (SSE) is defined as the sum of the squared deviation between each observed and predicted (local mean) values of y, or or in matrix notation The regression sum of squares (SSR) is defined as the sum of the squared deviation of the predicted (local mean) value and the mean of y, or For the minimization solution for bo, we have shown that Eei = 0 and we conclude that and therefore E(y^ ) = yG . The SSR can then be simplified as An alternative computational form that is frequently used is shown below. Details of this derivation are given in Appendix 4-A. Partitioning of total sum of squares 4-12 Similar to the results in Chapter 2, we are interesting in partitioning the total sum of squares into residual and regression sums of squares. As shown in Appendix 4-A, we obtain for multiple regression analysis for the conditions of * Linear relationship * Least squares estimate for all bj parameters Once again, the first condition is violated in the nonlinear regression chapter, and the second condition is violate when the intercept term is set to zero. When either of the above condition is violate, we have Mean square Regression and residual mean squares Similar to the result of simple regression, the “average” square deviations for the residual sum of square is called the residual (or error) mean square and is defined a where m is the number of estimated parameters (m = p+1) and n - m is the degrees of freedom. Likewise, the regression mean square is defined as Equivalent ANOVA table For a linear relationship with all of the coefficients estimated by the method of least squares, the results of the sum of squares analysis is convenient summarized in the following equivalent ANOVA table (where SSR is valid only if all b’s are optimized). Source Degree of Freedom Sum of Squares Mean Square Regression m-1 MSR = SSR/(m - 1) Residual n-m MSE = SSE/(n - m) 4-13 Total n-1 In addition to the coefficient of determination, the parameters that are particularly useful in assessing the regression analysis are which is an unbiased estimate of the variance of the pdf describing g. As discussed in greater detail later, the overall F is used to evaluate the slope terms and is defined as A large F indicates a significant regression model. Coefficient of determination Definition A useful measure of the goodness of fit of the regression model is the coefficient of determination defined as where Cd is the general definition of the coefficient of determination and other terms are as previously defined. Least squares estimate of all bj parameters As previously shown for least squares estimate of all parameters, SSTO = SSR + SSE. The coefficient of determination can then be written as where R2 is used for the coefficient of determination of multiple regression. Similar to r2, it has the following useful characteristics: * Lies between zero and one * Fraction of the total sum of squares corresponding to the regression model Discussion The value of R2 always increases with additional independent variables, even if there is no 4-14 relationship between these variables and the dependent variable. This result will be apparent later in the chapter with our example problems. Theoretical insight can be obtained for the two-dimensional problem shown below (the raw data is given later in the chapter. y = ybar+b1(x1-x1bar)+(0)(x2-x2bar) y = ybar+b1(x1-x1bar)+b2(x2-x2bar) 600 400 SSE SSE=171. 0986 200 -0.1 0 15 0.0 14 13.150 13 12 b1 11 0.1 b2 16 SSE=171. 0983 15 -0.1 0.00.011 14 13.151 13 12 b1 11 0.1 b2 The left figure shows the SSE using a single slope parameter in the regression equation (actually the intercept parameter has been optimized in both figures). For this illustration, it is useful to view this graph as a special case of a two-parameter equation where b2= 0. The right graph shows the solution using two optimization parameters. The optimized b2 value is 0.0112. The SSE values are very similar, differing by only 0.0003. Let’s consider the implication of adding another independent variable. If b2 = 0 (exactly), then the SSE is the same as the one parameter model. If b2 …0 (exactly), there exists a value of b2 that yields a smaller SSE. Since a value of bi is never exactly equal to zero in practice, the SSE will always be reduced (even if by the very small amount as in the above figure) with the addition of another parameter, and therefore the value of R2 will always increase. If your goal is to simply have a multiple regression model with the largest coefficient of determination, then your solution is to use as many independent parameters as possible, even if they are frivolous and have no relationship to the dependent variable. An “adjusted” R2 is frequently used in statistical software to allow the coefficient of determination to decrease with the addition of unrelated independent variables. It is defined as 4-15 where R^ 2 is the adjusted R2 and is defined as one minus the ratio of the variance of residuals divided by the variance of y relative to the mean. If SSE increases by a small value with an additional parameter, R^ 2 can actually decrease because of the potentially larger increase in the (n-m) term. Since SSTO and n-1 are constant for a given set of observations, the selection of appropriate independent variables using R^ 2 is then really determined by the parameter set that yields the smallest MSE. As discussed at length later in the chapter, Wilson recommends that you consider R2, MSE, and overall F in determining the best parameter set. If you use this approach, little is gained by using the R^ 2 criterion. Extra sums of squares Example Additional insight into the impact of adding independent variable on SSE and SSR is often considered using extra sums of squares. This concept is best introduced using an example problem. Consider the data reported by Neter and Wasserman's (1974) of skin cream sales (y) in different district as a function of population numbers (x1)and per capita income (x2). Obs 1 2 3 4 5 6 7 8 y x1 162 120 223 131 67 169 81 192 274 180 375 205 86 265 98 330 x2 2450 3254 3802 2838 2347 3782 3008 2450 Obs 9 10 11 12 13 14 15 y x1 116 55 252 232 144 103 212 195 53 430 372 236 157 370 x2 2137 2560 4020 4427 2660 2088 2605 Sum of squares Let’s consider the sum of squares obtained by including different independent variable in the regression analysis. The total sum of squares is constant for all option and SSTO = 53,902. The following sum of squares were obtained with x1 and x2 in the regression model: The following sum of squares were obtained with just x1 in the regression model: The following sum of squares were obtained with just x2 in the regression model: 4-16 A schematic illustrating these different sums of squares is shown below. SSE(x1,x2) SSE(x1) SSTO SSE(x2) n ∑ ( yi − y) so2 = i =1 n −1 2 SSR(x1,x2) SSR(x1) SSR(x1/x2) = SSE(x2) - SSE(x1,x2) SSR(x2/x1) = SSE(x1) - SSE(x1,x2) SSR(x2) Variability Regression Regression around mean x1 and x2 x1 Regression x2 Extra sum of squares Let’s now examine the difference in the residual terms for the regression models with just x2 and with both x1 and x2, or This corresponds to a shift of sum of squares from residual to regression when x1 is added to the model (given that x2 is already in the model), or Likewise the shift when x2 is added (given original model of x1) can be represented as which again represents the marginal increase in sum of squares due to regression when x2 is added to the model. ANOVA table The extra sums of squares are sometimes also summarized in ANOVA tables. The results obtained by adding x2 is shown below. 4-17 Source df Sum of Squares Regression 2 SSR(x1,x2) = 53,845 x1 1 SSR(x1) = 53,417 x2 given x1 1 SSR(x2/x1) = 428 Residual 12 SSE(x1,x2) = 56.9 Total 14 53,902 The results obtained by adding x1 is shown below. Source df Sum of Squares Regression 2 SSR(x1,x2) = 53,845 x2 1 SSR(x2) = 22,030 x1 given x2 1 SSR(x1/x2) = 31,815 Residual 12 SSE(x1,x2) = 56.9 Total 14 53,902 EXAMPLE PROBLEM #1 Data set Problem statement Let’s predict the mean annual flood (Q in thousands of cfs) as a function of watershed area (A in thousands of square miles) and average annual maximum rainfall depth (I in inches) using the following regression model, which in our notation for sample statistics, Raw data The data reported by Haan (1979) for fourteen different watersheds are shown below. Obs # 1 2 3 y (cfs) 15.50 8.50 85.00 x1 (sq mi) 1.250 0.871 5.690 x2 (in) 1.7 2.1 1.9 y^ (cfs) 18.1 13.1 76.5 ei (cfs) -2.62 -4.64 8.49 (y^ -yG )2 (cfs)2 13.0 73.8 3001.4 4-18 4 5 6 7 8 9 10 11 12 13 14 Sum Mean 105.00 24.80 3.80 1.76 18.00 8.75 8.25 3.56 1.90 16.50 2.80 304.1 21.7 8.270 1.620 0.175 0.148 1.400 0.297 0.322 0.178 0.148 0.872 0.091 21.33 1.52 1.9 2.1 2.4 3.2 2.7 2.9 2.9 2.8 2.7 2.1 2.9 34.3 2.45 110.4 23.0 4.0 3.6 20.1 5.6 5.9 4.0 3.6 13.1 2.9 304.1 21.7 Useful summation terms for this data set are given below. and Key matrices Data set matrices For the above data set, the y and x matrices are defined as and Matrix products By using the above matrices, we obtain the following matrix products -5.44 1.82 -0.19 -1.88 -2.10 3.16 2.33 -0.47 -1.73 3.35 -0.09 0.00 7870.2 1.6 314.6 327.0 2.6 260.1 249.6 313.1 327.2 73.5 354.8 13182.6 4-19 and and Inverse matrix Let’s review the procedures given in Chapter 1 for computing the inverse of a 3x3 matrix with the following elements, and the inverse matrix is calculated as By using the results in the previous section, the inverse of the xTx matrix is defined as Regression coefficients As previously shown, the regression coefficients are defined as By using the above matrices, we obtain 4-20 or Therefore, the regression model is ANOVA and R2 ANOVA table Let’s now compute SSR as or The SSTO is computed from previously given values as Since SSE = SSTO - SSR, the ANOVA table for this example problem can be written as Source Regression df Sum of Squares Mean Square 2 13182.6 6591.3 Residual 11 171 15.6 Total 13 13353.7 Multiple coefficient of determination The multiple coefficient of determination is defined as Roughly 99% of the variance of flow rates around the mean is “explained” by the regression 4-21 equation. EXCEL Solution The solution approach using Microsoft EXCEL is shown on the next page. Obs y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 y= n= ybar = ANOVA Calculations 15.5 8.5 85 105 24.8 3.8 1.76 18 8.75 8.25 3.56 1.9 16.5 2.8 (y-ybar)^2 38.723951 174.843951 4003.99681 6935.08252 9.46880816 321.228808 398.515665 13.8596653 168.295022 181.51788 329.88938 392.945665 27.2782367 358.074522 x= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x1 1.25 0.871 5.69 8.27 1.62 0.175 0.148 1.4 0.297 0.322 0.178 0.148 0.872 0.091 SSTO= df= b 1.657027 13.15104 0.011182 b= m= yhat= yhat 18.11484 13.13507 76.50772 110.4374 22.9852 3.985297 3.639164 20.09868 5.595315 5.924091 4.029222 3.633573 13.14822 2.8862 1 5.69 1.9 14 21.332 21.332 108.7412 34.3 43.3419 34.3 43.3419 86.99 3.71678 -0.180936 -1.375371 -0.1809 0.020283 0.06123695 -1.3754 0.061237 0.52329118 ei -2.614842 -4.635069 8.492284 -5.43741 1.814799 -0.185297 -1.879164 -2.098681 3.154685 2.325909 -0.469222 -1.733573 3.35178 -0.0862 SSE = df= MSE = 1 1 8.27 1.62 1.9 2.1 1 0.175 2.4 xTy = Var(b) = e^2 6.837398 21.48386 72.1189 29.56543 3.293495 0.034335 3.531257 4.40446 9.952038 5.409852 0.22017 3.005275 11.23443 0.00743 (yhat-ybar)^2 13.01777358 73.75010678 3001.380705 7870.2719 1.593512432 314.6210545 327.0199597 2.637949558 260.0976163 249.6010089 313.064707 327.2222019 73.52440296 354.8196588 3 13353.7209 13 Use Paste Special - Transpose 1 1 xT = 1.25 0.871 1.7 2.1 (xTx)^-1 = 1.7 2.1 1.9 1.9 2.1 2.4 3.2 2.7 2.9 2.9 2.8 2.7 2.1 2.9 14 21.72286 R^2 = F= xTx = x2 171.0983 13182.62256 "= SSR" 11 2 "= df" 15.55439 6591.311279 "=MSR" 0.987187 423.7588 1 0.148 3.2 1 1.4 2.7 1 0.297 2.9 1 0.322 2.9 1 0.178 2.8 1 0.148 2.7 1 0.872 2.1 1 0.091 2.9 304.12 1465.893 627.8 57.8123 -2.814351 -21.39306 -2.814351 0.31549 0.952504 -21.39306 0.952504 8.139477 Extrapolation Extrapolation is more difficult to identify for multiple regression problems. This can be well illustrated for this example problem. The range of area is 0.09 to 8.3 and depth is 1.7 to 3.2. It is therefore reasonable to use the above regression model for area = 6 and depth = 3. A plot of this point is shown below with the raw data. Although our point is within the range of observed values, it is outside the range of paired values of area and depth. 4-22 4 Area = 6 Depth = 3 Depth 3 2 Observed Data 1 0 2 4 6 8 10 Area REPRESENTATION FOR STATISTICAL ANALYSIS Overview Statistical model Statistical analysis is very powerful component of multiple regression analysis. The statistical representation of the population is assumed to be where gm is a random measurement error in x and g is the random residual (or error) from predicted and observed values. As previously discussed, xi1 through x1p are the independent variables. Similar to simple regression, there are assumed a population of y values for the same independent variables because of measurement error and other factors not included in the regression model. This variability is random represented by the pdf for g. The population parameters correspond to a surface that passes through the mean (local) of y for each combination of independent variables. In general, the population parameters are unobserved and unobservable. Our goal is to make inferences on the population parameters for a sample data set used to determine the sample parameters of bo through bp. Since a different sample will result in a different set of sample parameters, the regression parameters are random variables. Pdfs for inferences can be defined using the statistical assumption for g given below. 4-23 Instead of discussing the pdf for b resulting from sample estimates, this chapter will only focus on the mean and variance of b. Additional theoretical development to represent the pdf is given in Chapter 5. Statistical assumptions for the pdfs of residuals Statistical inferences for the regression parameters and the regression surface can be made for the following conditions: * Mean of g is zero; E(g) = 0 * Normally distributed g * Homoscedasticity; VAR(g) = constant * Uncorrelated residuals; COV(gi,gj) = 0 for i … j * Small random measurement error of x; gm .0 These concepts have been previously discussed in Chapter 3. Residual plots Similar to simple regression analysis, the validity of these assumptions are frequently evaluated using residual plots. Residuals are frequently plotted with respect to time (or order) and with respect to predicted values. The residual plot for the example problem is shown below. Normalized residuals 2 1 0 0 -1 -2 20 40 60 80 100 Predicted 4-24 Properties of least squares estimator Mean Once again, different samples result in different values of b and therefore the mean of b is of interest. As shown in Appendix 4-A, the mean of b is that is, on the average, the least square estimate gives the true population value. From Chapter 2, the least square estimate of $ is unbiased. Variance-covariance matrix The variance-covariance matrix is used to define the variability in the different b resulting from sampling. As shown in Appendix 4-A, the variance-covariance matrix is defined as where E[gTg] is the variance-covariance matrix of residual. For independent and constant variance, we know that and, as shown in Appendix 4-A, the variance-covariance matrix for b is The estimated variance covariance matrix is computed as The standard error for bj is obtained as the square root of the appropriate diagonal element. Minimum variance unbiased estimator Let’s consider the least square estimator of the form with a variance-covariance matrix defined as Out of the class of all linear unbiased estimators, the least squares estimator has the minimum variance, that is, it results in the smallest uncertainty in the regression parameters obtained by sampling. This important conclusion is obtained from the Gauss-Markov theorem derived in Appendix 4-A. It is valid for independent and constant variance of residuals. 4-25 Confidence intervals General format We will again use the following general format for confidence interval, where the upper and lower limits will be defined using the t distribution with the appropriate variance and (n-m) degrees of freedom. The above notation implies that (1-") fraction of the intervals defined by L and U contains the population value. Confidence interval for regression parameters The lower and upper confidence intervals for the parameter bj (j = 0, 1, ..., p) are and where sbj is the standard error of bj defined as The VAR(bj) is obtained from the diagonal elements of the MSE (xTx)-1 matrix. Variance for the mean value (regression surface) and predicted value Similar to simple regression, the mean value for a particular set of independent variable is that obtained from the regression surface. The variance of this mean value varies with its location. Let’s assume that we are interested in the regression value for independent variable of xo1, xo2,..., xop. The regression value is then obtained as where As derived in Appendix 4-A, the variance for the regression value at this point is defined using a sample estimate of F as By using the theoretical development in Chapter 2 and 3, the variance for the predicted value is defined as Confidence/prediction intervals for mean and predicted values The appropriate confidence intervals for the mean value are 4-26 and where s^yi is the standard error of mean value defined as where the variance is as previously defined. Likewise, the appropriate prediction intervals for ypi are and where the standard error of predicted value is defined as Hypothesis testing of parameters The null and alternative hypotheses for parameters bj’s (j = 0, 1, .... , p) are usually defined as and which is evaluated with the following t-distribution test statistic where sbj is the standard error of bj as previously defined. The null hypothesis is rejected if Tests are usually conducted for kj = 0 to remove insignificant terms. Because of the possible dependency among bj's, only one term should be removed before repeating the regression using the t test. This is explained in greater detail later. F test Introduction For regression analysis, we will define F as 4-27 Similar to the results derived in Appendix 3-A, the expected value of MSR is defined as (Neter and Wasserman,1974, p. 227) If $1 = $2 = ... = $p = 0, then E(MSR) = MSE and we conclude that If on the other hand, $1…0, then E(MSR) > MSE and we obtain Testing the full regression surface To test whether the regression model, with all terms, is explaining a significant amount of the variation in y, the following null and alternative hypotheses are used which is evaluating using F defined as The null hypothesis is rejected if that is, with (m-1) degrees of freedom in the numerator and (n-m) degrees of freedom in the denominator. Testing one or more slope terms Let’s assume that you wish to test one ore more parameters in the model. If the full model is defined as The reduced model is then defined (remove one or more variables) as where there are q parameters in the reduced model. The null and alternative hypotheses are then defined as 4-28 We will use the following notation SSE(F) = Residual sum of squares of full model SSE(R) = Residual sum of squares of reduced model SSR(F/R) = Extra sum of squares = SSE(R) - SSE(F) The ANOVA table summary is therefore Source df Sum of Squares Mean Squares Regression m-1 SSR(F) SSR(F)/(m-1) R q-1 SSR(R) F, given R m-q SSR(F/R) Residual dfF = n-m SSE(F) R dfR = n-q SSE(R) n-1 SSTO Total SSR(F/R)/(m-q) MSE = SSE/(n-m) If SSR(F/R) is small then the null hypothesis parameters add little to the regression analysis, that is, it supports the hypothesis of insignificant variables. If SSR(F/R) is large then the null hypothesis parameters add much to the regression analysis, that is, it refutes the hypothesis of insignificant variables. The appropriate test statistic is The null hypothesis is rejected if that is, with (m-q = dfR - dfF) degrees of freedom in the numerator and (n-m = dfF) degrees of freedom in the denominator. EXAMPLE PROBLEM #2 Problem statement 4-29 The goal is to make statistical inferences for the multiple regression problem of analyzing the mean annual flood as a function of watershed area and average annual maximum rainfall depth. The previously computed regression model and ANOVA table are repeated below. Source df Sum of Squares Regression Mean Square 2 13182.6 6591.3 Residual 11 171 15.6 Total 13 13353.7 and the previously computed (xTx)-1 was Test significance of regression The significance of the full regression model will be tested at 5% level. The appropriate null and alternative hypotheses are and which is evaluated using the test statistic of Since F(0.95,2,11) = 3.98, Ho is rejected. We conclude that the regression model is explaining a significant amount of the variation in y. Confidence interval for $1 Standard error of b1 Let’s first determine the standard error of b1. The variance-covariance matrix is defined as Since s2 =MSE=15.554 and 4-30 we obtain and the standard error of Confidence interval The 95% confidence interval for $1 can now be easily computed as and Confidence interval for regression surface Point of interest The confidence interval for the regression surface (mean value) is a function of its location. We will compute the confidence interval for x1 = 4 and x2 = 2, or The regression model value for that point is Variance for regression surface The variance for the regression value has been previously defined as from which we obtain the following value for this problem 4-31 and the standard error of Confidence interval The 95% confidence interval can now be computed as and Test slope parameters Test $1 = 0 Although we have previously shown by the F test that at least one of the slope coefficient is significantly different than zero, let’s evaluate whether each slope term is significantly different than zero at the 5% level. The null and alternative hypotheses for $1 are that is evaluated using the test statistic of where the value for sb1 was previously determined. Since t11,0.975 = 2.201, we reject the null hypothesis and conclude that area (x1) is explaining a significant amount of variation in y. Test $2 = 0 Let’s now evaluate whether $2 is significantly different than zero at the 5% level. The appropriate null and alternative hypotheses are that is evaluated using the test statistic of 4-32 The variance-covariance matrix defined as can be used to obtain and therefore The test statistic can now be computed as Since t11,0.975 = 2.201, we would not reject the null hypothesis. The slope parameter is not significantly different than zero at the 5% level, that is, it is not “explaining” a significant amount of the variation in y (with area in the analysis). Final model Since depth was not significant in explaining the variation in y, it likely should be removed from the model. A regression analysis would then be conducted using just area. The new equation for this regression is shown below. Note that the intercept changed slightly. The slope term remains very close to the original value because of the very low contribution of depth to the sum of squares in the first model. The R2 for the above regression model had negligible decrease, the overall F value increased, the residual mean square decreased, and the standard errors for both bo and b1 decreased. The interpretation of these trends is discussed in greater detail later. MULTICOLLINEARITY Introduction General 4-33 Multicollinearity exists when there are correlations (or linear interrelationships) among independent variables. This is sometimes referred to as intercorrelation. The degree of multicollinearity is usually assessed using correlation coefficients (for more discussion see Judge et al., 1987, pp. 868-871). As discussed in Chapter 2, correlation is a measure of the strength of a linear relationship between two variables (defined for random variables). Correlation coefficient is computed as discussed in the chapter, problems that result from multicollinearity include: * * * Difficulty in identifying the separate effects of variables, Large variances for pdfs of b’s, making it more difficult to conclude that bj …0. Odd trends between independent and dependent variables. Nearly perfect multicollinearity Perfect correlation For perfectly correlated variables r = ± 1. This corresponds to linear relationship between two or more variables. For example, or that is, a independent variable can be written as a linear combination of the other independent variables. As discussed in Chapter 1, a matrix that is composed of linear combinations of elements is singular and an unique solution does not exists. For regression analysis, the matrix (xT x)-1 is singular for exact multicollinearity. For nearly perfect correlation, special algorithms must be used to account for numerical instabilities in the solution technique. Example illustration for highly correlated variables To illustrate the impact of correlated variables, twenty values of y were obtained using the following equation where ei is a normally distributed residual. A second variable to be used in the regression analysis was obtained 4-34 where here ei is uniformly distributed residual set so that x1 and x2 are highly correlated (r=0.99). The value of y was determined independently of x2. The data obtained using these two equations are shown below. x2 x1 1.03 1.92 3.03 4.06 5.04 5.94 7.08 8.04 9.08 9.94 y 7.48 11.88 17.02 21.97 27.41 32.16 37.72 42.71 47.13 51.35 40.83 0.99 47.15 55.88 37.81 80.64 63.64 108.15 109.91 77.76 x1 x2 10.92 12.06 12.93 13.99 14.91 16.04 16.93 18.06 19.03 20.05 56.09 62.48 66.54 71.59 76.66 82.03 86.35 92.14 97.48 102.26 y 110.07 91.66 110.26 130.86 171.55 190.78 196.46 215.75 184.15 210.77 The regression equation for x2 is and therefore a = 2.148 and c = 4.986. The regression equation for the first model using x1 is and therefore bos = 3.748 and b1s = 10.282 (where the superscript s is used to indicate a single independent variable). The standard error of b1s is 0.813 and b1s is significantly different than zero at the 1% level (t = 12.6). The R2 = 0.9. The regression equation for the second model using both x1 and x2 is where bo = 2.826, b1 = 8.142, and b2 = 0.429. Since we obtain y from an algorithm, we know that y is, in fact, not a function of x2. The standard error of b1 is now 82.5 and b1 is not significantly different than zero at the 10% level (t = 0.1). The R2 = 0.9. Insight into Results Insight into the role of correlated independent variable can be obtained by considering nearly perfectly correlated variables. Let’s consider two nearly perfectly correlated independent variables where 4-35 and regression models of and Since x1 and x2 are highly correlated, the second regression model can be well approximated by We conclude that the intercept and slope of the one variable model are defined for the two variable model as Since we are typically interest in the change in y for a change in x1, the above relationship for the slope term is particularly important. When a collinear variable is added to the regression analysis (i.e, the second model), the part of the change in y with respect to x1 is shifted to the second variable. For example, if c and b2 are both positive, b1 must be smaller than b1s, indicating a smaller change in y for the same change in x1. The change in the b1s value implies that a range of parameters can be used to represent the observed dependent values. This greater range is reflected in a large variance for b1s. For our example data set, the above relationships are shown to hold, that is, and Theoretical Considerations Multicollinearity effects on slope parameter Let’s consider the following regression model: where terms are as previously defined. From the “no-intercept” formulation previously given, we know that the matrix xNTyN is defined as and the xNTxN matrix as 4-36 The determinant is then defined using rules given in Chapter 1 as By using the rules for determining an inverse matrix, we obtain or We can now evaluate b1 and b2 using the results for (xNTx)-1 and xNTyNas Although relationships for b1 and b2 can easily be determined, we will focus on b1. By using matrix multiplication, we obtain which can be rearranged as As shown in Appendix 3-A, the first term in the numerator is alternative solution form for b1 4-37 for simple regression. We then can further evaluate b1 as For rx1,x2 = 0, the value of the b1 with two parameter is identical to that obtain from simple regression. However, if there is correlation among x1 and x2, the value is clearly different. Evaluation of variance We can also easily evaluate the VAR(b) using (xNTx)-1 as We will focus on the variance of b1. From the above matrix, it is simply defined as where the definition for the variance of b1 in simple regression is the numerator of the first term. For uncorrelated x1 and x2 values, the variance of b1 is the same when a second variable is added to the regression model if the mean square of residual (F2) is constant. Frequently the mean square of residual is smaller for more than one independent variable. The above relationship shows that the addition of another correlated variable the variance increases and approaches infinity as the correlation among these variables approaches one. The above result can be generalized for more than two independent variables. The 1/(1rx1,x22) term is called the variance inflation factor. Example problem Data The data below were reported by Neter et al. (1983) to examine the relationship of body fat (y) to triceps skinfold thickness (x1), thigh circumference (x2) and midarm circumference (x3). x1 19.5 24.7 30.7 29.8 x2 43.1 49.8 51.9 54.3 x3 29.1 28.2 37.0 31.1 y 11.9 22.8 18.7 20.1 x1 31.1 30.4 18.7 19.7 x2 56.6 56.7 46.5 44.2 x3 30.0 28.3 23.0 28.6 y 25.4 27.2 11.7 17.8 4-38 19.1 25.6 31.4 27.9 22.1 25.5 42.2 53.9 58.5 52.1 49.9 53.5 30.9 23.7 27.6 30.6 23.2 24.8 12.9 21.7 27.1 25.4 21.3 19.3 14.6 29.5 27.7 30.2 22.7 25.2 42.7 54.4 55.3 58.6 48.2 51.0 21.3 30.1 25.7 24.6 27.1 27.5 12.8 23.9 22.6 25.4 14.8 21.1 The correlation coefficients among independent variables are: x1 and x2: r = 0.92 (highly correlated) x1 and x3: r = 0.46 (correlated) x2 and x3: r = 0.08 (low correlation) ryx1 = 0.84, ryx2 = 0.88 and ryx3 = 0.14 We can also compute the following sum terms: Regression of y on x1 (1st model) The regression parameters, corresponding standard errors, ANOVA table, coefficient of determination and overall F values are shown below Variable Parameter Intercept x1 Source Regression Standard Error t -1.496 3.319 -0.45 0.857 0.129 6.64 df Sum of Squares Mean Square 1 352.3 352.3 Residual 18 143.1 8.0 Total 19 495.4 R2 = 0.71 Overall F = Comments: * 71% of the total sum of squares explained by regression * b1 is significantly different than zero (1% level) * Overall F is significantly different than zero (1% level) Regression of y on x2 (2nd model) 44.3 4-39 The regression parameters, corresponding standard errors, ANOVA table, coefficient of determination and overall F values are shown below Variable Parameter Intercept x2 Source Standard Error -23.6 5.66 -4.17 0.86 0.11 7.82 df Sum of Squares Regression t Mean Square 1 382.0 382.0 Residual 18 113.4 6.3 Total 19 495.4 R2 = 0.77 Overall F = 60.6 Comments: * 77% of the total sum of squares “explained” by regression * bo and b1 are significantly different than zero (1% level) * Overall F is significantly different than zero (1% level) * Slightly improved model compared to the previous model Regression of y on x1 and x2 (3rd model) Let’s now consider the impact of adding a highly correlated variable (x2) to the 1st model (equivalent to adding highly correlated variable (x1) to the 2nd model). Theoretically, we know that the “new” slope term for x1 is defined as and the “new” standard error as The numerical results are shown below 4-40 Variable Parameter Intercept Standard Error t -19.2 8.36 -2.30 x1 0.22 0.30 0.73 x2 0.66 0.29 2.28 Source df Regression Sum of Squares Mean Square 2 385.4 192.7 Residual 17 110.0 6.5 Total 19 495.4 R2 = 0.78 Overall F = 29.8 In comparison to the 1st model * R2 increased 71% to 78% * MSE decreased from 8 to 6.5 * Standard error of b1 increased from 0.129 to 0.30 * b1 decreased from 0.86 to 0.22 * b1 is not significantly different than zero at the 1% level * Overall F decreased from 44.3 to 29.8 (still significant at 1% level) We therefore conclude that in terms of the goodness of fit parameters of R2 and MSE the addition of the second parameter improved the fit. However, in terms of confidence in the relationship between y and the b1, we have less confidence. In comparison to the 2nd model * R2 increased slightly 77% to 78% * MSE increased slightly from 6.4 to 6.5 * Standard error of b2 increased from 0.11 to 0.29 * b2 decreased from 0.86 to 0.66 * b2 is not significantly different than zero at the 1% level * Overall F decreased from 60.6 to 29.8 (still significant at 1% level) Given these results, it is reasonable to conclude that the third model is inferior to the second model. Regression of y on x2 and x3 (4th model) Let’s consider the impact of adding a variable (x3) to the 2nd model that has a low correlation. The numerical results are shown below. 4-41 Variable Parameter Intercept Standard Error t -26.0 7.0 -3.71 x2 0.85 0.11 7.73 x3 0.096 0.16 0.60 Source df Regression Sum of Squares Mean Square 2 384.3 192.2 Residual 17 111.1 6.5 Total 19 495.4 R2 = 0.78 Overall F = 29.4 In comparison to the 2nd model * R2 increased slightly 77% to 78% * MSE increased slightly from 6.4 to 6.5 * Standard error of b2 was essentially the same (0.11) * b2 was essentially the same (0.86) * b2 is significantly different than zero at the 1% level * Overall F decreased from 60.6 to 29.4 (still significant at 1% level) For independent variables that are uncorrelated, the regression parameters are the same with or without the other variables in the model. The change in the standard errors for the regression parameters are in direct proportion to the possible decrease in MSE with more than one independent variable. The t test can then be conducted for each parameters without redoing the regression analysis. The use of x3 added very little to the representation of the data. Regression of y on x1, x2, and x3 (final model) Let’s consider the results obtained using all of the independent variables. Variable Intercept Parameter Standard Error t 117.1 99.8 1.17 x1 4.33 3.02 1.43 x2 -2.86 2.58 -1.11 x3 -2.19 1.59 -1.38 4-42 Source df Regression Sum of Squares Mean Square 3 397.0 132.3 Residual 16 98.4 6.2 Total 19 495.4 R2 = 0.80 Overall F = 21.5 In comparison to the 2nd model * R2 increased slightly 77% to 80% * MSE decreased slightly from 6.4 to 6.2 * Standard error of b2 increased 0.11 to 2.58 (about twenty times) * Sign of b2 is negative * No slope parameters are significantly different than zero at the 1% level * Overall F decreased from 60.6 to 21.5 (but still significant at the 1% level) The 4th model predicts an decrease in body fat with x2, which is the opposite trend obtained with the 2nd model and is contrary to physical insight. Using this results could result in serious misinterpretation of the study and major error in possible extrapolation. To summarize, multicollinearity increases the variances of slope parameters and makes physical interpretation is more difficult. SELECTION OF REGRESSION MODEL Overview of concepts Review uses of regression analysis Description: The primary goal is identify significant variables influencing the response of a system, including the relative importance of these variables as indicated by the slope term. Sometimes the system is so poorly understood that there is little physical insight. Prediction: The primary goal is to develop an equation to predict the response of the system for a different set of values for the independent variables. Researchers should rely on their physical insight to select variables. It is possible that steps variable removed/retained for description studies should be retained/removed for prediction studies. Why reduce number of variables? 4-43 For descriptive studies, the primary goal is to determine the most significant variables. Since the standard error increases with correlated variables, insignificant variables should be removed from the analysis. For predictive studies, insignificant variables do not improve the fit of the model to the observed data while * * * * * Making the model more difficult and more expensive to use, Making the model more difficult to understand and possibly losing physical insight, which is especially a concern if you are unsure of extrapolation, Increasing the uncertainty (variance) of parameters and therefore confidence in predicted trends, Increasing the risk of extrapolation by reducing the range of possible values and Increasing the numerical roundoff errors in computations (at least for some numerical methods) Additional comments for predictive model Statistics should be used to supplement and not replace the physical insight a scientist or engineer brings to the problem. Do not use a model that violates your physical knowledge or understanding of the system. This is especially important if there is a possibility of extrapolation in a predictive model. Avoid using "canned" procedures, such as stepwise regression, to obtain parameters. This type of approach removes variables without allowing the scientist to provide physical insight into the problem. Sometimes the dominant physical process can not easily be obtained. Variables are then selected for the regression that the modeler thinks may be highly correlated to that process. Test data set is a good when is used with a different or poorly defined population. A poor fit with the regression models suggests that (1) there is not a strong linear relationship and/or (2) important variables have not been included in the study. Guidelines for removing variables Examine correlation matrix As previously discussed, independent variables that are highly correlated should be avoided because they (1) increase the standard errors and (2) can result in false trends between 4-44 dependent and independent variables. The latter issue is particularly dangerous if there is possible extrapolation. Significant variables For descriptive studies, the primary goal is to determine these variables. For predictive studies, try to retain only those variables that make a significant contribution to the regression. Based on your physical insight, you may decide that a 10% level of significance is acceptable for some variables, whereas, 1% level might be used for other variables. Multiple coefficient of determination The multiple coefficient of determination is A large R2 indicates that a large fraction of the variation about the mean is “explained” by the regression equation. The disadvantage is that R2 always increases when more variables are included in the model, even if they have no physical significance. The best model using only R2 is the one with the most parameters. Therefore, R2 should not be the sole criterion in selecting your model. Residual mean square The residual mean square is an estimate of the variance around the predicted surface and observed values. It has been previously defined as A small value implies small variance around the predicted surface and hence greater confidence in the predicted value. Let’s consider the impact of adding additional parameters (m 8) SSE 9 and n-m 9 and therefore additional parameters can increase s2 if (n-m) is reduced more than SSE. Overall F statistic The overall F statistic has been previously defined as 4-45 A large F means that at least one of the slope terms is significantly different zero. It is directly related to the significance of the regression. As shown in the previous section, the overall F value can increase with the removal of parameters. Example problem Problem definition The following example data set given by Haan (1979) will be used to illustrate procedures for selecting independent variables for a multiple regression model. The goal is to predict runoff as function of the following measurable watershed and rainfall characteristics: y = Runoff = x1 = Prec = x2 = Area = x3 = Slope = x4 = Len = x5 = Perim = x6 = Diam = x7 = Shape = x8 = Strm = x9 = Relief = Mean annual runoff (inches) Mean annual precipitation (inches) Watershed area (sq miles) Average watershed slope (percent) Axial length (miles) Watershed perimeter (miles) Diameter of the largest circle possible within basin (miles) Diam/(Diameter of largest circle enclosing basin) Stream frequency = # streams / Area (1/sq miles) (Total relief)/(Largest dimension) Observed results for 13 different watersheds are shown below. Runoff y 17.38 14.62 15.48 14.72 18.37 17.01 18.20 18.95 13.94 18.64 17.25 17.48 13.16 Prec x1 44.37 44.09 41.25 45.50 46.09 49.12 44.03 48.71 44.43 47.72 48.38 49.00 47.03 Correlation matrix Area x2 2.21 2.53 5.63 1.55 5.15 2.14 5.34 7.47 2.10 3.89 0.67 0.85 1.72 Slope x3 50 7 19 6 16 26 7 11 5 18 21 23 5 Len x4 2.38 2.55 3.11 1.84 4.14 1.92 4.73 4.24 2.00 2.10 1.15 1.27 1.93 Perim x5 7.93 7.65 11.61 5.31 11.35 5.89 12.59 12.33 6.81 9.87 3.93 3.79 5.19 Diam x6 0.91 1.23 2.11 0.94 1.63 1.41 1.30 2.35 1.19 1.65 0.62 0.83 0.99 Shape x7 0.38 0.48 0.57 0.49 0.39 0.71 0.27 0.52 0.53 0.60 0.48 0.61 0.52 Strm x8 1.36 2.37 2.31 3.87 3.30 1.87 0.94 1.20 4.76 3.08 2.99 3.53 2.33 Relief x9 332 55 77 68 68 230 44 72 40 115 352 300 39 4-46 The first step is to examine the correlations among the independent and between the dependent and independent variables as shown below. Slope, length, and perimeter are highly correlated with area. Slope-and-relief and length-and-perimeter are also highly correlated. You would need to consider carefully if you want to include these highly correlated variables in your regression model. The data also suggests that runoff increases with all of the independent variables except for Shape and Strm. Runoff Prec Area Slope Len Perim Diam Shape Strm Relief Runoff Prec Area Slope Len Perim Diam Shape Strm Relief 1.00 0.39 0.47 0.41 0.42 0.46 0.33 -0.15 -0.40 0.35 1.00 -0.25 0.08 -0.34 -0.41 -0.15 0.45 0.04 0.42 1.00 -0.17 0.90 0.96 0.91 -0.25 -0.48 -0.52 1.00 -0.21 -0.10 -0.16 0.05 -0.30 0.80 1.00 0.92 0.67 -0.58 -0.53 -0.54 1.00 0.81 -0.41 -0.48 -0.51 1.00 0.15 -0.32 -0.50 1.00 0.29 0.17 1.00 -0.08 1.00 Results for full model Let’s consider the results obtained using all of the independent variables. Variable Intercept Prec Area Slope Len Perim Diam Shape Strm Relief Source Parameter Standard Error -14.74 0.45 0.19 -0.02 0.29 0.99 -3.05 5.67 0.37 0.01 t 6.779 0.148 1.424 0.048 0.777 0.366 4.796 9.029 0.267 0.006 df -2.17 3.05 0.13 -0.38 0.37 2.69 -0.64 0.63 1.40 2.14 Sum of Squares Probability 0.118 0.055 0.903 0.731 0.735 0.074 0.569 0.574 0.255 0.121 Mean Square Regression 9 43.5 4.83 Residual 3 1.4 0.47 Total 12 44.9 R2 = 0.97 Overall F = 10.4 A plot of the standardized residuals is shown below. Although there appears to be a trend of a decrease in variance with predicted depth, there is not enough data to be conclusively. We 4-47 should nonetheless be wary about the statistical conclusion drawn with this regression model. Standardized residuals 2 1 0 14 16 18 Predicted 20 -1 -2 Let’s first determine if one or more of the slope parameters are significantly different than zero at the 5% level. From the F table for nine and three degrees of freedom in the numerator and denominator, respectively, we obtain F9,30.95 = 8.81. Since F = 10.4, we conclude that: At least one slope parameter is significantly different than zero. The model is better than using the mean of the data. Let’s now evaluate which (if any) of the individual slope parameters are significantly different than zero at the 5% level. From the t table, we obtain t3,0.975 = 3.18. We therefore conclude that: No individual slope parameters are significant at 5% level (assuming that the other parameters are in the model). Three most significant variables Let’s consider regression using the three most significant variables of the full model, that is, Prec, Perim, and Relief. The results are shown below. Variable Intercept Prec Perim Relief Parameter -9.64 0.43 0.62 0.01 Standard Error 6.441 0.093 0.075 0.002 t -2.17 4.62 8.24 5.19 Probability 0.058 0.001 0.000 0.001 4-48 Source df Sum of Squares Mean Square Regression 3 40.6 13.56 Residual 9 4.3 0.47 Total 12 44.9 R2 = 0.90 Overall F = 28.9 A plot of the standardized residuals is shown below. Although there is not enough data to be conclusive, there is insufficient evidence to reject the standard statistical assumptions. Standardized residuals 2 1 0 14 16 18 Predicted 20 -1 -2 Let’s evaluate whether the removed terms have significance, that is, we are interested in the following null hypothesis and alternative hypothesis Let’s review the test statistics for removing one or more slope parameters which from the above ANOVA table and that given for the full model we obtain Since F6,3,0.95 = 8.94, we do not reject Ho and conclude that the slope parameters are not significantly different than zero at the 5% level. This appears to be a good model because: 4-49 * Trends for independent variables correspond to hydrologic principles, * Reduction in R2 , compared to the full model, is relatively small, * Improved MSE and F statistic relative to the full model * All variables in the regression are significant, * All variable excluded are statistically insignificant, * Easier to use than the full model, and * Residual plot supports statistical assumptions. Consider area in model Since area is more readily available for watersheds than perimeter, and since it is highly correlated to perimeter, it worthwhile to consider replacing perimeter with area in the predictive model. The results are shown below. Variable Parameter Intercept Prec Area Relief Standard Error 0.45 0.26 0.82 0.01 Source t 6.200 0.136 0.166 0.003 df Probability 0.07 1.91 4.97 3.49 Sum of Squares 0.943 0.089 0.001 0.007 Mean Square Regression 3 35.2 11.70 Residual 9 9.7 1.08 Total 12 44.9 R2 = 0.78 Overall F = 10.8 A plot of the residuals is shown below. Standardized residuals 2 1 0 14 -1 -2 16 18 Predicted 20 4-50 Although there may be a trend with the variance, the data lack enough points to be conclusive and therefore the standard statistical assumptions will not be rejected. The regression model using area instead of perimeter has poorer goodness-of-fit statistics. For example, * R2 decreases from 90% to 78% and * MSE increases from 0.47 to 1.08 The statistical representation is also not as desirable. For example, * Overall F decreased from 28 to 11 * Standard error for precipitation increased from 0.093 to 0.136 * Standard error for relief increased from 0.002 to 0.003 Based on these results, I would not select this model unless (1) obtaining the perimeter is a serious limitation in using the model or (2) a compelling physical argument can be made that runoff depth is more closely related to area than perimeter. Summary of results A summary of the different combination of independent variables is given in the following table. The MSE, R2, and F statistic for each model is reported. Variables are evaluated for significance at the 10%. Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 Prec 1 # # # # # # # # # # # Area 2 x x x Slope Len Perim Diam Shape Strm Relief MSE R2 F x x x # # # # # x x x x x x x x x x x # # # # # # # # 0.48 0.37 0.35 0.36 0.47 1.10 1.29 1.12 0.97 1.41 1.70 1.44 3.47 3.20 3.60 0.97 0.97 0.95 0.94 0.90 0.78 0.74 0.78 0.81 0.72 0.62 0.68 0.15 0.22 0.12 10.1 14.5 20.4 23.8 28.6 10.9 8.6 10.4 12.5 7.6 8.2 10.6 1.9 3.0 1.5 # # # # # # # # # x x x # - Indicates variable was included in the model and was significant 4-51 2 x - Indicates variable was included in the model and was insignificant The above table supports the use of Model 5. All variables are statistically significant. Other models greater than 5 show a noticeable increase in MSE and reductions in R2 and F. Stepwise regression analysis Introduction Most statistical software has routines to automatically select variables based on set criteria. You should use this option carefully because it removes your insight into the system when eliminating variables. An example of the results of stepwise regression is shown below using the F statistics. These values were obtained from the MICROSTAT software. 1st regression The first step is to compute the all simple (one parameter) regression for each of the p parameters. For each equation, compute F as The parameter with the largest F value is selected as the first parameter. If this value is not greater than the significant level of F, the regression is stopped. For the multiple regression analysis using the Haan data set, we obtain for each parameter Variable Prec Area Slope bj’s F 0.31 0.43 0.06 1.93 3.13 2.21 Variable Len Perim Diam bj’s F 0.71 0.28 1.28 2.37 3.03 1.39 Variable Shape Strm Relief bj’s F -2.63 -0.69 0.01 0.26 2.07 1.49 Since area had the largest F, the model after the first regression is 2nd regression The second step is to compute all regression equation with two independent variables, where Area is one of the pair. For each equation, evaluate F defined as 4-52 The parameter with the largest F value is selected as the second parameter. If this value is not greater than the significant level of F, the regression is stopped. For the multiple regression analysis using the Haan data set, we obtain for each parameter Variable Prec Area Slope bj’s F 2.30 # 2.16 5.29 # 4.66 Variable Len Perim Diam bj’s F 0.18 0.83 0.13 0.03 0.68 0.02 Variable Shape Strm Relief bj’s F 0.13 0.72 3.50 0.02 0.52 15.60 Since Relief had the largest F, the second regression model is then obtained using Area and Relief to obtain Any existing parameters (Area in this example) is evaluated using the following F-test: If variables are correlated, it is possible that the addition of another variables may change the need for having the original variable in the analysis. This step guarantees that only significant variables remain in the regression. For the Haan’s example, Area is still a significant variable. Stepwise model The above approach is continued until no significant F terms remain. For example, all regression models are obtained with three independent variables, where Area and Relief are two of the variables. The third variable then corresponds to the one that gives the largest F. The significance of Area and Relief are evaluated with this new parameter. The final model obtained for the Haan’s example is Based on physical arguments, one would expect runoff to increase with increasing watershed diameter. This model violates that assumption. Therefore, it should probably not be used. The above techniques assumes that there exists a single best set of predictor variables. There often is no unique "best" set. If variables are highly correlated, stepwise regression may arrive at an unreasonable "best" set of parameters; that is, the more descriptive variables of the process may not be included in the model. OTHER REGRESSION TOPICS Polynomial regression 4-53 Formulation A polynomial of degree p (or order m = p+1) is defined as which can be represented for a linear regression as where , , , .... , where $o, ..., $p are coefficients that can be defined and statistically evaluated using techniques previously discussed. Discussion Because of unexpected direction, polynomial regressions of higher order than third degree are usually avoided. It is recommended that the polynomial equation be plotted to examine the shape before using it. To avoid highly correlated variables, the polynomial regression is usually formulated using deviations from the mean, that is the above independent variables are redefined as , , , .... , Indicator variables Definition Indicator variables are used to represent qualitative variables such as gender (male or female) and status of experimental design variables (treatment or no treatment). Indicator variables usually have a value of zero or one. They are sometimes called dummy variables or binary variables. Analysis of variance can be performed using a regression analysis where all the independent variables are qualitative defined as indicator variables. This feature is sometimes used in the analyses of data gathered in some experimental designs. Example applications In this chapter, we will use indicator variables for regression analyses of: * Piecewise linear regression, 4-54 * Two different data sets with common slope or intercept terms, and * Two different data sets with common variance. Illustration example Widths of the river channels are sometimes represented as a power function of flow rate, or where W is the channel width at a flow depth of h. The parameters “a” and b1 can be estimated using natural log transformation as where y^ = ln(W), bo=ln(a) and x = ln(Q). Piecewise linear regression Potential application y = ln(W) To introduce the concept of piecewise linear regression, let’s consider a typical channel with a flood plain region as shown below. It is likely that separate equations could be used for the main channel and the flood plain. Flood plain yb Q for h Main channel Qb xb x = ln(Q) The separate linear equations for the channel and flood plain are shown below. and where the prime terms correspond to the flood plain region. Since the width is continuous between the two region, it is desirable for the two equations to predict the same value at the breakpoint flow rate (xb = ln(Qb). Regression model Let’s consider the following regression model: 4-55 where I is an indicator variable defined as I=0 I=1 for main channel for flood plain region The above regression model can then solved for the main channel (I=0) as and for the flood plain region (I=1) and therefore bo' and b1' can be defined as function of bo, b1, and b2. Furthermore, it is clear that for x = xb that we obtain the same value for y^ , that is The solution for vector the vector b can be obtained for the matrices y and x defined as and where yci and xci refer to the ln(W) and ln(Q) of values in the main channel and yfi and xfi refer to these values in the flood plain. In the above formulation, there are n observations in the main channel and q observations in the flood plain. The least square parameters can then be obtained using the multiple regression solution previously given as After the vector b has been determined, the linear parameters for the flood plain region are defined by the relationships given of bo' = bo - b1xb and b1' = b1 + b2. Extension to three piecewise segments 4-56 The approach can be extended to more than two piecewise segments. For example, the formulation for two breakpoints (xb1 and xb2) is where I’s are indicator variables defined as I1 = 1 for x > xb1 and I1 = 0 I2 = 1 for x > xb2 and I2 = 0 for x < xb1 for x < xb2 Common slope or intercept parameter Common slope parameter Channel 1 y = ln(W) For two approximately trapezoidal channels with the same sideslope ratio and different bottom widths, the slope parameter would likely be constant and the intercept would be different. You wish to combine the information from the two channels to estimate the slope term. This concept is shown below. Channel 2 b1 b’o Channel 1 b1 Channel 2 bo x = ln(Q) Let’s consider the following regression model: where I is an indicator variable defined as I=0 I=1 for the first channel for the second channel The above regression model can then solved for the first channel (I=0) as and for the second channel (I=1) where the intercept parameter for the second channel is therefore defined as bo' = bo+b2. The 4-57 vector b is as previously defined. The matrices y and x are defined as and where y1i and x1i refer to the ln(W) and ln(Q) of values in the first channel and y2i and x2i refer to these values in the second channel. In the above formulation, there are n observations in the first channel and q observations in the second channel. The vector b can now be computed using (xTx)-1xTy. Common intercept parameter Channel 1 y = ln(W) For two approximately trapezoidal channels with the same bottom widths and different sideslopes, the intercept parameter would likely be constant and the slope parameter would be different. You wish to combine the information from the two channels to estimate the intercept as shown below. Channel 2 b’1 b1 bo Channel 2 x = ln(Q) Let’s use the following regression model: where I is an indicator variable defined as I=0 I=1 for the first channel for the second channel The above regression model can then solved for the first channel (I=0) as Channel 1 4-58 and for the second channel (I=1) where the slope parameter for the second channel is therefore defined as b1' = b1+b2. The vector b and matrix y are as previously defined. The matrix x is now defined as The vector b can again be computed using (xTx)-1xTy. Common variance Location 1 y = ln(W) Let’s assume that you wish to fit separate equations to two different cross sections where neither the slope nor the intercept terms are shared, but they have the same variance. You wish to combine the data set to obtain a better estimate of this variance. The representation of this problem is shown below. b’o Location 2 bo Location 2 b’1 b1 x = ln(Q) The appropriate regression model is where I is an indicator variable defined as I=0 I=1 for the first location for the second location Location 1 4-59 The above regression model can then solved for the first location (I=0) as and for the second location (I=1) where the intercept and slope parameters are defined similar to the approach previously given. As shown by Judge et al. (1988, pp. 428-430), the parameters obtained by solving the indicator variable formulation is identical to that obtained by solving each equation separately. If the data have the same variance, the above formulation allows the pooled variance to be determined directly. 4-60 APPENDIX 4-A: SELECTED DERIVATIONS Total and regression sum of squares The matrix form of the linear model is written in the chapter as y = y^ + e. Let’s consider Ey2i defined as where the property given in Chapter 1 of (y^ + e)T = y^T + eT has been used. Since y^Te=eTy^, we can simplify as By using y^ =x b and the property given in Chapter 1 of (xb)T = bT xT, we obtain From the minimization solution for bj using the least square methods given in the chapter, we have shown that By using this result and by subtracting nyG 2 from both sides of the equation, we obtain As shown in the chapter, the first term of the left-hand side of the equation equals SSR using the minimization solution for bo, or and therefore we able to partition the total sum of squares as An alternative, and computational simpler, form of the SSR can be obtained by using the result given in the chapter of y^ = x b, or where the property given in Chapter 1 of (xb)T = bT xT has again been used. Since we have previously shown that xT e = 0, we simplify as Expected value of least square estimator As derived in the chapter, the least square estimator for the vector b is defined as 4-61 The expected value can then be evaluated as where y is defined using the linear model of y = x $ + g. By expanding terms, we obtain where (xTx)-1xTx = I has been used. Since $ and (xTx)-1xT are constants (nonrandom) for a given formulation, the expected value can be written as where E[g] = 0 from the minimization solution for bo. Variance-covariance matrix of b General definition Let’s propose the following definition for the variance-covariance matrix where $ = E[b] has been used. We will now show that this definition corresponds to variance on the diagonal elements and covariance for the other elements. Let’s first consider the following matrix product or The variance-covariance is then obtained by taking the expected value of both sides. Since the expected value of the matrix can be evaluated by the expected value of individual terms, we conclude 4-62 and therefore the original definition of the variance-covariance has been shown to result in the variance-covariance matrix of Least squares variance-covariance matrix Let’s now consider more carefully our starting definition of the variance-covariance matrix. From the previous section, it is clear that the b can be written as and therefore the variance-covariance matrix can be written and simplified as From the matrix rules in Chapter 1 of (xy)T = yT xT and (y-1)T = (yT)-1, we can evaluate the last multiplicative term as and therefore the variance-covariance matrix can be written as where (xTx)-1xT and x (xTx)-1 are constants (nonrandom). For independent and constant variance, we know that where I is the identity matrix. We can therefore simplify as The variance-covariance matrix for b is obtained as 4-63 Gauss-Markov theorem General approach The Gauss-Markov theorem is used to show that of all possible unbiased linear estimators, the least squares estimator results in the minimum variance. This powerful result is limited by the assumptions of independence and constant variance of residuals used in the previous section. The derivation follows that given by Judge et al. (1988). Their approach utilizes the definition of a positive semidefinite matrix. As discussed in Chapter 1, a symmetrical nxn matrix W is positive semidefinite if where u is a 1xn vector. Let’s first introduce a general formulation by considering the linear combination of the regression parameters defined as where aT = [ao, a1, ... , ap] is any set of constants. A possible alternative estimator for the regression parameters will be represented by the vector b^ . The least squares estimator b has a smaller variance than b^ if For example, if aT = [1, 0, ... , 0] then we require that VAR(bo) # VAR(b^ o); if aT = [0, 1, ... , 0] then VAR(b1) # VAR(b^ 1); and if aT = [1, 1, ... , 0] then VAR(bo+b1) # VAR(b^ o+b^ 1). The above conditions can be represented using the covariance matrices as or for all a Let’s consider matrix W = COV(b^ )-COV(b). We only need to show that W is positive semidefinite to meet the above criterion. Unbiased condition Let’s consider the class of all possible linear estimators. Previously in this appendix, we used the following definition of the least squares estimator The class of all possible linear estimators can be represented as where C is a general mxn matrix. The mathematical details to obtain the last equality have 4-64 been previously given in this appendix. Let’s consider possible constraints on C to limit the analysis to unbiased linear estimators. For an unbiased estimator, we know that E(b^ ) = $. By taking the expected value of the above general linear estimator, we obtain where $, C and x are non-random variables or constants. For unbiased estimator, then matrix is constraint by the condition of Cx = 0. By using this result, the general definition of the linear estimator is now defined for an unbiased conditions as Minimum variance Let’s now compute the covariance of the general unbiased linear estimator determined in the previous subsection. Many of the covariance manipulations are similar to those previously given. The covariance matrix is defined as By using the matrix rules in Chapter 1 of (x+y)T = xT+yT, (xy)T = yT xT and (y-1)T = (yT)-1, we can evaluate the transpose term as where the details of the manipulations have been previously given in this appendix. By using this result, the covariance matrix simplifies as Since x and C are non-random, we can use the expected values rules to obtain By using for independent and constant variance residuals, we further simplify as We previously shown that Cx is a null matrix for unbiased estimator. Similar CTxT is also a null matrix. By using COV(b) = F2 (xTx)-1, we obtain As discussed in Chapter 1, the matrix-transpose matrix product is always positive semidefinite. Since F2 is also always positive, we conclude that W is positive semidefinite. We have therefore shown that, out of all of the class of linear unbiased estimators, the least squares method results in the smallest possible variance. 4-65 REFERENCES Beck and Arnold. 1977. Parameter Estimation in Engineering and Science. John Wiley and Sons. Devore, J.L. 2001. Probability and Statistics for Engineering and the Sciences. Duxbury Press, New York. Draper and Smith 1998. Applied Regression Analysis. John Wiley and Sons, New York.. Haan, C.T. 1979. Statistical Methods in Hydrology. Iowa State University Press. Judge, G.G., R.C. Hill, W.E. Griffiths, H. Lutkepohl, and T-C. Lee. 1988. Introduction to the Theory and Practice of Econometrics. John Wiley and Sons, New York. Neter, J. and W. Wasserman. 1974. Applied Linear Statistical Models. Irwin Inc., Homewood, IL. Neter, Kutner, Nachtsheim, and Wasserman. 1996. Applied Linear Statistical Models. McGrawHill, Boston. Pratt, J.W. 1965. Bayesian interpretation of standard inference statements. J. Roy. Statist. Soc. B 27: 169-203. 4-66 PROBLEM ASSIGNMENT Set #1: Due Date: Set #2: Due Date: Please do not use commercial statistical software to solve Problems #1 and #2. These problems should be solved using the matrix solution techniques of Chapter 1. Statistical software can be used (and is recommended) to check your values. Problem #1 (20 points) You are interested in predicting the gas mileage of cars as a function of horsepower, length, and weight. The following data have been measured for ten different cars (4_gas.xls). Car Id 1 2 3 4 5 6 7 8 9 10 Gas Mileage y 13.9 16.5 16.5 17.8 18.75 19.73 20.07 20.3 30.4 36.5 Length x1 215.5 195.4 185.2 199.9 199.9 215.3 194.1 168.8 165.2 160.6 Width x2 78.5 74.4 69 74 74 76.3 71.8 69.4 65 62.2 Weight x3 4540 3885 3660 3890 3890 4370 3365 2700 2320 2009 You are required to (1) determine the xTy, xTx, and (xTx)-1 matrices, (2) compute the parameters bo, b1, b2 and b3 using matrix multiplication, (3) sum of squares and mean squares corresponding to an ANOVA table, (4) the coefficient of determination, and the (5) variancecovariance matrix for b. What are the standard errors of bo, b1, b2 and b3? Problem #2 (20 points) You are interested in predicting the heat evolved of cement (calorie/gram) as a function of the percentages (by weight) of dicalcium silicate, calcium aluminum ferrate, and tricalcium silicate. The following data have been recorded (4_heat.xls). 4-67 Heat Evolved (y) Aluminum Ferrate (x2) Dicalcium Silicate (x1) 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4 60 52 20 47 33 22 6 44 22 26 34 12 12 Tricalcium Silicate (x3) 6 15 8 8 6 9 17 22 18 4 23 9 8 26 29 56 31 52 55 71 31 54 47 40 66 68 You are required to (1) determine the xTy, xTx, and (xTx)-1 matrices, (2) compute the parameters bo, b1, b2 and b3 using matrix multiplication, (3) sum of squares and mean squares corresponding to an ANOVA table, (4) the coefficient of determination, and the (5) variancecovariance matrix for b. What are the standard errors of bo, b1, b2 and b3? Problem #3 (20 points) The data given below are the results of a small scale experiment on the effects of work crew size and level of bonus pay on crew productivity scores (4_crew.xls). Crew Size x1 Bonus Pay ($) x2 Productivity Scores y 4 4 4 4 6 6 6 6 2 2 3 3 2 2 3 3 42 39 48 51 49 53 61 60 You are to: (1) (2) Determine the correlation between crew size and bonus pay. Are these variables correlated (accounting for possible roundoff error)? Calculate the regression of (a) y on x1 and x2 , (b) y on x1 only, and (c) y on x2 only. Report both the ANOVA tables and slope coefficients for all three options. Does the slope parameter for x1 change when x2 is included in the model? Does the slope parameter for x2 change when x1 is included in the model? 4-68 (3) Calculate SSR(x2/x1) and SSR(x1/x2). Is SSR(x2/x1) equal (within roundoff error) to SSR(x2)? Is SSR(x1/x2) equal (within roundoff error) to SSR(x1)? Problem #4 (20 points) Consider the following data obtained from ten shipments of chemicals in drums (4_ship.xls). You are interested in the number of human-minutes required to handle the shipment as a function of the number of drums and the total weight of the shipment. Shipment 1 2 3 4 5 6 7 8 9 10 Number of Barrels - x1 7 18 5 14 11 5 23 9 16 5 Total Weigh (100 pounds) x2 5.11 16.70 3.20 7.00 11.00 4.00 22.10 7.00 10.60 4.80 Human-minutes y 58 152 41 93 101 38 203 78 117 44 You are to: (1) (2) (3) Determine the correlation between number of barrels and total weight. Are these values highly correlated? Calculate the regression of (a) y on x1 and x2 , (b) y on x1 only, and (c) y on x2 only. Report both the ANOVA tables and slope coefficients for all three options. Does the slope parameter for x1 change when x2 is included in the model? Does the slope parameter for x2 change when x1 is included in the model? Calculate SSR(x2/x1) and SSR(x1/x2). Is SSR(x2/x1) equal to SSR(x2)? Is SSR(x1/x2) equal to SSR(x1)? Problem #5 (30 points) Physical fitness measurements on men were made at North Carolina State University. The results are shown below (4_fitness.xls), where age is in years, weight in kg, o_rate is the oxygen uptake rate in mL/kg/min, t_run is the time to run 1.5 minutes in minutes, h_rest is the heart rate while resting in beats/min, h_run is the heart rate while running in beats/mine, and h_max is the maximum recorded heart rate in beats/min. Obtain a predictive equation for oxygen uptake rate as a function of one or more of the other variables. Assume that all variables are equally easy to obtain. Report the correlation coefficients for the independent variables. Summarize your results in a table showing the 4-69 significant variables and corresponding R2, MSE, and F values for each set of variables considered in your analysis. Show the residual plots for your predictive model. O_rate y 44.609 45.313 54.297 59.571 49.874 44.811 45.681 49.091 39.442 60.055 50.541 37.388 44.754 47.273 51.855 49.156 40.836 46.672 46.774 50.388 39.407 46.08 45.441 54.625 45.118 39.203 45.79 50.545 48.673 47.92 47.467 Age x1 44 40 44 42 38 47 40 43 44 38 44 45 45 47 54 49 51 51 48 49 57 54 52 50 51 54 51 57 49 48 52 Weight x2 89.47 75.07 85.84 68.15 89.02 77.45 75.98 81.19 81.42 81.87 73.03 87.66 66.45 79.15 83.12 81.42 69.63 77.91 91.63 73.37 73.37 79.38 76.32 70.87 67.25 91.63 73.71 59.08 76.32 61.24 82.78 T_run x3 11.37 10.07 8.65 8.17 9.22 11.63 11.95 10.85 13.08 8.63 10.13 14.03 11.12 10.6 10.33 8.95 10.95 10 10.25 10.08 12.63 11.17 9.63 8.92 11.08 12.88 10.47 9.93 9.4 11.5 10.5 H_rest x4 62 62 45 40 55 58 70 64 63 48 45 56 51 47 50 44 57 48 48 67 58 62 48 48 48 44 59 49 56 52 53 H_run x5 178 185 156 166 178 176 176 162 174 170 168 186 176 162 166 180 168 162 162 168 174 156 164 146 172 168 186 148 186 170 170 H_max x6 182 185 168 172 180 176 180 170 176 186 168 192 176 164 170 185 172 168 164 168 176 165 166 155 172 172 188 155 188 176 172 Create a new independent variable in the spreadsheet or in your software package defined as weight in pounds (1 kg = 2.2 lbm). Try a regression model with both weight in kg and lbm as independent variables. What happens? Problem #6 (30 points) The following variables have been measured for 108 different diesel tractors (4_tract.xls): * Cylinder - x1 = Number of cylinders * Fuel_wt - x2 = Weight density of fuel (lbs of fuel/gallon) 4-70 * Bore - x3 = Bore diameter (inches) * Stroke - x4 = Length of engine stroke (inches) * Pto_m = Maximum recorded PTO power (hp) * Pto-gas - x5 = PTO power-hour (energy) per gallon of fuel consumption (hp-h/gal) * Displace - x6 = Cylinder displacement (cubic inches) * D_bar -y = Maximum recorded drawbar power (hp) You are required to determine a predictive equation for D_bar as a function of Cylinder, Fuel_wt, Bore, Stroke, Pto-gas and/or Displace (note: Pto-m is not included). Assume that all variables are equally easy to obtain (note that Displace = cylinder*stroke*bore2 *B/4). Summarize your results in a table showing the significant variables and corresponding R2, MSE, and F values for each set of variables considered in your analysis. Cylinder x1 Fuel_wt x2 3 3 2 2 4 3 6 3 3 3 6.94 6.94 6.94 6.94 6.976 7.008 7.062 6.976 6.976 7.008 Bore x3 3 3 3 3 3 3 3.23 3.23 3.23 3.23 Stroke x4 3.23 3.23 3.23 3.23 3.23 3.23 3.23 3.23 3.23 3.23 Pto_m 22.06 22.35 15.33 15.45 29.35 19.59 49.72 26.46 26.21 23.42 Pto_gas x5 13.26 13.6 12.81 13.13 12.78 14.17 14.37 13.99 13.24 14.21 Displace x6 68.495 68.495 45.663 45.663 91.326 68.495 158.8 79.4 79.4 79.4 D_bar y 17.8173 18.1154 12.48 12.1457 25.2997 15.3504 40.8637 21.576 21.7507 18.8695 Problem #7 (30 points) Consider the following set of x and y values (4_piece.xls). Perform a simple regression analysis of this data. Summarize your results in ANOVA table. Also analyzed the data using a piecewise regression model with a breakpoint at 10. Here you will need to use indicator variables. Summarize the results in ANOVA table. Sample 1 2 3 4 5 6 7 8 9 10 x 1.03 1.92 3.03 4.06 5.04 5.94 7.08 8.04 9.08 9.94 y 9.17 1.20 10.43 12.18 8.56 17.13 13.73 22.63 22.98 16.55 Sample 11 12 13 14 15 16 17 18 19 20 x 10.92 12.06 12.93 13.99 14.91 16.04 16.93 18.06 19.03 20.05 y 30.35 35.81 46.45 59.10 74.56 87.51 95.77 108.60 110.08 123.52 4-71 SOLUTION KEY Problem #1 (20 points) Obs y 1 2 3 4 5 6 7 8 9 10 y= n= ybar = ANOVA Calculations (y-ybar)^2 13.9 51.051025 16.5 20.657025 16.5 20.657025 17.8 10.530025 18.75 5.267025 19.73 1.729225 20.07 0.950625 20.3 0.555025 30.4 87.516025 36.5 238.857025 x= 1 1 1 1 1 1 1 1 1 1 x1 215.5 195.4 185.2 199.9 199.9 215.3 194.1 168.8 165.2 160.6 10 21.045 SSTO= df= Use Paste Special - Transpose 1 1 xT = 215.5 195.4 78.5 74.4 4540 3885 (xTx)^-1 = 10 1899.9 1899.9 364446.2 714.6 136624.3 34629 6726230 x2 4540 3885 3660 3890 3890 4370 3365 2700 2320 2009 m= b 39.11828 0.63069 -1.24523 -0.01413 b= 1 185.2 69 3660 yhat= yhat 13.1532435 14.83381678 18.30317321 18.09938526 18.09938526 18.16788824 24.59661617 21.02197034 29.5980608 34.57646043 SSE = df= MSE = 1 1 199.9 200 74 74 3890 3890 714.6 34629 136624.32 6726229.9 51297.74 2511871.8 2511871.8 126493231 146.892 -0.32108 -2.0998342 0.0185581 -0.3211 0.005927 -0.006675 -9.47E-05 -2.0998 -0.00668 0.05501102 -0.000163 0.01856 -9.5E-05 -0.0001626 3.191E-06 1 215.3 76.3 4370 xTy = Var(b) = ei 0.746756 1.666183 -1.80317 -0.29939 0.650615 1.562112 -4.52662 -0.72197 0.801939 1.92354 34.89298 e^2 0.557645265 2.776166511 3.25143364 0.089631534 0.423299541 2.440193139 20.49025397 0.52124117 0.643106488 3.700004478 402.8770743 (yhat-ybar)^2 62.27982 38.5788 7.517614 8.676646 8.676646 8.277772 12.61398 0.00053 73.15485 183.1004 6.13E-11 4 437.77005 9 R^2 = F= xTx = x2 78.5 74.4 69 74 74 76.3 71.8 69.4 65 62.2 34.89298 402.8770743 "= SSR" 6 3 "= df" 5.815496 134.2923581 "=MSR" 0.920294 23.09216 1 194.1 71.8 3365 1 168.8 69.4 2700 1 165.2 65 2320 1 160.6 62.2 2009 210.45 39035.77 14763.5 682200.2 854.2503 -1.86727 -12.2116 0.107925 29.22756048 -1.86727 0.034466 -0.03882 -0.00055 StError(b)= -12.2116 -0.03882 0.319916 -0.00095 0.107925 -0.00055 -0.00095 1.86E-05 0.18565 0.565611471 0.004308 4-72 Problem #2 (20 points) Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 y y= n= ybar = ANOVA Calculations 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4 (y-ybar)^2 286.390533 446.184379 78.7997633 61.2005325 0.22745562 189.803609 52.9536095 525.467456 5.39668639 419.304379 135.095917 319.584379 195.354379 x1 1 1 1 1 1 1 1 1 1 1 1 1 1 x= x2 60 52 20 47 33 22 6 44 22 26 34 12 12 13 95.42308 SSTO= df= Use Paste Special - Transpose 1 1 xT = 60 52 6 15 26 29 (xTx)^-1 = 13 390 153 626 390 15062 4628 15739 b 26 29 56 31 52 55 71 31 54 47 40 66 68 m= 203.642 -1.55704 -1.44797 -0.92342 b= yhat 77.52262406 74.17699521 109.2059999 90.25118597 95.55402201 105.5673548 104.121649 74.65072463 93.45903045 113.9663586 80.46245907 110.9802285 110.5813677 yhat= 1 47 8 31 153 4628 2293 7201 626 15739 7201 33050 51.9812 -0.59965 -0.1993881 -0.65557 -0.5996 0.007097 0.0020035 0.0075419 -0.1994 0.002004 0.00263704 0.0022479 -0.6556 0.007542 0.00224794 0.0083661 1 33 6 52 e^2 0.955263731 0.015130178 24.06883484 7.028787024 0.119700768 13.19611083 2.021085997 4.62561643 0.128902864 3.73896909 11.13917948 5.38133977 1.395629729 2641.948526 (yhat-ybar)^2 320.4262 451.396 189.969 26.74846 0.017147 102.9064 75.66516 431.4906 3.857479 343.8533 223.8201 242.025 229.7738 -5.3E-12 73.81455 2641.948526 "= SSR" 9 3 "= df" 8.201617 880.6495087 "=MSR" SSE = df= MSE = 1 20 8 56 ei 0.977376 0.123005 -4.906 -2.65119 0.345978 3.632645 -1.42165 -2.15072 -0.35903 1.933641 3.337541 2.319771 -1.18137 73.81455 4 2715.76308 12 0.97282 107.3751 R^2 = F= xTx = x2 6 15 8 8 6 9 17 22 18 4 23 9 8 1 22 9 55 xTy = Var(b) = 1 6 17 71 1 44 22 31 1 22 18 54 1 26 4 47 1 34 23 40 1 12 9 66 1 12 8 68 1240.5 34733.3 13981.5 62027.8 426.3298 -4.91807 -1.6353 -5.37673 20.64775475 -4.91807 0.058203 0.016432 0.061855 StError(b)= 0.241254 -1.6353 0.016432 0.021628 0.018437 0.147064608 -5.37673 0.061855 0.018437 0.068615 0.261945 Problem #3 (20 points) crew bonus 4 4 4 4 6 6 6 6 r= 2 2 3 3 2 2 3 3 prod 42 39 48 51 49 53 61 60 0 Variables are uncorrelated. The slope coefficients are the same for both models. SSR(x2/x1) equals SSR(x2) and SSR(x1/x2) equals SSR(x1). Differences in the standard errors are caused by smaller MSE for the model with both parameters. 4-73 df Regression Residual Total Intercept Crew Pay SS 402.25 17.63 419.88 2 5 7 Coeff Std Error 0.375 4.74 5.375 0.66 9.25 1.33 SSR(x2/x1) = SSE(x1) - SSE(x1,x2) 188.75 17.63 171.13 df Regression Residual Total SS 231.13 188.75 419.88 1 6 7 df Regression Residual Total Coeff Std Error 23.5 10.11 5.375 1.98 Intercept Crew Intercept Pay 1 6 7 SS 171.13 248.75 419.88 Coeff Std Error 27.25 11.61 9.25 4.55 SSR(x2/x1) =SSE(x1) - SSE(x1,x2) 248.75 17.63 231.13 Problem #4 (20 points) obs 1 2 3 4 5 6 7 8 9 10 correl = df Regression Residual Total Intercept Barrels Weight 2 7 9 SS 25720.11 78.39 25798.50 Coeff 3.37 4.07 4.72 Std Error 2.34 0.49 0.51 SSR(x2/x1) =SSE(x1) - SSE(x1,x2) 1046.85 78.39 968.46 Number 7 18 5 14 11 5 23 9 16 5 0.93 Weight 5.11 16.70 3.20 7.00 11.00 4.00 22.10 7.00 10.60 4.80 df Regression Residual Total Intercept Barrels 1 8 9 Human 58 152 41 93 101 38 203 78 117 44 SS 24751.65 1046.85 25798.50 Coeff Std Error -1.98 7.76 8.36 0.61 df Regression Residual Total Intercept Weight SS 1 24964.73 8 833.77 9 25798.50 Coeff Std Error 13.70 6.03 8.61 0.56 SSR(x2/x1) =SSE(x1) - SSE(x1,x2) 833.77 78.39 755.38 Variables are highly correlated. The slope coefficients are different for both models. SSR(x2/x1) 4-74 does not equal SSR(x2) nor does SSR(x1/x2) equals SSR(x1). Problem #5 (30 points) Correlation among variables: Orate Orate Age Weight T_run H-rest H-run H-max Age Weight T_run H-rest H-run H-max 1 -0.304592 1 -0.162753 -0.233539 1 -0.862195 0.188745 0.143508 1 -0.399356 -0.1641 0.043974 0.450383 1 -0.397974 -0.33787 0.181516 0.313648 0.352461 1 -0.23674 -0.432916 0.249381 0.226103 0.305124 0.929754 1 All variables: df Regression Residual Total Intercept age weight t_run h_rest h_run h_max SS MS F 6 722.5436 120.4239 22.43263 24 128.8379 5.368247 R Square 30 851.3815 Adjusted R2 Std Error Value Std Error t Stat P-value 102.9345 12.40326 8.298987 1.64E-08 -0.226974 0.099837 -2.273433 0.032235 -0.074177 0.054593 -1.358731 0.186866 -2.628653 0.384562 -6.835443 4.54E-07 -0.021534 0.066054 -0.325999 0.74725 -0.369628 0.119853 -3.084011 0.005079 0.303217 0.136495 2.221449 0.036007 0.848672 0.81084 2.316948 4-75 3 Std Residuals 2 1 0 -1 30 35 40 45 50 55 60 -2 -3 Predicted Model 1 2 3 1 2 Age x1 #1 # # 4 5 6 7 # # 8 # R2 F 0.84 0.82 0.81 22.4 22.4 28.0 5.95 7.17 7.27 7.53 0.81 0.76 0.76 0.74 38.6 45.4 44.7 84.0 26.6 0.09 3.0 Weight t_run h_rest h_run h_max MSE x2 x3 x4 x5 x6 2 x # x # # 5.37 x # x # 6.21 # x # 6.17 # # # # # # # - Indicates variable was included in the model and was significant at 10% x - Indicates variable was included in the model and was insignificant at 10% All the bold models are reasonable. The best three parameter model is Model 4. The regression model and residual plots are shown below. There is insufficient evidence from the residual plots to conclude that the standard assumptions for the residuals are unreasonable. 4-76 3 Std Residuals 2 1 0 -1 35 40 45 50 55 60 -2 -3 Predicted The best two parameter model is Model 5. The regression model and residual plots are shown below. There is insufficient evidence from the residual plots to conclude that the standard assumptions for the residuals are unreasonable. 3 Std Residuals 2 1 0 -1 35 40 45 50 55 60 -2 -3 Predicted The best one parameter model is Model 7. The regression model and residual plots are shown below. There is insufficient evidence from the residual plots to conclude that the standard assumptions for the residuals are unreasonable. 4-77 3 Std Residuals 2 1 0 -1 35 40 45 50 55 60 -2 -3 Predicted The selection of the “best” of the three relative good models is now dependent on the expertise of the system by the engineer or scientist. Correlation matrix with weight in kg. Orate Age Weight T_run H-rest H-run H-max wt_kg Orate 1.00 -0.30 -0.16 -0.86 -0.40 -0.40 -0.24 -0.16 Age Weight 1.00 -0.23 0.19 -0.16 -0.34 -0.43 -0.23 T_run 1.00 0.14 0.04 0.18 0.25 1.00 H-rest 1.00 0.45 0.31 0.23 0.14 H-run 1.00 0.35 0.31 0.04 H-max 1.00 0.93 0.18 Regression results in EXCEL. Large standard error : ANOVA df Regression 7 Residual 23 Total 30 Intercept age weight t_run h_rest Value 102.9 -0.2 1.3 -2.6 0.0 SS 722.54 128.84 851.38 MS 103.22 5.60 Std Error t Stat 12.7 0.1 257094.2 0.4 0.1 8.1 -2.2 0.0 -6.7 -0.3 F 18.43 P-value 0.0 0.0 1.0 0.0 0.8 R Square0.85 Adjusted R20.80 Std Error2.37 wt_kg 1.00 0.25 1.00 4-78 h_run h_max wt_kg -0.4 0.3 -0.6 0.1 0.1 116861.0 -3.0 2.2 0.0 0.0 0.0 1.0 Problem #6 (30 points) Correlation matrix: d_bar Cylinder Fuel_wt bore stroke pto_m pto_gas displace d_bar 1 0.90 0.09 0.65 0.64 0.99 0.37 0.96 Cylinder Fuel_wt bore stroke pto_m pto_gas displace 1.00 0.18 0.42 0.45 0.91 0.23 0.93 1.00 -0.03 -0.03 0.10 0.07 0.11 1.00 0.68 0.64 0.47 0.65 1.00 0.64 0.46 0.69 1.00 0.36 0.97 1.00 0.34 1.00 Regression analysis using all variables: df Regression Residual Total 6 101 107 Intercept Cylinder Fuel_wt bore stroke pto_gas displace Value 59.734 0.151 -10.263 3.849 -3.192 0.957 0.206 SS MS F 40219.2 6703.199 226.1354 2993.884 29.64242 43213.08 R Square Adjusted R2 Std Error Std Error t Stat P-value 85.704 0.697 0.487 2.281 0.066 0.947 12.065 -0.851 0.397 3.915 0.983 0.328 2.158 -1.479 0.142 0.487 1.967 0.052 0.043 4.784 0.000 Residuals appear to violate the condition of constant variance. 0.930718 0.926602 5.444485 4-79 4 Std Residuals 3 2 1 0 -1 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 -2 -3 Predicted Model Cylinder Fuel_wt x1 x2 1 x1 x2 2 3 4 5 6 7 8 9 1 2 # # # # bore x3 x x R2 F 0.93 0.93 226.1 342.3 29.53 30.19 30.82 0.93 0.93 0.92 453.2 663.1 1296.3 46.71 65.76 48.10 77.36 0.88 0.84 0.88 0.81 273.7 276.1 396.67 452.61 stroke pto_gas displace MSE x4 x5 x6 x # # 29.64 # # # 29.28 # # # # # # # # # # # - Indicates variable was included in the model and was significant at 10% x - Indicates variable was included in the model and was insignificant at 10% Three best models are shown in bold. Model 5 would likely be the best model with small increase in MSE, small decrease in R2, and the largest overall F compared to Models 3 and 4. For comparison, the best three parameter model is Model 3. The regression model and residual plots are shown below. Residuals suggest that the variance is increasing with predicted values. 4-80 4 Std Residuals 3 2 1 0 -1 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 -2 -3 Predicted The best two parameter model is Model 4. The regression model and residual plots are shown below. 4 Std Residuals 3 2 1 0 -1 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 -2 -3 Predicted The best one parameter and overall model is Model 5. The regression model and residual plots are shown below. 4-81 3 Std Residuals 2 1 0 -1 10 15 20 25 30 35 40 45 50 55 60 65 -2 -3 Predicted Residuals suggest that the variance is increasing with predicted values. Problem #7 (30 points) Simple regression analysis using on x in the model (no breakpoint): ANOVA Regression 1 Residual 18 df SS 26931.4 3622.9 Total 30554.3 19 Value Intercept x1 0nly -21.562 6.367 MS 26931.4 F 133.8 201.3 Std Error t Stat P-value 6.595 -3.270 0.004257 0.550 11.567 9.09E-10 R Square Adj R2 Std Error 0.881 0.875 14.187 70 75 80 4-82 140 120 100 y 80 60 40 20 0 -20 0 5 10 15 20 25 20 25 x Residual plots show possible correlation of y with x. 3 Std Residuals 2 1 0 -1 0 5 10 15 -2 -3 x 4-83 Multiple regression analysis using on x in the model (with breakpoint): ANOVA Regression Residual df 2 17 SS MS F 30273.52 15136.76 916.3535 R Square 280.814 16.518 0.991 Total 19 30554.33 0.990 Value Intercept 4.428 1.561 8.919 x1 x2 Adj R2 Std Error 4.064 Std Error t Stat P-value 2.628 1.685 0.110279 0.373 0.627 4.187 0.000618 14.224 7.17E-11 All of the statistical for the piecewise regression analysis are better. Residuals don’t show possible correlation among values. 140 120 100 y 80 60 40 20 0 -20 0 5 10 15 x 20 25 4-84 3 Std Residuals 2 1 0 -1 0 5 10 15 -2 -3 x 20 25