Fitting straight lines to experimental ROBERT A. BRACE Department of Physiology and Biophysics, Jackson, Mississippi 39216 University least-squares regression analysis; errors inX and Y; measurement errors; random variation THE PHYSIOLOGIST, like almost any other experimental researcher, often desires to fit his data to the straight line Y = AX + B, where A and B are the slope and intercept, respectively, of the line relating X to Y. For example, consider the data of Zweifach (7). He used the servonull method to measure pressures in large arteries (Pa) and in precapillary vessels (Ppc). From a physical point of view, it can be expected that the slope of the line relating these two pressures would have a value slightly less than one. However, Zweifach found Pa = 0.281 Ppc + a constant (I) when the data were analyzed with conventional least squares regression analysis. This slope of 0.281 appears to be significantly different from the expected value. However if the data had been analyzed with Pa treated as the independent (i.e., X) variable instead of Ppc the equation relating the arterial and arteriolar pressures would be ~ (2) Pa = 2.74 Ppc + a constant This example illustrates the fact that the value of the slope (and intercept) found with classical methods depends heavily on the often arbitrary choice as to which variable is treated as independent. Medical Center, This paper discusses the causes of the above problem and introduces a very simple method that can be used when the investigator wants to determine a slope and intercept that are not dependent on which variable is chosen as the X variable. METHODS USED TO FIT DATA TO STRAIGHT LINES If data exactly fit the line Y = AX + B then it is not at all difficult to determine the value of the slope A-d intercept B. More frequently, there is scatter in the data so that a perfect correlation does not exist. This scatter is due to the presence of errors in the values ofX and/or Y. Errors are either measurement errors or due to random variation. Measurement errors can often be reduced with proper experimental design (see (1)); however, even in the absence of measurement errors, there may still be considerable scatter. This is due to the fact that biological variables are generally dependent upon several factors, each of which can contribute to what appears to be random variation in the measured value of a variable. The problem then becomes how to determine values for A and B when errors are present in the values of X and Y. The following is a brief summary of several methods that have been used to estimate the slope and intercept from paired data. The mathematics of these methods are iwluded in the APPENDIX. All least-squares methods determine estimates of A and B by minimizing the sum of the squares of the distances between each datum point and the line of best fit. With the conventional least-squares method, the distance that is minimized is the vertical distance between the data points and the line of best fit (Line 1 of Fig. 1). This is called the vertical distance method. In effect, the method treats the data as if all errors are in the Y variable and no errors are present in the X variable. A second method is to minimize the horizontal distances between the data points and regression line (Line 2 of Fig. 1). This horizontal distance method treats Y as the independent variable and, in effect, analyzes the data as if no errors exist in Y. The horizontal distance method and vertical distance method can produce greatly different results as seen in the preceding example, since the slope found with the vertical distance method is equal to the correlation coefficient squared (r2> times the slope found with the horizontal distance method. This results from the fact that when there are errors in both variables the conventional least squares Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016 BRACE, ROBERT A. Fitting straight Lines to experimental data. Am. J. Physiol. 233(3): R94-R99, 1977 or Am. J. Physiol.: Regulatory Integrative Comp. Physiol. 2(2): R94-R99, 1977. -The problem associated with use of statistical methods for determining a best linear relationship of the form Y = AX + B have been examined for a condition quite prevalent with experimental research, i.e., when the values of both variables are subject to essentially unknown errors. Under this condition standard least-squares regression analysis underestimates the value of the slope A. A very simple method for determining the best’value of the slope and intercept has been introduced which can be used when errors are present in both variables. With this proposed method, the calculated slope is equal to the standard error of Y divided by the standard error of X (with the appropriate sign) and the intercept is found from the mean values ofX and Y, i.e., B = Y - AX. The best estimate of the slope is also equal to the slope found with the conventional regression method divided by the absolute value of the correlation coefficient. The line determined with the suggested method can be considered to be a line of symmetry through the data. of Mississippi data FITTING FIG. represented Y=AX+B. STRAIGHT LINES TO EXPERIMENTAL 1. Various techniques minimize by Zines1,2,3, or1 plus2 R95 DATA the squares of the distances to obtain the best fit of data to Equations 3 and 4 are quite different even though both represent least squares “best” fit of the data. (Note that, with all of the above methods, the value of the correlation coefficient (0.627) is invariant .) The sum of vertical and horizontal distances and the perpendicular distance methods attempt to treat both the X and Y variables as dependent variables in the sense that each method acknowledges that there are errors in the measured values of both variables. The data of Table 1 analyzed with the sum of horizontal and vertical distances method yields Y = 1.03X + 0.11 while the perpendicular PROBLEMS WITH ABOVE METHODS One of the more effective ways to exemplify the problems which arise with the use of the above methods is to analyze a sample data set with these methods. This is particularly beneficial since recognition of the problem is a prerequisite to the solution. First, consider the data of Table 1: The Xi and Yi values represent two different experimental methods (both of which are subject to error) for estimating the same variable (capillary hydrostatic pressure). The objective of the analysis is to find the relationship between the two variables (i.e., the structural relationship or functional relationship (3)) depending on the nature of the errors). In this case, there is no basis for choosing either X or Y as the independent variable. However, since both values represent the same variable, the slope of the line relatingx to Y should have a value of one. Using the standard least squares method with X assumed to be the independent variable (the vertical distance method) gives Y = 0.655X + 4.94 (3) If Y is treated as the independent variable (the horizontal distance method) the equation of the resulting line is Y = 1.67X - 8.04 (4 distance method gives Y = 1.07X - 0.42 (6) Analyzing the data with the method of averages yields the following line when the data are grouped according to the X values Y = 0.43X + 7.87 (7) If grouped according to the Y values, the method yields Y = 1.69X - 8.39 (8) Thus we have six different equations (Eq. 3a) that refiresent the data, but we do not know which, if any, is the “best” fit, even though we do know that Eqs. 5 and 6 represent a “better” fit since the slope in Eq. 3 must be biased to a smaller value while the slope in Eq. 4 is biased to a larger value (see APPENDIX). In this particular case, the method of averages does not improve the results of the analysis. The problem can be further clarified by another example. This time the data of Table 1 will be analyzed with each of the Y variables multiplied by 1,000 (e.g., a simple change in units). For this case both the standard least-squares method withX chosen as the independent variable and the sum of the vertical and horizontal distances method yield the equation Y = 655X + 4,940 while the standard least-squares (9) method with Y as the 9.6 20.7 14.5 16.0 12.7 17.9 10.3 7.4 7.8 11.0 14.2 17.1 20.9 13.9 17.6 15.3 7.0 9.4 10.0 7.8 x 12.79 +_SE tl.40 13.32 21.46 Xi is the measured value of isogravimetric capillary pressure, and Yi is the measured value of stop flow capillary pressure. *Each X, Y pair is observed in a different experimental animal. Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016 method underestimates the value of the slope and the horizontal distance method overestimates the value of the slope (see APPENDIX). It has long been recognized that there are often errors in the values of both the X and Y variable. In order to produce an exact estimate of the slope and intercept, the errors in bothX and Y must be known. If the errors are known, then the fairly complex methods discussed by Madansky (4) can be used. With physiological data, it is often known that errors exist in both variables, but the exact value of these errors are rarely, if ever, known. Thus the problem is analyzing data which has unknown errors in both variables.Three logical yet problematic approaches have been used to analyze data under these conditions. One of these approaches is to minimize the sum of the vertical and hoGzonta1 distances (lines 1 + 2 of Fig. 1). The second is to minimize the perpendicular distances between the data points and line-of best fit (line 3 of Fig. 1). The third is the method of averages which is discussed in the APPENDIX. (5) R. A. BRACE R96 independent variable method give and the perpendicular distance . . PC1 - Plf = o(np FIG. 2. A datum point removed from the line Y = AX + B by error AX in the X direction and AY in the Y direction. The vertical and horizontal or perpendicular distances do not represent the errors and AY) unless the slope is I one. - nif) (14) Investigators have measured Pci and rrp and used classical least squares regression analysis to find the slope and intercept of the equation relating these two variables. Equation 14 can be rearranged so that Pci is equal to a slope times np plus a constant. The slope has the value (~(1 - nif/np) and the intercept is equal Pif since vif goes to zero as np approaches zero. From Table Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016 distances method to produce the desired slope and intercept, the resulting slope must have a value of one during the regression analysis. This can easily be accomplished Y = 1,67QX - 8,040 (10) by weighting the data prior to the analysis. To do this, multiply each Yi by the standard error (or deviation) of In this example, the two methods that recognize errors X(SE,) and divide by the standard error (or deviation) are present in both variables cannot improve the analyof Y(SE,) i.e., let Y’i = Yi SEJSE,. Then use either the sis. But we do know that the desired slope lies between perpendicular distance or the sum of the vertical and 655 and 1,670 since errors are present in both X and Y horizontal distances methods. (Both methods produce (see A.PPENDIX). As seen with this example, the results of the perpendicular distance method (see (1.>) and sum the same result when the data are weighted in this fashion, i.e., when the slope during the analysis is one.) of vertical and horizontal distance method are dependThe resulting slope and intercept are then unweighted ent upon the relative magnitude of the two variables (i.e., the values of A and B are not unique). It could be by dividing each by SEJSE,. It is important to note that the slope found with this argued that all data should be normalized before using approach is equivalent to setting A equal to the slope either of the latter two methods. However, this requires found by the conventional least squares regression analthat each variable must have a standardized normal ysis (with X the independent variable) divided by the range and obviously would create more problems rather absolute value of the correlation coefficient. The interthan a solution to the existing problems. cept is then found from B = F - AX where the bar Ideally, there should be a simple method to use when errors exist in both variables which is insensitive to represents the mean value. It is even simpler to determine the slope A with the suggested method than that changes in units of the variables. just described since, as can be seen above, the slope is also exactly expressed as A = SE,/SE, with the sign of PROPOSED METHOD FOR DETERMINING BEST LINEAR FIT the slope being positive or negative, whichever is approThe perpendicular distance method and the sum of priate. It should also be pointed out that, with the vertical and horizontal distance method attempt to minproposed method, the slope is equal to the geometric imize errors in the X and Y directions. The failure of mean of slopes found with 1) the vertical distance these methods is that they do not adequately represent method and 2) the horizontal distance method. the error. This fact can be seen with the aid of Fig. 2 Analysis of the data of Table 1 using this suggested which shows a datum point removed from the line Y = method yields AX + B by an error in theX direction (AX) and an error Y = 1.05 X - 0.06 (11) in the Y direction (AY). The point (X,Y) is actually AY + A AX above the line and AX + AYIA to the left of the and of the same data set where each Yi was multiplied line. From this it can be seen that the sum of the bY 1 ,000 gi ves the same equation vertical and horizontal distances method minimizes the Y = 1,050X - 60 (12) sum of the squares of AY + AAX and AX + AYIA instead of the desired sum of the squares of AY and AX. Thus, contrary to the implications of Eqs. 3 or 4, Eq. 11 The only condition under which this method minimizes suggests that the two variables of Table 1 represent the distances corresponding to desired distances is when the same variable. Furthermore, the data of Zweifach (7), slope A has a value of one. For the perpendicular dis- which was used as an example, when analyzed with the tance method, it can be seen from Fig. 2 that the ideal proposed method yields distance to minimize is the diagonal of the rectangle of Pa = 0.88 Ppc + a constant size AX by AY. The perpendicular distance is equal to (13) the diagonal only under one condition, that is when Note that this slope is more consistent with the value A AY = AX. With the perpendicular method, the resultexpected from a physical viewpoint than the slopes ing slope times the mean error in Y is equal to the mean found with the conventional least-squares methods error in X only when A = 1 and the mean errors are (Eqs. 1 and2). equal. Thus, in order for either the perpendicular disAs a final example of the applicability of the proposed tance method or the sum of the vertical and horizontal method, consider the data of Table 2. These data are measured values of each of the four Starling pressures which determine fluid movement across the capillary membrane and are related by the equation FITTING STRAIGHT LINES TO EXPERIMENTAL R97 DATA TABLE 2. Measured values of four starling pressures in the isolated dog forelimb x Pci VP 7rif Pif 15.3 17.0 13.1 13.0 16.3 19.3 15.4 18.5 20.0 16.4 17.6 24.7 19.6 22.2 5.2 4.4 2.5 3.6 8.8 2.5 7.0 +1.2 +0.9 +0.6 -1.7 0.0 +2.3 -0.5 1.06 1.03 .90 1.05 1.03 .99 1.05 u 15.6 19.9 4.9 +0.4 1.02 2, the calculated values of the slope and intercept are 0.77 and 0.4, respectively. If the data of Table 2 are analyzed with the conventional vertical distance method, the result is . PC1 However, = 0.367~p + 8.6 (15) analysis with the suggested method yields . PC1 = 0.787rp + 0.1 (16) Note that the slope and intercept of Eq. 16 are in excellent agreement with the calculated values. This was not true for the slope and intercept found with conventional methods. Obviously, this suggested method cannot always best represent data since the best representation depends on the exact nature of the errors. However, the method certainly appears to produce a more meaningful estimate of the slope and intercept for physiological data when both variables are known to be subject to error. The proposed method is ideally suited to situations where the errors in Y (i.e., the magnitude of the errors in Y) are equal to A times the errors in X since this is the basic underlying assumption in the proposed method.’ To understand the relationship between the proposed method and classical least squares estimates of the slope, consider the exact relationship Y = AX + B. An error is then added to each value of Y and to each value of X. As is usually assumed, the errors are normally distributed and unrelated. Figure 3 shows the accuracy of the conventional and proposed estimates of the true slope. In this figure, A, is the slope found with the conventional vertical distance method, AI/r that found with the proposed method, and AI/r2 the slope found with the horizontal distance method. A Jr is equal to the true slope when the errors in Y equal A times the errors inX. In addition, the proposed method produces a more accurate estimate of the slope than conventional methods whenever A times the errors in X is in the range of one half to two times the errors in Y. Outside of l The terms “errors inX” and “errors in Y” are meant to represent the magnitude of the errors in X and Y and may be defined as the sum of the absolute value of the errors, i.e., the errors in X = ZIAXil and the errors in Y = ZlAY,I, where the vertical bars represent absolute value. The individual values of the errors are generally assumed to be normally distributed and unrelated. WHAT METHOD TO USE At this point we are left with the standard least squares method with one variable selected as the independent and with the proposed method which minimizes errors in both the X and Y directions. It is necessary to consider the conditions under which each method should be used. The line determined by the proposed method may be taken as a starting point for any linear regression analysis since it can be considered to be the line of symmetry through the data. Then it is necessary to consider 1) the use to which the desired line is to be put and2) what is known about the errors in each variable. If the errors in one variable are known to be negligibly small, then the conventional least-squares method should be used. (This corresponds to conditions in which W 0, 0 -I m .05 0.1 0.2 0.5 1.0 2.0 5.0 10.0 20.0 Ax (errors in X ) (errors in Y ) FIG. 3. A comparison of the ability of the proposed method and conventional methods to estimate the true slope (A) when errors are present in both variables. The ordinate represents the true slope times the magnitude of the errors in X divided by the magnitude of the errors in Y. Slope A, is from conventional methods with X the independent variable, A Jr2 from the same method with Y the independent variable and Al/r is the slope found with the proposed method. The dashed lines represent 50% of the distance between the conventional and proposed estimates of slopes (see text) and the shaded area is the range over which the proposed method results in a more accurate estimate ofA than classical methods. Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016 20.5 +0.02 +,SE rt0.8 21.1 kO.9 Pci represents isogravimetric capillary pressure; rrp, plasma colloid osmotic pressure; Gf, interstitial fluid colloid osmotic pressure; Pif, interstitial fluid pressure. The reflection coefficient (a> was calculated from the definition in Eq. 14. Units are torrs. this range, the classical least squares methods result in a more accurate estimate of the true slope. The relationship in Fig. 3 is accurate over a very large range of errors and breaks down only when there is no correlation in the data, i.e., when r c 0.1. There are several other methods that are comparable to that being proposed, e.g. taking the arithmetic mean of the two slopes found with conventional least squares techniques, bisecting the angle between the two lines, etc. However, with the other methods, the mathematics is somewhat more difficult and the errors being minimized are not as easily defined. R98 R. A. BRACE Implicit in the use of either E, or E, is the assumption that the independent variable in each case is exactly known (6). If this is not true so that either xi in Eq. 18 or yi in Eq. 20 is subject to error, then the slope (either A 1 or l/A,) is biased to a small value (2). This fact becomes more apparent when it is realized that the slope A found whenX is chosen as the independent variable differs by the square of the correlation coefficient (r) from the slope found when Y is chosen as the independent variable, i.e., A 1 is equal to r2A2. Obviously A, = A, only if r2 = 1 so that perfect correlation exists. For any other situation, the two estimates of A (A, andA,) and of B (B, and&) are not identical, and the ordinary least-squares approach provides no indication of the “better” choice. However, since A, is biased to a smaller value and A, to a larger value, it is obvious that the desired slope lies somewhere between A l and A,. 2) Methods used when errors are known. The problem with both of the above least-squares methods is that they fail to take into account the fact that both Xi and Y, are subject to error. This problem has been investigated, and if information is available as to the nature of the errors that exist in the measured values of Xi and Yi, then the fairly complex methods discussed by Kendall and Stuart (3) or Madansky (4) can be used. However, most frequently with experimental data the investigator has little idea as to what errors may exist in either Xi or Yi and only knows that the best relationship between X and Y is desired. In the past, the following logical yet problematic approaches have been used in an attempt to obtain the best linear relationship when the values of the variables are subject to error. 3) Sum of vertical and horizontal distances method. This approach simply minimizes the sum of the squares of the vertical and horizontal distances between the point and the line of best fit APPENDIX The following is a summary of procedures presently used to determine a linear fit of data to the line Y = AX + B. Further details and logic for the mathematics of the least square methods can be obtained from most texts on statistical methods. 1) Standard least-squares method. Whenever a “best” linear relationship between two variables is desired, the standard method of least-squares regression analysis is almost invariably used. From an input of N measured pairs of data (Xi, Yi), the method assumes X is the independent variable and supplies an estimate of the slope (A) and the intercept (B) for the relationship Y = AX + B by minimizing an error function (E) which is equal to the sum of the squares of the vertical distance between each Yi and the line of best fit (line 1 of Fig. 1) After .setting the . partial derivatives * equation can be expressed as El = Z (Yi - A, Xi ’ B,)2. i=l By setting aE/aB = 0 and aElaA = 0, an exact analytical sion for A and B is obtained A&&, Z Z (17) (18) Xi2 where Y and X represent respective arithmetic means of the measured values of the variables and where zi and yi are residuals of Xi and Yi, i.e., xi =Xi -Xandyi = Yi - Y. An alternate and often equally justified approach is to treat Y as the independent variable. In this case, the sum of the squares of the horizontal distance between the data point and the regression line (line 2 of Fig. 1) = g (Xi i=l is minimized - $ Yi + F)2 2 (1/A,)(AJ4 - (AJ3 + A, : (Yi - AsXi i=l - B3)’ w 2 (21) of& equal to zero, the resulting - A, = 0, B, = p - A,x (22) where A l and A, are defined above. This solution is more complex but the value of A, can be determined by trial and error. 4) Perpendicular distance method. The second and more commonly used approach when both variables are subject to error is to minimize the sum of the squares of the perpendicular distances between the data points and the line of best fit (Eine 3 of Fig. 1). The error function can be expressed as E, = Setting the partial l/(1 + Aa2) i=li (Yi - A,Xi derivatives - B4)2 (23) of E, to zero yields (A4)2+64,-A2)A4-1=0, i=l E, E, + E, = (1 + AZ) expres- Xiyi B,=y-A,r;’ E3 = B,=F-A,X (24) The two solutions of Eq. 24 are obtained through use of the quadratic formula, with the desired root for A, being between A 1 and A,. 5) Method of averages. This method has not been used much recently but has certain merits and is the only method suggested by the US Bureau of Standards Handbook on experimental statistics (5). First, divide the data into three nonoverlapping groups when considered in the X direction with a size as close to n/3 as possible. The two extreme groups should have an equal number of points. The desired line is drawn through the mean of the data (X, Y) with a slope of ( Y3 - &>/(X3 - X,) where the latter mean values are the means of each of the two extreme groups. Note that data can also be grouped according to the Y value. For large groups of data, the method of averages produces values of the slope A and intercept B which are almost identical to those found with the proposed line of symmetry. and yields This study was supported by National HL 11678. Syi A2=+, B2= z Xi Yi i=l %A,x Institutes (20) Received for publication 20 September 1976. of Health Grant Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016 the value ofX is accurately controlled by the investigator.) If the purpose of the analysis is to allow prediction of a value- of-Y from a given value of X,-then the standard least-squares method should be used regardless of whether there are errors in the values of the Y variable or errors in both variables (3,5). However if the purpose of the analysis is to give a meaningful interpretation to the values of the slope or intercept (e.g., for comparison with theoretical values or use in a simulation, etc.) then the method presented above can be used instead of the standard least-squares method whenever significant measurement errors and/or random variation errors are present in the values of both variables. This includes conditions where X and Y are uncontrolled as well as conditi .ons where one variable is inaccurately controlled. Use of the resulting line of symmetry is obviously limited to conditions in which both variables contain either measurement or random variation errors that are too large to be ignored. However, the line of symmetry exactly represents the data only when the errors in Y are equal to A times the errors in X ‘The major advantage of using this suggested method for determining a best linear fit is that it is exceedingly simple. Furthermore, data that had been analyzed in the classical fashion can be reanalyzed with minimal effort. FITTING STRAIGHT LINES TO EXPERIMENTAL R99 DATA REFERENCES 1. ACTON, F. S. Analysis of Straight Line Data. New York: Wiley, 1959. 2. BERKSON, J. Are there two regressions?J. Am. Statis. Assoc. 45: 164-180, 1950. 3. KENDALL, M. G., AND A. STUART. The Advanced Theory of Statistics. Inference and Relationship. New York: Hafner, vol. 2, 1961. 4. MADANSKY, A. The fitting of straight lines when both variables are subject to error. J. Am. Statist. Assoc. 54: 173-205, 1959. 5. NATRELLA,.M. G. Experimental Statistics. Washington, DC.: US Govt. Printing Office, National Bureau of Standards Handbook 91, 1963. 6. SOKAL, R. R., AND F. J. F~HLF. Biometry. San Francisco: Fresman, 1969. 7. Z WEIFACH, B. W. Quantitative studies of microcirculatory structures and function. 1. Analysis of pressure distribution in the terminal vascular bed in cat mesentery. Circulation Res. 34: 843866, 1974. Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016