Fitting straight lines to experimental data

advertisement
Fitting
straight
lines to experimental
ROBERT A. BRACE
Department
of Physiology and Biophysics,
Jackson, Mississippi 39216
University
least-squares regression analysis; errors inX and Y; measurement errors; random variation
THE PHYSIOLOGIST, like almost any other experimental
researcher, often desires to fit his data to the straight
line Y = AX + B, where A and B are the slope and
intercept, respectively, of the line relating X to Y. For
example, consider the data of Zweifach (7). He used the
servonull method to measure pressures in large arteries
(Pa) and in precapillary vessels (Ppc). From a physical
point of view, it can be expected that the slope of the line
relating these two pressures would have a value slightly
less than one. However, Zweifach found
Pa = 0.281 Ppc + a constant
(I)
when the data were analyzed with conventional least
squares regression analysis. This slope of 0.281 appears
to be significantly
different from the expected value.
However if the data had been analyzed with Pa treated
as the independent (i.e., X) variable instead of Ppc the
equation relating the arterial and arteriolar pressures
would be
~ (2)
Pa = 2.74 Ppc + a constant
This example illustrates the fact that the value of the
slope (and intercept) found with classical methods depends heavily on the often arbitrary choice as to which
variable is treated as independent.
Medical
Center,
This paper discusses the causes of the above problem
and introduces a very simple method that can be used
when the investigator wants to determine a slope and
intercept that are not dependent on which variable is
chosen as the X variable.
METHODS
USED TO FIT DATA TO STRAIGHT
LINES
If data exactly fit the line Y = AX + B then it is not at
all difficult to determine the value of the slope A-d
intercept B. More frequently, there is scatter in the data
so that a perfect correlation does not exist. This scatter
is due to the presence of errors in the values ofX and/or
Y. Errors are either measurement errors or due to random variation.
Measurement
errors can often be reduced with proper experimental
design (see (1)); however, even in the absence of measurement errors, there
may still be considerable scatter. This is due to the fact
that biological variables are generally dependent upon
several factors, each of which can contribute to what
appears to be random variation in the measured value
of a variable. The problem then becomes how to determine values for A and B when errors are present in the
values of X and Y.
The following is a brief summary of several methods
that have been used to estimate the slope and intercept
from paired data. The mathematics of these methods
are iwluded in the APPENDIX.
All least-squares methods determine estimates of A
and B by minimizing
the sum of the squares of the
distances between each datum point and the line of best
fit. With the conventional least-squares method, the
distance that is minimized is the vertical distance between the data points and the line of best fit (Line 1 of
Fig. 1). This is called the vertical distance method. In
effect, the method treats the data as if all errors are in
the Y variable and no errors are present in the X
variable.
A second method is to minimize the horizontal distances between the data points and regression line (Line
2 of Fig. 1). This horizontal distance method treats Y as
the independent variable and, in effect, analyzes the
data as if no errors exist in Y. The horizontal distance
method and vertical distance method can produce
greatly different results as seen in the preceding example, since the slope found with the vertical distance
method is equal to the correlation coefficient squared
(r2> times the slope found with the horizontal distance
method. This results from the fact that when there are
errors in both variables the conventional least squares
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
BRACE, ROBERT A. Fitting
straight
Lines to experimental
data. Am. J. Physiol. 233(3): R94-R99, 1977 or Am. J. Physiol.: Regulatory Integrative Comp. Physiol. 2(2): R94-R99,
1977. -The problem associated with use of statistical methods
for determining a best linear relationship of the form Y = AX
+ B have been examined for a condition quite prevalent with
experimental research, i.e., when the values of both variables
are subject to essentially unknown errors. Under this condition standard least-squares regression analysis underestimates the value of the slope A. A very simple method for
determining the best’value of the slope and intercept has been
introduced which can be used when errors are present in both
variables. With this proposed method, the calculated slope is
equal to the standard error of Y divided by the standard error
of X (with the appropriate sign) and the intercept is found
from the mean values ofX and Y, i.e., B = Y - AX. The best
estimate of the slope is also equal to the slope found with the
conventional regression method divided by the absolute value
of the correlation coefficient. The line determined with the
suggested method can be considered to be a line of symmetry
through the data.
of Mississippi
data
FITTING
FIG.
represented
Y=AX+B.
STRAIGHT
LINES
TO
EXPERIMENTAL
1. Various
techniques
minimize
by Zines1,2,3,
or1 plus2
R95
DATA
the squares
of the distances
to obtain the best fit of data to
Equations 3 and 4 are quite different even though
both represent least squares “best” fit of the data. (Note
that, with all of the above methods, the value of the
correlation coefficient (0.627) is invariant .)
The sum of vertical and horizontal distances and the
perpendicular
distance methods attempt to treat both
the X and Y variables as dependent variables in the
sense that each method acknowledges that there are
errors in the measured values of both variables.
The data of Table 1 analyzed with the sum of horizontal
and vertical distances method yields
Y = 1.03X + 0.11
while the perpendicular
PROBLEMS
WITH
ABOVE
METHODS
One of the more effective ways to exemplify the problems which arise with the use of the above methods is to
analyze a sample data set with these methods. This is
particularly beneficial since recognition of the problem
is a prerequisite to the solution. First, consider the data
of Table 1: The Xi and Yi values represent two different
experimental
methods (both of which are subject to
error) for estimating the same variable (capillary hydrostatic pressure). The objective of the analysis is to find
the relationship
between the two variables (i.e., the
structural relationship
or functional relationship
(3))
depending on the nature of the errors). In this case,
there is no basis for choosing either X or Y as the
independent variable. However, since both values represent the same variable, the slope of the line relatingx
to Y should have a value of one. Using the standard
least squares method with X assumed to be the independent variable (the vertical distance method) gives
Y = 0.655X + 4.94
(3)
If Y is treated as the independent variable (the horizontal distance method) the equation of the resulting line is
Y = 1.67X - 8.04
(4
distance method gives
Y = 1.07X - 0.42
(6)
Analyzing the data with the method of averages yields
the following line when the data are grouped according
to the X values
Y = 0.43X + 7.87
(7)
If grouped according to the Y values, the method yields
Y = 1.69X - 8.39
(8)
Thus we have six different equations (Eq. 3a) that
refiresent the data, but we do not know which, if any, is
the “best” fit, even though we do know that Eqs. 5 and 6
represent a “better” fit since the slope in Eq. 3 must be
biased to a smaller value while the slope in Eq. 4 is
biased to a larger value (see APPENDIX).
In this particular case, the method of averages does not improve the
results of the analysis.
The problem can be further clarified by another example. This time the data of Table 1 will be analyzed
with each of the Y variables multiplied by 1,000 (e.g., a
simple change in units). For this case both the standard
least-squares method withX chosen as the independent
variable and the sum of the vertical and horizontal
distances method yield the equation
Y = 655X + 4,940
while the standard least-squares
(9)
method with Y as the
9.6
20.7
14.5
16.0
12.7
17.9
10.3
7.4
7.8
11.0
14.2
17.1
20.9
13.9
17.6
15.3
7.0
9.4
10.0
7.8
x 12.79
+_SE tl.40
13.32
21.46
Xi is the measured
value of isogravimetric
capillary
pressure,
and
Yi is the measured
value of stop flow capillary
pressure.
*Each
X, Y pair is observed
in a different
experimental
animal.
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
method underestimates the value of the slope and the
horizontal distance method overestimates the value of
the slope (see APPENDIX).
It has long been recognized that there are often errors
in the values of both the X and Y variable. In order to
produce an exact estimate of the slope and intercept, the
errors in bothX and Y must be known. If the errors are
known, then the fairly complex methods discussed by
Madansky (4) can be used. With physiological data, it is
often known that errors exist in both variables, but the
exact value of these errors are rarely, if ever, known.
Thus the problem is analyzing data which has unknown
errors in both variables.Three logical yet problematic approaches have been
used to analyze data under these conditions. One of
these approaches is to minimize the sum of the vertical
and hoGzonta1 distances (lines 1 + 2 of Fig. 1). The
second is to minimize the perpendicular
distances between the data points and line-of best fit (line 3 of Fig.
1). The third is the method of averages which is discussed in the APPENDIX.
(5)
R. A. BRACE
R96
independent variable
method give
and the perpendicular
distance
.
.
PC1 - Plf = o(np
FIG. 2. A datum point removed from the line Y = AX + B by error
AX in the X direction and AY in the Y direction. The vertical and
horizontal or perpendicular distances do not represent the errors
and AY) unless the slope is I one.
- nif)
(14)
Investigators
have measured Pci and rrp and used
classical least squares regression analysis to find the
slope and intercept of the equation relating these two
variables. Equation 14 can be rearranged so that Pci is
equal to a slope times np plus a constant. The slope has
the value (~(1 - nif/np) and the intercept is equal Pif
since vif goes to zero as np approaches zero. From Table
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
distances method to produce the desired slope and intercept, the resulting slope must have a value of one during
the
regression analysis. This can easily be accomplished
Y = 1,67QX - 8,040
(10)
by weighting the data prior to the analysis. To do this,
multiply each Yi by the standard error (or deviation) of
In this example, the two methods that recognize errors
X(SE,) and divide by the standard error (or deviation)
are present in both variables cannot improve the analyof Y(SE,) i.e., let Y’i = Yi SEJSE,. Then use either the
sis. But we do know that the desired slope lies between
perpendicular
distance or the sum of the vertical and
655 and 1,670 since errors are present in both X and Y
horizontal distances methods. (Both methods produce
(see A.PPENDIX).
As seen with this example, the results
of the perpendicular distance method (see (1.>) and sum the same result when the data are weighted in this
fashion, i.e., when the slope during the analysis is one.)
of vertical and horizontal distance method are dependThe resulting slope and intercept are then unweighted
ent upon the relative magnitude of the two variables
(i.e., the values of A and B are not unique). It could be by dividing each by SEJSE,.
It is important to note that the slope found with this
argued that all data should be normalized before using
approach is equivalent to setting A equal to the slope
either of the latter two methods. However, this requires
found by the conventional least squares regression analthat each variable must have a standardized normal
ysis (with X the independent variable) divided by the
range and obviously would create more problems rather
absolute value of the correlation coefficient. The interthan a solution to the existing problems.
cept is then found from B = F - AX where the bar
Ideally, there should be a simple method to use when
errors exist in both variables which is insensitive to represents the mean value. It is even simpler to determine the slope A with the suggested method than that
changes in units of the variables.
just described since, as can be seen above, the slope is
also exactly expressed as A = SE,/SE, with the sign of
PROPOSED
METHOD
FOR DETERMINING
BEST LINEAR
FIT
the slope being positive or negative, whichever is approThe perpendicular
distance method and the sum of priate. It should also be pointed out that, with the
vertical and horizontal distance method attempt to minproposed method, the slope is equal to the geometric
imize errors in the X and Y directions. The failure of mean of slopes found with 1) the vertical distance
these methods is that they do not adequately represent
method and 2) the horizontal distance method.
the error. This fact can be seen with the aid of Fig. 2
Analysis of the data of Table 1 using this suggested
which shows a datum point removed from the line Y = method yields
AX + B by an error in theX direction (AX) and an error
Y = 1.05 X - 0.06
(11)
in the Y direction (AY). The point (X,Y) is actually AY
+ A AX above the line and AX + AYIA to the left of the and of the same data set where each Yi was multiplied
line. From this it can be seen that the sum of the bY 1 ,000 gi ves the same equation
vertical and horizontal distances method minimizes the
Y = 1,050X - 60
(12)
sum of the squares of AY + AAX and AX + AYIA
instead of the desired sum of the squares of AY and AX. Thus, contrary to the implications of Eqs. 3 or 4, Eq. 11
The only condition under which this method minimizes
suggests that the two variables of Table 1 represent the
distances corresponding to desired distances is when the same variable. Furthermore,
the data of Zweifach (7),
slope A has a value of one. For the perpendicular
dis- which was used as an example, when analyzed with the
tance method, it can be seen from Fig. 2 that the ideal proposed method yields
distance to minimize is the diagonal of the rectangle of
Pa = 0.88 Ppc + a constant
size AX by AY. The perpendicular
distance is equal to
(13)
the diagonal only under one condition, that is when Note that this slope is more consistent with the value
A AY = AX. With the perpendicular
method, the resultexpected from a physical viewpoint than the slopes
ing slope times the mean error in Y is equal to the mean found with the conventional
least-squares
methods
error in X only when A = 1 and the mean errors are (Eqs. 1 and2).
equal. Thus, in order for either the perpendicular
disAs a final example of the applicability of the proposed
tance method or the sum of the vertical and horizontal
method, consider the data of Table 2. These data are
measured values of each of the four Starling pressures
which determine fluid movement across the capillary
membrane and are related by the equation
FITTING
STRAIGHT
LINES
TO EXPERIMENTAL
R97
DATA
TABLE
2. Measured values of four starling
pressures in the isolated dog forelimb
x
Pci
VP
7rif
Pif
15.3
17.0
13.1
13.0
16.3
19.3
15.4
18.5
20.0
16.4
17.6
24.7
19.6
22.2
5.2
4.4
2.5
3.6
8.8
2.5
7.0
+1.2
+0.9
+0.6
-1.7
0.0
+2.3
-0.5
1.06
1.03
.90
1.05
1.03
.99
1.05
u
15.6
19.9
4.9
+0.4
1.02
2, the calculated values of the slope and intercept are
0.77 and 0.4, respectively. If the data of Table 2 are
analyzed with the conventional
vertical
distance
method, the result is
.
PC1
However,
= 0.367~p + 8.6
(15)
analysis with the suggested method yields
.
PC1
= 0.787rp + 0.1
(16)
Note that the slope and intercept of Eq. 16 are in
excellent agreement with the calculated values. This
was not true for the slope and intercept found with
conventional methods.
Obviously, this suggested method cannot always best
represent data since the best representation depends on
the exact nature of the errors. However, the method
certainly appears to produce a more meaningful
estimate of the slope and intercept for physiological data
when both variables are known to be subject to error.
The proposed method is ideally suited to situations
where the errors in Y (i.e., the magnitude of the errors
in Y) are equal to A times the errors in X since this is
the basic underlying
assumption
in the proposed
method.’ To understand the relationship between the
proposed method and classical least squares estimates of
the slope, consider the exact relationship Y = AX + B.
An error is then added to each value of Y and to each
value of X. As is usually assumed, the errors are normally distributed and unrelated. Figure 3 shows the
accuracy of the conventional and proposed estimates of
the true slope. In this figure, A, is the slope found with
the conventional vertical distance method, AI/r that
found with the proposed method, and AI/r2 the slope
found with the horizontal distance method. A Jr is equal
to the true slope when the errors in Y equal A times the
errors inX. In addition, the proposed method produces a
more accurate estimate of the slope than conventional
methods whenever A times the errors in X is in the
range of one half to two times the errors in Y. Outside of
l The terms “errors inX” and “errors in Y” are meant to represent
the magnitude of the errors in X and Y and may be defined as the
sum of the absolute value of the errors, i.e., the errors in X = ZIAXil
and the errors in Y = ZlAY,I, where the vertical bars represent
absolute value. The individual
values of the errors are generally
assumed to be normally distributed and unrelated.
WHAT
METHOD
TO
USE
At this point we are left with the standard least
squares method with one variable selected as the independent and with the proposed method which minimizes
errors in both the X and Y directions. It is necessary to
consider the conditions under which each method should
be used. The line determined by the proposed method
may be taken as a starting point for any linear regression analysis since it can be considered to be the line of
symmetry through the data. Then it is necessary to
consider 1) the use to which the desired line is to be put
and2) what is known about the errors in each variable.
If the errors in one variable are known to be negligibly
small, then the conventional
least-squares
method
should be used. (This corresponds to conditions in which
W
0,
0
-I
m
.05
0.1
0.2
0.5
1.0
2.0
5.0
10.0 20.0
Ax (errors in X )
(errors in Y )
FIG. 3. A comparison
of the ability of the proposed method and
conventional methods to estimate the true slope (A) when errors are
present in both variables. The ordinate represents the true slope
times the magnitude of the errors in X divided by the magnitude of
the errors in Y. Slope A, is from conventional methods with X the
independent variable, A Jr2 from the same method with Y the independent variable and Al/r is the slope found with the proposed
method. The dashed lines represent 50% of the distance between the
conventional and proposed estimates of slopes (see text) and the
shaded area is the range over which the proposed method results in a
more accurate estimate ofA than classical methods.
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
20.5
+0.02
+,SE
rt0.8
21.1
kO.9
Pci represents isogravimetric capillary pressure; rrp, plasma colloid osmotic pressure; Gf, interstitial fluid colloid osmotic pressure;
Pif, interstitial
fluid pressure. The reflection coefficient (a> was
calculated from the definition in Eq. 14. Units are torrs.
this range, the classical least squares methods result in
a more accurate estimate of the true slope. The relationship in Fig. 3 is accurate over a very large range of
errors and breaks down only when there is no correlation in the data, i.e., when r c 0.1.
There are several other methods that are comparable
to that being proposed, e.g. taking the arithmetic mean
of the two slopes found with conventional least squares
techniques, bisecting the angle between the two lines,
etc. However, with the other methods, the mathematics
is somewhat more difficult and the errors being minimized are not as easily defined.
R98
R. A. BRACE
Implicit in the use of either E, or E, is the assumption that the
independent variable in each case is exactly known (6). If this is not
true so that either xi in Eq. 18 or yi in Eq. 20 is subject to error, then
the slope (either A 1 or l/A,) is biased to a small value (2). This fact
becomes more apparent when it is realized that the slope A found
whenX is chosen as the independent variable differs by the square of
the correlation coefficient (r) from the slope found when Y is chosen
as the independent variable, i.e., A 1 is equal to r2A2. Obviously A, =
A, only if r2 = 1 so that perfect correlation
exists. For any other
situation, the two estimates of A (A, andA,) and of B (B, and&) are
not identical, and the ordinary least-squares approach provides no
indication of the “better” choice. However, since A, is biased to a
smaller value and A, to a larger value, it is obvious that the desired
slope lies somewhere between A l and A,.
2) Methods
used when errors are known.
The problem with both of
the above least-squares methods is that they fail to take into account
the fact that both Xi and Y, are subject to error. This problem has
been investigated, and if information is available as to the nature of
the errors that exist in the measured values of Xi and Yi, then the
fairly complex methods discussed by Kendall and Stuart (3) or Madansky (4) can be used. However, most frequently with experimental
data the investigator has little idea as to what errors may exist in
either Xi or Yi and only knows that the best relationship between X
and Y is desired. In the past, the following logical yet problematic
approaches have been used in an attempt to obtain the best linear
relationship when the values of the variables are subject to error.
3) Sum of vertical
and horizontal
distances
method.
This approach
simply minimizes the sum of the squares of the vertical and horizontal distances between the point and the line of best fit
APPENDIX
The following is a summary of procedures presently used to determine a linear fit of data to the line Y = AX + B. Further details and
logic for the mathematics of the least square methods can be obtained from most texts on statistical methods.
1) Standard
least-squares
method.
Whenever a “best” linear relationship between two variables is desired, the standard method of
least-squares regression analysis is almost invariably used. From an
input of N measured pairs of data (Xi, Yi), the method assumes X is
the independent variable and supplies an estimate of the slope (A)
and the intercept (B) for the relationship Y = AX + B by minimizing
an error function (E) which is equal to the sum of the squares of the
vertical distance between each Yi and the line of best fit (line 1 of
Fig. 1)
After .setting the
. partial derivatives
*
equation can be expressed as
El =
Z (Yi - A, Xi ’ B,)2.
i=l
By setting aE/aB = 0 and aElaA = 0, an exact analytical
sion for A and B is obtained
A&&,
Z
Z
(17)
(18)
Xi2
where Y and X represent respective arithmetic means of the measured values of the variables and where zi and yi are residuals of Xi
and Yi, i.e., xi =Xi -Xandyi
= Yi - Y.
An alternate and often equally justified approach is to treat Y as
the independent variable. In this case, the sum of the squares of the
horizontal distance between the data point and the regression line
(line 2 of Fig. 1)
=
g
(Xi
i=l
is minimized
- $
Yi + F)2
2
(1/A,)(AJ4
- (AJ3
+ A,
:
(Yi - AsXi
i=l
- B3)’
w
2
(21)
of& equal to zero, the resulting
- A,
= 0,
B,
= p -
A,x
(22)
where A l and A, are defined above. This solution is more complex
but the value of A, can be determined by trial and error.
4) Perpendicular
distance
method.
The second and more commonly used approach when both variables are subject to error is to
minimize the sum of the squares of the perpendicular
distances
between the data points and the line of best fit (Eine 3 of Fig. 1). The
error function can be expressed as
E, =
Setting the partial
l/(1 + Aa2) i=li (Yi - A,Xi
derivatives
- B4)2
(23)
of E, to zero yields
(A4)2+64,-A2)A4-1=0,
i=l
E,
E, + E, = (1 + AZ)
expres-
Xiyi
B,=y-A,r;’
E3 =
B,=F-A,X
(24)
The two solutions of Eq. 24 are obtained through use of the quadratic
formula, with the desired root for A, being between A 1 and A,.
5) Method
of averages.
This method has not been used much
recently but has certain merits and is the only method suggested by
the US Bureau of Standards Handbook on experimental
statistics
(5). First, divide the data into three nonoverlapping
groups when
considered in the X direction with a size as close to n/3 as possible.
The two extreme groups should have an equal number of points. The
desired line is drawn through the mean of the data (X, Y) with a
slope of ( Y3 - &>/(X3 - X,) where the latter mean values are the
means of each of the two extreme groups. Note that data can also be
grouped according to the Y value. For large groups of data, the
method of averages produces values of the slope A and intercept B
which are almost identical to those found with the proposed line of
symmetry.
and yields
This study was supported by National
HL 11678.
Syi
A2=+,
B2=
z Xi Yi
i=l
%A,x
Institutes
(20)
Received for publication
20 September
1976.
of Health Grant
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
the value ofX is accurately controlled by the investigator.) If the purpose of the analysis is to allow prediction
of a value- of-Y from a given value of X,-then
the
standard least-squares method should be used regardless of whether there are errors in the values of the Y
variable or errors in both variables (3,5). However if the
purpose of the analysis is to give a meaningful interpretation to the values of the slope or intercept (e.g., for
comparison with theoretical values or use in a simulation, etc.) then the method presented above can be used
instead of the standard least-squares method whenever
significant measurement
errors and/or random variation errors are present in the values of both variables.
This includes conditions where X and Y are uncontrolled as well as conditi .ons where one variable is inaccurately controlled. Use of the resulting line of symmetry is obviously limited to conditions in which both
variables contain either measurement or random variation errors that are too large to be ignored. However,
the line of symmetry exactly represents the data only
when the errors in Y are equal to A times the errors in
X
‘The major advantage of using this suggested method
for determining a best linear fit is that it is exceedingly
simple. Furthermore,
data that had been analyzed in
the classical fashion can be reanalyzed with minimal
effort.
FITTING
STRAIGHT
LINES
TO EXPERIMENTAL
R99
DATA
REFERENCES
1. ACTON, F. S. Analysis
of Straight
Line Data.
New York: Wiley,
1959.
2. BERKSON, J. Are there two regressions?J. Am. Statis.
Assoc. 45:
164-180,
1950.
3. KENDALL,
M. G., AND A. STUART. The Advanced
Theory
of Statistics. Inference
and Relationship.
New York: Hafner, vol. 2, 1961.
4. MADANSKY,
A. The fitting of straight lines when both variables
are subject to error. J. Am. Statist. Assoc. 54: 173-205, 1959.
5. NATRELLA,.M.
G. Experimental
Statistics.
Washington, DC.: US
Govt. Printing
Office, National
Bureau of Standards Handbook
91, 1963.
6. SOKAL, R.
R., AND F. J. F~HLF. Biometry.
San Francisco: Fresman, 1969.
7. Z WEIFACH,
B. W. Quantitative studies of microcirculatory
structures and function. 1. Analysis of pressure distribution
in the
terminal vascular bed in cat mesentery. Circulation
Res. 34: 843866, 1974.
Downloaded from http://ajpregu.physiology.org/ by 10.220.33.5 on October 1, 2016
Download