How many observations? The accuracy of Data

advertisement
Cubbin and Potonias
The accuracy of Data Envelope Analysis
How many observations? The accuracy of
Data Envelopment Analysis
A comparison of the relative accuracy of regression and
programming-based approaches to measuring efficiency
Preliminary: please do not quote without permission
Abstract
A mathematical programming-based approach (DEA) to the estimation of productive
efficiency is widely used, and yet there are few analyses of its accuracy. This paper
reports on a Monte Carlo simulation of the process of efficiency estimation. It shows
that the number of observations required for a reasonably accurate picture depends on
the underlying data generating process, but that in many circumstances classical
regression may give a more accurate estimate than DEA.
JEL categories: 220
John Cubbin
Address for correspondence:
Economics Department
City University
Northampton Square
London EC1V 0HB
e-mail j.s.cubbin@city.ac.uk
Louis Potonias
(University of Warwick)
Cubbin and Potonias
The accuracy of Data Envelope Analysis
How many observations? The accuracy of
Data Envelopment Analysis1
A comparison of the relative accuracy of regression and programming-based
approaches to measuring efficiency
Introduction
Data Envelopment Analysis (DEA) has been widely applied to the estimation of
productive efficiency. However, there are few examples of the examination of its
accuracy. The technique first emerged in the management science and operations
research literature but has recently been spreading to economics journals.
One “advantage” of DEA for authors is that it is not necessary to report t-statistics
since the software typically used does not produce statistical tests. For reviewers of
papers, however, it creates a difficulty because typically no tests of specification are
presented2.
In this paper we report on an examination of this issue in the context of a simple
application - the estimation of cost efficiency.
Section 2 briefly describes DEA and section 3 looks at some previous work. Section
4 describes our data generation process and section 5 describes the performance
statistics used. Section 6 describes the experimental design and section 7 reports on
the performance of DEA under different conditions using RA as a benchmark.
Section 8 summarises our conclusions and discusses the implications for further work.
2: DEA and RA compared
Figure 1 shows a comparison between the econometric and DEA approaches.
Inefficiency is measures as the proportional distance from the frontier. The
benchmark regression frontier is based on a corrected ordinary least squares (COLS)
approach. First estimate the average relationship between costs and output (in the
case illustrated, using OLS with a linear functional form). Then shift the line down
so that it goes though the observation with the largest negative residual.
By contrast the Data Envelopment approach seeks to draw a convex hull around the
observations. This is seen most clearly in the variable returns to scale (VRS) frontier
indicated. With just one input (cost) and one output the constant returns to scale
(CRS) version of DEA is deceptively simple: a ray from the origin to the lowest
point. As it is drawn, both DEA and COLS come to the same conclusion about the
most efficient observation, although they differ in their estimates and rankings.
1
The authors are grateful for comments from Kaddour Hadri and seminars at City University and the Centre for
Business Research, Cambridge University. Reamining errors are those of the authors.
2
For a discussion of statistical; tests in connection with DEA see Grosskopf(1996).
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Figure 1
Econometric Efficiency Scores
DEA frontier (VRS)
Cost
DEA frontier
C
(C
COLS Frontier
Regression Line
Explanatory Facto
The geometry of the two functions is quite different. In the case of the linear form, N
extreme points are all that are necessary to generate the frontier exactly, where N is
the dimensionality of the problem.3 However, once we allow for curvature, we may
need many more points. Suppose five efficient points are needed to define an
isoquant or output frontier reasonably accurately in the variable-returns case. Adding
another variable means that we are trying to define a surface, which will require 55
=25 efficient points for the same degree of accuracy. Extending the surface into
another dimension will multiply the number of required points again (= 125). In
general we might suppose that 5N efficient points are needed, where N is the
dimensionality of the problem.4 If only a minority of observations are on the frontier,
we need a multiple of this number of observations.
3
N is the number of variables for the variables returns to scale measure, one less for the constant returns
measure.
4
For variable returns to scale, dimensionality is M-1 , where M is the total number of inputs, outputs, and
non-controllable factors. Constant returns to scale preserves a degree of freedom, so the dimension is M-2..
2
Cubbin and Potonias
The accuracy of Data Envelope Analysis
The number of points required can be reduced if the variation in the factor or output
ratios can be reduced, since only a fraction of the overall hull needs defining. This
will be the case provided the variables have only limited variation.
This suggests that DEA might perform relatively well in cases where RA has a
problem: in the presence of multicollinearity, which by its nature tends to mean that
explanatory factors show little relative variation.
The statistical properties of regression analysis under ideal conditions are well known.
The ideal conditions will not be met in practice. For instance, inefficiencies cannot
be normally distributed because they are bounded at zero. This affects not only the
efficiency and unbiasedness properties of the estimates themselves, but also the tests
on which statistical inference is based for regression analysis.
As a practical matter it is important to know how inaccurate the results are likely to
be. Simulation of the data generation and estimation process under different
assumptions is one way of deducing this information.
3. Previous investigations
There have been surprisingly few attempts to compare the properties of DEA and
regression approaches in practice. Cubbin and Zamani (1996) compared the use of
DEA and RA in the context of measuring the performance of Training and Enterprise
Councils. Cubbin and Tzanidakis (1998) compared the methods for estimating the
efficiencies of water companies. Both studies concluded that the results could be
very different. When real data are used there is typically no knowledge of true
performance, and this makes firm conclusions difficult.
A number of other studies using constructed data have reported low correlations
between DEA scores and true efficiency measures. For example, Ferrier and Lovell
(1996) compared the efficiency rankings (and total residuals) of a stochastic cost
frontier and a nonstochastic nonparametric production frontier. In each case the
Spearman rank correlation coefficients were less than 0.02, and clearly nonsignificant.
This suggests that one or both of the methods was very poor at estimating efficiency
for these data.
A set of three papers by Sherman (1984), Bowlin et. al. (1985), and Thanassoulis
1993 all used the same hypothetical data set of 15 observations. This was generated
as follows:



Data were generated by a linear cost function.
7 of the 15 observations were 100% efficient
The rest had the following true efficiencies:
1 at 0.97
3 at 0.91
1 each at 0.89, 0.87, 0.86, and 0.85
Thus there seemed to be two distinct sets; a group of 100% efficient observations and
another with a skewed distribution around a distinct mode. Whilst mentioning some
3
Cubbin and Potonias
The accuracy of Data Envelope Analysis
advantages of RA (such as more stable estimates of efficiency) the balance in
Thanassoulis’ paper appears to favour DEA. In particular “DEA offers more accurate
estimates of relative efficiency because it is a boundary method.” (p1142)
Having a set of efficient observations provides a clear advantage for a boundary
method such as DEA. The number can be fewer if the form of the boundary can be
kept simple, for example, by the use of a linear functional form. Furthermore the
lack of a symmetric, let alone normal distribution of efficiencies may be thought to
have hampered the performance of the regression approach.
There is therefore a need to test the relative accuracy of DEA under a range of more
general assumptions about the distribution of efficiencies. This will allow not only a
clearer evaluation, but also give a guide to the sensitivity of the results to different
distributional assumptions.
Kittelsen (1995) addressed the problem of bias in DEA using Monte Carlo analysis
and concluded that “bias is important, increasing with dimensionality [i.e. k] and
decreasing with sample size, average efficiency, and a high density of observations
near the frontier.”
Pedraja-Chaparro, Salinas-Jimenez and Smith (1997) have found, in a Monte-Carlo
simulation, that the mean bias could be reduced, and correlation with true efficiency
scores improved, by imposing restrictions on the weightings of the inputs and outputs
so that they were more similar for different observations.
Several authors have gone beyond DEA and regression analysis. For example Pollitt
(1995) examined a range of approaches to the estimation of efficiencies on electric
utilities, including parametric programming analysis. The latter uses the whole data
set to generate a parametric frontier.
4: Data Generation
To generate test data, we need the following components:



an underlying cost function:
distributions for the exogenous variables in the cost function
the distribution of the underlying efficiencies
To this it seems natural to add a distribution of errors such as measurement errors in
the dependent variable, or errors arising from mis-specification such as the omission
of variables. However, this would tend to obscure the underlying issue which we are
attempting to address. In any case, such an error component is scarcely used in either
the DEA literature, and is not universally employed in the regression literature either.
See Coelli (1995) for a comparison of the stochastic frontier approach with the COLS
approach used as a benchmark here.
Cost function
4
Cubbin and Potonias
The accuracy of Data Envelope Analysis
One of the strengths claimed for DEA is that, since no particular functional form for
the frontier is assumed, it will be adaptable to a range of possibilities. We allow
DEA to show its capabilities over a range of true functional forms.
We relied on the three most commonly estimated forms: linear, Cobb-Douglas, and
trans-log. However, we do not assume that we either know or can reliably test for the
correct functional form. In estimation for the regression benchmark we deliberately
restrict ourselves to the linear and Cobb-Douglas forms.
Exogenous variables
Three independently distributed cost drivers were chosen. We know from both
Kittelson and a priori grounds that the accuracy of DEA depends on the
dimensionality of the problem in relationship to the number of observations, and
fixing the number at three, although computationally convenient, clearly represents a
limitation of the present analysis, which should be investigated at a later stage.
The variables were generated as independent pseudo-random variables. Negative
values for the independent variables (i.e. outputs) need to be ruled out, as do values
close to zero. For this reason they were truncated below. Since DEA is known to
be sensitive to the presence of outliers we also truncated the distribution at the top
end, so the exogenous variables were distributed in the range 0.5-1.5 with either a
uniform or normal distribution.
Distribution of inefficiencies
We considered three types of inefficiency distribution - truncated normal, uniform,
and exponential. The latter is one of the class of distributions for which Banker
(1993) shows that DEA is a maximum likelihood estimator.
The data were generated and estimated using an pair of integrated programs5 which
generate the random variables necessary, estimate efficiency using either DEA or
COLS, and then generates the performance data for the estimation method. The
programs allow the user to specify functional form, distributions of the exogenous
variables and inefficiencies, number of observations, and number of runs. The
programs use the same code for data generation and produce identical data when
provided with the same initial random number seed. This avoids the need for
storing large numbers of data sets.
5: Performance statistics
No single performance measure can capture the accuracy of an estimate. To give a
variety of perspective on the performance of DEA, the following were initially
calculated:
1. Mean bias = estimated efficiency E - true efficiency T
2. Mean square error =  (E - T)2/N, where N is the number of observations.
5
Written in FORTRAN 77 by the author (Cubbin), using routines adapted from Faires and Burden (1998) and
Press et al.(1986) The components were tested separately against standard econometric software, any errors are
my own.
5
Cubbin and Potonias
The accuracy of Data Envelope Analysis
3. Proportion of false efficiencies, F/N. One of the concerns about DEA is that
outliers will be placed on the estimated frontier. A false declaration of efficiency
is defined for present purposes as occurring if the true efficiency is less than 95%
and the observation is classed as efficient.
4. Correlation between true and estimated efficiency.
The square root of the mean square error is a useful indicator of the overall accuracy
of the measurement.
6. Experimental design
Monte Carlo simulation can, even with modern computers, be time intensive. On the
other hand, unless a sufficiently broad range of assumptions is tested the conclusions
are in danger of being unrepresentative.
How many replications?
The more simulations undertaken, the accurate will be the results. However, unless
statistical tables are being compiled, many significant places of accuracy are not
required.
Furthermore, the variance of the performance measures used will itself be a decreasing
function of the number of observations.
To get a benchmark we did a number of runs of the DEA base case using 16, 100, and
1000 observations and 1000, 250, and 100 runs respectively.
We found that 10,000 replications were sufficient to genrate results accurate to three
significant places or better..
As a first step we carried out a series of investigations into which of the factors apart
from sample size were important in determing DEA’s accuracy.
The factors considered were as follows:
 The underlying cost function: linear, Cobb-Douglas or transcendental-log.
 The distribution of outputs: uniform, (truncated) normal, or exponential.
 The distribution of efficiencies: uniform, normal, or exponential.
The cost function.
DEA is supposed to be good at defining frontiers which may not be able to be
described by a simple function. However, it was easiest to construct data in a
repeatable way using a parametric function. In order to reflect DEA’s suposed
greater flexibility, the regression benchmark was given differing degrees of handicap.
Only linear and linear-in-logs (Cobb-Douglas) approaches were used for estimation, in
order to capture the fact that the actual functional forms used will not usually replicate
the data generation process.
All the cost functions had three outputs or independent variables. The number was
chosen as a compromise, ensuring sufficient richness in the problem whilst not
over-burdening either technique with excessive complexity.
6
Cubbin and Potonias
The accuracy of Data Envelope Analysis
The distribution of outputs
An acknowledged weakness of DEA is its susceptibility to outlying observations.
We could have approached this in a symmetric way by generating log-normally
distributed cost drivers. However, we chose instead to focus on distributions which
were bounded form below but not from above, as probably best reflects firm
characteristics. In addition to a “default” of a uniform distribution (which has an
upper bound), we also chose an exponential distribution and a normal distribution
bounded at -2 standard deviations below the mean.
The exponential distribution is of particular interest because Banker (1993) has shown
that it is sufficient to guarantee that DEA has maximum likelihood properties. The
modal value for inefficiency is zero, and this guarantees that observations will tend to
cluster nearer the frontier.
One potentially important parameter is the degree of variation in the cost drivers or
outputs. This was measured in terms of the ratio of the mean value to its minimum.
This is equivalent to choosing the standard deviation or variance, but allows for the
possibility of introducing distributions whose variance is undefined.
The distribution of efficiencies
Again, a uniform distribution was chosen as the default value with a bounded normal
and an exponential distribution as the alternatives.
It is common (for example, in estimating stochastic frontiers) to bound the normal
distribution at its modal value (i.e. what would otherwise be the mean.) For the
purpose of this exercise, it was felt that this would produce a distribution too similar
to the exponential. It was also expected that, whilst DEA ought to perform relatively
well with the exponential distribution, regression analysis ought to do well with the
normal distribution (even a truncated normal.) Furthermore, there was interest in
testing the view that a normal distribution truncated at -2 standard deviations would
lead to only small biases for OLS.
7. Results
For identifying the potentially critical dimensions of the problem we chose to carry
out initial calculations for 30 observations. This is the number of observations
where traditionally econometricians have started to feel that worthwhile models could
be estimated.
The results of this initial phase are set out in Table 1.
7
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Table 1. Performance of DEA and RA in different specifications
DEA
RA
Row Funct
Driver
Effic
FALSE CORR BIAS Mean CORR BIAS Mean
%
SQERR
SQERR
14.2
0.879 0.107
0.017 0.925 0.001
0.005
12.4
0.727 0.046
0.019 0.776 -0.055
0.017
12.4
0.831 0.083
0.016 0.886 -0.019
0.009
15.4
0.872 0.108
0.018 0.921 0.007
0.005
12.9
0.704 0.032
0.019 0.756 -0.083
0.022
14.5
0.838 0.092
0.017 0.875 -0.017
0.009
14.0
0.847 0.102
0.018 0.898 -0.010
0.006
11.7
0.621 -0.004
0.026 0.659 -0.103
0.033
13.4
0.790 0.075
0.018 0.800 -0.057
0.018
1
2
3
4
5
6
7
8
9
LINR
COBD
TLOG
LINR
COBD
TLOG
LINR
COBD
TLOG
UNIF
UNIF
UNIF
NORM
NORM
NORM
EXPO
EXPO
EXPO
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
10
11
12
13
14
15
16
17
18
LINR
COBD
TLOG
LINR
COBD
TLOG
LINR
COBD
TLOG
UNIF
UNIF
UNIF
NORM
NORM
NORM
EXPO
EXPO
EXPO
NORM
NORM
NORM
NORM
NORM
NORM
NORM
NORM
NORM
17.2
13.4
14.4
16.8
13.3
15.0
16.5
12.3
13.7
0.888 0.092
0.707 0.016
0.832 0.063
0.902 0.089
0.719 0.010
0.859 0.070
0.870 0.085
0.591 -0.040
0.789 0.043
0.014
0.017
0.013
0.013
0.017
0.012
0.013
0.028
0.013
0.910
0.747
0.859
0.918
0.761
0.881
0.882
0.631
0.775
-0.015
-0.086
-0.038
-0.010
-0.102
-0.040
-0.029
-0.138
-0.093
0.006
0.022
0.011
0.004
0.026
0.010
0.008
0.041
0.025
19
20
21
22
23
24
25
26
27
LINR
COBD
TLOG
LINR
COBD
TLOG
LINR
COBD
TLOG
UNIF
UNIF
UNIF
NORM
NORM
NORM
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
14.6
12.5
13.5
15.1
11.9
13.8
13.6
10.7
11.5
0.943 0.077
0.825 0.010
0.908 0.055
0.938 0.081
0.804 0.006
0.914 0.063
0.930 0.070
0.725 -0.044
0.873 0.029
0.010
0.015
0.010
0.011
0.017
0.011
0.010
0.027
0.012
0.868
0.772
0.832
0.863
0.750
0.844
0.817
0.695
0.748
-0.055
-0.116
-0.080
-0.054
-0.122
-0.075
-0.081
-0.173
-0.147
0.016
0.033
0.024
0.015
0.036
0.021
0.023
0.057
0.046
30 observations, mean/min ratio = 10. 100 runs, variable returns to scale. Bold
indicates where DEA has lower MSE than regression analysis.
Not surprisingly, the best performance of DEA is when the distribution of
inefficiencies is exponential. A normal distribution of errors generally also leads to
a better performance than an exponential distribution.
The form of the cost function seems to make a significant difference to DEA. The
Cobb-Douglas form is the worst performer, especially in combination with
exponentially distributed errors.
8
Cubbin and Potonias
The accuracy of Data Envelope Analysis
There was little difference in results between the normal and uniform distributions for
the cost drivers. Both DEA and RA performed significantly worse with
exponentially distributed cost drivers.
Comparison with RA
In order to evaluate DEA it is useful to have a benchmark, and we have chosen
regression analysis, since this is the approach most commonly adopted as an
alternative. However, given that the data were generated using a parametric approach
it would appear relatively easy to demonstrate the superiority of the regression
approach by replicating the parametric model.
In practice the form of the parametric model (if there were one!) would be unknown.
We have simulated this by using an inappropriate functional form, e.g. linear in the
case of the Cobb-Douglas data and linear or log (but omitting the interaction terms)
for the translog formulation.
Furthermore the inefficiency distribution deviates from the regression ideal. At the
very least, the errors are truncated. In the case of the normal distribution they are
slightly skewed as a result of cutting off the bottom 2% of the distribution. In the
case of the exponential distribution, they are considerably skewed. Although this
should not affect the expected value for the parameters, it would obviously affect the
test statistics used.
The linear regression performance scores are shown for linear regression in Table 1.
The ranking for functional form is the reverse of that for DEA, with a linear form
giving the best overall performance, whether measured in terms of the correlation with
the true values or the mean squared error. As with DEA, the combination of a
Cobb-Douglas function and exponential errors leads to particularly poor scores.
As may be expected, the regression approach performed best when the data generation
process most resembled the assumptions of the model: a linear cost function and a
symmetric disturbance term. e.g. lines 1 and 4.
Creating outliers in the data, whether the exogenous variables or the disturbance
terms, through the exponential distribution, results in a reduction in performance.
Just behind the uniform distribution comes the truncated normal (lines 2 and 5).
Not surprisingly, the normally-distributed inefficiency term produced the lowest
performance for the regression approach.
Effect of size range for cost drivers
One of the reasons for the poorer performance of DEA when the cost drivers are
distributed with a normal or exponential distribution is that the scope for outlying
observations is greater. This increases the probability of an observation being falsely
classified as on the frontier.
The basic reason for this is that outliers cause that region of the frontier to be sparsely
populated. This effect can also be seen if the range of values for the cost drivers is
9
Cubbin and Potonias
The accuracy of Data Envelope Analysis
increased. Table 2 presents some results for 50 observations under two different
assumptions about the underlying cost function.
Table 2: Effect of cost driver dispersion
Row Funct Driver Effic Mean/ False
MIN
%
1 LINR NORM UNIF
1.2
4.1
2 LINR NORM UNIF
2
8.3
3 LINR NORM UNIF
4
9.4
4 LINR NORM UNIF
8 10.2
5 LINR NORM UNIF
16 10.2
6 LINR NORM UNIF
32 10.3
7 LINR NORM UNIF
64 10.3
CORR
BIAS
0.9659
0.9226
0.9152
0.9171
0.9182
0.9179
0.9175
0.0451
0.0647
0.0742
0.0791
0.0811
0.0817
0.0818
MEAN
SQERR
0.00395
0.00824
0.00985
0.01049
0.01077
0.01089
0.0109
8 LINR
9 LINR
10 LINR
11 LINR
12 LINR
13 LINR
14 LINR
NORM
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
EXPO
1.2
2
4
8
16
32
64
3.7
7.0
8.2
8.9
8.5
8.3
8.5
0.9834
0.9616
0.9553
0.9560
0.9558
0.9551
0.9543
0.0332
0.0440
0.0529
0.0567
0.0580
0.0583
0.0583
0.00246
0.00479
0.00613
0.00652
0.0067
0.0068
0.00687
15 COBD
16 COBD
17 COBD
18 COBD
19 COBD
20 COBD
21 COBD
NORM
NORM
NORM
NORM
NORM
NORM
NORM
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
UNIF
1.2
2
4
8
16
32
64
4.4
7.5
7.2
7.3
8.0
7.5
6.9
0.9713
0.9143
0.8243
0.7245
0.6480
0.5949
0.5593
0.0496
0.0573
0.0238
-0.0182
-0.0541
-0.0840
-0.1067
0.00405
0.00783
0.01053
0.01726
0.02658
0.03625
0.04473
There is some deterioration for the linear functional form, but this is more drastic for
the Cobb-Douglas form. In line 21, a mean squared error of .04 means that a typical
error is 21% in the efficiency rating.
Results: sample size
Table 3 and Figures 1 & 2 show the effect of sample size for four different
specifications.
10
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Table 3
Row
1(a)
2(b)
3(c)
4(d)
Function Driver
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
Efficiency No
Observ
NORM
8
NORM
8
EXPO
8
EXPO
8
FALSE
%
69.71
61.16
67.91
58.18
CORR
BIAS
0.5124
0.5641
0.5656
0.6088
0.2171
0.1965
0.2231
0.2019
5(a)
6(b)
7(c)
8(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
8
8
8
8
69.77
56.72
65.39
53.93
0.5125
0.5605
0.5838
0.6037
0.2173
0.1820
0.2243
0.1880
9(a)
10(b)
11(c)
12(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
15
15
15
15
54.34
40.49
50.46
38.38
0.6206
0.6473
0.6772
0.6847
0.1831
0.1416
0.1848
0.1437
13(a)
14(b)
15(c)
16(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
30
30
30
30
38.48
26.75
35.06
24.35
0.7130
0.7231
0.7637
0.7639
0.1436
0.1013
0.1387
0.0965
17(a)
18(b)
19(c)
20(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
100
100
100
100
17.59
10.46
15.67
9.8
0.8298
0.8208
0.8721
0.8618
0.0825
0.0423
0.0760
0.0373
21(a)
22(b)
23(c)
24(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
500
500
500
500
4.97
2.64
4.43
2.65
0.9319 0.0335
0.8893 -0.0162
0.9508 0.0289
0.9174 -0.0239
25(a)
26(b)
27(c)
27(d)
LINR
TLOG
LINR
TLOG
NORM
NORM
NORM
NORM
NORM
NORM
EXPO
EXPO
1000
1000
1000
1000
2.8
1.48
2.56
0.23
0.9590 0.0217
0.8974 -0.0415
0.9717 0.0179
0.9560 -0.1158
Note:
a, b, c, and d are the keys to the specifications shown in Figure 2.
11
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Figure 2
Increasing sample size
FALSEa
1
FALSEb
0.8
CORRa
0.6
CORRb
BIASa
0.4
BIASb
0.2
0
-0.2
8
8
15
30
100
500
1000
Num ber of observations
As expected the performance of DEA improves with sample size. Once the sample
size exceeds 500 a correlation of 0.9 or more with the true value is typically found.
Figures 3 and 4 show the results of regression analysis using two extreme models: a
linear model with uniform efficiencies and a Cobb-Douglas model with exponential
efficiencies.
Figure 3:
Linear regression
Linear, norm, normal
1
0.8
0.6
CORR
BIAS M
0.4
NSQERR
0.2
0
8
15
30
100
500
1000
-0.2
Observations
12
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Figure 4:
Linear regression
Cobb-D, Norm, Expo
1
0.8
0.6
CORR
0.4
BIAS M
0.2
NSQERR
0
8
15
30
100
500
1000
-0.2
-0.4
8. Conclusions
DEA appears to cope best under the following conditions:
 a data generating function which is reasonably close to linear
 distribution of efficiencies with sufficient observations close to the frontier for the
complexity of the functional form.
 if the underlying cost function is not linear, a small variation in the cost drivers is
desirable.
If these conditions are not fulfilled, reliable results may require several hundred
observations.
13
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Appendix
The functional forms.
All have three outputs or “cost drivers” (X, Y, and Z)
using common parameters. The translog form has three extra parameters.
1. Linear:
C= a + bX + cY + dZ
2. Cobb-Douglas:
C= exp(b lnX + c lnY + d lnZ)
3. Trans-log:
This is a slightly simplified to economise on the number of parameters:
C = exp(b lnX + c ln Y + d ln Z + e ((ln X)2 + (ln Y)2 +(ln Z)2)
f ln X. ln Y + f ln Y. ln Z + g ln X. ln Z)
The parameters are as follows:
a= 0.05
b= 0.4
c= 0.3
d= 0.25
e= 0.1
f= -0.1
g= -0.1
With these parameters the Cobb-Douglas form is homogeneous of degree (b+c+d) =
0.95 (slight economies of scale.)
The trans-log form is homogeneous of degree (b+c+d+6e+4f+2g) = 0.75 (significant
economies of scale.) The negative cross-effects means that outputs are more
complementary than in the Cobb-Douglas case. Conversely there is some possibility
of congestion where the outputs are unbalanced. This implies economies of scope.
References
Aigner, D, Lovell, C., and Schmidt, P (1977) “Formulation and Estimation of
stochastic frontier production functions”, Journal of Econometrics, 5, 21-38.
Banker, R.D. (1993) “Maximum Likelihood, consistency, and Data Envelopment
Analysis: a Statistical Foundation.” Management Science 39 (10) , 1265-1273.
Banker, R.D., Charnes, A. and Cooper, W.W. (1984), “Some models for estimating
Technical and Scale Inefficiencies in Data Envelopement Analysis.” Management
Science 30(9), 1078 - 1092.
Bogetoft, Peter (1994), “Incentive Efficient Production Frontiers: an Agency
Perspective on DEA.” Management Science, 40 (8), (August), 959 - 968.
14
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Bowlin, W.F., Charnes, A., Cooper, W.W. and H.D. Sherman (1985), “Data
envelopment and regression approaches to efficiency estimation and evaluation.”
Annals of Operational Research, 2: 113-138.
Charnes, A., Cooper, W.W. and Rhodes, E., (1978), “Measuring the Efficiency of
Decision Making Units”, European Journal of Operational Research, 2: 429-444.
Coelli, Tim (1995) “Estimators and hypothesis tests for a stochastic frontier: a
Monte Carlo Analysis” Journal of Productivity Analysis, 6, 247-68.
Cubbin, J.S. and Tzanidakis, G. “Regression versus data envelopment analysis for
efficiency measurement: an application to the England and Wales regulated water
industry.” Utilities Policy (7)(1988), 75 - 85.
Cubbin, J.S. and Zamani, H. (1996) “A comparison of performance indicators for
training and enterprise councils”, Annals of Public and Co-operative Economics,
September.
Farrell, M.J. (1957), “The Measurement of Productive Efficiency”, Journal of the
Royal Statistical Society, Series A, 120: 253-281.
Ferrier, Gary D. and Lovell, C.A. Knox (1996), “Measuring cost efficiency in
Banking” Journal of Econometrics, 46, 229-245.
Ganley, J.A. and Cubbin J.S., (1992), Public Sector Efficiency Measurement:
Applications of Data Envelopment Analysis, Elsevier.
Grosskopf, S. (1996) “Statistical Inference and Nonparametric Efficiency: A Selective
Survey” Journal of Productivity Analysis, 7, 161-76
Kittelsen, Sverre A.C. (1993), “Stepwise DEA; choosing variables for measuring
Technical efficiency in Norwegian Electricity distribution” University of Oslo,
Department of Economics, mimeo.
Kittelsen, Sverre A.C. (1995), “Monte Carlo simulations of DEA Efficiency
Measures and Hypothesis Tests” University of Oslo doctoral thesis, presented at
Georgia productivity workshop, 1994.
Lovell C.A.K. and Schmidt S. (1993), The Measurement of Productive Efficiency:
Techniques and Applications, Oxford.
Pedraja-Chaparro, Francisco, Salinas-Jimenez, Javier and Smith, Peter (1997) “On
the role of weight restrictions in Data Envelopment Analysis”, 8, 215-30.
Pollitt, Michael G. (1995), Ownership and Performance in Electric Utilities, Oxford
University Press.
Schmidt, Peter (1986), “Frontier Production Functions”, Econometric Reviews,
4(2): 289-328.
15
Cubbin and Potonias
The accuracy of Data Envelope Analysis
Stevenson, R. (1980), “Likelihood functions for generalised stochastic frontier
estimation “Journal of Econometrics 13, 58-66.
Thanassoulis, E. (1993), “A comparison of Regression Analysis and Data
Envelopement analysis as Alternative Methods for Performance Assessments,”
Journal of the Operational Research Society, 44: 1129 - 1144.
16
Download