Cubbin and Potonias The accuracy of Data Envelope Analysis How many observations? The accuracy of Data Envelopment Analysis A comparison of the relative accuracy of regression and programming-based approaches to measuring efficiency Preliminary: please do not quote without permission Abstract A mathematical programming-based approach (DEA) to the estimation of productive efficiency is widely used, and yet there are few analyses of its accuracy. This paper reports on a Monte Carlo simulation of the process of efficiency estimation. It shows that the number of observations required for a reasonably accurate picture depends on the underlying data generating process, but that in many circumstances classical regression may give a more accurate estimate than DEA. JEL categories: 220 John Cubbin Address for correspondence: Economics Department City University Northampton Square London EC1V 0HB e-mail j.s.cubbin@city.ac.uk Louis Potonias (University of Warwick) Cubbin and Potonias The accuracy of Data Envelope Analysis How many observations? The accuracy of Data Envelopment Analysis1 A comparison of the relative accuracy of regression and programming-based approaches to measuring efficiency Introduction Data Envelopment Analysis (DEA) has been widely applied to the estimation of productive efficiency. However, there are few examples of the examination of its accuracy. The technique first emerged in the management science and operations research literature but has recently been spreading to economics journals. One “advantage” of DEA for authors is that it is not necessary to report t-statistics since the software typically used does not produce statistical tests. For reviewers of papers, however, it creates a difficulty because typically no tests of specification are presented2. In this paper we report on an examination of this issue in the context of a simple application - the estimation of cost efficiency. Section 2 briefly describes DEA and section 3 looks at some previous work. Section 4 describes our data generation process and section 5 describes the performance statistics used. Section 6 describes the experimental design and section 7 reports on the performance of DEA under different conditions using RA as a benchmark. Section 8 summarises our conclusions and discusses the implications for further work. 2: DEA and RA compared Figure 1 shows a comparison between the econometric and DEA approaches. Inefficiency is measures as the proportional distance from the frontier. The benchmark regression frontier is based on a corrected ordinary least squares (COLS) approach. First estimate the average relationship between costs and output (in the case illustrated, using OLS with a linear functional form). Then shift the line down so that it goes though the observation with the largest negative residual. By contrast the Data Envelopment approach seeks to draw a convex hull around the observations. This is seen most clearly in the variable returns to scale (VRS) frontier indicated. With just one input (cost) and one output the constant returns to scale (CRS) version of DEA is deceptively simple: a ray from the origin to the lowest point. As it is drawn, both DEA and COLS come to the same conclusion about the most efficient observation, although they differ in their estimates and rankings. 1 The authors are grateful for comments from Kaddour Hadri and seminars at City University and the Centre for Business Research, Cambridge University. Reamining errors are those of the authors. 2 For a discussion of statistical; tests in connection with DEA see Grosskopf(1996). Cubbin and Potonias The accuracy of Data Envelope Analysis Figure 1 Econometric Efficiency Scores DEA frontier (VRS) Cost DEA frontier C (C COLS Frontier Regression Line Explanatory Facto The geometry of the two functions is quite different. In the case of the linear form, N extreme points are all that are necessary to generate the frontier exactly, where N is the dimensionality of the problem.3 However, once we allow for curvature, we may need many more points. Suppose five efficient points are needed to define an isoquant or output frontier reasonably accurately in the variable-returns case. Adding another variable means that we are trying to define a surface, which will require 55 =25 efficient points for the same degree of accuracy. Extending the surface into another dimension will multiply the number of required points again (= 125). In general we might suppose that 5N efficient points are needed, where N is the dimensionality of the problem.4 If only a minority of observations are on the frontier, we need a multiple of this number of observations. 3 N is the number of variables for the variables returns to scale measure, one less for the constant returns measure. 4 For variable returns to scale, dimensionality is M-1 , where M is the total number of inputs, outputs, and non-controllable factors. Constant returns to scale preserves a degree of freedom, so the dimension is M-2.. 2 Cubbin and Potonias The accuracy of Data Envelope Analysis The number of points required can be reduced if the variation in the factor or output ratios can be reduced, since only a fraction of the overall hull needs defining. This will be the case provided the variables have only limited variation. This suggests that DEA might perform relatively well in cases where RA has a problem: in the presence of multicollinearity, which by its nature tends to mean that explanatory factors show little relative variation. The statistical properties of regression analysis under ideal conditions are well known. The ideal conditions will not be met in practice. For instance, inefficiencies cannot be normally distributed because they are bounded at zero. This affects not only the efficiency and unbiasedness properties of the estimates themselves, but also the tests on which statistical inference is based for regression analysis. As a practical matter it is important to know how inaccurate the results are likely to be. Simulation of the data generation and estimation process under different assumptions is one way of deducing this information. 3. Previous investigations There have been surprisingly few attempts to compare the properties of DEA and regression approaches in practice. Cubbin and Zamani (1996) compared the use of DEA and RA in the context of measuring the performance of Training and Enterprise Councils. Cubbin and Tzanidakis (1998) compared the methods for estimating the efficiencies of water companies. Both studies concluded that the results could be very different. When real data are used there is typically no knowledge of true performance, and this makes firm conclusions difficult. A number of other studies using constructed data have reported low correlations between DEA scores and true efficiency measures. For example, Ferrier and Lovell (1996) compared the efficiency rankings (and total residuals) of a stochastic cost frontier and a nonstochastic nonparametric production frontier. In each case the Spearman rank correlation coefficients were less than 0.02, and clearly nonsignificant. This suggests that one or both of the methods was very poor at estimating efficiency for these data. A set of three papers by Sherman (1984), Bowlin et. al. (1985), and Thanassoulis 1993 all used the same hypothetical data set of 15 observations. This was generated as follows: Data were generated by a linear cost function. 7 of the 15 observations were 100% efficient The rest had the following true efficiencies: 1 at 0.97 3 at 0.91 1 each at 0.89, 0.87, 0.86, and 0.85 Thus there seemed to be two distinct sets; a group of 100% efficient observations and another with a skewed distribution around a distinct mode. Whilst mentioning some 3 Cubbin and Potonias The accuracy of Data Envelope Analysis advantages of RA (such as more stable estimates of efficiency) the balance in Thanassoulis’ paper appears to favour DEA. In particular “DEA offers more accurate estimates of relative efficiency because it is a boundary method.” (p1142) Having a set of efficient observations provides a clear advantage for a boundary method such as DEA. The number can be fewer if the form of the boundary can be kept simple, for example, by the use of a linear functional form. Furthermore the lack of a symmetric, let alone normal distribution of efficiencies may be thought to have hampered the performance of the regression approach. There is therefore a need to test the relative accuracy of DEA under a range of more general assumptions about the distribution of efficiencies. This will allow not only a clearer evaluation, but also give a guide to the sensitivity of the results to different distributional assumptions. Kittelsen (1995) addressed the problem of bias in DEA using Monte Carlo analysis and concluded that “bias is important, increasing with dimensionality [i.e. k] and decreasing with sample size, average efficiency, and a high density of observations near the frontier.” Pedraja-Chaparro, Salinas-Jimenez and Smith (1997) have found, in a Monte-Carlo simulation, that the mean bias could be reduced, and correlation with true efficiency scores improved, by imposing restrictions on the weightings of the inputs and outputs so that they were more similar for different observations. Several authors have gone beyond DEA and regression analysis. For example Pollitt (1995) examined a range of approaches to the estimation of efficiencies on electric utilities, including parametric programming analysis. The latter uses the whole data set to generate a parametric frontier. 4: Data Generation To generate test data, we need the following components: an underlying cost function: distributions for the exogenous variables in the cost function the distribution of the underlying efficiencies To this it seems natural to add a distribution of errors such as measurement errors in the dependent variable, or errors arising from mis-specification such as the omission of variables. However, this would tend to obscure the underlying issue which we are attempting to address. In any case, such an error component is scarcely used in either the DEA literature, and is not universally employed in the regression literature either. See Coelli (1995) for a comparison of the stochastic frontier approach with the COLS approach used as a benchmark here. Cost function 4 Cubbin and Potonias The accuracy of Data Envelope Analysis One of the strengths claimed for DEA is that, since no particular functional form for the frontier is assumed, it will be adaptable to a range of possibilities. We allow DEA to show its capabilities over a range of true functional forms. We relied on the three most commonly estimated forms: linear, Cobb-Douglas, and trans-log. However, we do not assume that we either know or can reliably test for the correct functional form. In estimation for the regression benchmark we deliberately restrict ourselves to the linear and Cobb-Douglas forms. Exogenous variables Three independently distributed cost drivers were chosen. We know from both Kittelson and a priori grounds that the accuracy of DEA depends on the dimensionality of the problem in relationship to the number of observations, and fixing the number at three, although computationally convenient, clearly represents a limitation of the present analysis, which should be investigated at a later stage. The variables were generated as independent pseudo-random variables. Negative values for the independent variables (i.e. outputs) need to be ruled out, as do values close to zero. For this reason they were truncated below. Since DEA is known to be sensitive to the presence of outliers we also truncated the distribution at the top end, so the exogenous variables were distributed in the range 0.5-1.5 with either a uniform or normal distribution. Distribution of inefficiencies We considered three types of inefficiency distribution - truncated normal, uniform, and exponential. The latter is one of the class of distributions for which Banker (1993) shows that DEA is a maximum likelihood estimator. The data were generated and estimated using an pair of integrated programs5 which generate the random variables necessary, estimate efficiency using either DEA or COLS, and then generates the performance data for the estimation method. The programs allow the user to specify functional form, distributions of the exogenous variables and inefficiencies, number of observations, and number of runs. The programs use the same code for data generation and produce identical data when provided with the same initial random number seed. This avoids the need for storing large numbers of data sets. 5: Performance statistics No single performance measure can capture the accuracy of an estimate. To give a variety of perspective on the performance of DEA, the following were initially calculated: 1. Mean bias = estimated efficiency E - true efficiency T 2. Mean square error = (E - T)2/N, where N is the number of observations. 5 Written in FORTRAN 77 by the author (Cubbin), using routines adapted from Faires and Burden (1998) and Press et al.(1986) The components were tested separately against standard econometric software, any errors are my own. 5 Cubbin and Potonias The accuracy of Data Envelope Analysis 3. Proportion of false efficiencies, F/N. One of the concerns about DEA is that outliers will be placed on the estimated frontier. A false declaration of efficiency is defined for present purposes as occurring if the true efficiency is less than 95% and the observation is classed as efficient. 4. Correlation between true and estimated efficiency. The square root of the mean square error is a useful indicator of the overall accuracy of the measurement. 6. Experimental design Monte Carlo simulation can, even with modern computers, be time intensive. On the other hand, unless a sufficiently broad range of assumptions is tested the conclusions are in danger of being unrepresentative. How many replications? The more simulations undertaken, the accurate will be the results. However, unless statistical tables are being compiled, many significant places of accuracy are not required. Furthermore, the variance of the performance measures used will itself be a decreasing function of the number of observations. To get a benchmark we did a number of runs of the DEA base case using 16, 100, and 1000 observations and 1000, 250, and 100 runs respectively. We found that 10,000 replications were sufficient to genrate results accurate to three significant places or better.. As a first step we carried out a series of investigations into which of the factors apart from sample size were important in determing DEA’s accuracy. The factors considered were as follows: The underlying cost function: linear, Cobb-Douglas or transcendental-log. The distribution of outputs: uniform, (truncated) normal, or exponential. The distribution of efficiencies: uniform, normal, or exponential. The cost function. DEA is supposed to be good at defining frontiers which may not be able to be described by a simple function. However, it was easiest to construct data in a repeatable way using a parametric function. In order to reflect DEA’s suposed greater flexibility, the regression benchmark was given differing degrees of handicap. Only linear and linear-in-logs (Cobb-Douglas) approaches were used for estimation, in order to capture the fact that the actual functional forms used will not usually replicate the data generation process. All the cost functions had three outputs or independent variables. The number was chosen as a compromise, ensuring sufficient richness in the problem whilst not over-burdening either technique with excessive complexity. 6 Cubbin and Potonias The accuracy of Data Envelope Analysis The distribution of outputs An acknowledged weakness of DEA is its susceptibility to outlying observations. We could have approached this in a symmetric way by generating log-normally distributed cost drivers. However, we chose instead to focus on distributions which were bounded form below but not from above, as probably best reflects firm characteristics. In addition to a “default” of a uniform distribution (which has an upper bound), we also chose an exponential distribution and a normal distribution bounded at -2 standard deviations below the mean. The exponential distribution is of particular interest because Banker (1993) has shown that it is sufficient to guarantee that DEA has maximum likelihood properties. The modal value for inefficiency is zero, and this guarantees that observations will tend to cluster nearer the frontier. One potentially important parameter is the degree of variation in the cost drivers or outputs. This was measured in terms of the ratio of the mean value to its minimum. This is equivalent to choosing the standard deviation or variance, but allows for the possibility of introducing distributions whose variance is undefined. The distribution of efficiencies Again, a uniform distribution was chosen as the default value with a bounded normal and an exponential distribution as the alternatives. It is common (for example, in estimating stochastic frontiers) to bound the normal distribution at its modal value (i.e. what would otherwise be the mean.) For the purpose of this exercise, it was felt that this would produce a distribution too similar to the exponential. It was also expected that, whilst DEA ought to perform relatively well with the exponential distribution, regression analysis ought to do well with the normal distribution (even a truncated normal.) Furthermore, there was interest in testing the view that a normal distribution truncated at -2 standard deviations would lead to only small biases for OLS. 7. Results For identifying the potentially critical dimensions of the problem we chose to carry out initial calculations for 30 observations. This is the number of observations where traditionally econometricians have started to feel that worthwhile models could be estimated. The results of this initial phase are set out in Table 1. 7 Cubbin and Potonias The accuracy of Data Envelope Analysis Table 1. Performance of DEA and RA in different specifications DEA RA Row Funct Driver Effic FALSE CORR BIAS Mean CORR BIAS Mean % SQERR SQERR 14.2 0.879 0.107 0.017 0.925 0.001 0.005 12.4 0.727 0.046 0.019 0.776 -0.055 0.017 12.4 0.831 0.083 0.016 0.886 -0.019 0.009 15.4 0.872 0.108 0.018 0.921 0.007 0.005 12.9 0.704 0.032 0.019 0.756 -0.083 0.022 14.5 0.838 0.092 0.017 0.875 -0.017 0.009 14.0 0.847 0.102 0.018 0.898 -0.010 0.006 11.7 0.621 -0.004 0.026 0.659 -0.103 0.033 13.4 0.790 0.075 0.018 0.800 -0.057 0.018 1 2 3 4 5 6 7 8 9 LINR COBD TLOG LINR COBD TLOG LINR COBD TLOG UNIF UNIF UNIF NORM NORM NORM EXPO EXPO EXPO UNIF UNIF UNIF UNIF UNIF UNIF UNIF UNIF UNIF 10 11 12 13 14 15 16 17 18 LINR COBD TLOG LINR COBD TLOG LINR COBD TLOG UNIF UNIF UNIF NORM NORM NORM EXPO EXPO EXPO NORM NORM NORM NORM NORM NORM NORM NORM NORM 17.2 13.4 14.4 16.8 13.3 15.0 16.5 12.3 13.7 0.888 0.092 0.707 0.016 0.832 0.063 0.902 0.089 0.719 0.010 0.859 0.070 0.870 0.085 0.591 -0.040 0.789 0.043 0.014 0.017 0.013 0.013 0.017 0.012 0.013 0.028 0.013 0.910 0.747 0.859 0.918 0.761 0.881 0.882 0.631 0.775 -0.015 -0.086 -0.038 -0.010 -0.102 -0.040 -0.029 -0.138 -0.093 0.006 0.022 0.011 0.004 0.026 0.010 0.008 0.041 0.025 19 20 21 22 23 24 25 26 27 LINR COBD TLOG LINR COBD TLOG LINR COBD TLOG UNIF UNIF UNIF NORM NORM NORM EXPO EXPO EXPO EXPO EXPO EXPO EXPO EXPO EXPO EXPO EXPO EXPO 14.6 12.5 13.5 15.1 11.9 13.8 13.6 10.7 11.5 0.943 0.077 0.825 0.010 0.908 0.055 0.938 0.081 0.804 0.006 0.914 0.063 0.930 0.070 0.725 -0.044 0.873 0.029 0.010 0.015 0.010 0.011 0.017 0.011 0.010 0.027 0.012 0.868 0.772 0.832 0.863 0.750 0.844 0.817 0.695 0.748 -0.055 -0.116 -0.080 -0.054 -0.122 -0.075 -0.081 -0.173 -0.147 0.016 0.033 0.024 0.015 0.036 0.021 0.023 0.057 0.046 30 observations, mean/min ratio = 10. 100 runs, variable returns to scale. Bold indicates where DEA has lower MSE than regression analysis. Not surprisingly, the best performance of DEA is when the distribution of inefficiencies is exponential. A normal distribution of errors generally also leads to a better performance than an exponential distribution. The form of the cost function seems to make a significant difference to DEA. The Cobb-Douglas form is the worst performer, especially in combination with exponentially distributed errors. 8 Cubbin and Potonias The accuracy of Data Envelope Analysis There was little difference in results between the normal and uniform distributions for the cost drivers. Both DEA and RA performed significantly worse with exponentially distributed cost drivers. Comparison with RA In order to evaluate DEA it is useful to have a benchmark, and we have chosen regression analysis, since this is the approach most commonly adopted as an alternative. However, given that the data were generated using a parametric approach it would appear relatively easy to demonstrate the superiority of the regression approach by replicating the parametric model. In practice the form of the parametric model (if there were one!) would be unknown. We have simulated this by using an inappropriate functional form, e.g. linear in the case of the Cobb-Douglas data and linear or log (but omitting the interaction terms) for the translog formulation. Furthermore the inefficiency distribution deviates from the regression ideal. At the very least, the errors are truncated. In the case of the normal distribution they are slightly skewed as a result of cutting off the bottom 2% of the distribution. In the case of the exponential distribution, they are considerably skewed. Although this should not affect the expected value for the parameters, it would obviously affect the test statistics used. The linear regression performance scores are shown for linear regression in Table 1. The ranking for functional form is the reverse of that for DEA, with a linear form giving the best overall performance, whether measured in terms of the correlation with the true values or the mean squared error. As with DEA, the combination of a Cobb-Douglas function and exponential errors leads to particularly poor scores. As may be expected, the regression approach performed best when the data generation process most resembled the assumptions of the model: a linear cost function and a symmetric disturbance term. e.g. lines 1 and 4. Creating outliers in the data, whether the exogenous variables or the disturbance terms, through the exponential distribution, results in a reduction in performance. Just behind the uniform distribution comes the truncated normal (lines 2 and 5). Not surprisingly, the normally-distributed inefficiency term produced the lowest performance for the regression approach. Effect of size range for cost drivers One of the reasons for the poorer performance of DEA when the cost drivers are distributed with a normal or exponential distribution is that the scope for outlying observations is greater. This increases the probability of an observation being falsely classified as on the frontier. The basic reason for this is that outliers cause that region of the frontier to be sparsely populated. This effect can also be seen if the range of values for the cost drivers is 9 Cubbin and Potonias The accuracy of Data Envelope Analysis increased. Table 2 presents some results for 50 observations under two different assumptions about the underlying cost function. Table 2: Effect of cost driver dispersion Row Funct Driver Effic Mean/ False MIN % 1 LINR NORM UNIF 1.2 4.1 2 LINR NORM UNIF 2 8.3 3 LINR NORM UNIF 4 9.4 4 LINR NORM UNIF 8 10.2 5 LINR NORM UNIF 16 10.2 6 LINR NORM UNIF 32 10.3 7 LINR NORM UNIF 64 10.3 CORR BIAS 0.9659 0.9226 0.9152 0.9171 0.9182 0.9179 0.9175 0.0451 0.0647 0.0742 0.0791 0.0811 0.0817 0.0818 MEAN SQERR 0.00395 0.00824 0.00985 0.01049 0.01077 0.01089 0.0109 8 LINR 9 LINR 10 LINR 11 LINR 12 LINR 13 LINR 14 LINR NORM NORM NORM NORM NORM NORM NORM EXPO EXPO EXPO EXPO EXPO EXPO EXPO 1.2 2 4 8 16 32 64 3.7 7.0 8.2 8.9 8.5 8.3 8.5 0.9834 0.9616 0.9553 0.9560 0.9558 0.9551 0.9543 0.0332 0.0440 0.0529 0.0567 0.0580 0.0583 0.0583 0.00246 0.00479 0.00613 0.00652 0.0067 0.0068 0.00687 15 COBD 16 COBD 17 COBD 18 COBD 19 COBD 20 COBD 21 COBD NORM NORM NORM NORM NORM NORM NORM UNIF UNIF UNIF UNIF UNIF UNIF UNIF 1.2 2 4 8 16 32 64 4.4 7.5 7.2 7.3 8.0 7.5 6.9 0.9713 0.9143 0.8243 0.7245 0.6480 0.5949 0.5593 0.0496 0.0573 0.0238 -0.0182 -0.0541 -0.0840 -0.1067 0.00405 0.00783 0.01053 0.01726 0.02658 0.03625 0.04473 There is some deterioration for the linear functional form, but this is more drastic for the Cobb-Douglas form. In line 21, a mean squared error of .04 means that a typical error is 21% in the efficiency rating. Results: sample size Table 3 and Figures 1 & 2 show the effect of sample size for four different specifications. 10 Cubbin and Potonias The accuracy of Data Envelope Analysis Table 3 Row 1(a) 2(b) 3(c) 4(d) Function Driver LINR TLOG LINR TLOG NORM NORM NORM NORM Efficiency No Observ NORM 8 NORM 8 EXPO 8 EXPO 8 FALSE % 69.71 61.16 67.91 58.18 CORR BIAS 0.5124 0.5641 0.5656 0.6088 0.2171 0.1965 0.2231 0.2019 5(a) 6(b) 7(c) 8(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 8 8 8 8 69.77 56.72 65.39 53.93 0.5125 0.5605 0.5838 0.6037 0.2173 0.1820 0.2243 0.1880 9(a) 10(b) 11(c) 12(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 15 15 15 15 54.34 40.49 50.46 38.38 0.6206 0.6473 0.6772 0.6847 0.1831 0.1416 0.1848 0.1437 13(a) 14(b) 15(c) 16(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 30 30 30 30 38.48 26.75 35.06 24.35 0.7130 0.7231 0.7637 0.7639 0.1436 0.1013 0.1387 0.0965 17(a) 18(b) 19(c) 20(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 100 100 100 100 17.59 10.46 15.67 9.8 0.8298 0.8208 0.8721 0.8618 0.0825 0.0423 0.0760 0.0373 21(a) 22(b) 23(c) 24(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 500 500 500 500 4.97 2.64 4.43 2.65 0.9319 0.0335 0.8893 -0.0162 0.9508 0.0289 0.9174 -0.0239 25(a) 26(b) 27(c) 27(d) LINR TLOG LINR TLOG NORM NORM NORM NORM NORM NORM EXPO EXPO 1000 1000 1000 1000 2.8 1.48 2.56 0.23 0.9590 0.0217 0.8974 -0.0415 0.9717 0.0179 0.9560 -0.1158 Note: a, b, c, and d are the keys to the specifications shown in Figure 2. 11 Cubbin and Potonias The accuracy of Data Envelope Analysis Figure 2 Increasing sample size FALSEa 1 FALSEb 0.8 CORRa 0.6 CORRb BIASa 0.4 BIASb 0.2 0 -0.2 8 8 15 30 100 500 1000 Num ber of observations As expected the performance of DEA improves with sample size. Once the sample size exceeds 500 a correlation of 0.9 or more with the true value is typically found. Figures 3 and 4 show the results of regression analysis using two extreme models: a linear model with uniform efficiencies and a Cobb-Douglas model with exponential efficiencies. Figure 3: Linear regression Linear, norm, normal 1 0.8 0.6 CORR BIAS M 0.4 NSQERR 0.2 0 8 15 30 100 500 1000 -0.2 Observations 12 Cubbin and Potonias The accuracy of Data Envelope Analysis Figure 4: Linear regression Cobb-D, Norm, Expo 1 0.8 0.6 CORR 0.4 BIAS M 0.2 NSQERR 0 8 15 30 100 500 1000 -0.2 -0.4 8. Conclusions DEA appears to cope best under the following conditions: a data generating function which is reasonably close to linear distribution of efficiencies with sufficient observations close to the frontier for the complexity of the functional form. if the underlying cost function is not linear, a small variation in the cost drivers is desirable. If these conditions are not fulfilled, reliable results may require several hundred observations. 13 Cubbin and Potonias The accuracy of Data Envelope Analysis Appendix The functional forms. All have three outputs or “cost drivers” (X, Y, and Z) using common parameters. The translog form has three extra parameters. 1. Linear: C= a + bX + cY + dZ 2. Cobb-Douglas: C= exp(b lnX + c lnY + d lnZ) 3. Trans-log: This is a slightly simplified to economise on the number of parameters: C = exp(b lnX + c ln Y + d ln Z + e ((ln X)2 + (ln Y)2 +(ln Z)2) f ln X. ln Y + f ln Y. ln Z + g ln X. ln Z) The parameters are as follows: a= 0.05 b= 0.4 c= 0.3 d= 0.25 e= 0.1 f= -0.1 g= -0.1 With these parameters the Cobb-Douglas form is homogeneous of degree (b+c+d) = 0.95 (slight economies of scale.) The trans-log form is homogeneous of degree (b+c+d+6e+4f+2g) = 0.75 (significant economies of scale.) The negative cross-effects means that outputs are more complementary than in the Cobb-Douglas case. Conversely there is some possibility of congestion where the outputs are unbalanced. This implies economies of scope. References Aigner, D, Lovell, C., and Schmidt, P (1977) “Formulation and Estimation of stochastic frontier production functions”, Journal of Econometrics, 5, 21-38. Banker, R.D. (1993) “Maximum Likelihood, consistency, and Data Envelopment Analysis: a Statistical Foundation.” Management Science 39 (10) , 1265-1273. Banker, R.D., Charnes, A. and Cooper, W.W. (1984), “Some models for estimating Technical and Scale Inefficiencies in Data Envelopement Analysis.” Management Science 30(9), 1078 - 1092. Bogetoft, Peter (1994), “Incentive Efficient Production Frontiers: an Agency Perspective on DEA.” Management Science, 40 (8), (August), 959 - 968. 14 Cubbin and Potonias The accuracy of Data Envelope Analysis Bowlin, W.F., Charnes, A., Cooper, W.W. and H.D. Sherman (1985), “Data envelopment and regression approaches to efficiency estimation and evaluation.” Annals of Operational Research, 2: 113-138. Charnes, A., Cooper, W.W. and Rhodes, E., (1978), “Measuring the Efficiency of Decision Making Units”, European Journal of Operational Research, 2: 429-444. Coelli, Tim (1995) “Estimators and hypothesis tests for a stochastic frontier: a Monte Carlo Analysis” Journal of Productivity Analysis, 6, 247-68. Cubbin, J.S. and Tzanidakis, G. “Regression versus data envelopment analysis for efficiency measurement: an application to the England and Wales regulated water industry.” Utilities Policy (7)(1988), 75 - 85. Cubbin, J.S. and Zamani, H. (1996) “A comparison of performance indicators for training and enterprise councils”, Annals of Public and Co-operative Economics, September. Farrell, M.J. (1957), “The Measurement of Productive Efficiency”, Journal of the Royal Statistical Society, Series A, 120: 253-281. Ferrier, Gary D. and Lovell, C.A. Knox (1996), “Measuring cost efficiency in Banking” Journal of Econometrics, 46, 229-245. Ganley, J.A. and Cubbin J.S., (1992), Public Sector Efficiency Measurement: Applications of Data Envelopment Analysis, Elsevier. Grosskopf, S. (1996) “Statistical Inference and Nonparametric Efficiency: A Selective Survey” Journal of Productivity Analysis, 7, 161-76 Kittelsen, Sverre A.C. (1993), “Stepwise DEA; choosing variables for measuring Technical efficiency in Norwegian Electricity distribution” University of Oslo, Department of Economics, mimeo. Kittelsen, Sverre A.C. (1995), “Monte Carlo simulations of DEA Efficiency Measures and Hypothesis Tests” University of Oslo doctoral thesis, presented at Georgia productivity workshop, 1994. Lovell C.A.K. and Schmidt S. (1993), The Measurement of Productive Efficiency: Techniques and Applications, Oxford. Pedraja-Chaparro, Francisco, Salinas-Jimenez, Javier and Smith, Peter (1997) “On the role of weight restrictions in Data Envelopment Analysis”, 8, 215-30. Pollitt, Michael G. (1995), Ownership and Performance in Electric Utilities, Oxford University Press. Schmidt, Peter (1986), “Frontier Production Functions”, Econometric Reviews, 4(2): 289-328. 15 Cubbin and Potonias The accuracy of Data Envelope Analysis Stevenson, R. (1980), “Likelihood functions for generalised stochastic frontier estimation “Journal of Econometrics 13, 58-66. Thanassoulis, E. (1993), “A comparison of Regression Analysis and Data Envelopement analysis as Alternative Methods for Performance Assessments,” Journal of the Operational Research Society, 44: 1129 - 1144. 16