Estimating the Empirical Lorenz Curve in the Presence of Error Chaya S. Moskowitz ∗ Colin B. Begg Department of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center 307 East 63rd Street, 3rd Floor, New York, NY 10021 U.S.A. ∗ e-mail: moskowc1@mskcc.org 1 Abstract It is frequently the case that the measure of interest used to rank experimental units when estimating the empirical Lorenz curve is subject to random error. This error can result in an incorrect ranking of experimental units which inevitably leads to a curve that exaggerates the degree of variation in the population. We will explore this bias and discuss several widely available statistical methods that have the potential to reduce or remove the bias in the empirical Lorenz curve. The properties of these methods will be examined and compared in a simulation study in order to determine which analytic strategies have sufficiently good performance characteristics to be recommended for widespread use. 2 1 Introduction Both the Lorenz curve and Gini coefficient were originally proposed as methods for studying the concentration of income in a population. While both tools have been primarily utilized in the economic and social sciences over the last century, in recent years these methods have also seen application in other areas such as medical and health services research. As use of the Lorenz curve and Gini coefficient has expanded into these other fields, the tools have been applied to data that are increasingly complex. With this added level of complexity, the possibility of error or bias in the estimation of the Lorenz curve and Gini coefficient may arise. This potential bias is frequently ignored in the applied literature. As is described in further detail below, the estimation of both the Lorenz curve and the Gini coefficient involves ranking the units of observation on the basis of some quantity of interest and then estimating cumulative proportions. When there is error or variation in the measurement of the quantity of interest, an analysis that does not account for this error or variation may incorrectly rank the units and result in biased estimates for the Lorenz curve and Gini coefficient. This situation can arise in many different circumstances. For instance, the Lorenz curve has traditionally been used to study the distribution of income in a population. In this case depending upon the study design, there may be variation in the measurement of income from a number of possible sources including error in the reported income (i.e. measurement error) and variation attached to estimating income if the income must be estimated for each member of the population. To our knowledge, this bias has been explicitly recognized by a limited number of investigators. Lee (1999) compared the potential for bias in this setting with the inherent bias in estimating predictive accuracy for a model using the same data that was used to construct the model, an issue that has been studied widely by statistical theorists. Suggesting bootstrapping to correct the bias, Lee proposed using the bootstrap samples to reorder the experimental components while 3 using the original data to estimate the Gini coefficient. Pham-Gia and Turkkan (1997) proposed a parametric approach when using a Lorenz curve to study inequality in income distributions when the incomes are measured with error, and in principle this has the potential to resolve the bias issue. In their formulation, they reason that true income = observed income + error and model the observed income with a parametric distribution and the error with a separate parametric distribution. Their paper contains theoretical results for obtaining the distribution of the true income and the corresponding Lorenz curve. In this manuscript we are particularly interested in the bias that may occur in a specific type of study design that can occur frequently in practice. This design is in some sense a two-stage or nested design where the members of the population, or experimental units, are the primary units of analysis, but within each experimental unit there are multiple observations that are aggregated to form the outcome whose distribution is of interest. For example, using Lorenz curves and Gini coefficients Praksam and Murthy (1992) look at couples within states in India to explore the acceptance of family planning methods for different levels of literacy. Tickle (2001) divides the North West Region of England into market areas and studies children in these areas who were included in the 1995/6 NHS epidemiology survey to examine the distribution of dental caries using Lorenz curves and Gini coefficients. Elliot et al. (2002) study geographical variations of sexually transmitted diseases across regions of Manitoba, Canada by looking at individuals within each region who have a reported infection. We are interested in “nonparametric” methods for estimating a Lorenz curve and Gini coefficient in this type of data configuration, where the methods are nonparametric in the sense that the ranking of the experimental units is unconstrained. A parametric approach such as that suggested by Pham-Gia and Turkkan may be too restrictive in practice as well as difficult to easily apply to complex data. While the bootstrap approach is a nonparametric method, we are skeptical that it 4 will adequately correct for the bias in the types of data often encountered in practice. We will, however, discuss and evaluate this approach in more detail below. In addition to the bootstrap, we consider two other approaches using random effects models for estimating the Lorenz curve and Gini coefficient. In the next section, we introduce a health services application that was our motivation for undertaking this work. In addition to motivating the problem, we will use this dataset to illustrate the different methods discussed below. In Section 3, we define the Lorenz Curve and Gini coefficient and discuss the potential bias in more detail. In Section 4, we present several different analytic strategies that can be used to estimate the Lorenz Curve and Gini coefficient. Section 5 contains the results of a simulation study evaluating these methods. In Section 6 we return to the health services application. We make our concluding remarks in Section 7. 2 Motivating Application In a recently published paper, Bach et al. (2004) explore whether the quality of health care received by patients in the United States is associated with race. They study a sample of Medicare patients together with the patients’ primary care physicians in order to assess if black patients receive a lower quality of care than do white patients. Their analysis involved several different components, including a Lorenz curve type analysis studying whether patient visits made by black patients to their physicians were concentrated among a select group of physicians. Specifically, they examine a sample of 5% of black and white Medicare beneficiaries who were treated during 2001 by 4,355 primary care physicians who participated in the 2000-2001 Community Tracking Study survey. The distribution of the black patients across these physicians was characterized by a Lorenz curve in which the cumulative proportion of black patient visits was plotted against the cumulative proportion of physicians in the population. They constructed a curve by ranking the physicians on 5 the basis of the proportion of blacks in the physician’s profile. In order to understand the potential problem, notice that an empirical ranking of these observed proportions tends to exaggerate the level of concentration or mal-distribution in the curve, since random variation in the “observed” profile of each physician would lead to the result that the relative rankings of some physicians would be incorrect, yet the empirical ranking chosen for the curve necessarily maximizes the empirical mal-distribution. In this context, an “incorrect” ranking would occur under the following circumstances. Let us say physician A typically treats patients of whom 10% are black while physician B typically has 7% black patients in his/her practice. However, the actual proportions of their patients who made office visits during 2001, submitted a suitable claim to Medicare, and were included in the 5% Medicare sample, were 9% and 11% respectively. That is, physician A had slightly fewer blacks than normal and physician B had slightly more, and as a result physician A is incorrectly ranked lower. In these circumstances random error in the outcome (proportion of black patients) for each experimental unit (physician) necessarily leads to an incorrect ordering of the units which alters the curve in the direction of an exaggerated estimate of the degree of concentration. Our goal in this manuscript is to carefully study the properties of several analytic methods that could be used to correct this bias. 3 Lorenz Curve and Gini Coefficient To study the issue further we first establish some notation and define the empirical Lorenz curve and Gini coefficient. We suppose that N observations are dispersed among n experimental units and represent the number of observations for each experimental unit as m k , k = 1, ..., n. Interest lies in studying the concentration or distribution of a feature of the N observations, say Y , across the n members of the population. Let Xk denote the value of some summary of the Y jk , j = 1, ..., mk , for the k th experimental unit. In terms of our example, N is the number of patients in the sample 6 seen by the n primary care physicians. Y is a binary indicator of whether or not a patient is black, and Xk = P mk j=1 Yjk is the number of black patients in the profile of the k th physician. We obtain the empirical Lorenz curve by first ranking the n units using the X k . We write these ordered values as X(1) , ..., X(i) , ....X(n) , using i to index the experimental units after they have Pi X(l) . The empirical Lorenz curve is then a plot of the set of points been ranked. Let Li = Pl=1 n X l=1 (l) { ni , Li ; i = 1, ..., n}. The horizontal axis represents the cumulative proportion of the population while the vertical axis represents the cumulative proportion of X. In the context of our motivating example where the mk vary, a more appropriate choice might be to plot the cumulative proportion of the {mk }, that is the cumulative proportion of patients seen by the physicians, along the x-axis. This approach is somewhat different that the traditional empirically estimated Lorenz curve. For the purposes of simplicity in this paper, we plot { ni } along the x-axis but we note that in some cases obtaining a meaningful measure of inequality may require careful consideration of the quantity plotted. The Gini coefficient is estimated from the same quantities as GC = 1− Pn−1 i=0 i ( i+1 n − n )(Li+1 +Li ). It is a measure of the distance of the Lorenz curve from the 45 ◦ degree line and often accompanies the Lorenz curve when analysis results are presented. In the situation where there is perfect equality in the distribution so that Y is evenly dispersed among the experimental units, X1 = Xk ∀k, the Lorenz curve will fall on the 45 ◦ degree line connecting the origin (0,0) of the unit square to the top right corner (1,1), and GI = 0. On the other end of the spectrum when all of the outcome is contained in a single member of the population (maximum inequality in the distribution), the Lorenz curve will lie along the bottom and right side of the unit square forming an equilateral triangle with the 45 ◦ degree line and GI = 1. In practice, the Lorenz curve lies somewhere between these two extremes. The bias about which we are concerned arises because the quantities used to calculate the 7 curve must be estimated from the data. Specifically, in many applications we observe only an estimate, X̂k , of the true unobserved Xk , where each of these estimates is based on a limited sample and/or limited time period of study, characterized by m k . Thus, while the empirical Lorenz curve is estimated as described above using the { X̂k }, the true curve involves a ranking of the data according to the Xk . The problem is that when X̂k 6= Xk not only will the estimated relative frequencies {Xk } be subject to random error, but also their ordering in the construction of {L i } will be redistributed erroneously in some cases. That is, the ranking of the data which is now based on X̂k may be incorrect. Members of the population with values of X̂k that are larger than their corresponding true Xk will be ranked higher than they should be while members of the population with values of X̂k that are smaller than their corresponding true X k will be ranked lower than they should be. Consequently the empirically estimated Lorenz curve will fall below the true Lorenz curve and the Gini coefficient will be overestimated, causing the distribution of X to appear more concentrated than it truly is. In more precise terms, if { L̂i } are calculated using an ordering based X̂k } then we are saying that L̂i ≤ Li ∀i. on the denotes ordered according to { m k Because of this systematic bias, we are interested in methods that will produce bias-corrected Lorenz curves. We next propose several methods that could be used for this purpose and describe them in detail. In these approaches we focus on adjusting the estimates of the proportions p k = Xk mk , the proportion of observations with Y = 1 that belong to the k th experimental unit in recognition of the fact that the distribution of emprical proportion will always have greater variance that the distribution of true proportion from which they are generated. The estimates X̂k = p̂k mk , k = 1, ..., n, are then used to construct the Lorenz curve and estimate the Gini coefficient. 8 4 Analytic strategies for unbiased estimation 4.1 Random effects with logistic regression The first method we consider is a random effects model. By using programs that are available in standard statistical software packages, we can estimate a random effect for each experimental unit and use this random effect to estimate the p k . In this approach we consider the model: E[Yjk ] = logit(pk ) = θ + γk , j = 1, ..., mk ; k = 1, ..., n Here the Yjk are assumed to be independent Bernoulli variables conditional on the experimental unit, denoted by the subscript k. θ is a fixed effect common to all experimental units, and the {γ k } are the random effects. In order for the random effects to be identifiable, we need to postulate a model for their distribution. We follow convention and assume that γ k , k = 1, ..., n, are independent observations from a Normal(0,σ 2 ) distribution. A random effects model of this nature induces “shrinkage” in the random effects estimators for the empirical estimates. The effect is to reduce the spread of the proportions between physicians and consequently reduce the Gini coefficient and raise the height of the Lorenz Curve toward the 45 o line. The degree of shrinkage should be inversely related to the sample size of the individual experimental units. The parameters in this model can be estimated using previously developed methods for nonlinear mixed models. A number of such methods have been proposed and are available in statistical software packages. (Pinheiro and Bates (2000) and Davidian and Giltinan (1995) are two references that contain overviews of these methods.) In this manuscript we use a numerical quadrature approach to integrating the likelihood of the data over the random effects. This approach is easily implemented for reasonably sized datasets in the software SAS through the PROC NLMIXED program. 9 4.2 Normal random effects analysis Although the preceding approach is conceptually appealing, it becomes computationally burdensome in large datasets. Indeed, it is not feasible to complete this analysis using the entire 80,000 physicians in the Community Tracking Study. Historically, it has often been shown that analyses of binary data in large samples can be accomplished by simply using the more available and computationally less burdensome normal mixed models, in effect treating each binary outcome as if it were a normal variate. We include this model for its accessibility and computational speed, and compare its results with the preceding logistic regression model in our simulations. In this model yjk = θ + γk + jk where γk represents the independent normal random effect for the k th physician and the jk are the error terms for the j th patient of the k th physician. Since the variance of this term for the patients of the k th physician really reflects the variance of the binomial variate X k , we must perform a weighted regression, with weights 1 mk to more appropriately characterize the contributions of each patient. Multiple standard statistical software packages provide programs that can be used to estimate the parameters in this model. These packages use restricted maximum likelihood to estimates the variance parameters and solve the mixed model equations to obtain estimates of the fixed and random effects. 4.3 Bootstrap The bootstrap (Efron and Tibshirani, 1993) is a nonparametric method that is frequently employed by statisticians. Very generally, data elements are randomly sampled with replacement from the entire dataset to create a “bootstrap” sample, and the statistic of interest is calculated using this bootstrap sample. This process is repeated numerous times in order to determine the distribution 10 of the statistic and associated information (such as standard errors, confidence intervals, and bias for example). In the context of the Lorenz curve, Lee (1999) suggests using the bootstrap. Considering a data configuration that is simpler than the one in which we are interested, when there is only one observation per experimental unit (not multiple observations that need to be aggregated within unit) and the observation is continuous Lee proposes drawing bootstrap samples of experimental units, reordering the data based on the bootstrap samples, and then using the original data to calculate summary measures of the Lorenz curve. We plan to evaluate a slight modification of this method, where we adapt Lee’s approach to our more complex data configuration by drawing a separate bootstrap sample of observations for each experimental unit, reordering the experimental units based on the estimated {Xk } from the bootstrapped samples, and then estimating the components of the Lorenz curve and Gini coefficient using the original data. Specifically, for this approach we draw bootstrap samples of observations within experimental (b) units. With this bootstrap sample we compute p k , the proportion of observations in the bootstrap sample for the k th unit that have the feature Y . We rank the experimental units according to (b) (b) the pk . We then use the original data to calculate {L i }, arranged in the order specified by the (b) bootstrap sample. We repeat this process B times to obtain B values of L i . Separately for each i, we take the mean of these values to obtain L B i = 1 B PB b=1 (b) Li , the bootstrap estimate of the ith ranked contribution to the Lorenz curve. Plotting { ni , LB i ; i = 1, ..., n} gives the bootstrap estimate of the Lorenz curve and the Gini coefficient is similarly estimated using the L B i . A potential problem that we forsee with this approach is that it will not work well when there are a large number of zeros in the data for experimental units with possibly small but nonzero probabilities. In this case the bootstrap samples would continually estimate these probabilities as zero and not correct for the bias. 11 5 Simulation Study We conducted a simulation study to (1) assess and characterize the magnitude of bias present in the empirically estimated Lorenz curve and Gini coefficient and (2) to evaluate the ability of the previously discussed methods to produce bias-corrected estimates of the Lorenz curve and Gini coefficient. We first describe how we generated data and then follow this description with the simulation results. 5.1 Simulation of Data We specified N and n and studied equal cluster sizes, m 1 = mk ∀k, and unequal cluster sizes generated from a multinomial distribution with probabilities α k ranging from 1/n, ..., 1, but scaled by the sum of these probabilities, P k αk , so they added to one, in order to study potential bias when there is wide variation in cluster sizes. We generated a variable Z from a Normal(µ, 1) distribution where µ was prespecified, and then obtained probabilities p ∗k = Φ(z) where Φ is the cumulative distribution function for the standard normal distribution. We further specified a desired π = P (Y = 1) and obtained the pk = P (Yk = 1) by scaling the p∗k so that the overall proportion of observations with Y = 1 across all simulated experimental units was equal to a chosen π. The {X k } were then generated from a Binomial(m k , pk ) distribution. By generating data in this fashion, we were able to control the degree of mal-distribution/concentration by changing the value of µ. In the simulations shown here, we fixed N = 5000 and π = 0.1. We varied µ so as to obtain distributions with very little inequality, GC = 0.05, with moderate inequality, GC = 0.50, and with high inequality, GC = 0.9. We also varied the parameter n using values n = 50, 100, 500 which resulted in average cluster sizes of 100, 50, and 10. For each scenario we simulated 1000 datasets and report the mean values of the estimated Gini coefficients across the 1000 simulations. In running these simulations, we found that in order to obtain convergence of the PROC 12 NLMIXED program for the logistic regression, we needed to constrain the estimate of the variance of the random effects to be greater than zero. 5.2 Results Figure 1 gives examples of the Lorenz curves generated from data simulated with equal cluster sizes. In each graph a single simulated dataset is plotted together with curves produced from the different analytic strategies below. Figure 1(a) depicts data with little inequality. The curve drawn using the “true” probabilities (which would be unobserved in a real application), lies very close to the 45 ◦ degree line. The empirically estimated Lorenz curve falls substantially below this line suggesting even greater inequality. The curve drawn using estimates from the logistic regression approach lies almost directly on top of the true curve, producing the best estimate in this dataset. The normal regression approach is also close to the true curve, but appears to slightly underestimate it. The curve resulting from using the bootstrap estimates lies in between the true curve and the empirically estimated curve, not fully correcting the bias. Figure 1(b) shows data with a moderate degree of inequality. Here the empirically estimated curve again lies well below the true curve. In this dataset, though, both random effects approaches clearly overestimate the curve suggesting less inequality than in truth. The Lorenz curve drawn using the bootstrap approach is the closest to the true curve and appears to provide a relatively reasonable approximation. Figure 1(c) illustrates the situation when there is a high degree of inequality in the data. Here all the estimated curves lie along the true curve indicating little bias in this case. Overall, these plots show that there is the potential for bias when estimating the Lorenz curve and suggest that the bias decreases as the inequality in the data increases. It is difficult to draw conclusions, however, based on a single simulation, so to further characterize the bias and different methods of estimation we turn to results summarizing the Gini coefficient across multiple simulations. 13 0.8 0.6 0.4 0.0 0.2 cumulative proportion of X 0.8 0.6 0.4 0.2 0.0 cumulative proportion of X 1.0 (b) GI=0.5 1.0 (a) GI=0.05 0.0 0.2 0.4 0.6 0.8 1.0 0.0 cumulative proportion of population 0.2 0.4 0.6 0.8 1.0 cumulative proportion of population 0.8 0.6 0.2 0.4 Truth Empirical Logistic regression Normal regression Bootstrap 0.0 cumulative proportion of X 1.0 (c) GI=0.9 0.0 0.2 0.4 0.6 0.8 1.0 cumulative proportion of population Figure 1: Examples of Lorenz curves for data simulated with equal cluster sizes and n = 500. Table 1 and Table 2 contain these results. Looking first at the empirically estimated Gini coefficients in both tables, we see that the magnitude of bias depends at least partially on the cluster size. The bias can be determined by comparing the estimated GC with the known value in the first column. Thus, in the first row, the logistic and normal GC estimates are 0.03 and 0.02, respectively, lower than the true value of 0.05. By contrast the emprical (0.18) and bootstrap (0.13) values clearly overestimate the true GC. The larger the cluster size, the less bias there is. The bias is always smallest when the average cluster size is 100 and largest when the average cluster size is ten. For the case when GC=0.05, the difference in bias between the situations where the average 14 cluster size is 100 and where the average cluster size is 10 is quite considerable (ie, for estimating GC=0.05 we find an empirically estimated GC=0.18 versus 0.50 from Table 1 and GC=0.37 versus 0.58 from Table 2). We also see that there is a very large degree of bias corresponding to a small degree of inequality in the population. The bias gets smaller as the inequality increases. If we compare the degree of bias for a fixed average cluster size across the different scenarios this trend becomes evident. For example, from Table 1 where the cluster size is 100, the bias is approximately 0.13 (0.18-0.05) for data with little inequality and GC=0.05, approximately 0.02 (0.52-0.50) for data with moderate inequality and GC=0.50, and approximately 0.00 (0.90-0.90) for data with high inequality and GC=0.90. Similarly, in Table 2 where the average cluster size is 100, the bias is approximately 0.32 (0.37-0.05) for data with little inequality and GC=0.05, approximately 0.10 (0.60-0.50) for data with moderate inequality and GC=0.50, and approximately 0.02 (0.92-0.90) for data with high inequality and GC=0.90. The pattern of bias at the other cluster sizes is the same. To understand why the bias decreases as the inequality increases, consider the following small example illustrated in Table 3. Suppose there are n = 25 experimental units and N = 200 equally distributed observations across the experimental units (ie. equal cluster sizes with 8 observations each) with π = 0.1. We consider first the situation where there is very little inequality in the distribution of the p k ’s across the population. This scenario is represented in the first four columns of Table 3 where GC is approximately 0.05. Because the cluster sizes are equal and there is little inequality in the data, we expect all the pk s to have similar values reflecting that about 10% of the observations for each experimental unit have Y = 1. The ranking for what would be the true, unobserved Lorenz curve ranks the units based on the Xk = mk pk , which in this case is equivalent to ranking the units based on the pk since the mk are identical across all units. In this example we have four distinct values of pk and the order in which the units would be ranked based on those values is shown in the third 15 column of the table. The empirically estimated Lorenz curve requires ranking the experimental units based on the X̂k which in this case is equivalent to ranking the units based on the p̂ k . There are a relatively large number of experimental units that have p̂ k equal to zero. Notice that because there are a small number of observations for each unit and on average 10% of these observations have Y = 1, simply by chance many of the experimental units will have no observations with Y = 1 yielding p̂k = 0. The ranking of the units based on the p̂ k is shown in the fourth column. It is clearly very different from the ranking based on the p k . The rank correlation coefficient comparing {pk } with {p̂k } is 0.06. This difference is due primarily to the variation in estimating values of {p k } that are very similar to one another when the estimate is based on a small number of observations. Consider now a similar data structure except with a high degree of inequality in the data. This scenario is represented in the last four columns of Table 3. Here most experimental units will have no observations with Y = 1. The 10% of the observations with Y = 1 will be concentrated in a few experimental units which is reflected by the vast majority of the values of p k equal to zero with only three exceptions, two of which have only observations with Y = 1 (p k = 1). When the observations for an experimental unit are consistently Y = 0 (p k = 0) or Y = 1 (pk = 1), then essentially there is no room for error in the estimate of the probability. In these cases we will always have p̂k = pk which is exactly what occurs in this dataset. In fact, we see only one unit with p̂k 6= pk , but the differences in the probabilities does not affect the ranking. Looking at the last two columns in the table, we see that the rankings are identical. Although this example is a somewhat simplified situation of what occurs in larger datasets, the general idea is the same. When there is little inequality in the population, there is the potential for variation in the estimates of the p k and Xk and even the slightest bit of variation can change the ranking of the experimental units. When there is a large degree of inequality in the populations, there is less potential for variation in the estimates of the pk and Xk . Most of the Xk will take values at the two extremes. Most of the 16 variation in estimating these quantities will occur in the few experimental units that do not have extreme values, but these units will be few in number and so the correct ranking will still be fairly evident. Returning to Table 1 and Table 2, also notice that the magnitude of the bias is larger for unequal cluster sizes, i.e. when there is variation in the m k . Variation in the number of observations for each experimental unit is associated with greater bias because some experimental units may have very small mk leading to potentially large error in estimating the p k . Looking now at the different methods proposed to correct the bias, we look first at Table 1. In this situation, not surprisingly the bootstrap does not work well when there is very little inequality in the distribution. Recall from the example in Table 3 that in this situation the observed values, { X̂k }, may contain many zeros when the {Xk } are small but not equal to zero. Drawing a bootstrap sample for an experimental unit with no observations that have Y = 1 will consistently yield bootstrap samples with X̂k = 0. The bootstrap approach is limited by this fact. For moderate inequality and a large degree of inequality in the data, it appears to work better. For the situation with equal cluster sizes, however, the logistic regression with random effects, while slightly underestimating the true Gini coefficient, appears to be the most reliable estimate across all levels of inequality in the data. The normal regression with random effects yields estimates that are very close to the logistic regression approach and so also appears to be a viable option. In Table 2, however, we see that when there is variation in the m k , none of the estimators produce an unbiased estimate. The random effects approaches, which work reasonably well when the cluster sizes are equal, do very little for correcting the bias. In the scenarios we considered here, there can be many experimental units with extremely small cluster sizes. Recall that we incorporate the estimates of the random effects in computing the Lorenz curve. It is possible that when the sample sizes are small the estimates of the random effects are not very good resulting in 17 the poor properties seen in Table 2. The bootstrap approach seems to have the least amount of bias in this situation, and works fairly well when GC=0.50 and GC=0.90. However, the Lorenz curve estimated using the bootstrap when GC = 0.05 is still highly biased for the same reasons explained in the preceding paragraph. We point out that in this set of simulations we have looked at a rather extreme situation when there is a very high degree of variation in the cluster sizes which may exaggerate what is typically seen in practice, particularly for the larger average cluster sizes.. For example, when the average cluster size is 100 we have cluster sizes ranging from 1 to over 250. In future work we plan to explore the properties of these estimates when the variation in cluster sizes is not so large. 6 Health Care Data For this example we analyzed data on a random sample of 500 primary care physicians. There were 4892 total patient visits made to these 500 physicians. Three hundred and fifteen of the visits (6%) were made by black patients. The number of patient visits in the profile for a physician ranged from one to 58. The average number of patient visits to each physician was approximately 10 visits with a variance of around 77. Figure 2 shows the estimated Lorenz curves for these data. Because of the large number of zeros observed for the X̂k , as evidenced by the flat line in the empirically estimated curve along the x-axis, we suspect that the empirical estimate is may be exaggerating the degree of inequality. The empirically estimated Gini coefficient is 0.83. The other three analytic methods yield Gini coefficients near 0.75 which does reduce the estimated degree of inequality somewhat. 18 1.0 0.8 0.6 0.4 0.2 0.0 cumulative proportion of black patient visits Empirical Logistic Normal Bootstrap 0.0 0.2 0.4 0.6 0.8 1.0 proportion of physicians Figure 2: Lorenz curves showing the concentration of black patient visits within a random sample of 500 physicians. 7 Discussion In this manuscript we have demonstrated that ignoring variation in the outcome may lead to substantial bias in the empirically estimated Lorenz curve and Gini coefficient causing overestimation of the degree of concentration. This bias is most profound when there is relative equality in the distribution. When cluster sizes are equal, a random effects approach appears to produce estimates of the Lorenz Curve and Gini coefficient that are relatively close to the true values. While in many applications cluster sizes vary, equal cluster sizes could arise in a designed study where it is specified a priori to take the same number of measurements for each experimental unit or when there are 19 only a limited number of sources from which to make observations (such as annual survey results for a five year period). Unfortunately, when cluster sizes are not equal, none of the approaches results in satisfactory estimates of these quantities. There are certainly other methods that could be used for this purpose which might produce better estimates. For example, estimating the probabilities from a Betabinomial hierarchy might work well here. A Bayesian random effects model might also perform well. Further work in this area is warranted. 20 References [1] P. B. Bach, H. H. Pham, D. Schrag, R. C. Tate, and J. L. Hargaves. Primary care physicians who treat whites and blacks. New England Journal of Medicine, 351:575–584, 2004. [2] R.-K. R. Chang and N. Halfon. Geographic distribution of pediatricians in the United States: an analysis of the fifty states and Washington, DC. Pediatrics, 100(2):172–179, 1997. [3] M. Davidian and D. M. Giltinan. Nonlinar Models for Repeated Measurement Data. Chapman and Hall, 1995. [4] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993. [5] L. J. Elliot, J. F. Blanchard, C. M. Beaudoin, C. G. Green, D. L. Nowicki, P. Matusko, and S. Moses. Geographical variations in the epidemiology of bacterial transmitted infections in manitoba, canada. Sexually Transmitted Infections, 78(Suppl 1):i139–i144, 2002. [6] J. L. Gastwirth. A general definition of the Lorenz curve. Econometrica, 39:1037–1039, 1971. [7] C. Gini. Sulla misura della concetrazione e della variability dei cartteri. Atti del Reale Instituto Veneto di Scienze, LXXII:1203–1248, 1914. lettere ed Arti. [8] Y. Kobayashi and H. Takaki. Geographic distribution of physicians in Japan. Lancet, 340:1391– 1393, 1992. [9] W. C. Lee. Probabilistic analysis of global performances of diagnostic tests: interpreting the Lorenz curve-based summary measures. Statistics in Medicine, 18:455–471, 1999. [10] M. O. Lorenz. Methods of measuring the concentration of wealth. Publications of the American Statistical Association, 9(70):209–219, 1905. 21 [11] T. Pham-Gia and N. Turkkan. Change in income distribution in the presence of reporting errors. Mathematical and Computer Modelling, 25(7):33–42, 1997. [12] J. C. Pinheiro and D. M. Bates. Mixed-Effects Models in S and S-Plus. Springer, 2000. [13] C. P. Prakasam and P. K. Murthy. Couple’s literacy levels and acceptance of family planning methods: Lorenz curve analysis. Journal of Institute of Economic Research, 27(1):1–11, 1992. [14] M. Tickle. The 80:20 phenomenon: help or hindrance to planning caries prevention programmes? Community Dental Health, 19:39–42, 2001. Table 1: Equal cluster sizes: Average Gini coefficients across 1000 simulations when the m k are equal. Data were generated with N = 5000 observations, π = P (Y = 1) = 0.1 and GC=0.05 for data with little inequality, GC=0.50 for data with moderate inequality, and GC=0.90 for data with high inequality. GC 0.05 0.50 0.90 n Cluster size Estimated GC Empirical Logistic Normal Bootstrap 50 100 0.18 0.03 0.02 0.13 100 50 0.24 0.03 0.02 0.18 500 10 0.50 0.02 0.02 0.38 50 100 0.52 0.48 0.47 0.50 100 50 0.62 0.47 0.45 0.52 500 10 50 100 0.90 0.90 0.90 0.90 100 50 0.90 0.90 0.90 0.90 500 10 0.90 0.90 0.90 0.90 22 Table 2: Unequal cluster sizes: Average Gini coefficients across 1000 simulations when the m k vary. Data were generated with N = 5000 observations, π = P (Y = 1) = 0.1 and GC=0.05 for data with little inequality, GC=0.50 for data with moderate inequality, and GC=0.90 for data with high inequality. Also shown are average cluster size, range in cluster sizes, and average variation in cluster sizes across the simulations. GC 0.05 0.50 0.90 n Cluster size Estimated GC Avg. Range Variation Empirical Logistic Normal Bootstrap 50 100 1-251 3298.4 0.37 0.33 0.34 0.15 100 50 1-133 839.6 0.41 0.33 0.36 0.21 500 10 1-38 35.9 0.58 0.32 0.43 0.44 50 100 1-246 3304.2 0.60 0.58 0.57 0.50 100 50 1-136 840.3 0.62 0.57 0.57 0.52 500 10 50 100 1-237 3298.5 0.92 0.92 0.92 0.91 100 50 1-133 841.5 0.93 0.93 0.93 0.92 500 10 1-37 35.9 0.93 0.93 0.93 0.92 23 Table 3: Example of potential error in ranking experimental units. Data generated with N = 200 and n = 25. GC=0.05 pk p̂k GC=0.90 Ranked by Xk X̂k pk p̂k Ranked by Xk X̂k 0.11 0.125 4 2 0 0 1 1 0.11 0.125 4 2 0 0 1 1 0.09 0 2 1 0 0 1 1 0.10 0 3 1 0 0 1 1 0.09 0 2 1 1 1 3 3 0.10 0.125 3 2 0.5 0.75 2 2 0.10 0.125 3 2 0 0 1 1 0.10 0 3 1 0 0 1 1 0.11 0 4 1 0 0 1 1 0.09 0.125 2 2 0 0 1 1 0.08 0 1 1 0 0 1 1 0.11 0 4 1 1 1 3 3 0.10 0.375 3 4 0 0 1 1 0.10 0.125 3 2 0 0 1 1 0.10 0 3 1 0 0 1 1 0.10 0 3 1 0 0 1 1 0.10 0.125 3 2 0 0 1 1 0.10 0.250 3 2 0 0 1 1 0.09 0 2 1 0 0 1 1 0.10 0 3 1 0 0 1 1 0.09 0 2 1 0 0 1 1 0.10 0 3 1 0 0 1 1 0.11 0 4 1 0 0 1 1 0.11 0 4 1 0 0 1 1 0.11 0 4 1 0 0 1 1 24