A New Dataset on Global Income Distribution Charles Ackah, Maurizio Bussolo, Rafael De Hoyos, and Denis Medvedev Development Prospects Group The World Bank Do not quote Feb 29, 2008 1. Introduction For many years, particularly following the influential ideas of Simon Kuznets (1955), discussions about the relationship between income inequality and a country’s aggregate income level have taken centre stage in development economics research. Since Kuznets made his famous proposition about the existence of an inverted-U shaped relationship between income inequality and economic development, many researchers have felt compelled to try to validate this hypothesis by often looking at past trends in search of any evidence that development truly hurts the poor. Empirically, some studies have found that a country’s rate of economic growth is negatively correlated with its initial level of inequality (see Ahluwalia, 1976; Deininger and Squire, 1998) while others have failed to demonstrate any systematic association between economic growth and the distribution of income (see Bourguignon and Morrisson, 1990; Roland Benabou, 1997; Anand and Kanbur, 1993; Li, Squire and Zou, 1998). A major difficulty in the empirical literature lies with the choice of an appropriate definition of global inequality. Another problem with the previous literature is the issue of the reliability of the underlying data used for distributional analysis. While the literature has made important strides in addressing what constitutes an appropriate measure of inequality, analyses of global income distribution are still plagued with serious data problems, including the limitations of traditional databases and the poor comparability of data despite some obvious improvements in the availability of income 1 inequality data mainly spawned by the pioneering work of Deininger and Squire (1996). It is fair to say that the dataset compiled and made freely available by Deninger and Squire have led to a remarkable improvement in the availability of secondary data facilitating the analysis of world income inequality, which was hitherto difficult to contemplate. However, utilizing the existing available data for distributional analysis are not without costs or problems and applied researchers using such data face non-trivial limitations in their ability to study the effects of global income inequality. The importance of having reliable and comparable individual (or household) level survey data that has long been acknowledged as appropriate for comparative analysis of levels and trends in global income distribution cannot be overemphasized (see Milanovic 2002). Moreover, since there has been a recent considerable interest and concern about the distributional effects of increasing globalization, there is even a more present need for reliable datasets that permits meaningful comparison of inequality not only within countries but across regions and nations. Indeed, being able to accurately measure the relative positions of every country (or individual) within the global income distribution is a necessary condition for evaluating whether any policy initiative (such as the removal of agricultural distortions, for example, or globalization, more generally) would increase or reduce world inequality. This is particularly important if one is interested in knowing the global distributional impacts of any such phenomenon. We set the stage by presenting the first ever household survey-based global distributional data. We should stress that we are not here attempting to critique the strengths and weaknesses of the existing traditional databases since none is actually comparable to this new data on global income distribution. Atkinson and Brandolini (2001) provide a detailed critique of the use of secondary datasets, including the Deninger and Squire’s celebrated dataset, and we do not want to repeat that in this paper. In fact, there are a number of important limitations to our present data as well. For example, the data has no time dimension (i.e. there is only one observation per country) so they are not suitable for examining changes over time of global income distribution. We are, for example, largely focusing on snapshots (or levels) of the distribution – such as the global distribution of 2 income in 2000.1 The data, nonetheless, is an interesting departure from the existing databases. In particular, this database is not a mere compilation of secondary crosscountry inequality indices. Instead, it is an actual presentation of a truly global income distribution based entirely on household survey data. Our ultimate goal is to assemble existing representative household data for all countries on the globe, standardize them so that they are internationally comparable, and make the minimal distribution data available to researchers in a ready-to-use format for the analysis of global income distribution. This goal is motivated by the publicly non-availability of data on global income distribution limiting the ability of researchers to identify the positions of each country in the global income distribution. It is important to note that our global data also easily permits interested researchers to construct the Deninger and Squire-like within-country Gini coefficients. We hope this work will contribute to a more informed policy debate about the dynamics in the global income distribution and the emergence of the new global middle class. The main objective of this manuscript is to introduce our new global dataset on income distribution and to discuss the procedures followed in assembling the data as well as acknowledging any remaining limitations. The next section briefly surveys previous attempts at improving the availability of cross-country income distribution data. Section 3 introduces the dataset and discusses the data sources. This section also discusses the limitations of the data and how we dealt with the issues of comparability. We then present some descriptive evidence on global income inequality in Section 4. In Section 5, we combine our data with the Global Income Distribution Dynamics (GIDD) tool to provide an illustrative example of some of the benefits of such a novel data. Section 6 concludes the paper with some suggestions for future work. 1 Although concentrating on a snapshot of the distribution, the underlying data allows for a forward-looking analysis of changes in the distribution of income arising from anticipated policy changes, and we provide an illustration in Section 5 of this paper. 3 2. Previous Work This subsection aims at charting some advances in the literature, including the recent datasets that have been made available by some authors highlighting what we believe to be the value added of our work. Ours is not the first attempt to compile distributional data based only on representative household surveys. The Deininger and Squire dataset and, in particular, the World Income Inequality Database (WIID1 and WIID2) compiled by World Institute for Research in Development Economics (WIDER) (2000) are recent examples of commendable attempts to gather information from household surveys to facilitate income distribution analysis. While there had been prior efforts to create secondary datasets for the study of income inequality including those by Kuznets (1963), Paukert (1973), Jain (1975), Aluwalia (1976), the United Nations (1981 and 1985) and Fields (1989), these data often fell short of the needs of researchers and policymakers who were interested in world inequality issues (see Atkinson and Brandolini, 2001).2 In the words of Denininger and Squire (1996:567): “although a large number of earlier studies on inequality have amassed substantial data on inequality, the information included is often of dubious quality”. Responding to the call by many researchers for better quality data on income distribution, Deininger and Squire compiled the first most comprehensive cross-country dataset on inequality measures based mainly on representative household surveys, rather than estimates drawn from national account statistics. Most studies have used the subset of this data labelled “accept” and declared as satisfying the authors’ criteria for being “highquality”.3 This dataset that covers the period 1960-1996 has been widely cited in the empirical literature investigating the link between inequality and economic growth (see Deninger and Squire, 1998; Benabou, 1996; Forbes, 2000; Barro, 2000; Li and Zou, 2 The data previously available to earlier researchers were almost entirely based on national statistics or even when derived from household surveys the secondary data were grouped in nature, with typically just a few observations per country-year. However, the increased recognition of population heterogeneity continues to cast doubt on the reliability and relevance of such average data. 3 To be classified as high-quality, the data must come from representative (national) household surveys, and that the survey must be capturing all sources of income or expenditure, including own-consumption. 4 1998; Li et al., 1998). Despite the progress achieved, the Deininger and Squire dataset should be approached with circumspection, and Atkinson and Brandolini have unequivocally verified this (see also Bourguignon and Morrison, 1998). Among its problems are two persistent issues: (1) the inequality measures are not derived on a consistent basis from comparable micro data - most of the figures are sourced from “third parties” that may differ from the original source because different studies do not necessarily employ similar conceptual definitions and methodologies (2) Income inequality observations based on the household unit are mixed with statistics based on income recipients; indices derived from grouped data are mixed with those based on micro data; estimates include some which relate to gross income and others to net (disposable) income; inequality indices include a mixture of ones obtained from parameterized and actual distributions.4 The Deninger and Squire dataset has since been updated to generate the World Income Inequality Databases (WIID1 and WIID2) by WIDER (2000, 2005). WIID1 which covers the period 1950-1998 has recently been updated resulting in a new version referred to as WIID2 (WIDER 2005). It is reasonable to view this database as a mere extension of the Deininger and Squire dataset; it contains all of the Deininger and Squire “high-quality” data and adds some ‘new’ data rejected by Deninger and Squire for not satisfying their inclusion criteria. Hence most of the limitations raised by Atkinson and Brandolini about the Deninger and Squire dataset may apply even more forcefully to the WIDER datasets. Two very rich and important databases that have some semblance with ours are the World Income Distribution (WYD) dataset and the All the Ginis database compiled by the World Bank (Milanovic 2005). Unlike the earlier datasets, these two are original (not secondary) databases created directly by the authors from household survey data. The WYD dataset covering the period 1985-2000 contain grouped (decile or quintile) income distribution indices for a large number of countries for three benchmark years (1988, 1993 and 1998). If a country does not have a household survey for a given benchmark 4 For a number of cases when income shares rather than Gini coefficients were available, Deninger and Squire resorted to Chen, Datt and Ravallion’s (1995) statistical program, POVCAL, to compute Gini coefficients. 5 year, then a year as close to the benchmark was selected so that observations are clustered around the years 1988, 1993, and 1998 (Milanovic, 2006). This database has been used for most of Milanovic’s work on world inequality (see for example, Milanovic 2002 & 2005). In the summer of 2004, Branko Milanovic created the All the Ginis database that simply combines the DS, WIID2 and WYD datasets.5 The idea was to create a more comprehensive and superior database than were available at the time. In the end what Milanovic did was to show a revealed preference for WIID2 so that WYD was only used to ‘fill’ the gaps in WIID2. The rationale for doing this is not clear to us but such arbitrary presumption introduces severe inconsistencies in the All the Ginis database. Apart from the limitation of having only grouped data, the database is a mixture of both secondary and primary sources of data, which make the indices not comparable. Take any country for example and you will be comparing a Gini coefficient à la WIID2 in one benchmark year against a Gini coefficient from WYD in another benchmark year. Moreover, since All the Ginis is a synthesis of all the above-mentioned datasets, it inherent suffers from the limitations that plague all three datasets. 3. The GIDD Database Compared to the existing secondary databases on cross-country inequality indices, the current dataset is a ‘true’ global income distribution. In fact, there is no precursor to this work. Milanovic (2002) is the first study we are aware of that derives ‘true’ global income distribution based solely on household survey data. Milanovic estimates the world income or expenditure inequality for individuals for the years 1988 and 1993. The world distribution is basically derived in the same manner as one would derive a country’s income distribution from regional distributions. Household surveys from 91 different countries adjusted for differences in Purchasing Power Parity (PPP) are used. For about three-quarters of the reported observations Milanovic had access to individual record data while for the rest, mean income or expenditure per deciles (or any other population group share) are used from grouped data. To maintain internal consistency, 5 Note however that the WIID2 dataset already incorporates the DS datasets so that in effect Milanovic is combining WIID2 and WYD. 6 Milanovic converted the unit record data into decile data so that in the end he had only ten data-points per country.6 Other than the fact that the data used by Milanovic is not publicly accessible, the main difference in the approach used here consists of working directly with the household data rather than with grouped data. In particular, our data relate to the individual/household so that one is able to explore much more heterogeneity within the distribution. Although working with grouped data has become popular and it has apparent advantages in terms of simplicity, it is not particularly attractive if one wants to perform scenario simulations. This is because using grouped data, or even parameterized Lorenz curves, may not permit the full heterogeneity in the distribution to be explored; in part because rural and urban or agricultural and non-agricultural households are mixed together. Working directly with household data – and thus being able to exploit the full heterogeneity – is fairly powerful in such cases. In addition, our data covers more countries and recent surveys.7 Moreover, by aligning all surveys in time to a common base year (2000 in this case) and by applying a common processing procedure, our data is more suitable for purposes of international comparison of income distribution [this paragraph may not be completely true if we are also working with grouped data]. 3.1. Conceptual and Methodological Issues Any attempt to put together a secondary cross-country dataset drawing on different household surveys raises several conceptual and methodological issues. A few reflections on a number of these problems are appropriate before we describe our data and present some descriptive results. Many of these issues are well discussed in the literature and so we will only give a cursory treatment here touching on the points that bear directly upon our work. There is at present little conceptual agreement regarding what constitutes a 6 More detailed information on the data used by Milanovic (2002) can be found at http://www.worldbank.org/research/inequality/data.htm. 7 Milanovic was able to obtain household survey data for a common sample of 91 countries both in 1988 and 1993, covering about 84 per cent of the world population. By contrast, our sample covers 116 countries representing 91 percent of the world population. We had access to individual records for 1.2 million households in 84 developing countries. These micro data are complemented with more aggregate information for countries where we do not have direct access to surveys. Household information from developed countries comes from the Luxemburg Income Study dataset. 7 good quality database for the purposes of studying global income distribution. That is not to suggest that there is no ideal standard, and Atkinson and Brandolini (2001) and most recently Milanovic (2002) have already forcefully made the case for this. National accounts data are in principle comparable across countries and they have been widely used for the analysis of world income distribution. However, most researchers including Milanovic are uncomfortable about their rampant utilization for the analysis of world income distribution. We agree with Milanovic that for the analysis of ‘true’ global income distribution household survey data are really indispensable. Granted that survey data are the most preferred, the issue is how to maintain consistency and to ensure that what we measure is truly global income distribution. We begin with one of the issues which has been most discussed in the literature: the concept of economic welfare. The Welfare Concept The first challenge in this exercise is the choice between consumption and income as the preferred overall measure of living standards. There is no clear guidance as to the most preferred welfare concept to use in studying distributional issues. While the advantages in preferring consumption over household income in welfare analysis are well known – consumption is less variable and more accurately gathered -, some people still hold the view that the lack of consensus on the treatment of durables makes the use of consumption problematic (see Deaton and Zaidi, 2002; and Atkinson and Bourguignon, 2000). Opposing views aside, many statistical offices only collect information on either consumption or income and so, in practice, one has not got a lot of choice anyway. For the purposes of this exercise, we define global income as the sum (over all household members) of the reported as well as estimated and imputed personal monthly consumption expenditures (or income) of all countries for which we have access to unit record data. Whenever both consumption and income are available the former is always preferred. This was the case in most African and Asian surveys where detailed consumption data are collected. By contrast, most industrialized countries and much of Latin America collect much more detailed income data. For these countries we follow common practice by using the information available, which in most cases is income. 8 Adjustment for Household Size and Composition The reference unit may be the household or the individual income earner. For income, we have information for each income earner while for consumption is usually given at the household level. For every country where we have access to individual record data our statistical unit of analysis is taken to be the total consumption expenditure (or income) adjusted for the size by dividing by household size. Indeed not adjusting income implies that the welfare achievable in a household with a certain income is independent of the number of its occupants. We note that though not the most preferable, in practice, per capita consumption or incomes are used as they are the most commonly available. We recognize that the use of per capita income as the unit of observation amounts to an assumption that no economies of scale arise from sharing of economic resources and that children and adults do not differ in their needs. However, the choice is mainly driven by practical matters. Very few countries report equivalence scales, and indeed these scales are difficult to compare across countries. We could, in fact, use the square root of household size as a crude measure of equivalence scales (see Gottschalk and Smeeding, 1997). The Application of Weights The welfare unit may be the person-weighted or household-weighted. As we are concerned with the welfare of the individual, all observations are person-weighted so that per capita income is counted as many times as there are persons in the household. Converting Local Currencies to International Dollars On the issue of making international comparisons of monetary variables across countries, there is wide consensus among economists that PPP income data are the most appropriate, even though there are ongoing discussions about some remaining imperfections. The direct use of the official exchange rates has been discounted as inappropriate for international comparison of living standards for several reasons, not least because in many countries the official exchange rates are distorted and volatile. In order to preserve comparability of the monetary variables across countries, we follow 9 common practice by converting all local currencies into international dollars using PPP conversion factors. 3.2. Coverage and Data Processing This subsection is concerned with the processes involved in arriving at the final dataset and the sample coverage. We also discuss how the included surveys were selected and the sources of the data used. Data Processing When one talks about global income distribution what one really wants to do is to be able to calculate income differences between all the citizens of the world regardless of nationality (see Milanovic 2002). For example, we want to be able to compare where an individual in Ghana stands in the world income distribution vis à vis his counterpart in Mexico. We are here talking about each person in the world having his or her own real income adjusted for differences in purchasing power parity. Note that the size of the country matters as well as the within country income distribution. One would, ideally, require data on both within- and between-national income distributions for all the countries in the world. The data requirement for such an exercise is massive and perfect comparability of the data across countries is not achievable. Nonetheless, it is important for data compilers to strive to minimize data and methodological differences across nations (Gottschalk and Smeeding, 2000). In this regard, in cases where we had access to individual data, we made all possible efforts to make data as comparable as possible across countries by using similar definitions of variables for each country and by applying consistent methods of processing the data. We pool all available surveys per country drawing on survey data from different World Bank sources. Our main source of data for developing countries came from data files used for the World Development Reports 2006 and 2007. For countries where we had access to multiple surveys we only retain the most nationally representative, recent and as closest to year 2000 as possible. For the developed countries we had to rely on predefined grouped data from the already standardized LIS dataset. Similarly, for China we only had access to predefined aggregate data already divided into 10 urban and rural parts. For internal consistency, we also convert all the developing unit record data into vintiles ranking all individuals by their household per capita consumption or income. Each vintile contains 5% of individuals in a given country. To this dataset, we added two new variables; PPP conversion factors obtained from the Penn World Tables and local CPI indexes obtained from the World Development Indicators (WDI). We should mention that all household surveys are “placed” in the year 2000. Basically, if we use a survey for 2002, the CPI is used to have all income/consumption figures in domestic values of 2000; also a correction factor is applied to the population weights so as to get to the (census) population of 2000. The real figures were finally converted into international dollars in year 2000 using the PPP conversion rates. Coverage Our sample is primarily determined by the availability of representative household survey data with as comprehensive information as possible on consumption or income. The only inclusion restriction was the quality of the survey, whether it is nationally representative and availability of consumption or income aggregates. Essentially, we began with as many survey data as possible for all countries for which unit record data on consumption or income were available in the WDR. However, many of these countries were eliminated from the sample due to lack of data deemed as important for our purposes. As with Chen and Ravallion (2007), surveys are excluded if essential data are missing (PPP exchange rates and local CPIs, for example) or if there are serious comparability problems with the rest of the data. The main source of data for the developing countries is the World Bank WDR database. We mainly use the underlying data used for the WDR 2006 and 2007 which are drawn largely from the LSMS and the Africa ISP-Poverty monitoring group. The data for Eastern Europe are drawn from the ECA databank and different World Bank sources. The Luxembourg Income Studies database (LIS) is our source of data for most of the developed countries. Table A.1 in the Appendix presents the main characteristics of each household survey [This Table will come later]. The table shows the names of the surveys, the sample size (in number of individuals) and the welfare measure used (consumption or income). 11 The final sample consists of 116 countries representing 91 percent of the world population. We had access to individual records for 1.2 million households in 84 developing countries. These micro data are complemented with more aggregate information for countries where we do not have direct access to surveys. The countries covered and their respective shares of the total sample and world population are presented in Table A.2 in the Appendix. The final sample covers all regions in the world: Eastern Europe and Central Asia (100%), Latin America (98%), South Asia (98%), East Asia and Pacific (96%), High Income Countries (79%), Sub-Saharan African (74%) and Middle East and North Africa (70%). 3.3. Limitations As already mentioned, the data requirements and the quality restrictions required to maintain international comparability is enormous. While we endeavored to make the data consistent and cross-nationally comparable, the usual ‘caution’ applies. Since there exists no global ‘household survey’ but instead different countries use different questionnaires and have different ways of minimizing potential measurement errors, perfect comparability is not assured. Perhaps, one of the most obvious difficulties is that as we do not have household income for all countries income inequality statistics are mixed with consumption inequality measures, confounding international comparisons as income tends to be more unequally distributed than expenditure. Moreover, differences in survey design, comprehensiveness of income sources and quality all have the potential to affect cross-country comparisons of income inequality. These quality issues and others, which plague most cross-country distributional analysis, have been sufficiently discussed in Gottschalk and Smeeding (1997, 1998), Székely and Hilgert (1999), Atkinson and Brandolini (2001) and Milanovic (2002); and we do not want to belabor them here other than warn that users should bear them in mind while using our data. 12 Appendix 1: Table A1: Household Surveys Included in the GIDD Region Actual population Covered Population (%) World 5,498,162 6,076,509 90.48 East Asia and Pacific Eastern Europe and Central Asia High Income Countries Latin America Middle East and North Africa South Asia Sub-Saharan Africa 1,733,358 460,385 764,285 500,199 190,397 1,332,800 516,737 1,817,232 471,549 974,612 515,069 276,447 1,358,294 663,305 95.38 97.63 78.42 97.11 68.87 98.12 77.90 Economy Covered population Covered population Actual population East Asia and Pacific 1,733,358 1,805,691 China Indonesia Vietnam Philippines Thailand Malaysia 1,260,000 212,000 80,400 71,600 61,700 23,300 1,260,000 212,000 80,400 71,600 61,700 23,300 13 Data used grouped individual individual individual individual grouped Cambodia Lao PDR Papua New Guinea Mongolia Myanmar Korea, Dem. Rep. Fiji Timor-Leste Solomon Islands Vanuatu Samoa Micronesia, Fed. Sts. Tonga Kiribati Marshall Islands Eastern Europe and Central Asia 11,900 4,927 5,133 2,398 460,385 11,900 4,927 5,133 2,398 47,700 21,900 811 784 419 191 177 107 100 91 53 471,549 individual individual grouped grouped individual individual individual individual individual individual individual grouped grouped individual individual individual individual individual grouped individual individual grouped grouped individual individual individual individual grouped individual individual grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped Russian Federation Turkey Ukraine Poland Uzbekistan Romania Kazakhstan Serbia and Montenegro Czech Republic Hungary Belarus Azerbaijan Bulgaria Tajikistan Slovak Republic Georgia Kyrgyz Republic Turkmenistan Croatia Moldova Lithuania Armenia Albania Latvia Estonia Macedonia, FYR Bosnia and Herzegovina High Income Countries 136,000 69,600 47,600 38,300 25,100 21,800 15,000 10,600 10,300 9,876 9,994 8,199 7,906 6,376 5,393 4,514 5,008 4,644 4,446 4,259 3,477 3,065 3,139 2,383 1,363 2,044 764,285 146,000 67,400 49,200 38,500 24,700 22,400 14,900 8,137 10,300 10,200 10,000 8,049 8,060 6,159 5,389 4,720 4,915 4,502 4,503 4,275 3,500 3,082 3,062 2,372 1,370 2,010 3,847 974,612 United States Germany France United Kingdom Italy Korea, Rep. Spain Canada Netherlands Greece 282,000 82,200 58,900 58,800 57,700 47,000 40,500 30,800 15,900 10,900 282,000 82,200 58,900 59,700 56,900 47,000 40,300 30,800 15,900 10,900 14 Belgium Portugal Sweden Austria Hong Kong, China Israel Denmark Finland Norway Singapore New Zealand Ireland Slovenia Luxembourg Netherlands Antilles Japan Taiwan, China Saudi Arabia Australia Switzerland Puerto Rico United Arab Emirates Kuwait Cyprus Bahrain Qatar Macao, China Malta Brunei Darussalam Bahamas, The Iceland French Polynesia New Caledonia Guam Channel Islands Virgin Islands (U.S.) Antigua and Barbuda Isle of Man Bermuda Greenland Latin America Brazil Mexico Colombia Argentina Peru Venezuela, RB Chile Ecuador Guatemala Bolivia Dominican Republic Haiti Honduras 10,300 10,100 8,875 8,011 6,669 6,282 5,338 5,177 4,492 4,020 3,864 3,815 1,986 441 215 grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped grouped 500,199 10,300 10,200 8,869 8,012 6,665 6,289 5,337 5,176 4,491 4,018 3,858 3,805 1,989 438 176 127,000 22,200 20,700 19,200 7,184 3,816 3,247 2,190 694 672 606 444 390 333 301 281 236 213 155 147 109 76 76 62 56 515,069 172,000 98,000 41,600 37,300 26,800 24,300 15,200 12,000 11,800 8,514 7,950 8,146 6,281 174,000 98,000 42,100 36,900 26,000 24,300 15,400 12,300 11,200 8,317 8,265 7,939 6,424 individual individual individual individual individual individual individual individual individual individual individual individual individual 15 El Salvador Paraguay Nicaragua Costa Rica Uruguay Panama Jamaica Guyana Cuba Trinidad and Tobago Suriname Barbados Belize St. Lucia St. Vincent and the Grenadines Grenada Dominica St. Kitts and Nevis Middle East and North Africa Egypt, Arab Rep. Iran, Islamic Rep. Morocco Yemen, Rep. Tunisia Jordan Algeria Iraq Syrian Arab Republic Libya Lebanon West Bank and Gaza Oman Djibouti South Asia India Pakistan Bangladesh Nepal Sri Lanka Afghanistan Bhutan Maldives Sub-Saharan Africa Nigeria Ethiopia South Africa Tanzania Kenya Uganda Ghana Côte d'Ivoire Madagascar Cameroon Zimbabwe 6,409 5,386 5,186 3,805 3,332 2,849 2,607 733 6,280 5,346 4,920 3,929 3,342 2,950 2,589 744 11,100 1,285 434 266 250 156 116 101 71 44 276,447 individual individual individual individual individual individual individual individual 67,300 63,700 27,800 17,900 9,564 4,857 30,500 23,200 16,800 5,306 3,398 2,966 2,442 715 1,358,294 grouped grouped individual individual grouped individual individual individual individual individual individual 516,737 1,020,000 138,000 129,000 24,400 19,400 26,600 604 290 663,305 137,000 64,300 43,900 34,500 28,100 24,600 19,300 16,500 16,000 15,500 12,600 118,000 64,300 44,000 34,800 30,700 24,300 19,900 16,700 16,200 14,900 12,600 individual individual individual individual individual individual individual individual individual individual grouped 190,397 67,300 63,700 27,800 16,500 9,565 5,532 1,332,800 1,020,000 142,000 131,000 20,800 19,000 16 Zambia Niger Mali Burkina Faso Malawi Rwanda Guinea Senegal Benin Burundi Sierra Leone Mauritania Lesotho Gambia, The Comoros Congo, Dem. Rep. Sudan Mozambique Angola Chad Somalia Togo Central African Republic Eritrea Congo, Rep. Liberia Namibia Botswana Guinea-Bissau Gabon Mauritius Swaziland Cape Verde Equatorial Guinea São Tomé and Principe Seychelles 12,600 11,800 11,100 10,800 10,300 8,024 7,929 7,914 6,718 6,563 4,509 2,668 1,743 1,217 554 10,700 11,800 11,600 11,300 11,500 8,025 8,434 10,300 7,197 6,486 4,509 2,645 1,788 1,316 540 50,100 32,900 17,900 13,800 8,216 7,012 5,364 3,777 3,557 3,438 3,065 1,894 1,754 1,366 1,272 1,187 1,045 451 449 140 81 grouped grouped individual individual grouped grouped individual individual individual individual grouped individual grouped individual grouped References [needs updating]: Atkinson, A. B., and A. Brandolini (2001), “Promises and Pitfalls in the Use of Secondary Data-Sets: Income Inequality in OECD Countries as a Case Study”. Journal of Economic Literature 39, 771-800. Bourguignon, F. and C. Morrison (1998), “Inequality and Development: The Role of Dualism,” Journal of Development Economics, 57(2), December, 233–58. Deininger, K. and L. Squire (1996), “A New Data Set Measuring Income Inequality,” World Bank Economic Review, 10(3), 565–91. 17 Deininger, K. and L. Squire (1998), “New Ways of Looking at Old Issues: Inequality and Growth,” Journal of Development Economics, 57(2), December. Fields, G. (1989), “A Compendium of Data on Inequality and Poverty for the Developing World”, Cornell University (mimeograph). Gottschalk, P., and T. M. Smeeding (1997), “Cross-National Comparisons of Earnings and Income Inequality”, Journal of Economic Literature 35, 633–687. Székely, M. and M. Hilgert (1999), “What’s Behind the Inequality We Measure: An Investigation Using Latin American Data”, Working Paper no. 409, Inter-American Development Bank. Put the following somewhere in the main text >>>>>>>>>>>>>>>>>> A Note on Imputing Sector of Employment Data using Household Surveys The Linkage Global Computable General Equilibrium model and the micro data of the Global Income Distribution Dynamics (GIDD) model are linked through several aggregate variables. More specifically, the wage rate and the employment levels of the agriculture and non-agriculture segments of the economy are among the crucial link variables. A first step in assembling the dataset for the Linkage-GIDD modeling framework thus consists of identifying the variable “sector of employment/occupation” in the original survey data. Given that this variable is recorded only for a subset of the surveys used in the GIDD framework – out of the 73 household surveys included in the GIDD dataset, only 30 of them report this information – an imputation methodology had to be devised. This note explains how this missing variable has been estimated for the cases where it was not available. The basic logic behind the data imputation is that observable characteristics, both at the household and individual level, are correlated with the probability of being employed in certain sector. The methods described in this note make use of this correlation to assign, to each household head, a probability of being employed in a particular sector. According to the GIDD’s surveys where sector of employment is reported, less than 3 percent of heads of households in urban areas work in the agricultural sector as a main activity, on the other hand, 65 percent of the heads in rural areas derived their incomes from farming activities (see Table 1). Given this high correlation, and the fact that the household’s stratum (rural/urban) is available in all the surveys, we assume that all the heads of households located in urban areas are not engaged in agricultural activities. 18 Table 1: Distribution of Agricultural Employment in Rural vs. Urban Areas (%) Urban Stratum Rural Sector of Employment Agriculture Non-Agriculture 2.7 97.3 65 35 Total 100 100 Note: With data from the GIDD Define Pr(i 1) as the probability observing individual i being employed in the agricultural sector, allow X i denote a vector of observable—personal and household— characteristics of individual i affecting the probability of being part of the agricultural sector and, finally, introduce i as a set of zero-mean, normally-distributed, random components. For all households located in rural areas in surveys with sector of employment information the following model was estimated: Pr(i 1) X i β i (1) where β is a vector of parameters relating characteristics X i with the probability of being part of the agricultural sector. In countries without information on sector of employment, we nevertheless observe the vector of personal and household characteristics, X i , therefore, under certain assumptions, we can use β̂ to impute, to each household head, a probability of being part of the agricultural sector. The crucial assumption is that the parameters estimated for countries with employment information ( β̂ ) are valid for countries without employment information. In other words, we have to assume that countries with employment information are a random sample of our universe of 73 countries. If this assumption is satisfied, we can define the expected value of the probability of being part of the agricultural sector in out-of-sample households as: EPr(i 1) | Xi Xi βˆ (2) The variables included in vector X i are age, gender and education level of the head plus household size and per-capita household income. The results of the estimation of equation (1) are presented in Table 2. Although the pseudo-R2 is quite low, all independent variables have a significant effect on the probability of being employed in the agricultural sector. Table 2: Probability of Being Employed in the Agricultural Sector Age Gender (1=Female) Education Level Household Size Coefficient 0.002 -0.421 -0.307 0.005 Robust Standard Error 0.000 0.014 0.008 0.002 19 z-statistic 5.6 -30.2 -40.8 2.4 p-values 0.000 0.000 0.000 0.017 Per-capita HH Income Constant Number of Observations Pseudo R2 -0.003 1.448 0.000 0.029 -26.6 50.5 Wald 2 (5) Prob. > 2 312,045 0.068 0.000 0.000 5514 0.000 Note: Author’s own estimations using data from the GIDD The coefficients shown in Table 2, are used to assign a propensity score ( Xi βˆ ) or the probability of being farmer to each rural household in countries without agricultural employment information. The last bit of information that we need is the proportion of households whose head is part of the agricultural sector at the national level. This specific information is not available; nevertheless, the World Development Indicators (WDI) reports the proportion of total employment that is part of the agricultural sector. We assume that this proportion is close enough to the proportion of heads in farming activities and therefore use it as the parameter determining the amount of households that will be assigned to the agricultural sector. For each country, rural households are ranked according to their probability of being part of the farming sector. Households are then assigned to the agricultural sector (according to their propensity score) until the proportion of households in the agricultural sector matches the proportion of employment in agriculture (using data from WDI). RAFA: A sentence (and perhaps a quantitative measure) on validation: didn’t we try to test this method on a few countries and were quite satisfied… 20