Peter Lanjouw, DECPI March, 2011 Three Topics 1. Small area estimation of poverty: “Poverty Maps” 2. Comparing poverty with non-comparable data 3. Using repeated cross-sections to explore movements in and out of poverty Introduction ◦ What is a poverty map? ◦ Why is there demand? Poverty mapping methodology Examining the underlying assumptions Project within World Bank’s Research Department ◦ In collaboration with academics Methodological papers ◦ Elbers, Lanjouw and Lanjouw (2003, Econometrica) ◦ Hentschel et al. (2000) and ELL (2000, 2002) PovMap Software (Qinghua Zhao) Goal is to produce accurate estimates of welfare at small area level – “poverty maps” Not necessarily “Maps”; rather, highly disaggregated databases of welfare ◦ ◦ ◦ ◦ ◦ ◦ Poverty Inequality Average income/consumption Calorie intake Under-nutrition Other indicators (health outcomes? lifeexpectancy?) disaggregation may, but need not, be spatial Poverty of “statistically invisible” groups Growing world-wide interest in having access to local-level information 1. Process of decentralization: ◦ Sub-national governments (state, municipality….) are expected to devise and implement policies ◦ It is important to know which localities should be prioritized Need to compare poverty across localities Geographic targeting of resources 2. ◦ Fine geographic targeting typically results in less leakage than coarse targeting. Simulations from Ecuador, Cambodia and Madagascar: Poverty reduction attainable by a uniform lumpsum transfer can be achieved with less than one third of the total funds available if the funds are targeted to the poorest communities (on average less than 2000 households). Of course, implementation of fine targeting can also be more costly Main source of information on consumption welfare - household expenditure surveys permit only limited disaggregation Very large data sources (e.g. census) typically collect very limited information on welfare outcomes 1. Collect larger samples ◦ ◦ Expensive Some kind of data compromise 2. Combine limited information available in data, such as the census, into some welfare index (e.g. “basic needs index” or “asset index”) ◦ ◦ ◦ Ad-hoc, easily leads to multiple maps How to interpret? Measures of welfare do not line up with official numbers at the national/regional level Impute a measure of welfare from household survey into census, using statistical prediction methods ◦ Produces readily interpretable estimates: Works with exactly the same concept of welfare as traditional survey-based analysis. ◦ Statistical precision can be gauged ◦ Encouraging results to date ◦ But, non-negligible data requirements Survey and census have variables in common (questionnaires have to be corresponding) Common variables are sufficiently correlated with consumption Survey and census can be linked at the cluster level Census includes variables that capture location specific effects (or 3rd data set) Census enjoys large coverage ELL (2002, 2003) ◦ Estimate a model of, for example, per-capita consumption yh using sample survey data ◦ Restrict explanatory variables to those that can be linked to households in survey and census ◦ Estimate expected level of poverty or inequality for a target population using its census-based characteristics and the estimates from the model of y ◦ Zero stage: establish comparability of data sources; identify/merge common variables; understand sampling design; GIS info(?) ◦ First stage: estimate model of consumption/income ◦ Second stage: take parameter estimates to census, predict consumption, and estimate poverty and inequality. Let W(m, y) be a welfare measure based on a vector of household per-capita expenditures, y, and household sizes, m. We want to estimate W for a target population (say a municipality, v) where y is unknown. We estimate a model of consumption/income per capita: ln ych E[ln ych | zch ] uch z c ch , T ch where ηc is a cluster random effect allowing for a locational influence on consumption. First Stage: ◦ Estimate separate regressions per stratum ◦ Use cluster weights where significant ◦ Allow for non-normality of disturbances (parametric/non-parametric), and ◦ Heteroskedasticity in individual-specific component of disturbances. ◦ Model is estimated by GLS ◦ Modelling criteria: explanatory power, significance of parameters, parsimony (overfitting), size of location effect. Second Stage: Simulation into Census ~r β , and simulated ◦ For each household rslope coefficients, ~ r disturbance terms, c and ~ch , are drawn from their corresponding distributions (parametric or semiparametric) ◦ Simulate per capita expenditure/income per household: ~ ˆy chr exp x ch β r ~cr ~chr ◦ Apply r simulations ◦ Calculate poverty in target population for each simulation. ◦ Welfare estimate is mean estimate across r simulations ◦ Standard error is standard deviation across simulations. The error in the estimator can be decomposed as: W (W ) ( ). Idiosyncratic error – increases with smaller populations. Model error – not related to size of target population. Other elements can include: Computation error – part of model error, can be negligible with sufficient number of simulations Model accurately describes consumption for each level to which it is applied ◦ Conditional distribution of y given x in small area A is the same as in larger region R ◦ Tarozzi and Deaton (2007) refer to this as the “Area homogeneity assumption” A shared cluster error is able to provide an accurate account of the spatial correlation between households ◦ Presence of spatial correlation will diminish precision of estimates Validation studies are needed to check on these assumptions ◦ Elbers, Lanjouw and Leite (2008) provide validation study for Brazil Elbers, Lanjouw and Leite (2008) consider Minas Gerais, Brazil Brazil Census collects income data ◦ Thin round (87.5%) collects single-question measure of household income ◦ Thick round (12.5%) collects more detailed info. ◦ Neither are judged reliable for an ‘official’ poverty map. We focus on Minas Gerais (for computational ease) ◦ 606,000 households in 12.5% sample (out of 4.8m) ◦ 12.5% sample covers all 853 municipalities in Minas Gerais Per Capita Income Infant Mortality Rate Life Expectancy We draw 41 synthetic surveys from Census sample ◦ 21 mimic sample design of POF - 2,800 households 13 households per cluster/EA 241 EA’s in about 151 Municipalities ◦ 20 mimic sample design of PNAD – 12,000 households 16 households per cluster/EA 779 EA’s in 123 municipalities We produce 41 poverty maps for Minas Gerais ◦ We estimate location effect at EA level ◦ We apply location effect at Municipality level Tarozzi and Deaton’s conservative approach Estimate 41 models Table 6: Local Effect and R2 estimated on the basis of our surveys Sample type 'PNAD' obs: 11,721 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'PNAD 'POF' obs: 2,800 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 'POF' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 R2 su s* s*2 / s2u 0.603 0.603 0.596 0.600 0.606 0.572 0.576 0.567 0.585 0.571 0.516 0.566 0.600 0.565 0.555 0.584 0.584 0.569 0.582 0.565 0.566 0.578 0.569 0.576 0.569 0.590 0.592 0.576 0.581 0.597 0.580 0.586 0.585 0.593 0.578 0.587 0.594 0.620 0.624 0.622 0.588 0.686 0.686 0.696 0.690 0.678 0.713 0.707 0.714 0.696 0.711 0.758 0.723 0.688 0.715 0.721 0.699 0.699 0.712 0.703 0.718 0.719 0.694 0.702 0.690 0.709 0.693 0.686 0.698 0.698 0.688 0.706 0.686 0.691 0.689 0.693 0.690 0.688 0.663 0.679 0.664 0.729 0.101 0.109 0.099 0.110 0.102 0.115 0.104 0.109 0.108 0.115 0.110 0.108 0.112 0.106 0.109 0.110 0.123 0.123 0.123 0.112 0.134 0.124 0.112 0.122 0.122 0.127 0.113 0.127 0.124 0.096 0.127 0.107 0.136 0.102 0.140 0.132 0.113 0.113 0.098 0.118 0.137 0.0214 0.0252 0.0201 0.0253 0.0225 0.0258 0.0215 0.0232 0.0239 0.0260 0.0211 0.0222 0.0264 0.0218 0.0228 0.0247 0.0311 0.0298 0.0303 0.0242 0.0348 0.0318 0.0252 0.0314 0.0298 0.0338 0.0273 0.0333 0.0318 0.0196 0.0325 0.0244 0.0387 0.0218 0.0409 0.0368 0.0270 0.0291 0.0209 0.0317 0.0353 Source: IBGE 12.5% Census and Author's Calculation. Exercise 1: Differences in returns ◦ Apply one model in full census sample (specified in one PNAD sample) ◦ Re-estimate model separately in each municipality (again in PNAD sample) ◦ Compare predicted municipality-level income Figure 13: Predicted Value Estimations: Small Area model vs Regional Model, 'PNAD' sample 6.5 R2 = 0.8394 6.0 5.5 E[y/X, H(R)] 5.0 74.8% of E[.,H(A)] fell onto the 95%CI of E[.,H(R)] 95.9% of the 95%CI of E[.,H(A)] lay down onto the 95%CI of E[.,H(R)] 4.5 4.0 4.0 Source: IBGE 12.5% Census and Author's Calculation 4.5 5.0 5.5 E[y/X, H(A)] 6.0 6.5 Municipal level Poverty Estimates versus “Truth” Figure 20a: FGT(0) measures at Municipality Level - Observed values x Simulated Values, Poverty Map Simulations using 'PNAD' Type Sample 100% 80% 2 R = 0.8807 60% 'PNAD' 40% 20% 0% 0% 20% 40% 60% Census 12.5% Source: IBGE 12.5% Census and Author's Calculation 80% 100% Municipal level Poverty Estimates versus “Truth” Figure 20b: FGT(0) measures at Municipality Level - Observed values x Simulated Values, Poverty Map Simulations using 'POF' Type Sample 100% 2 R = 0.8837 80% 60% 'POF' 40% 20% 0% 0% 20% 40% 60% Census 12.5% Source: IBGE 12.5% Census and Author's Calculation 80% 100% Overly Precise estimates? Figure 25a: Share of municipalities where 95% confidence interval encompass the 'true' estimate, FGT(0) 100% 95% 90% 85% Share 80% 75% 70% 65% 60% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 surveys Source: IBGE 12.5% Census and Author's Calculation Overly Precise estimates? 10 5 0 Density 15 20 Figure 26a: Histogram of the number of good predictions at municipality level using PovMap: FGT(0) 0 .1 .2 .3 Source: 12.5% Census- PovMap Estimations and Author's calculation. .4 .5 Share .6 .7 .8 .9 1 Are poverty estimates usable? Proportion of municipalities with significant HCR change versus Confidence Interval F GT ( 0 ) measures f ro m M inas Gerais - Est imat io ns o n t he b asis o f ELL 0.45 100% 0.43 90% % of municipalities with significant changes 0.41 80% 70% 60% 50% 40% 30% 20% 0.39 0.37 0.35 0.33 0.31 0.29 10% 0% 0.27 0 100 200 300 400 500 M uni ci pal i t i es 600 700 800 900 0.25 75 80 85 90 95 Confidence Interval (%) ELL Conservative 97.5 99 99.5 Our evidence is quite supportive of ELL methodology and underlying assumptions. BUT, evidence for one place need not imply assumptions hold everywhere. Validation efforts like these must be undertaken wherever possible. ◦ Sometimes survey data do allow one to probe ELL assumptions explicitly. Clearly that should be done, whenever possible. Proper validation can be built into planning and design of future poverty mapping activities. ◦ Involvement of census bureau is likely to be central. Household surveys fielded at different times are rarely identical in every respect: ◦ Timing of fieldwork ◦ Refining/modification of questionnaire ◦ Aggregation/disaggregation of consumption components ◦ Shift from recall to diary, or changes in recall periods Can poverty still be readily compared? ◦ “Great Indian Poverty Debate” focused on the 1999/0 round of the NSS survey which introduced slightly altered recall period. ◦ Comparability of estimates was seriously compromised Deaton & Tarozzi, Himanshu and Sen, Kijima and Lanjouw Can we use ELL method to impute consumption from one survey into another? ◦ i.e impute consumption into survey of time t+1 based on a model estimated in survey of time t ◦ Are predicted poverty rates for t+1 similar to actual rates? ◦ If so, this implies parameter estimates from consumption model are stable over time All action is in changing X’s Christiaensen, Lanjouw, Luoto and Stifel (2010) experiment with a variety of different consumption models in Kenya, Vietnam, and Russia ◦ “test” this idea with surveys that are actually comparable, but pretend that the surveys are not. Example of Vietnam ◦ Consider two household surveys: 1992/93 and 1997/8 ◦ These surveys are generally regarded as high quality and fully comparable ◦ Poverty in Vietnam declined significantly in this period National Rural Urban 1993/4 60.6% 68.5% 28.6% 1997/8 37.4% 44.9% 9.0% Indicative of major structural changes in Vietnam ◦ A priori expectation that “returns” are changing, i.e. stable parameter assumption would not hold. “poverty map” style models Included in Models: Geographic Indicators Household demographics Education/ Profession Variables Housing Quality Variables Asset Ownership Variables Region National Rural Urban (1) (2) X X P0 1992/3 VLSS 60.6 P1 19.0 P0 68.5 P1 22.0 P0 28.6 P1 7.3 54.6 (1.4) 16.5 (0.9) 63.5 (2.0) 19.1 (1.0) 21.7 (2.6) 5.4 (0.9) (3) (4) (5) (6) X X X X X X X X X X X X X X X 55.6 (1.4) 17.2 (0.6) 64.8 (1.7) 19.7 (0.9) 22.8 (2.3) 5.6 (0.8) 46.6 (1.4) 13.4 (0.6) 56.3 (1.5) 16.8 (0.7) 18.3 (1.8) 4.5 (0.6) 38.2 (1.3) 10.5 (0.5) 48.5 (1.7) 13.7 (0.7) 11.8 (1.3) 2.6 (0.4) 36.7 (1.4) 9.4 (0.5) 44.2 (2.1) 11.5 (0.8) 9.7 (1.5) 1.9 (0.4) (7) 1997/8 VLSS 37.4 9.5 44.9 11.6 9.0 1.7 Replicating this exercise in Russia, between 1994 and 1998, doesn’t work so well. ◦ Major financial crisis: poverty rose from 15.5% to 43.8% ◦ However, model for 1994 predicts poverty in 2003 reasonably well But little change in poverty over this time period Broad Conclusion: assumption of stable parameters in roughly adjacent years seems reasonable. ◦ If there is a major crisis (earthquake, macroeconomic, etc.) then caution is warranted. Goal: ◦ Explore whether repeated cross-sections which are widely available can be used to provide some reasonable, basic, descriptives of transitions in and out of poverty. Set out methods which we claim will give upper and lower bounds on mobility. Validate these methods by using genuine panel data from Vietnam and Indonesia, generating repeated cross-sections from these panels, and comparing the results of our method to what one would estimate based on the genuine panel. Combines ideas of poverty-mapping with pseudopanel ideas. Will set out for case of 2 rounds, can be extended easily to multiple rounds. Let xi1 be characteristics of household i in time period 1, which are observed in both the round 1 and round 2 surveys: ◦ All time-invariant characteristics (language, religion, ethnicity) ◦ Characteristics of household head if the head doesn’t change across rounds (sex, place of birth, parental education, etc.) ◦ Can include time-varying characteristics that can easily be recalled for round 1 in round 2 E.g. whether household head was employed in round 1, place of residence in round 1, whether household has a TV in round 1, etc. Project round 1 consumption or income onto xi1: Project round 2 consumption or income onto same set of characteristics as they appear at time of second round: Then we are interested in knowing quantities such as: Don’t observe for the same household Step one: Use the sample of households observed in round 1, and regress ◦ Obtain the OLS estimator on and the residuals: Step two: For each household observed in round 2, take a random draw with replacement from the empirical distribution of residuals, then combine with parameter estimate and known x to estimate round 1 income or consumption: Step Three: calculate movements into and out Step Four: Repeat steps 1-3 R times, and take of poverty using in place of the unobserved round 1 variable: average of the quantity of interest over the R replications. Condition 1: the underlying population sampled is the same in round 1 and round 2 ◦ Requires measure of consumption to be same from round to round, no (non-random) changes in underlying population from births, deaths, migration out of sample…as with pseudo-panels in general, household analysis works best when restricted to households headed by prime age adults Condition 2: εi1 is independent of yi2. This requires εi1 to be independent of εi2 (otherwise the distribution of εi1|yi2 >p is not the same as the unconditional distribution of εi1) ◦ Won’t hold if: Error term contains individual fixed effect If shocks to consumption or income are non-transitory. Expect in many cases this condition to be violated. So long as errors positively correlated (which seems likely in most cases), this will overstate mobility, providing an upper bound on movements into and out of poverty. Instead assume the prediction error for household i in round 1 is the same as it is for round 2 (perfect positive autocorrelation). Step One: for sample of households surveyed in round 2, obtain OLS residuals: Step Two: then estimate round 1 income or Step Three: Use the estimated y from step 2 to consumption as calculate poverty dynamic of interest. Methods here aim to estimate same level of movements into and out of poverty as one would observe in genuine panel data. ◦ Some of this mobility will be due to measurement error. A variety of fixes in literature (e.g. Glewwe, 2005; Antman and McKenzie, 2007; Fields et al. 2007) ◦ Basic idea of these is to study mobility which is related to mobility in some underlying variable (e.g. health, cohort characteristics, assets) ◦ Not the goal here: we want to just see if we can match panel. Choose two genuine panels from Vietnam and Indonesia: VLSS 1992/93 and 1997/98 ◦ Period over which poverty fell from 58% to 37%, more households exiting poverty than entering ◦ Panel of approximately 4800 households Indonesian Family Life Survey 1997 and 2000 (IFLS2 and 3) ◦ Static in terms of overall poverty levels, household moving into and out of poverty at similar rates ◦ Panel of 7500 Randomly split each genuine panel into two sub-samples, A and B. ◦ Use sub-sample A from round 1 and sub-sample B from round 2 as two repeated cross-sections. ◦ Then carry out our method by using sub-sample A to impute round 1 values for sub-sample B, and compare to results we would get using genuine panel for sub-sample B. Consider a hierarchy of models which progressively employ more and more data that is sometimes, but not always, collected retrospectively. Since we have the actual panel data to work with, we can force variables to be timeinvariant by using round 1 variables. Start with a basic “traditional model”, and add more regressors. 1. 2. 3. 4. 5. 6. (Basic Model): gender of head, age of head as of round 1, birthplace of head (rural/urban), whether the head ever attended primary school, education of head’s parents, head’s religion and ethnicity. Add locational dummies for where household was living in round 1. Add community variables from round 1 (e.g. village has electricity, village has a stone road, community has a primary school) Head’s sector of work and education in round 1 Demographic variables from round 1 (household size, number of children) Household’s assets and housing quality as of round 1 – e.g. did household own TV, radio, what sort of roof and floor did it have? Table 1: Poverty Headcount: Data Source: IFLS Round 1: Lower Bound Basic Full Truth 95% CI 1997 Poverty Rate (P0): 0.147 0.159 0.145 0.188 0.120 0.159 VLSS 1992 Poverty Rate (P0): 0.611 0.592 0.597 0.682 0.562 0.622 Method seems to be getting levels close Upper Bound Basic Full Recall our claim was that the residuals would likely be positively autocorrelated, making our first method an upper bound, and that this correlation would shrink as we add more variables to the model. This is what we see: Table 2: Correlation Between Round 1 and Round 2 Residuals 1 2 3 4 5 6 Indonesia 0.474 0.466 0.464 0.452 0.408 0.348 Vietnam 0.653 0.575 0.563 0.539 0.523 0.420 Columns 1-6 build increasingly rich models of consumption. Table 3: Poverty Dynamics from “Pseudo” Panel and Actual Panel Data Indonesia Lower Bounds Truth Upper Bounds 1997, 2000 Statuses Basic Full 95% CI Basic Full Poor, Poor 0.115 0.105 0.047 0.070 0.024 0.037 Poor, Nonpoor 0.015 0.031 0.065 0.088 0.097 0.090 Nonpoor, Poor 0.021 0.030 0.065 0.088 0.111 0.099 Nonpoor, Nonpoor 0.848 0.832 0.759 0.801 0.766 0.774 Vietnam Lower Bounds Truth Upper Bounds 1992, 1998 Statuses Basic Full 95% CI Basic Full Poor, Poor 0.360 0.322 0.275 0.360 0.227 0.288 Poor, Nonpoor 0.241 0.274 0.261 0.324 0.331 0.308 Nonpoor, Poor 0.000 0.039 0.034 0.060 0.138 0.077 Nonpoor, Nonpoor 0.398 0.366 0.300 0.386 0.305 0.327 For both countries, round 1 year is predicted, round 2 is "truth" Bounds not that wide: ◦ Full model would lead us to estimate 3-9% of households in Indonesia and 27-31% of households in Vietnam exited poverty over 2 rounds. ◦ Genuine panel would say 7-9% in Indonesia and 2632% in Vietnam More detailed model for consumption with higher R2 leads to narrower bounds ◦ E.g. bounds of 0.021-0.111 using basic model vs (0.033-0.099) using full model for entry into poverty rate in Indonesia. Genuine panel data are rare, and even the best panels are often smaller in scale & frequency than crosssectional surveys. E.g. Indonesia IFLS is one of, if not the, best developing country panel out there ◦ ◦ ◦ ◦ But not nationally representative Sample size of around 7000 households Low frequency Vs SUSENAS Annual, nationally representative (and representative at district level), around 200,000 households! Policymakers and academics do care about movements into and out of poverty- would be nice to be able to say something regularly and in most countries, even if what we can say is relatively basic. We’ve provided a method of using repeated cross-sections to obtain bounds on movements into and out of poverty ◦ Validated this against genuine panel data ◦ Found the bounds can be narrow enough in practice to be useful However, method works best when full range of variables used, some of which are not typically asked retrospectively in surveys ◦ But no reason why they can’t be – and much cheaper to add a few of these questions than field a panel => Seems worth experimenting with inclusion of some such questions in upcoming surveys. Christiaensen, L., Lanjouw, P., Luoto, J. and Stifel, D. (2010) ‘The Reliability of Small Area Estimation Prediction Methods to Track Poverty’, WIDER Working Paper No. 2010/99. Elbers, C., Lanjouw, J., and Lanjouw, P. (2003) Micro-level Estimation of Poverty and Inequality, Econometrica, Vol 71(1), January, 355-364. Elbers, C., Lanjouw, J.O., Lanjouw, P. (2002) Micro-Level Estimation of Welfare Policy Research Working Paper 2911, DECRG, The World Bank. Elbers, C., Lanjouw, P. and Leite, P. (2008) ‘Brazil within Brazil: Testing the Poverty Map Methodology in Minas Gerais’, Policy Research Working Paper WPS 4513, DECRG, the World Bank Lanjouw, P., Luoto, J. and McKenzie, D. (2011) ‘Using Repeated Cross Sections to Explore Movements in and out of Poverty’, Policy Research Working Paper WPS 5550, DECRG, the World Bank.