A Stata program for calibration weighting John D’Souza National Centre for Social Research Outline Description of calibration Adjust selection weights so that a weighted sample exactly matches the population Generalizes post-stratification Several methods: Linear, logistic … SAS, GenStat A new Stata program Limitations and extensions Sampling Selection weights: dk = 1/P(Person k is chosen) Sample frame variables Xk1, …, XkJ with known population totals, P1, …, PJ. Horvitz-Thompson estimator of Pi ∑dkXki ≈ Pi for i=1,2, …, J. Calibration: Adjust dk to get calibration weights, wk, giving exact equality: ∑wkXki = Pi for i=1,2, …, J. Example: School Census Variables include Age, Gender, Ethnic Group, Exam results Type of School, Region Pupil’s Free School Meal eligibility We calibrate to J variables. Eg. Boy (binary) Girl (binary) Region (eg. four categories) FSM eligibility (binary) J= 1 + 1 + (4-1) + 1 = 6 Special case: post-stratification Simplest case: One categorical variable Easy to deal with (post-stratification) svyset , poststrata() postweight() More general case: Several variables (categorical and numerical) Deville and Sarndal (1992). Minimize the “distance” between w and d subject to the J calibration constraints. Linear calibration: Minimize ∑S (wk- dk)2/dk Involves solving J simultaneous linear equations Logistic calibration: Minimize ∑S (wklog(wk/dk) – wk + dk) Involves solving J simultaneous non-linear equations GenStat, SAS, Stata GenStat and SAS Methods: linear, logistic and bounded. Estimation: GenStat gives SEs. SAS handles categorical variables directly. Enter as indicator variables in GenStat. Stata Post-stratification (calibration to one categorical variable). Gives SEs. No routine for general calibration. A new Stata program Typical syntax. matrix M=[10000, 10000, 3000, 4000, 3000, 8000] calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(linear) print(final) 10,000 boys, 10,000 girls, 3,000 FSM Variables boys, girls, FSM are binary Categorical variable region (4 categories) turned into 4 binary indicator variables). Only 3 entered in the syntax (colinearity) Output Variable Pop total Weighted (entrywt) Weighted (exitwt) R boy 10000 9619.7188 10000 .21373408 girl 10000 10380.281 10000 .13733883 FSM 3000 2915.4929 3000 .04710333 ireg1 4000 4056.3379 4000 -.19511394 ireg2 3000 3197.1831 3000 -.24808005 ireg3 8000 8507.042 8000 -.2391432 Options Options available to: Control amount of output/graphs Set max number of iterations/tolerance Methods linear, logistic, bounded linear and nonresp (blinear sets bounds for wk/dk. GenStat and SAS have something very similar ) (nonresp adjusts for non-response – see below) Limitations (1) Solves the equations by finding a matrix inverse 1. Won’t work if J is large 2. Can have problems with singular or nearly singular matrices 3. Iterative methods (logistic, blinear) won’t always converge No obvious solution to 1. Problem 2 and 3 are usually down to problems with the data Limitations (2) We need to recode categorical variables (SAS doesn’t) Stata: tab region, gen(ireg) More complicated (eg two-phase) problems aren’t handled directly Need a bit of syntax to handle this Other packages can handle this directly Extensions –Standard errors Calibration weights are often incorrectly treated as selection weights. calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) calibmean , selwt(w1) calibwt(w2) yvar(y) /// marginals(boy girl FSM ireg1-ireg3) /// psu(school) designops (strata(region)) This generalizes Stata’s poststrata command Extension: Method nonresp (1) Example Select schools, then classes, then pupils Assume all schools respond, pupils might not Variables available on responders. (Pop totals available) Gender, Exam results, FSM, Region Variables on non-responders. (Pop totals not available) PTratio: Pupil-teacher ratio topset: Is pupil in the top set? Extension: Method nonresp (2) serial region topset outc sex FSM -----------------------------------------1. 1001 1 1 0 . . 2. 1002 1 0 1 1 0 3. 1003 2 0 0 . . 4. 1004 1 0 1 1 1 5. 1005 3 1 0 . . -----------------------------------------6. 1006 1 0 1 0 1 7. 1007 3 1 1 1 0 8. 1008 2 1 0 . . 9. 1009 1 0 1 1 0 Extension: Method nonresp (3) Population totals unknown, but variables are available on all the sample (including nonresponders) calibrate , entrywt(w1) exitwt(w2) poptot(M) /// marginals(boy girl FSM ireg1-ireg3) /// method(nonresp) outc(outc) /// svars(PTratio topset) Responders weighted to pop totals on “marginals” and to selected sample totals on “svars” (Lundstrom & Sarndal, 2005) Conclusions We’ve found the program can handle many practical problems Easy to calculate SEs (but theory assumes no non-response) Method nonresp isn’t available in many packages We don’t have to calibrate to population totals Eg, calibrate Wave n+1 of a survey to totals from Wave n Calibrate one sample to look like another Questions References Deville, J.-C. and Sarndal, C.-E. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association 87: 376-382 Background and theory behind calibration Lundstrom, S. and Sarndal, C.-E. 2005. Estimation in Surveys with Nonresponse. Wiley Deals with non-response Singh, A.C. and Mohl, C.A. 1996. Understanding Calibration estimators in Survey Sampling. Survey Methodology 22: 107-115 Discusses several methods of doing bounded calibration