A Unified Modeling Approach to Estimating HIV Prevalence in Sub-Saharan African Countries∗ Giampiero Marra† Rosalba Radice‡ Simon N. Wood¶ Till Bärnighausen§ Mark E. McGovernk Tuesday 24th March, 2015 Abstract Estimates of HIV prevalence are important for policy in order to establish the health status and needs of a country’s population, to evaluate population-based interventions and campaigns, to identify the most at risk members of the population, and to target those most in need of treatment. However, data in low and middle income countries are often derived from HIV testing conducted as part of household surveys, where participation rates in testing can be low. Low participation rates may be attributed to HIV positive individuals being less likely to participate because they fear disclosure, in which case, estimates obtained using conventional approaches to deal with non-participation, such as imputation-based methods, will be biased. In addition, establishing which population sub-groups are most in need of intervention requires modeling of both spatial dependence and the predictors of HIV status, which is complicated by data censoring. We develop a Heckman-type selection model framework which accounts for non-ignorable selection, but allows for heterogeneous selection behavior by incorporating a flexible linear predictor structure for modeling copula dependence. The utilization of penalized regression splines and Gaussian Markov random fields allows us to account for non-linear covariate effects and for geographic clustering of HIV. A ridge penalty avoids convergence failures, even when the parameters of the selection variable are not fully identified. We provide the software for straightforward implementation of this approach, and apply our methodology to estimating national and sub-national HIV prevalence in three subSaharan African countries. Key Words: Heckman-Type Selection Models, Penalized Regression Spline, Selection Bias, HIV, Spatial Dependence, Simultaneous Equation Models. ∗ Research Report number 324, Department of Statistical Science, University College London. Date: March 2015. Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK. ‡ Department of Economics, Mathematics and Statistics, Birkbeck, University of London, Malet Street, London WC1E 7HX, UK. § Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Boston, MA, USA. Wellcome Trust Africa Centre for Health and Population Studies, University of KwaZulu-Natal, Mtubatuba, South Africa. ¶ Department of Mathematical Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK. k Harvard Center for Population and Development Studies, Cambridge, MA, USA; Department of Global Health and Population, Harvard T.H. Chan School of Public Health, Boston, MA, USA. † 1 1 Introduction 1.1 Measuring HIV Prevalence in Developing Countries Policy interventions targeted to control the HIV epidemic, improve population health, and reduce HIV-related health disparities, are often motivated by prevalence data obtained from HIV testing as part of national or regional surveillance (Beyrer et al., 1999; De Cock et al., 2006). Particularly in low and middle income countries without developed health systems infrastructure, data obtained from nationally representative samples of the population of interest are a powerful source of information for establishing the current numbers of HIV positive individuals, as well as the change in HIV prevalence over time (Boerma et al., 2003; Mishra et al., 2006). This information is important for governments to be able to cost policy interventions, to implement these interventions, and to plan and forecast future demands on the health care system and public finances. The development of new antiretroviral treatment (ART) for reducing viral load and stabilizing the health status of HIV positive individuals, and subsequent initiatives using treatment-as-prevention (TasP), which aims to reduce the transmission of HIV by placing infected individuals on treatment as soon as possible, is a very promising development for combating the HIV epidemic (Granich et al., 2009). However, to be most effective, these programs will require accurate prevalence data on hard to reach and at risk populations (Kranzer et al., 2012). The recent success of ART means that improving treatment access in sub-populations with high HIV prevalence or which have seen increases in HIV prevalence will have potentially large payoffs (Tanser et al., 2013; Bor et al., 2013). HIV prevalence estimates can potentially be used in conjunction with HIV TasP or Pre-Exposure Prophylaxis (PrEP) to reduce transmission among at risk sub-populations who are exposed to risky behavior (Karim et al., 2010). In addition to identifying the most suitable groups for these policy interventions, HIV prevalence data are important for evaluating the effectiveness of large-scale programs (Pettifor et al., 2007). Establishing whether a population-based policy or intervention acted to reduce HIV prevalence will require population-based prevalence data. In low and middle income countries, estimates of HIV prevalence obtained from nationally representative household surveys are now considered the gold standard (Boerma et al., 2003). These data are generally obtained from home-based testing which takes place after survey respondents complete a standard interview (Mishra et al., 2006). After the interview, the surveyor conducting the interview will ask the respondent to participate in a blood test for HIV, generally to be collected by finger prick, following the recommended guidelines specified by the World Health Organization (WHO) and the Joint United Nations Programme on HIV and AIDS (UNAIDS). Similar data collection procedures take place as part of demographic surveillance sites, which track the residents of specific geographic areas, and which are another important source of data in HIV prevalence (Sankoh & Byass, 2012; Tanser et al., 2008). For HIV surveys which are designed to be nationally representative, a random sample of the population is approached with an offer for HIV testing. However, these HIV survey data can be affected by non-participation, because some of those who are eligible for testing opt out. This non-participation can occur through a variety of mechanisms, including directly declining to test for HIV when a respondent is approached to 2 test after interview, or being an eligible respondent for HIV testing but not being present when the interviewers seek to contact the person for interview (Marston et al., 2008). Even if, ex ante, the eligible population for the survey is either the complete population of interest (as at surveillance sites), or a random sample (in household surveys), ex post the surveyed group who consent to HIV testing may not be representative of the population of interest due to this non-participation. Selection bias can occur if HIV prevalence among those who participate in testing differs from those who do not participate in testing. In many contexts, the extent of non-participation is substantial. For example, at some demographic surveillance sites, less than half of eligible respondents participate in testing (Tanser et al., 2008). In the nationally representative Demographic and Health Surveys, non-participation can also be common, for example, 37% of eligible male respondents failed to participate in testing in Malawi in 2004 (Hogan et al., 2012). In general, the treatment of missing information in survey data has the potential to have a substantial impact on both the parameter estimates and the policy recommendations derived from these surveys (Nicoletti, 2006). In the worst case scenario, where missing information caused by non-participation are a symptom of selection bias, conventional estimates can be substantially biased. Therefore, modeling this non-participation in HIV testing is important from a policy perspective. 1.2 Approaches for Dealing With Non-Participation in HIV Research There are a variety of options for dealing with missingness caused by non-participation (Donders et al., 2006). Standard approaches include multiple imputation, inverse probability re-weighting, and propensity score methods. Imputation is the approach recommended by UNAIDS and the WHO for dealing with missing values in HIV research, and, in general, imputation-based methods are a popular approach to dealing with missingness in a wide variety of contexts. Under ideal conditions, this approach will provide unbiased and efficient estimates of the parameters of interest, and if these ideal conditions are met, should be preferred to the approach of analyzing the data on the basis of only non-missing observations. However, these conventional imputation-based methods all rely on a key assumption to generate unbiased estimates of the parameters of interest (Conniffe & O’Neill, 2011). They require that missing data are missing at random, or missing at random conditional on observed covariates. In HIV surveys, there is generally a substantial amount of other information on respondents who do not participate in testing, because data on their socio-demographic characteristics is collected from other household informants (Mishra et al., 2006). The missing at random assumption essentially requires that once these observed characteristics of respondents have been controlled for, whether or not the outcome of interest (in this case, whether the respondent participates in HIV testing) for an observation appears in the dataset is random. Studies which have used imputation-based approaches to predict the HIV status of missing observations in HIV datasets have found very similar population prevalence estimates to when the data are analyzed by removing the missing observations who do not participate in testing (Marston et al., 2008; Mishra et al., 2008). It is important to note that this assumption of missing at random is generally not possible to test because we do not observe the HIV status of those who are absent from the data (Korenromp et al., 2013; Nicoletti, 2006). Moreover, the extent of 3 non-participation in many HIV surveys is such that the non-parametric bounding approach, for example, that proposed by Manski (1990), will not be informative. However, the assumption of missing at random for HIV status may not be realistic. The decision to participate in HIV testing is likely to occur in the context of highly complex individual behavioral and context-specific cultural factors (Angotti et al., 2009; Kalichman & Simbayi, 2003). For example, due to the stigma associated with HIV, HIV positive individuals may be less likely to participate in testing because they fear disclosure of their status to other household members, neighbors or even their interviewers. Participation in testing is lower in communities with higher knowledge of HIV status (Reniers & Eaton, 2009). The limited longitudinal evidence available from demographic surveillance sites also supports the hypothesis that HIV positive individuals are less likely to participate in testing (Arpino et al., 2014; Floyd et al., 2013; Reniers & Eaton, 2009). For example, in a rural hyperendemic community in South Africa, HIV positive residents were twice as likely to decline to participate in testing (Bärnighausen et al., 2012). Similar results were found in longitudinal data in Malawi (Obare, 2010). If data are missing because HIV positive individuals are more likely to decline to test, then the assumption of missing at random is violated because we do not observe the HIV status of respondents absent from the data, and therefore cannot condition on it. Because we cannot control for HIV status as a predictor of testing, there is an omitted variable which predicts both participation in testing and HIV status, violating the assumption of missing at random. In this case, conventional methods, including imputation or analysis of data on the basis of only non-missing observations, will generate biased results (Heckman, 1990; Puhani, 2000; Vella, 1998). In addition, because imputation-based models do not acknowledge that there is uncertainty surrounding the relationship between participation in testing and HIV status, confidence intervals based on this approach are likely to be too narrow when non-participation in testing is common (Hogan et al., 2012). 1.3 Accounting for Systematic Non-Participation There is a structural approach to dealing with missing information which will provide consistent results, even when data are not missing at random and selection bias would invalidate imputationbased methods. Heckman-type selection models can be used to correct for selection bias, even when selection is based on unobserved characteristics of respondents, as would be the case if HIV positive individuals were systematically opting out of HIV testing. This approach acknowledges the sequential decision making process involved in survey participation; respondents first decide whether to participate in testing, and it is only conditional on accepting to test that we observe their HIV status. Heckman (1979) originally proposed explicitly modeling the selection (in this case, whether the respondents test or not) and outcome (in this case, the HIV status of respondents) equations simultaneously, both as a function of the observed characteristics of respondents and given a parametric assumption about the joint distribution of the error terms in the two equations. In this approach, parameters are typically estimated under a maximum likelihood framework. When the outcome is binary, the conventional Heckman-type selection model is a bivariate probit (Dubin & Rivers, 1989; Van de Ven & Van Praag, 1981). These models do require a valid exclu4 sion restriction (or selection variable) for identification, a variable which predicts selection but not the outcome (Madden, 2008). Here we require a variable which predicts consent to test but not HIV status. Prior implementations of this approach in HIV research to correct for systematic non-participation have used interviewer identity for identification. The identity of the interviewer who contacts the respondent to seek consent for an HIV test is often recorded in survey data as an annonymized code, and the interviewer an eligible survey respondent is allocated to is typically highly correlated with whether the survey respondent consents to test (or is contacted for participation in the first instance). Interviewers are also likely to be allocated to survey respondents on the basis of survey design, as opposed to the characteristics of the respondents themselves. Therefore, interviewer identity is plausibly exogenous, should be unrelated to the HIV status of survey respondents, and is a potentially suitable exclusion restriction (Bärnighausen et al., 2011). Previous papers which have used this Heckman-type selection model approach to produce new estimates of HIV prevalence which adjust for non-participation in testing have found substantial differences in HIV prevalence in some contexts when compared to complete case or imputation-based methods (Bärnighausen et al., 2011; Clark & Houle, 2014; Hogan et al., 2012; Janssens et al., 2014; Reniers et al., 2009). 1.4 Towards a Unified Framework for Estimating HIV Prevalence Although the simultaneous equation modeling approach, such as that proposed by Heckman (1979), has the advantage of not requiring the assumption of missing at random for the HIV status of those who do not participate in testing, current techniques used for implementing this approach have been limited by a number of methodological drawbacks. First, selection models impose a homogeneous selection mechanism on all respondents, and do not account for between group heterogeneity in selection propensity (the association between the outcome of interest and whether the observation is present in the data, in our case the association between consenting to test and HIV status, conditional on observed covariates). However, this may not be a realistic assumption, particularly for the type of behavior which is likely to be correlated within groups. For example, there is substantial heterogeneity in the reasons for declining to test for HIV (Kranzer et al., 2008). Selection into testing may partly reflect spatial dependence amongst neighboring observations. Networks and proximity propagate the transmission and spread of infectious disease, and therefore HIV status and other outcomes which are determined by social or proximal interaction will be affected by geographic clustering (Tanser et al., 2009), with likely spill-over effects and spatial dependence among communities. For example, HIV prevalence may vary substantially by region (Larmarange & Bendaud, 2014; Aral et al., 2005). In addition, the association between testing and HIV status may vary between these communities, where different cultural or location-specific factors mean that the selection process and the factors which influence the decision to participate in HIV testing are heterogeneous. For example, the stigma associated with HIV, and the corresponding fear of disclosure for HIV positive individuals may vary according to location, and HIV prevalence within that location. It is particularly important for policy makers to be able to identify the most at risk populations within their countries in order 5 to begin to implement linkages to treatment (Govindasamy et al., 2012). Therefore, as well as being inefficient, the imposition of a common selection parameter across all communities could bias region-specific HIV prevalence estimates. Previous selection models cannot take account of these effects. Our first methodological contribution is that we allow the dependence structure of the model to be heterogeneous by specifying a linear predictor equation for the dependence parameter. Our approach of allowing the association between testing and HIV status to vary by region follows the same rationale provided by Rigby & Stasinopoulos (2005), who extended generalized additive models to the context of more complex response distributions, where not only the mean, but multiple parameters are related to linear predictors via suitable link functions. Recently, Klein et al. (2014a) and Klein et al. (2014b) proposed a similar approach in a Bayesian univariate and multivariate context, respectively. In order to reflect the manner in which HIV is spread through social networks and proximity (Klovdahl, 1985), we account for the spatial effects of contiguous geographic units (in our case, regions) using a Markov random field approach (Rue & Held, 2005). A second problem with the conventional implementation of Heckman-type selection models for binary data is that model specification is problematic. As a result, identifying the individuallevel predictors of HIV status is not straightforward. Because the HIV equation must be specified in the absence of the complete dataset (i.e. only on the basis of those who consent to test for HIV and for whom we observe the outcome), avoiding misspecification is difficult. Typically, continuous variables are entered into the equations as linear effects, polynomials of various degrees, or else categorized according to a series of cut-points. However, this approach runs the risk of over-fitting, may be inefficient, and can be arbitrary. Because some portion of the data are missing, often a substantial percentage, it can be difficult to reliably specify these choices ex ante. Moreover, the degrees of the relevant polynomial or the effective cut-points can be difficult to set in general because they may vary according to the context. For example, years of education in one country could have a different meaning to years of education in another, and specifying education groups according to some common threshold could be inappropriate. This is an important issue because identifying the relevant associations requires correct specification of the covariate structure. In addition, in the absence of a strong selection variable which is sufficiently predictive of the selection outcome, model identification can in theory be achieved through non-linearities (Madden, 2008), and misspecification of the model component effects could introduce bias into the results. To this end, we employ a penalized regression spline approach which allows us to estimate flexibly non-linear effects and does not depend on arbitrary modeling decisions by the researcher (e.g., Marra & Radice, 2013; Ruppert et al., 2003; Wood, 2006). Along with sex, these continuous variables are often the most relevant for population surveillance and identifying the fundamental demographic attributes of the most at-risk sub-groups in the population. For example, modeling the association of age with HIV status is crucial for understanding when peak incidence occurs, and these data can be used for appropriate targeting of efforts to reduce risky behavior (Gouws et al., 2008). The role of education in the evolution of the HIV epidemic is another question of fundamental importance to policy makers due to its potential for affecting 6 population health, behavior and knowledge, however the literature has found that its impact as a protective or risk factor appears to be changing over time (Hargreaves et al., 2008). Finally, the literature has debated the association of poverty with HIV risk (Gillespie et al., 2007). If any of these factors (age, education and poverty, which we measure with household wealth defined by an asset index) are systematically associated with non-participation, estimates based on analysis of surveys which are affected by missing data could be misleading. Our unified modeling framework allows us to reassess these relationships using flexible splines while also correcting for selective non-participation. Third, conventional Heckman-type selection models for binary data have relied on the assumption of bivariate normality for identification, an assumption which cannot be validated because the joint distribution of the error terms in the selection and outcome equations is unobserved. This parametric specification of bivariate normality has been criticized for being arbitrary (Geneletti et al., 2011; Puhani, 2000; Vytlacil, 2002). Misspecification of this joint distribution can lead to inconsistent and inefficient estimation (De Luca, 2008), and if model identification and results are sensitive to one particular distributional assumption, then this is a serious limitation. We use a copula approach to allow for non-Gaussian dependencies, and consider several alternative functional forms for specifying the (conditional) association between testing and HIV status. Following McGovern et al. (2015b), we consider Gaussian and Archimedean copulas (with the Frank, Gumbel, Joe, and Clayton copulas as specified cases) and the rotated versions of Gumbel, Joe, and Clayton to allow for negative non-Gaussian dependencies. Finally, while interviewer identity is a plausible and convenient choice for a selection variable, in practice the bivariate probit models which are used to implement this approach are not very stable and can fail to converge relatively frequently (Butler, 1996; Clark & Houle, 2014). This is especially the case when HIV prevalence is low or high, or non-participation is low or high (Chiburis et al., 2012; Clark & Houle, 2012). Interviewers are often matched to participants on the basis of some group-level characteristics such as language and gender, which can induce collinearity between the interviewer variable and the other independent variables in the linear predictor equation for consent. This can result in convergence failures, for example due to a nonpositive definite Hessian matrix induced by the collinearity. Similarly, the number of interviewees per interviewer typically varies in HIV surveys, with some interviewers conducting many interviews, and some interviewers only conducting a handful of interviews. Some interviewers obtain participation in testing from all their interviewees, while for some other interviewers all their interviewees may decline to participate, with the result that these interviewer effects are not identified due to lack of within-interviewer variation in testing participation. Previous approaches to dealing with this non-convergence and non-identification involved the pooling of interviewer parameters which cause model failure. Which interviewers are problematic can be established by examining which parameters in the variance-covariance matrix have variances which grow with each iteration of the algorithm (Bärnighausen et al., 2011). However, this pooling approach is arbitrary, and can involve combining very successful interviewers (in the sense of being successful at obtaining high testing participation rates among their interviewees) with very unsuccessful interviewers, 7 and is also clearly inefficient because it ignores the information in the participation data for these pooled interviewers. Alternatively, estimating interviewer persuasiveness in a two-stage process required bootstrapping of standard errors, was time consuming, proved inefficient, and may lead to attenuation bias (McGovern et al., 2015a). Here, we implement a ridge penalty approach (e.g., Wood, 2006) applied to the interviewer identity variable which allows for the estimation of all interviewer effects, solves the collinearity problem in a straightforward and efficient manner, and helps to prevent convergence failures. In our empirical application, we find that all models fail without the imposition of the ridge penalty on the interviewer identity selection variable. Our methodology incorporates each of these developments in a unified and flexible simultaneous equation framework for adjusting for systematic non-participation in HIV surveys. We outline further details of this methodology in the rest of the paper as follows. Section 2 introduces the approach in more detail by describing its main statistical components. Estimation and inference are developed in Section 3. Sections 4 and 5 describe the data and apply the proposed approach to three Sub-Saharan African countries. The final section discusses directions for future research. 2 Model representation Let us assume that there are two random variables (y1i , y2i ), for i = 1, . . . , n, where y1i , y2i ∈ {0, 1} and n represents the sample size. Variable y1i indicates whether an individual takes part in the study whereas y2i denotes the observed outcome. The probability of event (y1i = 1, y2i = 1) can be defined as (Sklar, 1959, 1973) p11i = P(y1i = 1, y2i = 1) = C(P(y1i = 1), P(y2i = 1); θi ), where P(yvi = 1) = Φ(ηvi ) for v = 1, 2. Φ(·) is the cumulative distribution function (cdf) of the standard univariate Gaussian distribution, ηvi ∈ R is a linear predictor (defined in generic terms in the next section), C is a two-place copula function, and θi is an association parameter measuring the dependence between the two marginals P(y1i = 1) and P(y2i = 1). Note that the marginal cdfs are conditioned on covariates (through η1i and η2i ), but for notational convenience we have suppressed this when expressing the marginal distributions. Since the strength of the association between the selection and outcome equations may vary across groups of observations (specifically, across regions in our case), the copula dependence parameter is specified as a function of a linear predictor: θi = m(η3i ), where m is a one-to-one transformation which ensures that the dependence parameter lies in its range (see Radice et al. (2015) for a list of transformations). In this context, y2i is observed only if y1i = 1, hence the data only identify the additional events (y1i = 1, y2i = 0) and (y1i = 0), with probabilities p10i = Φ(η1i ) − p11i and p0i = Φ(−η1i ). As in Radice et al. (2015), the copulae considered include the Clayton, Frank, Gaussian, Gumbel and Joe, as well as their counter-clockwise rotated versions (the 90, 180 and 270 degree rotated Clayton, Gumbel and Joe). The rotated versions are obtained using the definitions in Brechmann & Schepsmeier (2013), and allow us to model negative non-Gaussian dependencies. In the context of our application to HIV data, this is crucial as we expect negative de8 pendence. For full details on the properties of copula models, including the mathematical definitions and pictorial representations of the copulae mentioned above, see Nelsen (2006) and Brechmann & Schepsmeier (2013). Note that for practical modeling, the Gaussian and one of the Clayton, Joe or Gumbel (including all rotated versions) copulae may suffice. Consider, for instance, the Clayton, Joe and Gumbel copulae; the Clayton rotated by 90 degrees is very close to the Joe and Gumbel rotated by 270 degrees. Nevertheless, for completeness we consider all definitions in this paper. The classic sample selection model is obtained by using the Gaussian copula Φ2 (Φ−1 (Φ (η1i )) , Φ−1 (Φ (η2i )) ; θ), where Φ−1 (·) is the quantile function of the standard univariate normal distribution, and Φ2 (·, ·; θ) is the cdf of the standard bivariate normal distribution with correlation θ ∈ [−1, 1], where θ = tanh(θ∗ ) and θ∗ ∈ (−∞, ∞). The use of the hyperbolic tangent transformation is more convenient from an estimation perspective because, unlike the correlation parameter, it is not bounded. The log-likelihood function of the sample can be expressed as ℓ= n X i=1 {y1i y2i log(p11i ) + y1i (1 − y2i ) log(p10i ) + (1 − y1i ) log(p0i )} . (1) 2.1 Linear predictor specification For simplicity, and without loss of generality, we suppress the v subscript and define the generic linear predictor as K X ηi = β0 + sk (zki ), (2) k=1 where β0 ∈ R is an overall intercept, zki denotes the k th sub-vector of the complete covariate vector zi containing, for instance, binary, categorical, continuous, and spatial variables, and the K functions sk (zki ), represent generic effects which are chosen according to the type of covariate(s) considered. Each sk (zki ) can be approximated as a linear combination of Jk basis functions bkjk (zki ) and regression coefficients βkjk ∈ R, i.e. sk (zki ) = Jk X βkjk bkjk (zki ). (3) jk =1 Equation (3) implies that the vector of evaluations {sk (zk1 ), . . . , sk (zkn )}T can be written as Zk βk with coefficient vector βk = (βk1 , . . . , βkJk )T and design matrix Zk [i, jk ] = bkjk (zki ). This allows the linear predictor in equation (2) to be written as η = β0 1n + Z1 β1 + . . . + ZK βK , (4) where 1n is an n-dimensional vector made up of ones. Equation (4) can also be rewritten as η = Zβ, 9 (5) T T where Z = (1n , Z1 , . . . , ZK ) and β = (β0 , β1T , . . . , βK ) . The smooth functions may represent linear, non-linear, random and spatial effects, to name but a few. Moreover, each βk has an associated quadratic penalty λk βkT Sk βk whose role is to enforce specific properties on the k th function, such as smoothness. Parameter λk ∈ [0, ∞) controls the trade-off between fit and smoothness, and plays a crucial role in determining the shape of ŝk (zki ). For instance, let us assume that the k th function models the effect of a continuous variable such as age. A value of λk = 0 (i.e., no penalization is employed during fitting) will result in an un-penalized regression spline estimate with the likely consequence of over-fitting, while λk → ∞ (i.e., the penalty has a large influence on the function during fitting) will lead to a straight line estimate. The overall penalty can be defined as β T Sλ β, where Sλ = diag(0, λ1 S1 , . . . , λK SK ). Note also that smooth functions are typically subject to centering (identifiability) constraints and we adopt the parsimonious approach detailed in Wood (2006) to deal with this issue. In the following paragraphs, we outline the rationale for adopting the specific model components relevant to our case study. Linear and random effects For parametric, linear effects, equation (3) becomes sk (zki ) = zTki βk , and the design matrix is obtained by stacking all covariate vectors zki into Zk . In general, no penalty is assigned to linear effects (Sk = 0). This would be the case for variables such as ever tested for HIV and condom use at last sexual activity. However, as pointed out in Section 1.4, for the parameters of variables like interviewer identity, it is convenient, and often necessary to achieve model convergence, to employ a ridge penalty (i.e., Sk = I, where I is an identity matrix), which is equivalent to the assumption that the coefficients are random effects which are distributed as i.i.d. normal with unknown variance (e.g., Ruppert et al., 2003; Wood, 2006). This allows us to achieve convergence even when some interviewer parameters are not identified in the conventional selection model. Non-linear effects For continuous variables such as age, wealth and years of education, the smooth functions are represented using the regression spline approach popularized by Eilers & Marx P (1996). Specifically, for each continuous variable zki , sk (zki ) = Jjkk=1 βkjk bkjk (zki ), where the bkjk (zki ) are known spline basis functions. The design matrix Zk comprises the basis function evaluations for each i, and hence describe the Jk curves which have potentially varying degrees of complexity. Basis functions should be chosen to have convenient mathematical and numerical properties. We employ low rank thin plate regression splines (Wood, 2003), although other spline definitions (including B-splines and cubic regression splines) and corresponding penalties are supported in our implementation. Note that for one-dimensional smooth functions, the choice of spline definition does not play an important role in determining the shape of ŝk (zk ) (e.g., Ruppert et al., 2003). To enforce smoothness, a conventional integrated square second derivative R spline penalty is typically employed. That is, Sk = dk (zk )dk (zk )T dzk , where the jkth element of dk (zk ) is given by ∂ 2 bkjk (zk )/∂zk2 , and integration is over the range of zk . The formulas used to compute the basis functions and penalties for many spline definitions are provided in Ruppert et al. (2003) and Wood (2006). This flexible spline approach allows us to avoid arbitrary modeling decisions based on censored data, such as choosing the appropriate degree of a polynomial or 10 specifying cut-points, which could induce misspecification. Spatial effects To model the spatial information based on the geographic location of survey respondents, we employ a Markov random field smoother. This approach is popular when the geographic area of interest is split up into discrete contiguous geographic units, and allows us to take advantage of the information contained in neighboring observations which are located in the same region, due to our expectation of spatial dependence in HIV prevalence. In this case, sk (zki ) = zTki βk , where βk = (βk1 , . . . , βkR )T represents the vector of spatial effects, R denotes the total number of regions and zki is made up of a set of area labels. The design matrix linking an observation i with the corresponding spatial effect is therefore defined as Zk [i, r] = ( 1 if the observation belongs to region r 0 otherwise , where r = 1, . . . , R. The smoothing penalty associated with the Markov random field is constructed based on the neighborhood structure of the geographic units, so that spatially adjacent regions share similar effects. That is −1 if r 6= q, r ∼ q Sk [r, q] = 0 if r 6= q, r ≁ q , Nr if r = q where r ∼ q indicates whether two regions r and q are adjacent neighbors, and Nr is the total number of neighbors for region r. In a stochastic interpretation, this penalty is equivalent to the assumption that βk follows a Gaussian Markov random field (e.g., Rue & Held, 2005) and it has been employed in several contexts, including, for example, HIV and child under-nutrition (see, e.g., Klein et al., 2014b; Larmarange & Bendaud, 2014; Tanser et al., 2009, and references therein). This approach is also used to allow for heterogeneous selection mechanisms where the copula parameter (which measures the association between HIV status and participation in testing) varies according to location. In the context of our study, the linear predictors for the selection (η1 ) and outcome equations (η2 ) and the copula parameter (η3 ) are specified as η1i = β10 + xTi β11 + s11 (agei ) + s12 (educationi ) + s13 (wealthi ) + s1spatial (regioni ) + βinterviewerIDi , η2i = β20 + xTi β21 + s21 (agei ) + s22 (educationi ) + s23 (wealthi ) + s2spatial (regioni ), η3i = β30 + s3spatial (regioni ). Parameters β10 , β20 , β30 are constants comprising the overall levels of the predictors. Vector xi contains discrete and binary variables with impacts β11 and β21 , the svk , for v = 1, 2 and k = 1, 2, 3, are smooth functions of the continuous covariates represented using penalized thin 11 plate regression splines and the svspatial , for v = 1, 2, 3, model spatial regional effects using a Markov random field approach. Finally, βinterviewerIDi denotes the random effects for the set of binary variables defined by interviewer identity. The choice of specification for the third linear predictor equation for the copula parameter (η3 ) must reflect a balance between parsimony and a reasonable reflection of the selection behavior of those eligible for participation in HIV testing. η3 models an unobserved selection process and therefore specifying the linear predictor equation as a function of observed characteristics only makes sense from an estimation perspective if there are groups for which there is a clear rationale for expecting heterogeneous selection mechanisms. While in theory we could include additional group-level identifier variables in η3 , our model is already highly flexible, and therefore we opt to specify the copula parameter as depending on a grouping variable for which we expect heterogeneous selection: the location of the survey participant. This parametrization is motivated by the evidence on the spatial clustering of HIV prevalence (Larmarange & Bendaud, 2014; Tanser et al., 2009). The above model specification provides an example of the flexibility of our structured modeling approach to dealing with systematic non-participation. However, there are a number of other extensions which could easily be incorporated in our framework. These include varying coefficient models obtained, for instance, by multiplying one or more smooth components by some predictor(s), and smooth functions of two or more continuous covariates; see Hastie & Tibshirani (1993), Ruppert et al. (2003) and Wood (2006) for more details. In summary, we introduce a unified model for expanding the implementation of Heckman-type selection models. This is achieved by considering non-Gaussian dependencies between the selection and outcome equations, by applying a ridge penalty to the selection variable to deal with collinearity and non-identification caused by the non-uniform distribution of interviewees per interviewer, and by allowing for non-linear covariate effects, spatial effects and for heterogeneous regional selection dependence between testing and HIV status. Our proposal therefore extends the scope of the approaches presented in Marra & Radice (2013) and McGovern et al. (2015b). 3 Parameter estimation Let us define the overall quantities δ T = (β1T , β2T , β3T ) and Sλ = diag(λ1 S1 , λ2 S2 , λ3 S3 ), where λTv = (λvkv , . . . , λvKv ) for v = 1, 2, 3. Parameter vectors β1 , β2 and β3 and their corresponding penalty matrices and smoothing parameter vectors are associated with η1i , η2i and η3i , respectively. Because of the flexible linear predictor structures employed here, the use of a classic (unpenalized) optimization algorithm is likely to result in component estimates that are too rough to produce practically useful results (e.g., Klein et al., 2014b; Ruppert et al., 2003; Wood, 2006). Therefore, we maximize 1 ℓp (δ) = ℓ(δ) − δ T Sλ δ. (6) 2 12 3.1 Estimating δ Given λ̂T = (λ̂T1 , λ̂T2 , λ̂T3 ), we seek to maximize (6). To this end, we use a trust region approach which is generally more stable and faster than its line-search counterparts (such as NewtonRaphson), particularly for functions that are, for example, non-concave and/or exhibit regions that are close to flat; see Nocedal & Wright (2006, Chapter 4) for full details. Such functions can occur relatively frequently in, for example, bivariate probit models, often leading to convergence failures (Andrews, 1999; Butler, 1996; Chiburis et al., 2012). [a] Let us define the penalized gradient and Hessian at iteration a as gp = g[a] −Sλ̂ δ [a] and H[a] p = [a] [a] [a] [a] [a] H − Sλ̂ , where g consists of g1 = ∂ℓ(δ)/∂δ1 |δ1 =δ[a] , g2 = ∂ℓ(δ)/∂δ2 |δ2 =δ[a] and g3 = 1 [a] 2 ∂ℓ(δ)/∂δ3 |δ3 =δ[a] , and the Hessian matrix has elements Hr,h = ∂ 2 ℓ(δ)/∂δr ∂δhT |δr =δr[a] ,δ =δ[a] , h 3 h where r, h = 1, . . . , 3; the gradient and Hessian have been derived analytically and verified using numerical derivatives (Marra & Radice, 2015). Each iteration of the trust region algorithm solves the problem 1 T [a] [a] def [a] T [a] ˘ min ℓp (δ ) = − ℓp (δ ) + p gp + p Hp p such that kpk ≤ r[a] , p 2 [a] [a] [a+1] δ = arg min ℓ˘p (δ ) + δ , p where k · k denotes the Euclidean norm, and r[a] is the radius of the trust region. At each iteration of the algorithm, ℓ˘p (δ [a] ) is minimized subject to the constraint that the solution falls within a trust region with radius r[a] . The proposed solution is then accepted or rejected and the trust region expanded or shrunken based on the ratio between the improvement in the objective function when going from δ [a] to δ [a+1] and that predicted by the quadratic approximation. See Geyer (2013) for the exact details (e.g., numerical stability and termination criteria) of the implementation used here. It is important to stress that near the solution the trust region method typically behaves as a classic unconstrained algorithm (Geyer, 2013; Nocedal & Wright, 2006). Furthermore, our implementation provides the option of using the expected Fisher information matrix, E(H[a] ), instead of the observed H[a] , which may result in a slightly slower but more stable algorithm. Starting values for the coefficients in η1 and η2 are obtained by fitting the selection and outcome equations separately. The initial parameters in η3 are set to zero as there is not typically good a priori information about the direction and strength of the association between the selection and outcome equations. 3.2 Estimating λ Data-driven and automatic smoothing parameter estimation is pivotal for practical modeling, especially when the data are partly censored, and each model equation contains more than one smooth component, as in our case study. Such an approach allows us to determine the shape of the smooth functions from the data, hence avoiding arbitrary decisions by researchers as to the relevant functional form for continuous variables, for instance. Also, note that it would not be sensible to jointly estimate δ and λ via maximization of (6), as the highest value of ℓp (δ) would be obtained 13 when λ̂ = 0, hence leading to severe over-fitting and convergence failures. For single equation spline models, there are a number of methods for automatically estimating smoothing parameters within a penalized likelihood framework; see Ruppert et al. (2003) and Wood (2006) for excellent detailed overviews. In our context, we propose to use the smoothing approach detailed below. Let us use the fact that near the solution the trust region algorithm usually behaves as a classic Newton or Fisher Scoring method, and assume that δ [a+1] is a new updated guess for the parameter vector which maximizes (6). If δ [a+1] is to be ‘correct’, then the penalized gradient evaluated at [a+1] those parameter values would be 0, i.e. gp = 0. Applying a first order Taylor expansion to [a+1] [a+1] [a] [a] [a+1] gp about δ yields 0 = gp ≈ gp + δ − δ [a] H[a] p , from which we find the solution at iteration a + 1. After some manipulation, this can be expressed as −1 p δ [a+1] = I [a] + Sλ̂ I [a] z[a] , p I [a] δ [a] + ǫ[a] , with ǫ[a] = where I [a] is −H[a] (or, alternatively, −E H[a] ), and z[a] = p −1 I [a] g[a] . From standard likelihood theory, ǫ ∼ N (0, I) and z ∼ N (µz , I), where I is an √ identity matrix, µz = Iδ 0 , and δ 0 is the true parameter vector. The predicted value vector for z √ √ √ is µ̂z = I δ̂ = Aλ̂ z, where Aλ̂ = I (I + Sλ̂ )−1 I. Since our goal is to select the smoothing parameters in as parsimonious a manner as possible so that the smooth terms’ complexity which is not supported by the data is suppressed, λ is estimated so that µ̂z is as close as possible to µz . This can be achieved using E kµz − µ̂z k2 = E kz − Aλ z − ǫk2 = E kz − Aλ zk2 + E −ǫT ǫ − 2ǫT µz + 2ǫT Aλ µz + 2ǫT Aλ ǫ = E kz − Aλ zk2 − ň + 2tr(Aλ ), where ň = 3n and tr(Aλ ) is the number of effective degrees of freedom of the penalized model. Hence, the smoothing parameter vector is estimated by minimising an estimate of the expectation above, that is V(λ) = kz − Aλ zk2 − ň + 2tr(Aλ ), (7) which is equivalent to the expression of the Un-Biased Risk Estimator given in Wood (2006, Chapter 4). This is also equivalent to the Akaike information criterion after dropping the irrelevant constant; the first term on the right hand side of (7) is a quadratic approximation to −2ℓ(δ̂) to within an additive constant. In practice, given δ [a+1] , the problem becomes def [a+1] [a+1] 2 λ[a+1] = arg min V(λ) = kz[a+1] − Aλ λ z [a+1] k − ň + 2tr(Aλ ), (8) which is solved using the automatic stable and efficient computational routine by Wood (2004). 14 3.3 Consistency and further considerations The two steps, the first for δ (the coefficient vector) and the other for λ (the smoothing parameter vector), are implemented in a “performance iteration” fashion (Gu, 2002) until the algorithm sat isfies the stopping criterion max δ [a] − δ [a+1] < 10−6 . If, after estimation, the estimated smoothing parameter vector yields curve estimates that are deemed to be too rough by the researcher and smoother functions are desired, then the model can be re-estimated by increasing the quantity tr(Aλ ) in (8) by a factor > 1. Also, note the smoothing parameter estimation step is implemented using two key inputs (the gradient and information matrix), which are obtained as a byproduct of the estimation step for δ. The additional benefit of using z and Aλ as defined in Section 3.2 is that the proposed smoothing approach is in principle suitable for any model fitted by penalized maximum likelihood. As in Kauermann (2005) and Radice et al. (2015), it is possible to prove the consistency of the proposed estimator. This can be done by considering the situation in which the spline bases approximating the smooth components are of a fixed high dimension. Since the unknown smooth functions may not have an exact representation as linear combinations of given basis functions, the unknown functions and parameters may not be asymptotically identified by their estimators as the sample size grows. However, in practice basis dimensions have to be fixed, and assuming that these are of a high dimension, it is possible to assume heuristically that the approximation bias is negligible compared to estimation variability (e.g., Kauermann, 2005). Other key assumptions required for consistency are that g(δ 0 ) = OP (n1/2 ), EH(δ 0 ) = O(n), H(δ 0 ) − EH(δ 0 ) = OP (n1/2 ), and Sλ = o n1/2 . The first three conditions are the classic assumptions of n1/2 asymptotics. The last condition can be formulated equivalently as λvkv = o n1/2 for kv = 1, . . . , Kv , v = 1, 2, 3, assuming that the matrices Svkv are asymptotically bounded; this assumption is weak and in fact smoothing parameter estimates based on a mean squared error criterion are of order O(1) (Kauermann, 2005). It would then follow that δ̂ − δ 0 = OP (n−1/2 ) as n → ∞, as shown in Radice et al. (2015). From a practical point of view, an additional requirement for consistency is the inclusion of interviewer identity into the linear predictor for the selection equation, which is an important feature of Heckman-type selection models. Without this exclusion restriction, a variable which predicts selection but not the outcome, identification is derived through parametric assumptions only, and may not be considered robust (Madden, 2008). The use of interviewer identity in this manner as a selection variable is what allows us to adjust for the missing data caused by respondents declining to test, even if respondents are more likely to decline to test because they know they are HIV positive, and the assumption of missing at random is violated. At convergence, reliable point-wise confidence intervals for linear and non-linear functions of the model coefficients (e.g., smooth components, prevalence estimates, copula parameter) can be easily obtained using N (δ̂, Vδ ) where Vδ = −H−1 p (e.g., Marra & Wood, 2012; Radice et al., 2015; Silverman, 1985; Wahba, 1983; Wood, 2006). This result can in principle also be used to construct simultaneous credible bands (e.g., Krivobokova et al., 2010). To test smooth components for equality to zero we can use the results discussed in Radice et al. (2015) which are based on Wood (2013). However, there are many previous studies which examine the predictors of testing 15 and HIV status, therefore we are able to follow the previous literature in terms of variable selection (Bärnighausen et al., 2011; Hogan et al., 2012). Pn Pn HIV prevalence estimates can be obtained using p̂HIV = i=1 wi Φ(η̂2i )/ i=1 wi , where the wi are survey weights, whereas sonfidence intervals are derived using the delta method (e.g., McGovern et al., 2015b; Pya & Wood, 2014). The software for implementing all the model features and estimation and inferential procedures outlined above is freely available online through the R package SemiParBIVProbit (Marra & Radice, 2015), as are the HIV datasets (from http://www.measuredhs.com, after registration), and the code for preparing the data for analysis (from http://hdl.handle.net/1902.1/17657) (Bärnighausen et al., 2011; Hogan et al., 2012). The framework this paper provides allows researchers and policy-makers to apply a transparent approach to account for systematic non-participation in their data. The features of this software have been designed specifically with transparent and straightforward dissemination of results in mind. First, the choice of optimization algorithm and confidence interval procedure allow for results to be obtained relatively quickly without the need for bootstrapping or complex simulation procedures. Second, model specification is largely datadriven and implementation is designed to avoid arbitrary decisions by the researcher (including pooling of interviewers and polynomial or cut-point specification for continuous variables, and parametric dependence structure can be determined by information criteria). Finally, national HIV prevalence estimates and adjusted confidence intervals (which account for the uncertainty inherent in estimating the relationship between testing participation and HIV status) can be obtained directly as the primary output of the model, along with sub-national spatial maps (see Section 5). 4 Data We implement our simultaneous equation model framework described above to estimate HIV prevalence in three sub-Saharan African countries: Zambia, Zimbabwe, and Swaziland. All three of these countries rely on publicly available nationally representative household surveys for their HIV prevalence estimates, and their data are affected by non-participation. In addition, they have heterogeneous regions and relatively high HIV prevalence. The relevant data are the Demographic and Health Surveys (DHS) conducted in Zambia in 2007, Zimbabwe in 2008, and Swaziland in 2008. The DHS are a series of cross-sectional household surveys in developing countries which have been conducted since 1980, and now comprise nearly 100 in total (Fabic et al., 2012; Corsi et al., 2012). These surveys focus on topics such as health and fertility, and interview nationally representative samples of men and women in each country. The sampling procedure for the DHS is designed in two stages, first a random sample of primary sampling units (PSU) are drawn which comprise geographic locations usually defined by a preceding census. This first stage sampling is often stratified by urban/rural location, and/or region. Then, a random sample of households is chosen within each PSU, and all eligible residents of these households are sought for interview. Over the past 15 years, a number of DHS surveys have collected blood samples from men and women to be tested for HIV in addition to the standard interview (Mishra et al., 2006). In 16 the relevant surveys, respondents are asked, at the end of their individual interview, if they would consent to test for HIV. If they consent, a blood sample is drawn by finger prick by the interviewer, and subsequently sent to be laboratory tested for HIV. The results are not generally returned to the participants, but rather anonymized and made available for linkage to the main interview data using an anonymized code. All DHS HIV surveys comply with best practice as recommended by the official WHO/UNAIDS guidelines. Because these data are designed to provide estimates of HIV prevalence from a nationally representative sample, they are considered to be the gold standard in low and middle income countries (Boerma et al., 2003). Regional identifiers for respondents are available as part of the publicly available Demographic and Health Survey (DHS) data used in this analysis, and information on spatial boundaries at the sub-national level are publicly available from http://gadm.org/. In some countries, it is possible to request special access to more detailed geographic information, however, the DHS are not designed to be representative below the regional level. In addition, previous assessment of data quality on HIV prevalence at the sub-regional level in the DHS has highlighted a number of limitations (Larmarange & Bendaud, 2014), including the fact that sampling within regions can be sparse and often involve relatively few primary sampling unit clusters. Therefore, in this analysis we focus on regional level heterogeneity in estimating HIV prevalence. The model components are described in Section 2.1. Specifically, we follow the previous literature and include the following binary and categorical variables in xi : type of location (urban or rural), marital status, had a sexually transmitted disease, age at first intercourse, had high risk sex, number of partners, condom use, would care for an HIV-infected relative, knows someone who died of AIDS, previously tested for HIV, smokes, drinks alcohol, language, region, ethnicity and religion. Unlike the previous literature, we specify smooth functions of age, years of education, and wealth index (based on household assets) and employ Markov random field smoothers to model spatial variation. All these components enter into the linear predictors for selection (η1 ) and HIV status (η2 ). The selection variable (exclusion restriction) is interviewer identity and enters into η1 only. We apply a ridge penalty to the coefficients of this variable in order to account for the difficulties associated with its use which we outlined in Section 1.4. Linear predictor η3 only depends on Markov random field term and allows for the copula association parameter to vary by region. All of our models are stratified by sex to reflect potentially sex-specific consent and HIV related factors. Some surveys, including the DHS, provide survey weights (wi ) to better match the characteristics of the ex-post sampled population to the target population of interest, and these weights can easily be incorporated into the analysis. All our prevalence estimates are weighted to be nationally representative. Table 1 illustrates the sample size, number of regions, number of respondents who participate in testing, and the number of respondents who are HIV positive (among those who participate in testing) in each survey. There are between 4 and 8 thousand observations in each country, with the percentage of eligible respondents consenting to test for HIV ranging from 78% for men in Zambia and Zimbabwe, to 92% for women in Swaziland. The percentage of HIV positive individuals (among those who consent to test) is high in all countries, and ranges from 12% for men in Zambia to 31% among 17 Zambia Men Zimbabwe Women Men Swaziland Women Men Women Number HIV Negative Number HIV Positive % HIV Positive (95% CI) 4,457 4,689 641 936 12% (11% - 13%) 16% (14% - 18%) 4,773 5,941 782 1,553 14% (13% - 16%) 21% (20% - 23%) 2,898 704 19% (18% - 21%) 3,146 1,438 31% (29% - 33%) Number Declined to Test for HIV Number Consented to Test for HIV % Consented to Test for HIV (95% CI) 1,318 1,400 5,098 5,625 78% (76% - 80%) 79% (78% - 81%) 1,620 1,413 5,555 7,494 78% (76% - 80%) 84% (83% - 85%) 554 3,602 87% (86% - 89%) 403 4,584 92% (91% - 93%) 9 10 Number of Regions 4 Table 1: Descriptive Statistics for Demographic and Health Survey HIV Data. HIV prevalence (%) and consent to test (%) estimates are weighted, and confidence intervals are clustered to account for survey design. HIV status is only available for those who consent to test. Observations not contacted to test for HIV are not included in the analysis. women in Swaziland. Confidence intervals for the HIV prevalence estimates which do not account for non-participation are between 3 and 4 percentage points wide in each country. In this paper, we focus on non-participation due to eligible respondents declining to test for HIV after interview. The amount of missing data due to this type of non-participation is typically more substantial than non-participation due to eligible respondents not being available for interview (Hogan et al., 2012). In addition, previous analysis of the Zambia data found little evidence of selection bias among this second group (Bärnighausen et al., 2011). We exclude a small number of observations from the analysis sample in each country due to missing information on covariates. However, these constitute less than 1% of total observations. In the following section, we present new sex-specific national HIV prevalence point estimates and confidence intervals for Zambia, Zimbabwe, and Swaziland. We compare results from our selection model approach, which accounts for systematic non-participation, to the recommend imputation approach, which does not. The imputation model means that only the outcome equation with linear predictor η2 is fitted. In addition, we illustrate the regional heterogeneity in HIV prevalence and dependence parameter in each country, and present the association between HIV status and our main predictors of interest (age, wealth and years of education) derived from our smoothing approach. As we outline in Section 1.4, these factors are fundamental for population surveillance and targeted intervention (Gouws et al., 2008; Hargreaves et al., 2008; Gillespie et al., 2007). 5 Results Table 2 presents national estimates of HIV prevalence (and associated confidence intervals) obtained from the simultaneous equation framework we outline above. These are compared to imputation-based estimates shown in column 1, which only use the single linear predictor equation for HIV status (η2 ). As was found in previous research, we find that these estimates are almost identical to those in Table 1 which were based only on observations without missing data (Mishra et al., 2008; Marston et al., 2008; Hogan et al., 2012; Bärnighausen et al., 2011). Moreover, these imputation-based confidence intervals are similarly between 3 and 4 percentage points wide. Column 2, which shows our selection model estimates which account for potentially sys- 18 Men Women Imputation model Country HIV Prevalance (95% CI) Swaziland 19.0 (17.9, 20.2) Zambia 12.0 (11.1, 12.9) Zimbabwe 14.8 (13.8, 15.8) Swaziland Zambia Zimbabwe Selection model HIV Prevalance (95% CI) θ̂ (95% CI) 25.8 (23.3, 28.4) −4.09 (−10.4, −1.82) 22.9 (19.8, 25.9) −8.45 (−16.4, −4.25) 14.9 (12.8, 17.0) −1.03 (−22.7, −1.00) 30.7 (29.5, 31.9) 16.2 (15.3, 17.1) 21.8 (19.1, 24.4) 34.9 (33.3, 36.5) 19.3 (13.8, 24.7) 23.0 (19.4, 26.7) −9.83 (−30.9, −3.91) −1.40 (−2.39, −1.07) −1.45 (−3.79, −1.05) Table 2: National estimates of HIV prevalence (and associated confidence intervals) obtained from the imputation and proposed simultaneous equation approaches. The estimates shown in column 1 do not account for potentially systematic non-participation whereas those in column 2 do. The dependence structure used for estimating the sample selection models is based on the Joe 90 copula. Because we specify the dependence parameter in terms of a linear predictor, the values shown in column 3 are the average values in each country. Intervals are calculated using the inferential result mentioned in Section 3.3. The range of θ is (−∞, −1), with higher values (in absolute value) indicating greater association; Figure 1 shows three dependence scenarios. tematic non-participation, indicate evidence of selection bias for men (Swaziland and Zambia) and women (Swaziland). In each of these cases, we can reject that the selection model point estimates are the same as the imputation-based approach, or analysis of observations without missing data. In the final column of Table 2, we present estimates of the copula association parameter, which measures the degree of association between participation in testing and HIV status (conditional on observed covariates). Because we specify the copula dependence parameter in terms of a linear predictor equation (η3 ) which is a function of region, we do not impose homogeneity on all observations. The values shown in column 3 are the average values in each country. The range of this parameter is (−∞, −1), with higher values (in absolute value) indicating greater association. Three dependence scenarios for the 90-degree rotated Joe copula are illustrated in Figure 1. Although the precise definition of this parameter will vary according to the copula of interest, in this case when this parameter is close to −1 there is no association between participation in testing and HIV status once observed characteristics have been adjusted for, and hence no selection bias on unobservables. This is the case for men in Zimbabwe, and women in Zambia and Zimbabwe, when the selection model HIV prevalence estimate is close to the imputation-based approach. However, in all cases, we find that the imputation method substantially understates the amount of uncertainty associated with estimating HIV prevalence when survey testing data are affected by non-participation. Confidence intervals obtained from the selection model are generally twice as wide as those from the single-equation approach. We have considered a number of difference dependence structures for estimating these models, and these estimates do not depend on the assumption of bivariate normality. Using information criteria, we find that the Joe 90 copula is the preferred choice for most cases, and therefore all estimates in Table 2 use this dependence structure, although we have verified that the results are not sensitive to this choice (see McGovern et al., 2015b, for an explanation of this result). The empirical support we find for the Joe 90 copula in the data are consistent with our a priori expectations about the behavioral selection mechanisms underlying non-participation in HIV testing 19 3 5 0.05 0.01 −3 −2 −1 0 1 2 Propensity to Participate in Testing 3 −3 −2 −1 0 1 2 Propensity to Participate in Testing 2 0. −1 0 1 2 0.1 5 −2 2 1 0 0.1 −3 0.01 Propensity to be HIV Positive −2 0.05 0.1 0.0 5 0.0 1 −3 −1 0.15 0.2 −1 0 0.2 0.1 −2 Propensity to be HIV Positive 2 1 0.1 −3 Propensity to be HIV Positive Rotated Joe − 90 degrees (θ = −14) 3 Rotated Joe − 90 degrees (θ = −7) 3 Rotated Joe − 90 degrees (θ = −2) 3 −3 −2 −1 0 1 2 3 Propensity to Participate in Testing Figure 1: Three dependence scenarios for the 90-degree rotated Joe copula: θ = −2, minimal dependence, θ = −7, moderate dependence, θ = −14, high dependence. The range of θ is (−∞, −1). When this parameter is close to −1, there is no association between participation in testing and HIV status once observed characteristics have been adjusted for. Note that dependence structure implied by the Joe 90 copula is consistent with the interpretation that those who are most likely to be HIV positive are those who are also most likely to decline to participate in testing. (Arpino et al., 2014; Floyd et al., 2013; Reniers & Eaton, 2009; Bärnighausen et al., 2012; Obare, 2010). The Joe 90 copula (and the closely-related Clayton 270 and Gumbel 90 copulae) is asymmetric and has a relatively dense left hand tail (when compared to the bivariate normal distribution). In our application, this supports the hypothesis that those who are most likely to be HIV positive are those who are also most likely to decline to participate in testing. Our sub-national HIV prevalence estimates, which are based on the approach outlined in Section 2.1, and the region-specific copula dependence parameters, are presented in Figures 2 (men) and 3 (women). There is clear variation in HIV prevalence within some countries, most notably for men in Zambia and women in Zambia and Zimbabwe, either on the basis of the imputation-based model, or the selection model estimates. For men in Zambia, selection model HIV prevalence ranges from 28% in Usaka to 12% in Northwestern province. For women in Zambia, selection model HIV prevalence ranges from 27% in Usaka to 10% in Northern province. For women in Zimbabwe, selection model HIV prevalence ranges from 26% in Harare to 18% in Matebeleland North. Although the sample size is reduced when conducting sub-national analyses and confidence intervals are enlarged compared to the national prevalence estimates, most of these differences between highest and lowest prevalence regions are statistically significant (results available upon request). Even in Swaziland, which is relatively more homogeneous, selection model HIV prevalence still differs by 5 percentage points between the region with the highest prevalence (28% in Hhohho) and lowest prevalence (23% in Shiselweni) for men, and 3 percentage points between the region with the highest prevalence (36% in Hhohho) and lowest prevalence (33% in Shiselweni) for women. However, these differences are not statistically significant. In addition, there is also support for heterogeneous selection mechanisms across regions within some of these countries, as we find the copula dependence parameter varies according to location. For example, for men in Zambia, the selection model HIV prevalence for Northwestern province is 6 percentage points greater than the imputation-based model (12% compared to 6%), while for Luapula province, the difference is 15 percentage points (12% to 27%). In addition to this heterogeneity at the regional level, compared to a model which imposed homogeneity on the 20 Copula parameter (θ^) HIV (%) − Selection Model 12 25 Hhohho 6 15 Shiselweni 12 10 8 Lusaka Southern 12 10 8 4 6 eas nd Masvingo 2 Matebeleland south H: Harare B: Bulawayo Manicaland Matebeleland Midlands north B Ma H sho nal a 25 Mashonaland west 20 15 eas nd sho nal a Masvingo 10 Matebeleland south H: Harare B: Bulawayo Manicaland Matebeleland Midlands north B Ma 25 20 15 10 ZIMBABWE H t Mashonaland central t Mashonaland central Mashonaland west 14 2 Southern Eastern Central Western 4 Lusaka 6 North−westernCopperbelt Eastern Central 10 10 Luapula 15 North−westernCopperbelt Western Northern 20 20 Luapula 25 Northern 15 ZAMBIA 25 14 2 Shiselweni Lubombo Manzini 4 Lubombo Manzini 10 15 8 20 20 10 Hhohho 10 SWAZILAND 25 14 HIV (%) − Imputation Model Figure 2: Sub-national HIV prevalence estimates for men obtained by applying the imputation and sample selection models. The copula dependence parameter plot reports the estimated absolute values of the association parameter with range (1, ∞) in a Joe copula rotated by 90 degrees. The higher the value, the stronger the association between the selection and outcome equations. 21 Copula parameter (θ^) 30 10 35 HIV (%) − Selection Model 8 Hhohho 6 25 Lubombo Lubombo Manzini 15 4 Manzini 20 20 25 Hhohho 15 SWAZILAND 30 35 HIV (%) − Imputation Model Shiselweni 8 30 Northern Eastern Central Eastern Central Lusaka Southern t eas 6 4 Manicaland sho nal 8 and Masvingo 2 Matebeleland south H: Harare B: Bulawayo Ma Matebeleland north Midlands B 10 10 35 30 25 20 15 Manicaland sho Masvingo Mashonaland central Mashonaland west H 10 Matebeleland south H: Harare B: Bulawayo Ma 25 20 10 Matebeleland north Midlands B nal and eas t Mashonaland central Mashonaland west H 15 ZIMBABWE 30 35 10 2 Southern 4 Western Lusaka 15 15 Western 6 25 Luapula Copperbelt North−western 20 25 Luapula Copperbelt North−western 20 ZAMBIA 30 Northern 10 10 35 35 10 2 Shiselweni Figure 3: Sub-national HIV prevalence estimates for women obtained by applying the imputation and sample selection models. The copula dependence parameter plot reports the estimated absolute values of the association parameter with range (1, ∞) in a Joe copula rotated by 90 degrees. The higher the value, the stronger the association between the selection and outcome equations. 22 0.4 0.2 0.0 s(wealth,1) −0.4 −0.4 15 20 25 30 35 40 45 50 0 5 10 15 20 −2 −1 education 0 1 2 1 2 wealth 15 20 25 30 age 35 40 45 50 0.1 0.0 −0.1 s(wealth,1.59) 0.0 −0.2 −0.4 −0.3 −1.0 −0.6 −0.5 0.0 s(education,2.13) 0.5 0.2 0.2 0.3 age s(age,3.43) −0.2 0.1 0.0 s(education,1) −0.2 0.4 0.2 0.0 −0.4 s(age,4.85) 0.6 0.2 0.8 selection parameter, we found that this approach of allowing the dependence to reflect spatial variation was more efficient for estimating national HIV prevalence. Smoothed estimates obtained from our flexible spline approach for modeling the effects of continuous covariates (age, years of education and wealth) in Swaziland are shown in Figures 4 and 5. There is clear evidence of non-linearity for most of these variables in both consent to test for HIV and HIV status. Some of these relationships are consistent across sex, for example, the education association for participation in testing and as a risk factor for HIV status. Other associations differ by sex, for example, wealth exhibits a very different association with HIV status among men compared to among women. Among men, higher wealth is linearly associated with an increasing risk of being HIV positive, while there is no statistically significant association between household wealth and HIV status among women. We can use these results to identify peak prevalence (which has been adjusted for selective non-participation) according to the predictor of interest, for example, age. Highest HIV prevalence occurs at age 25 in women in Swaziland, compared to age 35 among men in Swaziland. The functional form for these relationships also differs across models, which supports our data-driven approach to model specification and the avoidance of imposed a common specification across models. Smooth function estimates for Zambia and Zimbabwe are available upon request. 0 5 10 education 15 20 −2 −1 0 wealth Figure 4: Swaziland (men). Smooth function estimates and associated 95% point-wise confidence intervals in the selection (first row) and outcome (second row) equations obtained from the proposed sample selection model based on the Joe copula rotated by 90 degrees. Results are plotted on the scale of respective linear predictors. The jittered rug plot, at the bottom of each graph, shows the covariate values. The numbers in brackets in the y-axis captions are the effective degrees of freedom of the smooth curves; the higher the value, the more complex the estimated curve. 23 30 35 40 45 50 s(wealth,2.61) 5 10 15 20 −2 0.4 s(wealth,1) −0.2 −0.6 s(education,2.87) 30 35 age 40 45 50 1 2 1 2 −0.3 −1.0 25 0 wealth 0.2 0.5 0.0 s(age,4.82) −0.5 20 −1 education −1.0 15 −0.5 −1.0 0 age 0.1 0.2 0.3 0.4 25 −0.1 20 0.0 0.5 0.2 0.0 −0.2 −0.6 −0.4 s(education,1.89) 0.4 0.2 0.0 s(age,3.06) −0.2 15 0 5 10 education 15 20 −2 −1 0 wealth Figure 5: Swaziland (women). Smooth function estimates and associated 95% point-wise confidence intervals in the selection (first row) and outcome (second row) equations obtained from the proposed sample selection model based on the Joe copula rotated by 90 degrees. Results are plotted on the scale of respective linear predictors. The jittered rug plot, at the bottom of each graph, shows the covariate values. The numbers in brackets in the y-axis captions are the effective degrees of freedom of the smooth curves; the higher the value, the more complex the estimated curve. 6 Discussion The emergence of new datasets containing information on HIV status conducted through testing obtained from representative samples of the population of interest have made an important contribution to our understanding of the evolution of the HIV epidemic (Boerma et al., 2003). These estimates provide important information for disease surveillance and support the targeting of interventions designed to slow the transmission of HIV, planning for the health needs of the population, and evidence on the effectiveness of population-based interventions or policies (Beyrer et al., 1999; De Cock et al., 2006; Granich et al., 2009). However, non-participation in testing as part of these surveys can lead to substantial amounts of missing data, and the assumption of missing at random for the HIV status of individuals who do not participate in testing may not be realistic (Arpino et al., 2014; Floyd et al., 2013; Reniers & Eaton, 2009; Bärnighausen et al., 2012; Obare, 2010). Heckman-type selection models can be used to account for selective non-participation (Clark & Houle, 2014), but have a number of drawbacks which limit their application. In this paper, we develop a simultaneous equation framework which extends the capabilities of the selection model framework. In particular, previous implementation of selection models cannot account for the manner in which HIV is spread through proximal contact and social networks within communities (Klovdahl, 1985), which manifests itself in HIV data in terms of spatial dependence and clustering (Aral et al., 2005; Larmarange & Bendaud, 2014; Tanser et al., 2009). 24 We relax the imposition of a homogeneous selection parameter, and allow for the association between testing and HIV status to vary by the location of respondents. Specifically, we adopt a Markov random field approach to model spatial variation in HIV prevalence and incorporate the geographic data on location into our estimation procedure. In addition, our approach to model specification is data-driven and is designed to avoid the researcher having to impose functional form decisions on covarite modeling when faced with censored data and potentially heterogeneous associational structure across countries or sex. Our spline approach allows for flexible model specification without the imposition of specific functional form assumptions, and the ridge penalty method for accounting for the non-identification of interviewer identity parameters does not require the pooling of successful and unsuccessful interviewers. Finally, the copula approach we adopt relaxes the usual assumption of bivariate normality, and therefore allows us to assess the sensitivity of our estimation results to non-Gaussian specifications for the structure of the dependence between participation in testing and HIV status. All of the models and data used in this paper are publicly available, and our approach is designed to provide a straightforward but flexible approach to estimating HIV prevalence which accounts for selective non-participation in testing. Our results for Zambia, Zimbabwe, and Swaziland indicate that some DHS HIV surveys are likely to be affected by selection bias. Using our simultaneous equation framework, we find that the selection model HIV prevalence is substantially higher than, and statistically different from, either the imputation-based single equation estimates or analysis of cases without missing data for men in Swaziland and Zambia, and women in Swaziland. These results are consistent with previous findings of selection bias in HIV data in several contexts (Bärnighausen et al., 2011; Clark & Houle, 2014; Hogan et al., 2012; Janssens et al., 2014), as are the similarities between results from imputation-based models and analysis of non-missing observations (Mishra et al., 2008; Marston et al., 2008). We also find that accounting for the fact that the relationship between participation in testing and HIV status is unknown illustrates that conventional confidence intervals are too narrow and do not reflect the true uncertainty associated with surveys which are affected by non-participation. Our sub-national estimates indicate the presence of substantial heterogeneity in both HIV prevalence and selection behavior, which supports the inclusion of spatial dependence into our framework. Similarly, there are important non-linearities and functional form differences across sex and country in the association between observed characteristics of survey respondents and testing participation and HIV status outcomes, which highlights the relevance of our spline and penalized smoothing framework. In this paper, we have focused on HIV surveillance data from national household DHS surveys, however there are many other contexts in which biomarker data collection is affected by non-participation. Specifically within HIV research, demographic surveillance sites which collect information on residents of localized communities, and randomized control trials which have HIV status as their primary outcome are also affected by non-participation (Harel et al., 2012; Tanser et al., 2008). Beyond HIV research, many ageing studies now collect biomarker data from survey participants (such as the Health and Retirement Study, HRS, in the US or the English Longitudinal Study of Ageing, ELSA, in England), however participation in the clinical assessment 25 modules of these surveys can be low, potentially reflecting the fact that less health participants are more likely to opt out. This would lead to a similar problem with systematic non-participation as we observe in the case of HIV data. More broadly, there are many instances of contexts in which data are missing for a substantial proportion of observations in medical and social science surveys, and in many of these contexts the assumption of missing at random may be unrealistic due to the existence of plausible behavioral mechanisms leading to selection bias. The methodology we outline in this paper therefore has wide range of potential applications outside of HIV research. The collection of information on survey design (such as interviewer identity) in many studies means that this approach provides a plausible approach for dealing with systematic non-participation which can be easily implemented in many contexts (Bärnighausen et al., 2011). From a methodological point of view, it would be interesting to explore the use of semi/nonparametric copula approaches. These would allow the margins and/or the copula to be estimated non-parametrically using, for instance, smoothing methods such as kernels, wavelets and orthogonal polynomials. If the specification of the model for the margins and copula is correct, then the parametric approach will outperform semi/non-parametric methods; however, the reverse will be true under misspecification. Without any valuable prior information, semi/non-parametric techniques should be favored as they will be more flexible in determining the shape of the underlying distribution. However, in practice, such techniques are typically limited with regard to the inclusion of a large set of covariates and very flexible linear predictor structures, may require the imposition of restrictions on the functions approximating the underlying distribution and may be computationally demanding (e.g., Deheuvels, 1981a,b; Genest et al., 1995; Tutz & Petry, 2013). Future research will look into the feasibility of such developments. References Andrews, D. W. (1999). Estimation when a parameter is on a boundary. Econometrica, 67, 1341– 1383. Angotti, N., Bula, A., Gaydosh, L., Kimchi, E. Z., Thornton, R. L., & Yeatman, S. E. (2009). Increasing the acceptability of HIV counseling and testing with three C’s: convenience, confidentiality and credibility. Social Science & Medicine, 68(12), 2263–2270. Aral, S. O., Padian, N. S., & Holmes, K. K. (2005). Advances in multilevel approaches to understanding the epidemiology and prevention of sexually transmitted infections and HIV: an overview. Journal of Infectious Diseases, 191(Supplement 1), S1–S6. Arpino, B., Cao, E. D., & Peracchi, F. (2014). Using panel data for partial identification of Human Immunodeficiency Virus prevalence when infection status is missing not at random. Journal of the Royal Statistical Society: Series A, 177, 587–606. Bärnighausen, T., Bor, J., Wandira-Kazibwe, S., & Canning, D. (2011). Correcting HIV preva- 26 lence estimates for survey nonparticipation using Heckman-type selection models. Epidemiology, 22, 27–35. Bärnighausen, T., Bor, J., Wandira-Kazibwe, S., & Canning, D. (2011). Interviewer identity as exclusion restriction in epidemiology. Epidemiology, 22(3), 446. Bärnighausen, T., Tanser, F., Malaza, A., Herbst, K., & Newell, M.-L. (2012). HIV status and participation in HIV surveillance in the era of antiretroviral treatment: a study of linked populationbased and clinical data in rural south africa. Tropical Medicine and International Health, 17, e103–e110. Beyrer, C., Baral, S., Kerrigan, D., El-Bassel, N., Bekker, L.-G., & Celentano, D. D. (1999). Expanding the space: Inclusion of most-at-risk populations in HIV prevention, treatment, and care services. Journal of Acquired Immune Deficiency Syndromes, 57(Suppl 2), S96. Boerma, J. T., Ghys, P. D., & Walker, N. (2003). Estimates of HIV-1 prevalence from national population-based surveys as a new gold standard. Lancet, 362, 1929–1931. Bor, J., Herbst, A. J., Newell, M.-L., & Bärnighausen, T. (2013). Increases in adult life expectancy in rural South Africa: valuing the scale-up of HIV treatment. Science, 339(6122), 961–965. Brechmann, E. C. & Schepsmeier, U. (2013). Modeling dependence with c- and d-vine copulas: The R package CDVine. Journal of Statistical Software, 52(3), 1–27. Butler, J. S. (1996). Estimating the correlation in censored probit models. The Review of Economics and Statistics, 78, 356–358. Chiburis, R. C., Das, J., & Lokshin, M. (2012). A practical comparison of the bivariate probit and linear IV estimators. Economics Letters, 117, 762–766. Clark, S. & Houle, B. (2012). Evaluation of heckman selection model method for correcting estimates of HIV prevalence from sample surveys via realistic simulation. Center for Statistics and the Social Sciences Working Paper No. 120, University of Washington. Clark, S. J. & Houle, B. (2014). Validation, replication, and sensitivity testing of heckman-type selection models to adjust estimates of HIV prevalence. PloS one, 9, e112563. Conniffe, D. & O’Neill, D. (2011). Efficient Probit Estimation with Partially Missing Covariates. Advances in Econometrics, 27, 209–245. Corsi, D. J., Neuman, M., Finlay, J. E., & Subramanian, S. (2012). Demographic and health surveys: a profile. International Journal of Epidemiology, 41, 1602–1613. De Cock, K. M., Bunnell, R., & Mermin, J. (2006). Unfinished business: expanding HIV testing in developing countries. New England Journal of Medicine, 354(5), 440–442. 27 De Luca, G. (2008). SNP and SML estimation of univariate and bivariate binary-choice models. Stata Journal, 8(2), 190–220. Deheuvels, P. (1981a). A kolmogorov-smirnov type test for independence and multivariate samples. Romanian Journal of Pure and Applied Mathematics, 26, 213–226. Deheuvels, P. (1981b). A nonparametric test for independence. Pub. Inst. Stat. Univ. Paris, 26, 29–50. Donders, A. R. T., van der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087– 1091. Dubin, J. A. & Rivers, D. (1989). Selection bias in linear regression, logit and probit models. Sociological Methods & Research, 18, 360–390. Eilers, P. H. C. & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. Fabic, M. S., Choi, Y., & Bird, S. (2012). A systematic review of Demographic and Health Surveys: data availability and utilization for research. Bulletin of the World Health Organization, 90(8), 604–612. Floyd, S., Molesworth, A., Dube, A., Crampin, A. C., Houben, R., Chihana, M., Price, A., Kayuni, N., Saul, J., & French, N. (2013). Underestimation of HIV prevalence in surveys when some people already know their status, and ways to reduce the bias. AIDS, 27, 233–242. Geneletti, S., Mason, A., & Best, N. (2011). Adjusting for selection effects in epidemiologic studies: why sensitivity analysis is the only solution. Epidemiology, 22(1), 36–39. Genest, C., Ghoudi, K., & Rivest, L. P. (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika, 82, 543–552. Geyer, C. J. (2013). Trust regions. http://cran.r-project.org/web/packages/trust/vignet Gillespie, S., Greener, R., Whiteside, A., & Whitworth, J. (2007). Investigating the empirical evidence for understanding vulnerability and the associations between poverty, HIV infection and AIDS impact. AIDS, 21, S1–S4. Gouws, E., Stanecki, K. A., Lyerla, R., & Ghys, P. D. (2008). The epidemiology of HIV infection among young people aged 15–24 years in southern africa. AIDS, 22, S5–S16. Govindasamy, D., Ford, N., & Kranzer, K. (2012). Risk factors, barriers and facilitators for linkage to antiretroviral therapy care: a systematic review. AIDS, 26(16), 2059–2067. Granich, R. M., Gilks, C. F., Dye, C., De Cock, K. M., & Williams, B. G. (2009). Universal voluntary HIV testing with immediate antiretroviral therapy as a strategy for elimination of HIV transmission: a mathematical model. The Lancet, 373(9657), 48–57. 28 Gu, C. (2002). Smoothing Spline ANOVA Models. Springer-Verlag, London. Harel, O., Pellowski, J., & Kalichman, S. (2012). Are We Missing the Importance of Missing Values in HIV Prevention Randomized Clinical Trials? Review and Recommendations. AIDS and Behavior, 16(6), 1382–1393. Hargreaves, J. R., Bonell, C. P., Boler, T., Boccia, D., Birdthistle, I., Fletcher, A., Pronyk, P. M., & Glynn, J. R. (2008). Systematic review exploring time trends in the association between educational attainment and risk of HIV infection in sub-Saharan Africa. AIDS, 22(3), 403–414. Hastie, T. & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society Series B, 55, 757–796. Heckman, J. (1990). Varieties of selection bias. American Economic Review, (pp. 313–318). Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153–161. Hogan, D. R., Salomon, J. A., Canning, D., Hammitt, J. K., Zaslavsky, A. M., & Bärnighausen, T. (2012). National HIV prevalence estimates for sub-Saharan Africa: controlling selection bias with Heckman-type selection models. Sexually transmitted infections, 88(Suppl 2), i17–i23. Janssens, W., van der Gaag, J., de Wit, T. F. R., & Tanović, Z. (2014). Refusal bias in the estimation of HIV prevalence. Demography, 51(3), 1131–1157. Kalichman, S. C. & Simbayi, L. C. (2003). HIV testing attitudes, AIDS stigma, and voluntary HIV counselling and testing in a black township in Cape Town, South Africa. Sexually Transmitted Infections, 79(6), 442–447. Karim, Q. A., Karim, S. S. A., Frohlich, J. A., Grobler, A. C., Baxter, C., Mansoor, L. E., Kharsany, A. B., Sibeko, S., Mlisana, K. P., Omar, Z., et al. (2010). Effectiveness and safety of tenofovir gel, an antiretroviral microbicide, for the prevention of HIV infection in women. science, 329(5996), 1168–1174. Kauermann, G. (2005). Penalized spline smoothing in multivariable survival models with varying coefficients. Computational Statistics and Data Analysis, 49, 169–186. Klein, N., Kneib, T., & Stefan, L. (2014a). Bayesian generalized additive models for location, scale and shape for zero-inflated and overdispersed count data. Journal of the American Statistical Association. Klein, N., Kneib, T., & Stefan, L. (2014b). Bayesian structured additive distributional regression for multivariate responses. Journal of the Royal Statistical Society: Series C. Klovdahl, A. S. (1985). Social networks and the spread of infectious diseases: the AIDS example. Social science & medicine, 21(11), 1203–1216. 29 Korenromp, E. L., Eleanor Gouws, & Barrere, B. (2013). HIV prevalence measurement in household surveys: is awareness of HIV status complicating the gold standard? AIDS, 27(2), 285– 287. Kranzer, K., Govindasamy, D., Ford, N., Johnston, V., & Lawn, S. D. (2012). Quantifying and addressing losses along the continuum of care for people living with HIV infection in subSaharan Africa: a systematic review. Journal of the International AIDS Society, 15(2). Kranzer, K., McGrath, N., Saul, J., Crampin, A. C., Jahn, A., Malema, S., Mulawa, D., Fine, P. E., Zaba, B., & Glynn, J. R. (2008). Individual, household and community factors associated with HIV test refusal in rural Malawi. Tropical Medicine & International Health, 13(11), 1341–1350. Krivobokova, T., Kneib, T., & Claeskens, G. (2010). Simultaneous confidence bands for penalized spline estimators. Journal of the American Statistical Association, 105, 852–863. Larmarange, J. & Bendaud, V. (2014). Hiv estimates at second subnational level from national population-based surveys. AIDS, 28, S469–S476. Madden, D. (2008). Sample selection versus two-part models revisited: the case of female smoking and drinking. Journal of Health Economics, 27, 300–307. Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review, (pp. 319–323). Marra, G. & Radice, R. (2013). A penalized likelihood estimation approach to semiparametric sample selection binary response modeling. Electronic Journal of Statistics, 7, 1432–1455. Marra, G. & Radice, R. (2015). SemiParBIVProbit: Semiparametric Bivariate Probit Modelling. R package version 3.4. Marra, G. & Wood, S. (2012). Coverage properties of confidence intervals for generalized additive model components. Scandinavian Journal of Statistics, 39, 53–74. Marston, M., Harriss, K., & Slaymaker, E. (2008). Non-response bias in estimates of HIV prevalence due to the mobility of absentees in national population-based surveys: a study of nine national surveys. Sexually Transmitted Infections, 84, i71–i77. McGovern, M., Bärnighausen, T., Salomon, J., & Canning, D. (2015a). Using interviewer random effects to remove selection bias from HIV prevalence estimates. BMC Medical Research Methodology, 15(8). McGovern, M. E., Bärnighausen, T., Marra, G., & Radice, R. (2015b). On the assumption of bivariate normality in selection models: A copula approach applied to estimating HIV prevalence. Epidemiology, 26, 229–237. Mishra, V., Barrere, B., Hong, R., & Khan, S. (2008). Evaluation of bias in HIV seroprevalence estimates from national household surveys. Sexually Transmitted Infections, 84, i63–i70. 30 Mishra, V., Vaessen, M., Boerma, J., Arnold, F., Way, A., Barrere, B., Cross, A., Hong, R., & Sangha, J. (2006). HIV testing in national population-based surveys: experience from the Demographic and Health Surveys. Bulletin of the World Health Organization, 84, 537–545. Nelsen, R. (2006). An Introduction to Copulas. New York: Springer. Nicoletti, C. (2006). Nonresponse in dynamic panel data models. Journal of Econometrics, 132(2), 461–489. Nocedal, J. & Wright, S. J. (2006). Numerical Optimization. New York: Springer-Verlag. Obare, F. (2010). Nonresponse in repeat population-based voluntary counseling and testing for HIV in rural Malawi. Demography, 47, 651–665. Pettifor, A. E., MacPhail, C., Bertozzi, S., & Rees, H. V. (2007). Challenge of evaluating a national hiv prevention programme: the case of lovelife, South Africa. Sexually Transmitted Infections, 83(suppl 1), i70–i74. Puhani, P. (2000). The Heckman correction for sample selection and its critique. Journal of Economic Surveys, 14(1), 53–68. Pya, N. & Wood, S. (2014). Shape constrained additive models. Statistics and Computing, (pp. 1–17). Radice, R., Marra, G., & Wojtys, M. (2015). Copula regression spline models for binary outcomes. Revise and Resubmit, Statistics and Computing. Reniers, G., Araya, T., Berhane, Y., Davey, G., & Sanders, E. J. (2009). Implications of the HIV testing protocol for refusal bias in seroprevalence surveys. BMC Public Health, 9, 1–9. Reniers, G. & Eaton, J. (2009). Refusal bias in HIV prevalence estimates from nationally representative seroprevalence surveys. AIDS, 23, 621–629. Rigby, R. A. & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape (with discussion). Journal of the Royal Statistical Society: Series C, 54, 507–554. Rue, H. & Held, L. (2005). Gaussian Markov Random Fields. New Haven: Chapman & Hall/CRC, Boca Raton, FL. Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press, New York. Sankoh, O. & Byass, P. (2012). The INDEPTH network: filling vital gaps in global epidemiology. International Journal of Epidemiology, 41(3), 579–588. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of The Royal Statistical Society Series B, 47, 1–52. 31 Sklar, A. (1959). Fonctions de répartition é n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8, 229–231. Sklar, A. (1973). Random variables, joint distributions, and copulas. Kybernetica, 9, 449–460. Tanser, F., Bärnighausen, T., Cooke, G. S., & Newell, M.-L. (2009). Localized spatial clustering of HIV infections in a widely disseminated rural South African epidemic. International Journal of Epidemiology, 38(4), 1008–1016. Tanser, F., Bärnighausen, T., Grapsa, E., Zaidi, J., & Newell, M.-L. (2013). High coverage of ART associated with decline in risk of HIV acquisition in rural KwaZulu-Natal, South Africa. Science, 339(6122), 966–971. Tanser, F., Hosegood, V., Bärnighausen, T., Herbst, K., Nyirenda, M., Muhwava, W., Newell, C., Viljoen, J., Mutevedzi, T., & Newell, M.-L. (2008). Cohort Profile: Africa Centre Demographic Information System (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956–962. Tutz, G. & Petry, S. (2013). Generalized additive models with unknown link function including variable selection. Technical Report. Van de Ven, W. P. & Van Praag, B. (1981). The demand for deductibles in private health insurance: A probit model with sample selection. Journal of Econometrics, 17, 229–252. Vella, F. (1998). Estimating models with sample selection bias: a survey. Journal of Human Resources, 33, 127–169. Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result. Econometrica, 70(1), 331–341. Wahba, G. (1983). Bayesian ‘confidence intervals’ for the cross-validated smoothing spline. Journal of The Royal Statistical Society Series B, 45, 133–150. Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society Series B, 65, 95–114. Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapman & Hall/CRC, London. Wood, S. N. (2013). On p-values for smooth components of an extended generalized additive model. Biometrika, 100, 221–228. 32