To ISSP members The methodology committee has asked the secretariat to distribute a report on substitution written by Tom Smith that was also sent to you last year. This report is not intended as an exhaustive discussion of substitution. It deals with certain important forms of substitution and their consequences and can serve as an aid in preparing for the discussion of substitution at the plenary session on Monday morning. We hope you can read this carefully before Monday. 1 Notes on the Use of Substitution in Surveys Tom W. Smith April, 2007 Substitution is a widely used (Callens and Croux, 2003; Bianchini and Loret, 1974; Lynn, 2004; Surve Sampling, 2003; Vehovar, 1999; Waksberg, 1985), but little examined, sampling method (Chapman, 1983; Rubin and Zanutto, 2002; Vehovar, 1999). Substitution designs consist of samples in which at some point during data collection nonrespondents from the initial sample are replaced with substitute cases. The literature on substitution is sparse, dated, limited, and fugitive. Most general textbooks on sampling and survey-research methods either ignore the procedure or contain only a cursory discussion (Chapman, 2003; Kish, 1965; Lessler and Kalsbeek, 1992; Little and Rubin, 2002; Lohr, 1999; Moser and Kalton, 1972l; Vehovar, 1999). When substitution is covered, sampling and survey-methodology texts usually emphasize its shortcomings. Also, several studies indicate that substitution had been disallowed, dropped or had been utilized, but was not recommended (Kordos, 2005; Linn et al., 2004; Pol, 1992; Vehovar, n.d.; Vehovar, 1999; Vehovar et al., 2002). Empirical studies of substitution all use less than optimal designs. They often use uncommon variants of substitution, do not conduct full comparisons to other methods such as continuous sampling1 with post-stratification/non-response weighting, and typically compare substitute cases to completed cases, but rarely compare 1) initial nonrespondents to their substitutes or 2) initial plus substitute completed cases to weighted completed cases without substitutes, and never compare initial plus substitute cases weighted to continued, initial cases weighted. This paper examines1) past studies comparing substitution to other sample designs, 2) various forms of substitution designs, 3) frequently mentioned advantages and disadvantages of substitution, 4) how substitution resembles other sample and study designs, 5) specific examples of substitution and non-substitution designs, 6) how substitution compares to continuous samples with weighting, and 7) the conditions under which certain forms of substitution would tend to be most useful. Past Studies Twelve studies have been identified that have done some analysis of substitution designs such as comparing substitutes to non-substitutes and evaluating alternative designs. For three of the studies only short summaries were available and the write-ups and analyses in several of the available reports were quite limited. Eleven were empirical studies and one was entirely a stimulation analysis. Two were done in the 1950s, three in the 1970s, two in the 1980s, and five in the 1990s. Six were from the US, five from Europe, and one from Japan. Of the ten empirical studies, eight used in-person interviewing and three telephone; all employed inter-unit substitution; eight substituted final units (households/individuals), one both final and intermediate units (household and respondent), and two immediate units (schools); six utilized random substitution, three were unclear, and three were non-random. 1 Continuous sample designs are those not using substitution in which initial cases are worked on a continuing basis. 2 Substitution levels were indicated for eight studies and were 5-10% for school surveys and 9-39% for household surveys Forms of Substitution Designs There are many different forms of substitution. The following five elements describe the major features of substitution designs. 1. Level: 1) intermediate or 2) lowest 2. Field Control: 1) loose/permissive or 2) tight/strict 3. Selection of substitutes: 1) random or 2) non-random 4. Matching 1) none, 2) intermediate unit (e.g. area) only, or 3) intermediate unit (e.g. area) + case-level characteristics (e.g. demographics) 5. Types of substitutes: 1) intra-household or 2) inter-household Different combinations of these produce at least several dozen types of basic substitution designs even without getting into more exotic hybrids such as using earlier cross-sections or panel designs.2 Of course some procedures are relatively rare (e.g. no matching and intra-household) and certain elements tend to be used together so certain combinations are more common than others (e.g. loose/convenience/area only and strict/random/area+demographics). First, the substitute units may be the final, target-population units (e.g. students, workers, adults) or some higher level unit used to reach the target population (e.g. schools, employers, sampling points/areas). Intermediate substitution is often used because non-response from an intermediate unit would eliminate multiple members of the target population (e.g. perhaps dozens of students in a school or employees in a company) and because the intermediate unit may not be a meaningful substantive unit in the study and as only a technical component in sampling should not be the basis for ruling out the final, target respondents (e.g. students, employees). Second, substitution can be tightly-controlled or loose (permissive). Under tight substitution, 1) substitution is used only after extensively working initial cases, 2) strict protocols exist for when substitution may be used, 3) close supervision checks that interviewers are making the required efforts and following the protocols, 4) substitute cases are chosen at random within strata, 5) when possible, initial and potential substitute cases are chosen as part of the master sample design and are matched on several variables based on information from a population register or other sample frame, and 6) the proportion of cases that are substitutes is limited (a result of following the first two points in the design of a tightly-designed use of substitution). In the most extreme case, interviewers are not even made aware that substitution is being used (Vehovar, 2003; Lynn, 2004). Third, the selection of substitutes can be done randomly or non-randomly. Random selection is usually at the initial sampling stage when clusters are drawn consisting of one initial case and typically several substitute cases. This is commonly done in school-based samples and also in household samples using sample frames 2 For drawing substitutes from nonrespondents to earlier surveys see Kish, 1965; Smith, 1983. 3 with detailed information about all units such as samples from population registers. Random sampling can also be done during the field period, but this more difficult. Systematic replacement such as taking a neighbor of a nonrespondent is not random sampling since this would lead to the relative overrepresentation of cases living next to nonrespondents and the underrepresentation of cases living next to respondents (Pol, 1992; Sirken, 1975). Non-systematic or convenience substitution is non-random and is based solely to decisions made by interviewers. Fourth, substitution may involve matching or no matching. When matching is used, it would usually be based on some intermediate unit such as area or the intermediate unit + some other known, case-level characteristics, typically demographics. Substitution almost always involves matching or substitution within sampling strata. A simple random sample design with no information about the sample units from the sample frame would be a case in which there would in effect be no individual matching or that the substitutes as a group would replace the initial nonrespodents as a group. More typical would be matching within at least an intermediate stratum, such as area codes or exchanges in RDD surveys or neighborhoods in multi-stage, area probability samples. Substitution within a geographic unit such as a neighborhood means that the initial and substitute cases share all of the aggregate, contextual variables and, to the extent that locality is important, this is a plus. If the sample frame also had case-level information on the sample units, then these could be matched on as well. This is often the case in samples based on population registers in which attributes like age and gender may be known. In effect, area, gender, and age would form strata from which initial cases and substitutes could be drawn. Finally, for household surveys, substitution could be within or between households. Within household substitution is apparently rarely done. Doing it would in most cases mean that a change in the gender of the respondent (since most two adult households contain adults of opposite gender) and would also mean an overrepresentation of multiple-adult households, since single adult households would have no within-household substitutes available (Forsman and Berg, 1992). If household-level characteristics or information on all household members being supplied from one respondent were important for the study, then using an intrahousehold substitute would be helpful. In effect in surveys with such a focus, the initial and substitute respondents are informants about the same household. (Of course, if it was a pure household-level survey, the case would be the household and any suitable household member would merely be an informant (or in effect all suitable household members are interchangeable as possible informants/”respondents”). Advantages and Disadvantages of Substitution The literature mentions several advantages and disadvantages for substitution designs. Mentioned advantages include: 1. 2. 3. 4. 5. Achieving more completed cases Better balance within strata in stratified samples Possible reduction in nonresponse bias Simplicity for users; self-weighting Incentive for completion of initial cases 4 Substitution typically achieves a larger sample size than continuous sample designs do and when completely successful (i.e. when a substitute is obtained for all initial cases), insures that the desired sample size is reached (Chapman, 1983; Elteto, 2003; Lessler and Kalsbeek, 1992; Moser and Kalton, 1972; Nathan, 1980; Vehovar. 1999). This can especially be useful when contracts specify a certain minimal number of cases to be completed (Waksberg, 1985). If substitution eliminates or reduces added variance from weighting, it will also produce a larger effective sample size (Biemer, Chapman, and Alexander, 1985). Likewise, in stratified samples better balance across strata is typically achieved by substitution (Chapman, 1983; Survey Sampling, 2003). However, not all substitution surveys succeed in interviewing substitutes for all initial cases and others need to go through multiple substitutes per case to cover all initial cases, which lowers the response rate (see below in disadvantages). To the extent that substitutes more closely resemble initial nonrespondents (for whom they are substitutes) rather than initial respondents, substitution will reduce nonresponse bias (Chapman, 2003; Lessler and Kalsbeek, 1992; Rubin and Zanutto, 2002; Vehovar, 1999). This is discussed further below. If substitution is complete and there is no need for a design-based weight (e.g. due to an oversample or interviewing only one respondent per household) and no use of other weights (e.g. post-stratification), then substitution may produce a simpler design with no need to weight (Chapman, 1983 & 2003; Vehovar, 1999). However, most designs and implementations of them do require a weight even if substitution is used, so this advantage is not common. In addition, methods for both creating and applying weights have become easier over time and avoiding weights is less beneficial than previously. It is often argued that the use of substitution leads to initial cases being worked less vigorously (see disadvantages below), but Moser and Kalton (1972) based on Gray and Corlett (1950) advance the argument that interviewers will be more diligent in obtaining interviews from initial cases since they know that they will still have to obtain an interview from a substitute case and have to start from scratch with the substitute case. Mentioned disadvantages include: 1. 2. 3. 4. 5. 6. 7. Lower response rate False or misleading response rate/ignoring true response rate No reduction/increase in nonresponse bias Difficulty of having/maintaining field controls Longer field period Substitution incomplete Substitution non-random With a similar total level of effort, the nonresponse rate may be higher in substitution surveys even before substitution itself is taken into consideration, because less effort will be devoted to the initial cases (Chapman, 1983; Chapman and Roman, 1985; Elliot, 1993; Lohr, 1999; Rubin and Zanutto, 2002; Vehovar, 1999). Substitution will probably yield more cases than non-substitution designs with a similar level of effort (because the former will pick up more easy cases than the other adds hard cases). Of course once substitution is correctly accounted for, the response 5 rate would be expected to fall since nonrespondents among the substitutes will be added to the initial nonrespondents. It will fall appreciably if multiple substitutes are used to secure a replacement for all initial nonrespondents. Say for example, that the initial response rate was 50% and an average of 2.5 cases were needed to fill all substitute cases. That would lower the final response rate to 44.4% and an average of 3.0 substitutes per initial nonrespondent would mean a response rate of 40.0%. The discussion below on cases of study designs elaborates further on this point. Several studies also show that substitutes have a lower response rate than initial cases (Biemer, Chapman, and Alexander, 1985, Callens and Croux, 2003; Vehovar, 1993). In some cases this appears to merely result from substitute cases being worked less hard (e.g. for a shorter field period). However, to the extent that substitutes represent initial nonrespondents rather than all initial cases, a lower response rate would be expected and might be seen as a sign that the substitutes were representing the initial nonrespondents. Substitution is sometimes used to mask or miscalculate response rates (Chapman, 1983; Lessler and Kalsbeek, 1992; Nathan, 1980; Vehovar, 1999). Standard Definitions: Final Disposition of Case Codes and Outcome Rates for Surveys (2006) of the American Association for Public Opinion Research and World Association for Public Opinion Research stipulates how response rates using substitution should be calculated. In essence, all initial nonrespondents and all nonrespondents among substitutes must remain in the base. Also according to these standards, the level of substitution used in a survey needs to be separately reported. Substitution may be less likely to reduce nonresponse bias than a continuous design (Chapman, 1983; Elliot, 1993; Kish, 1965; Lessler and Kalsbeek, 1992; Little and Rubin, 2002; Moser and Kalton, 1972; Rubin and Zanutto, 2002; Smith, 1983). Moser and Kalton (1972) state that “If substitutes are taken, all that happens is that nonrespondents are replaced by respondents, so the risk of bias remains the same.” Williams and Folsom (1976) and Durbin and Stuart (1954) found the substitute samples to be biased; Marlini and Pacei (1993) and Williams and Folsom (1976) concluded that substitutes did not closely resemble the initial nonrespodents they replaced; and Sirken (1975) showed that substitutes were more like initial respondents than initial nonrespondents. Vehovar (2003) describes the idea that substitutions outperform weighting adjustments as “never proven.” Apparently only two studies have compared substituting with continuous surveying using nonresponse/post-stratification weighting. One found more bias from substitution than weighting (Biemer, Chapman, and Alexander, 1985; Chapman and Roman, 1985). The other concluded that substitution provides “no improvement in nonresponse bias in comparison to the alternate weighting adjustment” (Vehovar, 1993). The final achieved sample in a substitution design may well be less like the complete target sample because it will have fewer very difficult cases than a design that expends all efforts on only the initial sample and then weights for nonresponse since the later will include more difficult cases and these will be weighted up along with easier cases.3 However, both approaches will underrepresent difficult cases. 3 Scheuren (2004) has recently reprinted for consideration a design by Kish and Hess (2004) that used noncontacts from previous surveys as substitutes for noncontacts in a subsequent survey and which proposed that similar substitution could be done with refusals. This has the potential for making the substitutes less like initial cases and more like the cases they are replacing. However, this version of substitution is rarely, if ever, used. 6 Field control is always difficult in decentralized designs such as multi-stage, area probability samples and this is even more challenging when substitution is employed (Vehovar, 1999). Interviewers must follow protocols for the initial cases, for when substitutes can be used, and then for handling the substitute cases. Moreover, it is commonly assumed that interviewers want to drop difficult initial cases and try potentially easier substitute cases (Chapman, 2003; Chapman, 1983; Chapman and Roman, 1985; Lessler and Kalsbeek, 1992; Lohr, 1999; Vehovar, 1993). There is a motivation to both not vigorously pursue initial cases and to prematurely switch to substitutes. However, Sudman (1967) in a comparison of block quota and full probability surveys found that interviewers did not slack-off in their pursuit of cases even though “substituting” cases was fully acceptable under quota rules and, as noted above, Durbin and Stuart (1954) argue that substitution motivates more efforts on initial cases. Most descriptions of substitution designs fail to mention either field control procedures or success in carrying them out. Rigorous substitution designs first work initial cases for an extended period and then devote a similar extended effort to securing the substitute cases. This might well lead to the lengthening of the total field period (Rubin and Zanutto, 2002; Vehovar, 1999; Waksberg, 1985). Of course, a continuous design might continue to pursue initial cases for just as long as in a substitution design. Alternatively, a substitution design could also be set up to go no longer than a period used for a continuous design, but that would mean working initial cases and substitute cases for less time than cases in a continuous design covering the same time span. Substitution is usually assumed to be complete, but this is often not the case (Chapman, 1983). In substitution surveys coverage sometimes falls well short of the target. For example, in the National Health Interview Survey/RDD Feasibility Study substitutes were secured for only 65% of initial cases (Chapman and Roman, 1985). Likewise, a school-based study of students was able to get substitute schools for just 41% of initial nonrespondents (Williams and Folsom, 1976). Incompleteness reduces several of the proffered advantages of substitution mentioned above. Substitution may be non-random (Chapman, 1983). While not used in an optimal substitution design, this is utilized in some substitution designs and this clearly means that a full-probability sample has not been followed. Resemblance of Various Forms of Substitution to Other Designs The nature of various substitution designs can be clarified by considering in what ways they resemble and in what ways they differ from alternative, nonsubstitution designs: 1. Low Response Rate, Continuous Designs: A large probability sample with a low response rate and few callbacks is like an uncontrolled and unmatched, but randomized, substitution design. In such a continuous design extra cases are included from the start and are like aggregate replacements for lightly worked initial cases and remain nonrespondents as opposed to the sequential and case-linked replacements used in substitution designs. 2. Replicates: The use of replicates may seem like substitution, but the additional sample is released for purposes other than compensating and adjusting for nonresponse, such as to avoid not issuing more sample than is needed (and thus to minimize expenses, to sample time more evenly in studies with long field period, and maintain a higher response rate), or to manage the flow of cases. The use of replicates 7 does not involve linked cases (initial and substitute) and usually does not mean that efforts to interview active cases from the initial or earlier replicates are abandoned when later replicates are released. Replicates augment the sample and do not replace cases. 3. Quota Samples: Quotas samples resemble substitution in that there is substitution and that quota controls bear some resemblance to the matched controls typically used in substitution (Smith, 1983). The resemblance is even closed when substitution does not use a random selection of substitution cases. If the quotas are multivariate (e.g. say using an 8-fold classification of gender (2) * white/not-white (2) * under/over 40 years old), then the substituting and quotas as very similar. If the quotas are only distributional (e.g. so many men/women, whites/non-whites, old/young), then there is greater difference between them. They differ in that there is not really any initial cases in quota samples (there are simply cases approached sooner or later in the process of securing respondents), that the quota controls are in the aggregate and do not represent matching across specific cases, and that the total number of cases touched is likely to be greater with quotas than with substitution except for the use of the loosest set of substitution rules (e.g. indiscriminate substitution). The literature on substitution often assumes that very elaborate versions are being used. For example, that field periods are longer because the substitute cases are being released only after initial cases have been extensively worked and then the substitute cases themselves are worked for an extended period. But in actuality, substitution appears to be often used as a quicker and less expensive method. In this situation, substitution is done early and often and substitutes are selected on the basis of convenience rather than randomly. Under such a substitution design a main aim is completing a target number of cases as rapidly and easily as possible. As such, permissive use of substitution more closely resembles quota sampling with either matching acting like quota controls or with little or no matching except on sampling point so there is little or none of the control for demographics that quotas provide for. 4. Household Proxies: Using household proxies is like within household substitution. They are essentially the same when the survey deals with householdlevel information or information about all household members. For respondent-level information the initial respondent remains the same, but the information about that person is supplied by the proxy, while in substitution the initial target respondent is replaced by the substitute respondent (Boyle and Brann, 1992). 5. Household Informants: When the unit is the household or for collecting information on all household members and any eligible household member can provide that information, then there is no initial respondent and selecting an informant would closely resemble within household substitution. The difference is that there is no formal substitution and no collection of respondent-only information, but that all eligible household members are potential informants and interchangeable for one another. Examples Comparing Response Rate/Number of Cases for Various Designs To further compare substitution to other sample designs, Table 1 examines one substitution design and five non-substitution designs. In these examples the initial sample is always 2000 and neither in the initial sample nor in any of the added sample are any of the cases out-of-scope. Also, all examples have the same result after 8 8 weeks, 1000 completed cases and a provisional response rate of 50%. The differences are in how the rest of the field effort is handled. Case A is a standard, continuous approach with no changes in design after phase one (eight weeks). In effect, there is no phase one and phase two, just one continuing effort. All cases are worked until they become completed cases or are classified as nonrespondents either during the field period or at its conclusion. In this example, an additional 400 interviews are secured for a final sample size of 1400 and a response rate of 70%. Case B uses a full-substitution design in which the 1000 nonrespondents at the end of phase one are replaced by 1000 substitute cases. Since they are matched on area and perhaps some other characteristics with the nonrespondents from phase one, it is assumed that their response rate will be lower than a random sample of respondents and nonrespondents. A new sample of the whole target population worked to the same field period would be expected to yield the same result as the initial sample (50% response rate). To the extent that the matching or stratifying characteristics are related to nonresponse, the substitute sample would be expected to yield a lower response rate. Here the assumption is the matching is a fairly weak predictor, so there is only a 10% loss in productivity (yielding 450 cases instead of 500) and a final sample size and response rate of 1450/48.3%. If the sample cases were much more like the dropped phase one nonrespondents, then the yield would be less. Case C is an unusual design in which a replicate of 800 is issued for phase two and work ceases on the phase one nonrespondents. It has a response rate of 50% phase one, phase two (since the field period and sample population are the same), and overall. It yields 1400 cases (1000+400). Case D is the more typical replicate design in which a sample of 400 representing the whole target sample is added and the 1000 phase one cases are continued to be worked. After phase two the replicate has yielded 200 cases (50% response rate) and the 1000 continued cases have yielded 200 cases. This yield is lower than in case A. The difference is because the assumption is that the effort devoted to the replicate had to be diverted from working the initial cases, that is that the total field effort was not expanded, but divided among two efforts. The final achieved sample size is 1400 and the response rate 58.3%. Case E is a hybrid substitution/continuation design in which 500 of the nonrespondents at the end of phase one are dropped and substituted for and 500 are continued. The 500 substitute cases are assumed to yield 225 (the same 90% effectiveness assumption as in Case B). The 500 cases not substituted for are assumed to yield 200 cases (the same rate as in Case A). That yields a final total of 1425 cases (1000 initial cases from phase one + 225 substitutes + 200 cases from the replicate) and a response rate of 57.0%. Case F takes the 1000 nonrespondents at the end of phase one and subsamples them, dropping 500 and retaining 500. It is assumed that half of the 500 remaining are interviewed by the end of phase two. Since each of these cases represents two cases (due to the subsampling), they are weighted up to make 500 cases for purposes of both calculating the response rate and substantive analysis for a weighted total of 1500 and a response rate of 75%. Of course the effective sample size is lower than 1500 (1250 – design effects due to weighting) and this would have to be taken into consideration in inferential statistical comparisons. Also, note that the yield of 50% for the phase two cases is higher than the assumed yield of 40% for Case A. This is based on the assumption that by reducing the scope of work (from following up 500 9 cases to following up 250) more effort could be devoted per case. Among other things this might also mean using only the best interviewers from phase one during phase two. Substitution and Nonresponse/Post-stratification Weighting Nonresponse and post-stratification weighting assumes that non-respondents are MCAR within the weight categories. That in effect means that within categories the respondents and non-respondents are the same. Substitution assumes that the substitutes resemble nonrespondents (e.g. original, non-interviewed cases) and not respondents from the initial sample. Both assumptions are usually problematic and at best only partly correct. Weighting reduces nonresponse bias to the extent that the weight categories are correlated with nonresponse bias. They usually are correlated to some extent, but often the relationship is not strong. Weighting then assumes that within weighting categories nonrespondents are MCAR or in other words that they are the same as respondents. This is a dubious assumption since the one thing that is known about the nonrespondents and respondents within categories is that they differ in their response status. Substitutes resemble the initial nonrespondents to the extent they are matched on variables selected for selecting substitutes. The matching variables are similar in function to the weighting categories used in nonresponse/poststratification weights in non-substitution designs (Elliot, 1993). However, within the matched groups, the substitute cases are equivalent to all initial cases, not to initial nonrespondents alone. Furthermore, it is likely that those substitutes that become respondents within matched groups will be more like initial respondents than nonrespondents since they share the characteristic of doing an interview and differ from initial nonrespondents who did not do an interview. By working cases longer and harder the continuous sampling + nonresponse/post-stratification approach should yield a higher response rate and have more hard cases among the completed cases than does substitution. This would suggest that weighting might reduce nonresponse bias more than substitution because there would be less difference between respondents and nonrespondents or in other words that the completed cases would better represent the nonresponse cases and thus the target population (Biemer, Chapman, and Alexander, 1985). Also, in most cases the range of variables known about cases from a sample frame is fairly limited and will not necessarily be good predictors of nonresponse bias. There would usually be more variables to choose from in developing a nonresponse/post-stratification weight and they could usually be selected because of how well they modeled nonresponse bias. Another factor to consider is that compared to substitution, nonresponse/poststratification weights increase the sampling variance and underestimate the population variance. When complete substitution is achieved, this reduces sampling variance (larger N and no extra variance from a weight) and increases true variance (substitutes are more variable than taking existing cases and weighting them up or down) and this better reflects the real variability of the target population. If substitution can avoid the use of weights or at least have weights with less variance, then substitution designs would have a greater effective sample size for a given number of actual cases. Finally as Chapman (1983) has noted, “All (empirical studies) seemed to indicate that substitution procedures do not eliminate the effects of nonresponse bias (but) it is probably true there is no procedure available that can adequately correct nonresponse bias.” As he has further noted (Chapman, 2003), “(T)he fundamental question is 10 whether it is better to use a substitute unit for a survey nonrespondent, rather than imputing nonrespondent data from a blend of the data reported by respondents in the same weighting class.” Conditions under which Substitution is More Useful Substitution designs would tend to overcome some of the noted disadvantages and achieve advantages under various conditions: 1. Having a sample frame with considerable information on the sample unit: European sample frames based on population registers with certain household and/or respondent-level demographic information are generally more suitable for substitution designs than area probability designs such as used in the US with little or no household/respondent information in the sample frame. It is particularly advantageous if the known attributes of cases are correlates of nonresponse. 2. Stratified samples rather than SRS designs: Even if little is known about the sample cases, the use of strata in the sample would allow substitution within strata such as neighborhoods within a multi-stage, area, probability design. Since locality is often related to response rates (e.g. lower in cities and higher in small town and rural areas) and neighborhoods typically differ on other variables such as SES and race/ethnicity, within strata weighting and/or substitution have the potential to reduce nonresponse bias. 3.Centralized vs. decentralized designs: Field control is difficult for all area probability designs in which interviewers are dispersed to many sample points and not under direct observation and substitution designs increase this problem by both making field procedures more complex and providing an incentive to flout the rules. For example, interviewers may give up on hard cases and switch to substitutes either without all protocols being followed (e.g. number of call attempts) or by exercising less effort per attempt to secure interviews from the initial cases. Centralized designs, such as CATI surveys from a call center, are much more subject to strict control and thus mostly eliminate this problem (Biemer, Chapman, and Alexander, 1985). Moreover, it is even more desirable if interviewers are not even aware that substitution is being used (Lynn, 2004; Vehovar, 2003). Conclusion With optimal substitution - using close field supervision, full-efforts for both initial cases and substitutes, randomly-selected substitutes, etc., substitution resembles the use of random replicates and should probably be considered a full-probability design. The use of close control, random selection of substitutes, and full efforts to obtain initial cases leading to limited use of substitution are key design factors that make substitution a method for adjusting for nonresponse bias within a fullprobability framework rather than being a non-probability design as with volunteer, convenience, and quota samples (Chapman, 1983; Lessler and Kalsbeek, 1992; Lohr, 1999; Rubin and Zanutto, 2002). Some substitution designs no more deviate from an ideal, full-probability design than do the use of nonresponse/post-stratification weights to compensate for nonresponse bias. Counter to what Lynn et al. (2004) asserts, it is not true that across substitution designs that “none of them meet the 11 requirement for probability sampling.” Neither well-executed continuous designs nor optimal substitution designs achieve idealized full-probability standards (no over- or undercoverage, no nonresponse bias, etc.), but both can produce creditable, practical approximations of full-probability surveys. It is less clear whether optimal substitution produces equivalently effective adjustments for nonresponse bias as does weighting. Theoretical expectations tend to favor weighting as superior to substitution and some empirical studies back this conclusion. But there are too few well-designed, comparative studies of the two approaches and none that apparently compare substitution + weighting to continuous interviewing + weighting to be confident in this as a general outcome. Clearly more research is needed to test how optimal substitution designs compare to continuous sampling designs and whether weighting the latter or both produces the best estimates and what weighting does to effective sample size (Chapman, 1983; Groves et al., 2002; Scheuren, 2004). At least in simulation studies, Rubin and Zanutto (2002) find matched substitutes with multiple imputations to be a useful approach that is as good as or better than other common designs. Whether empirical studies will support this conclusion is an open question. Only careful experimental comparisons of well-designed and executed substitution and continuous design will definitively establish the relative merits of the two approaches. 12 Table 1 Examples of Outcomes Using Different Designs Design Starting Sample Added Sample Completed Cases/ Response Rate 8 weeks 16 weeks A. No Substitution 2000 None 1000/50% B. Full Substitution 1450/48.3% 2000 1000 1000/50% C. Replicate/Discontinue 2000 800 1000/50% D. Replicate/Continue 1400/58.3% 2000 400 1000/50% E. Substitute/Continue 2000 500 1000/50% 1425/57% None 1000/50% 1500/75% F. No Substitute/Sub-Sample 2000 13 1400/70% 1400/50% References American Association for Public Opinion Research, Standard Definitions: Final Disposition of Case Codes and Outcome Rates for Surveys. Lenexa, KS: AAPOR, 2006. Bianchini, John C. and Loret, Peter G., Anchor Test Study: Final Report. Berkeley: Educational Testing Service, 1974. Biemer, Paul; Chapman, David W.; and Alexander, Charles, “Some Research Issues in Random-Digit Dialing Sampling and Estimation,” in Proceedings, First Annual Research Conference March 20-23, 1985. Washington DC: Bureau of the Census, 1985. Boyle, Coleen A.; and Brann, Edward A., “Proxy Respondents and the Validity of Occupational and Other Exposure Data,” American Journal of Epidemiology, 136 (1992), 712-721. Callens, Marc and Croux, Christopher, “Nonresponse in the Belgium Fertility and Family Survey,” Unpublished report, Population and Family Study Center, Belgium, 2003. Chapman, David W., “The Impact of Substitution on Survey Estimates,” in Incomplete Data in Sample Survey, edited by William G. Madow, Ingram Olkin, and Donald B. Rubin. Volume 2. New York: Academic Press, 1983. Chapman, David W., “To Substitute or Not to Substitute – That is the Question,” The Survey Statistician, 48 (2003), 32-34. Chapman, David W. and Roman, Anthony M., “An Investigation of Substitution for an RDD Survey,” in 1985 Proceeding of the Section on Survey Research Methods. Washington, DC: American Statistical Association, 1985. Durbin, J. and Stuart, A,, “Callbacks and Clustering in Sample Surveys: An Experimental Study,” Journal of the Royal Statistical Society, Series A (1954), 387-410. Elliot, Dave, “The Use of Substitution in Sampling,” Survey Methodology Bulletin, 33 (July, 1993), 8-11. Elteto, Odon, “Substitution in the Hungarian HSB,” The Survey Statistician, 49 (2004), 16-17. 14 Forsman, Goesta and Berg, Sven, “Telephone Interviewing and Data Quality: An Overview and Empirical Study, Unpublished report, Institute of Technology, Linkoeping University, 1992. Gray, P.G. and Corlett, T., “Sampling for the Social Survey,” Journal of the Royal Statistical Society, A, 113 (1950), 150-206. Jay, Gina M.; Liang, Jersey; Liu, Xian; and Sugisawa, Hidehiro, “Patterns of Nonresponse in a National Survey of Elderly Japanese,” Journal of Gerontology, 48 (1993), S143-S152. Kalton, Graham and Kasprzyk, Daniel, “The Treatment of Missing Survey Data,” Survey Methodology, 12 (1986), 1-16. Kish, Leslie, Survey Sampling. New York: John Wiley & Sons, 1965. Kish, Leslie and Hess, Irene, “A ‘Replacement’ Procedure for Reducing the Bias of Nonresponse,” American Statistician, 58 (2004), 295-297. Excerpt from American Statistician, 13 (1959), 17-19. Kordos, Jan, “Household Survey in Transition Countries,” in Household Sample Surveys in Developing and Transition Countries, edited by United Nations. Studies in Methods, Series F, No. 96. New York: UN, 2005. Lessler, Judith T. and Kalsbeek, William D., Nonsampling Error in Surveys. New York: John Wiley & Sons, 1992. Little, Roderick J.A. and Rubin, Donald B., Statistical Analysis with Missing Data. 2nd edition. New York: John Wiley & Sons, 2002. Lohr, Sharon L., Sampling: Design and Analysis. Pacific Grove, CA: Duxbury Press, 1999. Lynn, Peter, “The Use of Substitution in Surveys,” The Survey Statistician, 49 (2004), 14-16. Lynn, Peter; Haeder, Sabine; Gabler, Siegfried; and Laaksonen, Seppo, “Methods for Achieving Equivalence of Samples in Cross-National Surveys: The European Social Survey Experience,” ISER Working Paper, 2004-09, 2004 Marliani, Gianni and Pacei, Silvia, “Effects of Household Substitution in the Italian Consumer Expenditure Survey,” Bulletin of the International Statistical Institute. Firenze: ISI, 1993. Moser, C.A. and Kalton, G., Survey Methods in Social Investigation. New York: Basic 15 Books, 1972. Nathan, Gad, “Substitution for Non-response as a Means to Control Sample Size,” Sankhya, 42 (1980), 50-55. Pol, Louis G., “A Method to Increase Response When External Interference and Time Constraints Reduce Interview Quality,” Public Opinion Quarterly, 56 (1992), 356-359. Rubin, Donald B. and Zanutto, Elaine, “Using Matched Substitutes to Adjust for Nonignorable Nonresponse through Multiple Imputations,” in Survey Nonresponse, Robert M. Groves, Don A. Dillman, John L. Eltinge, and Roderick J.A. Little. New York: John Wiley & Sons, 2002. Scheuren, Fritz, “Introduction,” American Statistician, 58 (2004), 290-291. Smith, Tom W., “The Hidden 25 Percent: An Analysis of the 1980 General Social Survey,” Public Opinion Quarterly, 47 (1983), 386-404. Sudman, Seymour, Reducing the Cost of Surveys. Chicago: Aldine, 1967. Survey Sampling International, “Random Digit Telephone Sampling Methodology,” 2003. Vehovar, Vasja, “Field Substitution and Unit Nonresponse,” Journal of Official Statistics, 15 (1999), 335-350. Vehovar, Vasja, “Field Substitutions – A Neglected Option?” Unpublished paper, University of Ljublijana, n.d. Vehovar, Vasja, “Field Substitution in Sample Surveys: The Case of the Slovenian Labour Force Survey,” Bulletin of the International Statistical Institute. Firenze: ISI, 1993. Vehovar, Vasja, “Filed Substitutions Redefined,” The Survey Statistician, 48 (2003), 3537. Vehovar, Vasja; Zaletel, Metka; Novak, Tatjana; Arnez, Marta; and Katja, Rutar, “Household Sample Surveys in Slovenia,” Statistics in Transition, 5 (2002), 671-685. Waksberg, Joseph, “Discussion,” in Proceedings, First Annual Research Conference March 20-23, 1985. Washington DC: Bureau of the Census, 1985. Waksberg, Joseph, “Sampling Methods for Random Digit Dialing ,” Journal of the American Statistical Association, 73 (1978), 40-46. Williams, Stephen R. and Folsom, Ralph E., Jr., “Bias Resulting from Nonresponse: 16 Methodology and Findings. Revised,” Technical Report on NLS Base-Year Estimates, RTI, September, 1976. 17 Current practice – Recommendations from the ISSP General Meeting 2003 a. The use of substitution is discouraged since i) in its looser forms it may notably deviate from full-probability designs and risk becoming a convenience sample that overrepresents easy-to-contact and compliant respondents and ii) its use may lead interviewers to reduce their efforts to obtain interviews from the originally selected units/persons. b. The MC should collect more detailed information on exactly how substitution is used including the conditions under which substitutions may be utilized, whether substitution is at the interviewer's discretion or only authorized by the central office, whether substitute units are pre-selected, number of substitutions, etc. c. After this review the Methodology and Standing Committees will further consider the use of substitution. d. The total number of cases which are substitutions must be reported. In each data file all substitute cases should be marked with a flag variable. e. All replaced cases must be counted as non-respondents. For example, consistent with the AAPOR/WAPOR standards, "if a household refuses, no one is reached at an initial substitute household, and an interview is completed at a second substitute household, then the total number of cases would increase by two and the three cases would be listed as one refusal, one no one at residence, and one interview." 18