Data for the Second Modelling Paper The Use of Paradata (Field Process Data) in Cross–Sectional and Longitudinal Surveys Project (RES-062-23-2997) Construction of the datafile First, households which responded in Wave 1 and which had call record information as well as interviewer observations together with survey variables available were selected. A separate file with call record information was constructed for the Wave 2. As there is no such a thing as longitudinal household in the Understanding Society data, it was possible to merge data from Wave 1 and Wave 2 only on the basis of the individual identifiers which are stable and traceable across waves. In order to do this, first each member of each household in both waves obtained the same sequence of calls as the whole household would have. Then data for the individuals were merged for both waves on the basis of the unique individual identifier. Finally, only one record per households in Wave 2 remained in the final datafile as our analysis needs to be conducted at the household level. Ethnic Minority Boost sample was excluded from the analysis as the rules for the selection of this sample are different from the main sample and a separate analysis could be performed for this specific group. However, this is not the main aim of the current analysis. The British Household Panel Study sample was also excluded from the analysis as none of the participants from this sample were interviewed in the Wave 1 of the Understanding Society. All data then were cleaned and listwise deletion of observations with missing values on covariates was performed. This initial analysis sample contained 24,896 households including households with sequences available in both waves. First, we carried out preliminary work using this initial analysis sample, conducting separate analyses for all cases with at least 1, 2, 3 etc. calls. The 1 aim was to evaluate after every call, how well we can predict final outcome and length as more information about each case becomes available. As in previous analysis reported in XXX, this preliminary analysis also suggested that 3 calls may be sufficient to reach an acceptable level of predictability. To more formally assess this and to be able to compare model fit and prediction across the different models -since measures of predictability are dependent on sample size- we restrict our final analysis sample to all households from Wave 2 that received four or more calls (11,029 households). This approach enables us to employ call record information from the beginning of the field period in Wave 2 (here the first three calls) to predict final length and outcome. For these final models, we also excluded a small number of cases with missing values in the explanatory variables, as follows: there are no missing cases in any of the geographic information and design variables since these are derived from administrative data. Date and time of a call are automatically captured using computer assisted methods leading to no missing cases in these variables. Recordings of the call outcome of the households of interest did not contain any missing information either. Households in Wave 2 with missing items in the geographic and survey variables from Wave 1 as well as the interviewer observation variables from Wave 2 were deleted (399 cases (3.6%)). The final analysis sample, including only cases with four or more calls, therefore, contains 10,630 households with information from Wave 1 and Wave 2. To analyse length and call outcome (and to ensure the comparability of the different models), the following three response variables in Wave 2 are considered. Their distributions are presented in Table 1. 1.) length of call sequence (binary), distinguishing short sequences (up to 6 calls) and long sequences (more than 6 calls). The cut-off point at 6 calls was intentionally selected to fit with the survey protocol requirements of conducting a minimum of 6 call attempts if contact was not established earlier in the process. 2 2.) (final) outcome of call sequence (binary), distinguishing successful call sequences with at least one interview conducted in a household (after call 3) and unsuccessful call sequences with no interviews achieved (after call 3). 3.) a variable combining both length and final outcome, distinguishing 4 categories - short unsuccessful (up to 6 calls, no interview), short successful (up to 6 calls, at least one interview after call 3), long unsuccessful (more than 6 calls, no interview) and long successful (more than 6 calls, at least one interview after call 3). (Different definitions of the dependent variables - such as complete household outcome and different categorisations of sequence length - were also explored. The overall conclusions were very similar to the ones presented in this paper.) Table 1: Distributions of the three response variables in the final analysis sample (10,630 households). Frequencies Variables with categories Length 6704 Short sequence (up to 6 calls) 3926 Long sequence (7-30 calls) Final outcome No single interview in a sequence after call 3110 3 At least one interview in a sequence after 7520 call 3 Combined response 5304 Short successful 1400 Short unsuccessful 2216 Long successful 1710 Long unsuccessful Percentages 63.1 36.9 29.3 70.7 49.9 13.2 20.8 16.1 The explanatory variables used in the analysis can be split into seven main groups. The distributions of the explanatory variables broken down by the categories of the response variables are presented in the online Appendix (Table A2). 3 1.) geographic information and design variables from Wave 1 (4 variables: urban/rural indicator, government office region, low density area for ethnic minorities, and month and year of household issue); 2.) interviewer observation variables from Wave 1 (12 variables, e.g. indicators of entry barriers, conditions of surrounding area such as litter in street, abandoned buildings, heavy traffic, type of accommodation, relative condition of the property, garden); 3.) survey variables from Wave 1 (9 variables, e.g. household characteristics such as household income, the highest educational qualification, composition of the household, and others); 4.) call record variables from Wave 1 (5 variables, e.g. proportion of non-contact calls in a sequence or proportion of appointments in a sequence; length of sequence in Wave 1); 5.) call record variables from Wave 2 (11 variables, e.g. date, time of day, day of week, call outcome; also derived variables including time between calls); 6.) interviewer observations from Wave 2 (9 variables, e.g. household split identifier, presence of a car or a van, relative condition of the property, conditions of surrounding area such as litter in street, abandoned buildings, heavy traffic, type of accommodation, presence of children in a household, relative condition of the property, garden); 7.) change in interviewer observations between Waves 1 and 2 identifiers (8 variables, e.g. change in conditions of the garden between the 2 waves, change in presence of a car or a van between the two waves, and others). 4