The Conditional Independence Assumption in Probabilistic Record Linkage Methods Most probabilistic record linkage software packages assume that, amongst record pairs which are matches (which refer to the same person), the event that the two records will agree on one linking field is stochastically independent of the event that they will agree on any other field. The same assumption is made for non-matches (record pairs which refer to different people). This assumption is popular because it reduces greatly the number of parameters to be estimated and hence helps to ensure the stability and robustness of the model. However little work seems to have been done on the implications of adopting it. Is it better not to assume independence? This paper uses Scottish data from the 2001 Census, from NHSCR and from the Higher Education Statistics Agency. Firstly it investigates the extent to which the assumption of conditional independence is violated in these data sets and secondly it assesses the degree to which departures from the assumption diminish the effectiveness with which the record linkage software can distinguish between matches and non-matches. The implications of the results for record linkage practice are discussed. Key phrases: record linkage, conditional independence. Contact details: Stephen Sharp National Records of Scotland Ladywell House Ladywell Road Edinburgh EH12 7TF 0131 314 4270 stephen.sharp@gro-scotland.gsi.gov.uk 1 The Conditional Independence Assumption in Probabilistic Record Linkage Methods Introduction Most probabilistic record linkage software packages assume that, amongst record pairs which are matches (which refer to the same person), the event that the two records will agree on one linking field is stochastically independent of the event that they will agree on any other field. The same assumption is made for non-matches (record pairs which refer to different people). This assumption is popular because it reduces greatly the number of parameters to be estimated and hence helps to ensure the stability and robustness of the model. For example, if the level of agreement between the records for a given field is taken to be dichotomous (agree/disagree), if independence is assumed and if there are F linkage fields, then there are 2 F parameters to be estimated (the agreement probabilities for each field assuming a match and assuming a non-match). If independence is not assumed, then there are 2 F 1 parameters assuming a match (one for every combination of agreement and disagreement over the fields less one as these probabilities must sum to unity) and the same number of parameters assuming a non-match. It can be readily seen therefore that there are strong computational reasons for adopting the assumption. In particular, as the number of linking fields increases, the quality of the linkage ought to improve as it is based on more information. Assuming independence, this is possible while keeping the number of parameters manageable. Without this assumption, there is a tension between maximising the information used in linkage and keeping the number of parameters within practical limits. This paper links Scottish data from the 2001 Census to the census coverage survey of that year, and from NHSCR to Higher Education Statistics Agency student records in order (i) to investigate the extent to which the assumption of conditional independence is violated in these data sets and (ii) to assess the degree to which departures from the assumption diminish the effectiveness with which the record linkage software can distinguish between matches and non-matches. Even without data it is possible to see that there will be some occasions on which the assumption of independence must fail. The most obvious of these is that if agreement is observed on first name then, barring sexually ambiguous names such Lindsay and Lesley, it is almost certain that there will be agreement on gender, regardless of whether the pair is a match or not. However, as will be shown later, correlations between agreement probabilities can arise for reasons not connected with the meaning of the fields. It is possible to make some theoretical comments about the effect of assuming independence when it is not present in the data. The key statistic used by packages which follow the models proposed by Fellegi and Sunter (1969) or Copas and Hilton (1990) is the logarithm of the ratio of the probabilities of the observed outcome of the comparison given that the pair is a match and given that it is a non-match. If independence is assumed then the multiplicative rule for independent events applies and the logarithms can be calculated separately for each field and summed. Thus if the linkage score for field f is w f then the linkage score for the record pair is 2 w w1 w2 .... . The variance of this sum is composed of the sum of the variances of the individual terms i2 plus twice the sum of the pairwise covariances ij2 (taking each combination of i and j once since ij2 2ji ). If there is a nonzero covariance between two fields, their actual joint contribution to the total variance will be the sum of their individual variances plus twice their covariance. The independence assumption however ignores the covariance term and so underestimates their contribution. In fact it can be argued that the position is actually worse than this. Far from increasing the size of the contribution, positive covariance should reduce it since the introduction of w j into the sum replicates some of the information already added by wi . At the extreme for example where two fields are wholly predictable from each other, introducing the second adds no new marginal information and so should leave the variance of the sum unchanged. In general, where there is positive covariance, the contribution ought to be the sum of the two separate field variances minus the covariance (as this has been counted twice as it were) not plus twice the covariance. There are therefore theoretical reasons for believing that violations of the assumption of independence will distort the weights accorded to the linking fields in determining the log likelihood ratio for a given record pair. Field covariances can be investigated by assessing the agreement or disagreement between each pair of linking fields for a given record pair and calculating the correlation over record pairs. However the question arises of which pairs should be used for this purpose. At first glance, the correlation over all pairs admitted by the blocking strategy is the obvious approach. However this may give a misleading impression. For practical purposes, the output from record linkage packages (which is a list of record pairs sorted in descending order of the summed log likelihood ratios) can be divided into three sectors. Sector 1 at the top of the table is largely composed of pairs which are easily identifiable as probable matches. Sector 3 at the bottom of the table is largely composed of pairs which are easily identifiable as probable nonmatches. Sector 2 in the middle is the most important and is the reason why probabilistic record linkage is necessary. It is largely composed of cases which are difficult to classify as matches or non-matches with confidence. This triage is important for present purposes since there will be little if any correlation in sectors 1 or 3. Sector 1 is dominated by matches and the great majority of field comparisons will be agreements or missing values. Since correlation depends on the variance of the two variables being correlated and since there will be little variance, there will also be little correlation. Similarly sector 3 is dominated by non-matches and, although there will be variance here, it will be caused by agreements due to occasional ‘lucky hits’. Since the records refer to different people it is hard to see how there could be any more than random levels of correlation between agreement on different field pairs. Including records pairs in sectors 1 and 3 therefore will act to dilute the correlation observable in sector 2 which is where the negative impact of the correlation on the quality of the linkage process can be expected to manifest itself. The correlations reported below therefore are calculated only over record pairs in sector 2. The boundaries of sector 2 are to some extent arbitrary as there can be ‘outliers’ (matches with very low linkage scores or non-matches with very high linkage scores) which can affect the boundaries between the three sectors. For present purposes, any outliers which had this effect were ignored. Otherwise, the correlations were calculated over all the record pairs in sector 2. 3 The data Two data sets were used in the results reported below. The first consisted of (i) a file of 500,000 records randomly chosen from the 2001 Census in Scotland and (ii) the 77,800 records of the census coverage survey which had been linked to the Census in 2001. For quality assurance reasons, considerable effort was expended in 2001 into matching the two files using both electronic and manual methods and it is highly likely that the matches identified were true matches. However further calculations undertaken in 2011 revealed a number of links (around 1.5% of those found in 2001) which were not identified as matches in 2001 but which appeared convincing when reviewed in 2011. These were added to the file giving a total of 7,720 identified matches, consistent with the number expected given that the sample consists of around 10% of the census. The second data set was taken from a project on the migration of students domiciled in Scotland but studying in England or Wales. The files were (i) all 665,000 records of people on the Scottish NHSCR database who were born between 1984 and 1990 inclusive and hence likely to be of student age in 2010 and (ii) a sample of 6,909 records taken from the HESA student records databases for 2007/8 and 2008/9 where (a) the date of birth in either file was between 1984 and 1990 and (b) there was a term time postcode which was either non-Scottish or missing. As part of a student migration project using these data, all the links made by the software had been reviewed manually to determine whether the link was a true match. In a small minority of cases, confident decisions about match status were not possible on the basis of the information available. Where there was doubt, a non-match was assumed. The software Two packages were used in this work. They were RecLink produced by the Statistical Research Division of the US Bureau of the Census and Link Plus developed at the US Centers for Disease Control and Prevention (CDC), Cancer Division. They are respectively available on request from the US Bureau of the Census and on the internet at http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm; The results Two initial linkage runs were carried out. Details of the data sets linked, the software packages used and the fields used for linking and blocking are given in table 1. In all cases the agreement rule for all fields was identity. This is not necessarily the optimal rule for some fields (for example Jaro-Winkler is slightly better for text strings like names) but it is computationally simpler and close enough for present purposes. The tetrachoric correlations found in the two runs are reported in tables 2 and 3. The patterns of correlations for matches and non-matches are, as would be expected, very different. For matches, two of the three correlations greater than 0.13 occurred between components of the date of birth. It appears that if a recording error occurs for one of these then there is an increased probability of an error occurring in one or both of the others. This is not of course intuitively unreasonable. 4 Table 1: details of the two initial linkage runs Run 1 Run 2 Data 2001 Census – 10% sample NHSCR - subset Census coverage survey HESA - subset Software Rec Link Link Plus Blocking field(s) post code sector post code sector first name soundex Linking fields first name first name last name last name house no date of birth year of birth post code month of birth gender day of birth post code gender The other positive coefficients are either close to zero or negative. The occurrence of negative coefficients probably arises as follows. If a pair of records which refer to the same person is found in sector 2, it is because a number of errors have occurred in different linking fields. However the number of errors must be small (as otherwise the pair would fall to sector 3) and hence if an error occurs in one field, an error in another is less likely. Table 2 - Correlations for matches and non-matches in the Census/CCS data with post code sector as blocking field Matches first last house dob dob dob post N 90 name name no year mon day code gender first name 1.00 0.02 -0.44 -0.26 -0.01 -0.51 -0.13 -0.16 last name 0.02 1.00 -0.42 -0.18 -0.10 -0.39 -0.33 -0.02 house number -0.44 -0.42 1.00 0.26 -0.20 0.12 0.13 -0.17 year of birth -0.26 -0.18 0.26 1.00 0.08 0.78 -0.30 -0.09 month of birth -0.01 -0.10 -0.20 0.08 1.00 0.26 -0.36 -0.41 day of birth -0.51 -0.39 0.12 0.78 0.26 1.00 -0.36 -0.11 post code -0.13 -0.33 0.13 -0.30 -0.36 -0.36 1.00 -0.09 gender -0.16 -0.02 -0.17 -0.09 -0.41 -0.11 -0.09 1.00 Non-matches N 100 first name last name house number year of birth month of birth day of birth post code gender first name 1.00 -0.46 -0.48 0.07 0.78 0.14 -0.53 0.28 last name -0.46 1.00 0.09 -0.19 -0.60 -0.27 0.00 -0.19 house no -0.48 0.09 1.00 0.17 -0.48 0.11 -0.27 -0.24 dob year 0.07 -0.19 0.17 1.00 0.05 -0.01 -0.32 -0.07 5 dob mon 0.78 -0.60 -0.48 0.05 1.00 0.27 -0.37 -0.14 dob day 0.14 -0.27 0.11 -0.01 0.27 1.00 -0.41 -0.19 post code gender -0.53 0.28 0.00 -0.19 -0.27 -0.24 -0.32 -0.07 -0.37 -0.14 -0.41 -0.19 1.00 -0.16 -0.16 1.00 For the non-matches, there is one large positive coefficient (between first name and month of birth) but otherwise the largest is +0.28. A different argument applies for negative coefficients because here, variance is caused not by recording errors but by “lucky hits”. However it is in some ways analogous to that which applies to matches. If a pair of records which do not refer to the same person is found in sector 2, it is because a number of lucky hits have occurred in different linking fields. However the number of lucky hits must be limited (as otherwise the pair would be in sector 1) and so again if an error occurs in one field, an error in another is less likely. Table 3 - Correlations for matches and non-matches in the NHSCR/HESA data with first name soundex OR post code sector as blocking fields Matches first last birth post name name date code gender N 450 first name 1.00 -0.07 -0.07 -0.65 0.10 last name -0.07 1.00 -0.03 -0.03 -0.01 date of birth -0.07 -0.03 1.00 -0.26 -0.01 post code -0.65 -0.03 -0.26 1.00 -0.13 gender 0.10 -0.01 -0.01 -0.13 1.00 Non Matches N 131 first name last name date of birth post code gender first name 1.00 -0.66 -0.15 -0.54 . last name -0.66 1.00 -0.49 0.19 . birth date -0.15 -0.49 1.00 -0.07 . post code -0.54 0.19 -0.07 1.00 . gender . . . . . Table 3 gives the tetrachoric correlations for the NHSCR/HESA linkage run. For the matches, only first name / gender returns a positive coefficient, though this is not inconsistent with table 2 since here the date of birth was treated as a single field so the positive correlations between the various components of the field were not measured separately. For the non-matches, only last name / post code is significantly positive, whereas there are three coefficients which are less than -0.40. Coefficients are not available for gender because for all 131 of the non-match pairs there was agreement on this field. Non-match agreement is of course very likely for this field and appeared to be effectively a precondition for non-match record pairs to get into sector 2. These negative correlations may be artefacts in the sense that they have been calculated only on record pairs in sector 2 and record pairs (both matches and nonmatches) are only found in this sector if particular circumstances apply. Nevertheless the correlations are real and have the potential to damage the effectiveness of the linkage process. To investigate whether this potential is realised it is necessary to have a means of assessing and comparing the quality of record linkage output. The method most commonly used to compare the outputs of record linkage packages is the recall-versus-precision graph. These two variables are functions of the numbers of true positives (record pairs which refer to the same person and have been correctly linked by the software), false positives (record pairs which do not refer to the same person but have been incorrectly linked by the software) and false negatives (record 6 pairs which refer to the same person but have not been linked by the software). Note that the sum of the first and third of these is the number of true matches which is constant for a given data set. This method will be used here though the data will be presented rather differently. Instead of plotting the derived variables of recall and precision, the numbers of true positives (or true links as they will be called) will be plotted directly against the number of false positives (or false links). This enables direct identification of the cumulative numbers of successes and failures in the output as the value of the log likelihood ratio decreases. The equations which relate the recall-precision axes to the true-false axes are given in table 4. Table 4 – Raw and derived variables and the relationships between them. M is the total number of matches which is constant for a given data set. Variable Symbol Relationship to other variables True positives or true links T T MR False positives or false links Recall Precision F R P F MR1 P / P R T / M P T / T F It can be seen from table 2 that for the matches in the census/CCS data there are positive correlations (+0.78, +0.28 and +0.08) between the three elements of the date of birth. To assess the effect of assuming independence for these data, where it clearly does not apply, three further runs were undertaken, all using the census/CCS data, the Link Plus software, and blocking fields and linking fields as in table 2. The difference between these runs concerned how date of birth was handled. In run 3, the date-specific agreement rule was used which treats the three components as a single field. This is described in the user manual as follows: This method incorporates partial matching to account for missing month values and/or day values. The Date matching method checks to see if two dates are the same on day, month, and year components. If they are the same on all three components, the comparison pair will get a high weight (w). If they agree on year and month but are missing on day, the weight (w1) will be positive but less than w. If they agree on year but are missing on month and day the weight (w2) will still be positive but less than w1. If they are not missing values, the Date matching method will check if the day and month are swapped. The method also checks for transposition. In run 4, the three components of the date were treated as different linking fields (and hence as being independent) with the exact agreement rule used in each case. In the run 5, the date was treated as one field but with the exact agreement rule so that agreement for the field required agreement for all three components. If the violation of the independence reduces the quality of the output, the run 4 should produce lower quality than runs 3 or 5. 7 Fig 1: Census/CCS data with three date treatments True links 7500 7000 Run 3 - one component, date specific rule 6500 Run 4 - three components, exact rule 6000 Run 5 - one component, exact rule 5500 0 100 200 300 False links In fact, as the true-false graph in fig 1 shows, run 4 produces a slightly better result than either of the runs which treat the date of birth field as a single field. In fact this is not a wholly fair comparison since run 5 treated the entire date field as a single unit and did not give any credit for partial agreement. As such it was significantly harsher than either of the other treatments and this no doubt contributed to its having the least effective results. However for present purposes the important finding is that run 4, which assumes that the three components are independent, does not seem to suffer unduly for its incorrect assumption. Conclusion The data given above are exploratory and work to evaluate the packages featured here, and others available for record linkage, will continue. However the results available to date suggest that, while the assumption of conditional independence may be incorrect, the Fellegi-Sunter model, and the output from record linkage packages based on it, are robust to its violation. The effect of field dependencies is to change the weights allocated to agreement and disagreement and the same effect results from changing the values used for the agreement probabilities given a match and given a non-match. If it is the case that the quality of the output is not dependent on the exactitude of the parameter estimates, then it would also seem likely that it is not dependent on the exactitude with which the dependence assumption holds. Such a position is entirely consistent with the data presented above. Stephen Sharp National Records of Scotland Ladywell House Ladywell Road Edinburgh EH12 7TF 0131 314 4270 stephen.sharp@gro-scotland.gsi.gov.uk 8