The Conditional Independence Assumption in Probablistic Record

advertisement
The Conditional Independence Assumption in Probabilistic Record Linkage
Methods
Most probabilistic record linkage software packages assume that, amongst record
pairs which are matches (which refer to the same person), the event that the two
records will agree on one linking field is stochastically independent of the event that
they will agree on any other field. The same assumption is made for non-matches
(record pairs which refer to different people). This assumption is popular because it
reduces greatly the number of parameters to be estimated and hence helps to ensure
the stability and robustness of the model. However little work seems to have been
done on the implications of adopting it. Is it better not to assume independence?
This paper uses Scottish data from the 2001 Census, from NHSCR and from the
Higher Education Statistics Agency. Firstly it investigates the extent to which the
assumption of conditional independence is violated in these data sets and secondly it
assesses the degree to which departures from the assumption diminish the
effectiveness with which the record linkage software can distinguish between matches
and non-matches. The implications of the results for record linkage practice are
discussed.
Key phrases: record linkage, conditional independence.
Contact details:
Stephen Sharp
National Records of Scotland
Ladywell House
Ladywell Road
Edinburgh
EH12 7TF
0131 314 4270
stephen.sharp@gro-scotland.gsi.gov.uk
1
The Conditional Independence Assumption in Probabilistic Record Linkage
Methods
Introduction
Most probabilistic record linkage software packages assume that, amongst record
pairs which are matches (which refer to the same person), the event that the two
records will agree on one linking field is stochastically independent of the event that
they will agree on any other field. The same assumption is made for non-matches
(record pairs which refer to different people). This assumption is popular because it
reduces greatly the number of parameters to be estimated and hence helps to ensure
the stability and robustness of the model. For example, if the level of agreement
between the records for a given field is taken to be dichotomous (agree/disagree), if
independence is assumed and if there are F linkage fields, then there are 2 F
parameters to be estimated (the agreement probabilities for each field assuming a
match and assuming a non-match). If independence is not assumed, then there are
2 F  1 parameters assuming a match (one for every combination of agreement and
disagreement over the fields less one as these probabilities must sum to unity) and the
same number of parameters assuming a non-match. It can be readily seen therefore
that there are strong computational reasons for adopting the assumption. In particular,
as the number of linking fields increases, the quality of the linkage ought to improve
as it is based on more information. Assuming independence, this is possible while
keeping the number of parameters manageable. Without this assumption, there is a
tension between maximising the information used in linkage and keeping the number
of parameters within practical limits.
This paper links Scottish data from the 2001 Census to the census coverage survey of
that year, and from NHSCR to Higher Education Statistics Agency student records in
order (i) to investigate the extent to which the assumption of conditional
independence is violated in these data sets and (ii) to assess the degree to which
departures from the assumption diminish the effectiveness with which the record
linkage software can distinguish between matches and non-matches.
Even without data it is possible to see that there will be some occasions on which the
assumption of independence must fail. The most obvious of these is that if agreement
is observed on first name then, barring sexually ambiguous names such Lindsay and
Lesley, it is almost certain that there will be agreement on gender, regardless of
whether the pair is a match or not. However, as will be shown later, correlations
between agreement probabilities can arise for reasons not connected with the meaning
of the fields.
It is possible to make some theoretical comments about the effect of assuming
independence when it is not present in the data. The key statistic used by packages
which follow the models proposed by Fellegi and Sunter (1969) or Copas and Hilton
(1990) is the logarithm of the ratio of the probabilities of the observed outcome of the
comparison given that the pair is a match and given that it is a non-match. If
independence is assumed then the multiplicative rule for independent events applies
and the logarithms can be calculated separately for each field and summed. Thus if
the linkage score for field f is w f then the linkage score for the record pair is
2
w  w1  w2  .... . The variance of this sum is composed of the sum of the variances
of the individual terms  i2 plus twice the sum of the pairwise covariances  ij2 (taking
each combination of i and j once since  ij2   2ji ). If there is a nonzero covariance
between two fields, their actual joint contribution to the total variance will be the sum
of their individual variances plus twice their covariance. The independence
assumption however ignores the covariance term and so underestimates their
contribution. In fact it can be argued that the position is actually worse than this. Far
from increasing the size of the contribution, positive covariance should reduce it since
the introduction of w j into the sum replicates some of the information already added
by wi . At the extreme for example where two fields are wholly predictable from each
other, introducing the second adds no new marginal information and so should leave
the variance of the sum unchanged. In general, where there is positive covariance,
the contribution ought to be the sum of the two separate field variances minus the
covariance (as this has been counted twice as it were) not plus twice the covariance.
There are therefore theoretical reasons for believing that violations of the assumption
of independence will distort the weights accorded to the linking fields in determining
the log likelihood ratio for a given record pair.
Field covariances can be investigated by assessing the agreement or disagreement
between each pair of linking fields for a given record pair and calculating the
correlation over record pairs. However the question arises of which pairs should be
used for this purpose. At first glance, the correlation over all pairs admitted by the
blocking strategy is the obvious approach. However this may give a misleading
impression. For practical purposes, the output from record linkage packages (which is
a list of record pairs sorted in descending order of the summed log likelihood ratios)
can be divided into three sectors. Sector 1 at the top of the table is largely composed
of pairs which are easily identifiable as probable matches. Sector 3 at the bottom of
the table is largely composed of pairs which are easily identifiable as probable nonmatches. Sector 2 in the middle is the most important and is the reason why
probabilistic record linkage is necessary. It is largely composed of cases which are
difficult to classify as matches or non-matches with confidence.
This triage is important for present purposes since there will be little if any correlation
in sectors 1 or 3. Sector 1 is dominated by matches and the great majority of field
comparisons will be agreements or missing values. Since correlation depends on the
variance of the two variables being correlated and since there will be little variance,
there will also be little correlation. Similarly sector 3 is dominated by non-matches
and, although there will be variance here, it will be caused by agreements due to
occasional ‘lucky hits’. Since the records refer to different people it is hard to see
how there could be any more than random levels of correlation between agreement on
different field pairs. Including records pairs in sectors 1 and 3 therefore will act to
dilute the correlation observable in sector 2 which is where the negative impact of the
correlation on the quality of the linkage process can be expected to manifest itself.
The correlations reported below therefore are calculated only over record pairs in
sector 2. The boundaries of sector 2 are to some extent arbitrary as there can be
‘outliers’ (matches with very low linkage scores or non-matches with very high
linkage scores) which can affect the boundaries between the three sectors. For present
purposes, any outliers which had this effect were ignored. Otherwise, the correlations
were calculated over all the record pairs in sector 2.
3
The data
Two data sets were used in the results reported below. The first consisted of (i) a file
of 500,000 records randomly chosen from the 2001 Census in Scotland and (ii) the
77,800 records of the census coverage survey which had been linked to the Census in
2001. For quality assurance reasons, considerable effort was expended in 2001 into
matching the two files using both electronic and manual methods and it is highly
likely that the matches identified were true matches. However further calculations
undertaken in 2011 revealed a number of links (around 1.5% of those found in 2001)
which were not identified as matches in 2001 but which appeared convincing when
reviewed in 2011. These were added to the file giving a total of 7,720 identified
matches, consistent with the number expected given that the sample consists of
around 10% of the census.
The second data set was taken from a project on the migration of students domiciled
in Scotland but studying in England or Wales. The files were (i) all 665,000 records
of people on the Scottish NHSCR database who were born between 1984 and 1990
inclusive and hence likely to be of student age in 2010 and (ii) a sample of 6,909
records taken from the HESA student records databases for 2007/8 and 2008/9 where
(a) the date of birth in either file was between 1984 and 1990 and (b) there was a term
time postcode which was either non-Scottish or missing. As part of a student
migration project using these data, all the links made by the software had been
reviewed manually to determine whether the link was a true match. In a small
minority of cases, confident decisions about match status were not possible on the
basis of the information available. Where there was doubt, a non-match was assumed.
The software
Two packages were used in this work. They were RecLink produced by the Statistical
Research Division of the US Bureau of the Census and Link Plus developed at the US
Centers for Disease Control and Prevention (CDC), Cancer Division. They are
respectively available on request from the US Bureau of the Census and on the
internet at http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm;
The results
Two initial linkage runs were carried out. Details of the data sets linked, the software
packages used and the fields used for linking and blocking are given in table 1. In all
cases the agreement rule for all fields was identity. This is not necessarily the optimal
rule for some fields (for example Jaro-Winkler is slightly better for text strings like
names) but it is computationally simpler and close enough for present purposes. The
tetrachoric correlations found in the two runs are reported in tables 2 and 3.
The patterns of correlations for matches and non-matches are, as would be expected,
very different. For matches, two of the three correlations greater than 0.13 occurred
between components of the date of birth. It appears that if a recording error occurs for
one of these then there is an increased probability of an error occurring in one or both
of the others. This is not of course intuitively unreasonable.
4
Table 1: details of the two initial linkage runs
Run 1
Run 2
Data
2001 Census – 10% sample NHSCR - subset
Census coverage survey
HESA - subset
Software
Rec Link
Link Plus
Blocking field(s) post code sector
post code sector
first name soundex
Linking fields
first name
first name
last name
last name
house no
date of birth
year of birth
post code
month of birth
gender
day of birth
post code
gender
The other positive coefficients are either close to zero or negative. The occurrence of
negative coefficients probably arises as follows. If a pair of records which refer to the
same person is found in sector 2, it is because a number of errors have occurred in
different linking fields. However the number of errors must be small (as otherwise
the pair would fall to sector 3) and hence if an error occurs in one field, an error in
another is less likely.
Table 2 - Correlations for matches and non-matches in the Census/CCS data with
post code sector as blocking field
Matches
first
last
house dob
dob
dob post
N  90
name
name
no
year mon day code gender
first name
1.00
0.02
-0.44 -0.26 -0.01 -0.51 -0.13
-0.16
last name
0.02
1.00
-0.42 -0.18 -0.10 -0.39 -0.33
-0.02
house number
-0.44
-0.42
1.00 0.26 -0.20 0.12 0.13
-0.17
year of birth
-0.26
-0.18
0.26 1.00 0.08 0.78 -0.30
-0.09
month of birth
-0.01
-0.10
-0.20 0.08 1.00 0.26 -0.36
-0.41
day of birth
-0.51
-0.39
0.12 0.78 0.26 1.00 -0.36
-0.11
post code
-0.13
-0.33
0.13 -0.30 -0.36 -0.36 1.00
-0.09
gender
-0.16
-0.02
-0.17 -0.09 -0.41 -0.11 -0.09
1.00
Non-matches
N  100
first name
last name
house number
year of birth
month of birth
day of birth
post code
gender
first
name
1.00
-0.46
-0.48
0.07
0.78
0.14
-0.53
0.28
last
name
-0.46
1.00
0.09
-0.19
-0.60
-0.27
0.00
-0.19
house
no
-0.48
0.09
1.00
0.17
-0.48
0.11
-0.27
-0.24
dob
year
0.07
-0.19
0.17
1.00
0.05
-0.01
-0.32
-0.07
5
dob
mon
0.78
-0.60
-0.48
0.05
1.00
0.27
-0.37
-0.14
dob
day
0.14
-0.27
0.11
-0.01
0.27
1.00
-0.41
-0.19
post
code gender
-0.53
0.28
0.00
-0.19
-0.27
-0.24
-0.32
-0.07
-0.37
-0.14
-0.41
-0.19
1.00
-0.16
-0.16
1.00
For the non-matches, there is one large positive coefficient (between first name and
month of birth) but otherwise the largest is +0.28. A different argument applies for
negative coefficients because here, variance is caused not by recording errors but by
“lucky hits”. However it is in some ways analogous to that which applies to matches.
If a pair of records which do not refer to the same person is found in sector 2, it is
because a number of lucky hits have occurred in different linking fields. However the
number of lucky hits must be limited (as otherwise the pair would be in sector 1) and
so again if an error occurs in one field, an error in another is less likely.
Table 3 - Correlations for matches and non-matches in the NHSCR/HESA data
with first name soundex OR post code sector as blocking fields
Matches
first
last
birth
post
name
name
date
code
gender
N  450
first name
1.00
-0.07
-0.07
-0.65
0.10
last name
-0.07
1.00
-0.03
-0.03
-0.01
date of birth
-0.07
-0.03
1.00
-0.26
-0.01
post code
-0.65
-0.03
-0.26
1.00
-0.13
gender
0.10
-0.01
-0.01
-0.13
1.00
Non Matches
N  131
first name
last name
date of birth
post code
gender
first
name
1.00
-0.66
-0.15
-0.54
.
last
name
-0.66
1.00
-0.49
0.19
.
birth
date
-0.15
-0.49
1.00
-0.07
.
post
code
-0.54
0.19
-0.07
1.00
.
gender
.
.
.
.
.
Table 3 gives the tetrachoric correlations for the NHSCR/HESA linkage run. For the
matches, only first name / gender returns a positive coefficient, though this is not
inconsistent with table 2 since here the date of birth was treated as a single field so the
positive correlations between the various components of the field were not measured
separately. For the non-matches, only last name / post code is significantly positive,
whereas there are three coefficients which are less than -0.40. Coefficients are not
available for gender because for all 131 of the non-match pairs there was agreement
on this field. Non-match agreement is of course very likely for this field and
appeared to be effectively a precondition for non-match record pairs to get into sector
2.
These negative correlations may be artefacts in the sense that they have been
calculated only on record pairs in sector 2 and record pairs (both matches and nonmatches) are only found in this sector if particular circumstances apply. Nevertheless
the correlations are real and have the potential to damage the effectiveness of the
linkage process. To investigate whether this potential is realised it is necessary to
have a means of assessing and comparing the quality of record linkage output. The
method most commonly used to compare the outputs of record linkage packages is the
recall-versus-precision graph. These two variables are functions of the numbers of
true positives (record pairs which refer to the same person and have been correctly
linked by the software), false positives (record pairs which do not refer to the same
person but have been incorrectly linked by the software) and false negatives (record
6
pairs which refer to the same person but have not been linked by the software). Note
that the sum of the first and third of these is the number of true matches which is
constant for a given data set. This method will be used here though the data will be
presented rather differently. Instead of plotting the derived variables of recall and
precision, the numbers of true positives (or true links as they will be called) will be
plotted directly against the number of false positives (or false links). This enables
direct identification of the cumulative numbers of successes and failures in the output
as the value of the log likelihood ratio decreases. The equations which relate the
recall-precision axes to the true-false axes are given in table 4.
Table 4 – Raw and derived variables and the relationships between them.
M is the total number of matches which is constant for a given data set.
Variable
Symbol
Relationship to other
variables
True positives or true links
T
T  MR
False positives or false links
Recall
Precision
F
R
P
F  MR1  P / P
R T / M
P  T / T  F 
It can be seen from table 2 that for the matches in the census/CCS data there are
positive correlations (+0.78, +0.28 and +0.08) between the three elements of the date
of birth. To assess the effect of assuming independence for these data, where it
clearly does not apply, three further runs were undertaken, all using the census/CCS
data, the Link Plus software, and blocking fields and linking fields as in table 2. The
difference between these runs concerned how date of birth was handled. In run 3, the
date-specific agreement rule was used which treats the three components as a single
field. This is described in the user manual as follows:
This method incorporates partial matching to account for missing
month values and/or day values. The Date matching method
checks to see if two dates are the same on day, month, and year
components. If they are the same on all three components, the
comparison pair will get a high weight (w). If they agree on year
and month but are missing on day, the weight (w1) will be
positive but less than w. If they agree on year but are missing on
month and day the weight (w2) will still be positive but less than
w1. If they are not missing values, the Date matching method
will check if the day and month are swapped. The method also
checks for transposition.
In run 4, the three components of the date were treated as different linking fields (and
hence as being independent) with the exact agreement rule used in each case. In the
run 5, the date was treated as one field but with the exact agreement rule so that
agreement for the field required agreement for all three components. If the violation
of the independence reduces the quality of the output, the run 4 should produce lower
quality than runs 3 or 5.
7
Fig 1: Census/CCS data with three date treatments
True links
7500
7000
Run 3 - one component,
date specific rule
6500
Run 4 - three
components, exact rule
6000
Run 5 - one component,
exact rule
5500
0
100
200
300
False links
In fact, as the true-false graph in fig 1 shows, run 4 produces a slightly better result
than either of the runs which treat the date of birth field as a single field. In fact this is
not a wholly fair comparison since run 5 treated the entire date field as a single unit
and did not give any credit for partial agreement. As such it was significantly harsher
than either of the other treatments and this no doubt contributed to its having the least
effective results. However for present purposes the important finding is that run 4,
which assumes that the three components are independent, does not seem to suffer
unduly for its incorrect assumption.
Conclusion
The data given above are exploratory and work to evaluate the packages featured
here, and others available for record linkage, will continue. However the results
available to date suggest that, while the assumption of conditional independence may
be incorrect, the Fellegi-Sunter model, and the output from record linkage packages
based on it, are robust to its violation. The effect of field dependencies is to change
the weights allocated to agreement and disagreement and the same effect results from
changing the values used for the agreement probabilities given a match and given a
non-match. If it is the case that the quality of the output is not dependent on the
exactitude of the parameter estimates, then it would also seem likely that it is not
dependent on the exactitude with which the dependence assumption holds. Such a
position is entirely consistent with the data presented above.
Stephen Sharp
National Records of Scotland
Ladywell House
Ladywell Road
Edinburgh
EH12 7TF
0131 314 4270
stephen.sharp@gro-scotland.gsi.gov.uk
8
Download