Object-oriented Bayesian networks for complex forensic DNA profiling problems ∗ A. P. Dawid

advertisement
Object-oriented Bayesian networks for complex forensic DNA
profiling problems∗
A. P. Dawid
University College London
J. Mortera
P. Vicard
Università Roma Tre
September 6, 2005
Abstract
We describe a flexible computational toolkit, based on object-oriented Bayesian networks
(OOBNs), that can be used to model and solve a wide variety of complex problems of relationship testing using DNA profiles. In particular this can account for such complicating
features as missing individuals, mutation, and null alleles. We show by example how to build
a high-level representation of a disputed pedigree problem, and how to incorporate lowerlevel network models of the relevant complications. We illustrate the use of this toolkit with
several examples, including disputed paternity with missing or additional measurements, and
criminal identification. Using this technology, we investigate the effects on likelihood ratios
of introducing mutation and/or null alleles, and show that this can be very substantial even
when the underlying perturbations are very small.
Some key words and phrases: Bayesian network, DNA profile, missed allele, mutation, null allele,
object-oriented, paternity testing, silent allele.
1
Introduction
DNA parentage testing and forensic identification are currently conducted using DNA profiles,
comprised of several highly polymorphic short tandem repeat (STR) genetic markers each having
a repertory of alleles (“repeat numbers”) that can typically be represented as small integers. The
European standard AMPFl STRr SGM PlusT M system uses ten such STR loci, plus amelogenin.
All are on different chromosomes and so segregate independently. Polymerase chain reaction
amplification now allows a profile to be obtained from very small amounts of DNA, even a single
cell. For an account of the relevant biotechnology see e.g. Buckleton et al. (2004).
The forensic impact of such DNA evidence is most appropriately captured by calculating the
corresponding likelihood ratio for comparing a pair of competing hypotheses (Evett and Weir
1998; Morling et al. 2002). However, this can become extremely challenging, both logically and
computationally, in the presence of additional complicating features such as missing data on some
individuals, mixed trace evidence, mutation, null alleles, etc. For example, in a paternity case
the true father may appear to be excluded, when in fact a mutation has taken place, or an allele
has not been recorded.
We have previously shown (Dawid et al. 2002; Mortera 2003; Mortera et al. 2003; Dawid 2003)
how such complex problems can be addressed by structuring and analysing them with the aid of
the computational technology of Bayesian networks (BN), also called Probabilistic Expert Systems
(PES) (Cowell et al. 1999). These have been implemented in general purpose software such as
Hugin 1 .
∗ Research report No. 256, Department of Statistical Science, University College London.
September 2005.
1 Obtainable from www.hugin.com
1
Date:
A recent extension of this BN technology is the object-oriented Bayesian network (OOBN).
This allows hierarchical definition and construction of a BN, utilising simple modular building
blocks. Additional complexity can easily be introduced by adding new modules or refining existing
ones. Object-oriented Bayesian network architectures have been described by Laskey and Mahoney
(1997); Koller and Pfeffer (1997); Bangsø and Wuillemin (2000).
In this paper we describe a construction set of basic OOBN modules for DNA identification,
and show how these can be flexibly combined to handle a wide variety of complex problems. Our
networks have been built using Hugin version 6.4.
One specific complicating feature that we address is mutation, which can lead to a child having
an allele that appears to have no source in either parent. Another is the possibility that observation
of an individual’s genotype can be incomplete on account of a“null allele”, i.e. one that is not
detected by the measuring apparatus. We further distinguish between the cases where this property
is non-inherited (when we term the allele “missed”) or inherited (which we term a “silent” allele).
An allele can be missed simply on account of sporadic equipment failure. A silent allele, on
the other hand, might be the result of a mutation in the primer binding region, causing DNA
amplification failure (Clayton et al. 2004). In this case only one allele is amplified and read, and
the individual appears, wrongly, to be homozygous. This feature will be passed, by Mendelian
inheritance, to a child, which, consequently, may again wrongly appear homozygous. We can thus
easily have false evidence of exclusion, leading us to conclude, wrongly, that the alleged father is
not the true father.
We apply our networks to analyse a number of specific forensic cases. We find that properly
accounting for a small probability of a silent allele can have a dramatic effect. In particular, in paternity testing where we can also observe the putative father’s brother, this additional information
can substantially change the probability of paternity in the presence of silent alleles.
The paper is organized as follows. In § 2 we describe a variety of problems of civil and criminal
forensic identification, and represent them as high-level disputed pedigree networks. Section 3
shows how DNA identification in such problems can be implemented by treating these as objectoriented Bayesian networks, having further internal structure that can be expressed by means of
lower-level networks as described in § 4. Modifications of the lower-level networks to incorporate
various complicating features, viz. mutation, silent alleles, missed alleles and combinations of
these features are described in § 5, § 6, § 7 and § 8, respectively. In § 9 we examine some numerical
examples to illustrate the effects of taking proper account of the various complications considered.
Section 10 presents further examples, showing the sometimes dramatic effect on the paternity ratio
of accounting for silent alleles etc. when measurements can be obtained from relatives; while § 11
presents a case of criminal identification. Closing remarks are given in § 12. Appendix A develops
some algebraic formulae for the paternity ratio, allowing for silent alleles, in a simple paternity
problem when we can also observe the genotype of the putative father’s brother.
2
Pedigrees
We give particular attention to problems of testing paternity, or other family relationships, using
DNA profile data. We always start by constructing a single pedigree to represent the relationships,
whether known, assumed, or uncertain, between relevant individuals.
2.1
Nuclear family
Figure 1 is a simple pedigree representation for a nuclear family consisting of father f, mother
m, and one child c (colour-coded blue for male, pink for female). Both f and m are instances of
type founder, having no parents represented in the pedigree, whereas c is an instance of type
child, having both parents represented. Cases where, say, only the individual’s father is known
or observed can be handled by adding the unknown mother as an additional founder.
2
Figure 1: Pedigree for nuclear family
2.2
Simple disputed paternity
In the simplest case of disputed paternity, we have an alleged family triplet formed by a disputed
child c, its undisputed mother m, and the putative father pf. The hypothesis of interest, H0 , is
that the putative father is the true father tf of the child; the alternative hypothesis H1 is that
the true father is some unobserved alternative father, af, treated as drawn at random from the
population.
A pictorial representation of this disputed pedigree is shown in Figure 2 (unobserved individuals
being shown in a lighter shade.) Each of m, pf and af is a founder, while c is a child. To represent
the disputed identity of the true father tf we describe him as a query individual, and include an
explicit “hypothesis node” tf=pf? to indicate that we have a choice between pf and af.
Figure 2: Pedigree for simple disputed paternity
We may have DNA profiles from m, c, and pf, consituting evidence E. The impact of this
evidence is carried by the likelihood ratio in favour of paternity:
LR = Pr(E|H0 )/ Pr(E | H1 ).
(1)
If we make some standard assumptions — Mendelian segregation, independent markers, known
population allele frequencies — this can be calculated by a simple and well-known algebraic formula
(Essen-Möller 1938).
2.3
Missing individuals
In more complex cases, DNA profiles may be missing for one or more members of the basic family
triplet, but further information may be available in terms of profiles from known relatives. Forensic geneticists have not generally been able to handle such incomplete paternity data rigorously
because of the more complex logical and computational analysis required.
Figure 3 and Figure 4 relate to the two incomplete paternity cases described and analysed by
Dawid et al. (2002). They are variations on Figures 3 and 5 of that paper, extended to incorporate
explicitly all relevant individuals, whether observed or unobserved.
In Case 1, as displayed in Figure 3, we have DNA from a disputed child c1, but not from its
mother m1 nor from the putative father pf. We do however have DNA from c2, an undisputed
3
child of pf by a different, unobserved, mother m2, as well as from an undisputed full brother b of
pf. The sibling relationship is made explicit by the incorporation of the (unobserved) grandfather
gf and grandmother gm, parents of both pf and b. Nodes gf, gm, m1, m2 and af are all instances
of founder; pf, b, c1 and c2 are instances of child; and tf is an instance of query.
Case 2, displayed in Figure 4, is very similar, except that we now have DNA from both m1 and
m2, and from two full brothers, b1 and b2, of pf.
Figure 3: Pedigree for incomplete paternity case 1
Figure 4: Pedigree for incomplete paternity case 2
2.4
Criminal identification
Such genetic networks can also be used in certain criminal cases, as well as for identification of
victims of disasters.
The problem represented by Figure 5 is based on a real case. A body has been found, burnt
beyond recognition, but there is reason to believe it might be that of a missing criminal cr. DNA
is available from body, from the wife of cr, and from two children, c1 and c2, of cr and wife.
The hypothesis node now indicates that cr might be identical to body; otherwise he is treated as
an unobserved man, cr (unobs).
Figure 6 describes a British cause célèbre, the case of James Hanratty (H) who was found guilty
of murder and rape and hanged in 1962. In 1998 it was decided to apply modern DNA profiling
technology to certain items of evidence from the original trial, which had been retained by the
police, and a profile, taken to be from the culprit c (either H, or some other person o) was found.
In an attempt to prove Hanratty’s innocence, his mother m and full brother b offered themselves
for DNA profiling. In principle this might have excluded Hanratty, but in fact did not do so: the
associated likelihood ratio in favour of his having left the crime trace was about 440. In 2001
4
Figure 5: Pedigree for criminal identification case
Hanratty’s body was exhumed, and it was found that his DNA did indeed provide a full match to
the crime profile, yielding an updated likelihood ratio of about 2.5 million.
Figure 6: The case of James Hanratty
3
Object-oriented networks for DNA identification
So far we have merely described the type of problem we wish to address. In order to assess the
impact of the evidence in any but the simplest of such problems we shall generally have to make
use of sophisticated computational tools. Our approach is based on building Bayesian networks
to represent the assumed structure. These then allow insertion of the evidence and propagation
of its effect throughout the network. In particular, we can find its impact on the comparison of
competing hypotheses, e.g. as to paternity.
3.1
Object-oriented Bayesian networks
Dawid et al. (2002) showed how Bayesian networks can be built to represent problems such as
described above, allowing one to obtain the correct likelihood ratio for the hypotheses based on all
the available evidence. Here we describe a new, “object-oriented”, construction for such networks,
which greatly simplifies and clarifies the specification process.
Version 6 of the Bayesian network (BN) software system Hugin supports hierarchical definition
of a BN, whereby any network can itself contain repeated instances of some other generic (class)
network or networks. We use bold face to indicate a network class, and teletype face to
indicate an instance or regular node.
A class network is like a regular network, except that it can have interface — input and output
— nodes as well as internal nodes. Interface nodes are indicated by a grey outer ring, an input
5
node having a dotted outline, and an output node a solid outline. Any network can have nodes
that are themselves instances of other networks, in addition to regular nodes. Each instance of a
class network within another network is displayed as a rounded rectangle, which can be expanded
if desired to display its interface nodes; internal nodes remain hidden from view (although they
can be accessed in “run” mode for entering findings or extracting updated probabilities). Arrows
between nodes within the same network, or from output nodes to regular nodes in the containing
network, represent, in the standard way, the probabilistic or functional dependence of that “child”
node on its “parents” (Cowell et al. 1999). An input node can have at most one incoming arrow
from a node in the containing network (which could itself be an output node of some other
subnetwork): this is a “binding link”, indicating that these two nodes are to be identified.
All instances of a class have identical probabilistic structure, save that the table for an input
node is a default, being overwritten in any instance where that node is bound to a node of the
containing network. Only output nodes can be parents of external nodes (either regular nodes of
the containing network, or input nodes of other subnetworks).
This architecture enables a convenient modular approach to problem specification. It is particularly natural and useful for genetic networks, where there is repetition, across different individuals,
of such basic structures as Mendelian inheritance or mutation processes. Here we describe a set
of simple class networks that can be pieced together as required, much like a child’s construction
set, to represent a wide variety of problems. A specific application of this modular construction
process to a complex problem involving mutation has previously been described by Dawid (2003).
Note that the object-oriented structure is used purely for problem specification and network
construction. Within the software the network is expanded internally into a regular Bayes net
(which can be output if desired). Once an object-oriented network has been constructed, it can
be used for individual case analysis in essentially the same way as a regular network: see Dawid
et al. (2002) for illustrations. After entering evidence, computation and analysis are effected by
standard propagation algorithms (Cowell et al. 1999), initiated by means of simple mouse clicks.
3.2
Bayesian networks for DNA identification
The pedigrees displayed in § 2 above were constructed in Hugin 6.4. Over and above expressing
family relationships, this allows us to describe the operation of genetic inheritance in detail. We
do this in the context of forensic DNA profiles, each consisting of measurements on a collection of
STR genetic markers (which we shall usually simply call “gene”).
An individual’s DNA profile consists of measurements on a number of DNA markers. For
each such marker we observe a genotype, comprising the unordered pair of values (alleles) for its
constituent genes — one maternally and one paternally inherited, although this distinction can
not usually be observed. When these alleles are the same the individual is called homozygous at
that marker, else heterozygous. Current technology utilises STR markers, which have a repertory
of 8–20 alleles that can commonly be described by a small integer. For present purposes these can
be regarded as measured without error, except for the specific possibility of “silent” or “missed”
alleles, as treated in § 6 ff. below.
Each of our networks describes the inheritance of a single marker: distinct markers require
distinct networks, but these will differ only in the details of the repertory of alleles, and their
population frequencies. On entering the available DNA profile data for a marker we can use the
system to calculate likelihood ratios for comparing hypotheses of interest. Throughout this paper
we assume that the networks for different markers are entirely independent (given any of the
hypotheses entertained), and calculate an overall likelihood ratio by simply multiplying the values
obtained from each component marker network.
Note that colouring of nodes is purely for presentational purposes and has no effect on the
analysis.
6
3.3
Nuclear family
In Figure 1, each of its three nodes was defined as an instance of another, generic, class network,
having hidden internal structure. Both f and m are instances of a class founder, while c is an
instance of a class child.
In Figure 7, which is an expanded version of this network, we see that founder contains two
output nodes: pg, representing the founder’s paternally inherited gene, and mg, representing the
maternally inherited gene. As for child, in addition to output nodes pg and mg as for founder
it has input nodes fpg, fmg, mpg, mmg, representing respectively the child’s father’s paternal and
maternal genes, and his/her mother’s paternal and maternal genes. The arrows into these represent
binding links, specifying that these are identical copies of the associated gene nodes in the two
parental networks.
Figure 7: Expanded pedigree for nuclear family
The above class networks contain still further hidden structure, defining the nature of the
inheritance process and of the observable quantities (genotypes). This will be described in § 4
below.
3.4
Simple disputed paternity
In Figure 2, m, pf and af are again instances of class founder, and c an instance of class child,
exactly as described above. To model tf we need to construct a new network class query. Some
details of this are shown in the partially expanded version of Figure 8. Internally, the output node
tfpg is copied from either f1pg or f2pg, according as the Boolean variable tf=f1? is true or
false; and similarly for tfmg. Input nodes f1pg and f1mg are bound to output nodes pg and mg of
pf, while f2pg and f2mg are bound to output nodes pg and mg of af. Other connexions between
the nodes in Figure 2 are made exactly as described in § 3.3 above. We also include the explicit
“hypothesis node” tf=pf?, bound to tf=f1?, in the top-level network: this node embodies H0
or H1 according as its value is true or false. We initially set these as equally likely, so that after
propagation of evidence the ratio of their posterior probabilities can be interpreted as a likelihood
ratio.
3.5
Further networks
We now have all the ingredients to represent more complex problems, such as described in § 2.3
and § 2.4. All one has to do is to insert and connect together, in obvious ways determined by the
basic pedigree, instances of the already constructed networks founder, child and query, as well
as a hypothesis node. Armed with this “construction set” we can represent and so solve a very
wide variety of problems involving DNA profiles and disputed identity.
4
Detailed structure
We now give further details of the structure of the networks constructed above.
7
Figure 8: Partially expanded pedigree for simple disputed paternity
4.1
Network founder
The internal structure of the network class founder is shown in Figure 9. The internal nodes pgin
Figure 9: Network founder
and mgin represent the random paternally and maternally inherited genes of the founder, and are
themselves specified as instances of a class gene (not shown here), which consists of a single output
node, also called gene. Associated with gene in this simple network is the appropriate repertory
of allele values and their population frequencies.
For our illustrations in this paper we use forensic marker VWA, having alleles ranging from
12 to 22 and probability table as given in Table 1. These are Austrian-German population allele
frequencies.2
The output nodes pg and mg of founder are specified as identical copies of the internal gene
node of pgin and mgin, respectively. Such duplication is necessary only because of limitations of
Hugin, which currently does not allow a node to be both an input and an output node, nor for
an arrow to cross more than one level of the hierarchy.
Finally the internal node gt of founder is an instance of the class genotype, as displayed in
Figure 10. Here gtmin and gtmax are defined (by means of Hugin expressions) as the minimum
Figure 10: Network genotype
and maximum of the two input gene nodes pg and mg, and represent the observable genotype of
an individual, being used for entering such genotype evidence when available — we colour such
2 We
are grateful to B. Brinkmann for supplying the data for Table 1.
8
an “observation node” in green. The input nodes pg and mg of genotype are bound to nodes pg
and mg of founder.
4.2
Network child
The internal structure of network class child is displayed in Figure 11.
Figure 11: Network child
On the paternal (left-hand) side, the input nodes fpg and fmg of child are bound to the input
nodes pg and mg of an instance fmeiosis of a network class mendel. This in turn has an output
node cg, which is then copied identically to the output node pg of child (again, such duplication
would ideally be avoided but at present can not be). An identical structure holds for the maternal
(right-hand) side of child. Finally pg and mg are fed into an instance gt of genotype, exactly as
in founder, again allowing input of observed genotype data.
Figure 12 shows the internal structure of mendel. Its internal Boolean node cg=pg? is mod-
Figure 12: Network mendel
elled as having a 50% chance of being true, in which case output node cg is identical with input
node pg; else, when cg=pg? is false, cg is identical with input node mg. The effect is thus to
transmit, at random, just one of the two parental genes, in accord with Mendelian segregation.
4.3
Network query
The internal structure of network query is shown in Figure 13. This contains only the input and
Figure 13: Network query
9
output nodes as described in § 2.2 above. When tf=f1? is true, tfpg copies f1pg and tfmg copies
f1mg; when false, tfpg copies f2pg and tfmg copies f2mg.
4.4
Analysis
For case analysis the pedigree network describing a problem is used essentially as described in
Section 2.2 of Dawid et al. (2002): each observed genotype is entered (as gtmin and gtmax) inside
the instance gt of genotype within the relevant instance of founder or child. Then probability
propagation is performed by the software, following which we calculate, as the ratio of the updated
probabilities at node tf=pf?, the contribution to the likelihood ratio in favour of paternity based
on these observations at this marker. The global likelihood ratio is obtained by multiplication of
these contributions across all the markers measured.
4.5
Super-networks
We can even treat a “top-level” network, such as triplet, as a class, and create one instance of
it for each marker. Since Hugin does not currently allow modification of the states of a node
when reusing a network, we must first set up a single repertory of coded states in gene, and
specify appropriate correspondences with the actual alleles of the marker under consideration;
the allele frequencies are likewise edited appropriately for each marker. The resulting marker
networks can then be analysed separately, and their several likelihood ratios multiplied together.
Alternatively all the single-marker networks can be explicitly combined as instances within a single
super-network, with the node tf=pf? (now made into an input node) in each instance bound to
a new top-level hypothesis node tf=pf?. Then after entering the evidence on all individuals
at all markers, and propagating, we can obtain directly the global likelihood ratio from that
hypothesis node. Such super-networks are not ideally suited to the propagation algorithm used
by Hugin, since the links to the top-level hypothesis node can create very large cliques, and
thus severe computational inefficiencies. External combination of marker-specific calculations
is preferable whenever (as in the cases considered here) this is possible. However in some more
complex problems, e.g. those involving quantitative analysis of mixed samples (Cowell et al. 2004),
there are additional quantities common to all markers, and then such a super-network may be the
only way to proceed.
5
Mutation
It is easy to modify networks such as the above to account for possible mutation of genes in
transmission from parent to child. We distinguish between a child’s original gene cog, identical
with one of the parent’s own genes, and the actual gene cag available to the child, which may
differ from cog because of mutation.
Mutation network “mut” We must first construct a new class network mut to model the
relevant mutation process. This network should have og as an input node, and ag as an output
node.
Revised network “mendel” We also modify the class mendel of Figure 12 as shown in
Figure 14, renaming cg to cog (now made into an internal node) and binding this to input node
og of an instance cag of mutation network mut. The output node ag of cag is then duplicated
to supply the output node cg of mendel.
The overall effect is that the output of mendel now represents the result of mutation acting
on top of Mendelian segregation.
As a very simple example, the network mut shown in Figure 15 implements the proportional
mutation model : the actual gene ag is either identical to the original gene og, or else replaces that
10
Figure 14: Revised network mendel, incorporating mutation
by a new gene sampled randomly from the population distribution, obtained from the output of
an instance otherg of gene. The choice between these is made according to the outcome of a
biased coin toss bcoin.
Figure 15: Network mut for proportional mutation model
For some mutation models we might wish to allow the mutation process to vary, according as
it affects the paternal or the maternal line; in this case we need to incorporate a further Boolean
input node p or m? in mut to specify the parental line. We then duplicate this in mendel, and bind
these nodes together, as shown in Figure 16; and further modify child as in Figure 17, assigning
probabilities 1 and 0 appropriately at nodes pline and mline (each bound to input node p or m?
in the relevant instance fmeiosis or mmeiosis of mendel) to specify the relevant paternal line.
Figure 16: Revised network mendel, incorporating mutation varying with parental line
For more complicated mutation models there may be further internal structure, and/or adjustable parameters, in mut. As an example, Figure 18 represents a “mixed mutation model”
(Dawid et al. 2001; Vicard and Dawid 2004). This chooses, as ag, either the original gene og,
or a mutated gene, represented by an instance mutg of the class mutg of Figure 19. The choice
is controlled by a coin toss bcoin, with bias determined by parameters xi, related to the overall
mutation rate, and rho, which can be set to allow for differential mutation rates in the male and
female lines. The mutated gene mutg is itself obtained by selecting between the outputs of the
11
Figure 17: Revised network child, incorporating mutation varying with parental line
proportional mutation model propmutg, an instance of gene, and that of the “single-step mutation model” onemutg, an instance of onestep (not shown here). A parameter h determines the
selection probability. For further details of this model see Dawid (2003).3
Figure 18: Network mut for mixed mutation model
Figure 19: Network mutg for mixed mutation model
If we were only concerned with fixed values of the parameters, we could omit the parameter
nodes and simply insert appropriate values into the conditional probability tables of the coin toss
or other nodes that they affect. In that case we could proceed exactly as described above for
the proportional mutation model. However, exploration of sensitivity to varying parameter values
would then require direct editing of these conditional probability tables. To avoid this we have inserted explicit parameter nodes h, xi and rho, each having a discrete collection of numerical values
we wish to experiment with, and specify the coin-toss probabilities etc. as algebraic expressions in
these parameters. Since typically several instances of a network class containing such a parameter
node will occur in the overall network, we need to ensure that any value set for the parameter is
transferred to all those instances. The “traverse instance” feature of Hugin 6.4 enables this to be
done easily.
Once an appropriate network mut has been built, and mendel (and possibly also child) modified as described above, pedigree networks constructed as in § 2 will now automatically incorporate
the additional possibility of mutation. No other changes are required.
3 Our network mut corresponds to the network ag of Dawid (2003), while our parameter xi is twice the parameter
lambda used there.
12
5.1
Non-stationarity
A stationary mutation model is one for which the allele frequency distribution of a gene after
mutation is identical with its distribution before mutation. The proportional mutation model described above is stationary, but in general the mixed mutation model is not. With non-stationary
mutation, allele frequencies will change slightly from one generation to the next, and the very
concept of a “population allele frequency distribution” dissolves into meaninglessness. A consequence of this is that we will get slightly different answers according as, say, our pedigree network
does or does not include parents for node pf. For example, if we were to use the pedigree of
Figure 3 to analyse the simple paternity problem of Figure 2, by inserting findings at m, pf and c,
we would get a slightly different answer simply in view of the fact that a (now unobserved) brother
is represented in the network. Various workarounds could be used to avoid this, but we have not
felt it worthwhile following this route, on the grounds that there is no logically compelling reason
to prefer raw over once-mutated, twice-mutated, . . . , frequencies, and the numerical differences
will in any case be small (vanishing completely for a stationary mutation process).
6
6.1
Silent alleles
Background and assumptions
A null or drop-out allele is one that is not recorded by the equipment used. When this can
happen, what appears to be a homozygous genotype at some marker may not be so: an alternative
explanation is that we are seeing just one band of a heterozygous genotype, the other band being
null. This phenomenon will clearly affect the evidential interpretation of certain patterns of DNA
profiles. Several papers in the literature have dealt with genetic aspects of dropout and how to
allow for it in the analysis: Gill et al. (2000) develop formulae for the likelihood ratio, while
dna·view, a programme developed by C. Brenner, contains modules to perform the calculations.
This phenomenon can occur for a number of reasons. One possibility is “run-off”, where the
measuring apparatus used is simply unable to record certain allele values. Another is a mutation
in the primer binding site, near to the target marker, leading to failure of the amplification process.
In either of these cases a null allele will be inherited exactly like any other allele, distinct markers
still being unlinked. We term such an inherited null allele silent. We construct networks to model
and analyse this situation in § 6.2 below.
Clayton et al. (2004) found that about 3 × 10−4 apparent mutations detected in paternity
triplets were due to primer binding site mutations. They also suggest that such a mutation is
likely to be preferentially associated with some specific allele or alleles of the target marker. For
simplicity and demonstration purposes we have not taken account of this association, supposing
instead that every allele has the same probability of becoming silent. Thus the models developed
and the numerical values assumed here should be considered as purely illustrative: they are not
recommendations for use in forensic laboratory casework.
Another possible explanation for a null allele is sporadic failure of the apparatus to record the
correct allele value. In this case the property is not inherited; we refer to such a null allele as
missed . We describe how to handle this situation in § 7.
6.2
Networks for inherited silent alleles
We can construct Hugin networks to handle problems with inherited silent alleles by making
minor modifications to the basic building blocks: specifically, to gene and genotype. We now
make explicit use of the dummy value 99 to represent silence. Wherever any node in any network
represents a gene, its state-space must be augmented with this value (in fact, to avoid further
editing we already included this in our previous networks, giving it probability 0 in network
gene).
13
Revised network “gene” The simple one-node network gene is now renamed gene0, and an
instance gene0 of it is included in the new gene network shown in Figure 20. This has output
Figure 20: Network gene for founder gene, incorporating silent allele
node gene, equal to the output of gene0 unless the binary node silent takes the value 1, in which
case gene is set to the silent value 99. The silence indicator silent is generated from Binomial(1,
pr(silent)), depending on parameter node pr(silent): we have made this a discrete numerical
node, so that we can vary its value (we consider values 0.000015, 0.00003, 0.0001, 0.0005, 0.001,
0.005 and 0.01). The overall effect is that, with probability pr(silent), any original allele value is
transformed into a silent allele. The probability of a silent allele is thus pr(silent), while initial
“real” allele frequencies are multiplied by 1 − pr(silent). A silent allele is inherited just like any
other allele.
Revised network “genotype” The network of Figure 10 for class genotype also needs to
be modified, as shown in Figure 21, to account for the fact that silent alleles can not be seen
in observed genotypes. Nodes pg, mg and gtmin are defined as before. Previous node gtmax is
Figure 21: Network genotype, incorporating silent allele
renamed gtmax0, while new output node gtmax is equal to gtmax0 unless this has value 99, in
which case it is set equal to gtmin, so mimicking a homozygous genotype. If both alleles are silent
so will be both gtmin and gtmax, and nothing will be seen — an event which, though rare, has
been known to occur (Clayton et al. 2004, Figure 1).
Again, once we have made the above replacements of lower level networks, we can simply reuse
top-level pedigree networks such as in § 2 — now automatically incorporating the possibility of
silent alleles into these problems.
7
Missed alleles
Modelling of sporadically missing alleles is just as straightforward. These only affect the way in
which a genotype is observed. We now use 99 to represent an unobserved “missed” value.
Observed allele network “geneobs” This new network, displayed in Figure 22, is very similar
to that for gene in Figure 20. Node pr(missed) is a discrete numerical parameter node allowing
us to set various values for the probability that an allele is missed (supposed independent of its
14
value). The binary missingness indicator missed has a Binomial(1, pr(missed)) distribution.
Input node gene0 represents an actual allele value, while output node gene, the possibly missed
gene, replaces this by 99 if missed takes value 1.
Figure 22: Network geneobs for observed gene, incorporating missed allele
Revised network “genotype” We also revise the network genotype of Figure 10, as in
Figure 23. New nodes pgobs and mgobs are instances of geneobs, thus transforming pg and
mg according to the missingness process. Nodes gtmin, gtmax0 and gtmax are obtained from the
resulting, possibly missing, alleles exactly as described in § 6.2.
Figure 23: Network geneobs for observed genotype, incorporating missed allele
Yet again, existing pedigree networks can be reused, so as now to allow for missing alleles.
8
Combination
We can readily combine any or all the complicating features so far introduced, thus allowing for the
possible simultaneous existence of inherited silent alleles, sporadic missed alleles, and mutation;
all within a wide variety of top-level pedigree networks incorporating further complications such
as missing individuals. We simply include all the appropriate new and revised networks needed
for the various extensions (when combining both silence and missingness — treated as operating
independently — we use the network genotype constructed for missingness). Further modifications can generally be introduced quite easily: for example, when combining mutation and silence
we have chosen to modify mendel, adding an extra arrow from cog to cg, to ensure that mutation
out of or into a silent allele is not allowed.
In all circumstances the identical pedigree networks can be used. We have created a number
of directories containing the appropriate lower-level networks for each combination of the above
features. Using instances of founder, child, query, a pedigree network to describe a new problem can be constructed in any one of these, and simply dropped into any other, for immediate
incorporation of the relevant additional features.
15
9
Examples
We now illustrate the effects of accounting for either the separate or the combined effects of silent
alleles, missed alleles, and mutation. All examples refer to marker VWA, with population gene
frequencies as given in Table 1.
We use the simple paternity pedigree network of Figure 2, extended, as described in § 8, to
allow for all the additional complications simultaneously. A mixed mutation model is assumed,
with parameter values set to h = 0.9, rho = 0.5 and xi = 0.005081 (corresponding to a combined
mutation rate of τ = 0.004982). When no mutation is allowed we set xi = 0.
After propagating the evidence, node tf=pf? contains the posterior probabilities of paternity
and non-paternity. We set the prior probability of paternity to 0.5, so that we can interpret the
ratio of the resulting (purely nominal) posterior probabilities as the likelihood ratio in favour of
paternity — which we henceforth term the paternity ratio.
In our examples both the child’s and the putative father’s genotypes are apparently homozygous. It is easy to see that (in the absence of mutation) if either the child or the putative father
were heterozygous it would make no difference to introduce the possibility of a silent or a missed
allele.
Since a silent allele is inherited while a missed allele only affects the recorded genotype, allowing
for silence will typically have a much greater effect than allowing for missingness.
Example 9.1 The data are:
m : {12, 20}
pf : {18, 18}
c : {12, 12}.
Note that the child’s observed allele 12 is extremely rare, having frequency p12 = 0.03%; the
mother’s other allele 20 is somewhat less rare, with p20 = 1.4%; while the putative father’s
observed allele 18 is common, with p18 = 22%.
Table 2 shows the combined effects of silence and missingness with no mutation. Comparing
the column pr(missed) = 0 with the row pr(silent) = 0, we see that the effect of silence alone is
roughly 5 times that of missingness alone. On passing from pr(silent) = 0 to pr(silent) = 0.001
— the value estimated by the American Association of Bloodbanks — the paternity ratio goes
from 0 to 3.53: instead of the evidence ruling the putative father out, when we introduce a small
possibility of silence it actually favours paternity. Indeed, whenever pr(silent) ≥ 0.0001 all
entries in the table give a paternity ratio greater than 1, favouring paternity (the additional effect
of incorporating missingness in addition to silence being to reduce slightly the paternity ratio).
Intuitively this is because, as soon as the probability of silence is comparable with that of allele 12,
the child’s apparently homozygous genotype is well explained as really being truly heterozygous
{12, silent}. This in turn is readily explained under paternity if the putative father also has a
silent allele. A similar explanation based on a (non-inherited) missed allele is however much less
convincing.
Table 3 shows the combined effect of silence, missingness and mutation. In the absence of
silence or missingness, a 6-step mutation would be required to explain the data under paternity,
and this is highly improbable under our mixed mutation model. Comparing Table 3 with Table 2
one in fact observes a negligible additional effect of allowing for mutation.
2
Example 9.2 Now consider data:
m : {12, 20}
pf : {13, 13}
c : {12, 12}.
The mother’s and child’s genotypes are the same as in Example 9.1, while the putative father’s
observed allele is now the relatively rare allele 13, with p13 = 0.2%. The combined effects of silence
and missingness are displayed in Table 4.
The impact of introducing the possibility of silence is overwhelming: for example, when
pr(silent) = 0.01% the paternity ratio is 125. Compared with Example 9.1, the greater rarity of the putative father’s observed allele now makes the presence of a silent allele still more
plausible. However the sheer magnitude of this effect is perhaps unexpected.
16
The effect of missingness alone is, however, similar to that in Example 9.1. The additional
effect of allowing for missingness over that of silence is to decrease the paternity ratio— markedly
so for pr(missed) ≥ 0.001.
The effect of further incorporating mutation can be seen in Table 5. Mutation by itself
(pr(silent) = pr(missed) = 0) has quite an impact, giving a paternity ratio of 3.79; intuitively this is because paternity can now be well-explained by a 1-step mutation, and this is quite
probable under the mixed model. This effect of mutation can still be seen when missingness is
introduced, but essentially disappears as soon as silence is allowed.
2
Example 9.3 The data are:
m : {16, 16}
pf : {18, 18}
c : {18, 18}.
The undisputed mother is apparently incompatible with the child: she must therefore have a
missed allele, or have transmitted a silent or mutated allele to her child. Given that p18 = 21% is
much larger than any value considered for pr(silent) or pr(missed), we can be pretty sure, first
that both pfgt and cgt are truly homozygous, and then that the child inherited allele 18 from its
father. This has probability close to 1 under paternity, and to p18 = 0.2162 under non-paternity.
Correspondingly the paternity ratio is close to 1/0.2162 ≈ 4.6 for any combination of the above
explanations. This can be confirmed by calculations (not shown), using our networks.
2
10
Additional individuals
Suppose that, in a simple disputed paternity case, the genotype bgt of the putative father’s full
brother b has been observed, in addition to those of the basic triplet m, pf and c. The relevant
pedigree is as shown in Figure 24. Under simple Mendelian segregation this additional observation
Figure 24: Pedigree for paternity testing with additional individual
is independent of paternity status given the triplet evidence, and so makes no difference to the
impact of that evidence. However, once we allow for a silent or missed allele the paternity ratio
can be affected by knowledge of the brother’s genotype, because it can help to distinguish whether
the putative father is a true homozygote, or is truly heterozygous but with a silent or missed allele.
The likelihood ratio in favour of paternity P based on just the triplet data D := (mgt, pfgt, cgt)
is
Pr(D | P )
LD :=
.
(2)
Pr(D | P̄ )
The impact of the additional information carried by the brother’s data B := (bgt) is measured by
LB :=
Pr(B | D, P )
,
Pr(B | D, P̄ )
17
(3)
and the overall paternity ratio, taking account of both D and B, is
LR := LD × LB .
(4)
We can calculate LB directly by algebraic methods: this is developed in Appendix A. Alternatively we can compute LD and LR by numerical propagation, and thus derive LB from (4).
Our computations were made using the pedigree network of Figure 24, together with appropriate lower-level networks to incorporate the effects of silence or missingness (we do not consider
mutation here).
Example 10.1 To illustrate the possible effect of the additional measurement B on the paternity
ratio, we consider an example where the triplet evidence D is as follows:
m : {12, 15}
pf : {14, 14}
c : {12, 12}.
The putative father and child are both apparently homozygous, in a way that would be inconsistent with paternity under Mendelian segregation. However pf could still be the true father if he
had a silent allele he might have passed to the child, or if one of his alleles was missed. Observation
of his brother’s genotype can help to shed light on these possibilities.
Silent alleles. Table 6 displays the paternity ratio, allowing for silent alleles. The second column
gives the paternity ratio LD based on the triplet data only. The later columns show the additional
factor LB for various possible observations on the brother’s genotype bgt. The behaviour of this
term is determined by its relationship to the putative father’s observed genotype pft.
In columns 3 and 4 we consider bgt = {16, 20} and bgt = {12, 17}: b is heterozygous, and
does not share any allele (and in particular, not a silent allele) with pf. As is verified in Case 1 (a)
in Appendix A, the additional observation B makes no difference whatsoever in this case: LB = 1
for all values of pr(silent).
However, when b is heterozygous but shares an allele with pf, the paternity ratio is reduced
by this additional knowledge. Intuitively this is because it becomes more likely that pf is a true
homozygote, and hence excluded from paternity. This effect is seen in columns 5 and 6 of Table 6
for the cases bgt = {12, 14} and bgt = {14, 17}, so that b and pf share allele 14. The fact
that the additional paternity ratio factor is close to 0.5 is explained by the analysis of Case 1 (b)
in Appendix A, since in our example we have q14 ≈ p14 = 0.1009, considerably larger than the
various values considered for pr(silent). That analysis also explains why the results are the
same in both these columns.
Column 7 refers to the case bgt = pfgt ( = {14, 14}). Since b could now have a silent allele
the additional data do little to distinguish whether or not pf is a true homozygote. Indeed we see
that the extra factor LB is very close to 1, and so essentially uninformative. This is explained in
Case 2 (a) in Appendix A.
Finally we consider the case that b is apparently homozygous, but with bgt different from
pfgt. With such a configuration pf and b might still share a silent allele, and the additional
observation B therefore renders it more probable that pf is a false homozygote, who could have
passed a silent allele down to the child. As a consequence the paternity ratio is increased.
In column 8 the brother exhibits a relatively common allele, bgt = {16, 16}, where p16 ≈ 20%.
Even though this renders him likely to be a true homozygote, the effect on the paternity ratio of
the uncertainty introduced by this extra information is to introduce a factor of around 6 for small
ps , reducing somewhat as ps increases.
In column 9 we take a very rare allele, bgt= {12, 12}, where p12 = 0.03%. The increase in the
paternity ratio is now dramatic. The values here reflect the analysis of Case 2 (b) in Appendix A,
where it is shown that the additional effect is particularly strong when the allele of the brother
is rare, but the silent allele is rarer still. The limiting value of LB as ps → 0 here is 3334.33,
though to come close to this value ps needs to be less than 10−6 . The overall paternity ratio
LR = LD × LB achieves a maximum value of 1027.3 at ps = 0.0000642.
18
Missing alleles.
Table 7 illustrates the effect of observing the brother when allowing for missing alleles. Now
the principal determinant of the additional effect of observing b is whether or not he shares an
allele with c.
Columns 3 (bgt = {16, 20}), 6 (bgt = {14, 17}), 7 (bgt = {14, 14}) and 8 (bgt = {16, 16})
involve cases where bgt and cgt have no common alleles. Since missing alleles occur independently
in different individuals, observation of the brother carries very little additional information on
paternity.
In columns 4 (bgt = {12, 17}) 5 (bgt = {12, 14}) and 9 (bgt = {12, 12}) the brother and the
child share allele 12. In this case, knowing that allele 12 is likely to be present in the paternal
line, because it has been observed in the putative father’s brother, makes it more probable that
pfgt, observed as {14, 14}, was in fact {12, 14}, but with allele 12 missed. This argument is
strengthened further when bgt = {12, 17}: whether this is a true homozygote or involves a silent
allele, it provides evidence for pfgt truly being {12, s}. The strength of the effect is related to the
rarity of allele 12. It decreases slowly as ps increases.
2
Example 10.1 shows that when the possibility of silent or missed alleles is taken into account in a
paternity testing problem where the putative father appears incompatible with the child, additional
information on relatives of the putative father can have a dramatic effect on the paternity ratio.
An effect can also be seen in compatible cases.
Example 10.2 The triplet evidence D is now:
m : {12, 15}
pf : {13, 13}
c : {12, 13}.
Paternity ratios allowing for silent alleles are shown in Table 8. The values of LD in column
2 are much greater than 1 because the triplet is compatible, but they decrease as pr(silent)
increases since it is then more likely that pf carries a silent allele. When bgt is also observed,
its additional effect depends on its type. From column 6 of Table 8 we see that the there is
no effect whatsoever when the brother is heterozygous with no allele in common with the child
(bgt = {21, 22}); otherwise there is some effect, which is most apparent in column 5, where bgt
is apparently homozygous but different from pfgt: it then becomes more plausible that pf is in
fact heterozygous with one silent allele.
The effect of allowing for missed alleles is shown in Table 9. In this case the most interesting
configurations are those where b shares at least one allele with pf. In particular, column 4 shows
that when the brother is heterozygous (bgt = {13, 16}), for larger values of pr(missed) the
paternity ratio decreases, since it is then more likely that pf is truly heterozygous but with a
missed allele. On the other hand if bgt = pfgt (= {13, 13}), the paternity ratio is increased by
the additional information.
2
11
Criminal Case
Here we analyse the criminal case represented by Figure 5. The identity of an unrecognisable
body is unknown, and it is questioned whether it might be that of a criminal cr whose family had
reported his disappearance. The DNA profiles of the criminal’s family members — his wife wife
and their two children c1 and c2 — were typed, and a DNA profile was also extracted from the
bodily remains.
Two different hypothetical cases are analysed below, to investigate the possible effects of allowing for silent and/or null alleles (we do not illustrate the additional effects of mutation, which
were small). We again use marker VWA with allele frequencies as in Table 1.
Example 11.1 The observed genotypes are:
body : {16, 16}
wife : {13, 14}
19
c1 : {13, 13}
c2 : {14, 14}.
Both c1 and c2 are apparently incompatible with being the children of body. Table 10 shows
the likelihood ratio in favour of identity, body = cr, obtained by propagating the evidence in the
network of Figure 5, incorporating lower level networks for silent and missed alleles. The likelihood
ratio exceeds 1 for pr(silent) ≥ 0.0001. The effect of missingness alone is slight; when included
in addition to silence it slightly reduces the likelihood ratio.
2
Example 11.2 Here the DNA evidence is:
body : {16, 16}
wife : {13, 14}
c1 : {13, 13}
c2 : {14, 16}.
The difference from Example 11.1 is that c2 is now compatible with being the child of body.
Table 11 shows the results of propagating this evidence. When taking the possibility of silent
alleles into account the general effect is, as might have been expected, to increase the likelihood
ratio; however this is not so for small values of pr(silent) and pr(missing). The likelihood
ratio again exceeds 1 when pr(silent) ≥ 0.0001. Additional allowance for missingness increases
the likelihood ratio when pr(silent) ≤ 0.0001, while for pr(silent) ≥ 0.001 it slightly reduces
the likelihood ratio.
2
In both the above cases, an apparent exclusion can turn into strong positive evidence for
identity as soon as we allow only a small probability of a silent allele. Allowing a small probability
of a missed allele yields much weaker evidence in itself, but even here the overall effect of all the
evidence could be strongly in favour of identity when there is no exclusion on any other marker.
12
Conclusions
This paper has illustrated how object-oriented Bayesian networks can be fruitfully applied to solving complex problems of forensic DNA identification and paternity testing. The modularity and
flexibility of the approach allows ready application to numerous different cases and complicating
features. A significant application is to accommodate potential allelic drop-out.
When a silent or missing allele is suspected, the ambiguity in the genotype can sometimes be
resolved by retesting. In cases where this is impossible or proves ineffective, it has been common
simply to discard the data (Leopoldino and Pena 2002), but it is better to perform an appropriate
analysis that properly allows for the ambiguity. We have shown how this can be done using the
computational methodology of OOBNs, and have used this to illustrate the sometimes striking
impact of even very low levels of drop-out. In particular, as shown in § 10, in the presence of silent
alleles information on additional relatives can be very powerful in helping to resolve the ambiguity
and assess the strength of the evidence.
In this work we have used a very simple model in which the probability of allelic drop-out is
independent of the actual allele value. In fact small alleles may be less affected by degradation
and so less likely to drop out. Also, as suggested by Clayton et al. (2004), silence due to primer
binding site mutation is likely to be associated with the allele repeat number. It should be relatively
straightforward to incorporate such more realistic dependencies into our OOBNs.
There are numerous further artifacts, such as stutter, drop-in etc., that can occur in DNA
profiling and that we have not considered here. Again, most of these can modelled by modifications
to our basic modular structures, along the lines already described. We hope to address some of
these issues in future work. Another important area where this approach could be applied is in
the analysis of low copy number (LCN) DNA, which is particularly sensitive both to drop-out and
to possible contamination. Whitaker et al. (2001) found that under low copy number conditions
approximately 10% per locus of all heterozygotes exhibit allelic drop-out.
Object-oriented Bayesian networks will also be useful for analysing other problems of interest
in forensic DNA identification. For example, Bayesian networks have been applied to the analysis
of mixed DNA traces, where several individuals may have contributed to the DNA trace (Mortera
et al. 2003; Cowell et al. 2004). In such cases allelic drop-out and other artifacts are known to occur
quite often. Incorporation of these additional complicating features in modular object-oriented
networks should be reasonably straightforward.
20
A
Appendix: The effect of observing the putative father’s
brother
In this Appendix we develop algebraic formulae for the paternity ratio in certain cases where we
wish to allow for silent alleles, and we can also measure the brother B of the putative father PF.
In the absence of silent alleles, the brother’s genotype would contain no information relevant to
the paternity query. In their presence, however, it can carry useful additional information. This
happens when there is ambiguity as to PF’s full genotype, because his measured genotype appears
homozygous; and observing his brother can then provide information relevant to resolving this
ambiguity.
We confine attention to a single forensic marker; as always, assuming independence across
different markers we can obtain the overall paternity ratio by multiplication across markers.
Notation We denote by [x, y] an ordered genotype, where x is the paternally inherited allele,
and y is the maternal allele. We further denote by hx, yi the corresponding unordered genotype
(with possibly repeated values); and by {x, y} the measured genotype—identical with hx, yi when
x 6= y, but with {x, x} ambiguously denoting the homozygous pair hx, xi or the pair hx, si where
s is a silent allele. We denote the frequency of the silent allele s by ps , and that of any other allele
x by qx 4 .
We consider a putative family triplet with measurements D on the genotypes on all individuals:
mgt for mother M, pfgt for putative father PF, and cgt for child C. We have, in addition, measured
the genotype bgt of a full brother B of PF. Under non-paternity we assume that the true father
TF is unrelated to PF.
The impact of the additional information contained in B is carried by
LB :=
Pr(B | D, P )
,
Pr(B | D, P̄ )
(5)
the likelihood ratio in favour of paternity (P ) as against non-paternity (P̄ ), based on B, after
taking account of D. The overall likelihood ratio in favour of paternity is then LD × LB , where
LD is that based only on the data D on the family triplet. In particular, there is no additional
information in B just when LB = 1.
We also define, for a seemingly homozygous putative father with pfgt = {z, z},
Lh :=
Pr(B | pfgt = hz, zi)
.
Pr(B | pfgt = hz, si)
(6)
This is the likelihood ratio in favour of PF’s being truly homozygous, as against heterozygous with
a silent allele, based on his brother’s data B. We denote limps →0 Lh by L0h .
Inconsistent triplet
We shall here consider only triplets that are prima facie incompatible, but could be explained,
under paternity, by silent alleles. These would have measured genotype data D of the form: mgt
= {a, b}, pfgt = {z, z}, cgt = {a, a}, with z 6= a (though we allow a = b). This is the pattern of
Example 10.1, with a = 12, b = 15, z = 14, and qz ≈ p14 = 0.1009. We denote the the brother’s
measured genotype by {x, y}.
We shall require the following general results.
4 Under the assumptions made in § 6, q = (1 − p ) p (x 6= s), where p is the population frequency of allele x
x
s
x
x
when silent alleles do not occur.
21
Lemma A.1 For prima facie incompatible triplet data D,
L−1
B = 1 − α (1 − Lh ),
where
α=
(7)
qz
.
qz + 2ps
(8)
Proof. Consider first Pr(B | D, P ). Under paternity P , we can deduce from the family data D
that PF must have unordered genotype hz, si, including a silent allele. Given this, the profile B
of the brother is independent of those of M and C. Hence
Pr(B | D, P ) = Pr(B | pfgt = hz, si).
(9)
Under non-paternity, P̄ , the data on M and C are completely irrelevant to those on PF and
B, and we have
Pr(B | D, P̄ )
= Pr(B | pfgt = hz, si or hz, zi)
= (1 − α) Pr(B | pfgt = hz, si) + α Pr(B | pfgt = hz, zi),
(10)
with
α:
= Pr(pfgt = hz, zi | pfgt = hz, si or hz, zi)
= qz /(qz + 2ps ).
The result now follows.
2
Corollary A.2 LB = 1 (the brother is uninformative as to paternity) if and only if Lh = 1 (the
brother is uninformative as to silence).
Corollary A.3 LB ≤ 1/(1 − α) = 1 + (qz /2ps ).
Corollary A.4 LB → L0B := (L0h )−1 as ps → 0.
Lemma A.5 Consider two full brothers B1 and B2, with respective ordered genotypes [X, Y ] and
[Z, W ]. Then
1
Pr([X, Y ] = [x, y] | [Z, W ] = [z, w]) = (δxz + qx )(δyw + qy ),
4
where δxz := 1 if x = z, 0 otherwise.
Proof. Let IP denote the event that B1 and B2 inherited the identical gene from their father.
Then Pr(IP ) = 21 , independently of the paternal allele, Z, of brother B2. Clearly Pr(X = x |
Z = z, IP ) = δxz , Pr(X = x | Z = z, I P ) = qx , so that unconditionally Pr(X = x | Z = z) =
1
1
2 (δxz + qx ). Similarly Pr(Y = y | W = w) = 2 (δyw + qy ). The result follows since the maternal
and paternal inheritance processes operate independently.
2
Corollary A.6
Pr(hX, Y i = hx, yi | hZ, W i = hz, wi) =
(
1
4 {(δxz + qx )(δyw + qy )
1
4 (δxz + qx )(δxw + qx )
22
+ (δxw + qx )(δyz + qy )} (x 6= y)
(x = y).
Paternity ratio
We consider various cases, according to the relationship between the brother’s measured genotype
{x, y} and those of the family triplet: mgt = {a, b}, pfgt = {z, z}, cgt = {a, a}.
Case 1: x 6= y. This observation is equivalent to hx, yi, with both x and y different from the
silent allele s. Applying Corollary A.6 with w = z, we obtain
Lh =
2(δxz + qx )(δyz + qy )
.
2qx qy + δxz qy + δyz qx
(11)
(a). If x and y are both different from z,5 (11) reduces to 2qx qy /2qx qy = 1. So in this case
no further information is obtained from the brother’s genotype.
(b). Otherwise, suppose x = z, y 6= z.6 We calculate
Lh
=
=
2(1 + qz )qy
2qz qy + qy
1
1+
.
(1 + 2qz )
Thus from (7)
L−1
B
=
1+
α
1 + 2qz
=
1+
qz
.
(1 + 2qz )(qz + 2ps )
In particular, Lh ≤ 2, and LB ≥ 1/(1 + α) = 1 − qz /(2qz + 2ps ), with approximate
equality so long as qz is small. This lower bound in turn exceeds 1/2, with approximate
equality when ps is much smaller than qz . The limit as ps → 0 is L0B = 12 × (1 +
2pz )/(1 + pz ).
Case 2: x = y. Then x 6= s, and the observation is ambiguously either homozygous hx, xi, or
heterozygous hx, si. Hence Pr(B | pfgt = hz, zi) = Pr(bgt = hx, xi) | pfgt = hz, zi) +
Pr(bgt = hx, si) | pfgt = hz, zi), and on applying Corollary A.6 we obtain
Pr(B | pfgt = hz, zi) =
=
1
1
(δxz + qx )2 + (δxz + qx )ps
4
2
1
(δxz + qx )(δxz + qx + 2ps ).
4
Similarly,
Pr(B | pfgt = hz, si) =
1
{(δxz + qx )(1 + qx )(1 + qx + ps ) + qx ps } .
4
(a). If x = z,7 we obtain
Pr(B | pfgt = hz, zi)
=
Pr(B | pfgt = hz, si)
=
1
(1 + qz )(1 + qz + 2ps )
4
1
{(1 + qz )(1 + qz + 2ps ) − ps } .
4
Since ps will be negligible in comparison with (1 + qz )(1 + qz + 2ps ), which is at least
1, we see that in this case Lh will be extremely close (though not exactly equal) to 1.
Correspondingly so will be LB , and the additional information in the brother’s genotype
is virtually valueless.
5 As
for the case bgt = {16, 20} in Example 10.1
for the case bgt = {12, 14} in Example 10.1
7 As for the case bgt = {14, 14} in Example 10.1
6 As
23
(b). Finally, for x 6= z
8
we similarly calculate
Lh
=
L−1
B
=
1
,
1 + qx + 2ps
α
1−
.
1 + qx + 2ps
1−
It follows from Corollary A.4 that LB → 1 + p−1
x as ps → 0. When px is small, the
additional effect of observing the brother can thus be very substantial, even when the
probability of a silent allele is extremely tiny.
Acknowledgement
This research was supported by a Research Interchange Grant from the Leverhulme Trust. We
are indebted to Steffen Lauritzen for extremely helpful suggestions.
References
Bangsø, O. and Wuillemin, P. H. (2000). Object Oriented Bayesian Networks: A framework for
top-down specification of large Bayesian networks with repetitive structures. Technical report,
Hewlett-Packard Laboratory for Normative Systems, Aalborg University.
Buckleton, J. S., Triggs, C. M., and Walsh, S. J. (ed.) (2004). Forensic DNA Evidence Interpretation. CRC Press.
Clayton, T. M., Hill, S. M., Denton, L. A., Watson, S. K., and Urquhart, A. J. (2004). Primer
binding site mutations affecting the typing of STR loci contained within the AMPFl STRr
SGM PlusT M kit. Forensic Science International, 139, 255–9.
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic
Networks and Expert Systems. Springer, New York.
Cowell, R. G., Lauritzen, S. L., and Mortera, J. (2004). Identification and separation of DNA
mixtures using peak area information using a probabilistic expert system. Research Report 25,
Cass Business School, City University.
Dawid, A. P. (2003). An object-oriented Bayesian network for estimating mutation rates. In
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Jan
3–6 2003, Key West, Florida, (ed. C. M. Bishop and B. J. Frey). http://tinyurl.com/39bmh.
Dawid, A. P., Mortera, J., and Pascali, V. L. (2001). Non-fatherhood or mutation? A probabilistic
approach to parental exclusion in paternity testing. Forensic Science International, 124, 55–
61.
Dawid, A. P., Mortera, J., Pascali, V. L., and van Boxel, D. W. (2002). Probabilistic expert
systems for forensic inference from genetic markers. Scandinavian Journal of Statistics, 29,
577–95.
Essen-Möller, E. (1938). Die Beweiskraft der Ähnlichkeit im Vaterschaftsnachweis. Theoretische
Grundlagen. Mitteilungen der Anthropologischen Gesellschaft, 68, 9–53.
Evett, I. W. and Weir, B. S. (1998). Interpreting DNA Evidence. Sinauer, Sunderland, MA.
Gill, P., Whitaker, J., Flaxman, C., Brown, N., and Buckleton, J. (2000). An investigation of the
rigor of interpretation rules for STRs derived from less than 100 pg of DNA. Forensic Science
International, 112, 17–40.
Koller, D. and Pfeffer, A. (1997). Object-oriented Bayesian networks. In Proceedings of the 13th
Annual Conference on Uncertainty in Artificial Intelligence, (ed. D. Geiger and P. Shenoy),
pp. 302–13. Morgan Kaufmann Publishers, San Francisco.
8 As
for the case bgt = {22, 22} in Example 10.1
24
Laskey, K. B. and Mahoney, S. M. (1997). Network fragments: Representing knowledge for constructing probabilistic models. In Proceedings of the 13th Annual Conference on Uncertainty
in Artificial Intelligence, (ed. D. Geiger and P. Shenoy), pp. 334–41. Morgan Kaufmann
Publishers, San Francisco.
Leopoldino, A. M. and Pena, S. D. J. (2002). The mutational spectrum of human autosomal
tetranucleotide microsatellites. Human Mutation, 21, 71–9.
Morling, N., Allen, R. W., Carracedo, A., Geada, H., Guidet, F., Hallenberg, C., Martin, W.,
Mayr, W. R., Olaisen, B., Pascali, V. L., and Schneider, P. M. (2002). Paternity Testing
Commission of the International Society of Forensic Genetics: Recommendations on genetic
investigations in paternity cases. Forensic Science International, 129, 148–57.
Mortera, J. (2003). Analysis of DNA mixtures using Bayesian networks. In Highly Structured
Stochastic Systems, (ed. P. J. Green, N. L. Hjort, and S. Richardson), chapter 1B, pp. 39–44.
Oxford University Press.
Mortera, J., Dawid, A. P., and Lauritzen, S. L. (2003). Probabilistic expert systems for DNA
mixture profiling. Theoretical Population Biology, 63, 191–205.
Vicard, P. and Dawid, A. P. (2004). A statistical treatment of biases affecting the estimation of
mutation rates. Mutation Research, 547, 19–33.
Whitaker, J. P., Cotton, E. A., and Gill, P. (2001). A comparison of the characteristics of profiles
produced with the AMPFl STRr SGM PlusT M multiplex system for both standard and low
copy number (LCN) STR DNA analysis. Forensic Science International, 123, 215–23.
25
allele
frequency
12
0.0003
13
0.0018
14
0.1009
15
0.1004
16
0.1949
17
0.2834
18
0.2162
19
0.0866
20
0.0137
21
0.0015
Table 1: Population gene frequencies for marker VWA
pr(silent)
0
0.000015
0.0001
0.001
0
0
0.2202
1.1555
3.5297
pr(missed)
0.000015
0.0001
0.0477
0.2503
0.2557
0.4083
1.1497
1.1238
3.5004
3.3462
0.001
0.7701
0.8136
1.0422
2.4128
Table 2: Example 9.1: mgt = {12, 20}, pfgt = {18, 18}, cgt = {12, 12}. Combined effect of
silent and missed alleles on likelihood ratio in favour of paternity.
pr(silent)
0
0.000015
0.0001
0.001
0
0.0003
0.2195
1.1517
3.5260
pr(missed)
0.000015
0.0001
0.0477
0.2497
0.2549
0.4071
1.1461
1.1208
3.4968
3.3430
0.001
0.7694
0.8128
1.0413
2.4114
Table 3: Example 9.1: mgt = {12, 20}, pfgt = {18, 18}, cgt = {12, 12}. Combined effect of
silent and missed alleles together with mutation on likelihood ratio in favour of paternity.
pr(silent)
0
0.000015
0.0001
0.001
0
0
26.02
125.02
202.57
pr(missed)
0.000015
0.0001
0.0554
0.2875
24.98
18.08
120.79
91.14
199.98
178.77
0.001
0.8299
3.80
18.61
75.45
Table 4: Example 9.2: mgt = {12, 20}, pfgt = {13, 13}, cgt = {12, 12}. Combined effect of
silent and missed alleles on likelihood ratio in favour of paternity.
pr(silent)
0
0.000015
0.0001
0.001
0
3.79
29.49
127.30
203.01
pr(missed)
0.000015
0.0001
3.64
2.99
27.78
20.61
120.96
92.97
199.14
179.19
0.001
1.48
4.43
19.19
75.73
Table 5: Example 9.2: mgt = {12, 20}, pfgt = {13, 13}, cgt = {12, 12}. Combined effect of
silent and missed alleles together with mutation on likelihood ratio in favour of paternity.
26
22
0.0003
pr(silent)
0
0.000015
0.0001
0.001
0.01
LD
0
0.472
2.473
7.485
8.100
{16, 20}
1
1
1
1
1
{12, 17}
1
1
1
1
1
LB with bgt =
{12, 14} {14, 17} {14, 14}
0.546
0.546
1
0.546
0.546
1.0000
0.546
0.546
0.9999
0.551
0.551
0.9992
0.590
0.590
0.9932
{16, 16}
6.13
6.12
6.07
5.54
3.19
{12, 12}
3334
1595
403.7
46.07
5.45
Table 6: Example 10.1: mgt = {12, 15}, pfgt = {14, 14}, cgt = {12, 12}. Likelihood ratio in
favour of paternity allowing for silent alleles: LD , without brother’s genotype. LB , additional
effect of brother’s genotype.
pr(missed)
0
0.000015
0.0001
0.001
0.01
LD
0
0.048
0.251
0.771
0.973
{16, 20}
1
1.0000
1.0000
0.9999
0.9996
{12, 17}
5.94
5.94
5.92
5.76
4.60
LB with bgt =
{12, 14} {14, 17} {14, 14}
5.94
0.9987
0.9973
5.94
0.9987
0.9973
5.93
0.9987
0.9973
5.84
0.9987
0.9974
5.14
0.9988
0.9978
{16, 16}
1
1.0000
1.0000
0.9999
0.9997
{12, 12}
10.88
10.05
8.04
6.14
4.90
Table 7: Example 10.1: mgt = {12, 15}, pfgt = {14, 14}, cgt = {12, 12}. Likelihood ratio in
favour of paternity allowing for missed alleles: LD , without brother’s genotype. LB , additional
effect of brother’s genotype.
pr(silent)
0
0.000015
0.0001
0.001
0.01
LD
555.55
551.01
527.83
409.70
303.54
{13, 13}
1
1.0000
1.0000
1.0002
1.0007
LB with bgt =
{13, 16} {22, 22}
1
1
1.0041
0.5118
1.0249
0.5158
1.1144
0.6102
1.0632
0.8703
{21, 22}
1
1
1
1
1
Table 8: Example 10.2: mgt = {12, 15}, pfgt = {13, 13}, cgt = {12, 13}. Likelihood ratio in
favour of paternity allowing for silent alleles: LD , without brother’s genotype. LB , additional
effect of brother’s genotype.
27
pr(silent)
0
0.000015
0.0001
0.001
0.01
LD
555.55
551.01
527.83
409.55
300.96
{13, 13}
1
1.0082
1.0524
1.3537
1.6720
LB with bgt =
{13, 16} {22, 22}
1
1
0.9918
0.9927
0.9501
0.9685
0.7385
0.9296
0.5980
0.9758
{21, 22}
1
0.9920
0.9569
0.8890
0.9631
Table 9: Example 10.2: mgt = {12, 15}, pfgt = {13, 13}, cgt = {12, 13}. Likelihood ratio in
favour of paternity allowing for missed alleles: LD , without brother’s genotype. LB , additional
effect of brother’s genotype.
pr(silent)
0
0.000015
0.0001
0.001
0.01
0
0
0.3883
1.7563
3.9567
4.1576
pr(missed)
0.000015 0.0001
0.0004 0.0002
0.3823 0.3517
1.7377 1.6394
3.9467 3.8901
4.1560 4.1467
0.001
0.0078
0.1956
1.0239
3.3792
4.0506
Table 10: Example 11.1: bogt = {16, 16}, wgt = {13, 14}, c1gt = {13, 13}, c2gt = {14, 14}.
Effect of silent and missed alleles on likelihood ratio in favour of identification.
pr(silent)
0
0.000015
0.0001
0.001
0.01
0
0
0.2175
1.3845
9.3310
20.6564
pr(missed)
0.000015 0.0001
0.0845
0.5152
0.2978
0.7071
1.4429
1.7418
9.2850
9.0418
20.6138 20.3767
0.001
2.7005
2.7922
3.2977
7.5166
18.2226
Table 11: Example 11.2: bogt = {16, 16}, wgt = {13, 14}, c1gt = {13, 13}, c2gt = {14, 16}.
Effect of silent and missed alleles on likelihood ratio in favour of identification.
28
Download