Transmit (version 2.5.4)

advertisement
=============================
= Transmit (version 2.5.4) =
=============================
Author
======
David Clayton
(MRC Biostatistics Unit, Cambridge)
This version:
Description
===========
TRANSMIT tests for association between genetic marker and disease by
examining the transmission of markers from parents to affected offspring.
The main features which differ from other similar programs are:
1. It can deal with transmission of multi-locus haplotypes, even if phase
is unknown, and
2. Parental genotypes may be unknown.
The tests are based on a score vector which is averaged over all possible
configuations of parental haplotypes and transmissions consistent with
the
observed data. Data from unaffected siblings (or siblings whose disease
status is unknown) may be used to narrow down the range of possible
parental
genotypes which need to be considered. The program produces the following
asymptotic chi--squared tests:
1. For each haplotype or allele, a test on 1-df for excess transmission
of
that haplotype.
2. A global test for association on H-1 df, where H is the number of
haplotypes for which transmission data are available.
The theory underlying the method is described in Clayton, Am. J. Hum.
Genetics, 65:1170-7, 1999.
Of course it should go without saying that the approximate chi-squared
distribution of test statistics will not hold if rare haplotypes are
included in the analysis. Two flags are available to protect against
this. The -agg flag causes aggregation and renumbering of alleles before
haplotype construction, while the -c flag simply omits rare haplotypes
from tests. The former approach inevitably results in some information
loss but, when parents are missing, it may reduce the number of
possible
parental haplotypes that must be considered, and considerably
reduce
the computational burden (both in time and space).
It might reasonably be asked how common a haplotype must be for us to
legitimately use the chi-squared tests? A good guideline is to look at
the table of "observed" and "expected" transmissions. If we were to
observe
N heterozygous parents carrying a specific haplotype then, under the null
hypothesis, we would expect the haplotype to be transmitted N/2 times.
The
variance of (O-E) will then be N/4. Thus, multiplying the tabulated value
for Var(O-E) in the TRANSMIT output by four gives us an equivalent
number
of fully informative transmissions. A widely used guideline for the
applicability of chi-squared tests is that they should only be used when
all expected frequencies exceed five. This would correspond to ten fully
informative transmissions and to a value of 2.5 for Var(O-E). My instint
is that this is very much a minimum figure, and I'd only really feel safe
with a value of 5 or more for Var(O-E). But there is a need for more
simulation work to investigate this point.
In the most recent version of the program a bootstrap test procedure is
implemented, and this should be more accurate than the chi-squared
approximations.
Brief resume of theory
======================
The score vector for the "haplotype relative risk" parameters, which
specify
allelic association, u, has elements
u_i =
minus
Observed transmissions of haplotype i to affected offspring
Expected transmissions under Mendelian inheritance.
When transmission is uncertain, u is averaged over all possible haplotype
assignments to parents and offspring, using weights proportional to the
probability of each assignment. Note that these weights depend on the
unknown haplotype frequencies. These are estimated from the data by
solving
the estimating equations which set the vector v, defined by
v_i = Observed minus expected frequency of haplotype i in parents
(under uncertain haplotype assignment this vector too must be average
over
all possibilities in the same way as u). Solution of these equations is
carried
out using an EM algorithm.
There is a "theoretical" variance-covariance matrix for (u, v) which can
be used to calculate a "profile" variance matrix, V, for u which takes
account of the fact that haplotype frequencies have been estimated by
setting v=0.
Alternatively, the variance-covariance matrix of (u, v) is
estimated from the empirical variance-covariance matrix of
contributions
from each nuclear family and, again, an adjustment for the
taking account of the restriction v=0 is made. This is the
option
selected by the -ro flag. Note that this option allows for
affected
sibs within a family --- even in the presence of linkage.
the
variance of u
"robust"
multiple
Each allele is tested individually by calculating
(u_i)^2 / V_ii
which are asymptotically chisquared on 1 df. A global test is given by
the
quadratic form
u.V-inverse.u-transpose
which is asymptotically chi-squared on rank(V) degrees of freedom.
Sometimes (when there is one or more rare haplotypes) the estimated V is
not
positive-semidefinite and the global test cannot be calculated. A test
base
only on more common haplotypes can be carried out by using the -c option.
Bootstrap testing
=================
This is a new and experimental option, introduced in version 2.5.
The bootstrap test is carried out as follows:
1. Calculate the "maximum entropy" distribution which gives a probability
weight to each family's contribution to the (u, v) vector in such a way
that
they have mean (0, 0).
2. Draw repeated bootstrap samples of (u, v)-contributions. For each
sample,
sum these to obtain (u*, v*).
3. Technically we should reestimate the haplotype frequencies since v* is
no
longer zero. We approximately simulate this by adjusting u* by H.v*,
where
H is the matrix of derivatives with H_ij = du_i/dv_j.
4. Calculate the test statistics based on u* and test if they excede the
observed value. The bootstrap p-value is the proportion of bootstrap
samples that give an equal or larger value of the test statistic to that
observed. The statistics calculated are as above, plus the maximum value
of
the 1-df test statistics.
Note that, when transmission is not uncertain, this procedure is expected
to
yield the correct "exact" p-value (if sufficient bootstrap sample are
drawn).
Note also that the procedure should be robust to inclusion of multiple
affected offspring in each family, even in the presence of linkage.
Sometimes the maximum entropy distribution of score contributions cannot
be
calculated. In these circumstances, the empirical distribution of the
contributions is used, its location being shifted so that its mean is
(0,0). This is second best, in that the p-value yielded in the simple
certain-transmission case is not the conventional "exact" p-value, and a
warning message is printed.
Data input
==========
The data input file should contain, for
blank-delimited fields:
family
id
father
mother
sex
affected
marker 1
marker 2
...
each person,
the following
family or pedigree code (alphanumeric)
person's identifier within family (alphanumeric)
id of father (who must have the same family code)
ditto for mother
sex (2=Female, 1=Male)
disease status (2=affected, 1=unaffected, 0=unknown)
coded a/b, where a and b are the two alleles. Alleles
must be coded as consecutive integers, with 0
representing unknown. Thus 0/0 represents completely
missing data but, for a biallelic marker, 2/0
represents either 2/1 or 2/2. For markers on the X
chromosome, males should have marker phenotypes coded
a/0 or a/a, so that males and females have equal
length records.
ditto
etc.
Although these fields must appear in the specified order, persons need
not appear on the file in any particular order. Note that parents must
be included in the data file even if no data concerning them are
available; such entries are necessary to correctly identify sibships.
Persons who appear on the data file only as parents do not need to
have valid entries in the "mother" and "father" fields. A single
period (.)
is recommended for
the coding of these, but any
identifyier which does not occur in the family will have the same
effect. The disease status of parents is not used by the program and
may be coded as 0.
Data input is via the standard data input stream and may be fed into
TRANSMIT either via a filter program, or by using the < operator on
the command line, for example:
transmit <input.dat
It is envisaged that the input data will be extracted from a larger
database, and it should not prove too difficult to achieve the above
format. An alternative is to use Linkage PEDFILEs as input since these
files have the same basic structure even though they contain a few
extra fields. A filter program "ped2spl" is available which converts
Linkage PEDFILEs into a form suitable for input to splink or transmit.
Output
======
Output is to the standard output stream,
using the > command line operator:
but may be saved
to a file
transmit <input.dat >output.lst
The optional output of family transmission scores is controlled by the
-o flag (see below).
A further option is to write the U vector and V matrix to a file in a
format suitable for analysis in the Splus or R statistical programming
languages.
This file can be read into either language with the
statement
source("filename")
which creates several vectors and matrices (see below).
Flags
=====
A number of flags control program operation. In the description that
follows, the # character represents the optional value to be assigned
to the flag. The value represented by # must follow the flag directly
but there may be intervening spaces. Logical flags are set by simply
including them on the command line. If a flag, eg -mf, is set by
default, it can be unset by either writing -nomf or -mf- .
-1
-agg#
-all
If more than one affected offspring in a nuclear family,
use only
one
(selected
at random)
Aggregate alleles. All alleles with relative frequency
not exceeding #% will be aggregated. Alleles will
be renumbered. Note that -a0 will just renumber alleles,
skipping any gaps.
Consider all possible haplotypes. If this is not set, only
haplotypes which are phase variations of observed genotypes
-aoff
-bs#
-c#
-f#
-h
-l#
-mf
-mhp#
-n#
-o<f>
-O#
-pf<f>
written
-ro
-rs#
not
-s#
-S#
-x#
-X
will be considered (see note below).
Use only families with affected offspring. Only these
are informative about transmission, although other
families carry information about haplotype frequencies.
Carry out bootstrap significance testing using # bootstrap
samples.
When computing tests, pool haplotypes with relative
frequencies less than #%
Specify maximum number of (nuclear) families.
(Default -f1000)
Help (list command line options)
Specify number of marker loci. If missing, it is assumed
that this number appears as the first item on the
input file.
Allow multiple nuclear families from one pedigree
(although the relationship between these families will
be ignored). If not set, only the first nuclear family
encountered in each pedigree is used. Default is for
this flag to be set, but it may be unset by -nomf or
-mf- .
Set the minimum haplotype probability to # %. Estimated
probabilities less than this will be set to zero.
(Default -mhp 0.01)
Specify maximum number of persons on data file.
(Default
-n5000)
Specifies
that the tranmission scores for each
family will be written to file <f>. By default this
option is turned off.
Controls amount of output from 0 (min) to 3 (max)
(Default 2)
Specifies that the data used in the analysis will be
to file # (in the format for a linkage "pedfile")
Use the robust estimate of the variance of the score vector
Seed random number generator with an integer. If this is
set, the system clock will be used to generate a seed.
Only treat sex # (1=M, 2=F) as being affected
Write matrices in Splus format to file #
Specify maximum allowable ambiguity for parental
haplotyping. If there are more than # possible
parental haplotype assignments, the family is
excluded. Note that, for speed, possible parental
haplotypes are stored in dynamically allocated memory
and the -x option will help if you run out of memory.
(Default -x1000)
Marker loci are on the X-chromosome; only transmission of
maternal haplotype will be considered.
Matrix output
=============
When the
-S flag is in
force, the following matrices
are written to
file in a form suitable for reading into Splus or R using the source()
function (sizes are for an H haplotype marker in F families) :
score.vector
score.variance
full.information
bugs)
full.score.variance
score.contrib
(2HxF)
The
The
The
vector u_beta (Hx1)
variance of u_beta (HxH)
upper triangle of J (2Hx2H) (see known
The empirical variance matrix V (2Hx2H)
The score contributions, (u , u , u , ...)
1
2
3
The last two matrices are produced only when the -r option is in force.
Example:
=======
transmit <infile.dat -l2 -o scores.dat -S matrices.dat -c10
Changes implemented in Version 2.0
==================================
1. Version 1 had an error in the calculation of V when parental
genotypes were uncertain. This has been corrected. Thanks to Sandra
Cervino (Wellcome Trust Centre, Oxford) for discovering this error.
2. Robust variance estimate (-r flag) implemented.
3. X-chromosome transmission (-X flag) implemented.
4. Restriction of analysis to affected offspring of one sex (-s flag)
implemented.
5. Version
estimated.
1
ignored
the
fact
that
haplotype
frequencies
are
Version 2.3
===========
1. Several small errors fixed.
2. -agg, -pf, and -1 flags implemented.
3. Command line processor modified to allow spaces between flags and
their
values.
4. Initial estimate of haplotype frequencies has been improved. A side
effect of this is that alleles not occurring anywhere in the data now
have zero estimated probability rather than some very small value.
5. An error which affected the estimation of haplotype frequencies in
some
circumstances (leading sometimes to a failure to increase the likelihood)
has been corrected.
6. Steps have been taken to avoid non-positive-semi-definite information
matrices (see below).
Version 2.4
===========
Error in computing variances when -r option in force corrected
Version 2.5
===========
1. Bootstrap testing procedure implemented
2. Error handling improved in case where variance matrix can't be
inverted
Known bugs and problems
=======================
The Information matrix can fail to be positive semidefinite in odd cases.
The problem only seems to arise when there are rare alleles (haplotypes)
and can usually be avoided by use of either the -agg flag or the
-c flag.
Compiling:
=========
Most of TRANSMIT is written
in C++ and must be compiled using a
suitable C++ compiler. The
files transmit.C (or transmit.cpp) and
transfun.C (or transfun.cpp)
are C++ source files
and cline.c,
gamma.c, invert.c, matrix.c, profile.c, stats.c, and bstrap.c are plain
C
source files. The "header" files bstrap.h, cline.h, matrix.h, and
transmit.h
contain class definitions, function protocols etc. Finally, transmit.doc
contains this documentation as a plain text file.
In Unix, compilation would normally be by:
CC *.C *.c -lm -o transmit
A Makefile is supplied. This specifies the g++ (gnu C++) compiler and
must
be edited if a different compiler is to be used. This Makefile has been
successfully tested with the "Cygwin" package, which creates a Unix-like
shell within Windows 95/98/NT and makes the gnu compilers and utilities
available. The main WWW page for the Cygwin project is
http:/www/cygnus.com/cygwin
Note, however, that some editing of the Makefile may be necessary. For
example,
"Cygwin" does not yet contain the 48-bit random number generators.
Download