Overview of eHap 2.0

advertisement
Overview of eHap 2.0
The eHap software is designed to analyze multilocus data as haplotypes, and to determine
whether there is an association between haplotypes and phenotypes. The principal
architects of the software are Howard Seltman (statistical) and Shawn Wood
(graphical/analytical interface). Kathryn Roeder and Bernie Devlin made additional
contributions. We originally conceived eHap as a package for population and familybased association studies in which inference would be guided by the evolutionary
relationships among the haplotypes. We soon realized, however, that we should broaden
the scope of eHap to encompass other analyses that are not evolutionarily based, given
the analytic machinery we were building into the software. This version of eHap
embodies a broad (and ever broadening) set of tools for haplotype-based inference for
association (and linkage) studies using population- and family-based samples.
The assumptions of eHap suggest targeting haplotype blocks. We recommend using
eHap in conjunction with Entropy Blocker (EB) to divide a genomic region into logical
units for analysis. EB is freeware, available at B Devlin’s computational genetics web
site.
Evolutionary-based analyses: Based on a specified cladogram, eHap will perform
any and all tests described in the following two manuscripts:
Seltman H, Roeder K, Devlin B (2001) TDT meets MHA: Family-based
association analysis guided by the evolution of haplotypes. Am J Hum Genet
68:1250-1263.
Seltman, H, Roeder K, Devlin B (2003) Evolutionary-based association analysis
using haplotype data. Genet Epidemiol 25: 48-58
It also contains some tools to help the user build cladograms, as described in the 2003
manuscript. We expect to enhance these tools in the future. The tools will be described
later in the document.
Omnibus tests: If the user wants a test to determine if haplotype are associated with a
phenotype, but does not wish to specify evolutionary hypotheses, omnibus tests are
available from eHap. By omnibus we mean tests for association including all possible
haplotypes or a specified subset thereof. For example, for a test of association of a set of
H haplotypes and affection status in a case-control study, eHap will produce a H-1 degree
of freedom hypothesis test of the null hypothesis of no association.
What eHap will not do yet? Many things. One thing we expect to build into it in the
future is a means of handling X-linked loci. Estimating haplotypes for X-linked loci is
straightforward, but models for eHap must be adapted to this case. For ET-TDT analysis,
only 1 child per family can be analyzed. It is worth keeping in mind that eHap is a work
in progress, so look for updates on the website regularly. Features expected “To Be
Implemented” into eHap soon are indicated by TBI.
For now, send suggestions/complaints/bugs to devlinbj@msx.upmc.edu. In the future
users can post comments to our site.
Getting Started
To get a feel for the software we suggest that you quickly peruse this document and then
try out the examples.
There are several files you will need before you can analyze your data. File names are
given below in bold and italics. Further descriptions of some of the files can be found in
other sections of this document.
I. Two executables:
eHap.exe and HapLinMod.exe are required. Why two executables? Largely because the
programs compiled into these executables have been written separately, and perform
different tasks. eHap, written by Shawn Wood, is the “front-end”. Its purpose is to pick
up instructions for analysis, as described in a “control” file to be described under “II”,
ensure that data are in proper format, determine the set of all possible haplotypes for each
individual multilocus genotype, check for errors, visualize the relationships among
haplotypes (as a network or cladogram), estimate haplotype frequencies (independent of
phenotypes) and so on. An important function of eHap is to set up the data and provide
instructions on how to analyze the data to HapLinMod. eHap automatically passes off to
HapLinMod, so you should never have to be concerned about the interface between the
two executables. HapLinMod, written by Howard Seltman, is the analysis engine.
Unlike eHap, it estimates in a generalized linear model or ettdt model from the possible
haplotypes configurations and relevant phenotypes as laid out by the eHap program. In
brief, it embodies all of the statistical methods described in Seltman et al. (2001, 2003),
and a bit more. It goes without saying that we would appreciate a citation to these
manuscripts should you use eHap (our name for the general programs) in a publication of
your own.
II. Control/data files. These files are described in terms of their usage in eHap.
The “control” file. eHap implements a large number of analyses by using directions in
the control file and the contents of several data files. There are two avenues to invoke a
control file.
1. By clicking on eHap.exe, you prompt the program to look for a file named
Control.txt in the same directory as the two executable files.
2. If you click on eHap.exe, and it does not find a file Control.txt, it will prompt you
to search for the file via a pop-up window (see examples).
3. You can also run eHap from a command line prompt by typing “eHap
control.txt”. This might be of use for people that want to run a large number of
analyses using a script.
Contents of the control file. The options listed below can be put into the control file (value
in brackets is the default value). Many options are explained in more detail later.
FILES
pedfile=[pedigree.txt]
pedigree file
lfile=[loci.txt]
locus file
cfile=[cov.txt]
covariate file
haplist=[none]
haplotype list
emhapfile=[emhaps.txt]
haplotypes and estimated frequencies (by max. likelihood)
cladfile=[cladogram]
cladogram output file
out=[rslt.dat]
file name for HapLinMod standard output
err=none
HapLinMod error file, without specifying this file the error message
only go to the screen
eerror=[error.txt]
file containing non-Mendelian configurations of haplotypes
eHap=[eHap.txt]
collapsed cladogram file passed from HapLinMod to eHap
datafile=[config]
data file passed from eHap to HapLinMod
suffix=[.txt]
extension for datafile
SETTINGS
Settings for input data
transmit=[0]*
for a transmission disequilibrium tests use 1, otherwise use 0
generation=[1]*
to analyze the phenotypes of the parental generation use 0, otherwise
use 1. if transmit = 1, generation must = 1.
*Family structure is always used to determine possible haplotype configurations.
ocv=[affection]
outcome variable to use for this analysis
usecov=[none]
number of covariates to use for this analysis; next line is
the name of the covariate(s) to be included
Settings for eHap
mdist=[1]
minimum distance for connection in terms of mutational steps
configmax=[30]
minimum # of configurations allowed per family
zerocut=[0.001]
cutoff for eliminating haplotypes by frequency (haplotypes with
estimated frequency less than zerocut are treated as if they had
frequency 0).
title
string displayed at top of screen, default is pedigree file name
length=[80]
length, in pixels, of distance between points
everbose=[0]
verbose printing of emhaps file (0=no, 1=yes). Care should be
exercised in this option because the verbose file can be very large.
rerun=[0]
indicates rerun of a previous run (0=no, 1=yes)
autorun=[0]
starts analysis immediately after cladogram is found (0=no, 1=yes)
nodraw=[0]
skips drawing step (0=no, 1=yes)
fake=[0]
does not perform any analyses (0=no, 1=yes; useful if eHap
is run in autorun mode, but HapLinMod analyses are not
needed)
Settings for HapLinMod
family=[normal]
GLM family type (normal/poisson/binomial/tdt)
method=[score]
analysis method (score/LR). Specify type of hypothesis test.
omnibus=[0]
request performance of an omnibus test (0=no, 1=yes).
alpha=[0.05]
nominal significance level
bonferroni=[1]
bonferroni correction (0 = no, 1 = yes)
singleton[0]
test final collapse of remaining singleton nodes (0=no,1=yes)
nperm=n
n>1 sets up permutation testing (should be roughly 100-1000)
ptype[0]
Except for ETTDT, permutation is normally both between
families of the same size, and within children for any families
with two or more children ptype=1 or 3 suppresses
permutation within families, ptype=2 or 3 suppresses
permutation between families.
verbose=[0]
verbosity of HapLinMod (0=show only final cladogram / 1=also show
tests and p-values / 2=additional detail including option and column
id settings / -1=quiet mode (p-values only))
nuisance[0]
adds printing of beta and psi values 0/1
thisthetaeq=[]
haplotype code which will have its coefficient fixed at zero in the
final estimation. The default is the first code in an alphabetic sort.
wtmin=[0.00001]
weight below which configurations are dropped from HapLinMod
analyses.
maxiter=[50]
maximum EM iterations before failing in HapLinMod
relative=[0.001]
relative error that will signify convergence in HLM EM algorithm
absolute=[0.00001]
absolute error that will signify convergence in HLM EM algorithm
nmax=[500000]
maximum total function evaluations for likelihood maximization.
USER INPUT FILES
The pedigree file: This file specifies the family and genotype data, which is expected
to be in “pre-linkage” format. In the pre-linkage format, each line consists of the
following fields (strings), which are all expected to be integer-valued, separated by at
least one space, and in this order: family id, individual id, father id, mother id, gender,
affection status, allele 1 of genotype 1, allele 2 of genotype 1, allele 1 of genotype 2, etc.
If you are not familiar with this format, please look at the examples we bundle with the
software. The default name for the pedigree file is pedigree.txt. You can change it to
something meaningful in the control file by including a line pedfile=filename.
If only unrelated individuals were sampled for a study, then including information about
parentage would be meaningless and unnecessary work. For that reason, we also allow
the file to have no parents in the file, although missing mother/father are still required
(enter mother and father as 0, the usual specification for a founder of a pedigree).
In the pedigree file, a missing value is specified by a zero in that field.
The locus file: In a traditional pre-linkage format, this file contains much information
we would not use. Thus we simplified the file. It only specifies the number of marker
loci genotyped (as opposed to the usual pre-linkage file, which also counts a disease
locus), the number of alleles per locus and their names. For example, suppose there are
three markers typed, with alleles 1/2, 1/2 and 1/2. Then the file would be organized as
follows:
3
2
1 2
2
1 2
2
1 2
You might wonder why we need this information at all. For two reasons. First, we
wanted to build in Quality Control checks. So eHap verifies that the alleles that are
supposed to be there are there. Second, we allow the use of non-integer names for alleles
and we want to provide a mapping from our allele names, which will be strictly integer,
to your allele names. Important: If you choose to use non-integer names for alleles and
if there are more than two alleles per locus, then you should give careful thought to how
you enter the names of alleles for a locus. eHap will give the alleles integer values
starting with one for the first allele encountered at that locus (in the locus file) and
increase by one for each new allele encountered. In terms of mutational steps between
alleles, eHap then assumes integer differences between alleles are meaningful (e.g. for
STRs).
The default name for the locus file is loc.txt. You can change it to something meaningful
in the control file by including a line lfile=filename.
The covariate file: This file contains the “covariate” and outcome information. In
keeping with the usual structure of pedigree files, the sex and affection status of
individuals are specified in the pedigree file. All other covariates/outcomes are specified
in the covariate file. The default name for the covariate file is cov.txt. You can change it
to something meaningful in the control file by including a line cfile=filename.
The first line of the covariate file specifies the number of variables in the file followed by
their names. Subsequent lines are lead by two columns giving family identification and
individual identification, followed by the values for the variables for the specified
individual. Missing data should be specified by the value NA. Thus a first few lines of
this file might look as follows:
3 y x1 x2
144 3 1.71 0 3
222 3 0.55 NA 0.7
269 3 1.3 0 1.8
Of these four files, the control file, pedigree file, and locus file are essential for running
eHap. The covariate file is also essential if there are covariates other than gender or an
outcome other than diagnosis. Other files described below are either output files or files
used to control eHap after a first pass analysis of the data.
OUTPUT FILES and/or files for secondary analysis
The haplotype list file: This file specifies the set of possible haplotypes to be used in
the analysis. You can give it a name in the control file, such as
haplist=haplist.txt. It takes only the integer names of the haplotypes that have
been assigned by eHap, one integer name per line. How can you know those names?
The correspondence will be available in an output file named emhaps.txt, which will be
described shortly. You do not have to specify haplist.txt to analyze the data with
eHap.exe, and in fact it would be absent for the first pass at the data, when the list of
possible haplotypes in the sample and the frequencies of those haplotypes are being
estimated. Haplist.txt is used to limit the set of possible haplotypes, and there may be no
need to do so.
Haplist.txt is also the place to specify that certain haplotypes should be grouped. Let’s
imagine there are 12 observed haplotypes in the sample, and we wish to group (2,3) with
1, (5,6) with 4, (8,9) with 7 and (11,12) with 10. The 12 line file would look like:
1
2=
3=1
4
5=4
6=4
7
8=7
9=7
10
11=10
12=10
The cladogram file: This file specifies the pairwise relationships among haplotypes.
Its default name is cladogram.txt. You can change its default name by including a line in
the control file as follows: cladfile=username. You do not have to specify the
cladogram file to analyze the data with eHap.exe, and it can also be ignored if no
evolutionary-based test is desired. The former is expected, in fact, for a first pass at the
data. Many users will want to see what haplotypes are common in the population, which
haplotypes are rare, which are not found in the sample, and whether eHap finds a network
or a sensible cladogram from the data. From these preliminary analyses you can specify
which set of haplotypes should be considered in the analysis and what the likely
relationships are among the haplotypes. On the other hand, if you already have a
cladogram, you can specify your own cladogram file (see the section describing the eHap
window, which follows, for using your cladogram file). The structure of the file is
simple, but the default names for the haplotypes is less obvious, so bear with us for a
moment. Imagine there exists a set of 5 haplotypes with names 1, 2, 3, 4 and 5.
Haplotypes 2-5 are one-step decendants from 1. On each line of the cladogram file a
connection (edge) of the cladogram is specified. In this case, the file would look like
12
13
14
1 5.
The order of the specified connections/edges is not important.
The haplotype frequency file: This file specifies eHap’s integer name for each
haplotype, its corresponding verbose haplotype in integer allele names, and its maximum
likelihood estimated frequency obtained via the EM algorithm. The estimates take into
account family information but, as noted previously, completely ignore information from
the outcome phenotype and covariates. The default name for the pedigree file is
emhaps.txt. You can change it to something meaningful in the control file by including a
line emhapfile=filename. Please note that this file will print a concise version of
the set of haplotypes with non-zero estimates of frequencies (or whatever level of
tolerance is specified in the control file). If you wish to see the entire set of haplotypes
explored, then you can specify verbose output of this file by the option everbose=1 in
the control file.
The haplotype configurations file: For each person in the pedigree file, this file
contains the set of possible haplotypes. The users can easily see a problem with this open
statement because the set can be rather large when many markers are examined and there
is little information to limit the set for certain individuals. By default, eHap tries to limit
the set of possible haplotypes. For example, by using the option zerocut=tol, the
user can eliminate any haplotype that has an estimated haplotype frequency less than tol
(e.g., 0.001, the default) from the analysis. Likewise, eHap uses that specification to
suppress printing of any haplotype of estimated frequency less than tol. You can
override this by using everbose=1.
The results file: This file contains the results from HapLinMod.exe. Its default name
is rslt.txt. You can change it to something meaningful in the control file by including a
line out=filename.
Error files:
eHap.exe error file: This file contains any errors encountered by eHap before data
are transmitted to HapLinMod for statistical analysis. Its default name is
error.txt. You change it to something meaningful in the control file by including
a line eerror=filename. For example, any Mendelian errors encountered
will be noted here; please note that when a Mendelian error is encountered, eHap
removes the family from consideration and moves on.
HapLinMod.exe error file: This file contains errors encountered by HapLinMod
while analyzing data. Its default name is err.txt
Ignorable files: A few files can be completely ignored by you because they are used to
pass data to or results from HapLinMod.exe. Nonetheless, the files could be of interest
because you might want to examine the data transmitted into the analysis module and the
results sent back to eHap.exe. For input into the analysis module, look for out.txt. You
can change the default by da=filename, although it is worth noting the default
extension remains in force. For results shipped back from the analysis module, look for
HLMeHap.txt. You can change the default by eHap=filename.
Further discussion of options.
The use of some options should be obvious (e.g., alpha=0.01). Other options seem
less obvious, and here we discuss how they work in more detail. For some, we also
discuss how they interact with other options.
zerocut & haplist: After maximum likelihood estimation of haplotype frequencies, eHap
will eliminate all haplotypes with frequency less than zerocut. from subsequent
analyses. However, another way to exert control over what haplotypes to retain in
subsequent analyses is to produce a list of haplotypes to keep and record them in
haplist.txt. Any haplotype not named in haplist.txt will be removed from analysis even
before the estimation of haplotype frequencies. In this way one could keep certain
haplotypes and discard others even though they have the same estimated haplotype
frequency (e.g., the retained haplotypes might be essential for Mendelian inheritance
whereas the discarded ones were not).
mdist: The default for mdist is 1, which means eHap will connect all haplotypes
differing by “one step” mutations. The “distances” between haplotypes are given in
distmat.txt, a file produced by eHap. By forcing the user to produce a numeric scale for
alleles, we can easily calculate this distance matrix (and eHap doesn’t have to make
judgments about such things). Suppose, however, that a central haplotype in the
cladogram/network is not sampled, leaving two unconnected structures. Specifying
mdist=2 will join the structures, and eHap will remove unnecessary two-step
connections prior to analyses. See Example 2.
family: This option doesn’t have anything to do with relationships among individuals. It
refers to members of the exponential family of distributions, one of which the outcome
variable is assumed to follow. For example, we might be interested in a case-control
analysis, for which the outcome variable is case status; a binomial distribution is
appropriate in this instance and family=binomial would be the appropriate option.
The family=tdt merits more discussion. Obviously “TDT” is not a member of the
exponential family, but it is a facile way of forcing eHap to produce the ET-TDT
analyses described in Seltman et al. 2001. However, it is also worthwhile noting that
quantitative trait TDT using haplotypes is performed by using option family=normal.
See Example 1 for quantitative TDT, Example 4 for ET-TDT, and Example 3 for Poisson
analysis.
method There are two basic types of test procedures: score tests and likelihood ratio
tests. Results are likely to be quite similar if each nuclear family has only one child. The
methods differ in how they handle correlated observations siblings. If method = LR and
family = normal, then the correlation between siblings is directly modeled via a
correlation matrix. Alternatively if method = score, the correlation is adjusted for
indirectly (see Seltman et al. 2003). If the sample has multiple siblings and
family=binomial or poisson, then we recommend using method=score. LR is not
designed to model within family correlation for binomial or poisson models. For these
choices LR will treat siblings as independent, which could lead to false positives.
ocv & usecov: The first option, ocv, lets the user define the outcome variable for the
analysis. By default eHap assumes it is affection status from the pedigree file, but the
appropriate outcome variable might be a quantitative trait from the covariate file.
Likewise, for any particular analysis, the user might want to limit the covariates entering
as predictor variables, and this can be accomplished by usecov; for example, to use two
covariates name X_1 and X_2 (or whatever their names might be, as defined in the first
line of the covariates file) the following lines in the control file will suffice:
usecov=2
X_1 X_2
everbose: By default, eHap will only print those haplotypes with estimated frequencies
greater than zerocut. If you want, it will print them all by changing the setting to
everbose=1.
omnibus=1: This option causes HapLinMod to produce a standard omnibus test for
association between the outcome variable and the set of haplotypes in the sample. The
likelihood of the phenotype is computed using generalized linear models. The test
accounts for missing phase information via the EM algorithm, based upon the missing
data principle. With R distinct haplotypes effects under investigation, the test has R-1
degrees of freedom. If omnibus=0 the MHA test is performed.
permutation=n: use this option to calculate simulation-based p-values. If omnibus=0,
permutation testing is on the MHA, otherwise it is based on the omnibus test.
singleton=1:. With the selected choice of mdist it might happen that more than one tree
is formed; in this event we say there is a grove of trees. In this situation eHap makes all
of the comparisons dictated by MHA for each tree. By default eHap stops here. If
singleton =1, however, and if the edges in each tree collapse to a singleton, eHap
compares the singletons.. If it finds no significant difference it indicates this with a
dotted line connecting the singleton nodes. Otherwise, a significant difference is
indicated graphically by no connection between singletons.
III. eHap window. When eHap is invoked it opens a window to display some of its
results. There are options available from this window. If the window is blank but it
appears eHap is churning away, it is establishing haplotype configurations and estimating
haplotype frequencies from the data (assuming rerun=0 and this is a first pass through the
data). Depending upon the size of the problem, eHap will be in this state for a short to
long time. If the problem is not too big, it should present a figure representing
relationships among haplotypes shortly. When the results are back, the window will
display the network of relationships among haplotypes. On the top left are two menus,
labeled “File” and “Draw”. Clicking on the Draw menu gives you four options: Network,
Cladogram, Labels and EMvalues. Clicking on “Cladogram” produces a cladogram from
a network using the Crandall/Templeton (1993) rules, and clicking on “Network” returns
to the network. By default the nodes are labeled with eHap’s names for the haplotypes;
they can be erased by clicking on “Labels” and re-evoked by clicking again on “Labels”.
The labels can be replaced by the estimated fraction of haplotypes of each variety by
clicking on ``EMvalues’’. A right-click on a haplotype reveals details about the
haplotype.
Under the “File” button are three options: “Open cladogram”, “Print” and “Exit”. Use of
the latter two options are obvious, the first option deserves an explanation. Suppose the
user wanted to analyze the data with a different cladogram than the default cladogram
produced by eHap. Clicking on “File” and then “Open cladogram” allows the user to
select a user-specified cladogram for the analysis. The new, selected cladogram will be
drawn in the window
When you want to analyze the data for association, simply click on the button
HapLinMod. Assuming no errors are detected, results will begin to appear in the
window, and the cladogram will likely be simplified. When HapLinMod is finished, a
final cladogram will be displayed in the eHap window. Results will also have been
written to the file you specified. This transition from eHap to HapLinMod happens
automatically if autorun=1.
The Exit button does the expected thing, exits eHap.
Examples of eHap analyses. We have organized data for the examples by
subdirectory, named EG1 for Example 1, EG2 for Example 2, and so forth. Likewise
results for the examples are directed to the appropriate subdirectory. Control files for
these examples are also in the subdirectory. When you click on eHap, it will pop-up a
window to search for a control file because it did not find a control file in its home
directory (it looks there first). Use this window to find the control file you want. This
environment is fragile, for some reason; you will know you have successfully found a
control file and implemented its instructions if eHap pops-up the message “Finding
Haplotype Network”.
You can learn much about eHap by looking at the structure of these files, running the
samples as is and examining the results, and then playing with the options to see how the
results change.
Example 1. Quantitative trait TDT and quantitative trait association
analysis.
This example is very similar to the one described in Seltman et al. (2003), and uses the
same network/cladogram. Nine bi-allelic loci are assayed, and 11 haplotypes are
estimated to occur at non-zero frequencies (actually, these haplotype frequencies were
estimated to be greater than the default for zerocut, which is 0.01) from the sample. The
data consist of trios of parents and one offspring, all of whom are genotyped (actually,
data for some fathers is missing at random) and a quantitative trait is measured on the
offspring. The set of realized haplotypes results in a network, which is then resolved to a
cladogram for analysis using the rules of Crandall and Templeton.
As the control file is set up, the data are analyzed as a quantitative trait TDT using the
cladogram to structure tests of significance. Running HapLinMod from the eHap
window the user will see results written to the eHap window, which are also written to
file rslts.txt. The output of these results is easy to read, once you know how, so let’s go
through one (underlined).
Check 265,266,267,269 => 257,385 | 1,17,33 | 129,193
What HapLinMod is doing at this step is checking to see if transmission of haplotypes
“265, 266, 267, & 269”, within families, has the same impact on the mean of the
quantitative trait, stochastically, as does transmission of haplotypes “257 & 385”, given
the impact of transmission of “nuisance” groups of haplotypes “1,17,33” and “129, 193”
on the quantitative trait. (Nuisance in the sense that these haplotypes do not directly enter
the test, but their effects must be estimated to fully specify the model).
The results of this test are
Score test p-value=1.523677e-006: reject
demonstrating a highly significant difference in the effect of transmission of the two
haplotype-groups of interest on the mean of the quantitative trait. In this example, eHap
pinpoints the cluster of haplotypes that bear alleles conferring liability.
If we were to change the control file from “transmit=1” to “transmit=0”, then we
force HapLinMod to ignore the transmission of haplotypes from parents to their
offspring, and instead estimate the relationship between haplotype and quantitative
phenotype as if the sample (of children) were a population sample of individuals (some of
whom could be siblings). In this case, similar results are obtained, although the p-value
is even more noteworthy:
Score test p-value=4.352074e-014: reject
If one wishes to change the cladogram inferred by eHap, it is necessary to first run eHap
to obtain the cladogram output file (cladogram.txt) and then edit this file prior to running
HapLinMod. For example, in the network file haplotype 385 is connected to haplotypes
257 and 129. The cladogram algorithm in eHap breaks the edge between 385 and 129.
To attach 385 to 129 instead of 257, simply replace the line “385 – 257” in cladogram.txt
with “386 – 129, then run HapLinMod.
If one were not interested in using the evolutionary relationships among haplotypes, we
would change the control file by adding the line
omnibus=1
The results for the omnibus test for association yield a p-value that is infinitesimal (less
than 10**(-13), but eHap notes it is less than 0.0005.
Chi^2=134.0837(10 df)
omnibus LR p-value<0.0005
You can get a closer approximation to the p-value using standard packages, such as SAS,
Splus, or R, although each is likely to report a zero value to you.
Example 2. A grove of cladograms.
This example has nine loci, and takes more time to run (1 minute) on our computers. It
should be clear that if you put eHap on a truly tough problem, you might need to take
lunch or leave it run overnight. As we noted in Seltman et al. (2003), we are working on
speeding it up. A feature that can speed up secondary runs of the analysis considerably is
rerun=1. With this option the program stores preliminary results such as the set of
haplotypes consistent with each family. This option is useful if in a subsequent run
options in HapLinMod are changed. Nodraw=1 is also a time saver.
This example shows you how eHap handles disconnected portions of a network or a
grove of cladograms. After eHap runs, you will see two cladograms and an isolated node
in the window. Haplotypes 118, 86, 78 & 70 form one cladogram and 472, 471,470, 466
the other cladogram, with 278 being the isolated node. You may want to manipulate the
figure to see this better: you can “grab” any node in the eHap window (by clicking and
dragging the cursor) and move it around. (Incidentally for this example to get an accurate
picture of the cladogram, you need to grab node 70 and move it.) Notice that in
cladogram.txt the singleton node 278 is indicated by the line “278 – 278”.
Below is reprinted the file distmat.txt that contains the mutational distances between the
set of observed haplotypes.
1 1 2 3 4 3 5 4
2 3 4 5
1 2 3
3 4
3
4
2
3
2
1
6
4
5
4
3
2
5
3
4
3
2
1
1
By examining this file, you can see why the figure in the eHap window looks as it does,
and you will note that the isolated node is a two-step mutation away from both clusters of
four haplotypes. At this point the user should make a decision. You can either (1)
analyze the cladogram structure as is, so that contrasts will be made within each of the
cladograms, but not across them, or (2) you can change the control file from the default,
mdist=1 to mdist=2. The latter change will force eHap to connect all the haplotypes
into a network of 1- or 2-step mutations (it produces a fairly complicated network, but it
will “erase” the unnecessary connections by toggling between network and cladogram in
the eHap window and eHap erases the other connections before analysis). For this
example, options (1) and (2) yield similar results, namely that there is a fundamental
difference in the mean of the continuous outcome as a function of transmission of
haplotype 118 versus all others. The choice of mdist=2 is often necessary to obtain a
single cladogram from a population of haplotypes.
An interesting cautionary tale develops from this example. In this example there are
roughly 1000 families analyzed and zerocut=0.005. Look in the file eHaperr.txt.
You will find a substantial number of families that have been eliminated from the
analysis because they have “No configuration”. In other words, there are Mendelian
errors that could be due to simple incompatibility with Mendel’s first law or certain
haplotypes necessary for proper Mendelian transmission have been eliminated; i.e.,
1/4000 = 0.00025, so one might believe zerocut is too big. One could, in fact, set
zerocut=0.0001 and rerun the analyses. If one were to do that, you will see that
eight new, rare haplotypes are introduced into the sample and they have quite an impact
on the realized relationships among haplotypes: a network is formed and there is quite an
odd chain of haplotypes. Such a network would be hard to analyze, and in fact the
decision to set zerocut to such a low value would be a mistake. In the simulation we
have generated a low rate of genotype errors to demonstrate two points: errors do not
always show up as simple Mendelian single-locus errors; and that low levels of
genotyping errors create some nasty structure for a network/cladogram. This structure
could, in theory, have consequences for inference. (Note: the relationships among
haplotypes realized by the original configuration of control file is correct [generating]
relationships among the haplotypes.)
Example 3. Trait following Poisson distribution.
In this example, the outcome is generated from the Poisson distribution (small number of
discrete outcomes). Our emphasis here is to show how to handle different data types, in
this case by setting family=poisson in the control file. From this analysis, no errors
are issued by HapLinMod, although there are incompatible configurations again, which
result from random genotyping errors. A cladogram is realized from the observed set of
haplotypes and a significant effect of the transmission of haplotype 6, versus all others, is
observed on the values of the Poisson random variable.
Example 4. Analysis by ET-TDT.
This is another example of how to adjust for outcomes. In this case we wish to perform
an ET-TDT analysis. We do so by specifying transmit=1, method=sc &
family=tdt in the control file. Results and interpretation should now be
straightforward.
Download