Overview of eHap 2.0 The eHap software is designed to analyze multilocus data as haplotypes, and to determine whether there is an association between haplotypes and phenotypes. The principal architects of the software are Howard Seltman (statistical) and Shawn Wood (graphical/analytical interface). Kathryn Roeder and Bernie Devlin made additional contributions. We originally conceived eHap as a package for population and familybased association studies in which inference would be guided by the evolutionary relationships among the haplotypes. We soon realized, however, that we should broaden the scope of eHap to encompass other analyses that are not evolutionarily based, given the analytic machinery we were building into the software. This version of eHap embodies a broad (and ever broadening) set of tools for haplotype-based inference for association (and linkage) studies using population- and family-based samples. The assumptions of eHap suggest targeting haplotype blocks. We recommend using eHap in conjunction with Entropy Blocker (EB) to divide a genomic region into logical units for analysis. EB is freeware, available at B Devlin’s computational genetics web site. Evolutionary-based analyses: Based on a specified cladogram, eHap will perform any and all tests described in the following two manuscripts: Seltman H, Roeder K, Devlin B (2001) TDT meets MHA: Family-based association analysis guided by the evolution of haplotypes. Am J Hum Genet 68:1250-1263. Seltman, H, Roeder K, Devlin B (2003) Evolutionary-based association analysis using haplotype data. Genet Epidemiol 25: 48-58 It also contains some tools to help the user build cladograms, as described in the 2003 manuscript. We expect to enhance these tools in the future. The tools will be described later in the document. Omnibus tests: If the user wants a test to determine if haplotype are associated with a phenotype, but does not wish to specify evolutionary hypotheses, omnibus tests are available from eHap. By omnibus we mean tests for association including all possible haplotypes or a specified subset thereof. For example, for a test of association of a set of H haplotypes and affection status in a case-control study, eHap will produce a H-1 degree of freedom hypothesis test of the null hypothesis of no association. What eHap will not do yet? Many things. One thing we expect to build into it in the future is a means of handling X-linked loci. Estimating haplotypes for X-linked loci is straightforward, but models for eHap must be adapted to this case. For ET-TDT analysis, only 1 child per family can be analyzed. It is worth keeping in mind that eHap is a work in progress, so look for updates on the website regularly. Features expected “To Be Implemented” into eHap soon are indicated by TBI. For now, send suggestions/complaints/bugs to devlinbj@msx.upmc.edu. In the future users can post comments to our site. Getting Started To get a feel for the software we suggest that you quickly peruse this document and then try out the examples. There are several files you will need before you can analyze your data. File names are given below in bold and italics. Further descriptions of some of the files can be found in other sections of this document. I. Two executables: eHap.exe and HapLinMod.exe are required. Why two executables? Largely because the programs compiled into these executables have been written separately, and perform different tasks. eHap, written by Shawn Wood, is the “front-end”. Its purpose is to pick up instructions for analysis, as described in a “control” file to be described under “II”, ensure that data are in proper format, determine the set of all possible haplotypes for each individual multilocus genotype, check for errors, visualize the relationships among haplotypes (as a network or cladogram), estimate haplotype frequencies (independent of phenotypes) and so on. An important function of eHap is to set up the data and provide instructions on how to analyze the data to HapLinMod. eHap automatically passes off to HapLinMod, so you should never have to be concerned about the interface between the two executables. HapLinMod, written by Howard Seltman, is the analysis engine. Unlike eHap, it estimates in a generalized linear model or ettdt model from the possible haplotypes configurations and relevant phenotypes as laid out by the eHap program. In brief, it embodies all of the statistical methods described in Seltman et al. (2001, 2003), and a bit more. It goes without saying that we would appreciate a citation to these manuscripts should you use eHap (our name for the general programs) in a publication of your own. II. Control/data files. These files are described in terms of their usage in eHap. The “control” file. eHap implements a large number of analyses by using directions in the control file and the contents of several data files. There are two avenues to invoke a control file. 1. By clicking on eHap.exe, you prompt the program to look for a file named Control.txt in the same directory as the two executable files. 2. If you click on eHap.exe, and it does not find a file Control.txt, it will prompt you to search for the file via a pop-up window (see examples). 3. You can also run eHap from a command line prompt by typing “eHap control.txt”. This might be of use for people that want to run a large number of analyses using a script. Contents of the control file. The options listed below can be put into the control file (value in brackets is the default value). Many options are explained in more detail later. FILES pedfile=[pedigree.txt] pedigree file lfile=[loci.txt] locus file cfile=[cov.txt] covariate file haplist=[none] haplotype list emhapfile=[emhaps.txt] haplotypes and estimated frequencies (by max. likelihood) cladfile=[cladogram] cladogram output file out=[rslt.dat] file name for HapLinMod standard output err=none HapLinMod error file, without specifying this file the error message only go to the screen eerror=[error.txt] file containing non-Mendelian configurations of haplotypes eHap=[eHap.txt] collapsed cladogram file passed from HapLinMod to eHap datafile=[config] data file passed from eHap to HapLinMod suffix=[.txt] extension for datafile SETTINGS Settings for input data transmit=[0]* for a transmission disequilibrium tests use 1, otherwise use 0 generation=[1]* to analyze the phenotypes of the parental generation use 0, otherwise use 1. if transmit = 1, generation must = 1. *Family structure is always used to determine possible haplotype configurations. ocv=[affection] outcome variable to use for this analysis usecov=[none] number of covariates to use for this analysis; next line is the name of the covariate(s) to be included Settings for eHap mdist=[1] minimum distance for connection in terms of mutational steps configmax=[30] minimum # of configurations allowed per family zerocut=[0.001] cutoff for eliminating haplotypes by frequency (haplotypes with estimated frequency less than zerocut are treated as if they had frequency 0). title string displayed at top of screen, default is pedigree file name length=[80] length, in pixels, of distance between points everbose=[0] verbose printing of emhaps file (0=no, 1=yes). Care should be exercised in this option because the verbose file can be very large. rerun=[0] indicates rerun of a previous run (0=no, 1=yes) autorun=[0] starts analysis immediately after cladogram is found (0=no, 1=yes) nodraw=[0] skips drawing step (0=no, 1=yes) fake=[0] does not perform any analyses (0=no, 1=yes; useful if eHap is run in autorun mode, but HapLinMod analyses are not needed) Settings for HapLinMod family=[normal] GLM family type (normal/poisson/binomial/tdt) method=[score] analysis method (score/LR). Specify type of hypothesis test. omnibus=[0] request performance of an omnibus test (0=no, 1=yes). alpha=[0.05] nominal significance level bonferroni=[1] bonferroni correction (0 = no, 1 = yes) singleton[0] test final collapse of remaining singleton nodes (0=no,1=yes) nperm=n n>1 sets up permutation testing (should be roughly 100-1000) ptype[0] Except for ETTDT, permutation is normally both between families of the same size, and within children for any families with two or more children ptype=1 or 3 suppresses permutation within families, ptype=2 or 3 suppresses permutation between families. verbose=[0] verbosity of HapLinMod (0=show only final cladogram / 1=also show tests and p-values / 2=additional detail including option and column id settings / -1=quiet mode (p-values only)) nuisance[0] adds printing of beta and psi values 0/1 thisthetaeq=[] haplotype code which will have its coefficient fixed at zero in the final estimation. The default is the first code in an alphabetic sort. wtmin=[0.00001] weight below which configurations are dropped from HapLinMod analyses. maxiter=[50] maximum EM iterations before failing in HapLinMod relative=[0.001] relative error that will signify convergence in HLM EM algorithm absolute=[0.00001] absolute error that will signify convergence in HLM EM algorithm nmax=[500000] maximum total function evaluations for likelihood maximization. USER INPUT FILES The pedigree file: This file specifies the family and genotype data, which is expected to be in “pre-linkage” format. In the pre-linkage format, each line consists of the following fields (strings), which are all expected to be integer-valued, separated by at least one space, and in this order: family id, individual id, father id, mother id, gender, affection status, allele 1 of genotype 1, allele 2 of genotype 1, allele 1 of genotype 2, etc. If you are not familiar with this format, please look at the examples we bundle with the software. The default name for the pedigree file is pedigree.txt. You can change it to something meaningful in the control file by including a line pedfile=filename. If only unrelated individuals were sampled for a study, then including information about parentage would be meaningless and unnecessary work. For that reason, we also allow the file to have no parents in the file, although missing mother/father are still required (enter mother and father as 0, the usual specification for a founder of a pedigree). In the pedigree file, a missing value is specified by a zero in that field. The locus file: In a traditional pre-linkage format, this file contains much information we would not use. Thus we simplified the file. It only specifies the number of marker loci genotyped (as opposed to the usual pre-linkage file, which also counts a disease locus), the number of alleles per locus and their names. For example, suppose there are three markers typed, with alleles 1/2, 1/2 and 1/2. Then the file would be organized as follows: 3 2 1 2 2 1 2 2 1 2 You might wonder why we need this information at all. For two reasons. First, we wanted to build in Quality Control checks. So eHap verifies that the alleles that are supposed to be there are there. Second, we allow the use of non-integer names for alleles and we want to provide a mapping from our allele names, which will be strictly integer, to your allele names. Important: If you choose to use non-integer names for alleles and if there are more than two alleles per locus, then you should give careful thought to how you enter the names of alleles for a locus. eHap will give the alleles integer values starting with one for the first allele encountered at that locus (in the locus file) and increase by one for each new allele encountered. In terms of mutational steps between alleles, eHap then assumes integer differences between alleles are meaningful (e.g. for STRs). The default name for the locus file is loc.txt. You can change it to something meaningful in the control file by including a line lfile=filename. The covariate file: This file contains the “covariate” and outcome information. In keeping with the usual structure of pedigree files, the sex and affection status of individuals are specified in the pedigree file. All other covariates/outcomes are specified in the covariate file. The default name for the covariate file is cov.txt. You can change it to something meaningful in the control file by including a line cfile=filename. The first line of the covariate file specifies the number of variables in the file followed by their names. Subsequent lines are lead by two columns giving family identification and individual identification, followed by the values for the variables for the specified individual. Missing data should be specified by the value NA. Thus a first few lines of this file might look as follows: 3 y x1 x2 144 3 1.71 0 3 222 3 0.55 NA 0.7 269 3 1.3 0 1.8 Of these four files, the control file, pedigree file, and locus file are essential for running eHap. The covariate file is also essential if there are covariates other than gender or an outcome other than diagnosis. Other files described below are either output files or files used to control eHap after a first pass analysis of the data. OUTPUT FILES and/or files for secondary analysis The haplotype list file: This file specifies the set of possible haplotypes to be used in the analysis. You can give it a name in the control file, such as haplist=haplist.txt. It takes only the integer names of the haplotypes that have been assigned by eHap, one integer name per line. How can you know those names? The correspondence will be available in an output file named emhaps.txt, which will be described shortly. You do not have to specify haplist.txt to analyze the data with eHap.exe, and in fact it would be absent for the first pass at the data, when the list of possible haplotypes in the sample and the frequencies of those haplotypes are being estimated. Haplist.txt is used to limit the set of possible haplotypes, and there may be no need to do so. Haplist.txt is also the place to specify that certain haplotypes should be grouped. Let’s imagine there are 12 observed haplotypes in the sample, and we wish to group (2,3) with 1, (5,6) with 4, (8,9) with 7 and (11,12) with 10. The 12 line file would look like: 1 2= 3=1 4 5=4 6=4 7 8=7 9=7 10 11=10 12=10 The cladogram file: This file specifies the pairwise relationships among haplotypes. Its default name is cladogram.txt. You can change its default name by including a line in the control file as follows: cladfile=username. You do not have to specify the cladogram file to analyze the data with eHap.exe, and it can also be ignored if no evolutionary-based test is desired. The former is expected, in fact, for a first pass at the data. Many users will want to see what haplotypes are common in the population, which haplotypes are rare, which are not found in the sample, and whether eHap finds a network or a sensible cladogram from the data. From these preliminary analyses you can specify which set of haplotypes should be considered in the analysis and what the likely relationships are among the haplotypes. On the other hand, if you already have a cladogram, you can specify your own cladogram file (see the section describing the eHap window, which follows, for using your cladogram file). The structure of the file is simple, but the default names for the haplotypes is less obvious, so bear with us for a moment. Imagine there exists a set of 5 haplotypes with names 1, 2, 3, 4 and 5. Haplotypes 2-5 are one-step decendants from 1. On each line of the cladogram file a connection (edge) of the cladogram is specified. In this case, the file would look like 12 13 14 1 5. The order of the specified connections/edges is not important. The haplotype frequency file: This file specifies eHap’s integer name for each haplotype, its corresponding verbose haplotype in integer allele names, and its maximum likelihood estimated frequency obtained via the EM algorithm. The estimates take into account family information but, as noted previously, completely ignore information from the outcome phenotype and covariates. The default name for the pedigree file is emhaps.txt. You can change it to something meaningful in the control file by including a line emhapfile=filename. Please note that this file will print a concise version of the set of haplotypes with non-zero estimates of frequencies (or whatever level of tolerance is specified in the control file). If you wish to see the entire set of haplotypes explored, then you can specify verbose output of this file by the option everbose=1 in the control file. The haplotype configurations file: For each person in the pedigree file, this file contains the set of possible haplotypes. The users can easily see a problem with this open statement because the set can be rather large when many markers are examined and there is little information to limit the set for certain individuals. By default, eHap tries to limit the set of possible haplotypes. For example, by using the option zerocut=tol, the user can eliminate any haplotype that has an estimated haplotype frequency less than tol (e.g., 0.001, the default) from the analysis. Likewise, eHap uses that specification to suppress printing of any haplotype of estimated frequency less than tol. You can override this by using everbose=1. The results file: This file contains the results from HapLinMod.exe. Its default name is rslt.txt. You can change it to something meaningful in the control file by including a line out=filename. Error files: eHap.exe error file: This file contains any errors encountered by eHap before data are transmitted to HapLinMod for statistical analysis. Its default name is error.txt. You change it to something meaningful in the control file by including a line eerror=filename. For example, any Mendelian errors encountered will be noted here; please note that when a Mendelian error is encountered, eHap removes the family from consideration and moves on. HapLinMod.exe error file: This file contains errors encountered by HapLinMod while analyzing data. Its default name is err.txt Ignorable files: A few files can be completely ignored by you because they are used to pass data to or results from HapLinMod.exe. Nonetheless, the files could be of interest because you might want to examine the data transmitted into the analysis module and the results sent back to eHap.exe. For input into the analysis module, look for out.txt. You can change the default by da=filename, although it is worth noting the default extension remains in force. For results shipped back from the analysis module, look for HLMeHap.txt. You can change the default by eHap=filename. Further discussion of options. The use of some options should be obvious (e.g., alpha=0.01). Other options seem less obvious, and here we discuss how they work in more detail. For some, we also discuss how they interact with other options. zerocut & haplist: After maximum likelihood estimation of haplotype frequencies, eHap will eliminate all haplotypes with frequency less than zerocut. from subsequent analyses. However, another way to exert control over what haplotypes to retain in subsequent analyses is to produce a list of haplotypes to keep and record them in haplist.txt. Any haplotype not named in haplist.txt will be removed from analysis even before the estimation of haplotype frequencies. In this way one could keep certain haplotypes and discard others even though they have the same estimated haplotype frequency (e.g., the retained haplotypes might be essential for Mendelian inheritance whereas the discarded ones were not). mdist: The default for mdist is 1, which means eHap will connect all haplotypes differing by “one step” mutations. The “distances” between haplotypes are given in distmat.txt, a file produced by eHap. By forcing the user to produce a numeric scale for alleles, we can easily calculate this distance matrix (and eHap doesn’t have to make judgments about such things). Suppose, however, that a central haplotype in the cladogram/network is not sampled, leaving two unconnected structures. Specifying mdist=2 will join the structures, and eHap will remove unnecessary two-step connections prior to analyses. See Example 2. family: This option doesn’t have anything to do with relationships among individuals. It refers to members of the exponential family of distributions, one of which the outcome variable is assumed to follow. For example, we might be interested in a case-control analysis, for which the outcome variable is case status; a binomial distribution is appropriate in this instance and family=binomial would be the appropriate option. The family=tdt merits more discussion. Obviously “TDT” is not a member of the exponential family, but it is a facile way of forcing eHap to produce the ET-TDT analyses described in Seltman et al. 2001. However, it is also worthwhile noting that quantitative trait TDT using haplotypes is performed by using option family=normal. See Example 1 for quantitative TDT, Example 4 for ET-TDT, and Example 3 for Poisson analysis. method There are two basic types of test procedures: score tests and likelihood ratio tests. Results are likely to be quite similar if each nuclear family has only one child. The methods differ in how they handle correlated observations siblings. If method = LR and family = normal, then the correlation between siblings is directly modeled via a correlation matrix. Alternatively if method = score, the correlation is adjusted for indirectly (see Seltman et al. 2003). If the sample has multiple siblings and family=binomial or poisson, then we recommend using method=score. LR is not designed to model within family correlation for binomial or poisson models. For these choices LR will treat siblings as independent, which could lead to false positives. ocv & usecov: The first option, ocv, lets the user define the outcome variable for the analysis. By default eHap assumes it is affection status from the pedigree file, but the appropriate outcome variable might be a quantitative trait from the covariate file. Likewise, for any particular analysis, the user might want to limit the covariates entering as predictor variables, and this can be accomplished by usecov; for example, to use two covariates name X_1 and X_2 (or whatever their names might be, as defined in the first line of the covariates file) the following lines in the control file will suffice: usecov=2 X_1 X_2 everbose: By default, eHap will only print those haplotypes with estimated frequencies greater than zerocut. If you want, it will print them all by changing the setting to everbose=1. omnibus=1: This option causes HapLinMod to produce a standard omnibus test for association between the outcome variable and the set of haplotypes in the sample. The likelihood of the phenotype is computed using generalized linear models. The test accounts for missing phase information via the EM algorithm, based upon the missing data principle. With R distinct haplotypes effects under investigation, the test has R-1 degrees of freedom. If omnibus=0 the MHA test is performed. permutation=n: use this option to calculate simulation-based p-values. If omnibus=0, permutation testing is on the MHA, otherwise it is based on the omnibus test. singleton=1:. With the selected choice of mdist it might happen that more than one tree is formed; in this event we say there is a grove of trees. In this situation eHap makes all of the comparisons dictated by MHA for each tree. By default eHap stops here. If singleton =1, however, and if the edges in each tree collapse to a singleton, eHap compares the singletons.. If it finds no significant difference it indicates this with a dotted line connecting the singleton nodes. Otherwise, a significant difference is indicated graphically by no connection between singletons. III. eHap window. When eHap is invoked it opens a window to display some of its results. There are options available from this window. If the window is blank but it appears eHap is churning away, it is establishing haplotype configurations and estimating haplotype frequencies from the data (assuming rerun=0 and this is a first pass through the data). Depending upon the size of the problem, eHap will be in this state for a short to long time. If the problem is not too big, it should present a figure representing relationships among haplotypes shortly. When the results are back, the window will display the network of relationships among haplotypes. On the top left are two menus, labeled “File” and “Draw”. Clicking on the Draw menu gives you four options: Network, Cladogram, Labels and EMvalues. Clicking on “Cladogram” produces a cladogram from a network using the Crandall/Templeton (1993) rules, and clicking on “Network” returns to the network. By default the nodes are labeled with eHap’s names for the haplotypes; they can be erased by clicking on “Labels” and re-evoked by clicking again on “Labels”. The labels can be replaced by the estimated fraction of haplotypes of each variety by clicking on ``EMvalues’’. A right-click on a haplotype reveals details about the haplotype. Under the “File” button are three options: “Open cladogram”, “Print” and “Exit”. Use of the latter two options are obvious, the first option deserves an explanation. Suppose the user wanted to analyze the data with a different cladogram than the default cladogram produced by eHap. Clicking on “File” and then “Open cladogram” allows the user to select a user-specified cladogram for the analysis. The new, selected cladogram will be drawn in the window When you want to analyze the data for association, simply click on the button HapLinMod. Assuming no errors are detected, results will begin to appear in the window, and the cladogram will likely be simplified. When HapLinMod is finished, a final cladogram will be displayed in the eHap window. Results will also have been written to the file you specified. This transition from eHap to HapLinMod happens automatically if autorun=1. The Exit button does the expected thing, exits eHap. Examples of eHap analyses. We have organized data for the examples by subdirectory, named EG1 for Example 1, EG2 for Example 2, and so forth. Likewise results for the examples are directed to the appropriate subdirectory. Control files for these examples are also in the subdirectory. When you click on eHap, it will pop-up a window to search for a control file because it did not find a control file in its home directory (it looks there first). Use this window to find the control file you want. This environment is fragile, for some reason; you will know you have successfully found a control file and implemented its instructions if eHap pops-up the message “Finding Haplotype Network”. You can learn much about eHap by looking at the structure of these files, running the samples as is and examining the results, and then playing with the options to see how the results change. Example 1. Quantitative trait TDT and quantitative trait association analysis. This example is very similar to the one described in Seltman et al. (2003), and uses the same network/cladogram. Nine bi-allelic loci are assayed, and 11 haplotypes are estimated to occur at non-zero frequencies (actually, these haplotype frequencies were estimated to be greater than the default for zerocut, which is 0.01) from the sample. The data consist of trios of parents and one offspring, all of whom are genotyped (actually, data for some fathers is missing at random) and a quantitative trait is measured on the offspring. The set of realized haplotypes results in a network, which is then resolved to a cladogram for analysis using the rules of Crandall and Templeton. As the control file is set up, the data are analyzed as a quantitative trait TDT using the cladogram to structure tests of significance. Running HapLinMod from the eHap window the user will see results written to the eHap window, which are also written to file rslts.txt. The output of these results is easy to read, once you know how, so let’s go through one (underlined). Check 265,266,267,269 => 257,385 | 1,17,33 | 129,193 What HapLinMod is doing at this step is checking to see if transmission of haplotypes “265, 266, 267, & 269”, within families, has the same impact on the mean of the quantitative trait, stochastically, as does transmission of haplotypes “257 & 385”, given the impact of transmission of “nuisance” groups of haplotypes “1,17,33” and “129, 193” on the quantitative trait. (Nuisance in the sense that these haplotypes do not directly enter the test, but their effects must be estimated to fully specify the model). The results of this test are Score test p-value=1.523677e-006: reject demonstrating a highly significant difference in the effect of transmission of the two haplotype-groups of interest on the mean of the quantitative trait. In this example, eHap pinpoints the cluster of haplotypes that bear alleles conferring liability. If we were to change the control file from “transmit=1” to “transmit=0”, then we force HapLinMod to ignore the transmission of haplotypes from parents to their offspring, and instead estimate the relationship between haplotype and quantitative phenotype as if the sample (of children) were a population sample of individuals (some of whom could be siblings). In this case, similar results are obtained, although the p-value is even more noteworthy: Score test p-value=4.352074e-014: reject If one wishes to change the cladogram inferred by eHap, it is necessary to first run eHap to obtain the cladogram output file (cladogram.txt) and then edit this file prior to running HapLinMod. For example, in the network file haplotype 385 is connected to haplotypes 257 and 129. The cladogram algorithm in eHap breaks the edge between 385 and 129. To attach 385 to 129 instead of 257, simply replace the line “385 – 257” in cladogram.txt with “386 – 129, then run HapLinMod. If one were not interested in using the evolutionary relationships among haplotypes, we would change the control file by adding the line omnibus=1 The results for the omnibus test for association yield a p-value that is infinitesimal (less than 10**(-13), but eHap notes it is less than 0.0005. Chi^2=134.0837(10 df) omnibus LR p-value<0.0005 You can get a closer approximation to the p-value using standard packages, such as SAS, Splus, or R, although each is likely to report a zero value to you. Example 2. A grove of cladograms. This example has nine loci, and takes more time to run (1 minute) on our computers. It should be clear that if you put eHap on a truly tough problem, you might need to take lunch or leave it run overnight. As we noted in Seltman et al. (2003), we are working on speeding it up. A feature that can speed up secondary runs of the analysis considerably is rerun=1. With this option the program stores preliminary results such as the set of haplotypes consistent with each family. This option is useful if in a subsequent run options in HapLinMod are changed. Nodraw=1 is also a time saver. This example shows you how eHap handles disconnected portions of a network or a grove of cladograms. After eHap runs, you will see two cladograms and an isolated node in the window. Haplotypes 118, 86, 78 & 70 form one cladogram and 472, 471,470, 466 the other cladogram, with 278 being the isolated node. You may want to manipulate the figure to see this better: you can “grab” any node in the eHap window (by clicking and dragging the cursor) and move it around. (Incidentally for this example to get an accurate picture of the cladogram, you need to grab node 70 and move it.) Notice that in cladogram.txt the singleton node 278 is indicated by the line “278 – 278”. Below is reprinted the file distmat.txt that contains the mutational distances between the set of observed haplotypes. 1 1 2 3 4 3 5 4 2 3 4 5 1 2 3 3 4 3 4 2 3 2 1 6 4 5 4 3 2 5 3 4 3 2 1 1 By examining this file, you can see why the figure in the eHap window looks as it does, and you will note that the isolated node is a two-step mutation away from both clusters of four haplotypes. At this point the user should make a decision. You can either (1) analyze the cladogram structure as is, so that contrasts will be made within each of the cladograms, but not across them, or (2) you can change the control file from the default, mdist=1 to mdist=2. The latter change will force eHap to connect all the haplotypes into a network of 1- or 2-step mutations (it produces a fairly complicated network, but it will “erase” the unnecessary connections by toggling between network and cladogram in the eHap window and eHap erases the other connections before analysis). For this example, options (1) and (2) yield similar results, namely that there is a fundamental difference in the mean of the continuous outcome as a function of transmission of haplotype 118 versus all others. The choice of mdist=2 is often necessary to obtain a single cladogram from a population of haplotypes. An interesting cautionary tale develops from this example. In this example there are roughly 1000 families analyzed and zerocut=0.005. Look in the file eHaperr.txt. You will find a substantial number of families that have been eliminated from the analysis because they have “No configuration”. In other words, there are Mendelian errors that could be due to simple incompatibility with Mendel’s first law or certain haplotypes necessary for proper Mendelian transmission have been eliminated; i.e., 1/4000 = 0.00025, so one might believe zerocut is too big. One could, in fact, set zerocut=0.0001 and rerun the analyses. If one were to do that, you will see that eight new, rare haplotypes are introduced into the sample and they have quite an impact on the realized relationships among haplotypes: a network is formed and there is quite an odd chain of haplotypes. Such a network would be hard to analyze, and in fact the decision to set zerocut to such a low value would be a mistake. In the simulation we have generated a low rate of genotype errors to demonstrate two points: errors do not always show up as simple Mendelian single-locus errors; and that low levels of genotyping errors create some nasty structure for a network/cladogram. This structure could, in theory, have consequences for inference. (Note: the relationships among haplotypes realized by the original configuration of control file is correct [generating] relationships among the haplotypes.) Example 3. Trait following Poisson distribution. In this example, the outcome is generated from the Poisson distribution (small number of discrete outcomes). Our emphasis here is to show how to handle different data types, in this case by setting family=poisson in the control file. From this analysis, no errors are issued by HapLinMod, although there are incompatible configurations again, which result from random genotyping errors. A cladogram is realized from the observed set of haplotypes and a significant effect of the transmission of haplotype 6, versus all others, is observed on the values of the Poisson random variable. Example 4. Analysis by ET-TDT. This is another example of how to adjust for outcomes. In this case we wish to perform an ET-TDT analysis. We do so by specifying transmit=1, method=sc & family=tdt in the control file. Results and interpretation should now be straightforward.