- SNPHAP A program for estimating frequencies of large haplotypes of SNPs ================================================================ (Version 1.0) David Clayton Department of Medical Genetics Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke's Hospital, Cambridge, CB2 2XY Disclaimer ---------This program has now had a fair bit of use by our group and others. However it comes with no guarantees. I'd like to know of any difficulties/bugs that people experience and will try and fdeal with them, but this may take some time. Introduction -----------This program implements a fairly standard method for estimating haplotype frequencies using data from unrelated individuals. It uses an EM algorithm to calculate ML estimates of haplotype frequencies given genotype measurements which do not specify phase. The algorithm also allows for some genotype measurements to be missing (due, for example, to PCR failure). It also allows multiple imputation of individual haplotypes. EM algorithm -----------The well-known algorithm first expands the data for each subject into the complete set of (fully phased) pairs of haplotype consistent with the observed data (the "haplotype instances"). The EM algorithm then proceeds as follows: E step: Given current estimates of haplotype probabilities and assuming Hardy-Weinberg equilibrium, calculate the probability of each phased and complete genotype assignment for each subject. Scale these so that they sum to 1.0 within each subject. These are then the POSTERIOR PROBABILITIES of the genotype assignments. M step: Calculate the next set of estimates of haplotype probabilities by summing posterior probabilities over instances of each distinct haplotype. After scaling to sum to 1.0, These provide the new estimates of the PRIOR PROBABILITIES. This algorithm is widely used and not novel. However, for a large number of loci, the number of possible haplotype instances rapidly becomes impossibly large, even though the eventual solution may only give appreciable support to a rather limited set of haplotypes. This program avoids this difficulty by starting by fitting 2-locus haplotypes and extending the solution by one locus at a time. As each new locus is added, the number of haplotype instances to be considered is first expanded, by considering all possible larger haplotypes. Then, after applyiny the EM algorithm to estimate the "prior" haplotype probabilities and the posterior probilities of haplotype instances, the haplotype instances are culled in two ways: Posterior trimming: Any genotype assignment whose posterior probability falls below a given threshold is deleted and the posterior probabilities of assignments of genotype to the subject are recomputed. Prior trimming: All instances of any haplotype whose prior probability falls below a threshold are removed. This option is not used by default since it can lead to difficulties in comparing likelihoods (see below). We add one locus at a time until completion. The process of culling haplotype assignments at early stages can lead to solutions which are not optimal. For example, haplotype 1.1.1 may have zero estimated frequency in the maximum likelihood analysis of the three-locus haplotype, while 1.1.1.x may have non-zero estimated frequency in the ML solution to the four-locus problem. It is not clear how often this will be a problem. A partial solution is to try including loci in different orders, seeing if the soultion obtained varies. A further protection is not to cull haplotypes after inclusion of every locus, but only every k loci, although there will be a penalty both in computer time and use of memory incurred by choice of large values of k. Multiple imputation ------------------Sampling the (Bayesian) posterior distribution of individual haplotype data is conveniently carried out using a Gibbs sampler. This mimics the EM algorithmm but uses stochastic steps rather than deterministic ones. It has been termed the IP (Imputation/Posterior sampling) algorithm. In our case the algorithm works as follows: I-step (replaces the E-step): For each subject, pick a haplotype assignment from the possible instances, with probability given by the current posterior, or "full conditional" distribution. P-step (replaces the M-step): Sample the haplotype population frequencies from their full conditional Bayesian posterior. If the prior is a Dirichlet distribution with constant df on all possible haplotypes*, the full conditional posterior distribution is also Dirichlet. To obtain the set of df parameters for this posterior Dirichlet, we simply add the constant prior df to the number of chromosomes currently assigned to each haplotype. (* i.e. if the unknown population haplotype relative frequencies are denoted p_1, p2, ..., p_i, ..., p_n, then their prior density is assumed to be proportional to n Product i=1 d-1 (p_i) where d is the prior degree of freedom parameter) There can be difficulties in sampling the entire space using this algorithm if the prior Dirchlet df is taken as zero; if a haplotype is not assigned to any individual at in one step, then the full conditional posterior is improper and the haplotype will be given zero probability at the next step. Thereafter it can never be sampled again. Also, when there are multiple maxima in the likelihood, the algorithm may become "stuck" under one peak. To avoid these difficulties, provision is made to start the prior df parameter at a relatively large value, thereby giving all haplotypes an appreciable probability of being sampled. Thereafter the prior df parameter is reduced at each step. This algorithm is repeated for a fixed number of steps to obtain a single imputation. The prior df parameter is then set back up to the high value, the population haplotype frequencies restored to their MLE's, and the process repeated to obtain the next imputation. And so on. Warning: Although multiple imputation using the IP algorithm is an established technique (see Schafer J.L. "Analysis of Incomplete Multivariate Data" Chapman and Hall: London, 1997), it remains to be rigorously validated in this application. Multiple maxima --------------It is well known that the likelihood surface for this problem may have multiple maxima and that the EM algorithm will only converge to a local maximum. After all loci have been added and a final trimmed list of haplotype instances has been computed, the EM algorithm may be repeated multiple times from random starting points in order to search for the global maximum. The random starting points may be chosen in one of two ways: (a) from randomly chosen values for the prior haplotype probabilities, or (b) from randomly chosen posterior probabilities for each haplotype assignment. Random starting points can also be chosen in the first set of EM iterations and, in this case, method (b) is used. Use --The program is invoked from the command line by snphap [-ds # -de # -i # -l # -mb # -mc # -mi # -mm # -n -pr # -ro -rv -rp -rs -sd # -ss -to # -th #] input-file [output-file-1] [output-file-2] -po # The input file should contain the data in subject order, with a subject identifier followed by pairs of alleles of each locus. The subject identifier need not be numeric, but must not include "white space" (blanks or tabs). The alleles should either be coded 1, 2 (numeric coding), or A,C,G or T ("nucleotide" coding). Missing data is indicated by 0 in numeric coding and, for nucleotide coding, by any character not hitherto mentioned. Data fields should be separated by any "white space" (any number of blanks, tabs or new-line characters). By default loci, are added in the same order that they appear on the input file but, optionally, they may be added in reverse order or random order. The log likelihood output from this program should be used with some caution, particularly when prior trimming has been applied, since likelihoods which do not consider the same subsets of possible haplotypes may not be comparable. Options ------The optional flags allow one to set the following parameters: -i # The maximum number of EM iterations at each step is # (default 50). -l # The number of loci is #. This option is no longer necessary if the data are entered as one line per subject. -k # Kill improbable haplotype assignments (see -pr and -po) after every # loci (default is 5). -mb # The maximum amount of dynamic storage to be allocated to the program is # Mbytes. This option should not be needed since, by default, the program should be able to determine this. -mi # Create # output files containing fully phased genotypes imputed at random from the posterior distribution. -mm # Carry out the final EM maximization # times, starting from random starting points. The solution with the largest likelihood is accepted. -nu data Forces numeric coding (1/2) of alleles on output, even when input are in A/T/C/G format (default is "off"). -nh Locus names are supplied as first line of input file. -nf # Locus names are supplied in file #. This file should have one line per locus, with the locus name as the first field. -pr # The threshold for prior trimming is # (default is zero). -po # The threshold for posterior trimming is # (default is 0.01). -q "Quiet" operation. This suppresses the constantly-updated progress report that is written to the screen. This is needed when the program is to be run in "batch" mode. -ro Add new loci in random order (off by default). -rv Add new loci in reverse order (off by default). -rp When -mm option is set, each repeated EM algoritm is restarted with random values for the prior haplotype probbailities. If not set, each EM algorithm is restarted with random posterior assignment probabilities for each subject (the default behaviour). -rs Select random starting point for each EM iteration. Otherwise we start by assuming linkage equilibrium between the new marker and the previous ones (default is "off"). -sd # Set the pseudo-random number generator seed to a large integer, #. The default is to generate a seed from the date and time. -ss Specifies that the output file should be written as a tab-delimited file with variable names on the first line (suitable for reading into spreadsheet or statistical programs; default is "off"). -to # The convergence criterion for the EM iterations is # (the tolerated change in log-likelihood between two iterations; default 0.0001). -th # A number between 0 and 1, controlling the posterior threshold for writing most likely haplotype assignments to subjects to output-file-2. Only assignments whose posterior probability exceeds this multiple of the most likely a posterior assignment will be written to the output file (not relevant when in multiple imputation mode). Multiple imputation options: -mc # The number of MCMC steps between imputations -ds # The starting value of the df parameters for the Dirichlet prior. This is specified as a multiple of the number of chromosomes observed (i.e. twice the number of subjects). The default value is 0.1. -de # The final values of the df parameters for the Dirichlet prior (specified in the same way as above). The default value is zero, corresponding to complete ignorance. If the command is issued without options or arguments, a brief description of available options is written to the screen. Output -----1. Iteration progress reports (written to the screen). Note that some terminal emulators which provide "scrolling" may seriously slow down operation of the program. In this case you should either use a standard nonscrolling xterm, or invoke the -q option which suppresses this output. 2. A file listing the haplotypes found, and their probabilities (output-file-1). The list is in descending order of probability and a cumulative probability is also listed. The cumulative probability is suppressed if the -ss option is in force. 3. A file listing assignments of haplotypes to subjects (output-file-2). This file contains all assignments whose posterior probability exceeds a multiple of that of the most probable assignment (see -th option). The file is in "long format" -- that is, the pair of haplotypes for each assignment appear as successive lines. 4. A file (named "snphap-warnings") which contains any warning messages. Output files output-file-1 and output-file-2 are in a compressed and easily readable format. Alternativelly they can be saved as tab-delimited text files suitable for reading into a spreadsheet program, or a statistical program such as "Stata". Both file names are optional and a missing argument can be indicated with a single "." (period or full-stop) character. But since it must be assumed that you want SOME output, omission of both file names causes the program to default to "snphap.out" for output-file-1. In multiple imputation mode, an additional series of files is created. Each imputation causes a fresh file (or pair of files) to be written. The file names are as specified on the command line, but the strings .001, .002, .003 ... etc. are appended. Building -------- A primitive Makefile is supplied. This uses the gcc compiler and will need to be edited if a different C compiler is to be used. You may also need to edit the CMP_FLAGS and LD_FLAGS options (which provide flags used by the compiler at compile and load stages respectively) For Microsoft Windows users, I suggest use of emulation package. See the "Cygwin" Unix http://www.redhat.com/software/tools/cygwin I found that setting LD_FLAGS to -lm worked for me on both Linux and Solaris (this is the default setting), but on Cygwin I had to omit this flag. The default uniform random number generator (UNIFORM_RANDOM) is set to be the standard 48-bit function `drand48', and the corresponding seeding function (RANDOM_SEED) is `srand48'. However, for systems which do no support the 48-bit functions (this includes Cygwin), the 32-bit versions can be chosen: UNIFORM_RANDOM = drand RANDOM_SEED = srand `drand()' is defined as a macro evaluating to (0.5+rand())/(1+RAND_MAX). A short test data file is also included. This contains typings of 100 subjects for 51 SNPs in a small region. To test the program: ./snphap test.dat Altrenatively, if you wish to incorporate locus names in the output, ./snphap -nf test.nam test.dat Acknowledgements ---------------Thanks to Newton Morton and Nikolas Maniatis for their helpful comments and suggestions on an early previous version. Thanks also to anyone who has pointed out bugs in earlier versions. David Clayton Diabetes and Inflammation Laboratory Tel: 44 (0)1223 762669 Cambridge Institute for Medical Research Fax: 44 (0)1223 762102 Wellcome Trust/MRC Building david.clayton@cimr.cam.ac.uk Addenbrooke's Hospital, Cambridge, CB2 2XY wwwgene.cimr.cam.ac.uk/clayton