******************** ******************** ******************** ******************** ******************** MTDIS ******************* ******************* ******************* ******************* ******************* Welcome to the mitochondrial DNA distance analysis program (MTDIS). This program was initially written to calculate genetic distances between populations for restriction site data obtained from single copy DNA sequences (eg. mtDNA). The present version of MTDIS will also calculate inter-population genetic distances for DNA sequence data obtained from single copy DNA. The algorithm used compares frequency differences between populations for the presence of a restriction site or polymorphic nucleotide bases across all haplotypes present in both populations. The formulation of the distance measure is taken from the one given on page 651, of: Danzmann et al. 1991. J. Fish Biol. 39:649-659. Note: The minimum number of populations being compared should be 3 The present version of the program will allow genetic distances to be calculated between single copy DNA (eg. mtDNA) haplotypes for: 1) restriction site data 2) DNA sequences Note: only a p-distance estimation of sequence divergence is made between haplotypes where: p = nd / n nd = the number of base pair (bp) differences between haplotypes, and n = the total number of bp sequenced. Thus, MTDIS is most appropriate for analyzing recently diverged conspecific populations where pairwise divergences are small. The program should not be used for interspecific phylogenetic analyses. Two types of data sets may be analyzed with MTDIS. 1. Single data sets that would generally consist of the empirical data. 2. bootstrapped data sets that can be used to test confidence limits on the empirical data set. 1 Bootstrapped data sets can be obtained by modifying your input data set according to the file format for SEQBOOT in PHYLIP (see below)(Felsenstein, 1993). The randomized data sets may then be generated by running SEQBOOT to the specified number of randomizations desired. Once the output file in SEQBOOT is generated the first few lines of the file will need to be modified before the analyzing the data for restriction site or DNA sequence differences. SEQBOOT is available in version 3.5 of PHYLIP. ________________________________________________________________________ HAPLOTYPE DESCRIPTION INPUT FILES: 1. RESTRICTION SITE and DNA SEQUENCE DATA: Two types of input files are required for MTDIS. One file consists of the population data set and contains information on the population name designations (maximum of 8 letters). The names of the haplotypic designations (maximum of 8 letters) and the number of haplotypes in each population and the number of individuals possessing each haplotype (see below in section 2). The second input file consists of either the site matrix data for restriction site data or DNA sequence data as follows: 1a) Restriction site data: These input data files consist of site gains ONLY! between populations Note: All files of this type must have a *.res extension. The data matrix which must be supplied is a 0,1 matrix of recognition site presence=1/absence=0 for each haplotype being compared. If a site is unknown, it may be coded as any number greater than 1 (I suggest using the numbers 2 - 9 to keep your column structure linear) and MTDIS with then simply skip the region in its frequency calculation. Data MUST be entered using haplotypes as the rows and recognition sites as the columns. For example, if you have three haplotypes in your species with six polymorphic recognition sites, where: site 1 is shared between haplotype 1 + 2 site 2 is shared between haplotype 1 + 3 site 3 is shared between haplotype 2 + 3 site 4 is only present in haplotype 3 site 5 is only present in haplotype 3 2 site 6 is only present in haplotype 1 You should have the following data matrix structure. 36 110001 101000 011110 Before you enter the presence/absence matrix for the data structure, you must specify the total number of haplotypes observed as the first value followed by the total number of polymorphic restriction sites detected as the second number. In the present example 6 sites were detected and 3 haplotypes were described. Remember to leave at least one space between each data value! Also you must start data entry on line 1 and column 1. Do not leave empty lines in your data entry as the program may read these as zero values. Following construction of the data matrix you will need to enter a data line indicating whether the site is recognized by a 6-base, 5-base, or 4-base cutter. For example if all 6 restriction sites in the previous example are recognized by a 6-base cutter, then enter: 36 110001 101000 011110 666666 Finally, you will need to enter the monomorphic sites (shared among all haplotypes recorded) for all 4-base, 5-base, 6-base, and 8-base cutting enzymes used. Enter these values on a separate line for each enzyme. The monomorphic sites detected with 4-base cutters would be entered first, followed by 5-base, 6-base, and 8-base sites respectively. In the present example only 6-base cutting enzymes were used, and therefore 0 values would be entered for the other categories. Let us suppose that with the six polymorphic 6-base sites detected there were 15 monomorphic sites. 3 The data input file would then look like this: 36 110001 101000 011110 666666 0 0 15 0 Multiple intervening random nucleotide sites such as ATTNNNNNNNAAT would be considered a 6-base recognition site since only six unique sites are defined. It was considered unlikely that a more complex recognition enzyme than an 8-base cutter would be used in a routine population genetics survey and therefore input for restriction fragment patterns from such enzymes is not accommodated. Note: You will have to apply an averaging factor for multiple recognition sequence enzymes that share sites with other enzymes. For example, if two 6-base cutters recognize 4 different sites, and one of these sequence sites is recognized by both enzymes, then subtract 1/8 of the total number of sites recognized by both enzymes. Note: It is a good idea to insert a number of lines at the end of your data file describing which enzyme and fragment sizes correspond to each data column. These values will not be read by the program and are merely suggestions for your archival convenience. Start each of these reference lines with a REM statement or with an ' (apostrophe) designation. VBASIC recognizes such lines as REMARK lines and will not process their contents. Files may be set up with any standard ASCII text editor such as NORTON commander or CED.COM or most word processing packages that will save text in ASCII format. The final line of the file MUST contain the word "DONE" to signal that the file input is ended. For example: 4 36 110001 101000 011110 666666 0 0 15 0 "DONE" 1b) DNA sequence data: The second type of data set which may be entered is DNA sequence data for single copy DNA. Note: All data sets of this type must have a *.seq extension. I have also included a program called SEQFLTR with this package that will allow you to prepare raw DNA sequence data from other applications for input into MTDIS. The program essentially filters out all monomorphic base pairs among the haplotypes you are comparing and writes a sequential input file for MTDIS containing only the polymorphic sites (see below). In a similar fashion to *.res files, the first two values that are entered in the MTDIS *.seq input file must be the number of haplotypes followed by the number of polymorphic DNA sequence sites. eg. 36 ACCGAG GTTGAA ATCAGG 94 "DONE" In this example, you will note that DNA sequence data must be entered in sequential format WITHOUT any spaces between the nucleotide designations. Nucleotides may be entered in either upper or lower case, but a U designation is not allowed. Missing sites (i.e. INDEL regions) must be coded as a blank (i.e. "-"). Unknown data sites may be coded as either an "N", "X", or "?". Following the entry of polymorphic sites, the number of monomorphic sites are entered. Thus, in the present example, 100 bp among all 3 haplotypes are examined. Six of these sites are polymorphic and 94 are monomorphic. 5 If your data set is relatively small then you could list the entire DNA sequence for each haplotype (including monomorphic sites) in a sequential fashion. You would then still need to indicate the number of monomorphic sites "outside" of the sequences input to MTDIS. In this case a zero value would be listed for this input variable. Note: Array sizes are limited to 32,000 elements, and thus any combination of haplotypes x number of base pairs screened which exceed this limit will cause the program to 'crash'. If this is the case you will need to run SEQFLTR to reduce your data set to informative sites only. It is generally a good idea to do this anyway. Why bother manipulating all those non-informative sites? When data entry is complete, you must write "DONE" in quotation marks at the end of the file, similar to the format for *.res files. 1b.i) Distance calculations: The algorithm used for the estimation of pairwise genetic distances among the populations being compared for DNA sequence data is based upon the summation of frequency differences at nucleotide sites between populations as follows: S ( i=1 ( | fA1 - fA2 | + | fC1 - fC2 | + | fG1 - fG2 | + | fT1 - fT2 |) / 2 ) / t where: S = the total number of polymorphic sites compared. fA1 and fA2 = the frequency of A nucleotides at site i in population 1 and 2, respectively. fC1 and fC2 = the frequency of C nucleotides at site i in population 1 and 2, respectively. fG1 and fG2 = the frequency of G nucleotides at site i in population 1 and 2, respectively. fT1 and fT2 = the frequency of T nucleotides at site i in population 1 and 2, respectively. t = the total number of nucleotides compared. 1b.ii) INDELS (Insertion/deletion sequence elements): MTDIS will also accommodate the analysis of INDELS in your data set. However, 6 this is optional, as the inclusion or removal of INDEL sequences may be specified in the program SEQFLTR during the construction of your input files for MTDIS. If you choose to include the analysis of INDELS then all INDEL stretches will be scored as 1 mutational step change from a sequence which does not possess the INDEL regardless of length. However, if you know that there are tandem repeat elements in the INDEL stretches then you may also score these differences according the number of tandem repeats detected. You will need to answer -YES- to a specific query regarding this within the program. MTDIS will keep track of the number of missing sites within an INDEL and simply score the number of mutational steps for these regions as: bp / tr where: bp = the number of missing bp in the INDEL, and tr = the number of bp in the tandem repeat unit For example if the sequence AAAAACGACGGCTTTTT formed the following repeating stretches in four haplotypes: AGCAAAAAACGACGGCTTTTTAAAAACGACGGCTTTTTAAAAACGACGG CTTTTTCGATACTATATCCA AGTA-----------------AAAAACGACGGCTTTTTAAAAAGGAGGG CTTTTTCGGTACCACACCCA GGCAAAAAACGACGGCTTTTTAAAAACCACGGCTTTTTAAAAACGACGG CTTTTTCGATACTGTGTCCG AACG--------------------------------------------------GAACGTTGTGTTTG * * * Haplotypes 1 and 3 would differ from haplotypes 2 and 4, by 1 and 3 mutational steps, respectively for the INDEL regions, while haplotype 4 would differ from haplotype 2 by two mutational steps within this region. Of course if you do not feel that tandem repeats correspond to a stepwise mutation model you may simply score all INDELS as one mutational step differences regardless of length. The asterisks is the figure above indicate regions where there are polymorphic nucleotide sites in other haplotypes that overlap with an INDEL region. MTDIS handles the scoring of these variants in a different fashion. For example in the following four haplotypes: 7 AGCAATATTACGGGCAATACGAAGTACAGATTAGACTAGGAAATACTAT ACTATACGATACTATATCCA AGTA-------------------------------------------------ACGGTACCACACCCA GGCAATATTACGAGCAATACGAAGTACAGATTAGACTAGGAAATACTAT ACTACACGATACTGTGTCCG AACGATATTACGGGTAATGCGAAGTATAGATTAGACTAGGAAATACTAT ACTATGGAACGTTGTGTTTG * * * * Sequence 3 differs from sequence 1 and 4 within the INDEL stretch at the first asterisk, while sequence 4 differs from sequence 1 and 3 at asterisks 2-4 within the INDEL. In sequence comparisons that do not include haplotype 2, divergence estimates will be calculated in a normal fashion. However, distance estimates with haplotype 2 would be averaged for the variant sites excluding a frequency estimate contributed by INDEL sites in haplotype 2 (i.e. frequencies for the four variant nucleotides within an INDEL region will be scored as 0). Since nucleotides cannot be assigned to these positions, the frequency of haplotypes possessing INDELS cannot contribute to frequency summations for the nucleotides. The overall effect is that divergence estimates are reduced by 0.5 in comparisons between haplotypes where one haplotype possesses an INDEL and the other haplotypes possess a variant nucleotide site within the INDEL. If you feel that this is an unacceptable bias, you may delete the INDEL regions from the analysis using SEQFLTR, as previously mentioned. ________________________________________________________________________ 2. POPULATION DESIGNATION AND HAPLOTYPE FREQUENCY INPUT FILES: Note: All files of this type must have a *.pop extension. In this data file you will need to construct is a listing of the number of individuals possessing different haplotypes in the populations you have examined. Note: All data input files will need to reside in the same directory on your floppy disk or hard disk, as the program will only ask to specify the path for your first input file. (i.e. the program assumes both input files are in the same path). Populations are input as rows. The first value to input on the first line is the number of populations which you have compared in your study. 8 The second line of the file should contain the names of the populations in the order in which they are input in the file. For example, if 5 populations are being compared they may be designated as follows: 5 "POP-1" "POP-2" "POP-3" "POP-4" "POP-5" You may input the population names on more than one line. However, no blank lines should be included and all designations should start in column 1. Population designations are usually given as abbreviations and are listed in quotations for alphanumeric designations. Following input of the population designations it will be necessary to enter the designations of the haplotypes described in the *.res or *.seq file. These designations MUST be in the order in which the haplotypes were designated in the *.res or *.seq file. For example, if you had three haplotypes designated A, B, & D in the *.res file and you input these haplotypes as A=row 1; D=row 2; and B=row 3 in the *.res file then you must enter: "A" "D" "B" on the 3rd line of the *.pop file describing the haplotypes. For the example given above, if the three haplotypes are simply designated as: 1, 2, and 3, and were entered in that order, your file should look like this: 5 "POP-1" "POP-2" "POP-3" "POP-4" "POP-5" "1" "2" "3" Next you will need to input the population frequency data. Remember: each line will represent a different population and populations MUST be entered in the order they are specified in the second line of the file. i.e. population "POP-1" must be entered first. The first number input on the line will be the number of clones or haplotypes found in that population. Following this designation will be the haplotype designation (in alphanumeric quotations) followed by the number of individuals sampled possessing that haplotype. 9 For example, if in "POP-1" three different haplotypes were found with 2 individuals possessing haplotype 1, 10 individuals possessing haplotype 2, and 5 individuals possessing haplotype 3 the following should be entered: 5 "POP-1" "POP-2" "POP-3" "POP-4" "POP-5" "1" "2" "3" 3 "1" 2 "2" 10 "3" 5 If the entire data file looked like this: 5 "POP-1" "POP-2" "POP-3" "POP-4" "POP-5" "1" "2" "3" 3 "1" 2 "2" 10 "3" 5 1 "1" 12 2 "1" 6 "2" 18 3 "1" 5 "2" 1 "3" 6 2 "2" 3 "3" 12 "DONE" 'POP-1 = Blackwater Cr. 'POP-2 = Brownwater Cr. 'POP-3 = Clearwater Cr. 'POP-4 = Teawater Cr. 'POP-5 = Fish Cr. This would indicate that in Population 5, three individuals were sampled which had haplotype 2, while twelve individuals were sampled which had haplotype 3. Remember to write "DONE" at the end of your input file. It is also a good idea to insert a number of lines at the end of the data file indicating the population name that corresponds to each row for the abbreviations given in the main file. Remember to start these comment lines with a REM statement or '. The program will then convert the observed number of individuals possessing each haplotype into population frequencies. ________________________________________________________________________ 3. BOOTSTRAPPING FILES: 10 To produce a number of randomized data sets (bootstrap, half-jacknife) using SEQBOOT in PHYLIP you will need to modify your input data sets in a format that may be read by SEQBOOT. These formats are essentially identical for both programs. The main difference is that you will need to edit the data files used for MTDIS by adding the name of the haplotype, beginning on each line. The first 10 columns of each line may only be occupied by the haplotype designation. Data entry starts at column 11. (See the README file in PHYLIP for further information on how to run SEQBOOT). The output file generated from SEQBOOT may be read directly by MTDIS with the following modifications. 3a) Restriction site data: The first line of the output file will list the number of haplotypes followed by the number of polymorphic sites being compared. This line should not be modified. On the second line of the output file indicate the number of randomized data sets that are contained in the file. Thus if 100 bootstrapping matrices were generated then type 100 on the second line. On the third line list the number of nucleotide bp that are detected by each restriction enzyme, in the same sequence that they entered in the data matrix (i.e. left to right). On the fourth line list the number of monomorphic sites detected with 4-base cutters. On the fifth line list the number of monomorphic sites detected with 5-base cutters. On the sixth line list the number of monomorphic sites detected with 6-base cutters. On the seventh line list the number of monomorphic sites detected with 8-base cutters. The input order listed above is identical to the format that would be used to specify input for *.res files. On line number 8 type the word "READY" in quotation marks. This signals MTDIS that the output file from SEQBOOT has been modified to provide all the data to MTDIS for the calculation of genetic distances. An example on how to modify the beginning lines in the OUTFILE from PHYLIP for the above data set after 100 bootstrapping replicates would be as follows: 36 100 666666 0 0 15 11 0 "READY" Note: Remember to rename your file with a *.res extension, otherwise MTDIS will not recognize it as an input file. Also, the haplotype names (listed in column 1-10) generated by PHYLIP will be expected by MTDIS when it processes bootstrapping files so these names should not be removed from the PHYLIP OUTFILE. 3b) DNA sequence data: On the second line of the file insert the number of randomization data sets present in the file. The first line of the output file will already list the number of haplotypes and polymorphic sites being compared. This line should not be adjusted. On the third line list the number of monomorphic bp present among all the haplotypes being compared. On the fourth line type the word "READY" in quotation marks. This signals MTDIS that the output file from SEQBOOT has been modified to provide all the data to MTDIS for the calculation of genetic distances. An example for the above data set would look like this: 36 100 94 "READY" Note: Remember to rename your file with a *.seq extension, otherwise MTDIS will not recognize it as an input file. Also, the haplotype names (listed in column 1-10) generated by PHYLIP will be expected by MTDIS when it processes bootstrapping files so these names should not be removed from the PHYLIP OUTFILE. 3b.i) Bootstrapping INDEL regions: 12 The output produced by SEQFLTR will contain complete INDEL stretches if this option is specified. If you enter these regions into SEQBOOT then multiple INDEL sites would be randomly generated. This would greatly inflate the distance estimates obtained for haplotypes possessing such INDELs. My recommendation is that you text edit your sequences and remove all INDEL deletion sites (i.e. "-") except 1 for each INDEL unit. In the case of overlapping polymorphic sites, one possible solution is to keep the deletion signature (i.e. "-") at the corresponding position in the haplotype with the deletion. This will of course result in the generation of greater than the expected number of deletions within this haplotype following the bootstrapping procedure. One way to compensate for this is to arbitrarily reduce the number of deletion sites by the number of additions in the sequence. For example, if you have 1 deletion region in a haplotype and added two more sites to correspond to polymorphic sites in other haplotypes, then you could remove 2 deletion sites from all bootstrapped data sets. If you only have 1 or 2 sites generated the job is easy. The question becomes which 2 sites do I remove if 3 or more deletion sites are generated.? I'll leave that one to you. ________________________________________________________________________ 4. OUTPUT FILES: Output from MTDIS is written to disk in ASCII as a right-handed pairwise distance matrix. Such a matrix may be readily input to other programs such as PHYLIP or MEGA (Kumar et al. 1993) for the construction of UPGMA trees or Neighbor-Joining trees. Two types of output files will be written to disk depending upon whether you are analyzing a: a) single data set, or b) multiple data sets When converting the output file for input into MEGA it will be necessary to put # symbols at the beginning of each population designation. The distance matrix itself will not need to be modified. For converting the file into PHYLIP format, the population designations at the beginning of the file will need to be removed and redesignated at the beginning of each comparison row in columns 1-10. The actual values of the distance matrix must follow in column 11 or greater. This will require some editing of the output file from MTDIS. Note: All output files have a * .mtd extension 4a) Single data sets: The output file generated for the above restriction site data set would look like this: 13 1. Genetic sequence divergence among populations. Restriction site data. Input population file = "name of your input file" Input restriction site data file = "name of your input file" Population Population Population Population Population 1 2 3 4 5 = POP-1 = POP-2 = POP-3 = POP-4 = POP-5 0.02567694 0.01038749 0.01973684 0.01365546 0.02182540 0.01984127 0.01699347 0.03650793 0.02738095 0.01468254 2. Intrapopulation nucleotide diversities (diagonal) and interpopulation nucleotide diversities (upper right matrix). 0.01904271 0.02715121 0.00000000 0.01646273 0.02452618 0.02200108 0.01973684 0.02203425 0.03700919 0.01029748 0.02623225 0.02925230 0.02306144 0.02097605 0.01142857 3. NUCLEON DIVERSITIES (using Formula 8.5 of Nei, M. 1987). Population h +/- standard error _________________________________________________ POP-1 POP-2 POP-3 POP-4 POP-5 .5882353 0 .3913043 .6212121 .3428571 5.247664E-02 0 4.876184E-02 5.387287E-02 6.982974E-02 In the first section of the output file are listed the population names in the input order they were entered in either the *.res or *.seq file. The genetic distances given in the right-hand matrix are the average populations sequence divergence estimates from the algorithm of Danzmann et al. (1991). Each value corresponds to the paired sets of populations listed at the top of the file in the sequential order given. For example, row 1 contains 4 values 14 representing the sequential pairs of populations tested with POP-1. (i.e. POP-1 & POP-2; POP-1 & POP-3; POP-1 & POP-4; and POP-1 & POP-5). In a similar fashion, row 2 would contain all the pairwise distances with POP-2, etc. Intrapopulation nucleotide diversities are calculated using a formula equivalent to 10.19 of Nei(1987) = dx, based upon empirical numbers of individuals sampled for each haplotype, instead of frequencies. Interpopulation nucleotide diversities are estimated in a similar fashion and are equivalent to dxy using formula 10.20 of Nei(1987). Nei recommends calculating interpopulation distances as dA = dxy-((dx + dy)/2). However, the values given in the output table for MTDIS are simply presented as: dxy. The ordering of the population pairs is similar to that for the top right-hand matrix, with the exception that numbers on the diagonal represent intra-population nucleotide diversities. Thus, row 1 has 5 numbers, with the second value representing the interpopulation nucleotide diversity between POP-1 & POP-2, etc. Nucleon diversities are calculated according to formula 8.5 of Nei(1987) and the standard errors calculated as the square root of the variance for single copy DNA given by formula 8.12 of Nei(1987). Note: the determination of nucleotide diversities and nucleon diversities is only given if a single data set is entered. If multiple data sets are being input the program assumes that these are randomized input data sets. Output generated from a single analysis using DNA sequence data would have an identical format. 4a.i) Analysis of mtDNA haplotype lineages: Note: MTDIS may also be used to calculate interhaplotype distances by modifying the input *.pop file such that each row represents a different haplotype, with only 1 haplotype being present in each row. The sample size must be greater than 1, however, for each haplotype, and should be identical for each row (= haplotype). Thus, if you were comparing 10 haplotypes, called C1 - C10 your input *.pop file could look like this: 10 "C1" "C2" "C3" "C4" "C5" "C6" "C7" "C8" "C9" "C10" "C1" "C2" "C3" "C4" "C5" "C6" "C7" "C8" "C9" "C10" 1 "C1" 3 1 "C2" 3 1 "C3" 3 1 "C4" 3 15 1 "C5" 3 1 "C6" 3 1 "C7" 3 1 "C8" 3 1 "C9" 3 1 "C10" 3 "DONE" The second and third rows would need to be repeated since the top row represents the haplotypes being compared (replacing the populations) and all haplotypes need to be specified in the *.pop file. 4b) Multiple data sets: Output from the randomization data sets (eg. bootstrapping matrices) produced by SEQBOOT are analyzed by MTDIS to produce a sequential set of interpopulation distances presented as upper right-hand matrices. These data sets are in the format that may be read by the PHYLIP program NEIGHBOR to construct multiple UPGMA or Neighbor-Joining trees. The treefile output from NEIGHBOR may then be read by CONSENSE of PHYLIP to produce a majority-rule consensus tree of the randomized data sets. In other words, MTDIS is used to generate the distance matrices, but all other aspects of performing the bootstrapping tests are done using the programs SEQBOOT, NEIGHBOR, and CONSENSE in PHYLIP. Users are referred to the *.DOC files in PHYLIP for further instructions on implementing these specific programs. ________________________________________________________________________ 5. COMPLETE DNA SEQUENCE INPUT FILES: I have provided an additional program with this package called SEQFLTR.EXE. This stand alone executable allows you to read in the complete DNA sequence of several haplotypes. Software packages that perform sequence alignment procedures will print out left-justified sequences that may be modified with a minimum of effort for input into MTDIS. All data files read into SEQFLTR must have the *.inf extension. DNA sequences may be input in either the SEQUENTIAL or INTERLEAVED format. Sequential input is structured such that all the DNA sequence from one haplotype is input before the data from the next haplotype is entered. Interleaved input is structured such that the DNA sequences from one haplotype is entered on one line followed by the sequence from the next haplotype on the following line and so on. When the sequences from all haplotypes have been entered a line space is usually inserted, and the next set of data from each haplotype is entered. This data entry format is repeated in 1 line data blocks for all haplotypes until the complete sequence is entered. 16 Note: all identical sites across haplotypes must be positioned in the same data entry column in the input file, otherwise a data entry error will occur. In other words, you may not have 61 bp in one row for a given haplotype, and 60 bp in the next row for the second haplotype. Spaces are permitted between nucleotide designations, but you may NOT have more than ONE space between base pairs. However, more than one line space (or no line spaces) may be left between haplotype inputs. You will be asked by SEQFLTR whether your data is in a sequential or interleaved format. An example of Sequential vs. Interleaved input formats is given as follows: Sequential: 3 80 0 ACCGAGGGGG GACTTGGAAA CAGACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT ACCCTGAGGA ACTGACCCTC GTCGAA---- GACTTGGAAA C?GACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT ACCCTGAGGA ACTGACCCTC ATCGAGGGGG TACTTCCTAA CAGACACAGT ACCTCCCAGA CGGACTATTA CTATCACGTT ACCCTGAGGA ACTGATTTTC Interleaved: 3 80 0 ACCGAGGGGG GACTTGGAAA GTCGAA---- GACTTGGAAA ATCGAGGGGG TACTTCCTAA CAGACTCAGT ACCTCCCAGA C?GACTCAGT ACCTCCCAGA CAGACACAGT ACCTCCCAGA CGGACTATTA CGATCACGTT CGGACTATTA CGATCACGTT CGGACTATTA CTATCACGTT ACCCTGAGGA ACTGACCCTC ACCCTGAGGA ACTGACCCTC ACCCTGAGGA ACTGATTTTC You will note that there are three numeric inputs required on the first line of the data input file. The first value indicates the number of haplotypes being compared. In the above example, 3 haplotypes are being compared. The second value lists the number of base pairs compared. The final value indicates whether you wish to keep track of the haplotype designations in the file. 17 The zero value in the current example indicates that you do not wish to give these haplotypes designations (perhaps they are just called haplotype 1, 2, and 3) or you have already designated them in the *.pop file. However, if you wish to designate the haplotypes then enter the value 3 for this input variable and SEQFLTR will look for three haplotype designations at the end of the file. Many programs such as MEGA and PHYLIP write this information in the left-hand column beside each haplotype and start data entry in a subsequent column. To accommodate all these formats is a difficult programming task. Thus, I have taken the easy way out and will require you to do some text editing. If your data is one of these formats then you will have to remove all the leading columns and start data entry in column 1 as shown. If you specify a haplotype name input then these names must be given IMMEDIATELY following the last row of data entry without any intervening spaces as follows: 3 80 3 ACCGAGGGGG GACTTGGAAA CAGACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT ACCCTGAGGA ACTGACCCTC GTCGAA---- GACTTGGAAA C?GACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT ACCCTGAGGA ACTGACCCTC ATCGAGGGGG TACTTCCTAA CAGACACAGT ACCTCCCAGA CGGACTATTA CTATCACGTT ACCCTGAGGA ACTGATTTTC Haplotype1 Haplotype2 Haplotype3 If INDELS are present in you your DNA sequences you may tell SEQFLTR to either keep these sections in the output *.seq file, or remove them entirely. IMPORTANT NOTE: The order of the haplotypes in your *.inf file must be the same order you designate the haplotypes in the *.pop file. The output file produced by SEQFLTR will have the *.res extension and will be in a proper format for input into MTDIS (i.e. only the polymorphic sites across haplotypes will be listed, followed by the number of monomorphic sites). In addition, SEQFLTR will give a listing of where the polymorphic sites occur within the original sequences. 5a) Array size: The MS-DOS version of VISUAL BASIC with which SEQFLTR was compiled limits array sizes to approximately 32,000 elements. Therefore, you must keep this fact in 18 mind when you are converting DNA sequences for input into MTDIS. If you are analyzing more than this number you will have to truncate the bp range accordingly. This is not a major problem. Since the MTDIS input file may be structured with only polymorphic sites the total number of DNA sequences bp screened could potentially be quite large. You simply have to append the output from several different input data sets. Append (any text editor will do this) the output from each analysis in a sequential fashion (i.e. the polymorphic sites from haplotype 1 must listed first, followed by haplotype 2, etc.). Remember, no spaces are allowed between bp, but multiple lines of input are allowed. Therefore, simply place the output from file 1 in the first row, followed by the output from file 2 in the next row, and finally the output from additional files in additional rows. Then go on to the next haplotype. Do not leave any spaces between haplotype entries, and remember to add together the number of monomorphic bp detected from each input file for entry at the bottom of your MTDIS input file. ________________________________________________________________________ If you have problems with setting up the data files or implementing the program please do not hesitate to contact me directly by E-mail at: rdanzman@uoguelph.ca Roy Danzmann, Department of Zoology, University of Guelph, Guelph, ONtario, Canada N1G 2W1 dated: June 1, 1999. These programs may be obtained by anonymous ftp to site: 131.104.50.2 Use the password: danzmann Then use the command GET to retrieve the file: MTDIS.ZIP PKUNZIP this file. It contains the following files. ________________________________________________________________________ 19 Files provided with MTDIS. 1. MTDIS.EXE MS-DOS stand alone executable for estimating average sequence divergence among populations. 2. SEQFLTR.EXE MS-DOS stand alone executable for converting complete DNA sequences into the correct input format for MTDIS. 3. example.res An example file showing the proper format for the input restriction site data. 4. example.seq An example file showing the proper format for the input DNA sequence data. 5. example.pop An example file showing the proper format for the input population data. 6. bootexam.res bootexam.seq Example files showing how to convert the restriction site and DNA sequence input files into a format that may be read by SEQBOOT in PHYLIP. 7. bootfile.res bootfile.seq Example files showing how to convert the randomized OUTFILEs from SEQBOOT in PHYLIP into a format that may be read by MTDIS. 8. examres.mtd The output file generated from the example.res and example.pop data files. 9. examseq.mtd The output file generated from the example.seq and example.pop data files. 10. bootres.mtd The output file generated from the bootfile.res and example.pop data files. This file may be read directly by NEIGHBOR in PHYLIP. 11. bootseq.mtd The output file generated from the bootfile.seq and example.pop data files. This file may be read directly by NEIGHBOR in PHYLIP. 12. examres.meg examseq.meg An example file showing how to convert the output examres.mtd and examseq.mtd files into a format that may be read by the program MEGA (Kumar et al. 1993). Note: Remember to insert a line between the population designations and the first line of the distance matrix. 20 13. readme.now An ASCII text file with instructions on how to use these programs. 14. readme.doc A WORD file with instructions on how to use these programs. ________________________________________________________________________ References: Danzmann, R.G., M.M. Ferguson, S. Skulason, S.S. Snorrason, and D.L.G. Noakes, 1991. Mitochondrial DNA diversity among four sympatric morphs of Arctic charr, Salvelinus alpinus L., from Thingvallavatn, Iceland. J. of Fish Biol. 39: 649-659. Felsenstein, J. 1993. Phylogeny Inference Package (PHYLIP). Version 3.5. Department of Genetics, University of Washington, Seattle. Kumar, S., K. Tamura, and M. Nei, 1993. MEGA: Molecular Evolutionary Genetics Analysis. ver. 1.01. The Pennsylvania State University, University Park, PA 16802. Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York. ________________________________________________________________________ Citation: Danzmann, R.G. 1998. MTDIS: A computer program to estimate genetic distances among populations based upon variation in single copy DNA. J. Hered. 89: 283-284. 21